Distributed Autonomous Robotic Systems

Springer Proceedings in Advanced Robotics 28
Series Editors: Bruno Siciliano · Oussama Khatib
Julien Bourgeois · Jamie Paik · Benoît Piranda ·

Justin Werfel · Sabine Hauert ·
Alyssa Pierson · Heiko Hamann · Tin Lun Lam ·
Fumitoshi Matsuno · Negar Mehr ·
Abdallah Makhoul Editors
Distributed
Autonomous
Robotic
Systems
16th International Symposium
Springer Proceedings in Advanced
Robotics 28
Series Editors
Bruno Siciliano, Dipartimento di Ingegneria Elettrica e Tecnologie dell’Informazione,
Università degli Studi di Napoli Federico II, Napoli, Italy
Oussama Khatib, Robotics Laboratory Department of Computer Science, Stanford
University, Stanford, CA, USA
Advisory Editors
Gianluca Antonelli, Department of Electrical and Information Engineering, University
of Cassino and Southern Lazio, Cassino, Italy
Dieter Fox, Department of Computer Science and Engineering, University of
Washington, Seattle, WA, USA
Kensuke Harada, Engineering Science, Osaka University Engineering Science,
Toyonaka, Japan
M. Ani Hsieh, GRASP Laboratory, University of Pennsylvania, Philadelphia, PA, USA
Torsten Kröger, Karlsruhe Institute of Technology, Karlsruhe, Germany
Dana Kulic, University of Waterloo, Waterloo, ON, Canada
Jaeheung Park, Department of Transdisciplinary Studies, Seoul National University,
Suwon, Korea (Republic of)
The Springer Proceedings in Advanced Robotics (SPAR) publishes new developments
and advances in the fields of robotics research, rapidly and informally but with a high
quality.
The intent is to cover all the technical contents, applications, and multidisciplinary
aspects of robotics, embedded in the fields of Mechanical Engineering, Computer Sci-
ence, Electrical Engineering, Mechatronics, Control, and Life Sciences, as well as the
methodologies behind them.
The publications within the “Springer Proceedings in Advanced Robotics” are pri-
marily proceedings and post-proceedings of important conferences, symposia and con-
gresses. They cover significant recent developments in the field, both of a founda-
tional and applicable character. Also considered for publication are edited monographs,
contributed volumes and lecture notes of exceptionally high quality and interest.
An important characteristic feature of the series is the short publication time and
world-wide distribution. This permits a rapid and broad dissemination of research results.
Indexed by SCOPUS, SCIMAGO, WTI Frankfurt eG, zbMATH.
All books published in the series are submitted for consideration in Web of Science.
Julien Bourgeois · Jamie Paik · Benoît Piranda ·
Justin Werfel · Sabine Hauert · Alyssa Pierson ·
Heiko Hamann · Tin Lun Lam ·
Fumitoshi Matsuno · Negar Mehr ·
Abdallah Makhoul
Editors
Distributed Autonomous
Robotic Systems
16th International Symposium
Editors
Julien Bourgeois Jamie Paik
University of Franche-Comté École Polytechnique Fédérale de Lausanne
Montbéliard, France Lausanne, Switzerland
Benoît Piranda Justin Werfel

CNRS Harvard University
University of Franche-Comté Boston, MA, USA
Besancon, France
Alyssa Pierson
Sabine Hauert Boston University
University of Bristol Boston, MA, USA
Bristol, UK
Tin Lun Lam
Heiko Hamann The Chinese University of Hong Kong,
University of Lübeck Shenzhen
Lubeck, Germany Shenzhen, China
Fumitoshi Matsuno Negar Mehr

Kyoto University University of Illinois System
Kyoto, Japan Urbana, IL, USA
Abdallah Makhoul
University of Franche-Comté
Montbéliard, France
ISSN 2511-1256 ISSN 2511-1264 (electronic)

Springer Proceedings in Advanced Robotics
ISBN 978-3-031-51496-8 ISBN 978-3-031-51497-5 (eBook)
https://doi.org/10.1007/978-3-031-51497-5
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Paper in this product is recyclable.

Foreword
At the dawn of the century’s third decade, robotics is reaching an elevated level of
maturity and continues to benefit from the advances and innovations in its enabling
technologies. These all are contributing to an unprecedented effort to bring robots to
human environment in hospitals and homes, factories and schools; in the field for robots
fighting fires, making goods and products, picking fruits and watering the farmland, and
saving time and lives. Robots today hold the promise for making a considerable impact
in a wide range of real-world applications from industrial manufacturing to healthcare,
transportation, and exploration of the deep space and sea. Tomorrow, robots will become
pervasive and touch upon many aspects of modern life.
The Springer Tracts in Advanced Robotics (STAR) was launched in 2002 with the
goal of bringing to the research community the latest advances in the robotics field
based on their significance and quality. During the last fifteen years, the STAR series has
featured the publication of both monographs and edited collections. Among the latter,
the proceedings of thematic symposia devoted to excellence in robotics research, such
as ISRR, ISER, FSR, and WAFR, have been regularly included in STAR.
The expansion of our field as well as the emergence of new research areas has
motivated us to enlarge the pool of proceedings in the STAR series in the past few years.
This has ultimately led to launching a sister series in parallel to STAR. The Springer
Proceedings in Advanced Robotics (SPAR) is dedicated to the timely dissemination of
the latest research results presented in selected symposia and workshops.
This volume of the SPAR series brings the proceedings of the sixteenth edition of the
DARS Symposium on Distributed Autonomous Robotic Systems, whose proceedings
have been published within SPAR since the thirteenth edition. This symposium took
place in Montbéliard, France November 28 to 30, 2022. The volume edited by Julien
Bourgeois, Justin Werfel, Jamie Paik, Benoît Piranda, Sabine Hauert, Alyssa Pierson,
Fumitoshi Matsuno, Negar Mehr, Heiko Hamann, Tin Lun Lam, and Abdallah Makhoul
contains 39 scientific contributions cutting across mobile sensor networks, unmanned
aerial vehicles, multi-agent systems, algorithms for multi-robot systems, modular robots,
swarm robotics, and reinforcement learning or deep learning, all united through the
common thread of distributed robotic systems.
With its excellent technical program, DARS culminates with this unique reference on
the current developments and new advances in distributed autonomous robotic systems—
a genuine tribute to its contributors and organizers!
October 2023 Bruno Siciliano

Oussama Khatib
SPAR Editors
Preface
Distributed robotics is an interdisciplinary and rapidly growing area, with research and
applications spanning fields including computer science, communication and control
systems, electrical and mechanical engineering, life sciences, and humanities. This book
brings together advanced works in this field presented at the 16th International Sympo-
sium on Distributed Autonomous Robotic Systems (DARS), November 28–30, 2022, in
Montbéliard, France.
During this symposium, 39 papers were presented covering a broad scope of topics
within distributed robotics including mobile sensor networks, unmanned aerial vehi-
cles, multi-agent systems, algorithms for multi-robot systems, modular robots, swarm
robotics, and reinforcement learning or deep learning applied to multi-robot systems.
Three keynote talks featured renowned experts in the field: Alcherio Martinoli
(EPFL), on “Gas and Particle Sensing in Air: Challenges, Achievements, and Lessons”;
Jessica Flack (SFI), on “Hourglass Emergence”; and Kirstin Petersen (Cornell), on
“Designing Robotic Systems with Collective Embodied Intelligence.”
The welcome reception featured an artwork made from Blinky Blocks, created by
the artist duo Scenocosme. They have been working for several years with this hard-
ware developed at Carnegie Mellon University and the FEMTO-ST Institute and have
exhibited this work in many museums.
The program committee awarded the Best Paper Award to Shoma Abe, Jun
Ogawa, Yosuke Watanabe, MD Nahin Islam Shiblee, Masaru Kawakami, and Hidemitsu
Furukawa for their work on “Application of 3D Printed Vacuum-actuated module with
Multi-Soft Material to Support Handwork Rehabilitation,” and the Best Student Paper
Award to Yuhong Cao, Zhanhong Sun and Guillaume Sartoretti for their work on “DAN:
Decentralized Attention-based Neural Network for the MinMax Multiple Traveling
Salesman Problem.”
Finally, we would like to thank all those who contributed to the success of DARS
2022, including Pays de Montbéliard Agglomération, EIPHI Graduate School, MA
Scène Nationale, Région Bourgogne Franche-Comté for sponsoring the event, the DARS
advisory committee for their useful advice, the program committee for their high-quality
viii Preface
scientific work, and the organizing committee for serving the attendees during the
conference.
Summer 2023 Julien Bourgeois

Justin Werfel
Jamie Paik
Benoît Piranda
Sabine Hauert
Alyssa Pierson
Fumitoshi Matsuno
Negar Mehr
Heiko Hamann
Tin Lun Lam
Abdallah Makhoul
Contents
How Can We Understand Multi-Robot Systems? a User Study to Compare

Implicit and Explicit Communication Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Valeria Villani, Cristina Vercellino, and Lorenzo Sabattini
The Benefits of Interaction Constraints in Distributed Autonomous Systems . . . 14

Michael Crosscombe and Jonathan Lawry
Outlining the Design Space of eXplainable Swarm (xSwarm): Experts’

Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Mohammad Naiseh, Mohammad D. Soorati, and Sarvapali Ramchurn
VMAS: A Vectorized Multi-agent Simulator for Collective Robot Learning . . . . 42

Matteo Bettini, Ryan Kortvelesy, Jan Blumenkamp, and Amanda Prorok
FLAM: Fault Localization and Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Guillaume Ricard, David Vielfaure, and Giovanni Beltrame
Social Exploration in Robot Swarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Elliott Hogg, David Harvey, Sabine Hauert, and Arthur Richards
Stochastic Nonlinear Ensemble Modeling and Control for Robot Team

Environmental Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Victoria Edwards, Thales C. Silva, and M. Ani Hsieh
A Decentralized Cooperative Approach to Gentle Human Transportation

with Mobile Robots Based on Tactile Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Yi Zhang, Yuichiro Sueoka, Hisashi Ishihara, Yusuke Tsunoda,
and Koichi Osuka
Sparse Sensing in Ergodic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Ananya Rao, Ian Abraham, Guillaume Sartoretti, and Howie Choset
Distributed Multi-robot Tracking of Unknown Clustered Targets

with Noisy Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Jun Chen, Philip Dames, and Shinkyu Park
A Force-Mediated Controller for Cooperative Object Manipulation

with Independent Autonomous Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Nicole E. Carey and Justin Werfel
x Contents
A Distributed Architecture for Onboard Tightly-Coupled Estimation

and Predictive Control of Micro Aerial Vehicle Formations . . . . . . . . . . . . . . . . . . 156
I. Kagan Erunsal, Rodrigo Ventura, and Alcherio Martinoli
Search Space Illumination of Robot Swarm Parameters for Trustworthy

Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
James Wilson and Sabine Hauert
Collective Gradient Following with Sensory Heterogeneous UAV Swarm . . . . . . 187

Tugay Alperen Karagüzel, Nicolas Cambier, A. E. Eiben,
and Eliseo Ferrante
DAN: Decentralized Attention-Based Neural Network for the MinMax

Multiple Traveling Salesman Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Yuhong Cao, Zhanhong Sun, and Guillaume Sartoretti
Receding Horizon Control on the Broadcast of Information in Stochastic

Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Thales C. Silva, Li Shen, Xi Yu, and M. Ani Hsieh
Adaptation Strategy for a Distributed Autonomous UAV Formation

in Case of Aircraft Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Tagir Muslimov
DGORL: Distributed Graph Optimization Based Relative Localization

of Multi-robot Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Ehsan Latif and Ramviyas Parasuraman
Characterization of the Design Space of Collective Braitenberg Vehicles . . . . . . 257

Jack A. Defay, Alexandra Q. Nilles, and Kirstin Petersen
Decision-Making Among Bounded Rational Agents . . . . . . . . . . . . . . . . . . . . . . . . 273

Junhong Xu, Durgakant Pushp, Kai Yin, and Lantao Liu
Distributed Multi-robot Information Gathering Using Path-Based Sensors

in Entropy-Weighted Voronoi Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Alkesh Kumar Srivastava, George P. Kontoudis, Donald Sofge,
and Michael Otte
Distributed Multiple Hypothesis Tracker for Mobile Sensor Networks . . . . . . . . . 300

Pujie Xin and Philip Dames
Distributed Multirobot Control for Non-cooperative Herding . . . . . . . . . . . . . . . . . 317

Nishant Mohanty, Jaskaran Grover, Changliu Liu, and Katia Sycara
Contents xi
On Limited-Range Coverage Control for Large-Scale Teams of Aerial

Drones: Deployment and Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Filippo Bertoncelli, Mehdi Belal, Dario Albani, Federico Pratissoli,
and Lorenzo Sabattini
MARLAS: Multi Agent Reinforcement Learning for Cooperated Adaptive

Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Lishuo Pan, Sandeep Manjanna, and M. Ani Hsieh
Proportional Control for Stochastic Regulation on Allocation

of Multi-robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Thales C. Silva, Victoria Edwards, and M. Ani Hsieh
Comparing Stochastic Optimization Methods for Multi-robot, Multi-target

Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Pujie Xin and Philip Dames
Multi-agent Deep Reinforcement Learning for Countering Uncrewed

Aerial Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Jean-Elie Pierre, Xiang Sun, David Novick, and Rafael Fierro
Decentralized Risk-Aware Tracking of Multiple Targets . . . . . . . . . . . . . . . . . . . . . 408

Jiazhen Liu, Lifeng Zhou, Ragesh Ramachandran, Gaurav S. Sukhatme,
and Vijay Kumar
Application of 3D Printed Vacuum-Actuated Module with Multi-soft

Material to Support Handwork Rehabilitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Shoma Abe, Jun Ogawa, Yosuke Watanabe, MD Nahin Islam Shiblee,
Masaru Kawakami, and Hidemitsu Furukawa
Combining Hierarchical MILP-MPC and Artificial Potential Fields

for Multi-robot Priority-Based Sanitization of Railway Stations . . . . . . . . . . . . . . 438
Riccardo Caccavale, Mirko Ermini, Eugenio Fedeli, Alberto Finzi,
Emanuele Garone, Vincenzo Lippiello, and Fabrizio Tavano
Exploration System for Distributed Swarm Robots Using Probabilistic

Action Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Toui Sato, Kosuke Sakamoto, Takao Maeda, and Yasuharu Kunii
Distributed Estimation of Scalar Fields with Implicit Coordination . . . . . . . . . . . 466

Lorenzo Booth and Stefano Carpin
A Constrained-Optimization Approach to the Execution of Prioritized

Stacks of Learned Multi-robot Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Gennaro Notomista
xii Contents
A Comparison of Two Decoupled Methods for Simultaneous Multiple

Robots Path Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
Benjamin Bouvier and Julien Marzat
Smarticle 2.0: Design of Scalable, Entangled Smart Matter . . . . . . . . . . . . . . . . . . 509

Danna Ma, Jiahe Chen, Sadie Cutler, and Kirstin Petersen
Hybrid Flock - Formation Control Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523

Cyrill Baumann, Jonas Perolini, Emna Tourki, and Alcherio Martinoli
Multi-Robot Systems Research: A Data-Driven Trend Analysis . . . . . . . . . . . . . . 537

João V. Amorim Marques, María-Teresa Lorente, and Roderich Groß
Coverage Control for Exploration of Unknown Non-convex Environments

with Limited Range Multi-robot Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
Mattia Catellani, Federico Pratissoli, Filippo Bertoncelli,
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563

How Can We Understand Multi-Robot Systems?
a User Study to Compare Implicit and Explicit
Communication Modalities
Valeria Villani(B) , Cristina Vercellino, and Lorenzo Sabattini
Department of Sciences and Methods for Engineering,

University of Modena and Reggio Emilia, Reggio Emilia, Italy
{valeria.villani,lorenzo.sabattini}@unimore.it
Abstract. In this paper we investigate the use of different communication modal-

ities for a multi-robot system. Specifically, the aim is to let a human subject be
aware of the goal robots are moving towards: to this end, we compare implicit
(i.e., the user infer robots goal from their motion pattern) and explicit (i.e., using
visual or audio clues) communication modalities. We propose a user study, per-
formed in virtual reality, aimed to assess which communication modality is the
most effective and how they are perceived by human subjects.
While implicit communication has the intrinsic advantage of not requiring any
specific hardware on the robots, results of the user study show that implicit com-
munication is comparable, from the users’ perspective, to explicit communication
modalities when robots goals are unambiguous and sufficiently distinct. In other
cases, explicit communication is preferable.
1 Introduction
Along with the diffusion of robotic platforms, multi-robot systems are expected to be
deployed in a variety of applications in our everyday lives, including precision agricul-
ture, logistics, environmental monitoring, or surveillance [1]. Their spread is enabled
not only by the availability of efficient and safe coordination strategies, but also of intu-
itive communication modalities. Specifically, given the complexity associated with the
coordination among numerous individual robots, it is important that, on the one hand,
users can exert control over the multi-robot system in a straightforward and intuitive
manner and, on the other hand, users can easily understand what robots are doing or
going to do. This second problem is the focus of our work and refers to the availability
of communication modalities that allow robots to give information about their status in
a manner that is easily perceived and understood by human users.
Feedback from the robotic platform to the user can be communicated either explic-
itly or implicitly [2]. Implicit communication consists in defining a robot behavior that
enables the collaborator to quickly and confidently infer the goal [3–6]. This idea is
related to the notion of legibility and requires the use of implicit cues, such as robot
motion. On the contrary, in explicit communication explicit cues are used to provide
feedback from the robot to the user. Classical approaches in this regard resort to audio
L. Sabattini—This work was supported by the COLLABORATION Project through the Italian
Ministry of Foreign Affairs and International Cooperation.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
J. Bourgeois et al. (Eds.): DARS 2022, SPAR 28, pp. 1–13, 2024.
https://doi.org/10.1007/978-3-031-51497-5_1
2 V. Villani et al.
or verbal feedback [7]. Additionally, visual feedback is typically employed [8–10]. In

explicit communication, ad-hoc apparatus for communication purposes, such as LEDs,
light indicators or speakers, is typically used, which limits the applicability of this com-
munication modality in practical scenarios. Conversely, implicit communication lever-
ages human understanding of movement and the tendency of people to discover inten-
tionality in everything, even in robots [11]. With specific respect to multi-robot systems,
few previous works have investigated how the concepts of legibility and implicit com-
munication can be applied to multi-robot systems. In particular, the work [5, 6] aimed
to identify robots motion characteristics that are more relevant int terms of system leg-
ibility. To this end, the authors considered different trajectories leading the multi-robot
system to a spatial goal and other parameters such as dispersion and stiffness of the
group of robots. Furthermore, the authors in [12] considered also the effect of target
difficulty, namely whether the spatial goals had the same number of neighboring goals
or not, and initial radius, namely how much the robots were spread out at the beginning
of the experiment.
Building upon these ideas, the aim of this work is to investigate the use of implicit
and explicit communication modalities in the interaction with multi-robot systems, and
assess their effectiveness and user’s satisfaction with them. To this end, we set up an
experiment where two groups of robots move towards a goal and communicate their
intentions to human subjects either with implicit cues, or with explicit color cues, arrows
or verbal messages. Effectiveness of communication modalities is measured in terms of
correctness of subjects responses and response time. The results of our study show that
explicit communication modalities are more intuitive for participants and implicit com-
munication fails especially in the case of goals with close neighbors, thus confirming
findings by [12].
The rest of the paper is organized as follows. In Sect. 2 we described the experi-
mental approach and present test protocol. Section 3 reports and discusses the achieved
results, in terms of measured effectiveness of the different communication modalities
and subjective reporting by test subjects. Finally, Sect. 4 follows with some concluding
remarks.
2 Methods
In this section, we describe the experiments carried out to compare implicit and explicit
communication in human-multi-robot system interaction. The experiment, which is
described in details in Sect. 2.2, consists in having two groups of mobile robots moving
towards a goal and asking participants to infer the goal relying on explicit or implicit
cues.
2.1 Communication Modalities

As mentioned in Sect. 1, implicit communication has been compared with three explicit
communication modalities: vocal commands, color cues, arrows. Figure 1 shows an
example of implicit communication (1(a)), color cues (1(b)) and arrows (1(c)).
As regards implicit communication, cues about robots intention have been provided
through their motion. The trajectory leading a group of robots to a goal has been com-
puted according to the minimum jerk criterion: it minimizes the square of the magnitude
How Can We Understand Multi-Robot Systems? 3
Fig. 1. Considered communication modalities.
of the jerk, which is the rate of change of the acceleration. Previous works have shown
that this kind of trajectory is usually used by humans to grasp objects and improves the
interaction of a human with a robot manipulator [13, 14]. Moreover, in [5, 6] it has been
shown that this trajectory improves the legibility of one or several multi-robot systems
that move in a shared environment with the user.
As regards explicit communication, vocal commands consisted in a sentence like
“The group of robots number 1 is moving towards the (e.g.,) red goal”. Color cues
consisted in highlighting the group of robots and their goal with the same color, as
shown in Fig. 1(b). Arrows are shown in Fig. 1(b) and simply consist in an arrow starting
from a group of robot and ending in the associated goal.
It is worthwhile noting that the considered explicit communication modalities are
very intuitive and do not require any processing or intuition by the user, as opposed
to implicit communication. Nevertheless, implementing the explicit communication
modalities requires specific hardware (i.e., speaker, light, projector), which is not
needed in the case of implicit communication.
2.2 Experimental Task

The experimental task was inspired by those considered in [5, 6]. Specifically, two
groups of mobile robots, each composed by six robots, were placed in a virtual envi-
ronment, together with five goals, as shown in Fig. 1. The experiment consisted of two
main parts. In the first part, each group of robots started from a common starting posi-
tion and traveled towards a goal, randomly selected among those available. The two
groups of robots started moving at the same time and were headed to different goals.
4 V. Villani et al.
Fig. 2. Experimental task with puzzle solving as confounding task.
Participants were asked to selected the goal for each group of robot as soon as they
were confident with it. The task was repeated four times, each time with a different
communication modality (either implicit or one of the explicit ones). The order com-
munication modalities were used varied among participants. The second part of the
experiment was organized as the first one, with the same objectives. In addition, partic-
ipants were involved in a primary task, consisting of a puzzle. Hence, they were asked
to focus on a puzzle, while robots were moving towards goals. As in the first part, they
were asked to select the goal of each group of robot as soon as they were confident with
it. Figure 2 shows the experimental setting in the second part of the task. As a result, the
experimental task of each participant consisted in eight trails, where all the considered
communication modalities were tested in random order twice: in the first part, focus
was exclusively on inferring robots goal; in the second part, participants had to solve a
puzzle and, at the same time, infer robots goal.
Before the beginning of the experiment, participants were introduced to the objec-
tives of the experiment and its setup. Moreover, they were instructed on how to interact
with the system (e.g., how to select the objects in the environment), and they were
informed that clues about the goal of the groups may have appeared during the experi-
ment. At the end of the experiment, they were administered a questionnaire investigat-
ing their preference for the tested communication modalities.
2.3 Experimental Setup

The experiments were carried out in virtual reality. The use of virtual reality allows
for a high number of robots moving in a wide environment, thus overcoming labora-
tory limitations. Moreover, it guarantees repeatability of experiments, given the lack of
localization errors, slippage or other disturbing factors [6]. As regards interaction, it
provides an immersive interaction experience that faithfully reproduces interaction in a
real environment [15].
The environment was developed in Unity and the experiments were reproduced
with the Oculus Rift headset. Goals were rendered as colored boxes, while robots were
rendered as elliptical cylinders (width equal to 0.05 m, lengths equal to 0.1 m, height
equal to 0.01 m). The two groups of robots were identified by a text box “Group 1”
or “Group 2”, moving with the group. Next to goals, two buttons were placed to let
participants select the goal inferred for each group. Interaction was enabled though the
Oculus remote controller, called Oculus Remote. Participants could select pieces of the
puzzle or the goal of group of robots through the central button of the controller.
2.4 Participants
A total of 30 volunteer subjects were enrolled in the experiment. Thesubjects are stu-
dents and researchers working at our engineering department, in different research fields
and are categorized as follows: 12 females, 18 males, spread across an age range of 20–
65 years, with mean age of 25.5 years. All the participants were completely naı̈ve to
the experimental task and goals. Ten of them had never used virtual reality before the
experiment, while eleven reported they had used it once or twice. The remaining nine
participants reported they had used it more than twice before this experiment. More-
over, three reported they had never played videogames before, twelve reported they
play less than once in a month, five play some times per month, and ten are used to play
at least once in a week. The study protocol followed the Helsinki declaration and the
participants signed the informed consent form before starting the experiment. All the
data were analyzed and reported anonymously.
2.5 Data Collection and Analysis
During each experiment, we collected data about the participants’ performance, in terms
of selected goal for each robot group, time taken from the beginning of robots move-
ments to selection of the goal for each group and, for the second experimental task, time
required to complete the puzzle. A total of 480 values were measured for these quanti-
ties (240 for time required to complete the puzzle), since each subject participated in the
two parts composing the experiment (without and with puzzle), and in each part they
experienced the four considered communication modalities and two groups of robots
were used. Data were analyzed using MATLAB R2021a and its Statistics and Machine
Learning Toolbox.
The final questionnaire included four questions addressing comfort during the
experiment, effectiveness of the considered communication modalities in the first and
the second part of the experiment, and overall preference for such modalities.
3 Results
3.1 Correctness of Responses
The first analysis we carried out investigated whether the use of implicit or explicit com-
munication modalities has some influences on legibility of robots groups. Specifically,
6 V. Villani et al.
Table 1. Correctness of participant’s responses for the different communication modalities: 1

denotes correct response, 0 incorrect response.
Communication modality Correctness

0 1
Implicit 36 84
Vocal commands 14 106
Color cues 10 110
Arrows 10 110
we analyzed the effect of communication modality on the correctness of estimated goals

for all trials. To this end, we tested the following null hypothesis:
H0 - Null hypothesys: there is no relation between the correctness of test participant’s
response and robot’s communication modality.
Here we considered correctness of response as the dependent variable and commu-
nication modality as the independent variable. Being correctness a categorical variable,
we performed a Pearson’s chi-squared test [16]. Table 1 reports the contingency table
for the hypothesis H0. The table shows that explicit communication modalities led to
less errors than implicit communication. From the chi-squared test, these results proved
statistically significantly different with p-value p = 7.56 · 10−7 . Hence, the null hypoth-
esis can be rejected: the communication modality, implicit or explicit, affects user’s
performance on the inference of robots goal, with implicit communication introducing
a higher number of errors. This result is not surprising since the considered explicit
communication modalities are very intuitive and straightforward and do not require any
information processing or human thinking (although they do require some hardware).
Nevertheless, it is still interesting that implicit cues are processed correctly in 70% of
trials.
Furthermore, we investigated the effect, if any, of the position of the goals and
the presence of a puzzle on correctness of test participants responses. To this end, we
defined the two following null hypotheses:
response and robots group’s goal.
response and the involvement in a confounding task.
Tables 2 and 3 report the contingency tables for these hypotheses, where we consid-
ered, as before, correctness of response as the dependent variable and goal and involve-
ment in a confounding task as the independent variables. The tables report that dif-
ferences can be found among the different conditions. In particular, with reference to
Table 2, the goals between the middle one and the leftmost (green) and the rightmost
(blue) were the most misestimated, whereas the leftmost (yellow), middle (red) and
rightmost (cyan) were correctly estimated more frequently. This is due to the fact that
the green and blue goals were the most ambiguous for implicit communication, since
they had two neighbors. Indeed, while the one in the middle, that had also two neigh-
bors, was reached with a straight trajectory, which caused poor confusion, the green and
Table 2. Correctness of participant’s responses for the different robots goals, as shown in Fig. 1.
Goal Correctness
0 1
Leftmost (yellow) 8 86
Second from left (green) 19 77
Middle (red) 4 74
Second from right (blue) 31 72
Rightmost (cyan) 8 101
Table 3. Correctness of participant’s responses with respect to the presence of puzzle solving,
meant as additional confounding task.
Confounding task Correctness

0 1
Puzzle 26 214
None 44 196
blue goals were often confused with the leftmost and rightmost goals, respectively. Con-
versely, these latter were rarely confused since the corresponding trajectory exhibited
a large deviation to left or right. These considerations are corroborated by the results
of the chi-squared test, which returned a p-value p = 4.72 · 10−7 for H1, thus reject-
ing the null hypothesis and confirming that the position of goals influences accuracy of
estimation.
As regards the presence of a confounding task test participants had to focus on while
robots were moving towards goals, Table 3 shows that more errors were made when no
confounding task was included. This could be due to the fact that this condition was
always administered as first to all the participants, to let them become familiar with
the communication modalities, which were the main focus of our experiments. Thus,
the results in Table 3 might reflect the learning effect. The chi-squared test returned a
p-value p = 0.02 and, hence, the null hypothesis H2 can be rejected. However, a deeper
analysis in this regard should be carried out randomizing the order of the two tasks
among test participants.
3.2 Response Time

As for correctness, we considered response time (T) as independent variable and inves-
tigated whether the three dependent variables communication modality (C), robots goal
(G) and involvement in a confounding task (P) had an effect on response time. To this
end, we defined the following null hypothesis:
H3 - Null hypothesis: there is no relation between the response time and robot’s com-
munication modality, goal and involvement in a confounding task.
The analysis was limited to those trials test participants gave a correct reply, that
is 410 out of 480 trials. Table 4 reports the results of a three-way analysis of variance
8 V. Villani et al.
Table 4. P-values for three-way ANOVA: C denotes the communication modality, G robots goal
and P the presence of puzzle solving as confounding task.
Variable p-value
C < 10−7
G < 10−7
P 0.73
CG < 10−7
CP 0.34
PG 0.76
(ANOVA) in terms of the p-value of all the main effects C, G and P and also of the
interaction variables. Results in the table show that communication modality (C) and
goal (G) are the main effects that influence response time in a statistically significant
manner. Also the effect of their interaction (CG) results statistically significant. As a
result, we can reject the null hypothesis for the variables C and G since they influence
response time. Conversely, the presence of a confounding task, such as puzzle solving,
did not significantly affect participant’s response time. It should be clarified that no pri-
ority was given to test participants regarding the two tasks, puzzle solving and guessing
robots goal: they could stop puzzle solving as soon as they had a reliable intuition about
the goal of any of the two groups of robots.
To further investigate differences existing among the experimental conditions, we
carried out a post-hoc analysis, based on multiple pairwise comparisons. Indeed, results
in Table 4 suggest that for the variable C, G and CG, on average, at least one of the con-
sidered experimental conditions is different from the others. However, this analysis does
not provide details on which conditions are different from each other. To answer this
question, we performed the Tukey’s honestly significant difference test [17]. As regards
the variable C, that is the effect of communication modality on participant’s response
time, Fig. 3(a) shows the mean response time marginalized for the considered com-
munication modalities. The figure shows that implicit communication caused higher
response time than explicit modalities, which all required comparable response time.
Multiple comparison with Tukey’s test confirmed this result, since the only pairs with
statistically significant difference are those involving implicit communication (p-values
p < 10−10 for all the pairs (implicit, arrows), (implicit, colors), (implicit, audio)). Con-
versely, for all possible pairs with two explicit communication modalities, there was no
statistically significant difference between mean response times in each pair (p-value
p = 0.99 for the pair (arrows, colors), p = 0.16 for (arrows, audio) and p = 0.30 for
(colors, audio)).
As regards the variable G, which refers to robots goal, Fig. 3(b) reports marginal
mean response time. The figure shows that the green and blue target required longer
response times, thus confirming that they were the most critical, as anticipated in
Sect. 3.1. In particular, the green one returned statistically significantly different
response time with respect to yellow (p = 0.0017), red (p = 0.0027) and cyan (p =
Fig. 3. Mean response time for the different communication modalities and robots goals.
10−4 ) targets, while response time for the blue one was statistically significantly differ-
ent for the cyan (p = 0.0162) goal only.
The combined effect of communication modality and robots goal (variable CG
in Table 4) is described by Fig. 4. The figure confirms that, for each goal, implicit
cues were the slowest communication modality. However, within a goal, the difference
between mean response time for implicit communication and response time for any
explicit modality was not statistically significant only in the case of the yellow and cyan
goals: for the yellow goal (first four rows in Fig. 4) p-value p = 0.9917 for (implicit,
audio), p = 0.9481 for (implicit, colors) and p = 0.9139 for (implicit, arrows); for the
cyan goal (last four rows in Fig. 4) p-value p = 1.000 for (implicit, audio), p = 0.9758
for (implicit, colors) and p = 1.000 for (implicit, arrows). This result is interesting since
the yellow and cyan goals are the only ones having one neighbor, hence they are less
Fig. 4. Mean response time marginalized over communication modality C and robots goal G.
10 V. Villani et al.
Fig. 5. Mean response time marginalized over communication modality C and involvement in
confounding task P.
ambiguous than the other goals: under these circumstances, the delay introduced by
implicit communication is neglectable, as confirmed by p-values and Fig. 4. From this
result, we can assume that implicit communication is a viable communication modality
in the case of enough distinct and unambiguous robots goals.
As a last remark, in Fig. 5 we summarize the effect of communication modality and
involvement in a confounding task on response time. The figure shows that the longest
response times are given in the case of implicit cues, no matter whether the subject
was involved in puzzle solving or focused on robots movements. Moreover, as regards
explicit communication, no significant differences are found between the presence of
puzzle (P = 1) and its absence (P = 0). This is also confirmed by Tukey’s test: the only
p-values that are less than 0.01 are those of pairs involving any explicit communication
and implicit cues, with or without puzzle. This result might lead to the consideration
that if the only reason behind the use of implicit communication is not to overwhelm
users since they are busy with other tasks that are occupying their perceptive channels
(i.e., need to look at something or to pay attention to sounds), then the choice of implicit
communication might be not advantageous as expected. In this regard, Fig. 6 shows that
the use of different communication modalities (and goals) has no effect at all on the time
required to complete the confounding task, namely puzzle solving.
3.3 Subjective Reporting

At the end of the experimental task, participants were administered a questionnaire to
collect their subjective feedback about the experiment and the considered communica-
tion modalities. First, they were asked whether they were at ease during the experiment
and comfortable in interacting with the virtual environment. On a scale from 1 (no,
Fig. 6. Mean time required for puzzle solving marginalized over communication modality C and
robots goal C.
never) to 4 (yes, always), 80% of them reported that they had always been at ease
during the experiment and 20% almost always; 60% reported they had always been
comfortable with the interaction system and 40% almost always.
Then, they were asked to rate each communication modality on a scale from 1 (very
bad) to 5 (very good). The average scores are as follows: 3.3 for implicit communica-
tion, 4.3 for explicit communication with color cues, 4.3 for explicit communication
with vocal commands and 3.6 for explicit communication with arrows.
Regarding the second part of the experiment, that is the one including puzzle solv-
ing, we asked how much each communication modality was perceived as disturbing.
Scores were on the scale from 1 (not disturbing) to 5 (very disturbing). On average,
implicit communication was rated 1.9, color cues 1.6, arrows 2.6 and audio messages
1.8. Arrows were found as the most disturbing communication modalities and, interest-
ingly, implicit cues were found almost as disturbing as vocal synthesis.
Finally, the following questions were asked:
– Which communication modality did you find the most effective?

– Which communication modality did you find the less effective?
– Which communication modality did you find the most effective while solving the
puzzle?
– Which communication modality did you find the less effective while solving the
puzzle?
– Should you need to interact with a multi-robot system, how do you prefer they would
communicate with you?
– Should you need to interact with a multi-robot system, which communication modal-
ity would you definitely not want?
12 V. Villani et al.
Replies to these questions are reported in Table 5. Participants reporting show that they
did not appreciate implicit communication and, if given the choice, they prefer not to
use it. However, it is noteworthy that it was not the most disturbing communication
modality while they were involved in another confounding task. Among the considered
explicit communication modalities, color cues and arrows were preferred over audio,
which was found almost as disturbing as implicit cues.
Table 5. Replies to questionnaire for subjective reporting.
Question Implicit Color Arrows Audio

Most effective 10.0% 40.0% 33.3% 16.7%
Less effective 36.7% 13.3% 13.3% 36.7%
Most effective during puzzle 10.0% 30.0% 43.3% 16.7%
Lest effective during puzzle 33.3% 16.7% 20.0% 30.0%
Preferred 6.6% 36.7% 40.0% 16.7%
Not wanted 40.0% 3.3% 16.7% 40.0%
4 Conclusion
In this paper we investigated the use of different communication modalities for a multi-
robot system. Specifically, the aim is to let a human subject be aware of the goal robots
are moving towards. To this end, we considered implicit and explicit communication
modalities. In implicit communication, users infer robots goal from their trajectory,
whereas for explicit communication we considered color cues, where the robots and
their goal exhibit the same color, arrows pointing from the center of the multi-robot
system to the goal, and audio messages. We designed an experiment aimed to assess
which communication modality is the most effective and how they are perceived by
human subjects.
Results have shown that, on average, implicit cues cause the highest number of
errors and require the longest response time. While this result is not surprising, the
experiments have highlighted that implicit communication is comparable to explicit
communication modalities when robots goals are unambiguous and sufficiently distinct.
Moreover, implicit communication does not require any specific hardware on robots,
which is a practical limitation of explicit communication. Conversely, according to our
results, implicit communication does not provide any advantage with respect user’s
involvement in other tasks that involve their perceptive channels, such as looking at
something requiring attention.
References
1. Schranz, M., Umlauft, M., Sende, M., Elmenreich, W.: Swarm robotic behaviors and current
applications. Front. Robot. AI 7, 36 (2020)
2. Lasota, P.A., Fong, T., Shah, J.A., et al.: A survey of methods for safe human-robot interac-
tion. Found. Trends Robot. 5(4), 261–349 (2017)
3. Dragan, A.D., Lee, K.C.T., Srinivasa, S.S.: Legibility and predictability of robot motion. In:
Proceedings of 8th Annual ACM/IEEE International Conference Human-Robot Interaction,
pp. 301–308. IEEE Press (2013)
4. St-Onge, D., Levillain, F., Zibetti, E., Beltrame, G.: Collective expression: how robotic
swarms convey information with group motion. Paladyn, J. Behav. Robot. 10(1), 418–435
(2019)
5. Capelli, B., Villani, V., Secchi, C., Sabattini, L.: Understanding multi-robot systems: on the
concept of legibility. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pp. 7355–7361. IEEE (2019)
6. Capelli, B., Villani, V., Secchi, C., Sabattini, L.: Communication through motion: Legibility
of multi-robot systems. In: 2019 International Symposium on Multi-Robot and Multi-Agent
Systems (MRS), pp. 126–132. IEEE (2019)
7. Clair, A.S., Mataric, M.: How robot verbal feedback can improve team performance in
human-robot task collaborations. In: Proceedings of 10th Annual ACM/IEEE International
Conference Human-Robot Interaction, pp. 213–220. ACM (2015)
8. Johnson, T., Tang, G., Fletcher, S.R., Webb, P.: Investigating the effects of signal light posi-
tion on human workload and reaction time in human-robot collaboration tasks. In: Advances
in Ergonomics of Manufacturing: Managing the Enterprise of the Future, pp. 207–215.
Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41697-7 19
9. May, A.D., Dondrup, C., Hanheide, M.: Show me your moves! conveying navigation inten-
tion of a mobile robot to humans. In: European Conference Mobile Robots (ECMR), pp.
1–6. IEEE (2015)
10. Szafir, D., Mutlu, B., Fong, T.: Communicating directionality in flying robots. In: Proceed-
ings of 10th Annual ACM/IEEE International Conference Human-Robot Interaction, pp.
19–26. IEEE (2015)
11. De Graaf, M.M.A., Malle, B.F.: People’s explanations of robot behavior subtly reveal men-
tal state inferences. In: 2019 14th ACM/IEEE International Conference on Human-Robot
Interaction (HRI), pp. 239–248. IEEE (2019)
12. Kim, L.H., Follmer, S.: Generating legible and glanceable swarm robot motion through tra-
jectory, collective behavior, and pre-attentive processing features. ACM Trans. Hum.-Robot
Interact. (THRI) 10(3), 1–25 (2021)
13. Flash, T., Hogan, N.: The coordination of arm movements: an experimentally confirmed
mathematical model. J. Neurosci. 5(7), 1688–1703 (1985)
14. Glasauer, S., Huber, M., Basili, P., Knoll, A., Brandt, T.: Interacting in time and space: inves-
tigating human-human and human-robot joint action. In: 19th International Symposium in
Robot and Human Interactive Communication, pp. 252–257. IEEE (2010)
15. Villani, V., Capelli, B., Sabattini, L.: Use of virtual reality for the evaluation of human-robot
interaction systems in complex scenarios. In: 2018 27th IEEE International Symposium on
Robot and Human Interactive Communication (RO-MAN), pp. 422–427. IEEE (2018)
16. McHugh, M.L.: The Chi-square test of independence. Biochemia Medica 23(2), 143–149
(2013)
17. Miller, R.G.: Simultaneous statistical inference. Springer-Verlag New York Inc., (1981).
https://doi.org/10.1007/978-1-4613-8122-8
The Benefits of Interaction Constraints
in Distributed Autonomous Systems
Michael Crosscombe1(B) and Jonathan Lawry2

1
Graduate School of Arts and Sciences, University of Tokyo, Tokyo, Japan
cross@sacral.c.u-tokyo.ac.jp
2
Department of Engineering Mathematics, University of Bristol, Bristol, UK
j.lawry@bristol.ac.uk
Abstract. The design of distributed autonomous systems often omits

consideration of the underlying network dynamics. Recent works in
multi-agent systems and swarm robotics alike have highlighted the
impact that the interactions between agents have on the collective
behaviours exhibited by the system. In this paper, we seek to highlight
the role that the underlying interaction network plays in determining
the performance of the collective behaviour of a system, comparing its
impact with that of the physical network. We contextualise this by defin-
ing a collective learning problem in which agents must reach a consensus
about their environment in the presence of noisy information. We show
that the physical connectivity of the agents plays a less important role
than when an interaction network of limited connectivity is imposed on
the system to constrain agent communication. Constraining agent inter-
actions in this way drastically improves the performance of the system in
a collective learning context. Additionally, we provide further evidence
for the idea that ‘less is more’ when it comes to propagating information
in distributed autonomous systems for the purpose of collective learning.
Keywords: collective learning · interaction networks · multi-agent

systems · robot swarms
1 Introduction
During the development of autonomous systems, it is only natural that we focus

our attention on the aspects of a system which are responsible for producing the
desired behaviours. In distributed autonomous systems the focus is often on how
the interactions between individual agents can lead to effective macro-level col-
lective behaviours. What may be oversimplified or even forgotten, however, are
the underlying networks which underpin those interactions. A typical assump-
tion to make is that the agent’s interaction network is merely a product of the
spatial properties of the system, i.e. agents interact randomly with other agents
based only on whether or not they are within the required distance for inter-
action to occur; often this takes the form of a communication range in swarm
https://doi.org/10.1007/978-3-031-51497-5_2
The Benefits of Interaction Constraints 15
robotics using platforms such as the Kilobots with infrared transceivers that are
limited to a physical communication radius of 10 cm [16,20]. However, in this
paper we argue that we ought to consider which agents should be allowed to
interact, rather than merely considering how those agents interact.
A common modelling assumption in distributed multi-agent and multi-robot
systems is the “well-stirred system” assumption: In a well-stirred system each
agent is equally likely to encounter any other agent in the system and, therefore,
each interaction is regarded as an independent event [15]. This equates to a sys-
tem in which the interaction network of the agents is totally connected and edges
are selected at random to simulate the random interactions of different pairs of
agents. However, this assumption does not consider that intentional constraints
placed on agent interactions may yield better performance than if the interac-
tion network is assumed to be totally connected and interaction depends only
upon the physical proximity of the agents. More recently, [18] found that “less is
more” when it comes to communication in a robot swarm and that, by reducing
the physical communication radius of the robots (or the density of the swarm),
the system-level performance improves as a result of there being less communi-
cation overall. While this is the most recent study to highlight the problem of
too much communication, [9, Sect. 3.2] previously demonstrated a similar prob-
lem in which agents became “too cooperative”. This general problem can be
traced further back to [14] in which the authors examined the effect of network
structure on system-level performance. A general survey of recent literature that
considers the connectivity of a multi-agent/robot system conducting collective
exploration is provided in [13].
In previous work we have studied the impact of applying constraints to the
interaction network in a non-physical multi-agent system within a collective
learning setting [5]. We showed that the “less is more” hypothesis holds true
for collective learning when the underlying interaction network’s topology is a
small-world network and that interaction networks with low connectivity lead
to optimal performance for collective learning. However, this study was limited
to studying interaction networks in abstract simulations, without the ability to
consider the physical network layer. Therefore, in this work we propose to study
both the physical and interaction networks in combination, to determine which
layer of connectivity is more important for performance in the development of
distributed autonomous systems.
An outline of the remainder of this paper is as follows: We begin by describing
collective learning as the context of our investigation, before defining a three-
valued model for collective learning. We present results for a totally-connected
interaction network but where the physical network connectivity is varied, so
as to understand the extent to which the physical network has an impact on
system performance. Then, we introduce constraints over the interaction network
by restricting its topology to that of a regular lattice network, before studying
system performance with respect to different levels of connectivity. Finally, we
conclude our findings with some discussions and highlight our main contributions
before detailing directions for future work.
16 M. Crosscombe and J. Lawry
2 Collective Learning in Multi-agent Systems

Social learning broadly describes the observed phenomena in animals where an
individual’s learning is influenced by observing and interacting with other indi-
viduals or their products [10]. An example of this is the use of pheromones to
facilitate teaching in the tandem-running ant Temnothorax albipennis [8]. This
style of learning forms the basis of many distributed autonomous systems which,
by design, rely on the interactions of individual agents to produce behaviours
more complex than could be demonstrated by any one individual.
In the last decade, distributed decision-making has been identified as a foun-
dational problem in swarm systems due to the need for large groups of agents
to cooperate on a shared task by functioning as a collective [2,17]. In swarm
systems, the ability for agents to make decisions in a distributed manner is cru-
cial for their real-world deployment, as robot swarms are appropriate platforms
for deployment in difficult terrain and hazardous environments, such as for the
detection of wildfires [12]. Due to their large scale, swarm systems often cannot
be supervised at the individual level and must instead rely on human supervi-
sion at the swarm level [11], while making its own decisions and taking actions
autonomously, often by reaching a consensus [19] or through quorum sensing [1].
Much of the literature on collective decision-making in robot swarms centres on
the best-of-n problem; a bio-inspired problem in which the swarm must decide
which is the best option of n possible alternatives [15]. However, this class of
decision-making is limited to scenarios in which each option is associated with a
quality value. Instead, we focus on a state-of-the-world problem whereby agents
must learn the current state of the world, in order to make informed decisions
about their assigned task.
In this paper we adopt the term collective learning to describe social learning
in a population of artificial agents attempting to reach a consensus about the
state of their environment [4]. In this context, collective learning is used specifi-
cally to describe the combination of two distinct processes: direct learning based
on evidence obtained directly from an agent’s environment, and indirect learning
in the form of an agent fusing its belief with the belief of another agent. The
effects of combining these two processes has been studied from an epistemologi-
cal perspective by [7] who argue that communication between agents acts both
to propagate information through the population, and to correct for errors that
occur whilst gathering evidence. The connection between these two processes
has been further investigated more recently in [3,6].
3 A Three-Valued Model for Collective Learning

We consider a collective learning problem in which a distributed system of
autonomous agents attempts to learn the true state of their environment. Sup-
pose that the environment can be described by a set of n propositional variables
P = {p1 , . . . , pn } with each relating to a discrete location, then an agent’s belief
about the environment is an allocation of truth values to each of the propositional
variables b : P → {0, 12 , 1}n . In our model we adopt a three-valued propositional

logic with the truth values 0 and 1 denoting false and true, respectively, whilst
the third truth-value 12 is used to denote unknown and allows agents to express
uncertainty about the environment. For notational convenience, we can instead
represent an agent’s belief b by the n-tuple B = B(p1 ), . . . , B(pn ). By doing so,
an agent can express uncertainty about a location in the environment by assign-
ing the truth value 12 to any of the propositional variables in P. As an example,
let us suppose that an environment can be described by n = 2 propositional
variables, then the belief B = 0, 1 would represent that the agent believes p1
to be false, while p2 is true. Now let us suppose another agent holds the belief
B = 12 , 12 , then the agent is expressing uncertainty about both of the proposi-
tions p1 and p2 , thereby indicating that the agent is totally uncertain about the
environment.
We now define the two processes of direct and indirect learning in the context
of our proposed three-valued model and describe how they combine to produce
collective learning at the macro level in distributed autonomous systems.
Evidential updating. The learning process by which an agent obtains evidence

directly from its environment, e.g., through onboard sensory capabilities, and
updates its current belief to reflect this evidence is referred to as evidential
updating. In our model, an agent first decides which proposition to investigate by
selecting a single proposition at random from those propositions about which the
agent is uncertain, i.e. where B(pi ) = 12 . Then, having selected a proposition pi to
investigate, the agent travels to the corresponding location in the environment
where it gathers evidence. Evidence takes the form of an assertion about the
truth value of the chosen proposition pi such that E = 12 , . . . , S ∗ (pi ), . . . , 12 ,
where S ∗ : P → {0, 1} denotes the ground truth – often referred to as the true
state of the world – and that the truth value of each proposition is certainly true
or false. Upon obtaining evidence E the agent then updates its belief to B|E as
follows:
B|E = B(p1 ) E(p1 ), . . . , B(pn ) E(pn )
(1)
=BE
where is a binary operator defined on {0, 12 , 1} given in Table 1. We will elab-

orate on the reasoning behind our choice of fusion operator when we introduce
the belief fusion process but, for now, notice that the operator works for evi-
dential updating by preserving certainty over uncertainty and that updating in
this manner does not, therefore, alter the truth value of any proposition pj ∈ P
where pj = pi because E(pj ) = 12 . An agent repeatedly gathers evidence about
uncertain propositions until it has become totally certain about the environment,
at which point it ceases to look for evidence.
In real-world environments the act of gathering evidence is often subject to
noise, stemming either from the use of noisy sensors – such as thermal sensors
used by UAVs [21] – or from the environment itself. In this model evidence shall
Table 1. The fusion operator applied to beliefs B and B .
B (pi )
B(pi ) 0 12 1
0 0 0 12
1
2
0 12 1
1 12 1 1
therefore take the following form:

S ∗ (pi ) : with probability 1 −
E(pi ) = ∗
(2)
1 − S (pi ) : with probability
where ∈ [0, 0.5] is a noise parameter denoting the probability that the evidence
is erroneous.
Belief Fusion. In collective learning the agents do not exist in isolation. Instead,
individual agents can benefit from being able to interact with, and learn from,
other agents in a shared environment. To this end, we now describe the indirect
learning process by which pairs of agents fuse their beliefs in order to improve
the system’s ability to learn the state of the world.
The belief fusion process provides two crucial benefits: Firstly, agents are able
to propagate information they obtain directly from the environment through the
system by fusing their beliefs with those of other agents. We propose to combine
beliefs using the three-valued fusion operator in Table 1 which is applied element-
wise to all of the propositional variables pi ∈ P. Given two beliefs B and B ,
corresponding to the beliefs of two agents, the fused belief is then given by:
B B = B(p1 ) B (p1 ), . . . , B(pn ) B (pn ) (3)
Secondly, the belief fusion operator helps to correct for errors that occur when
the evidence obtained by the agents is subject to noise. This error-correcting
effect arises when two agents have gathered evidence about the same proposition
pi , independently of one another, and encounter a disagreement. When a pair
of agents differ in their belief about the truth value of pi , then the belief fusion
process leads them both to become uncertain with regards to pi . As an example,
suppose that two agents hold the beliefs B1 (pi ) = 1 and B2 (pi ) = 0, then
upon the agents fusing their beliefs such that B1 B2 (pi ) = 12 , both agents
will attempt to seek additional evidence about proposition pi , either directly
from the environment or indirectly via fusion with other agents, having become
uncertain about the truth value of pi . Notice that the belief fusion process can
lead to agents becoming less certain when the fusing agents disagree about the
truth value of a given proposition, while evidential updating only leads to agents
becoming more certain.
Measuring Performance. In the context of collective learning agents are attempt-

ing to learn a representation of the true state of the world underlying their
environment. As such, we can measure the accuracy of the average belief of the
population with respect to the true state of the world in the following way: Given
a population of m agents, the normalised difference between each agent’s belief
B and the true state of the world S ∗ , averaged across the population, is the
average error of the system;
m n
1 1
|Bj (pi ) − S ∗ (pi )| (4)
m n j=1 i=1
This shall be our primary metric of performance.
4 Simulation Environment
Fig. 1. Simulation environment depicting a noisy collective learning scenario in which

m = 20 agents attempt to learn about n = 126 locations when the environment noise
= 0.2. Each agent has a communication radius Cr = 20.
We now introduce the physical simulation environment used to study collec-

tive learning in a multi-agent system. In Fig. 1 the simulation environment is
depicted with 20 agents and 126 hexagonal locations for the agents to investi-
gate1 . Agents are represented by black Unmanned Aerial Vehicles (UAVs) with
either a grey or a blue dashed circle surrounding them, indicating their phys-
ical communication radius Cr . If two agents are within each other’s physical
radii of communication then communication between those two agents is possi-
ble. If the radius is blue, then an agent is said to be in a communicating state,
whilst grey indicates that it is not communicating. At initialisation (Fig. 1a),
the environment is greyed out to represent the collective lack of information that
the system holds about its environment. During the collective learning process
(Fig. 1b), agents begin to build an internal representation of the state of the
1
We found that 20 agents were sufficient to complete the task with performance
similar to that of 50 agents.
Fig. 2. A snapshot of the simulation during an experiment. In this model two different
types of networks are formed which govern the agent interactions in the system. (a)
A snapshot of the simulation environment during runtime. (b) A zoomed-in view of 7
agents with a system-wide communication radius Cr = 20. (c) A representation of the
physical network that forms from the agents’ communication radii. (d) A representation
of the pre-assigned interaction network between the visible agents. (e) The resulting
communication links that are available according to edges which are present in both
the physical and interaction networks.
world, which can be averaged across the system to give us a general view of
performance. To show this, the colours of the hexagons increase in intensity rep-
resenting the collective certainty about each location as contained in the belief
of the average agent. The different colours are used to represent ocean (blue),
beach (yellow), and low and high terrain (light and dark green), but are purely
visual in this context. The brown hexagon is the initial launch location from
which all agents begin exploring. In the case that the system collectively adopts
the incorrect truth value for a given proposition, the hexagon at its associated
location is coloured red, with its intensity still representing the average certainty
with which the agents have (incorrectly) determined its truth value. At the end
of the simulation (Fig. 1c) the agents have reached a consensus about the state
of the world and the map is in full view, albeit with some errors depicted by red
hexagons.
5 The Networks Underlying Agent Interactions

When we consider the design of distributed autonomous systems (e.g. multi-
agent systems or robot swarms), the system-level performance is driven solely
by the interactions of its constituent agents with their environment and with each
other. The consequence of this fact is that we focus our effort on designing the
rules and processes which define how the agents interact, but often we neglect to
consider which agents ought to interact. In the past, we have often assumed that
it is beneficial for agents to be able to interact freely with any other agent in the
system, but recent works have shown this to be false [5,18]. To understand the
extent to which agents interact in our simulation, we now distinguish between
the physical network and the interaction network as they occur in our simulation
environment.
In Fig. 2 we show a snapshot of the simulation during runtime in which we
overlay visual representations of the different networks which are formed during a
single time step t. Assuming each agent acts as a node in both a physical network
and an interaction network, then the red edges connect pairs of agents which
are physically capable of communicating, based on the agents being within their
communication range with radius Cr . Edges in the physical network are added
or removed during each time step t of the simulation according to which pairs
of agents are within communication range of one another. The yellow edges,
meanwhile, depict the connectivity of the interaction network as determined at
the start of each experiment. Unlike in the physical network, the pairs of agents
which are allowed to interact are immutable, such that no new edges are added,
nor existing edges removed after the network is initialised. The green edges
depict which edges exist in both networks at a given time step t. Each edge then
corresponds to a pair of agents which are both able and allowed to interact with
one another.
An interaction network can be constructed in whatever manner leads to opti-
mal system performance, provided that the agents remain capable of interacting
physically in the environment during their task. The default interaction network
in many multi-agent and multi-robot systems is a totally-connected network. In
this paper we seek to constrain interactions by reducing the connectivity of the
network whilst improving system performance. To do this we shall use regular
lattice networks, having discovered in previous work that such a topology is more
optimal even than small-world networks in this setting [5,22].
6 Multi-agent Simulation Experiments
For each experimental condition, the system proceeds in the following way: We
initialise m = 20 agents at the same starting location and with total uncertainty
about the n = 126 locations, i.e. each agent begins with the belief B(pi ) = 12
for pi ∈ P. Every agent is also initialised with the same communication radius
Cr ∈ {20, . . . , 100} and a communication frequency Cf ∈ [0, 1].
After initialisation, each agent selects a proposition pi at random about which
they are uncertain and begins travelling to the associated location in search of
evidence about pi . When they arrive at the location the agent will observe their
environment and obtain evidence E according to Eq. (2), before updating its
belief by Eq. (1). The agent will then select another uncertain proposition pi
to investigate and begin to travel in the direction of the new location. At this
moment, the agent will also enter a communicating state with probability Cf .
If the agent does not enter the communicating state then it continues to its
destination to obtain evidence and the process repeats. If the agent does enter
the communicating state, however, then the agent begins broadcasting its belief
to any agents that come within its communication radii Cr as it travels to its
destination. The agent remains in the communicating state until it encounters
one or more other communicating agents within communication range. Upon
receiving the belief(s) of the other agent(s), the agent will cease communicating
and update its belief B by selecting another belief B at random before adopting
the fused belief B B by Eq. (3). If the agent remains uncertain about the
Fig. 3. Average error at steady state against communication radius Cr for increasing
communication frequency Cf = [0, 1]. Each dotted line shows the noise ∈ {0, 0.1, 0.3}.
The shaded regions show a 95% confidence interval.
location to which it is travelling, then the agent continues to that location as

normal. However if the agent, having fused its belief, becomes certain about the
location during travel then it simply repeats the process of searching for new
evidence by first selecting a new uncertain location from its belief. Finally, if
instead the agent holds a totally certain belief, such that B(pi ) ∈ {0, 1} for all
pi ∈ P, then the agent repeatedly enters a communicating state in an attempt
to reach a consensus with the rest of the system.
Each experiment runs for a maximum of 30, 000 time steps or until the
population has reached a unanimous consensus about their environment. We
repeat each experiment 50 times for each parameter combination Cr and Cf
and for three different noise levels ∈ {0, 0.1, 0.3} representing noise-free, low-
and moderate-noise scenarios, respectively. The results are then averages across
the 50 runs. We now present results for simulation experiments where the under-
lying interaction network is totally-connected, in keeping with the “well-stirred
system” assumption, in which we study how changes to the connectivity of the
physical network affect the macro-level dynamics of the system.
6.1 Convergence Results for Physical Networks

In this section we show convergence results for a static totally-connected interac-
tion network and focus our attention on the dynamic physical network for which
the connectivity changes based on the chosen communication radius Cr . In all
figures, the shaded regions show a 95% confidence interval.
In Fig. 3 we show the average error of the population (Eq. (4)) at steady
state against communication radius Cr . Each solid line shows the results for
different noise , while the dashed lines of the same colour show the noise .
Each plot within this figure represents a different communication frequency Cf
increasing from 0 (no communication) to 1 (unrestricted communication). For
Cf = 0 we present a special case in which agents do not communicate with
one another (referred to as “asocial learning” by [10]). As a result of this total
absence of belief fusion, the agents rely solely upon evidential updating to inform
their beliefs and, in the case where the evidential updating process is noisy, the
agents are unable to learn an accurate representation of their environment. The
result is the system reaching an average error around for all communication
radii Cr . Therefore, serves as a simple benchmark of performance and any

reduction in the average error to below demonstrated improved performance.
As we increase Cf and agents are able to fuse their beliefs, we see an imme-
diate and significant reduction in the average error for Cf ≥ 0.001 such that
for all noisy scenarios (i.e. for = 0.1 and 0.3) the average error is reduced to
below the level of noise in each scenario. In the noise-free scenario (green line)
the populations converge to an average error above 0 when the communication
frequency is too low (i.e. for Cf < 0.1, suggesting that while the population has
converged to a highly accurate average belief, the population does not reach a
consensus across the system. Here we see that the best performance is measured
for a communication frequency Cf = 0.025, but increasing Cf further results in
an increased average error and, therefore, worse performance in the system.
Perhaps surprisingly, increasing the communication radius Cr of the system,
and thereby the connectivity of the physical network over which agents com-
municate, has little effect on the macro-level performance as measured by the
average error of the system at steady state. The lines in Fig. 3 remain relatively
flat across different communication radii Cr , with the performance of the system
being almost entirely determined by the communication frequency Cf . This sug-
gests that the level of connectivity of the physical network has very little impact
on the performance of the system for the purpose of collective learning when
agents can communicate freely provided they are in range.
7 Constraining the Interaction Network
Fig. 4. Network diagrams with different levels of connectivity. (a) Regular lattice net-
work with k = 2. (b) Regular lattice network with k = 4.(c) Totally-connected network.
We now investigate how applying constraints to the interaction network, i.e.

limiting the micro-level dynamics, affects the macro-level performance of our
model for collective learning. In the remainder of this paper, we will study inter-
action networks which take the form of regular lattice networks with different
levels of connectivity. A regular lattice network can be defined by its connectivity
k ∈ [2, m − 1] where k determines how many “nearest neighbours” each agent is
connected to when the agents are arranged around a ring. Examples of lattice
networks for k = 2 and k = 4 are shown in Figs. 4a and 4b, respectively, and are
compared with a totally-connected network equivalent to k = 19 in Fig. 4c.
Fig. 5. Average error at steady state against communication radius Cr for increasing
communication frequency Cf = [0, 1]. Each dotted line shows a different lattice network
with connectivity k ∈ {2, 4, 8, 16}. The shaded regions show a 95% confidence interval.
7.1 Convergence Results for Constrained Interaction Networks

Results are presented for interaction networks with differing levels of connectivity
k. We continue to vary the connectivity of the physical network via the commu-
nication radius to observe whether the physical layer, in conjunction with the
interaction layer, has some effect on system performance that wasn’t observed
previously.
In Fig. 5 we show a comparison of the average error of the system at steady
state against communication radius Cr for different communication frequencies
Cf . Each solid line represents a different regular lattice network of connectivity k.
In addition, we compare results for noise = 0.1 (top) against = 0.3 (bottom),
and show the noise level as a red, dotted line. Beginning with Fig. 5a we can
see that the low-noise scenario converges to an average error at or around 0
for communication frequencies Cf < 0.2 with the best performance observed
for Cf = 0.1. We also see that the physical connectivity of the system has
little impact when comparing across communication radius Cr . In the high-noise
scenario of Fig. 5b, however, we see that the average error remains above 0 for
all values of Cr and Cf and that, when Cf ≥ 0.1, there is a sudden increase in
average error for k ≥ 4. Most importantly, we observe that the networks with the
highest connectivity exhibit the largest increases in average error as Cf increases,
such that k = 16 always results in worse performance than k = 8, and so forth.
An exception occurs in the high-noise scenario with very low communication
frequencies Cf < 0.1, where a network with k = 2 performs slightly worse
than more connected networks. However, for Cf ≥ 0.1, k = 2 performs on par
or better than more connected networks. For example, at the extreme with a
Fig. 6. Average error against time t for Cr = 20 and Cf = 0.2 for different regular
lattice networks with connectivity k ∈ {2, 4, 8, 16} and a totally-connected network (i.e.
k = 19). Each dotted line shows the noise ∈ {0, 0.1, 0.3}. The shaded regions show a
95% confidence interval.
communication frequency Cf = 1, a network of k = 2 has 15 the average error

compared with k = 16 for the same communication radius Cr = 20. When we
increase the radius Cr = 100, a network of k = 2 achieves less than 12 the average
error of a network with k = 16. Furthermore, the impact of the physical network
also becomes more apparent in the high-noise scenario where, for Cf = 1, we see
a clear increase in average error as the communication radius Cr is increased.
To better understand the dynamics of our model and how the underlying
connectivity affects time to convergence, we present trajectories of the average
error against time in Fig. 6. Each plot represents a network of different connec-
tivity k, and each line shows that network’s performance for different noise .
We have fixed Cr = 20 and Cf = 0.2 as the best performing parameters in this
comparison. From left to right, as connectivity increases, we see that for a high-
noise scenario the time to convergence is much slower, with the k = 2 network
converging after around 25, 000 time steps. This is reduced to less than 15, 000
time steps for k = 4 but with slightly higher average error. Then, as we continue
to increase the connectivity k, the network continues to converge in less time
but with greater average error. Therefore, increasing the connectivity worsens
performance when the noise is high.
These results suggest that, for higher communication frequencies, the combi-
nation of both the interaction network and the physical network is an important
consideration. It is also clear that as the interaction network becomes increas-
ingly connected, the connectivity of the physical network has less of an impact,
and that the inverse is true where, for a less connected interaction network, the
physical network has greater impact overall. Therefore, both networks are impor-
tant for considering optimum performance in distributed autonomous systems.
8 Conclusion and Future Work
In this paper we have proposed to study the effects of both the physical and
interaction networks on the collective learning performance of a multi-agent sys-
tem. We have demonstrated that, when the underlying interaction network is
totally connected, the physical network has minimal impact on performance in
our model. However, when the connectivity of the interaction network is greatly
reduced the system performs better, achieving lower average error when com-
pared with highly-connected interaction networks under the same experimental
conditions. Limiting the interactions between agents also results in improved
robustness to noise, where increased connectivity at the physical layer only wors-
ens performance for the same limited interaction network. Overall, performance
is best when the connectivity of both the physical and interaction networks is
severely reduced, e.g., for a connectivity of k = 2 and a communication radius
Cr = 20.
To our knowledge, this is the first such study explicitly investigating the
impact of both the physical and interaction networks in combination in a swarm
setting. While it is difficult to draw strong, broad conclusions from this study, we
hope that this will encourage other researchers working on collective behaviours
to consider the importance of the underlying interaction network and its level
of connectivity. Obviously, this study is limited to simulation studies of the
effects of the interplay between the physical and interaction network, but a more
theoretical study of this interplay is important to understand the reason behind
the observed dynamics.
Up to now we have only considered regular lattice networks and small-world
networks as the basis for interaction networks. These networks possess certain
desirable properties which may be more conducive to the processes involved in
collective learning than may other, different, types of networks. Furthermore, we
have only considered the effect of network connectivity in the context of collective
learning. This effect may not necessarily be observed for other kinds of collective
behaviours. As such, we intend to further explore how network connectivity,
and information sharing more generally, impacts different kinds of collective
behaviours in multi-agent systems. We are also interested in understanding how
a more dynamic environment might change the optimal level of connectivity
in the networks, as less connectivity may lead to outdated information being
retained by the population which is likely to affect performance over time.
Acknowledgements. This work was funded and delivered in partnership between

Thales Group, University of Bristol and with the support of the UK Engineering and
Physical Sciences Research Council, ref. EP/R004757/1 entitled “Thales-Bristol Part-
nership in Hybrid Autonomous Systems Engineering (T-B PHASE).”
References
1. Bechon, P., Slotine, J.J.: Synchronization and quorum sensing in a swarm of
humanoid robots. arXiv preprint arXiv:1205.2952 (2012)
2. Brambilla, M., Ferrante, E., Birattari, M., Dorigo, M.: Swarm robotics: a review
from the swarm engineering perspective. Swarm Intell. 7(1), 1–41 (2013)
3. Crosscombe, M., Lawry, J.: A model of multi-agent consensus for vague and uncer-
tain beliefs. Adapt. Behav. 24(4), 249–260 (2016)
4. Crosscombe, M., Lawry, J.: Collective preference learning in the best-of-n problem.
Swarm Intell. 15(1), 145–170 (2021)
5. Crosscombe, M., Lawry, J.: The impact of network connectivity on collective learn-
ing. In: Matsuno, F., Azuma, S., Yamamoto, M. (eds.) DARS 2021. SPAR, vol. 22,
pp. 82–94. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-92790-5 7
6. Douven, I.: Optimizing group learning: an evolutionary computing approach. Artif.
Intell. 275, 235–251 (2019)
7. Douven, I., Kelp, C.: Truth approximation, social epistemology, and opinion
dynamics. Erkenntnis 75(2), 271–283 (2011)
8. Franks, N.R., Richardson, T.: Teaching in tandem-running ants. Nature 439(7073),
153–153 (2006)
9. Hamann, H.: Superlinear scalability in parallel computing and multi-robot systems:
shared resources, collaboration, and network topology. In: Berekovic, M., Buchty,
R., Hamann, H., Koch, D., Pionteck, T. (eds.) ARCS 2018. LNCS, vol. 10793, pp.
31–42. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77610-1 3
10. Heyes, C.M.: Social learning in animals: categories and mechanisms. Biol. Rev.
69(2), 207–231 (1994)
11. Hogg, E., Harvey, D., Hauert, S., Richards, A.: Evolving robust supervisors for
robot swarms in uncertain complex environments. In: Matsuno, F., Azuma, S.,
Yamamoto, M. (eds.) DARS 2021. SPAR, vol. 22, pp. 120–133. Springer, Cham
(2022). https://doi.org/10.1007/978-3-030-92790-5 10
12. Innocente, M.S., Grasso, P.: Self-organising swarms of firefighting drones: Har-
nessing the power of collective intelligence in decentralised multi-robot systems. J.
Comput. Sci. 34, 80–101 (2019)
13. Kwa, H.L., Leong Kit, J., Bouffanais, R.: Balancing collective exploration and
exploitation in multi-agent and multi-robot systems: a review. Front. Robot. AI 8,
771520 (2022)
14. Lazer, D., Friedman, A.: The network structure of exploration and exploitation.
Adm. Sci. Q. 52(4), 667–694 (2007)
15. Parker, C.A., Zhang, H.: Cooperative decision-making in decentralized multiple-
robot systems: the best-of-n problem. IEEE/ASME Trans. Mechatron. 14(2), 240–
251 (2009)
16. Rubenstein, M., Ahler, C., Hoff, N., Cabrera, A., Nagpal, R.: Kilobot: a low cost
robot with scalable operations designed for collective behaviors. Robot. Auton.
Syst. 62(7), 966–975 (2014). Reconfigurable Modular Robotics
17. Schranz, M., Umlauft, M., Sende, M., Elmenreich, W.: Swarm robotic behaviors
and current applications. Front. Robot. AI 7, 36 (2020)
18. Talamali, M.S., Saha, A., Marshall, J.A., Reina, A.: When less is more: robot
swarms adapt better to changes with constrained communication. Sci. Robot.
6(56), eabf1416 (2021)
19. Valentini, G.: Achieving Consensus in Robot Swarms. SCI, vol. 706. Springer,
Cham (2017). https://doi.org/10.1007/978-3-319-53609-5
20. Valentini, G., Ferrante, E., Dorigo, M.: The best-of-n problem in robot swarms:
formalization, state of the art, and novel perspectives. Front. Robot. AI 4, 9 (2017)
21. de Vries, J.: Image processing and noise reduction techniques for thermographic
images from large-scale industrial fires. In: Proceedings of the 12th International
Conference on Quantitative InfraRed Thermography (QIRT) (2014)
22. Watts, D., Strogatz, S.: Collective dynamics of ‘small-world’ networks. Nature 393,
440–442 (1998)
Outlining the Design Space of eXplainable
Swarm (xSwarm): Experts’ Perspective
Mohammad Naiseh1(B) , Mohammad D. Soorati2 , and Sarvapali Ramchurn2

1
Department of Computing and Informatics, Bournemouth University, Poole, UK
mnaiseh1@bournemouth.ac.uk
2
Department of Electronics and Computer Science, University of Southampton,
Southampton, UK
{m.soorati,sdr1}@soton.ac.uk
Abstract. In swarm robotics, agents interact through local roles to solve

complex tasks beyond an individual’s ability. Even though swarms are
capable of carrying out some operations without the need for human
intervention, many safety-critical applications still call for human oper-
ators to control and monitor the swarm. There are novel challenges to
effective Human-Swarm Interaction (HSI) that are only beginning to be
addressed. Explainability is one factor that can facilitate effective and
trustworthy HSI and improves the overall performance of Human-Swarm
team. Explainability was studied across various Human-AI domains, such
as Human-Robot Interaction and Human-Centered ML. However, it is
still ambiguous whether explanations studied in Human-AI literature
would be beneficial in Human-Swarm research and development. Fur-
thermore, the literature lacks foundational research on the prerequisites
for explainability requirements in swarm robotics, i.e., what kind of ques-
tions an explainable swarm is expected to answer, and what types of
explanations a swarm is expected to generate. By surveying 26 swarm
experts, we seek to answer these questions and identify challenges experts
faced to generate explanations in Human-Swarm environments. Our work
contributes insights into defining a new area of research of eXplainable
Swarm (xSwarm) which looks at how explainability can be implemented
and developed in swarm systems. This paper opens discussion on xSwarm
and paves the way for more research in the field.
Keywords: Explainable AI · Swarm Robotics · Human-Swarm

Interaction
1 Introduction
Swarm robotics consists of multiple robots that interact with each other to
accomplish a collective task that is beyond individual ability, or can perform
more effectively by a group of robots [11]. Swarm robotics promise various imple-
mentations in many scenarios such as pollution control, surveillance, package
delivery and firefighting. Although swarms can perform the task autonomously,
https://doi.org/10.1007/978-3-031-51497-5_3
Explainable Swarm 29
human intervention is still crucial for safe and effective deployment in real-
world scenarios [1]. For example, swarm operators may need to manage swarm
tasks that might be beyond swarm capability or intervene in cases of failures
and errors. This intervention could be more complex when it requires multiple
humans to work together in a team to operate remote swarms [10]. For instance,
in a search and rescue scenario where the swarm is responsible for searching and
identifying casualties, the system may consist of two operators: (a) an operator
is tasked with scheduling clusters of robots to search areas and (b) an analyst
who is tasked with analysing casualties images.
Human-Swarm Interaction (HSI) is an emerging field of research that studies
human factors and swarm robotics to better utilise both humans and swarm
capabilities in a Human-Swarm environment. An essential part of HSI is com-
municating, to humans, the reasons behind swarms’ decisions and behaviour
[25]. As swarms become more advanced, explaining their behaviour (individual
and collective) is buried in increasingly complex communication between swarm
agents and underlying black-box machine learning models.
Explainable Artificial Intelligence (XAI) is a response to this direction of re-
search. XAI is a domain of research that is interested in making the behaviour or
the output of intelligent systems more intelligible to humans by providing expla-
nations [18]. While the concept of XAI is increasingly gaining attention in several
research domains, such as Agent Planning [3], Machine Learning [22] and Motion
Planning [6]. There is a clear gap in the literature to integrate swarm robotics
into the research and development of XAI. Similar to XAI fields of research,
we argue that eXplainable Swarm (xSwarm) should address that an explanation
is an answer to an unexpected event or a failure of the system [18]. However,
compared to different XAI domains, xSwarm can be significantly different since
the decision made by a swarm significantly depends on the communication and
collective contribution of its agents [25]. Further, the explainability of swarms
would require explaining the emerging collective behaviour that is not necessary
for the summation of the individual robot behaviour [5].
Many studies discussed the need for XAI methods and models for Human-
Swarm environments to improve trust and maintain situational awareness, e.g.,
[1,20,25]. However, at the current stage of research, it is still vague what kind of
explanations a swarm system should be expected to generate and what questions
an explainable swarm is expected to answer. This paper seeks to conceptualise
the concept of xSwarm and introduce a new taxonomy of explainability require-
ments for HSI. Our results can guide the requirements elicitation for explanation
methods and models for Human-Swarm teaming. These insights contribute to
operationalising the concept of xSwarm in reality and can serve as a resource
for researchers and developers interested in designing explainable Human-Swarm
interfaces.
In this paper, we limit our research to swarm robotics, i.e., this paper does
not address multi-robot systems that have explicit goals and their agents execute
both individual and group plans. We follow a similar approach proposed in recent
publications to outline explainability requirements in a given problem [6,15]. Our
30 M. Naiseh et al.
method uses open-text and scenario-based questionnaire with domain experts,

to elicit explainability considerations for swarm systems. We foresee a lack of
established understanding of the concept or common technical knowledge given
the early stage of xSwarm in industrial practices. Therefore, we used a novel
use-case developed in our recent work [10] and supported with failure scenarios
and explanations to ground our investigation. Our data set was collected from
on an online questionnaire with 26 swarm experts. In this paper, we make the
following contributions:
– We present xSwarm taxonomy that summarises the explainability space for
swarm robotics from swarm experts’ perspective.
– We summarize current challenges faced by researchers and industry practi-
tioners to create xSwarm.
2 Background and Related Work

So far, the process of generating explanations for AI systems users has received a
lot of attention from multiple domains, including psychology [21], artificial intel-
ligence [16], social sciences [18] and law [4]. Furthermore, regulatory requirements
and increasing customer expectations for responsible artificial intelligence raise
new research questions on how to best design meaningful and understandable
explanations for a wide range of AI users [23]. To respond to this trend, many
explainable AI (XAI) models and algorithms have been proposed to explain
black-box ML models [23], agent-path planning [3], multi-agent path planning [2]
and explainable scheduling [14]. These explanations can range from local expla-
nations that provide a rationale behind a single decision to global explanations
that provide explanations for the overall AI logic [16].
Despite a growing interest in explainable AI, there is a limited number of
studies on how to incorporate explainability in swarm robotics. Only a few
attempts in the literature have discussed explainability in swarm robotics. For
instance, a scalable human-swarm interaction has been proposed to allow a sin-
gle human operator to monitor the state of the swarm and create tasks for the
swarm [11]. Their study with 100 participants showed that users have differ-
ent visualisation preferences for explanations to observe the state of the swarm
depending on the type of the given tasks. The explanation in this study was
limited to a visual representation of the swarm coverage using a heatmap or indi-
vidual points space that does not capture the wider space of xSwarm. Another
example is the work proposed in [20]. They proposed Human-Agent EXplain-
ability Architecture (HAEXA) to filter explanations generated by swarm agents
with the aim to reduce human cognitive load. Finally, Roundtree et al. [25] dis-
cussed design challenges and recommendations for explainable Human-Swarm
interfaces, e.g., they described the delivery of explanations to human operators
in swarm systems as the main challenge due to human processing limitations
caused by a large number of individual robots in the swarm. Although these stud-
ies provide steps toward the right direction for xSwarm, it is still unclear whether
explanations developed in other XAI domains would be useful for Human-Swarm
environments. Also, it is still undefined what kind of questions a swarm system

is expected to answer or what explanations from swarm robotics would look like.
Meanwhile, recent papers were published in multiple domains to guide the
exploration of explainability requirements in a certain domain. Liao et al. [15]
created a list of prototypical user questions to guide the design of human-centred
explainable black-box ML. Similarly, Brandao et al. [6] proposed a taxonomy
of explanation types and objects for motion planning based on interviewing
motion planning experts and understanding their current work and practice.
Our work is driven by a similar approach for outlining xSwarm’s design space.
More particularly, in light of the growing interest in XAI, we are interested
in understanding what is required to create xSwarm and what are the major
practical challenges to developing xSwarm.

Ethical approval was acquired via the Ethics Committee of the School of Elec-
tronics and Engineering from the University of Southampton [Reference No.
67669].
3.1 Participants
Thirty-two participants voluntarily accessed our online questionnaire. Partici-
pants were approached through two main mailing lists: multi-agent1 and TAS-
Hub2 mailing lists. Four participants were excluded because they identified them-
selves to have less than one year of working in swarm robotics. Two more par-
ticipants were also excluded as they chose not to complete the questionnaire,
leaving us with twenty-six experts. 46.2% of our participants were researchers,
7.7% were lecturers, 23.1% were PhD students, and 22.9% with industrial roles.
All experts had at least more than one year of working in swarm robotics. 61.5%
of participants had 1–5 years of experience, 23.1% had 5–10 years, 7.7% with
10–15 years and 7.7% exceeded 15 years of experience in the field.
3.2 Study Procedure and Design

Due to the limited research in xSwarm, we chose to conduct an online question-
naire as it allows us to scope an early investigation into outlining a preliminary
xSwarm taxonomy for future in-depth investigations. It also helps us to gather a
large number of experts’ feedback in a brief time compared with other data col-
lection methods such as interviews [17]. We designed our questionnaire to include
several elicitation probes to obtain representative results and minimise cognitive
heuristics and bias [13,19] (Appendix A - Table A.1). First, a draft question-
naire was developed by the first author. Then, the draft was pilot tested with
1
Multi-agent systems—The Alan Turing Institute
https://www.turing.ac.uk/research/interest-groups/multi-agent-systems.
2
UKRI Trustworthy Autonomous Systems Hub (TAS-Hub): https://www.tas.ac.uk/.
32 M. Naiseh et al.
experts (researchers from the inside and outside the research group) to assure
that question formulation can achieve the research goal and be understandable
to study participants. The questionnaire starts by asking participants to read
the participant information sheet. The information sheet outlines the motivation
of the study, compiles background material about the research goal and problem
as well as describes the questionnaire structure. Participants’ information sheet
also contains background material about explainable AI to minimise any avail-
ability bias [19]. Then participants were directed toward the main questionnaire.
The questionnaire has four main sections. Table 1 provides a brief description of
the questionnaire structure and probe types used in each section.
Table 1. Questionnaire structure and elicitation probes
Questionnaire section Probe type Description

Swarm experience NA Demographic questions
Swarm failures Knowledge Participants’ knowledge about swarm failures, e.g., What are
the main reasons for swarm failures? And What behaviour
sparks your concern?
xSwarm scenario Situation assessment Participants’ assessment of a swarm failure or unexpected
behaviour, e.g., What questions would you ask to debug this
failure? Why do you think the swarm failed in this case?
xSwarm challenges Experience Participants’ previous experience in generating explanations
from a swarm, e.g., What kind of challenges did you face
when generating an explanation from a swarm?
Hypothetical Participant needs to develop eXplainable swarms, e.g., How
would swarm developers be supported in integrating
explainability??
Swarm Experience. In this section, we asked participants to provide four

demographic factors related to their previous swarm experience - participants’
area of research or experience, current role, years of experience in swarm
robotics, and level of experience in swarm robotics. We excluded participants
who had less than a year of experience or had identified themselves with a low
level of experience in the field.
xSwarm Scenario. This section used an industrial-based use case for HSI,
which represents a real-world use case for HSI where explainability is a crucial
requirement [10]. The use case presented the following scenario: “A hurricane has
affected a settlement. Many were evacuated, but some did not manage to escape
and have been caught in debris and fallen structures. A UAV (Unmanned Aerial
Vehicle) swarm is being deployed to identify locations of survivors, to evaluate
their condition, and to advise on where unmanned ground res-cue vehicles should
be deployed to evacuate casualties”. A diagram for the use case can be found in
Appendix A - Figure A.1. First, we asked participants to select when and what
the swarm should explain its behaviour to human operators during the task.
Then, we provided two cases of swarm failures or unexpected behaviour and
asked participants to explain these cases. This is a situation assessment probe to
determine the bases for explanations during failures or unexpected behaviour,

rather than waiting for these incidents to happen [13]. We also asked participants
to provide what questions will they ask to debug failures. Figure 1A and Fig. 1B
show two scenarios of either a failure on an expected behaviour during two main
stages of the swarm task: (a) Planning: a swarm has suddenly stopped while
it is trying to reach its destination, and (b) Searching: the swarm has failed to
identify all the casualties in the search area.
Fig. 1. Scenarios of swarm failures or unexpected behaviour presented to participants

(A) swarm has suddenly stopped. (B) Swarm has failed to recognise all possible casu-
alties in the search area.
xSwarm Challenges. The questions in this section aim to understand the

challenges faced by experts to generate explanations from swarm systems. First,
this section included open-end questions to reflect on participants’ experience in
generating or obtaining explanations from swarm systems. We asked questions
such as “What were the typical difficulties you encountered while trying to extract
explanations from swarm systems?”. The section also contained hypothetical
questions to identify the key tools and techniques that swarm developers should
have to build xSwarm. We asked questions such as “How would swarm developers
be better supported in integrating explainability?” and “What kind of issues and
considerations should developers have in mind when developing xSwarms?”.
3.3 Analysis
We followed the content analysis method [12]. Our data set included participants’
answers to open-ended questions with themes extracted across perspectives. The
analysis was conducted by the first author. To increase the trustworthiness of our
qualitative analysis, we followed an iterative process across the research team.
This iterative procedure was implemented to check and confirm that the taxon-
omy is reflected in the responses of the participants. The authors met multiple
times during each stage of the research to ensure the accurate interpretation
of each category and the supporting evidence. These discussions and meetings
34 M. Naiseh et al.
resulted in the dividing, altering, removing, or adding of categories to ensure

that all responses and their settings were adequately represented in the data.
4 Results
Our results are divided into two parts. We start by discussing the general expla-
nation categories emergent in participants’ answers, which outline the design
space of xSwarm. We then discuss the main challenges faced by our experts to
integrate explainability in swarm systems. Table 2 shows a sample of partici-
pants’ answers. More examples can be found in Appendix A - Table A.2.
Table 2. Explanation categories for swarm robotics. Explain column refers to partici-
pants’ explanations to swarm failures or unexpected behaviour- Debug column presents
participants’ questions to debug swarm failures or unexpected behaviour - Freq. refers
to the number of times this category appeared in the data set.
Category Explain Debug

Consensus Detecting the missing casualty caused a conflict How much of the swarm is consolidating the same
between the agents view?
Agents were not able to agree on some features of What is the agreement percentage between
the casualty agents?
Freq. = 24 Freq. = 18
Path Planning Because the swarm initiative plan is to visit Loca- Is it a collision-free route?
tion L which is a charging location Where is the swarm right now? [the probability
Because swarm is trying to avoid obstacles distribution of possible robots locations]
Freq. = 22 Freq. = 14
Communication Because the communication is limited between What is the accuracy of information exchange
agents between swarm agents?
Environmental condition is limiting the connec- What is the synchronisation status between
tion between agents swarm agents?
Freq. = 19 Freq. = 6
Scheduling Because this casualty should be detected by Why there is x number of agents to search this
another cluster area?
Because there are not enough agents to detect all Why a cluster c1 is assigned to task t1?
casualties Freq. = 8
Freq. = 14
Hardware Because the casualty is out of sensors coverage What is the communication bandwidth between
distance swarm robots
Because of low-quality input data from swarm sen- What is the limitation of the swarm sensors?
sors Freq. = 5
Freq. = 16
Architecture and design Because robots have limited communication links What is the response threshold function?
per robot How is the swarm size selected?
Because the swarm has not got enough training in Freq. = 6
such an environment
Freq. = 5
4.1 Explanation Categories
Consensus. A key distinction between swarms and multi-agent systems is that

the swarm can be seen as a single entity rather than multiple individual robots
with multiple goals and tasks. Swarm robotics is usually equipped with meth-
ods to reach a consensus on the swarm decision, based on the accumulation and
sharing of information about features of the decision encountered by individual

robots. The most frequently asked questions by our participants were not only
regarding why the swarm has made a single decision but at a high level, acquiring
who and how many robots contributed to that decision. Our participants fre-
quently asked questions to check either the percentage of individual robots who
communicated the preferred decision, e.g., P183 mentioned: “How many robots
detected the casualties?”, and P26 described: “Which features of the casualty did
the robots agree on?”. Furthermore, when participants were asked to provide
reasons for swarm failures, they frequently answered that such a failure could be
related to a conflict between swarm agents. Participants enquired information
to further understand the opinion dynamics of the swarm. Opinion dynamics
is a branch of statistical physics that studies the process of agreement in large
populations of interacting agents [24]. Because of these dynamics, a population
of agents might progressively transition from a situation in which they may hold
varying opinions to one in which they all hold the same opinion. Participants
were interested in how the swarm opinion dynamics might change and evolve.
For instance, P20 reflected on the swarm searching for casualties task: “What we
are also interested in is that What are the emerging decisions (fragmentation),
Is there a dominant decision, how dominant is this decision, and how will this
opinion evolve”.
Path Planning. Participants frequently asked questions related to the path

that the swarm is going to follow and its related events. For instance, P2 men-
tioned that remote human operators would frequently check what is the next
state of the swarm. Similarly, P20 added, “automatic planning has not been fully
adopted especially in high-risk domains such this scenario, what the system usu-
ally does to suggest the plan to the human operator to check that path”. Further,
participants were not only interested to validate the swarm plan but also asked
questions related to the path planning algorithm to debug swarm failures, e.g.,
P18 and P7 asked to debug a failure in Fig. 1A: “What metrics does the swarm
use to follow this route?” and “Why did the swarm follow this route, not another
one?”. A similar pattern appeared in participants’ answers when they were asked
to explain the unexpected behaviour in Fig. 1A Participants explained the unex-
pected behaviour based on path planning events, P9 stated: “Because the swarm
agents are charging their batteries according to the initial plan”, similarly P18
commented: “Probably the swarm has not stopped, it is just avoiding obstacles
right now”. These results support the growing interest in Explainable Plan-
ning (XAIP), as shown by many planning contributions such as Human-Aware
Planning and Model Reconciliation [9] and multi-agent path planning [2]. Such
explanations aim to answer human operators’ questions regarding the path for
several agents or clusters of agents to reach their targets [2], such that paths can
be taken simultaneously without the agents colliding. However, the research on
explainable swarm path-planning is still limited and requires further attention.
3
For anonymity, individual participants are referred to as Pn (e.g. P18) in this chapter.
36 M. Naiseh et al.
Communication. The collective decision-making process is the outcome of an

emergent phenomenon that follows localised interactions among agents yielding
a global information flow. As a consequence, to investigate the dynamics of col-
lective decision-making, participants went beyond debugging consensus between
agents and asked questions related to the communication between swarm agents,
e.g., P7 pointed out the following question: “What is the synchronisation status
between swarm agents?”. Participants also brought up that swarm failure in our
examples could be critically related to a failure in ensuring the appropriate flow
of behavioural information among agents, both in terms of information quantity
and accuracy, e.g., P21 commented on the failure in Fig. 1B: “because the envi-
ronmental condition is limiting the connection between agents” and P18 added:
“Perhaps one of the casualties moved on/no longer is, therefore state was not
updated”.
Scheduling-Based. Task scheduling in swarm robotics is a collective behaviour

in which robots distribute themselves over different tasks. The goal is to max-
imize the performance of the swarm by letting the robots dynamically choose
which task to perform. Yet, in many applications, human operators are still
required to intervene in scheduling plan and adapt accordingly. In our question-
naire, participants’ understanding of particular swarm failure was sometimes
associated with an initial task scheduling. Questions and explanations related
to task scheduling were mentioned in both scenarios. For instance, P2 and P21
explained casualty detection failure (Fig. 1B) with the following explanations,
“Because this casualty should be detected by another cluster”, and “Because the
initial scheduling plan did not allocate enough robots”. Interestingly, regardless
the failure scenarios, our participants repeatedly suggested questions with a pat-
tern proposed by Miller [?] which have a contrastive nature of explanation that
are often implying why not another scheduling plan is feasible. For instance,
P19 asked the following question: “Why is a cluster c1 better than cluster c2 in
performing task t1?”. Our data also showed a set of questions that require inter-
active scheduling-based explanations, where the human operator can understand
the consequences of the scheduling plan. This pattern is pointed out by Wang et
al. [?] as what-if questions, P12 mentioned: “What if cluster c1 were to do task
t1 rather than task t2?”.
Hardware. Even though the quality and robustness of the hardware are increas-
ing, hardware failure is still quite common. Participants explained the failure of
the swarm with a link to a hardware problem. For instance, when the swarm
was not able to detect all the casualties in the area, P17 commented: “because
of faulty in the sensors”. Our participants also discussed that explanations of
hardware are necessary to give human operators actionable explanations to either
better utilise the swarm or to improve the swarm performance, e.g., send more
swarm agents to the location. There was also a typical pattern among partic-
ipants’ feedback to explain swarm behaviour or failure based on its hardware
boundaries and limitation as well as environmental and training data limita-
tions. For instance, P11 explained unexpected behaviour in Fig. 4.1: “Because
some agents have limited speed capabilities, so they are waiting for each other”,
and similarly P14 explained Fig. 4.2 failure, “the swarm sensors have low image
input”.
Architecture and Design. Participants also found many reasons for swarm
failures based on the swarm architecture or design decisions of the system. This
category did not appear frequently in the data, and participants mentioned five
explanations and six questions. Participants recognised that potential failure or
unexpected behaviour can be related to initial design decisions made by the
systems engineers e.g., P12 answered: “Because the swarm has not got enough
training in such an environment” and similarly, P22 commented: “Because the
transition probability between states is fixed”. Participants also took a broader
view of swarm explainability when they discussed questions that can be answered
through descriptive information about the swarm system design. For instance, P1
and P4 asked: “How is the swarm size selected?”, “What is the design method? is
it behaviour-based or Automatic design?.” Currently, these kinds of explanations
are discussed in the XAI literature as an on-boarding activity where human
operators are introduced to the system design architecture to understand its
capability and limitation [8]. These global properties of the system could help
inform the interpretation of swarm behaviour and predictions during critical
decision-making scenarios.
4.2 Challenges Faced by Swarm Experts

In this section, we discuss three main challenges participants faced when design-
ing or developing explanations for swarm robotics.
A Trade-Off Between Explainability and Human-Swarm Performance.

Participants discussed that developing explainable swarms comes with a criti-
cal analysis of performance-explainability trade-offs that swarm developers have
to face. Participants frequently pointed out that generating explanations does
not come without cost; they mentioned that this could be a computational or
performance cost, e.g., interruption to the human-swarm task. P18 mentioned:
“asking the operator to read an explanation at an emergent situation can be
vital and the explanation might delay the reaction time and distract the oper-
ator”. These observations are consistent with earlier research [22] that showed
explanations can be ignored by human operators when they are perceived as a
task goal impediment. It has been shown that humans in Human-AI systems
focus to complete the task rather than trying to engage cognitively with AI
systems to understand their behaviour [7]. Such effect can be even amplified in
human-swarm environments where swarm operators might face time constraints
and multitasking as well as large numbers of agents to monitor. In summary,
participants argued that integrating explainability in the Human-Swarm task is
difficult - numerous design decisions require trade-offs involving explainability,
workload and usability.
38 M. Naiseh et al.
Increasing the Number of Agents Would Increase the Complexity of

Explanation Generation. The xSwarm challenge is significantly more diffi-
cult than other XAI areas since a swarm’s decision is dependent on numerous
parameters linked to numerous agents [?]. Participants pointed out that there is a
scalability problem in swarm explainability, i.e., increasing the number of agents
in a swarm will exponentially increase the complexity of the explanation gener-
ation. P12 mentioned: “explainable swarm can have an extreme cost when the
swarm size is 100”. Similarly, P2 added: “some of the technical reasons that led
to the decision are relevant to a given user; as the number of agents increases, the
non-relevant information increases exponentially”. For these reasons, simplicity
as a usability feature for explanation can be essential in swarm environments.
Simplicity states that explanations generated by AI systems should be selec-
tive and succinct enough to avoid overwhelming the explainee with unnecessary
information [26]. In response to this challenge, participants suggested that future
swarm systems shall include methods and tools to help summarise explanations
generated by swarm agents, e.g., P20 suggested: “Human operators cannot deal
with loads of explanations, there should be some tools to provide a summary of the
swarm decision”. In one relevant research, Mualla et al. [20] addressed this issue
and presented a novel explainable swarm architecture with a filtering feature to
filter explanations generated from swarm agents. They conduct a user evaluation
with three experimental groups: “No explanation: the swarm does not present
an explanation”, “Detailed explanation: the swarm presents explanations gener-
ated by each agent” and “Filtered explanation, the swarm presented a filtered
explanation of each agent”. They showed that filtering explanation was the most
preferable option for participants and lead to more trust and understanding of
the swarm behaviour.
The Soundness of Swarm Explanations is Difficult to Achieve. The

soundness of an explanation measures how truthful an explanation is with
respect to the underlying system state [26]. Participants stated that they faced
difficulties to generate explanations that are sound and complete because, in
some contexts, information might be unavailable or outdated, P7 mentioned: “A
variety of technical factors like bandwidth limitations and swarm size, an envi-
ronmental condition factors associated with swarms diminish the likelihood of
having perfect communication, which negatively impacts generating explanations
that are complete and sound”. To address this challenge, participants suggested
explanation generation tools in swarm robotics shall take into consideration the
validity of the available information to generate an explanation for a swarm
action. P16 mentioned: “swarm explanation algorithms should have metrics to
measure the soundness of their explanations before presenting it to operators”.
Participants also suggested that outdated information about the swarm may
result in an outdated explanation which may degrade Human-Swarm perfor-
mance and trust, P4 raised this issue: “Sometimes the swarm explanation might
be not synchronised with the current state of the swarm, in these cases, it might
be better to avoid showing this information to human operators”. Previous empir-
ical studies discussed this issue and showed that erroneous actions taken by the
human operator to progress toward the overall goal will be amass if actions
are based on outdated swarm explanations or information [27]. Future xSwarm
research may need to consider discovering mitigation strategies to break the error
propagation cycle that may emerge from outdated explanations.
5 Conclusion
Although the research in eXplainable AI is gaining a lot of interest from industry

and academia, the research on explainability for swarm robotics is still limited.
In this paper, we presented results from an online questionnaire with 26 swarm
experts to outline the design space of eXplainable Swarm (xSwarm). In partic-
ular, we presented a list of explanations generated by swarm experts to explain
swarm behaviours or failures and what kind of questions do they ask to debug
the swarm. Our work provides concrete taxonomy of what xSwarm would look
like from swarm experts’ perspective, serving as an initial artefact to outline
the design space of xSwarm. Our work suggests opportunities for the HCI and
swarms robotics communities, as well as industry practitioners and academics,
to work together to advance the field of xSwarm through translational work and
a shared knowledge repository that represent an xSwarm design space. Finally,
we cannot claim this is a complete analysis of xSwarm design space. The results
only reflect swarm experts’ views using an online questionnaire. Future work
could use our taxonomy as a study probe for in-depth investigation with multi-
ple stakeholders in swarm systems to further enhance our taxonomy.
Acknowledgments. This work was supported by the Engineering and Physical Sci-
ences Research Council (EP/V00784X/1).
References
1. Agrawal, A., Cleland-Huang, J.: Explaining autonomous decisions in swarms of
human-on-the-loop small unmanned aerial systems. In: Proceedings of the AAAI
Conference on Human Computation and Crowdsourcing, vol. 9, pp. 15–26 (2021)
2. Almagor, S., Lahijanian, M.: Explainable multi agent path finding. In: AAMAS
(2020)
3. Anjomshoae, S., Najjar, A., Calvaresi, D., Främling, K.: Explainable agents and
robots: results from a systematic literature review. In: 18th International Con-
ference on Autonomous Agents and Multiagent Systems (AAMAS 2019), Mon-
treal, Canada, 13–17 May 2019, pp. 1078–1088. International Foundation for
Autonomous Agents and Multiagent Systems (2019)
4. Atkinson, K., Bench-Capon, T., Bollegala, D.: Explanation in AI and law: past,
present and future. Artif. Intell. 289, 103387 (2020)
from the swarm engineering perspective. Swarm Intell. 7(1), 1–41 (2013)
40 M. Naiseh et al.
6. Brandao, M., Canal, G., Krivić, S., Luff, P., Coles, A.: How experts explain motion
planner output: a preliminary user-study to inform the design of explainable plan-
ners. In: 2021 30th IEEE International Conference on Robot & Human Interactive
Communication (RO-MAN), pp. 299–306. IEEE (2021)
7. Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To trust or to think: cognitive forcing
functions can reduce overreliance on AI in AI-assisted decision-making. Proc. ACM
Hum.-Comput. Interact. 5(CSCW1), 1–21 (2021)
8. Cai, C.J., Winter, S., Steiner, D., Wilcox, L., Terry, M.: “hello AI”: uncovering the
onboarding needs of medical practitioners for human-AI collaborative decision-
making. Proc. ACM Hum.-Comput. Interact. 3(CSCW), 1–24 (2019)
9. Chakraborti, T., Sreedharan, S., Zhang, Y., Kambhampati, S.: Plan explanations
as model reconciliation: moving beyond explanation as soliloquy. arXiv preprint
arXiv:1701.08317 (2017)
10. Clark, J.R., et al.: Industry led use-case development for human-swarm operations.
arXiv preprint arXiv:2207.09543 (2022)
11. Divband Soorati, M., Clark, J., Ghofrani, J., Tarapore, D., Ramchurn, S.D.:
Designing a user-centered interaction interface for human-swarm teaming. Drones
5(4), 131 (2021)
12. Elo, S., Kyngäs, H.: The qualitative content analysis process. J. Adv. Nurs. 62(1),
107–115 (2008)
13. Klein, G.A., Calderwood, R., Macgregor, D.: Critical decision method for eliciting
knowledge. IEEE Trans. Syst. Man Cybern. 19(3), 462–472 (1989)
14. Kraus, S., et al.: AI for explaining decisions in multi-agent environments. In: Pro-
ceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13534–13538
(2020)
15. Liao, Q.V., Gruen, D., Miller, S.: Questioning the AI: informing design practices
for explainable AI user experiences. In: Proceedings of the 2020 CHI Conference
on Human Factors in Computing Systems, pp. 1–15 (2020)
16. Lundberg, S.M., et al.: From local explanations to global understanding with
explainable AI for trees. Nat. Mach. Intell. 2(1), 56–67 (2020)
17. McGuirk, P.M., O’Neill, P.: Using questionnaires in qualitative human geography
(2016)
18. Miller, T.: Explanation in artificial intelligence: insights from the social sciences.
Artif. Intell. 267, 1–38 (2019)
19. Morgan, M.G.: Use (and abuse) of expert elicitation in support of decision making
for public policy. Proc. Natl. Acad. Sci. 111(20), 7176–7184 (2014)
20. Mualla, Y., et al.: The quest of parsimonious XAI: a human-agent architecture for
explanation formulation. Artif. Intell. 302, 103573 (2022)
21. Mueller, S.T., Hoffman, R.R., Clancey, W., Emrey, A., Klein, G.: Explanation in
human-AI systems: a literature meta-review, synopsis of key ideas and publications,
and bibliography for explainable AI. arXiv preprint arXiv:1902.01876 (2019)
22. Naiseh, M., Cemiloglu, D., Al Thani, D., Jiang, N., Ali, R.: Explainable recommen-
dations and calibrated trust: two systematic user errors. Computer 54(10), 28–37
(2021)
23. Naiseh, M., Jiang, N., Ma, J., Ali, R.: Personalising explainable recommendations:
literature and conceptualisation. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo,
S., Orovic, I., Moreira, F. (eds.) WorldCIST 2020. AISC, vol. 1160, pp. 518–533.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45691-7 49
24. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intell.
1(1), 33–57 (2007)
25. Roundtree, K.A., Goodrich, M.A., Adams, J.A.: Transparency: transitioning from
human-machine systems to human-swarm systems. J. Cogn. Eng. Decis. Making
13(3), 171–195 (2019)
26. Sokol, K., Flach, P.: Explainability fact sheets: a framework for systematic assess-
ment of explainable approaches. In: Proceedings of the 2020 Conference on Fair-
ness, Accountability, and Transparency, pp. 56–67 (2020)
27. Walker, P., Nunnally, S., Lewis, M., Kolling, A., Chakraborty, N., Sycara, K.:
Neglect benevolence in human control of swarms in the presence of latency. In:
2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC),
pp. 3009–3014. IEEE (2012)
VMAS: A Vectorized Multi-agent
Simulator for Collective Robot Learning
Matteo Bettini(B) , Ryan Kortvelesy, Jan Blumenkamp, and Amanda Prorok
Department of Computer Science and Technology, University of Cambridge,

Cambridge, UK
{mb2389,rk627,jb2270,asp45}@cl.cam.ac.uk
Abstract. While many multi-robot coordination problems can be

solved optimally by exact algorithms, solutions are often not scalable
in the number of robots. Multi-Agent Reinforcement Learning (MARL)
is gaining increasing attention in the robotics community as a promis-
ing solution to tackle such problems. Nevertheless, we still lack the
tools that allow us to quickly and efficiently find solutions to large-
scale collective learning tasks. In this work, we introduce the Vectorized
Multi-Agent Simulator (VMAS). VMAS is an open-source framework
designed for efficient MARL benchmarking. It is comprised of a vec-
torized 2D physics engine written in PyTorch and a set of twelve chal-
lenging multi-robot scenarios. Additional scenarios can be implemented
through a simple and modular interface. We demonstrate how vectoriza-
tion enables parallel simulation on accelerated hardware without added
complexity. When comparing VMAS to OpenAI MPE, we show how
MPE’s execution time increases linearly in the number of simulations
while VMAS is able to execute 30,000 parallel simulations in under
10 s, proving more than 100× faster. Using VMAS’s RLlib interface,
we benchmark our multi-robot scenarios using various Proximal Policy
Optimization (PPO)-based MARL algorithms. VMAS’s scenarios prove
challenging in orthogonal ways for state-of-the-art MARL algorithms.
The VMAS framework is available at: https://github.com/proroklab/
VectorizedMultiAgentSimulator. A video of VMAS scenarios and exper-
iments is available https://youtu.be/aaDRYfiesAY
Keywords: simulator · multi-robot learning · vectorization
1 Introduction
Many real-world problems require coordination of multiple robots to be solved.
However, coordination problems are commonly computationally hard. Examples
include path-planning [14], task assignment [25], and area coverage [39]. While
exact solutions exist, their complexity grows exponentially in the number of
robots [3]. Metaheuristics [6] provide a fast and scalable solutions, but lack in
optimality. Multi-Agent Reinforcement Learning (MARL) can be used as a scal-
able approach to find near-optimal solutions to these problems [34]. In MARL,
agents trained in simulation collect experiences by interacting with the environ-
ment, and train their policies (typically represented with deep neural networks)
through a reward signal.
https://doi.org/10.1007/978-3-031-51497-5_4
VMAS: A Vectorized Multi-agent Simulator for Collective Robot Learning 43
Fig. 1. Multi-robot scenarios introduced in VMAS. Robots (blue shapes) interact

among each other and with landmarks (green, red, and black shapes) to solve a task.
However, current MARL approaches present several issues. Firstly, the train-
ing phase can require significant time to converge to optimal behavior. This is
partially due to the sample efficiency of the algorithm, and partially to the
computational complexity of the simulator. Secondly, current benchmarks are
specific to a predefined task and mostly tackle unrealistic videogame-like sce-
narios [28,31], far from real-world multi-robot problems. This makes research in
this area fragmented, with a new simulation framework being implemented for
each new task introduced. Multi-robot simulators, on the other hand, prove to
be more general, but their high fidelity and full-stack simulation results in slow
performance, preventing their applicability to MARL. Full-stack learning can
significantly hinder training performance. Learning can be made more sample-
efficient if simulation is used to solve high-level multi-robot coordination prob-
lems, while leaving low-level robotic control to first-principles-based methods.
Motivated by these reasons, we introduce VMAS, a vectorized multi-agent
simulator. VMAS is a vectorized 2D physics simulator written in PyTorch [22],
designed for efficient MARL benchmarking. It simulates agents and landmarks
44 M. Bettini et al.
of different shapes and supports torque, elastic collisions and custom gravity.
Holonomic motion models are used for the agents to simplify simulation. Vec-
torization in PyTorch allows VMAS to perform simulations in a batch, seamlessly
scaling to tens of thousands of parallel environments on accelerated hardware.
With the term GPU vectorization we refer to the Single Instruction Multiple
Data (SIMD) execution paradigm available inside a GPU warp. This paradigm
permits to execute the same instruction on a set of parallel simulations in a
batch. VMAS has an interface compatible with OpenAI Gym [7] and with the
RLlib library [15], enabling out-of-the-box integration with a wide range of RL
algorithms. VMAS also provides a framework to easily implement custom multi-
robot scenarios. Using this framework, we introduce a set of 12 multi-robot
scenarios representing difficult learning problems. Additional scenarios can be
implemented through a simple and modular interface. We vectorize and port all
scenarios from OpenAI MPE [16] in VMAS. We benchmark four of VMAS’s new
scenarios using three MARL algorithms based on Proximal Policy Optimization
(PPO) [29]. We show the benefits of vectorization by benchmarking our scenar-
ios in the RLlib [15] library. Our scenarios prove to challenge state-of-the-art
MARL algorithms in complementary ways.
Contributions. We now list the main contributions of this work:
– We introduce the VMAS framework. A vectorized multi-agent simulator
which enables MARL training at scale. VMAS supports inter-agent com-
munication and customizable sensors, such as LIDARs.
– We implement a set of twelve multi-robot scenarios in VMAS, which focus on
testing different collective learning challenges including: behavioural hetero-
geneity, coordination through communication, and adversarial interaction.
– We port and vectorize all scenarios from OpenAI MPE [16] into VMAS and
run a performance comparison between the two simulators. We demonstrate
the benefits of vectorization in terms of simulation speed, showing that VMAS
is up to 100× faster than MPE.
The VMAS codebase is available https://github.com/proroklab/VectorizedMul
tiAgentParticleSimulator.
2 Related Work
In this section, we review the related literature in the fields of multi-agent and
multi-robot simulation, highlighting the core gaps of each field. Furthermore, we
compare the most relevant simulation frameworks with VMAS in Table 1.
Multi-agent Reinforcement Learning Environments. A significant
amount of work exists in the context of MARL to address the issues of
multi-robot simulation for learning hard coordination strategies. Realistic GPU-
accelerated simulators and engines have been proposed. Isaac [17] is a proprietary
NVIDIA simulator used for realistic robotic simulation in reinforcement learning.
Instead of using environment vectorization to accelerate learning, it uses concur-
rent execution of multiple training environments in the same simulation instance.
Despite of this, its high-fidelity simulation makes it computationally expensive

for high-level MARL problems. Brax [8] is a vectorized 3D physics engine intro-
duced by Google. It uses the Jax [5] library to achieve environment batching
and full-differentiability. However, computational issues occur when scaling the
number of simulated agents, leading to stalled environments with just 20 agents.
There also exist projects for single-agent vectorized environments [13,35], but
the complexity of extending these to the multi-agent domain is non-trivial.
The core benchmark environments of the MARL literature focus on high-level
inter-robot learning. Multiagent Particle Environments (MPE) [16] are a set of
enviroments created by OpenAI. They share VMAS’s principles of modularity
and ease of new scenario creation, without providing environment vectoriza-
tion. MAgent [38] is a discrete-world environment supporting a high number of
agents. Multi-Agent-Learning-Environments [10] is another simplified discrete-
world set of environments with a range of different multi-robot tasks. Multi-
Agent-Emergence-Environments [2] is a customizable OpenAI 3D simulator for
hide-and-seek style games. Pommerman [26] is a discretized playground for learn-
ing multi-agent competitive strategies. SMAC [28] is a very popular MARL
benchmark based on the Starcraft 2 videogame. Neural-MMO [31] is another
videogame-like set of environments where agents learn to survive in large pop-
ulations. Google Research Football [12] is a football simulation with a suite of
scenarios that test different aspects of the game. Gym-pybullet-drones [21] is
a realistic PyBullet simulator for multi-quadricopters control. Particle Robots
Simulator [30] is a simulator for particle robots, which require high coordination
strategies to overcome actuation limitations and achieve high-level tasks. Multi-
Agent Mujoco [23] consists in multiple agents controlling different body parts
of a single Mujoco [32] agent. While all these environments provide interesting
MARL benchmarks, most of them focus on specific tasks. Furthermore, none of
these environments provide GPU vectorization, which is key for efficient MARL
training. We present a comparison between VMAS and all the aforementioned
environments in Tab. 1.
Multi-robot Smulators. Video-game physics engines such as Unity and Unreal
Engine grant realistic simulation that can be leveraged for multi-agent robotics.
Both make use of the GPU-accelerated NVIDIA PhysX. However, their general-
ity causes high overheads when using them for robotics research. Other popular
physics engines are Bullet, Chipmunk, Box2D, and ODE. These engines are all
similar in their capabilities and prove easier to adopt due to the availability
of Python APIs. Thus, they are often the tool of choice for realistic robotic
simulation. However, because they do not leverage GPU-accelerated batched
simulation, these tools lead to performance bottlenecks in MARL training.
The most widely known robotic simulators are Gazebo [11] and Webots [18].
Their engines are based on the ODE 3D dynamics library. These simulators
support a wide range of robot models, sensors, and actuators, but suffer from
significant performance loss when scaling in the number of agents. Complete
simulation stall is shown to occur with as few as 12 robots [20]. For this reason,
Argos [24] has been proposed as a scalable multi-robot simulator. It is able to
Table 1. Comparison of multi-agent and multi-robot simulators and environments.
Vectora Stateb Commc Actiond PhysEnge #Agentsf Geng Exth MRobi MARLj RLlibk
Brax [8] ✓ C ✗ C 3D < 10 ✓ ✓ ✗ ✗ ✗
MPE [16] ✗ C C+D C+D 2D < 100 ✓ ✓ ✗ ✓ ✓
MAgent [38] ✗ D ✗ D ✗ > 1000 ✗ ✗ ✗ ✓ ✓
MA-Learning-Environments [10] ✗ D ✗ D ✗ < 10 ✓ ✗ ✓ ✓ ✗
MA-Emergence-Environments [2] ✗ C ✗ C+D 3D < 10 ✗ ✗ ✗ ✓ ✗
Pommerman [26] ✗ D ✗ D ✗ < 10 ✗ ✗ ✗ ✓ ✗
SMAC [28] ✗ C ✗ D ✗ < 100 ✗ ✓ ✗ ✓ ✓
Neural-MMO [31] ✗ C ✗ C+D ✗ < 1000 ✗ ✓ ✗ ✓ ✓
Google research football [12] ✗ C ✗ D 2D < 100 ✗ ✓ ✗ ✓ ✓
gym-pybullet-drones [21] ✗ C ✗ C 3D < 100 ✗ ✓ ✓ ✓ ✓
Particle robots simulator [30] ✗ C ✗ C+D 2D < 100 ✗ ✓ ✓ ✓ ✗
MAMujoco [23] ✗ C ✗ C 3D < 10 ✗ ✗ ✗ ✓ ✗
Gazebo [11] ✗ C C+D C+D 3D < 10 ✓ ✓ ✓ ✗ ✗
Webots [18] ✗ C C+D C+D 3D < 10 ✓ ✓ ✓ ✗ ✗
ARGOS [24] ✗ C C+D C+D 2D&3D < 1000 ✓ ✓ ✓ ✗ ✗
VMAS ✓ C C+D C+D 2D < 100 ✓ ✓ ✓ ✓ ✓
a
Vectorized
b
Continuous state (C) or discrete state/grid world (D)
c
Continuous communication (C) or discrete communication (D) inside the simulator
d
Continuous actions (C) or discrete actions (D)
e
Type of physics engine
f
Number of agents supported
g
General purpose simulator: any type of task can be created
h
Extensibility (API for creating new scenarios)
i
Contains multi-robot tasks
j
Made for Multi-Agent Reinforcement Learning (MARL)
k
Compatible with RLlib framework [15]
simulate swarms in the thousands of agents by assigning parts of the simulation

space to different physics engines with different simulation goals and fidelity.
Furthermore, it uses CPU parallelization through multi-threading. Despite these
features, none of the simulators described are fast enough to be usable in MARL
training. This is because they prioritize realistic full-stack multi-robot simulation
over speed, and they do not leverage GPU acceleration for parallel simulations.
This focus on realism is not always necessary in MARL. In fact, most collective
coordination problems can be decoupled from low-level problems relating to
sensing and control. When these problems can be efficiently solved independently
without loss of generality, fast high-level simulation provides an important tool.
This insight is the key factor motivating the holonomicity assumption in VMAS.
3 The VMAS Platform

The unique characteristic that makes VMAS different from the related works
compared in Table 1 is the fact that our platform brings together multi-agent
learning and environment vectorization. Vectorization is a key component to
speed-up MARL training. In fact, an on-policy training iteration1 is comprised
1
Here we illustrate an on-policy training iteration, but simulation is a key component
of any type of MARL algorithm.
of simulated team rollouts and a policy update. During the rollout phase of
iteration k, simulations are performed to collect experiences from the agents’
interactions with the environment according to their policy πk . The collected
experiences are then used to update the team policy. The new policy πk+1 will
be employed in the rollout phase of the next training iteration. The rollout phase
usually constitutes the bottleneck of this process. Vectorization allows parallel
simulation and helps alleviate this issue.
Inspired by the modularity of some existing solutions, like MPE [16], we cre-
ated our framework as a new scalable platform for running and creating MARL
benchmarks. With this goal in mind, we developed VMAS following a set of
tenets:
– Vectorized. VMAS vectorization can step any number of environments in
parallel. This significantly reduces the time needed to collect rollouts for train-
ing in MARL.
– Simple. Complex vectorized physics engines exist (e.g., Brax [8]), but they
do not scale efficiently when dealing with multiple agents. This defeats the
computational speed goal set by vectorization. VMAS uses a simple custom
2D dynamics engine written in PyTorch to provide fast simulation.
– General. The core of VMAS is structured so that it can be used to implement
general high-level multi-robot problems in 2D. It can support adversarial as
well as cooperative scenarios. Holonomic robot simulation shifts focus to high-
level coordination, obviating the need to learn low-level controls using MARL.
– Extensible. VMAS is not just a simulator with a set of environments. It
is a framework that can be used to create new multi-agent scenarios in a
format that is usable by the whole MARL community. For this purpose, we
have modularized our framework to enable new task creation and introduced
interactive rendering to debug scenarios.
– Compatible. VMAS has multiple wrappers which make it directly compati-
ble with different MARL interfaces, including RLlib [15] and Gym [7]. RLlib
has a large number of already implemented RL algorithms.
Let us break down VMAS’s structure in depth.
Interface. The structure of VMAS is illustrated in Fig. 2. It has a vector-
ized interface, which means that an arbitrary number of environments can be
stepped in parallel in a batch. In Sec. 5, we demonstrate how vectorization grants
important speed-ups on the CPU and seamless scaling on the GPU. While the
standard simulator interface uses PyTorch [22] to enable feeding tensors directly
as input/output, we provide wrappers for the standard non-vectorized OpenAI
Gym [7] interface and for the vectorized interface of the RLlib [15] framework.
This enables users to effortlessly access the range of RL training algorithms
already available in RLlib. Actions for all environments and agents are fed to
VMAS for every simulation step. VMAS supports movement and inter-agent
communication actions, both of which can be either continuous or discrete. The
interface of VMAS provides rendering through Pyglet [1].
Scenario. Scenarios encode the multi-agent task that the team is trying to
solve. Custom scenarios can be implemented in a few hours and debugged using
Fig. 2. VMAS structure. VMAS has a vectorized MARL interface (left) with wrappers
for compatibility with OpenAI Gym [7] and the RLlib RL library [15]. The default
VMAS interface uses PyTorch [22] and can be used for feeding input already on the
GPU. Multi-agent tasks in VMAS are defined as scenarios (center). To define a scenario,
it is sufficient to implement the listed functions. Scenarios access the VMAS core (right),
where agents and landmarks are simulated in the world using a 2D custom written
physics module.
interactive rendering. Interactive rendering is a feature where agents in scenarios

can be controlled by users in a videogame-like fashion and all environment-
related data is printed on screen. To implement a scenario, it is sufficient to define
a few functions: make_world creates the agents and landmarks for the scenario
and spawns them in the world, reset_world_at resets a specific environment in
the batch or all environments at the same time, reward returns the reward for
one agent for all environments, observation returns the agent’s observations
for all environments. Optionally, done and info can be implemented to provide
an ending condition and extra information. Further documentation on how to
create new scenarios is available in the repository and in the code.
Core. Scenarios interact with the core. This is where the world simula-
tion is stepped. The world contains n entities, which can be agents or land-
marks. Entities have a shape (sphere, box, or line) and a vectorized state
(xi , ẋi , θi , θ̇i ), ∀i ∈ [1..n] ≡ N , which contains their position xi ∈ R2 , veloc-
ity ẋi ∈ R2 , rotation θi ∈ R, and angular velocity θ̇i ∈ R for all environments.
Entities have a mass mi ∈ R and a maximum speed and can be customized to
be movable, rotatable, and collidable. Agents’ actions consist of physical actions,
represented as forces fia ∈ R2 , and optional communication actions. In the cur-
rent state of the simulator, agents cannot control their orientation. Agents can
either be controlled from the interface or by an “action script” defined in the
scenario. Optionally, the simulator can introduce noise to the actions and obser-
vations. Custom sensors can be added to agents. We currently support LIDARs.
The world has a simulation step δt, velocity damping coefficient ζ, and customiz-
able gravity g ∈ R2 .
VMAS has a force-based physics engine. Therefore, the simulation step

uses the forces at time t to integrate the state by using a semi-implicit Euler
method [19]:
⎧ g
⎪ a e
⎨fi (t) = fi (t) + fi + j∈N \{i} fij (t)
ẋi (t + 1) = (1 − ζ)ẋi (t) + fm
i (t)
δt , (1)
⎪
⎩ i
xi (t + 1) = xi (t) + ẋi (t + 1)δt

where fia is the agent action force, fig = mi g is the force deriving from gravity
e
and fij is the environmental force used to simulate collisions between entities i
and j. It has the following form:
⎧
⎨c xij (t) k log 1 + e −(xij (t)k−dmin ) if xij (t) dmin
e xij (t)
fij (t) = . (2)
⎩
0 otherwise
Here, c is a parameter regulating the force intensity. xij is the relative position
between the closest points on the shapes of the two entities. dmin is the minimum
distance allowable between them. The term inside the logarithm computes a
scalar proportional to the penetration of the two entities, parameterized by a
coefficient k. This term is then multiplied by the normalized relative position
vector. Collision intensity and penetration can be tuned by regulating c and k.
This is the same collision system used in OpenAI MPE [16].
The simulation step used for the linear state is also applied to the angular
state:
⎧
⎪ e
⎨τi (t) = j∈N \{i} rij (t) × fij (t)
θ̇i (t + 1) = (1 − ζ)θ̇i (t) + τiI(t) δt . (3)
⎪
⎩ i
θi (t + 1) = θi (t) + θ̇i (t + 1)δt

Here, rij ∈ R2 is the vector from the center of the entity to the colliding point,
τi is the torque, and Ii is the moment of inertia of the entity. The rules regu-
lating the physics simulation in the core are basic 2D dynamics implemented in
a vectorized manner using PyTorch. They simulate holonomic (unconstrained
motion) entities only.
4 Multi-robot Scenarios
Alongside VMAS, we introduce a set of 12 multi-robot scenarios. These scenar-
ios contain various multi-robot problems, which require complex coordination—
like leveraging heterogeneous behaviour and inter-agent communication—to be
solved. While the ability to send communication actions is not used in these
scenarios, communication can be used in the policy to improve performance.
For example, Graph Neural Networks (GNNs) can be used to overcome partial
observability through information sharing [4].
Each scenario delimits the agents’ input by defining the set of their obser-
vations. This set typically contains the minimum observation needed to solve
the task (e.g., position, velocity, sensory input, goal position). Scenarios can be
made arbitrarily harder or easier by modifying these observations. For example,
if the agents are trying to transport a package, the precise relative distance to
the package can be removed from the agent inputs and replaced with LIDAR
measurements. Removing global observations from a scenario is a good incentive
for inter-agent communication.
All tasks contain numerous parametrizable components. Every scenario
comes with a set of tests, which run a local heuristic on all agents. Furthermore,
we vectorize and port all 9 scenarios from MPE [16] to VMAS. In this section,
we give a brief overview of our new scenarios. For more details (e.g., observation
space, reward, etc.) you can find in-depth descriptions in the VMAS repository.
Transport (Fig. 1a). N agents have to push M packages to a goal. Packages
have a customizable mass and shape. Single agents are not able to move a high-
mass package by themselves. Cooperation with teammates is thus needed to
solve the task.
Wheel (Fig. 1b). N agents have to collectively rotate a line. The line is
anchored to the origin and has a parametrizable mass and length. The team’s
goal is to bring the line to a desired angular velocity. Lines with a high mass are
impossible to push for single agents. Therefore, the team has to organize with
agents on both sides to increase and reduce the line’s velocity.
Balance (Fig. 1c). N agents are spawned at the bottom of a world with ver-
tical gravity. A line is spawned on top of them. The agents have to transport a
spherical package, positioned randomly on top of the line, to a given goal at the
top. The package has a parametrizable mass and the line can rotate.
Give Way (Fig. 1d). Two agents start in front of each other’s goals in a
symmetric environment. To solve the task, one agent has to give way to the
other by using a narrow space in the middle of the environment.
Football (Fig. 1e). A team of N blue agents competes against a team of M
red agents to score a goal. By default, red agents are controlled by a heuristic
AI, but self-play is also possible. Cooperation among teammates is required to
coordinate attacking and defensive maneuvers. Agents need to communicate and
assume different behavioural roles in order to solve the task.
Passage (Fig. 1f ). 5 agents, starting in a cross formation, have to reproduce
the same formation on the other side of a barrier. The barrier has M passages
(M = 1 in the figure). Agents are penalized for colliding amongst each other and
with the barrier. This scenario is a generalization of the one considered in [4].
Reverse Transport (Fig. 1g). This task is the same as Transport, except only
one package is present. Agents are spawned inside of it and need to push it to
the goal.
Dispersion (Fig. 1h). There are N agents and N food particles. Agents start
in the same position and need to cooperatively eat all food. Most MARL algo-
rithms cannot solve this task (without communication or observations from
other agents) as they are constrained by behavioural homogeneity deriving from
parameter sharing. Heterogeneous behaviour is thus needed for each agent to

tackle a different food particle.
Dropout (Fig. 1i). N agents have to collectively reach one goal. To complete
the task, it is enough for only one agent to reach the goal. The team receives
an energy penalty proportional to the sum of all the agents’ controls. Therefore,
agents need to organize themselves to send only the closest robot to the goal,
saving as much energy as possible.
Flocking (Fig. 1j). N agents have to flock around a target without col-
liding among each other and M obstacles. Flocking has been an important
benchmark in multi-robot coordination for years, with first solutions simulating
behaviour according to local rules [27], and more recent work using learning-
based approaches [33]. In contrast to related work, our flocking environment
contains static obstacles.
Discovery (Fig. 1k). N agents have to coordinate to cover M targets as quickly
as possible while avoiding collisions. A target is considered covered if K agents
have approached a target at a distance of at least D. After a target is covered,
the K covering agents each receive a reward and the target is re-spawned at a
random position. This scenario is a variation of the Stick Pulling Experiment [9]
and while it can be solved without communication, it has been shown that
communication significantly improves performance for N < M .
Waterfall (Fig. 1l). N agents move from top to bottom through a series of
obstacles. This is a testing scenario that can be used to discover VMAS’s func-
tionalities.
5 Comparison with MPE

In this section, we compare the scalability of VMAS and MPE [16]. Given that
we vectorize and port all the MPE scenarios in VMAS, we can compare the two
simulators on the same MPE task. The task chosen is “simple spread”, as it con-
tains multiple collidable agents in the same environment. VMAS and MPE use
two completely different execution paradigms: VMAS, being vectorized, lever-
ages the Single Instruction Multiple Data (SIMD) paradigm, while MPE uses
the Single Instruction Single Data (SISD) paradigm. Therefore, it is sufficient to
report the benefits of this paradigm shift on only one task, as the benefits are
task-independent.
In Fig. 3, we can see the growth in execution time with respect to the number
of environments stepped in parallel for the two simulators. MPE runs only on
the CPU, while VMAS, using PyTorch, runs both on the CPU and on the GPU.
In this experiment, we compare the two simulators on an Intel(R) Xeon(R) Gold
6248R CPU @ 3.00GHz and we also run VMAS on an NVIDIA GeForce RTX
2080 Ti. The results show the impact of vectorization on simulation speed. On
the CPU, VMAS is up to 5x faster than MPE. On the GPU, the simulation
time for VMAS is independent of the number of environments, and runs up to
100× faster. The same results can be reproduced on different hardware. In the
VMAS’s repository we provide a script to repeat this experiment.
6 Experiments and Benchmarks

We run a set of training experiments to benchmark the performance of MARL
algorithms on four VMAS scenarios. Thanks to VMAS’s vectorization, we are
able to perform a training iteration (comprised of 60,000 environment interac-
tions and deep neural network training) in 25 s on average. The runs reported in
this section all took under 3 h to complete. The models compared are all based
on Proximal Policy Optimization [29], an actor-critic RL algorithm. The actor
is a Deep Neural Network (DNN) which outputs actions given the observations
and the critic is a DNN (used only during training) which, given the observa-
tions, outputs a value representing the goodness of the current state and action.
We refer to the actor and critic as centralized when they have access to all the
agents’ observations and output all the agent’s actions/values and we call them
decentralized when they only map one agent’s observations to its action/value.
The models compared are:
– CPPO: This model uses a centralized critic and actor. It treats the multi-
agent problem as a single-agent problem with one super-agent.
– MAPPO [37]: This model uses a centralized critic and a decentralized actor.
Therefore, the agents act independently, with local decentralized policies, but
are trained with centralized information.
– IPPO [36]: This model uses a decentralized critic and actor. Every agent
learns and acts independently. Model parameters are shared among agents so
they can benefit from each other’s experiences.
– HetIPPO: We customize IPPO to disable parameter sharing, making each
agent’s model unique.
– Heuristic: This is a hand-designed decentralized heuristic different for each
task.
Fig. 3. Comparison of the scalability of VMAS and MPE [16] in the number of parallel
environments. In this plot, we show the execution time of the “simple spread” scenario
for 100 steps. MPE does not support vectorization and thus cannot be run on a GPU.
Experiments are run in RLlib [15] using the vectorized interface. We run all
algorithms for 400 training iterations. Each training iteration is performed over
60,000 environment interactions. We plot the mean and standard deviation of the
mean episode reward2 over 10 runs with different seeds. The model used for all
critics and actors is a two layer Multi Layer Perceptron (MLP) with hyperbolic
tangent activations. A video of the learned policies is available at this https://
youtu.be/aaDRYfiesAY. In the following, we discuss the results for the trained
scenarios.
Transport (Fig. 4a). In the Transport environment, only IPPO is able to learn
the optimal policy. This is because the other models, which have centralized com-
ponents, have an input space consisting of the concatenation of all the agents’
observations. Consequently, centralized architectures fail to generalize in envi-
ronments requiring a high initial exploration like this one, where there is a high
variance in possible joint states (and therefore there is a low probability that a
similar state will be encountered).
Fig. 4. Benchmark performance of different PPO-based MARL algorithms in four

VMAS scenarios. Experiments are run in RLlib [15]. Each training iteration is per-
formed over 60,000 environment interactions. We plot the mean and standard deviation
of the mean episode reward4 over 10 runs with different seeds.
2
The episode reward mean is the mean of the total rewards of episodes contained in
the training iteration.
Wheel (Fig. 4b). The Wheel environment proved to be a hard task for MARL
algorithms. Here, all models were not able to solve the task and performed worse
than the heuristic.
Balance (Fig. 4c). In Balance, all models were able to solve the task and out-
perform the heuristic. However, this is largely due to the use of a big observation
space containing global information. The task can be made arbitrarily harder by
removing part of the observation space and thus increasing partial observability.
Give Way (Fig. 4d). In the Give Way scenario, it is shown that only algorithms
able to develop heterogeneous agent behaviour can solve the environment. In
fact, IPPO and MAPPO, which use parameter sharing and decentralized actors,
fail this scenario. On the other hand, it is shown that the scenario can be solved
either through a centralized actor (CPPO) or by disabling parameter sharing
and allowing agent policies to be heterogeneous (HetIPPO).
The experimental results confirm that VMAS proposes a selection of sce-
narios which prove challenging in orthogonal ways for current state-of-the-art
MARL algorithms. We show that there exists no one-fits-all solution and that
our scenarios can provide a valuable benchmark for new MARL algorithms. In
addition, vectorization enables faster training, which is key to a wider adoption
of multi-agent learning in the robotics community.
7 Conclusion
In this work, we introduced VMAS, an open-source vectorized simulator for

multi-robot learning. VMAS uses PyTorch and is composed of a core vector-
ized 2D physics simulator and a set of multi-robot scenarios, which encode hard
collective robotic tasks. The focus of this framework is to act as a platform for
MARL benchmarking. Therefore, to incentivize contributions from the commu-
nity, we made implementing new scenarios as simple and modular as possible.
We showed the computational benefits of vectorization with up to 30,000 parallel
simulations executed in under 10 s on a GPU. We benchmarked the performance
of MARL algorithms on our scenarios. During our training experiments, we
were able to collect 60,000 environment steps and perform a training iteration
in under 25 s. Experiments also showed how VMAS scenarios prove difficult in
orthogonal ways for state-of-the-art MARL algorithms. In the future, we plan
to extend the features of VMAS to widen its adoption, continuing to imple-
ment new scenarios and benchmarks. We are also interested in modularizing the
physics engine, enabling users to swap vectorized engines with different fidelities
and computational demands.
Acknowledgements. This work was supported by ARL DCIST CRA W911NF-17-

2-0181 and European Research Council (ERC) Project 949940 (gAIa). R. Kortvelesy
was supported by Nokia Bell Labs through their donation for the Centre of Mobile,
Wearable Systems and Augmented Intelligence to the University of Cambridge. J.
Blumenkamp acknowledges the support of the ‘Studienstiftung des deutschen Volkes’
and an EPSRC tuition fee grant.
References
1. Pyglet. https://pyglet.org/
2. Baker, B., et al.: Emergent tool use from multi-agent autocurricula. In: Interna-
tional Conference on Learning Representations (2019)
3. Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.: The complexity of decen-
tralized control of markov decision processes. Math. Oper. Res. 27(4), 819–840
(2002)
4. Blumenkamp, J., Morad, S., Gielis, J., Li, Q., Prorok, A.: A framework for real-
world multi-robot systems running decentralized gnn-based policies. arXiv preprint
arXiv:2111.01777 (2021)
5. Bradbury, J., et al.: JAX: composable transformations of Python+NumPy pro-
grams (2018). http://github.com/google/jax
6. Bräysy, O., Gendreau, M.: Vehicle routing problem with time windows, Part II:
Metaheuristics. Transp. Sci. 39(1), 119–139 (2005)
7. Brockman, G., et al.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
8. Freeman, C.D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., Bachem, O.: Brax -
a differentiable physics engine for large scale rigid body simulation (2021). http://
github.com/google/brax
9. Ijspeert, A.J., Martinoli, A., Billard, A., Gambardella, L.M.: Collaboration through
the exploitation of local interactions in autonomous collective robotics: the stick
pulling experiment. Auton. Robot. 11(2), 149–171 (2001)
10. Jiang, S., Amato, C.: Multi-agent reinforcement learning with directed exploration
and selective memory reuse. In: Proceedings of the 36th Annual ACM Symposium
on Applied Computing, pp. 777–784 (2021)
11. Koenig, N., Howard, A.: Design and use paradigms for gazebo, an open-source
multi-robot simulator. In: 2004 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), pp. 2149–2154. IEEE
(2004)
12. Kurach, K., et al.: Google research football: a novel reinforcement learning envi-
ronment. Proc. AAAI Conf. Artif. Intell. 34, 4501–4510 (2020)
13. Lange, R.T.: gymnax: A JAX-based reinforcement learning environment library
(2022). http://github.com/RobertTLange/gymnax
14. Li, Q., Gama, F., Ribeiro, A., Prorok, A.: Graph neural networks for decentral-
ized multi-robot path planning. In: 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pp. 11785–11792. IEEE (2020)
15. Liang, E., et al.: Rllib: abstractions for distributed reinforcement learning. In:
International Conference on Machine Learning, pp. 3053–3062. PMLR (2018)
16. Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi-
agent actor-critic for mixed cooperative-competitive environments. Adv. Neural
Inf. Process. Syst. (2017)
17. Makoviychuk, V., et al.: Isaac gym: High performance GPU based physics simula-
tion for robot learning. In: Thirty-fifth Conference on Neural Information Process-
ing Systems Datasets and Benchmarks Track (Round 2) (2021)
18. Michel, O.: Cyberbotics ltd. webotsTM : professional mobile robot simulation. Int.
J. Adv. Robot. Syst. 1, 5 (2004)
19. Niiranen, J.: Fast and accurate symmetric Euler algorithm for electromechanical
simulations. In: Electrimacs 99 (Modelling and Simulation of Electric Machines
Converters an& Systems), pp. I–71 (D1999)
20. Noori, F.M., Portugal, D., Rocha, R.P., Couceiro, M.S.: On 3d simulators for multi-
robot systems in ros: Morse or gazebo? In: 2017 IEEE International Symposium
on Safety, Security and Rescue Robotics (SSRR), pp. 19–24 (2017)
21. Panerati, J., Zheng, H., Zhou, S., Xu, J., Prorok, A., Schoellig, A.P.: Learning to fly-
a gym environment with pybullet physics for reinforcement learning of multi-agent
quadcopter control. In: 2021 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pp. 7512–7519. IEEE (2021)
22. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning
library. Adv. Neural Inf. Process. Syst. (2019)
23. Peng, B., et al.: Facmac: factored multi-agent centralised policy gradients. Adv.
Neural Inf. Process. Syst. 12208–12221 (2021)
24. Pinciroli, C., et al.: ARGoS: a modular, parallel, multi-engine simulator for multi-
robot systems. Swarm Intell. 6, 271–295 (2012)
25. Prorok, A.: Robust assignment using redundant robots on transport networks with
uncertain travel time. IEEE Trans. Automat. Sci. Eng. 17, 2025–2037 (2020)
26. Resnick, C., et al.: Pommerman: a multi-agent playground. CoRR (2018)
27. Reynolds, C.W.: Flocks, herds and schools: a distributed behavioral model. In:
Proceedings of the 14th Annual Conference on Computer Graphics and Interactive
Techniques, pp. 25–34 (1987)
28. Samvelyan, M., et al.: The StarCraft Multi-agent Challenge. CoRR (2019)
29. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
30. Shen, J., Xiao, E., Liu, Y., Feng, C.: A deep reinforcement learning envi-
ronment for particle robot navigation and object manipulation. arXiv preprint
arXiv:2203.06464 (2022)
31. Suarez, J., Du, Y., Isola, P., Mordatch, I.: Neural MMO: a massively multiagent
game environment for training and evaluating intelligent agents. arXiv preprint
arXiv:1903.00784 (2019)
32. Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control.
In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems,
pp. 5026–5033. IEEE (2012)
33. Tolstaya, E., Gama, F., Paulos, J., Pappas, G., Kumar, V., Ribeiro, A.: Learning
decentralized controllers for robot swarms with graph neural networks. In: Kael-
bling, L.P., Kragic, D., Sugiura, K. (eds.) Proceedings of the Conference on Robot
Learning, Proceedings of Machine Learning Research, vol. 100, pp. 671–682. PMLR
(2020). https://proceedings.mlr.press/v100/tolstaya20a.html
34. Wang, B., Liu, Z., Li, Q., Prorok, A.: Mobile robot path planning in dynamic envi-
ronments through globally guided reinforcement learning. IEEE Robot. Automat.
Lett. 5, 6932–6939 (2020)
35. Weng, J., et al.: Envpool: a highly parallel reinforcement learning environment
execution engine. arXiv preprint arXiv:2206.10558 (2022)
36. de Witt, C.S., et al.: Is independent learning all you need in the starcraft multi-
agent challenge? arXiv preprint arXiv:2011.09533 (2020)
37. Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A., Wu, Y.: The surpris-
ing effectiveness of PPO in cooperative, multi-agent games. arXiv preprint
arXiv:2103.01955 (2021)
38. Zheng, L., et al.: Magent: a many-agent reinforcement learning platform for arti-
ficial collective intelligence. In: Proceedings of the AAAI Conference on Artificial
Intelligence (2018)
39. Zheng, X., Koenig, S., Kempe, D., Jain, S.: Multirobot forest coverage for weighted
and unweighted terrain. IEEE Trans. Robot. 26, 1018–1031 (2010)
FLAM: Fault Localization and Mapping
Guillaume Ricard(B) , David Vielfaure, and Giovanni Beltrame
Polytechnique Montréal, Montreal, Canada

guillaume.ricard@polymtl.ca
Abstract. Multi-robot systems are increasingly used in several indus-

try automation and warehouse management applications, mostly with a
centralized hub for coordination. Several decentralized infrastructures
have been studied for using multi-robot systems in real mission sce-
narios like search-and-rescue, area coverage and exploration. However,
despite designing rigorous methods for using multi-robot systems in
a decentralized setting, long-term field deployments still seem unfea-
sible. The lack of proper infrastructure for tackling fault-detection
is one of the great challenges in this regard. We propose FLAM
(https://github.com/MISTLab/FLAM), a fault localization and map-
ping algorithm that detects faults in a robotic system and uses them
to build a map of the environmental hazards, effectively providing risk-
awareness to the robotic team.
Keywords: Swarm Robotics · Robot Safety · Failure Detection and

Recovery
1 Introduction
Using a team of robots, in opposition to a single one, for carrying a given task
can result in significant performance gains, especially if the aforementioned task
requires terrain coverage. Indeed, with proper coordination between the robots,
the speed at which the terrain is being covered should increase proportionally
with the number of robots in the team [1]. Time critical applications like search-
and-rescue scenarios and forest fire reckoning are therefore well-suited for multi
robot applications. However, in most cases, the environment in which the task is
being carried is unknown and unfortunately, the use of robotic systems in such
settings remains a challenge. Robots are prone to exhibiting failures when facing
the associated random hazards and constantly changing conditions of such envi-
ronments. Additionally, the agents forming a robot swarm are generally simple
and inefficient on their own [2] and as a result, are not particularly tolerant
to faults. If in theory the inherent redundancy of robot swarms should provide
increased robustness, it has been demonstrated that, in practice, robot swarms
can even exhibit lower reliability levels than traditional centralized robots sys-
tems [3]. Individual failures often lead to a complete system failure, challenging
the assumption that swarm systems are robust by nature [4]. Developing a tool
https://doi.org/10.1007/978-3-031-51497-5_5
58 G. Ricard et al.
that tackles failures in decentralized robotic systems is therefore of great impor-

tance. The Science Robotics journal even sees it as one of the great challenges
of the next decade [5]. Furthermore, a robot itself could be normally operational
but under certain external factors could experience temporary sensor or actuator
failures. These environmental factors are in a number of cases, location-based
(for instance, a mud pit could cause wheel slippage in ground robots) and infor-
mation about these locations could provide increased situational awareness for
robots to effectively perform a given mission. In this work, we design a back-end
and a front-end that can be used to determine these environmental hindrances
and create a map of these factors that can be used by robots to gain risk aware-
ness.
In the front-end, we design feature vectors that are periodically shared
between neighbors in the swarm. The feature vectors are then used by an outlier
detector to determine if the agent’s behavior is normal or abnormal. An agent
is portrayed as being faulty if its behavior is abnormal, the abnormality is pre-
sumed to be the result of a fault. The back-end obtains the identified faults along
with their given locations and creates a distributed belief map of the environ-
mental hazards. Higher levels of risk will be associated to areas where failures
have been detected in the past. The belief map is updated and shared period-
ically across the swarm which in turn provides risk-awareness to the agents to
perform their given task. In this paper, we make the following contribution to
the field of swarm robotics: A decentralized fault detection and fault mapping
mechanism that provides risk-awareness to autonomous robot swarms.
Based on this contribution, we name our algorithm FLAM: Fault Localiza-
tion And Mapping. The following paper is structured as follows: in Sect. 2 some
relevant related works are presented; in Sect. 3 we describe the workings of our
system; in Sect. 4 we present results obtained from simulations; and finally, in
Sect. 5 we draw conclusions.
2 Related Work and Background

2.1 Fault Detection
Faults are inevitable in robotic systems, based on Khalastchi et al. [6] the key is
to identify and recover from them as quickly as possible. In a swarm of robots, the
problem is slightly different. Because there is usually a large number of agents
forming the swarm, losing one of them to a fault should not lead to a global
system failure. However, it is of capital importance to gain insights from previous
faults so as to avoid having the same fault propagating to the entire swarm. A
taxonomy of the faults affecting robotic systems is presented in [6]. Hardware
faults affect physical components of the robots and compromise their sensing
and acting abilities. Software faults affect the behavior of the robots and are
caused by faulty algorithms and/or faulty implementations. Finally, interaction
faults affect the dynamics of the robotic system and are caused by exogenous
events. In this work, we aim at detecting faults caused by location-based hazards,
consequently FLAM focuses on interaction faults. Fault detection methods have
FLAM: Fault Localization and Mapping 59
been proposed in the past and the three main families are presented in the
same survey [6]. (i) Data-driven approaches use sensor data and compare it to
past normal/abnormal behaviors or to the behavior of neighbor agents to detect
faults. (ii) Model-based approaches use a priori explicit model of the system
to identify faults. (iii) Knowledge-based approaches are similar to the way a
human would perform fault detection. Faults are typically represented in a tree
structure where the fault can be linked backwards to where it originated. The
main drawbacks of the last two approaches is the need of a priori model of the
system and its inefficiency in detecting faults in dynamic environments.
Additionally, two strategies can be applied to fault detection: endogenous
fault detection and exogenous fault detection [7,8]. Endogenous fault detection
refers to the actions taken by an individual to perform fault detection on itself
and on its own. It is the robot’s responsibility to determine if it is in a faulty
state. In opposition, exogenous fault detection refers to the actions taken by
neighboring robots to detect faults on an observed entity. This approach relies
on the observations of external robots and requires collaboration of neighboring
agents. Exogenous data-driven fault detection methods are the best-suited for
robot swarms. Exogenous strategies leverage the power of swarms: collaboration
of all team members. Data-driven methods allow the detection of previously
unseen faults and, as a result, enable the deployment of the swarm in unknown
and dynamic environments.
An exogenous data-driven fault detection method inspired by the human
immune system has been proposed by Tarapore et al. [9–11]. Binary feature
vectors that characterize agents’ behavior are periodically shared among robots
of the swarm. Then, using the mathematical formulation of the cross-regulation
model along with the feature vectors, outliers are detected. The method detects
faults by assuming that the behavioral abnormality is the result of a fault. Non-
binary feature vectors could be used to perform fault detection using more com-
mon outlier detection methods like K-nearest neighbor (KNN) and local outlier
factor [12]. A neural network fault detection approach was proposed in [13]
where faults are presumed to cause changes in the data flow of the system. By
monitoring the flow, they can infer the presence of a faults. Of course, such a
method requires learning, which can be challenging when failures are concerned.
However, because outlier detection identifies abnormal behaviors and not faults
directly, it is hard to perform diagnosis and identify the adequate recovery pro-
cedure. In [14,15], authors address the problem by adding a fault diagnosis layer
that uses feature vectors after the fault detection one.
2.2 Risk Awareness

Several robot swarm applications considering risk have been proposed in the
past. In SPIDER [16], agents are deployed into a dangerous environment and
maintain connectivity through chain formation. Agents display shy behaviors
when the frequency of social interactions is low and bold behaviors when inter-
actions are numerous. In this paper, danger is symbolized through connectivity,
where being disconnected from the other agents is the risk the swarm faces.
60 G. Ricard et al.
Ono et al., Vitus et al. introduced the concept of a “risk budget” [17,18] and
leverage it to both maximize reward and optimize allocation of risk. Unfortu-
nately, global knowledge of the environment is required and is therefore ill-suited
for tasks performed in unknown environments. Vielfaure et al. proposed DORA-
Explorer [19], a risk-aware exploration algorithm that maximizes terrain coverage
while minimizing exposure to risk in a fully decentralized fashion. The method
assumes that the risk associated with the environment can directly be sensed
by the agents using an on-board sensor. Risk levels are then used to compute a
direction that will provide meaningful exploration gains while minimizing expo-
sure to risk. In all these methods the nature of risk is known by the robots: in
SPIDER [16] risk is connectivity, in Ono et al., Vitus et al. [17,18] a global state
of the environment is needed and in DORA-Explorer [19] risk can be measured
by the agents. However in some cases the nature of the risk might not be known,
especially is the mission is being carried in dynamic and unknown environments.
In FLAM we instead use a fault detector and leverage the identified failures to
build a map of the environmental hazards. Risk is never directly sensed, it is
inferred from the location of previous failures.
2.3 Distributed Storage
Distributed information sharing remains a challenge in multi robot applica-

tions. Limitations in communication capabilities of robots generally complicates
achieving consistency among the robotic team [20,21]. The virtual stigmergy
presented by Pinciroli et al. [22] tackles the problem and achieves an eventually
coherent set of data across the swarm. It is specifically designed for decentralized
robotic systems operating in environments with heavy packet loss and partial
connectivity. Using Conflict-Free Replicated Data Types (CRDTs), the virtual
stigmergy allows robots to share and agree on sets of (key, value) pairs and
enables robust decentralized information sharing. The Buzz programming lan-
guage [23] directly integrates the virtual stigmergy and allows straightforward
robot swarm software development. Another data decentralized data storage
mechanism has recently been proposed by Majcherczyk et al. [24] where agents
each store some amount of information using a fitness based partitioning scheme.
Data is not fully replicated across the agents of the swarm allowing the swarm
to store more information system wide. However, because agents only store a
portion of the swarm’s information, they are less likely to have access to the
information they need.
Occupancy maps and belief maps have been extensively used for robotic
exploration purposes [25–27]. Occupancy maps discretize the environment into
cells and record the absence of presence of a feature, generally an obstacle.
Whereas occupancy maps are binary, belief maps store probabilities and are
proven to offer better exploration performances [28]. The previously mentioned
DORA-Explorer algorithm [19] actually leverages a belief map when determining
the direction that both optimizes terrain coverage and minimizes exposure to
risk.
Fig. 1. FLAM intuition. Figure 1a: Robots of the swarm start exploring an unknown
and hazardous environment. Figure 1b: Some robots start experiencing failures due to
environmental hazards (in this example a mud pit causing wheel slippage and moun-
tains causing falling damage). Failures are detected using a data-driven exogenous
fault detector and their respective locations are recorded in the distributed belief map.
Figure 1c: A map of the environmental hazards that embodies the “risk beliefs” is built
and can be leveraged to provides risk-awareness to the swarm to perform the given
mission.
3 System Model
Taking inspiration from previous works, we build FLAM, an exogenous data-
driven fault detection mechanism that is able to identify interaction faults, in
other words, faults caused by exogenous events in the environment modeled as
a grid E ⊂ Z2 . Then, using the detected faults and their respective locations,
FLAM builds a belief map of where the hazards are located in the environment
and shares it among the swarm using the virtual stigmergy. The hazard map pro-
vides risk-awareness to agents of the swarm and could, for example, be later used
for planning purposes to help avoiding environmental pitfalls during subsequent
exploration efforts. This intuition is summarized in Fig. 1.
3.1 Risk Modelling
Several triggers are taken into account for risk sources. Some hazards pertain to
fixed areas matching specific terrain or obstacles in the environment: robots are
subject to such faults immediately as they enter the corresponding zone (e.g.
slippage).
Conversely, more diffuse areas of danger may affect the agents with a given
probability as they draw closer, causing interference that may result in robot
death or communication jamming, we model these as point radiation sources of
varying intensities (Ij )j∈S ∼ U(0, 1) akin to those used in DORA-Explorer [19].
The radiation induced for robot i at position xi ∈ E by a single source located
at sj ∈ S is expressed as:
Ij
rsj (xi ) = (1)
1 + λ||xi − sj ||2
62 G. Ricard et al.
with λ as a decay constant. To determine whether a robot fails due to its radi-
ation exposure, we sum all individual source radiation levels and add Gaussian
noise b ∼ N (0.005) to account for environment and source variability.

r(xi ) = b + rj (xi ) (2)
sj ∈S
A failure is then triggered with probability min{1, r(xi )}.
3.2 Feature Vectors

Tarapore et al. [10] introduced a set of boolean feature vectors describing neigh-
bor density, robot motion and obstacle avoidance performance. We extended
these with a seventh binary feature vector to take into account received net-
work packets in a similar fashion as the motion speed: data transfer speed is
sampled over the sliding observation window W and the additional feature vec-
tor component is set to 1 as long as the current network speed is at least 10%
of the last maximum speed. Counting the values added by each agent to the
stigmergy maintains the exogenous nature of estimating nearby robots’ feature
vector locally.
3.3 Fault Detection

We compare two approaches for outlier detection:
(i) KNN [12]. Erroneous behavior detection is achieved by comparing the Ham-
ming distance of each agent feature vector with its 3 closest neighbors in the
stigmergy, the predicted class (faulty or normal) is decided with a majority
rule. To perform early on in the absence of robots identified as faulty, KNN
must be seeded with an inkling of which values may be abnormal. Fortu-
nately the definition of the feature vector components outlines impossible
values for a robot in normal operation (e.g. no movement command while
reporting performing obstacle avoidance).
(ii) Cross-regulatory model (CRM) [10]. Used in conjunction with the base
set of feature vectors in their inception paper by Tarapore et al. [10]. Feature
vector values are treated as antigens displayed by agent cells and their clas-
sification result is given by integrating the modeled dynamics of our immune
system. Over time, more-often-encountered values which are common in the
agent population are more tolerated than outliers that we assume are the
result of a fault. The original C++ implementation is used directly with
feature vector values supplied from Buzz.
The local classification results are shared through the stigmergy and a robot
is deemed faulty when the majority of the swarm classifies it as abnormal.
3.4 Distributed Belief Map

Using feature vectors to characterize robot behavior may not allow direct diagno-
sis but it encompasses various kinds of possible faults: without loss of generality,
we may consider all types of fault equal in terms of risk as FLAM leaves the
task of leveraging these maps to the exploration control software.
Thus, we store the detected failure count for each discrete cell in the envi-
ronment modeled as a grid. This information can be used with the current time
step to compute an estimate of the failure probability at position (x, y):
Nf
pt (x, y) = (3)
t
As the fault likelihood decreases with time, entries in the stigmergy older than
5 epochs are evicted after each update.
In a more realistic application, performing fault diagnosis lets us prioritize
which risks we want to avoid by applying additional weights when constructing
the global risk belief map.
3.5 Implementation
With the steps put forward, the FLAM implementation may be summarized as
the following algorithm:
Algorithm 1: FLAM implementation

step ← 1, S ← random radiation sources;
while True do
if Failure(S) or consensus against(self ) then
Add failure to map at (self.x, self.y);
else if step % W = 0 then
compute f eatures()
update votes()
update consensus()
random search();
step ← step + 1
Failed robots or robots that were voted against by the majority of the swarm
become inactive and stop running FLAM as soon as they are aware that they
no longer seem fit for duty. Updates to the stigmergy are one step apart to let
the information spread through the shared storage.
4 Simulations
We evaluated FLAM with different kinds of faults on the open-source ARGoS
robot swarm simulator [29] performing 50 simulation instances over 1200 time
64 G. Ricard et al.
steps. N = 20 wheeled Khepera IV [30] robots start at a random position on

a 20 × 20 grid and perform random search in an environment with two local-
ized error sources (see Fig. 2): terrain difficulty is modeled by fault-causing areas
slowing down robot wheel speed or saturating sensor readings, respectively. Addi-
tionally, four radiation sources may trigger robot termination or communication
jamming/packet loss. The latter are more challenging as they effectively dis-
connect one robot from the stigmergy entirely preventing further voting and
consensus updates. The Khepera IV is equipped with an additional near-range
communication system using its sensors that is used to shutdown robots cut off
from the network that the rest of swarm identified as faulty.
Fig. 2. Simulation setup. Figure 2a: Fault-generating areas are present in the top-left
(slippage) and bottom-right (sensor saturation) corners highlighted with rectangles.
Red stars represent radiation sources fixed across simulation instances to achieve con-
sistent belief maps. Figure 2b robots inside ARGoS, cylinders indicate radiation sources
Simulations logs are collected for each robot and combined to assess fault
detection performance with simple counting rules for determining true positives
(TP), false positives (FP) and false negatives (FN).
– TP Fault-detection occurs for a robot subject to a fault

– FP Fault-detection occurs for a robot before it ever fails or when it is no
longer subject to a fault
– FN No fault-detection occurs for a failed robot
4.2 Results
(See Table 1).
Table 1. Overall fault detection performance by method. The scores capture the
retained counting rules for TP, FP, FN. Left: N = 10, Right: N = 20
Method Precision Recall F1 Method Precision Recall F1

KNN [12] 38.9 59.8 47.0 KNN [12] 82.8 43.8 57.3
CRM [10] 68.4 57.9 62.7 CRM [10] 91.9 57.5 70.7
Fig. 3. Average fault detections over time. Left: N = 10, Right: N = 20
Fig. 4. Generated risk-belief map with slippage (blue rectangle) and sensor satura-
tion (green rectangle) areas as well as radiation sources (red stars). Darker intensity
indicates more robot failures at that discrete location. Left: N = 10, Right: N = 20
66 G. Ricard et al.
5 Conclusion
We introduced FLAM, a multiclass bio-inspired distributed fault detection
method that builds risk belief maps from identified robot failures. We had
expected that successfully applying CRM [10] would outperform basic KNN
[12] classification for fault detection and verified that KNN [12] outputs more
false positives. CRM [10] avoids this sensitivity pitfall via inherently conserva-
tive design though precision is not as high in sparse swarms (N = 10) as in
denser swarms (N = 20) because of the challenging small-scale swarm context
FLAM operates in to reflect practical applications. Both approaches cease to
detect faults when too few robots are left (stagnancy in reported faults Fig. 3).
Independently of fault detection accuracy, maps of failures are obtained that are
consistent with the simulation setups as they highlight the frontier of the error
zones and the surroundings of radiation sources (Fig. 4).
The generated risk belief maps may be used to guide navigation for subse-
quent tasks in the environment. Future work may involve building risk-aware
exploration at the same time as FLAM-based risk assessment is conducted, tak-
ing immediate action to alleviate faults (e.g. by increasing obstacle avoidance
sensitivity and radius). Such robot response can be explored either as automatic
countermeasures or interactive visualization presenting risk and terrain informa-
tion to a remote operator as it is gathered.
References
1. Burgard, W., Moors, M., Stachniss, C., Schneider, F.E.: Coordinated multi-robot
exploration. IEEE Trans. Robot. 21(3), 376–386 (2005). https://ieeexplore.ieee.
org/abstract/document/1435481
2. Şahin, E.: Swarm robotics: from sources of inspiration to domains of application.
In: Şahin, E., Spears, W.M. (eds.) SR 2004. LNCS, vol. 3342, pp. 10–20. Springer,
Heidelberg (2005). https://doi.org/10.1007/978-3-540-30552-1 2
3. Winfield, A.F., Nembrini, J.: Safety in numbers: fault-tolerance in robot swarms.
Int. J. Model. Identif. Control 1(1), 30–37 (2006). https://www.inderscienceonline.
com/doi/abs/10.1504/IJMIC.2006.008645
4. Bjerknes, J.D., Winfield, A.F.T.: On fault tolerance and scalability of swarm
robotic systems. In: Martinoli, A., et al. (eds.) Distributed Autonomous Robotic
Systems. STAR, vol. 83, pp. 431–444. Springer, Heidelberg (2013). https://doi.
org/10.1007/978-3-642-32723-0 31
5. Yang, G.-Z., et al.: The grand challenges of Science Robotics. Sci. Robot. 3(14),
eaar7650 (2018). https://www.science.org/doi/abs/10.1126/scirobotics.aar7650
6. Khalastchi, E., Kalech, M.: On fault detection and diagnosis in robotic systems.
ACM Comput. Surv. 51(1), 1–24 (2018). https://doi.org/10.1145/3146389
7. Christensen, A.L.: Fault detection in autonomous robots. Ph.D., Université Libre
de Bruxelles (2008)
8. Lau, H.K.: Error detection in swarm robotics: a focus on adaptivity to dynamic
environments. Ph.D. dissertation, University of York (2012)
9. Tarapore, D., Lima, P.U., Carneiro, J., Christensen, A.L.: To err is robotic, to toler-
ate immunological: fault detection in multirobot systems. Bioinspiration Biomimet-
ics 10(1), 016014 (2015)
10. Tarapore, D., Christensen, A.L., Timmis, J.: Generic, scalable and decentralized
fault detection for robot swarms. PLoS ONE 12(8), e0182058 (2017)
11. Tarapore, D., Timmis, J., Christensen, A.L.: Fault detection in a swarm of physical
robots based on behavioral outlier detection. IEEE Trans. Robot. 35(6), 1516–1522
(2019)
12. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based
local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference
on Management of Data, pp. 93–104 (2000)
13. Christensen, A.L., O’Grady, R., Birattari, M., Dorigo, M.: Fault detection in
autonomous robots based on fault injection and learning. Auton. Robot. 24(1),
49–67 (2008). https://doi.org/10.1007/s10514-007-9060-9
14. O’Keeffe, J., Tarapore, D., Millard, A.G., Timmis, J.: Towards fault diagnosis in
robot swarms: an online behaviour characterisation approach. In: Gao, Y., Fallah,
S., Jin, Y., Lekakou, C. (eds.) TAROS 2017. LNCS (LNAI), vol. 10454, pp. 393–
407. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64107-2 31
15. O’Keeffe, J., Tarapore, D., Millard, A.G., Timmis, J.: Adaptive online fault diag-
nosis in autonomous robot swarms. Front. Robot. AI 5, 131 (2018)
16. Hunt, E.R., Jenkinson, G., Wilsher, M., Dettmann, C.P., Hauert, S.: SPIDER: a
bioinspired swarm algorithm for adaptive risk-taking. In: Artificial Life Conference
Proceedings, pp. 44–51. MIT Press, Cambridge (2020)
17. Ono, M., Williams, B.C.: An efficient motion planning algorithm for stochastic
dynamic systems with constraints on probability of failure. In: AAAI, pp. 1376–
1382 (2008)
18. Vitus, M.P., Tomlin, C.J.: On feedback design and risk allocation in chance con-
strained control. In: 2011 50th IEEE Conference on Decision and Control and
European Control Conference, pp. 734–739. IEEE (2011)
19. Vielfaure, D., Arseneault, S., Lajoie, P.-Y., Beltrame, G.: DORA: distributed online
risk-aware explorer (2021)
20. Amigoni, F., Banfi, J., Basilico, N.: Multirobot exploration of communication-
restricted environments: a survey. IEEE Intell. Syst. 32(6), 48–57 (2017). https://
ieeexplore.ieee.org/abstract/document/8267592
21. Otte, M.: An emergent group mind across a swarm of robots: collective cognition
and distributed sensing via a shared wireless neural network. Int. J. Robot. Res.
37(9), 1017–1061 (2018)
22. Pinciroli, C., Lee-Brown, A., Beltrame, G.: A tuple space for data sharing in robot
swarms. In: Proceedings of the 9th EAI International Conference on Bio-Inspired
Information and Communications Technologies (Formerly BIONETICS), pp. 287–
294 (2016). https://carlo.pinciroli.net/pdf/Pinciroli:BICT2015.pdf
23. Pinciroli, C., Beltrame, G.: Buzz: a programming language for robot swarms. IEEE
Softw. 33(4), 97–100 (2016)
24. Majcherczyk, N., Pinciroli, C.: SwarmMesh: a distributed data structure for
cooperative multi-robot applications. In: 2020 IEEE International Conference
on Robotics and Automation (ICRA), pp. 4059–4065. IEEE (2020). https://
ieeexplore.ieee.org/abstract/document/9197403
25. Kobayashi, F., Sakai, S., Kojima, F.: Sharing of exploring information using belief
measure for multi robot exploration. In: 2002 IEEE World Congress on Computa-
tional Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-
IEEE 2002. Proceedings (Cat. No. 02CH37291), vol. 2, pp. 1544–1549 (2002)
68 G. Ricard et al.
26. Kobayashi, F., Sakai, S., Kojima, F.: Determination of exploration target based
on belief measure in multi-robot exploration. In: Proceedings 2003 IEEE Inter-
national Symposium on Computational Intelligence in Robotics and Automation.
Computational Intelligence in Robotics and Automation for the New Millennium
(Cat. No. 03EX694), vol. 3, pp. 1545–1550 (2003)
27. Indelman, V.: Cooperative multi-robot belief space planning for autonomous nav-
igation in unknown environments. Auton. Robot. 42(2), 353–373 (2018). https://
doi.org/10.1007/s10514-017-9620-6
28. Stachniss, C., Burgard, W.: Mapping and exploration with mobile robots using
coverage maps. In: Proceedings 2003 IEEE/RSJ International Conference on Intel-
ligent Robots and Systems (IROS 2003) (Cat. No. 03CH37453), Las Vegas, Nevada,
USA, vol. 1, pp. 467–472. IEEE (2003)
29. Pinciroli, C., et al.: ARGoS: a modular, multi-engine simulator for heterogeneous
swarm robotics. In: 2011 IEEE/RSJ International Conference on Intelligent Robots
and Systems, pp. 5027–5034 (2011)
30. K-Team, “Khepera IV” (2021). https://www.k-team.com/khepera-iv
Social Exploration in Robot Swarms
Elliott Hogg1(B) , David Harvey2 , Sabine Hauert1 , and Arthur Richards1

1
Bristol Robotics Laboratory, University of Bristol, Bristol, UK
elliott.hogg@bristol.ac.uk
2
Thales UK ltd, Reading, UK
Abstract. Robot swarms have shown great potential for exploration

of unknown environments, utilizing simple robots with local interaction
and limited sensing. Despite this, complex indoor environments can cre-
ate issues for reactive swarm behaviours where specific paths need to be
travelled and bottlenecks are present. In this paper we present our social
exploration algorithm which allows the swarm to decide between differ-
ent options of swarm behaviours to search randomly generated environ-
ments. Using a “happiness” measure, agents can reason over the perfor-
mance of different swarm behaviours, aiming to promote free movement.
Agents collaborate to share opinions of different behaviours, forming
teams which are capable of adapting their exploration to any given envi-
ronment. We demonstrate the ability of the swarm to explore complex
environments with minimal information and highlight increased perfor-
mance in relation to other swarm behaviours over 250 randomly gener-
ated environments.
Keywords: Swarm robotics · Collective decision making · Search and

Rescue
1 Introduction
Swarm robotics studies how to design large teams of robots which can work
together to achieve some goal. This takes inspiration from swarms found in
nature, particularly, the ability to produce complex emergent behaviours from
simple individual swarm agents reacting to their local environment [1,2]. Swarm
robotics aims to understand how to design the local interaction between neigh-
bours in the swarm to generate these complex behaviours, without any defined
leader [3]. Some of the key benefits of swarms systems are their fully decentralized
nature, which reduces risks of single point of failure, and provides redundancy
in the case of lost agents [4]. Swarm robotics has been investigated and applied
to a large range of applications, including, exploration, collective assembly, and
collective decision making [5,6]. Here, we focus on the problem of exploration of
unknown environments with swarms.
Exploration in robot swarms has been investigated widely, considering how
to design swarm systems to explore different types of environments. Swarms
https://doi.org/10.1007/978-3-031-51497-5_6
70 E. Hogg et al.
are well suited to exploration as we can use large teams of simple robots which
can explore different areas in parallel. There are many existing dispersion based
algorithms, where agents can simply repel from their local neighbours, allowing
swarms to cover large areas of different shapes [7,8]. This has also been demon-
strated with minimal sensory requirements for real world applications [9,10].
Other approaches such as random walk behaviours have also been shown to
be effective solutions [11,12]. Exploration is also a fundamental requirement in
foraging based problems [13]. Foraging algorithms can use virtual pheromones
placed by agents to guide other swarm agents in the correct direction [14]. Whilst
each of these proposed swarm algorithms have been shown to be effective, each
are suited to different types of environments. When faced with uncertainty over
the nature of the environment, it is not clear how to select suitable behaviours.
Therefore, there is a need for a universal swarm search strategy.
In this paper we explore the design of a social exploration algorithm which
allows swarms to explore complex, uncertain environments by giving agents the
option to individually choose between different swarm behaviours. Without any
prior knowledge, agents trial different behaviours to try and find the best current
option. Agents use a “happiness” measure to assess their degree of free movement
when exploring to determine whether a behaviour is effective. Crucially, agents
work together by sharing opinions of different behaviours to discover the most
suitable behaviours to use (Fig. 1). This naturally allows the swarm to adapt
its search to the configuration of the environment. We discuss our algorithm in
greater detail in Sect. 3. In the following section we highlight similar work to
our presented algorithm and finally we discuss the performance of our algorithm
when compared to other swarm behaviours in Sect. 4.
Fig. 1. Swarm agents reason over suitable swarm behaviours to explore environments
using a happiness measure, indicating their ability to explore freely. Agents locally
compare happiness to find suitable behaviours and improve exploration.
Social Exploration in Robot Swarms 71
2 Related Work
The social exploration algorithm discussed in this paper is inspired from other
examples of collective decision making in robot swarms [15]. Examples of collec-
tive decision making can be found in nature such as site selection in ant swarms
[16]. Some examples applied to robotics include the “best of n” problem, tasking
the swarm with comparing the qualities of different sites, aiming to reach con-
sensus on the best option [17]. In these problems, agents develop opinions over
different options and through different communication mechanisms, share this
evidence with their neighbours. Through information sharing, agents compare
beliefs until the swarm can form consensus on one shared belief or decision [18].
Our work relates to the decision making mechanisms used within this field but
crucially differs in that we do not want to achieve global consensus in the swarm.
Agents will share their opinions of different options of swarm behaviours and
whilst we may observe small groups of agents which adopt the same behaviour,
this should remain highly localized. Talamali et al. [19] have also explored collec-
tive decision making whilst with short communication ranges, highlighting the
benefit of localized information sharing.
Global consensus is not a priority in our work, but we find a similar example
where swarm agents reason over possible actions in a foraging task [20]. Here
it is shown that the swarm can form consensus over the best of two actions to
find the shortest path between a food and nest site. Whilst agents don’t need to
directly measure the reward of each action, only two options need to be consid-
ered given two possible paths to take. In exploration of complex environments,
agents are required to make many correct decisions in sequence when many
possible complex paths are present.
We also highlight themes in Particle Swarm Optimization (PSO) which over-
lap with our algorithm. In PSO, swarm agents search an N dimensional param-
eter space for optimal solutions, recording their best individual state and states
of their neighbours [21]. The particles use this shared historical information to
determine their velocity headings, aiming to search for optimal solutions [22,23].
There are similarities here where agents share their states with some value asso-
ciated to that state. Our problem differs from PSO as there is no sense of an
optimal state that we want to reach, rather promoting continuous exploration.
Additionally, agents do not use a sense of a best previous position or state,
instead agents use an internalized sense of performance based only on their
choose of a certain option.
Our implementation of agent happiness also links to themes in reinforcement
learning (RL) using reward functions. In multi-agent RL, agents aim to deter-
mine optimal policies to solve some problem which consist of a specific sequence
of actions. This is achieved using a reward function which is associated to dif-
ferent actions an agent can take in different possible states [24,25]. Whilst in
RL, the reward function is used to construct policies over many iterations, here
we employ a real-time sense of reward to make a decision at a given point in
time. Our agents do not construct policies or decide upon sequences of actions
to perform.
72 E. Hogg et al.
3 Methodology
3.1 Simulation and Local Swarm Behaviours
We investigate the exploration of complex environments in a custom simulator
built in Python. Swarm agents are modelled as particles which can move in
any direction with a degree of random motion. Agents aim to represent either
quadcopter or omni-directional wheel based robots. Potential field forces allow
agents to avoid obstacles and walls in the environment. The agents travel at a
constant speed of 0.5 m/s. In our investigations we aim to maximise coverage
of environments which we measure based on the detection of a set of objectives
which are evenly distributed over the environment. An objective is detected
when the euclidean distance between an agent and an objective is less than 2.5
m. Objectives are distributed evenly 1 m apart over the environment, giving a
granular measurement of coverage. The swarm agents have no access to coverage
information.
In this investigation, individual agents can choose to follow different local
rules or swarm behaviours. In this case we implement a directed field behaviour
which allows the swarm to disperse in a specific direction. We separate this into
eight discrete behaviours allowing the swarm to disperse, North, East, South,
and West. As well as, Northeast, Northwest, Southeast, and Southwest. We
explored the application of these behaviours in a previous study using central-
ized behaviour selection [26]. These behaviours allow the swarm to disperse in
specific directions, and in this case, individual agents can choose between which
directions to disperse rather than all agents following the same global rules.
Agents should then work together to determine which direction is best for them
to travel to explore an environment. We also give agents the option of selecting a
random walk behaviour in addition to the directed field behaviours. The process
of selecting between behaviours is discussed in detail in Sect. 3.4.
3.2 Generating Random Environments

In order to thoroughly test the performance of our proposed algorithm, we
designed an algorithm to randomly generate environments which resemble pos-
sible building layouts. We focus on these types of environments which have
complex structures, multiple paths, and features including rooms and doorways
which can make exploration more difficult. The environment generation process
follows three key stages:
1. Corridor generation - From a random starting point at the bounds of the
environment, we randomly grow a spanning tree which forms the core corridor
network in the environment. Segments of the corridors are created in sequence
with random probabilities to change directions by 90 degrees. Sub-corridors
are also added during this process.
2. Room Creation - With a generated corridor network, the remainder of the
environment is naturally split into sections. Each section is given randomly
placed doorways which connect to the corridor network.
3. Corridor doorways - With some probability we add additional doorways at

junctions in the corridor network.
Key parameters which control the types of environments that are generated
include, bounds for the segment lengths within corridors, clen , the probability
of creating sub-corridors, psub , and the minimum area which corridors cover in
the environment, cprop . All environments in this study were generated with the
same bounding dimensions of 80 × 80 m.
3.3 Agents with “Happiness”

For agents to decide between using different behaviours, they need a way of
measuring their own performance. In this case, we implement a “happiness”
measure which agents use to measure their degree of free movement which they
update at given intervals. Here, agents monitor how far they have travelled and
density to other agents to determine their happiness and promote exploration.
Agents do not use any complex sensing or reasoning to understand their sur-
roundings, nor do they have any direct knowledge of global or local coverage.
Instead, agents compare happiness locally to find suitable behaviours (discussed
further in Sect. 3.4).
We include distance traveled such that agents check how far they have moved
within a certain time period. When performing a certain behaviour, agents record
their initial position and after a short period of time, agents can check how far
they have moved from their initial position. We define α as the travel rate where,
xt is the position of the agent at time t, and τ is the period between two states.
Here, agents only have memory of one previous position. vc is the constant speed
at which agents travel. By selecting behaviours which allow agents to maximize
distance travelled, this should in turn promote coverage and allow agents to
detect when they become stuck.
xt − xt−τ
α= (1)
τ vc
The second factor included is agent density. When agents update their hap-
piness measure, at that moment in time, they check how many other agents
are within their communication range. To promote coverage, agents should not
be densely packed together. Therefore we define β where n is how many neigh-
bours an agent has and nmax is the limit on the maximum number of neighbours
an agent should have. In this case, nmax = 10. This means agents with fewer
neighbours are happier.
⎧n
max − n
⎪
⎨ if n < nmax
nmax
β= (2)
⎪
⎩ 1
if n >= nmax
nmax
happiness = α ∗ β (3)
74 E. Hogg et al.
Agent happiness is then given by Eq. 3 which gives a value between 0 for
lowest happiness, and 1 for highest happiness. Given a defined happiness measure
which agents can use to assess different behaviours, agents can then reason over
which behaviours are best to use.
3.4 Social Exploration Algorithm

Using the happiness measure defined above, we define a set of rules which allow
the swarm agents to work together to discover suitable behaviours to search an
environment. The local decision making rules that are implemented are sum-
marised in Algorithm 1. Agents continually update their happiness each time-
step but may only make decisions at certain points based on the update rate
τ . This gives agents time to trial different behaviours for small periods of time
before checking whether some decision needs to be made. When agents check
their happiness, they enter either happy or unhappy states in which they behave
differently.
Algorithm 1. Social exploration algorithm using happiness measure. At each

time step, agents update their state in either happy, or unhappy modes.
for agent in swarm do
update happiness
if happiness > 0.5 or happiness increased then
for neighbour in neighbours do
if happiness > neighbour happiness then
neighbour behaviour ← agents behaviour
if agent timer > update limit then
if happiness < 0.5 and not increased then
if neighbour has higher happiness then
agent behaviour ← neighbour behaviour
else if agent has highest happiness then
pick random behaviour = neighbours
behaviours
if probability < 0.5 then
broadcast behaviour change to neighbours
previous position ← current position
agent timer ← 0
update limit ← happiness ∗ update rate
agent timer ← agent timer +1
Unhappy Agents. If the agent’s happiness is low and has not improved since
the last state check, then the agent needs to switch to some other behaviour.
When agents are unhappy, they need to communicate with their neighbours to
find a better option. In this case the agent will first look for agents with higher
happiness than its own. If a neighbour does have a higher happiness, the agent
will switch their behaviour to match their neighbour. This allows unhappy agents
to use positive evidence from their neighbours to adopt suitable behaviours.
However, if a neighbour is using the same behaviour, the unhappy agent will not
switch their behaviour as it conflicts with their own evidence.
When all an agent’s neighbours have equal or lower happiness, the agent
will then select a behaviour which none of its neighbours are using. Here the
agent uses the negative evidence of it’s neighbours to eliminate bad options. In
addition, there is a probability that upon selecting a new behaviour, the agent
will in turn signal to each neighbour to also switch to this new behaviour. This
allows agents to help unhappy neighbours, promoting co-operation rather than
individual gain.
If an unhappy agent has no neighbours, it will randomly switch to a new
behaviour. With no external evidence, agents use trial and error to discover
more suitable behaviours.
Happy Agents. When an agent has high happiness, it does not make any
changes to its chosen behaviour, but it can influence agents with low happiness.
At each time-step, happy agents have a probability of broadcasting their chosen
behaviour to their neighbours. This probability is set equal to their happiness
level. If a broadcast is successful, a neighbour which receives the broadcast will
switch its behaviour to the happy agent, but only if its happiness is significantly
higher than its own. It is important that happy agents help other agents to
improve performance of the swarm as a whole.
4 Results
In this section we discuss the emergent behaviour produced by the social explo-
ration algorithm and performance across large sets of random environments
(videos of simulations are presented here). For all experiments shown here we
selected a short update interval of 20 s to allow agents to frequently check their
happiness. We also set the communication range between agents to 5 m such
that evidence is only shared between agents in close proximity.
Initial Experiments. We implemented the algorithm discussed above using

the happiness measure based on distance travelled and agent density. We first
highlight the behaviour of a single agent and the ability to adapt to their sur-
roundings by selecting different swarm behaviours. Figure 2a presents a single
agent and the path of the agent through the environment. The path is plotted
in different colours which relate to the options of swarm behaviour available. In
addition, we plot the agents happiness over time, also indicating the behaviour
used at different points in time. What we observe is that the agent is capable
of selecting new swarm behaviours when they become stuck and subsequently
unhappy. We observe that initially the agent attempts to head West, but quickly
76 E. Hogg et al.
Fig. 2. Emergent behaviour of the swarm using happiness to select between different
options of exploration.
hits a wall and becomes unhappy. Without any neighbours present, an agent
simply picks a random behaviour to try. The agent briefly attempts to head
Southwest, with no success, before then switching to heading Northeast. This
then allows the agent to head away from the corner and we observe a sharp
improvement in happiness. Whilst the agent has high happiness, no change of
behaviour is needed. As navigation continues, we see the agent repeat this pro-
cess each time their happiness drops. This trial and error approach allows agents
to fix problems when they become unhappy, allowing them to navigate out of
positions where they become stuck. This creates agents which can adapt to their
surroundings without complex reasoning over why they have become stuck.
Social Exploration. Whilst with only 1 agent we have to rely on trial and
error, with many agents, this approach becomes more powerful where agents can
collaborate and share knowledge. Figure 2b shows another environment with 30
deployed agents. By utilizing the mechanisms discussed in the previous section,
happy and unhappy agents work together to try and promote the selection of
effective behaviours in their surroundings. Through the use of happy agents
which broadcast their behaviours, and unhappy agents which collaborate with
other agents to discover new behaviours to try, emergent sub-teams form in

the swarm where small groups of agents converge to the same behaviour. These
flock-like groups form through the use of these mechanisms, allowing agents to
share solutions to local problems and help each other navigate with minimal
resistance. We found that these emergent behaviours that form allow the swarm
to effectively navigate complex environments.
Fig. 3. Trails of 50 agents deployed in a maze environment. The colour of the trials
in the environment indicate the level of happiness of agents, red trials show positions
where agents had happiness lower than 0.5, and green if greater than 0.5. The coverage
achieved over time is presented in comparison to 50 agents performing dispersion or
random walk.
We also investigated the behaviour of the swarm in a hand designed maze

environment as shown in Fig. 3. Coverage is measured based on the detection of
objectives covering the environment, distributed in increments of 1 m in both
directions. When the Euclidean distance between any agent and an objective
is less than 1.5 m, the objective is classified as detected. The total coverage is
determined by the proportion of detected objectives given the total number of
objectives. In this case there is a complex path that needs to be navigated, rely-
ing on a long sequence of actions to travel in the correct direction. We find with
the deployment of social swarming, agents are capable of navigating this environ-
ment quickly and achieving high coverage. The swarm forms small teams which
work together to select the appropriate behaviours to allow them to explore the
maze. This is achieved without any prior knowledge or a direct goal to maximise
coverage. Whilst the swarm has some difficulty travelling away from the bottom
left region, we highlight the performance against random walk and dispersion
swarm behaviours which struggle greatly in this environment. With few agents,
dispersion is not effective at pushing agents around many sequences of corners,
78 E. Hogg et al.
and similarly, using random walk is not suited to specific paths that need to be
travelled.
Fig. 4. Performance comparison of social exploration algorithm and random walk and
dispersion behaviours with 25 agents. Each data point shows the comparison in perfor-
mance of two algorithms for one environment. Each data set contains 250 environments.
Points shown above the red line indicate higher performance with the social exploration
algorithm.
4.1 Performance in Randomly Generated Environments
To systematically test the efficacy of our algorithm we test the performance over
a large set of randomly generated environments. The developed algorithm should
enable the swarm to adapt to search a variety of environments and agents will
work together to promote exploration. We test the performance of a swarm of
25 agents over a set of 250 randomly generated environments. For each envi-
ronment, the swarm has 1000 s to explore and we take the mean performance
of each algorithm over 10 separate trials. The input parameters for the gener-
ated environments (as discussed in Sect. 3.2) are also randomly selected within
controlled bounds to generate a broad spectrum of environments. We measure
performance based on the integral of coverage achieved over time, λ, as shown
by Eq. 4. Here, C(t) is the function of coverage achieved over time and, T , is the
total search duration. Rather than only measuring overall coverage, we also want
to measure the rate of coverage, where λ is higher for faster exploration. Fast
exploration is important in time critical scenarios such as search and rescue.
T
1
λ= C(t) dt (4)
T 0
Fig. 5. Simulation of social exploration algorithm at different stages with 30 agents.

Agents which are green have happiness greater than 0.5, red agents have happiness less
than 0.5. The swarm is shown to adapt it’s behaviour at different points to navigate
through the environment with low resistance. The coverage over time is compared to
random walk and dispersion.
We compare the performance of dispersion and random walk behaviours

across the same set of generated environments as shown in Fig. 4. Whilst more
sophisticated swarm behaviours exist for exploration, such as through the use
of pheromones, this requires the use of either physical beacons or access to a
shared virtual pheromone map [27,28]. The method presented does not require
any global information, only relying on local information sharing.
Each plot presents the difference in performance of each algorithm, each
data point showing the performance in a specific environment. Any point plot-
ted above the line indicates an environment where social exploration performed
higher. This highlights that the social algorithm is capable of consistently out-
performing each of the alternative algorithms. In addition, given the distribution
of performance across the data we can also infer that testing covered a broad
sample of both difficult and easy environments to explore. Whilst in some cases
the difference is small, we observe in the majority of environments the difference
is significantly high. When comparing to dispersion, we see many cases where
the performance difference is large, highlighting that dispersion struggled in cer-
80 E. Hogg et al.
tain environments. Social exploration is capable of adapting broadly to difficult

environments where random walk and dispersion struggle. This is also achieved
with relatively few agents.
We assess one example environment in detail which highlights how this algo-
rithm can adapt to difficult environment configurations. Figure 5 presents the
simulation of the social exploration algorithm in one environment where disper-
sion and random walk performed poorly. Initially, agents are packed together
and find behaviours to quickly spread out, splitting a portion of agents dispers-
ing into the large room (Fig. 5a), and others travelling up the corridor (Fig. 5b).
When reaching the top of the corridor, agents become unhappy and trial new
behaviours. They discover that moving west allows them to increase happiness
again and a team of agents move through to the left region of the environment
Fig. 5. Agents then continue to explore both left and right regions, adjusting
behaviours to navigate in and out of rooms (Fig. 5d and 5e). We highlight in
Fig. 5f that the swarm is able to achieve nearly 100% coverage and can achieve
quicker exploration of the environment when compared to random walk and dis-
persion. In these types of environment where there are specific paths which need
to be travelled to fully explore the environment, the social exploration algorithm
is much more effective at navigating these paths. This is further supported by
the performance of the swarm in the maze environment (Fig. 3).
5 Conclusions and Future Work

In this investigation we demonstrated the ability of a swarm to explore com-
plex random environments by using different possible swarm behaviours. With
a simple happiness measure, agents can reason over different options and locally
share opinions of these behaviours to adapt to their immediate surroundings.
This created the emergent formation of small teams that worked together to
discover behaviours to react appropriately to their surroundings. In addition
we highlighted the ability of the swarm to solve a complex maze environment
without any external information. This is also given a high number of nine pos-
sible options of swarm behaviours which agents had to choose between. This is
achieved without the need of complex sensing or reasoning at the individual level.
The findings showed that the social exploration algorithm allowed the swarm to
achieve higher coverage and fast exploration when compared to random walk
and dispersion over a set of 250 environments. This was possible through the
swarms capability to adapt their behaviour in real-time to suit the needs of each
individual environment.
Future work will explore how agent happiness could be defined with different
information such as average speed and number of collisions to promote free
movement, and the subsequent effect on the emergent behaviour of the swarm.
We also consider how giving agents greater memory of previous states could
change performance. In addition, we aim to implement our algorithm with real
robots to verify the ability to produce the same emergent properties shown here.
Acknowledgement. This work was funded and delivered in partnership between the
Thales Group and the University of Bristol, and with the support of the UK Engineering
and Physical Sciences Research Council Grant Award EP/R004757/1 entitled “Thales-
Bristol Partnership in Hybrid Autonomous Systems Engineering (T-B PHASE)”.
References
1. Meng, X.B., Gao, X.Z., Lu, L., Liu, Y., Zhang, H.: A new bio-inspired optimisation
algorithm: bird swarm Algorithm. J. Exp. Theor. Artif. Intell. 28(4), 673–687
(2016)
2. Mirjalili, S., Gandomi, A.H., Mirjalili, S.Z., Saremi, S., Faris, H., Mirjalili, S.M.:
Salp swarm algorithm: a bio-inspired optimizer for engineering design problems.
Adv. Eng. Softw. 114, 163–191 (2017)
3. Dorigo, M., Theraulaz, G., Trianni, V.: Swarm robotics: past, present, and future.
Proc. IEEE 109(7), 1152–1165 (2021)
4. Zhu, B., Xie, L., Han, D., Meng, X., Teo, R.: A survey on recent progress in control
of swarm systems. Sci. China Inf. Sci. 60(7), 1–24 (2017)
5. Hanay, Y.S., Ilter, M.: Aggregation , Foraging , and Formation Control of Swarms
with Non-Holonomic Agents Using Potential Functions and Sliding Mode Tech-
niques, vol. 15, no. 2, pp. 149–168 (2007)
6. Cardona, G.A., Calderon, J.M.: Robot swarm navigation and victim detection
using rendezvous consensus in search and rescue operations. Appl. Sci. (Switzer-
land) 9(8) (2019)
7. McLurkin, J., Smith, J.: Distributed algorithms for dispersion in indoor environ-
ments using a swarm of autonomous mobile robots. Distrib. Auton. Robot. Syst.
6, 399–408 (2008)
8. Novischi, D.M., Florea, A.M.: Decentralized swarm aggregation and dispersion
with inter-member collision avoidance for non-holonomic multi-robot systems. In:
Proceedings - 2018 IEEE 14th International Conference on Intelligent Computer
Communication and Processing, ICCP 2018, pp. 89–95 (2018)
9. Bayert, J., Khorbotly, S.: Robotic swarm dispersion using gradient descent algo-
rithm. In: IEEE International Symposium on Robotic and Sensors Environments,
ROSE 2019 - Proceedings (2019)
10. Ludwig, L., Gini, M.: Robotic swarm dispersion using wireless intensity signals.
Distrib. Auton. Robot. Syst. 7, 135–144 (2006)
11. Kegeleirs, M., Garzón Ramos, D., Birattari, M.: Random walk exploration for
swarm mapping. In: Althoefer, K., Konstantinova, J., Zhang, K. (eds.) TAROS
2019. LNCS (LNAI), vol. 11650, pp. 211–222. Springer, Cham (2019). https://doi.
org/10.1007/978-3-030-25332-5 19
12. Pang, B., Song, Y., Zhang, C., Wang, H., Yang, R.: A swarm robotic exploration
strategy based on an improved random walk method. J. Robot. 2019(i) (2019)
13. Lu, Q., Fricke, G.M., Ericksen, J.C., Moses, M.E.: Swarm foraging review: closing
the gap between proof and practice. Curr. Robot. Rep. 1(4), 215–225 (2020)
14. Song, Y., Fang, X., Liu, B., Li, C., Li, Y., Yang, S.X.: A novel foraging algorithm
for swarm robotics based on virtual pheromones and neural network. Appl. Soft
Comput. J. 90, 106156 (2020)
15. Valentini, G., Brambilla, D., Hamann, H., Dorigo, M.: Collective perception of
environmental features in a robot swarm. In: Dorigo, M., et al. (eds.) ANTS 2016.
LNCS, vol. 9882, pp. 65–76. Springer, Cham (2016). https://doi.org/10.1007/978-
3-319-44427-7 6
82 E. Hogg et al.
16. Mallon, E.B., Pratt, S.C., Franks, N.R.: Individual and collective decision-making
during nest site selection by the ant Leptothorax albipennis. Behav. Ecol. Sociobiol.
50(4), 352–359 (2001)
17. Valentini, G., Ferrante, E., Dorigo, M.: The best-of-n problem in robot swarms:
formalization, state of the art, and novel perspectives. Front. Robot. AI 4(MAR)
(2017)
18. Valentini, G., Hamann, H., Dorigo, M.: Efficient decision-making in a self-
organizing swarm of simple robots: on the speed versus accuracy Trade-Off G.
Valentini, G., Hamann, H., Dorigo, M. Technical Report No ., no. September, pp.
1305–1314 (2014)
19. Talamali, M.S., Saha, A., Marshall, J.A., Reina, A.: When less is more: robot
swarms adapt better to changes with constrained communication. Sci. Robot. 6(56)
(2021)
20. Scheidler, A., Brutschy, A., Ferrante, E., Dorigo, M.: The k-unanimity rule for
self-organized decision-making in swarms of robots. IEEE Trans. Cybern. 46(5),
1175–1188 (2016)
21. Zhao, S.Z., Liang, J.J., Suganthan, P.N., Tasgetiren, M.F.: Dynamic multi-swarm
particle swarm optimizer with local search for large scale global optimization. In:
2008 IEEE Congress on Evolutionary Computation. CEC 2008, pp. 3845–3852
(2008)
22. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization: an overview.
Swarm Intell. 1(1), 33–57 (2007)
23. Tian, D., Shi, Z.: MPSO: modified particle swarm optimization and its applications.
Swarm Evol. Comput. 41(August 2017), 49–68 (2018)
24. Althamary, I., Huang, C.W., Lin, P.: A survey on multi-agent reinforcement learn-
ing methods for vehicular networks. In: 2019 15th International Wireless Commu-
nications and Mobile Computing Conference. IWCMC 2019, pp. 1154–1159 (2019)
25. Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: Deep reinforce-
ment learning: a brief survey. IEEE Signal Process. Mag. 34(6), 26–38 (2017)
26. Hogg, E., Hauert, S., Harvey, D., Richards, A.: Evolving behaviour trees for super-
visory control of robot swarms. Artif. Life Robot. 25(4), 569–577 (2020)
27. Cimino, M.G.C.A., Lazzeri, A., Vaglini, G.: Combining stigmergic and flocking
behaviors to coordinate swarms of drones performing target search. In: 2015 6th
International Conference on Information, Intelligence, Systems and Applications
(IISA), pp. 1–6, IEEE, July 2015
28. Hunt, E.R., Jones, S., Hauert, S.: Testing the limits of pheromone stigmergy in
high-density robot swarms. R. Soc. Open Sci. 6(11) (2019)
Stochastic Nonlinear Ensemble Modeling
and Control for Robot Team
Environmental Monitoring
Victoria Edwards(B) , Thales C. Silva, and M. Ani Hsieh
Department of Mechanical Engineering and Applied Mechanics,

University of Pennsylvania, Philadelphia, USA
vmedw@seas.upenn.edu
Abstract. We seek methods to model, control, and analyze robot teams

performing environmental monitoring tasks. During environmental mon-
itoring, the goal is to have teams of robots collect various data through-
out a fixed region for extended periods of time. Standard bottom-up
task assignment methods do not scale as the number of robots and task
locations increases and require computationally expensive replanning.
Alternatively, top-down methods have been used to combat computa-
tional complexity, but most have been limited to the analysis of meth-
ods which focus on transition times between tasks. In this work, we
study a class of nonlinear macroscopic models which we use to control a
time-varying distribution of robots performing different tasks through-
out an environment. Our proposed ensemble model and control maintains
desired time-varying populations of robots by leveraging naturally occur-
ring interactions between robots performing tasks. We validate our app-
roach at multiple fidelity levels including experimental results, suggesting
the effectiveness of our approach to perform environmental monitoring.
Keywords: multi-robot system · control · environmental-monitoring
1 Introduction
Robot teams performing long-term environmental monitoring require methods
to manage different data collection across the team’s varied sensing modalities.
Since environments can be dynamic and uncertain, it is often necessary to vary
the distribution of robots within the workspace over time as they perform their
monitoring tasks. Existing strategies are mostly bottom-up where the objective
is to synthesize individual robot strategies, [14], but are difficult to ensure the
desired global distribution of the team over time. We present an approach to
model and control time-varying distributions of robots with multi-modal sens-
ing capabilities that extends existing top-down methods, [6], by modeling robot
interactions that govern changes in sensing behavior.
To illustrate the importance of our approach, consider the long-term deploy-
ment of a robot team to monitor marine life in a littoral ocean environment.
https://doi.org/10.1007/978-3-031-51497-5_7
84 V. Edwards et al.
In these settings, autonomous surface vehicles (ASVs) may be tasked to mea-

sure the water depth and take images of the ocean floor but have limited storage
capacity. While the ASVs can simultaneously collect both types of data, the lim-
ited onboard storage capacity may render it undesirable to collect image data
during high tide when ocean floor visibility is reduced. Instead, at high tide some
robots should focus on collecting water depth data while others focus on collect-
ing images, and during low tide more should switch to collecting both depth and
image data. This problem can be thought of as a planning and coverage problem,
which couples the navigation of the robot to where the robots should sense [1].
One approach to the problem of managing teams of robots with multiple
capabilities is to model, plan, and control each individual robot. Microscopic
modeling of teams of robots considers each robot and assigns robots to tasks as
some variant of the resource allocation problem [7]. Many algorithms exist to
solve the task allocation problem [14]. These methods often rely on heuristics to
measure how well-suited a robot is for a specific task, and as such, they do not
scale as the number of robots, number of sensing capabilities, and number of
task locations increase. Furthermore, time-varying microscopic solutions require
computationally expensive replanning [20], and limited analytical tools exist
to study the global properties of the team. As a result of these limitations,
microscopic modeling restricts our ability to achieve the desired time-varying
distributions of robots to collect the desired data.
Alternatively, macroscopic models represent global team behaviors and are
most commonly derived through some mean field approximation with the
assumption that the microscopic system is fundamentally stochastic [12,18].
Macroscopic models have been employed to study and control the steady-state
behavior of both homogeneous [2], and heterogeneous cooperative swarms [13].
These models have also been used to synthesize control strategies to reduce the
uncertainty in the microscopic systems [4,19]. This is accomplished in [19] by
explicitly modeling the first and second order moments of the team distribu-
tion in the workspace. Since macroscopic models are agnostic to team size and
scale in the number of tasks, these approaches have been employed for homo-
geneous as well as heterogeneous robot teams [22,23]. Nevertheless, these top-
down approaches, similar to their bottom-up counterparts, require replanning
when there is a new desired robot distribution because the macroscopic models
employed are fundamentally linear. Various replanning methods for linear ensem-
ble models have been proposed [15,22], however these linear mean field models
are fundamentally limited in terms of the complexity of microscopic behaviors
that they can accurately represent [6,10].
Different from recent work [3], we propose a nonlinear ensemble model that
can describe collaborative robot behaviors at the microscopic level and model
the dynamics of time-varying robot distributions in the workspace. The proposed
class of nonlinear ensemble models can explicitly account for robot-robot inter-
actions and opens up new avenues for analyzing robot team behaviors. In this
work, collaboration during an environmental monitoring task means that robots
probe the same event in space simultaneously in a distributed fashion. Since
Stochastic Nonlinear Ensemble Modeling and Control 85
robots can also change their behaviors according to the overall environmental
conditions, e.g., tidal changes, collaboration refers to the robot’s ability to adjust
its short term goal to attain a global objective. To achieve a representation that
exhibits those complex behavioral modes, we model the macroscopic continuous
dynamics of the team using the Replicator Equations, which are commonly used
in Biology to study natural selection and evolutionary dynamics [25]. Alternative
models have been considered to understand evolutionary dynamics with poten-
tial applications to robotics, but existing work primarily focus on the continuous
analysis of discrete solutions overlooking the fundamentally stochastic nature
of the microscopic multirobot system [5,16,21]. Instead, we use the Replicator
Equations to model a population of robots interacting in a collaborative fashion,
where the collaboration necessitates a time-varying distribution of robots across
possible behavioral modes and/or specific regions in the workspace.
The contributions in this paper are: 1) the development of a nonlinear macro-
scopic model for a collaborative robot team based on the Replicator Equa-
tions, 2) the design of a feedback control strategy to handle uncertainties during
team deployment based on the proposed nonlinear macroscopic model, and 3)
a method to distribute macroscopic model derived feedback control strategy for
the team. We show how our proposed framework works well for environmental
monitoring tasks that requires robots with several sensing capabilities to work
collaboratively over time through our multi-level fidelity simulation and experi-
mental results.
2 Problem Formulation
We are interested in problems that require robot teams to exhibit time-varying

behavior distributions. Unlike existing linear macroscopic ensemble methods
which consider transition times [6], our approach uses collaboration rates, a mea-
sure of the interaction between robots that share spatial proximity, in a class
of nonlinear models. We illustrate our method with two example time-varying
solutions which are not possible with existing linear macroscopic methods.
2.1 Task Topology
Consider a set of M tasks where for each task a robot must exhibit a specific set
of behaviors, for example sensing a specific phenomenon. We build an undirected
graph to represent the tasks and potential switching between task. The graph
G = (V, E), has nodes for each task vj ∈ V where j = 1, ..., M , and edges eij ∈ E.
If an edge eij = 0 then it is not possible for robots to switch from task i to task
j, likewise, when eij = 1 it is possible for robots to switch from task i to task
j. We assume the graph G is strongly connected, which means that there is a
path from each task to any other task in the graph. Note that the existence of
an edge is not reflective of distance to a spatially distributed task as in [19,22],
but instead reflective of the potential for robot collaboration to occur.
2.2 Ensemble Model

Given N robots and M collaborative tasks, let Yi be a random variable which
represents the population of robots at task i, and the vector of random variables
for all tasks be Y = [Y1 ... YM ]T . We present a class of models that describe the
stochastic jump process for how the robot ensemble evolves in time. Consider
two collaborating robot populations, where robots performing task i and robots
performing task j are expected to have spatially close interactions, then
kij
Yi + Yj −−→ 2Yi , (1)
where kij describes the Poisson collaboration rate between the two tasks i and j
that will exist for every nonzero edge in the task graph G. The physical intuition
for a nonzero collaboration rate at the microscopic level is when two agents
performing different tasks have physically close interactions with one another
the result will be one agent changing task. Each collaboration rate kij can be
put into a matrix K ∈ RM ×M , commonly called the payoff matrix. An entry
of the payoff matrix, kij , maps what will happen when an interaction happens
between two agents performing task i and task j. Consider a row i of the payoff
matrix, anywhere that kij = 0, means that when a robot performing task i
interacts with a robot performing task j nothing happens. We abuse notation
by letting the signal of kij be negative or positive to express two types of switches.
Specifically, if kij > 0 then the process results in a robot changing from task Yj
and starting to perform task Yi , whereas if kij < 0 then the process results in a
robot changing from task Yi and starting to perform task Yj .
The Replicator Equations [25], are the dynamical systems which use the
provided collaboration rates to describe the evolution of different agent behaviors
and can be written in vector form as,
M
M
M
Ẏi = Yi [ kij Yj − Yp kpl Yl ]. (2)
j p l
The Replicator Equations in (2) can be constrained to satisfy the assumption

M
that i Yi = 1, such that, if the initial state Y (0) satisfies this constraint,
then all future states will satisfy this condition–maintaining a fixed population
of robots N [25]. Note that Eq. (1) only considers collaborations between two
task types, while there are third order terms accounted for in the Replicator
Equations. This simplification is justified because it is of low likelihood to have
three robots of the appropriate type collaborating.
This formulation allows us to consider the collaborations between robot pop-
ulations and study how spatial robot-robot interactions change the ensemble
dynamics. The Replicator Equations in (2) will behave differently for different
structures of the payoff matrix, K, because each entry can be 0 or nonzero
depending on the allowed collaborations specified by G. We will consider two
relevant examples throughout the paper which have equilibrium that guaran-
tee time-varying distributions of robots that could be used in the littoral ocean
monitoring scenario.
Example 1: Consider the canonical example of the Lokta Volterra model, writ-
ten as the Replicator Equations in (2) as detailed by Hofbauer [11]. This example
has three tasks, M = 3, and the payoff matrix is
⎡ ⎤
0 0 0
K = ⎣ k10 0 −k12 ⎦ . (3)
−k20 k21 0
The rules from potential interactions to collaboration are

k
Y0 + Y1 −−10
→ 2Y1 ,
k
Y1 + Y2 −−12
→ 2Y2 ,
k
Y0 + Y2 −−20
→ 2Y0 ,
k
Y1 + Y2 −−21
→ 2Y2 , (4)
which correspond to the following differential equations that describe how the
processes evolve in time,
Ẏ0 = Y0 (k20 Y0 Y2 − k10 Y0 Y1 + (k12 − k21 )Y1 Y2 ),

Ẏ1 = Y1 (k10 Y0 − k12 Y2 + k20 Y0 Y2 − k10 Y0 Y1 + (k12 − k21 )Y1 Y2 ),
Ẏ2 = Y2 (k21 Y1 − k20 Y0 + k20 Y0 Y2 − k10 Y0 Y1 + (k12 − k21 )Y1 Y2 ). (5)
The parameters k10 , k12 , k20 , and k21 , are the collaboration rates, and define
the result of an interaction between two robots performing task i or task j.
The possible outcomes based on the collaboration rate are; keep doing the same
tasks, or have one of the robots change task.
This model is of interest to us because it exhibits time-varying population
distributions. Consider the following equilibrium for positive robot populations,
T
Y d = 0, k12k−k
12
21
, k
k21
21 −k 12
. If we linearize Eq. (5) about Y d , and compute the
√ √
eigenvalues of the linearization we get λ0 = 0, λ1 = − ik21
√ k10 , and λ2 = ik21
k20
√ k10 .
k20
From dynamical systems theory we know that the system is classified as a center.
(See Guckenheimer and Holmes Ch 1 [9] for a detailed analysis.) This means that
for the same set of parameters, K, and different initial positions, Y (0), a unique
non-intersecting orbit will occur around the fixed point. The oscillations that
result from a center equilibrium produce time-varying distributions of robots.
From our littoral ocean monitoring case, we can select parameters which match
the tidal charts so that the desired sensing measurements are performed for
representative amounts of time without wasting space resources with unnecessary
data or compute resources for replanning.
Example 2: Consider a four task example, M = 4, discussed in detail by

Schuster et al. [24]. The payoff matrix is written as follows,
⎡ ⎤
0 1 −μ 0
⎢ 0 0 1 −μ⎥
K=⎢ ⎣−μ 0 0 1 ⎦ .
⎥ (6)
1 −μ 0 0
The collaboration definitions can be easily achieved by using Eq. (1), and the
resulting dynamical system using Eq. (2) comes out to,
Ẏ1 = Y1 (Y2 − μY3 + 2μY1 Y3 + 2Y2 Y4 − Y1 Y2 − Y2 Y3 − Y4 Y3 − Y4 Y1 ),

Ẏ4 = Y4 (Y1 − μY2 + 2μY1 Y3 + 2Y2 Y4 − Y1 Y2 − Y2 Y3 − Y4 Y3 − Y4 Y1 ). (7)
Notice that for kij = 1 an interaction results in a change from task j to task i,
and in the case that kij = −μ the result will be a change from task i to task j.
Again this model is of interest because of equilibrium behavior which provides
alternative time-varying population distributions. Shuster et al. proposed the
T
following equilibrium Y d = 14 , 14 , 14 , 14 , and if we linearize Eq. (7) about Y d ,
and compute the eigenvalues of the linearization we get λ0 = −1+μ 4 , λ1 =
−i+μ
4 ,
−1−μ
λ2 = 4 , and λ3 = 4 . The analysis that Schuster et al. performed con-
i+μ
cluded that at the equilibrium point Y d , μ is a bifurcation parameter. When

μ < 0 the system is stable because the real part of all eigenvalues are negative.
At μ = 0 the system is at the bifurcation point, where the complex plane is
crossed and the negative real parts of λ2 and λ3 vanish. Finally, when μ > 0 the
system exhibits a stable Hopf Bifurcation for small enough μ. (See Schuster et
al. [24] Sect. 4 for further details on the analysis). The existence of a stable Hopf
Bifurcation means that an attracting limit cycle exists in some neighborhood of
the equilibrium. This means that the resulting robot populations will fluctuate
at regulated intervals governed by the limit cycle. In the littoral ocean moni-
toring task, we can break down the tides into four periods of interest and pick
parameters to match such that different sensing happens based on local inter-
actions during the corresponding time with minimal replanning or information
exchanged.
3 Ensemble Model Control
Our goal is to maintain proximity to an equilibrium of interest which provide

time fluctuating distributions of robots performing different behaviors. Previous
work has proposed controlling different aspects of the linear mean field model by
considering different nonlinear feedback strategies [4,19]. We will aim to maintain
a trajectory by actuating on the collaboration rates to control how the interaction
with other agents influences the switch to different behaviors. The macroscopic
continuous solutions to our proposed models give deterministic behavior, and will
result in constant oscillations either around the center equilibrium in Example
1 or the limit cycle equilibrium in Example 2. However, because our models are
representations of stochastic processes we know that due to variability in the
robot-robot and robot-environment interactions our desired distributions may
not be achieved over time. This is important because if the environment we
are monitoring in has specific windows where phenomena can be observed it is
necessary that we stay near the desired distributions.
3.1 Feedback Controller
We are proposing a modification to the collaboration rate, kij , to take as feedback

a desired trajectory, Yi∗ (t). The evaluation for Yi∗ is achieved from the numerical
solution of the continuous macroscopic differential equations, Eq. (2), with the
desired parameters. Consider the following trajectory tracking controller,
Yi∗ (t)
kij = αij ( − 1), (8)
Yi (t)
where αij is a tunable parameter, Yi∗ (t) is the desired trajectory and Yi (t) is the
current population fraction at time t. If we plug this new collaboration rate into
Eq. (1) we get:
Yi∗
αij ( Y −1)
i
Yi + Yj −−−−− −−→ 2Yi . (9)
After substituting Eq. (8) into the Replicator Equations in (2) we get:

M
Y∗
M
M
Yp∗
Ẏi = Yj [ (αij ( i − 1))Yi − Yl αpl ( − 1)Yp ],
j
Yi p
Yp
l

M
M
M
= Yj [ αij (Yi∗ − Yi ) − αpl Yl (Yp∗ − Yp )]. (10)
j p l
The resulting (Yi∗ − Yi ) acts as a proportional error term which regulates the
population based on the desired trajectory. Notice that, as before, we will have
collaboration rates that are negative which will switch the resulting collaboration
outcome. The same substitutions can be replicated for both Example 1 and
Example 2.
3.2 Distributed Microscopic Algorithm
To demonstrate that the proposed nonlinear stochastic representations are ben-

eficial for modeling and controlling robot teams we will outline an idealized
microscopic scenario, which acts as a proxy for teams performing environmental
monitoring. The goal is to find the area of interaction, aij and corresponding
radius of interaction, rij , which are microscopic instantiations of the macroscopic
collaboration rate kij . If there is a nonzero collaboration rate, kij = 0, then we
know that when robots interact within a certain area a task switch will occur.
Notice that if the collaboration rate is zero, than the robots may interact but
this interaction will not result in collaboration. The area of interaction is directly
related to the collaboration rate, which is a measure of how close two agents need
to be to one another for a task switch to happen. Let Ŷi be the population count
for robots performing task i, then the area of interaction, aij , is computed as
follows
Akij
aij = , (11)
vij Ŷi Yˆj
where A is the area of the environment to monitor, kij is the collaboration rate,
vij is the relative speed of robot i and robot j, and Ŷi Ŷj are all disjoint pairs of
collaborations possible given the populations
of robots performing these tasks.
The robot radius of interaction is: rij = aij /π.
Incorporating the proposed feedback controller, Eq. (8), consider
Akij
aij = , (12)
vij Ŷj Ŷ ∗1−Ŷ
i i
where Ŷi∗ = Yi∗ N is the re-scaling of the desired trajectory population fraction
by the total number of robots in the system. The intuition for the feedback term
is that when Ŷi is far from Ŷi∗ then the resulting area of interaction will increase.
Likewise, when Ŷi is close to Ŷi∗ the area of interaction decreases. The interaction
area is updated in time as the microscopic simulations are running.
We extend our controller to consider distributed estimation of the population
dynamics. Similar to other methods, [3,15], we are proposing a local estimation
method of robots performing tasks. Consider that each robot has a fixed sens-
ing radius, Ri , and the ability to determine which robots within its radius are
performing which behaviors. The neighborhood of the robot, Ni , is every robot
within Ri , and the population estimate for robots performing different i tasks
within Ni is Ỹi . This method assumes that within the sensing radius, Ri , there
will be a sufficient density of robots to act as a proxy for the overall distribu-
tions of robot behaviors. Substituting Ỹi for Ŷi in Eq. (12) gives a new locally
computed interaction area. Note to scale the desired trajectory, Ỹi∗ = Yi∗ Ni .
Using local estimates and an appropriately scaled desired trajectory we have
eliminated the need for global information in our microscopic experiments.
4 Simulation and Experimental Setup
We used three levels of fidelity to evaluate our proposed approach: macro-discrete

simulation trials, microscopic simulation trials, and mixed reality experimental
trials. Parameters for all results were selected to be near our equilibria of interest
for both Example 1 and Example 2, which ensure that we see time-varying
distributions of robots performing tasks.
Our macro-discrete simulations used the Stochastic Simulation Algorithm
(SSA) from Gillespie [8]. SSA takes in initial populations of robots Ŷ (0) and
collaboration rates K. To determine the next time and type of event, SSA uses
a pseudo random number generator. While the total time and total number
of transitions are not exceeded, the following two steps are iteratively taken.
The next event is determined, by considering the likelihood of a certain task
switch based on the current population counts, Ŷi and the collaboration rates,
kij . Consider that each collaboration type has the following scaled collaboration
rate,
bij = kij Ŷi Ŷj . (13)

M M
Select a random number, r1 and scale that value by bn = i j bij , or the
total of all possible collaborations. The next event is which ever, bij is closest to
the scaled random number. Note that the resulting switch is defined by reactions
outlined in Eq. (1), and the population counts Ŷi and Ŷj are updated accordingly.
Next, the time, tp , is updated to the next time, tp+1 , by generating a random
number r2 such that,
Fig. 1. The multi-robot Coherent Structure Testbed (mCoSTe) water tank where the
miniature Autonomous Surface Vehicles (mASV) perform different tasks.

1 1
tp+1 = tp + log . (14)
r2 bn
The process continues until the maximum time or maximum number of transi-
tions are met. An important observation is that SSA ignores robot dynamics, and
considers only the stochastic jump process making it mathematically equivalent
to microscopic simulations.
Specific to our implementation, the original Replicator Equations in (2),

require initial conditions to sum to 1, which is not compatible with SSA. To
deal with this we scale the population fractions by N , the result of doing this is
a shift in the magnitude of the parameters used for the macroscopic continuous
solutions and macro-discrete trials. In addition, to prevent behavior extinction,
we enforce that each behavior is performed by at least one agent at any given
time.
To further validate our method, microscopic simulation trials were introduced
to incorporate robot dynamics. Robots were simulated using ideal point particles,
have unit velocity, and collide with the boundary of the workspace using an angle
of deflection of π/2. A difference from the macro-discrete trials is that robots
now compute a direct interaction radius rij , based on the collaboration rates.
This means when an encounter with another robot occurs, if there is a nonzero
collaboration rate and agents are spatially within rij , then the predetermined
interaction task switch ensues. Trials were run for 100 s, with a fixed number of
robots represented by point particles in an 18 × 18 unit area.
Experimental results were performed in the multi-robot Coherent Struc-
ture Testbed (mCoSTe), Fig. 1, which includes the miniature Autonomous Sur-
face Vehicles (mASV), and a Multi Robot Tank (MR tank). The MR Tank
is 4.5 m × 3.0 m × 1.2 m water tank equipped with an OptiTrack motion cap-
ture system. The mASVs are a differential drive platform, localize using 120 Hz
motion capture data, communicate via XBee, and onboard processing is done
with an Arduino Fio. Mixed reality experiments were performed to increase
the robot density, due to limited number of mASV platforms available. Experi-
ments use 4 mASV and 6 idealized point particles with the same configuration
as the microscopic simulation trials. To implement our method on the mASV
we assigned waypoints to robots by randomly selecting positions on the environ-
mental boundary, which had an area of 2.3 m × 1.3 m. During the mixed reality
experiment, robots react to simulated agents as if they were in the environment.
Behavior switches occur based on the interaction radius rij . This means when
any agent, robot or simulated, is within rij the predetermined collaboration
switch happens. The interaction radius comes directly from the calculation of
the area of interaction, Eq. (12), which is the connection between the macro-
scopic model and the microscopic implementation.
5 Results
Fig. 2. Example 1 has three time-varying tasks, and include five SSA trials. Figure
T
(a) parameters: Ŷ (0) = 2 50 50 , with k10 = 0.004, k12 = 0.0004, k20 = 0.003, and
T
k21 = 0.0008. Figure (b) parameters: Ŷ (0) = 2 50 50 , with α10 = 0.03, α12 = 0.003,
α20 = 0.0225, and α21 = 0.006.
Our results use multiple levels of fidelity to demonstrate how interactions can
induce time varying distributions of robot behaviors. When possible the same
initial conditions were used for both Example 1 and Example 2 to help draw com-
parisons between different types of results. Specifically, the desired trajectories
for Example 1 and Example 2 were the same for all fidelity results. The initial
T
conditions for Example 1 were a reference trajectory of Y ∗ (0) = 0.2 0.2 0.6
with parameters k10 = 2.0, k12 = 0.2, k20 = 1.5, and k21 = 0.4. The initial con-
T
ditions for Example 2 were a reference trajectory of Y ∗ (0) = 0.1 0.2 0.4 0.3
∗
and parameter μ = 0.01. Figures contain reference trajectories Y as solid lines,
and resulting population fractions as dots. Further details for parameters used
for each level of fidelity simulation and experiment are in the corresponding
figure captions. For both Example 1 and Example 2 we selected parameters near
the equilibrium presented in Sect. 2.2 so that time-varying distributions of robot
behaviors were observed.
Fig. 3. Example 2 has four time varying tasks and plots include five SSA trials. Figure
T
(a) parameters: Ŷ (0) = 5 10 15 20 , with k12 = k23 = k34 = k41 = 0.01 and μ =
T
0.0001. Figure (b) parameters: Ŷ (0) = 5 10 15 20 , with α12 = α23 = α34 = α41 =
0.1, and μ = 0.001
Five macrodiscrete, SSA, trail results are in Figs. 2 and 3. In the case with-
out control, Figs. 2a and 3a, each trial roughly follows the desired time varying
trajectory, but the variability is significant. This variability can be attributed
to microscopic behaviors not being accounted for during SSA. In the case with
control, Figs. 2b and 3b, the trials follow the trajectory closely as we anticipated.
The gains, αij were tuned so that the rate of interactions were sufficiently high,
if there was an insufficient interaction rate then the SSA trials would lag behind
the reference trajectory.
Fig. 4. Controlled microscopic results of Example 1. Figure (a) uses global population
estimation, and Figure (b) uses distributed population estimation where Ri = 5. Both
T
trials use Ŷ (0) = 2 25 25 , α10 = 2.0, α12 = 0.2, α20 = 1.5, and α21 = 0.4.
Fig. 5. Controlled microscopic results of Example 2 for two distributed estimation

cases. Figure (a) uses sensing radius, Ri = 10, and Figure (b) uses sensing radius,
T
Ri = 5. Both trials used Ŷ (0) = 5 10 20 15 , μ = 0.05, α12 = α23 = α34 = α41 = 0.5.
Microscopic simulation trials are shown in Fig. 4, which provides compar-

isons between the centralized and distributed approach. The distributed method
shown in Fig. 4b was not as accurate but roughly follows the trajectory which
is to be expected. In Fig. 5 a direct comparison between different sensing radii
is studied, as Ri reduces the number of neighbors reduces giving less accurate
estimates for the robot task distributions. The rate that Ri loses accuracy is
related to the density of robots in the environment. These microscopic scenarios
are the most idealized version of environmental monitoring task assignment.
Fig. 6. Figure (a) shows a centralized mixed reality experiment, and Figure (b) shows
a distributed mixed reality experiment with Ri = 0.25. The parameters used are the
T
same as those used in Fig. 4. The environment area is 2.3 m×1.3 m and Ŷ (0) = 2 4 4 .
Our final evaluation demonstrates that using physical robots in a mixed

reality setting produce comparable results. The mixed reality experimental tri-
als have robots and simulated agents respond to one another in the environment
and the desired distributions of behaviors are achieved in Fig. 6. Notice that in
the centralized case, Fig. 6a, the populations closely follow the desired trajectory.
The variability in the distributed estimation plot, Fig. 6b, is due to the break-
down of the underlying assumption that within the local sensing radius, Ri ,
the robot has a decent sample of other robots performing tasks. Video demon-
strations of the mixed reality experiments can be found at https://youtu.be/
X0QHgYM1s-g.
6 Discussion
Across multiple types of experimental results, we have demonstrated that we
can control time-varying distributions of robots. Our results consider the most
ideal case, where robots move randomly and consistently throughout the envi-
ronment with parameters that were selected near equilibria of interest to ensure
that our desired behavior was observed. There are several directions of future
work which need to be addressed to further enhance this approach. For example,
a more realistic robot behavior should be considered where there may be pro-
longed stops to complete a task or complex non-random behavior which would
change the reaction dynamics of our macroscopic model. Likewise, right now, we
have selected parameters to show that we can achieve time-varying distributions
of robots, however, in practice we want the team to evolve based on informa-
tion sensed in the environment. One way to address these nonidealities is to
consider feedback from the microscopic team to the macroscopic model. This
would require taking data collected from the individual robot and transforming
it into a form that is compatible with the macroscopic model, for example by
determining collaboration rate parameters kij .
A direction of great interest to the authors is to consider more broadly the
classes of replicator dynamics in general which achieve our task objective and
potentially other more diverse task objectives. Important properties of the repli-
cator dynamics which we have identified with our two examples is the need for
equilibria to have either limit cycle type behavior or center behavior. For exam-
ple, the resulting oscillations would not exist for parameters or equilibria which
resulted in a payoff matrix that had only real eigenvalues, which we know from
dynamical systems theory results in stable solutions [9].
Currently, we enforce that at least one agent must be performing a task at any
given time to avoid extinction events. Extinction can occur when the parameters
selected drive the populations to have extreme shifts in size, for example, all
agents performing one task. Without our stated assumption, there is no way for
the team to recover the other tasks if extinction happens. A solution would be
to consider equilibria which do not cause such extreme population fluctuations.
Alternatively, there are different nonlinear systems, for example, the Replicator
Mutator equations [21] which introduce variability to the model to help ensure
that only traits which are strong persist.
Finally, during the mixed reality experimental trials, robot collisions can
occur and are non-catastrophic due to the small size and low speed of the mASVs.
The addition of collision avoidance requires an additional scaling term to com-
pute the interaction radius and is a direction for future work. An additional
direction of future work is to consider how to expand experimental results to
further improve the distributed estimation using robot-robot communication.
A potential approach similar to [17], would introduce more stable results by
considering a history of observations made about neighbors.
7 Conclusion
In this work, we address the limitations of existing macroscopic modeling tech-
niques to model and control time-varying distributions of robot behaviors. This
is of particular importance for environmental monitoring tasks, where natu-
ral phenomena might require the populations of robots to change behavior, for
example, tidal shifts. Our proposed methods use robot-robot interactions, which
are explicitly ignored by existing linear mean field models. The result is a rich set
of time-varying robot behavior distributions that were not previously achievable.
In addition, to counteract known uncertainty during robot team deployment, we
propose a feedback control strategy which follows a specified desired distribution.
Our macro-discrete results show that we get improved control over the model,
and the idealized microscopic examples support that this is a fruitful avenue for
further study. Future work includes understanding different types of nonlinear
models, and further characterizing the uncertainty by seeking upper and lower
bounds on the macroscopic model.
Acknowledgement. We gratefully acknowledge the support of ARL DCIST CRA

W911NF-17-2-0181, Office of Naval Research (ONR) Award No. N00014-22-1-2157, and
the National Defense Science & Engineering Graduate (NDSEG) Fellowship Program.
References
1. Almadhoun, R., Taha, T., Seneviratne, L., Zweiri, Y.: A survey on multi-robot
coverage path planning for model reconstruction and mapping. SN Appl. Sci. 1(8),
1–24 (2019)
2. Berman, S., Halász, A., Hsieh, M.A., Kumar, V.: Optimized stochastic policies for
task allocation in swarms of robots. IEEE Trans. Rob. 25(4), 927–937 (2009)
3. Biswal, S., Elamvazhuthi, K., Berman, S.: Decentralized control of multi-agent
systems using local density feedback. IEEE Trans. Autom. Control 67(8), 3920–
3932 (2021)
4. Deshmukh, V., Elamvazhuthi, K., Biswal, S., Kakish, Z., Berman, S.: Mean-field
stabilization of Markov chain models for robotic swarms: computational approaches
and experimental results. IEEE Robot. Autom. Lett. 3(3), 1985–1992 (2018).
https://doi.org/10.1109/LRA.2018.2792696
5. Dey, B., Franci, A., Özcimder, K., Leonard, N.E.: Feedback controlled bifurca-
tion of evolutionary dynamics with generalized fitness. In: 2018 Annual American
Control Conference (ACC), pp. 6049–6054. IEEE (2018)
6. Elamvazhuthi, K., Berman, S.: Mean-field models in swarm robotics: a survey.
Bioinspiration Biomimetics 15(1), 015001 (2019). https://doi.org/10.1088/1748-
3190/ab49a4
7. Gerkey, B.P., Mataric, M.J.: Multi-robot task allocation: analyzing the complexity
and optimality of key architectures. In: 2003 IEEE International Conference on
Robotics and Automation (ICRA), vol. 3, pp. 3862–3868. IEEE (2003)
8. Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys.
Chem. 81(25), 2340–2361 (1977)
9. Guckenheimer, J., Holmes, P.: Nonlinear Oscillations, Dynamical Systems, and
Bifurcations of Vector Fields, vol. 42. Springer, Cham (2013). https://doi.org/10.
1007/978-1-4612-1140-2
10. Harwell, J., Sylvester, A., Gini, M.: Characterizing the limits of linear modeling of
non-linear swarm behaviors. arXiv preprint arXiv:2110.12307 (2021)
11. Hofbauer, J.: On the occurrence of limit cycles in the Volterra-Lotka equation.
Nonlinear Anal. Theory Methods Appl. 5(9), 1003–1007 (1981)
12. Hsieh, M.A., Halász, Á., Berman, S., Kumar, V.: Biologically inspired redistri-
bution of a swarm of robots among multiple sites. Swarm Intell. 2(2), 121–141
(2008)
13. Hsieh, M.A., Halasz, A., Cubuk, E.D., Schoenholz, S., Martinoli, A.: Specialization
as an optimal strategy under varying external conditions. In: 2009 IEEE Interna-
tional Conference on Robotics and Automation (ICRA), pp. 1941–1946 (2009).
https://doi.org/10.1109/ROBOT.2009.5152798
14. Khamis, A., Hussein, A., Elmogy, A.: Multi-robot task allocation: a review of the
state-of-the-art. Coop. Robots Sens. Netw. 2015, 31–51 (2015)
15. Lee, W., Kim, D.: Adaptive approach to regulate task distribution in swarm robotic
systems. Swarm Evol. Comput. 44, 1108–1118 (2019)
16. Leonard, N.E.: Multi-agent system dynamics: bifurcation and behavior of animal
groups. Annu. Rev. Control. 38(2), 171–183 (2014)
17. Lerman, K., Jones, C., Galstyan, A., Matarić, M.J.: Analysis of dynamic task allo-
cation in multi-robot systems. Int. J. Robot. Re. 25(3), 225–241 (2006). https://
doi.org/10.1177/0278364906063426
18. Lerman, K., Martinoli, A., Galstyan, A.: A review of probabilistic macroscopic
models for swarm robotic systems. In: Şahin, E., Spears, W.M. (eds.) Swarm
Robotics, pp. 143–152. Springer, Heidelberg (2005). https://doi.org/10.1007/978-
3-540-30552-1 12
19. Mather, T.W., Hsieh, M.A.: Distributed robot ensemble control for deployment to
multiple sites. In: Robotics: Science and Systems VII (2011)
20. Nam, C., Shell, D.A.: Analyzing the sensitivity of the optimal assignment in prob-
abilistic multi-robot task allocation. IEEE Robot. Autom. Lett. 2(1), 193–200
(2017). https://doi.org/10.1109/LRA.2016.2588138
21. Pais, D., Caicedo-Nunez, C.H., Leonard, N.E.: Hopf bifurcations and limit cycles
in evolutionary network dynamics. SIAM J. Appl. Dyn. Syst. 11(4), 1754–1784
(2012)
22. Prorok, A., Hsieh, M.A., Kumar, V.: The impact of diversity on optimal control
policies for heterogeneous robot swarms. IEEE Trans. Rob. 33(2), 346–358 (2017)
23. Ravichandar, H., Shaw, K., Chernova, S.: STRATA: unified framework for task
assignments in large teams of heterogeneous agents. Auton. Agents Multi Agent
Syst. 34(2), 38 (2020)
24. Schuster, P., Sigmund, K., Hofbauer, J., Gottlieb, R., Merz, P.: Selfregulation of
behaviour in animal societies. Biol. Cybern. 40(1), 17–25 (1981)
25. Sigmund, K.: A survey of replicator equations. In: Casti, J.L., Karlqvist, A. (eds.)
Complexity, Language, and Life: Mathematical Approaches. Biomathematics, vol.
16, pp. 88–104. Springer, Heidelberg (1986). https://doi.org/10.1007/978-3-642-
70953-1 4
A Decentralized Cooperative Approach
to Gentle Human Transportation
with Mobile Robots Based on Tactile
Feedback
Yi Zhang(B) , Yuichiro Sueoka(B) , Hisashi Ishihara, Yusuke Tsunoda,

and Koichi Osuka
Osaka University, Suita, Osaka 565-0871, Japan

{zhang.y,sueoka,y.tsunoda,osuka}@mech.eng.osaka-u.ac.jp,
ishihara@ams.eng.osaka-u.ac.jp
Abstract. Cooperative transportation of objects by a group of small

mobile robots is expected to work in disaster sites. In this study, we aim
to transport fragile objects including humans which may move during the
transport, with as little burden as possible. We propose the adoption of
a flexible tri-axis tactile sensor with thickness at the top of the robot on
which the object is mounted for safe support and state monitoring. We
improved the leader-follower based control by adding force feedback into
the leader’s control law to prevent excessive force on the object, which is
the disadvantage of the typical leader-follower method. We verified the
robots can transport a rigid body gently with the proposed control law
by converging their speeds to the same value in a dynamical simulation
and actual experiments. Additionally, we found that multijoint objects
also can be transported with the proposed method, whereas the stable
support of the object by each robot is reserved for future works.
Keywords: cooperative transportation · flexible tactile sensor ·

human transportation
1 Introduction
Recently, cooperative transportation of objects by a swarm of small mobile

robots has attracted much attention [1–3]. Because of their scalability, flexi-
bility, and robustness [4–7], swarm robots are expected to be applied to work at
disaster sites [8]. Examples of disaster site work include cooperative removal of
rubble and fallen trees, transportation of various supplies, and injured people.
Since a single robot is not versatile enough to perform such tasks, methods have
been proposed to realize such tasks through the cooperation of multiple robots.
Recently, many studies on cooperative multi-robot transportation systems [1–
3] have been conducted. Cooperative transportation methods can be roughly
divided into pushing-only, grasping, and caging. In particular, many studies
https://doi.org/10.1007/978-3-031-51497-5_8
Gentle Human Transportation Based on Tactile Feedback 101
have adopted the method of pushing and pulling on the ground for pushing-only
and caging. Though there are studies dealing with human transportation [9,10]
among them, in [10], the human is pulled and transported by swarm robots on
the ground, which is not gentle to the human. For systems that carry adult-sized
people on the robot, [9] is designed with about three modular robots, making it
difficult to deal with different people sizes due to the large size of each modular
robot. Thus, though there exists a demand for robots to transport people at
disaster sites, there are few studies of cooperative transport systems in which
multiple robots carry a human on their top and are suitable for various body
types.
When we consider small robots carrying a fragile object on their top, the
contact surface materials that cover their top are desired to be soft. If the robot
parts are made with hard materials, a substantial force will occur on the contact
surface between the robot and the object to be transported, which may damage
the object. Safe interaction is necessary and objects such as humans need to be
treated gently which can be achieved by using flexible materials [14,15].
Another advantage of using soft materials as a contact surface is that by
introducing a sensor to measure the deformation of the contact surface, each
robot can determine its movement so that no extra force is applied to the object.
According to the action-reaction law, if we move the robot so that the defor-
mation of the contact surface is reduced, the force applied to the object is also
reduced. For most of the previous studies, the force information is measured at
the joints rather than at the surface that handles the object.
Here, we consider the implementation of flexible tactile sensors that measure
the force through deformation on the contact surface of all robots in a leader-
follower system. Leader-follower is a typical cooperative transportation method
in which the leader decides the direction of the transport, and the followers
move the object to assist the leader’s effort [16–20]. We consider the method
suitable for transporting humans, which may change their position during trans-
port, because the followers can keep following each body part based on the force
information even when the person changes their position. However, the trans-
portation is not gentle if the leader robot runs forward without paying attention
to the overall transportation situation. In [19,20], the control law is designed
so that leader tries to run at the desired velocity all the time. This will let the
leader exert excessive force on the object when the desired velocity is suddenly
changed. These situations include the beginning of the transport and emergen-
cies where the leader detects a danger such as a soon-to-collapse house and needs
to evacuate as soon as possible. This is why the control of a leader robot also
needs to include feedback from its force sensor.
Therefore, in this study, we introduce force feedback into the control law for
the leader, as a major difference from the conventional leader-follower based
method. Using the improved leader-follower method, in situations when the
leader tries to give a speed command far from the current system’s speed, the
designed control law will restrict the force the leader applies to the object by
moving toward the direction where force is applied, i.e. slowing down when force
102 Y. Zhang et al.
is applied to the opposite direction of the desired velocity. Since each robot’s
control is completed by its sensors and controller, there is no need for commu-
nication with other robots in this system, the system is fully autonomous and
decentralized.
Thus, this study adopts an approach to gentle transportation and safe inter-
action without applying excessive force by improving both the hardware and
the software. We define gentle as a characteristic when the device has a surface
that enables safe interaction with human, and does not apply undue force to the
transportation subject. This is achieved by adopting a robot equipped with a
thick and flexible tri-axis tactile sensor and an improved leader-follower based
control which includes force feedback in the leader’s control.
In this paper, firstly, an autonomous decentralized controller for cooperative
transport is designed based on the force information from the flexible sensors.
Then, a model including the flexible sensor’s dynamics is constructed. Speed
convergence, consensus speed characteristics and gentleness to the object of the
designed autonomous decentralized control are analyzed through simulations and
actual experiments using a rigid object. In addition, we discuss the performance
of the extended application of the designed controller to multijoint transport
through transport experiments with a replica of a half-size skeleton.
Contributions: I. An approach to gentle transport without applying excessive
force by improving both hardware and control laws; II. A modeling method of
the system considering the flexible tactile sensor; III. Verification of extensibility
of the controller to multijoint objects.
2 Methodology
In this chapter, we discuss the mechanism of the flexible tactile sensor and the
control law for gentle cooperative transportation which uses the flexible tactile
sensor’s reading values.
2.1 Flexible Tactile Sensors and Methods of Acquiring Force

Information
In recent years, various flexible tactile sensors have been developed [11–13] and
their design methods have been established [11]. In this study, a force sensor
with a flexible thick contact layer that can measure the triaxial force devel-
oped by Kawasetsu et al. [12] was adopted for the part that interacts with
the object. The force exerted on the object during transport is measured at
the contact area, enabling transport without exerting excessive force on the
object. The flexible tactile sensor consists of two main parts: the flexible contact
part and the circuit board part which includes a 4-channel inductance digital
converter (LDC1614EVM, TEXAS INSTRUMENTS). The flexible contact part
has a disk-shaped magnetorheological elastomer (ferromagnetic marker) in the
center that deforms according to the force applied. In [12], the silicon rubber
Fig. 1. (a) shows the mechanism of the measurement using the flexible tactile sensor’s
deformation. (b) shows the coil position indexing for force related values’ calculation.
ECOFLEX 00-30 (Smooth-On) was used, whereas, in this study, ECOFLEX 00-
20 (Smooth-On), a softer silicone rubber, was used considering handling objects
more gently and increasing sensor sensitivity. The magnetorheological elastomer
in the center was made of elastomer containing iron powder. The deforma-
tion of the magnetorheological elastomer part changes the values of inductance
Li ∈ R, i = {1, 2, 3, 4}, measured with the four coils as shown in Fig. 1, which
can be used to determine Lx , Ly and Lz as in Equations (1)–(3) which are
known to correspond linearly to the forces in the three axes. We calibrate the
sensor by centralizing each Li to zero when no load is on the sensor. During the
experiment, when force is applied to the sensor, we obtain three values that are
proportional to the force as follows:
Lx = (L1 + L4 ) − (L2 + L3 ), (1)

Ly = (L1 + L2 ) − (L3 + L4 ), (2)
Lz = L1 + L2 + L3 + L4 . (3)
Force can be calculated using Lx , Ly and Lz for each corresponding direction

after calibration. However, we use Lx and Ly in the control law directly because
of their linear relationship with the forces [12].
2.2 Design of Control Law
Consider a situation where not all robots know the path in which the object
should be moving. In this case, the robot that has the knowledge of the path
to the goal can be the leader and brings the object forward. The remaining
follower robots can sense the direction the leader robot wants to go based on
the force information from the object, then follow the leader robot. There is
no need for communication between robots. However, the transportation is not
gentle if the leader robot runs forward without paying attention to the overall
transportation situation or movements made by humans such as stretching one’s
limb. Therefore, the control of a leader robot also needs to include feedback from
the force sensor. The system schema is shown in Fig. 2.
104 Y. Zhang et al.
Fig. 2. Cooperative transportation using leader follower model.
For the follower robot’s control law, [LF x , Ly ]

F T
obtained from the flexible
tactile sensor is multiplied by a gain KF ∈ R to determine the follower’s speed
F
input [vin,x , vin,y
F
]T as shown in Equation (4).
F F
vin,x Lx
= K (4)
vin,y
F F
LFy
For the leader robot’s control law, in addition to the desired velocity v d ∈ R2
for leader along the predetermined path, [LL x , Ly ] obtained from the flexible
L T
tactile sensor is multiplied by a gain KL ∈ R to determine the leader’s speed

L
input [vin,x , vin,y
L
]T as shown in Equation (5). Note that KF and KL are con-
stants with the dimension of M −1 T .
L L
vin,x Lx
= KL + vd (5)
vin,y
L
LLy
The control law is designed so that leader will decrease its speed when force
is applied to a direction opposite to the desired velocity. Due to this feature,
the system’s velocity does not converge to v d but to some velocity v c where
|v c | < |v d |.
3 Modeling and Simulation

Firstly, we need to construct a model of the robots’ behavior that considers the
dynamics of the system which includes the flexible tactile sensor. Since the leader
robot’s control also contains feedback of the force it receives, the system does
not simply converge to the leader robot’s desired speed. The effects of various
parameters on convergence and the value of consensus speed were investigated
through simulations.
3.1 Modeling
In this model, we consider the transportation of a rigid body in a one-dimensional
path for simplicity. Since silicone rubber generally has elasticity and damping
elements, we model the flexible tactile sensor as a spring and a damper with the
modulus of elasticity and damping coefficient denoted by k and c, respectively.
Fig. 3. Modeling schematic diagram. All the flexible sensors mounted on the robots
are assumed to have the same elasticity and damping coefficient. The initial value of
each robot’s and object’s position and speed is set to be 0.
When the object is too heavy to move for the leader, the followers are not able
to sense the force from the object in the first place. So we consider multiple
leaders in our model. Thus, consider a system with p leaders, q followers and
an object, where the position of leaders, object and followers are denoted by
{x0 , x1 , · · ·, xp−1 }, xp and {xp+1 , xp+2 , · · ·, xp+q } respectively. The mass of each
robot and the object is m and M respectively. In the initial state, the object
is placed on the robots so that the forces are balanced among all robots. Let
xi ∈ R, {i ∈ Z | 0 ≤ i ≤ p + q} be defined as zero when all robots are in the
initial state, i.e., at the position where the sensor springs are at their initial
length. The model schema is shown in Fig. 3. There is no direct interaction
between the robots. We assume there is no sliding between the object and each
robot.
For the dynamics model, we only consider the dynamics of the object for sim-
plicity. We consider rolling resistance for wheel resistance, which is the main resis-
tance in the low-speed region. The rolling resistance is a force directly applied
to the robot, but we consider it hypothetically applied to the object since it is
assumed that there is no sliding between the object and the robot. Ur is denoted
as the resistance coefficient. Thus, the dynamics of leaders, followers and the
object can be expressed as in Equations (6)–(8). The leaders’ and followers’
speed change according to the control law and the object moves according to
the summation of forces it receives from all robots.
ẋi = vd + K L (−k(xi − xp ) − c(ẋi − ẋp )), 0 ≤ i ≤ p − 1 (6)

p−1
ẋi M
M ẍp = (−k(xp − xi ) − c(ẋp − ẋi ) − Ur ( + m)g)
i=0
|ẋi | p+q
(7)

p+q
ẋj M
+ (−k(xp − xj ) − c(ẋp − ẋj ) − Ur ( + m)g)
j=p+1
|ẋj | p +q
x˙j = K F (−k(xj − xp ) − c(ẋj − ẋp )), p + 1 ≤ j ≤ p + q (8)

106 Y. Zhang et al.
Fig. 4. (a) shows the force applied to each robot and the object, (b) shows the speed
of the leader, follower and object during the base case simulation.
3.2 Simulation
In this section, we use the model constructed in the previous section to inves-
tigate the convergence and trends of consensus speed for different parameters.
We define consensus speed as the speed after leaders and followers reach con-
sensus and their speeds become the same value. The governing equations and
the initial conditions for all leader robots are the same, so the answers will be
identical. This is the same with the followers. Therefore, only one leader and one
follower are considered in the simulation and the following graph, the leader’s
state represents the state of all leaders and the same with the followers.
We consider a base case that has a certain configuration and investigate the
influence of parameters by comparing other cases with the base case. The param-
eters include the object’s weight, leaders’ number, followers’ number, leader’s
gain and follower’s gain in the control law, sensor’s elasticity, damping effect and
leader’s desired speed. The parameter value settings for each case are described
in Table 1. We set Ur to be 0.01 in all cases.
We show the graph of speed and force in Fig. 4. The speed of the leader and
follower reached a consensus so that the speed value became the same. Both in
the speed and force graph, we could observe the gentleness to the object in the
first two seconds. When the leader is trying to run too fast or applying too much
force to the object, it will restrict its speed according to the designed control
law.
The rest of the cases also showed speed consensus. The speed after consensus
of each case is shown in Table 1. It is shown that when the value of leader number,
follower gain, or leader desired speed increases, the speed after consensus is
higher. On the other hand, if the weight of the object, follower number, or leader
gain increases, the consensus speed becomes lower. The elasticity and damping
effect of the sensor does not affect the consensus speed. More specifically, the
elasticity of 1 to 400 and damping effect of 0.001 to 5 resulted in the same
consensus speed with other parameters fixed, though they affect the length of
Table 1. Simulation settings and results.
Cases p q M KL KF k c vd Speed after

consensus
Base case 1 2 5 0.1 0.2 20 0.03 3 1.48
Heavier object 1 2 9 0.1 0.2 20 0.03 3 1.46
More leaders 2 2 5 0.1 0.2 20 0.03 3 1.98
More followers 1 3 5 0.1 0.2 20 0.03 3 1.18
Higher KL 1 2 5 0.15 0.2 20 0.03 3 1.17
Higher KF 1 2 5 0.1 0.4 20 0.03 3 1.97
Sensor with more elasticity 1 2 5 0.1 0.2 40 0.03 3 1.48
Sensor with more damping effect 1 2 5 0.1 0.2 20 0.08 3 1.48
Higher leader desired speed 1 2 5 0.1 0.2 20 0.03 10 4.98
time for the system’s speed to converge. With elasticity larger than the range,
convergence still can be achieved by adjusting the gain. Conditions for stability
will be explored in future work.
4 Experiments
4.1 Experiment Setup

A three-wheeled omnidirectional robot platform manufactured by Nexus robot
was used to test the performance of the control law. The appearance of the
omnidirectional mobile robot equipped with the flexible contact sensor described
in the previous section is shown in Fig. 5 (a),(b).
Although the simulation was in one dimension for simplicity to show the
convergence of speed, the control is in two dimension as stated in Sect. 2.2. For
the control of the robots, the relationship between the velocity of each wheel
v1 , v2 , v3 , and the velocity of the robot vx , vy , vθ with respect to the x, y, and θ
axes can be obtained as in Equation (9) using the inverse kinematics model. Pos-
itive directions of velocities are defined as in Fig. 5 (c). Distance from the center
of the robot to the wheel is defined as r. The sample rate in the experiments is
about 0.23 s.
⎛ ⎞ ⎛ ⎞⎛ ⎞
v1 1 √0 r vx
⎝v2 ⎠ = ⎝−1/2 − 3/2 r⎠ ⎝vy ⎠ (9)
√
v3 −1/2 3/2 r vθ
In the experiments, the validity of the system is first verified through the
transportation of a rigid object. We show that the system is able to transport an
object cooperatively even if the weight of the object and the number of leaders
108 Y. Zhang et al.
Fig. 5. (a): The mechanical structure of the developed robot. The Vstone’s omni-robot
platform which comes with three omni-wheels, 12V DC coreless motors (gearbox ratio
is 80:1), and optical encoders (4CPR) is used for the experiments. (b): The micro-
controller - ESP32-DevKitC V4, motor driver - Cytron MDD3A, Bi-directional Logic
Level Converter - TXB0108, and a Lipo battery is equipped. (c): velocity indexing of
the omnidirectional robot.
Table 2. Experiments’ conditions
Cases p q M (kg) KL (M −1 T ) KF (M −1 T )
Base case 1 2 5 15 30
Heavier object 1 2 9 15 30
More leaders 2 2 5 15 30
More followers 1 3 5 15 30
and followers are changed. Secondly, a half-size mannequin is transported to

verify the scalability of the system for transporting multijoint objects.
4.2 Experiment Result
Transportation of Rigid Body. The experiment conditions are described in

Table 2. Note that in the experiment, we feed back values measured with the
flexible tactile sensor which is proportional to the force information but not the
exact force.
For the base case result, the flexible tactile sensor’s reading and the speed
calculated using encoder values are shown in Fig. 6. A video recording of the
experiment is shown in [21]. It can be confirmed that the speed changes according
to the control law and the sensor value. During the experiment, the slipping of
the leader’s tires is observed. Thus, though the consensus on speed is confirmed,
there is a difference between the speed of the leader and the speed of the followers.
At 44 s, the object is removed from the robots and the sensor reading turned back
to nearly zero. The followers stopped moving and the leader started moving at
its desired speed vd . The Follower 2’s sensor reading did not turn exactly back
to zero because of the hysteresis in the flexible sensor’s shape. For Follower
1, the flexible contact part of the sensor which includes the magnetorheological
Fig. 6. (a,b,c) shows the speed of the robots during the base case experiment calcu-
lated with encoders. (d,e,f) shows the flexible force sensor’s reading in the direction of
movement.
elastomer was removed when removing the object, resulting in a remaining sensor
reading.
In other cases, we also observed the consensus of speed during the experi-
ments. Time-series graphs of other experiments are omitted. The averaged speed
after the observation of the speed consensus in each case is shown in Table 3. The
consensus speed is calculated using the motion capture software Kinovea. The
trend of consensus speed when the parameters change stated in the simulation
is also shown in the experiment except in the case when a heavier object is used
for the experiment. This might due to the difference of the actual vd in each
experiment. In all cases, the leader’s desired input was set to the same value,
but the actual desired speed results are slightly different due to the difference in
Lipo battery voltage.
Transportation of Multijoint Object–Mannequin. In the cooperative

transportation experiment, a half-sized human skeleton model (85 cm, 4.9 kg)
was placed on six robots that are under the back, waist, arms and legs respec-
tively as shown in Fig. 7. The human skeleton model is fleshed out by sandbags
to mimic a human and its joints are free to move. The leader was placed under
the waist, and the other five robots acted as followers. The leader was given a
straight path in advance.
110 Y. Zhang et al.
Table 3. Experiments’ speed result
Base case Heavier More More

object leaders followers
Speed after consensus (Kinovea, m/s) 0.0507 0.0523 0.0778 0.0485
Fig. 7. The scene during the Mannequin transportation experiment.
Table 4. Experiment consensus speed result calculated using Kinovea and the leader’s
desired speed in each case measured with encoder.
Cases Leader Follower

Body parts Upper body Head Left arm Right arm Left leg Right leg
Average speed after 0.014 0.012 0.014 0.014 0.016 0.014
consensus (m/s)
The result of each robot’s average speed after consensus was calculated using
Kinovea is shown in Table 4. We observed the consensus on speed and success
in transportation. A video of the experiment is shown in [22].
In another experiment, we applied force to arms and legs in which the direc-
tion is in the plane that is being controlled to mimic the situation where a
human moves its limbs. A video of the experiment is shown in [23]. The fol-
lowers moved in the direction of applied force following the designed control
law and the robots successfully tracked the limb’s movement. The compatibil-
ity of the leader-follower based control and multijoint objects’ transportation is
demonstrated.

We have confirmed the feasibility of transporting objects with flexible tactile
sensors in both simulated and actual environment including a skeleton model
transportation experiment in the real world. We confirmed the consensus of speed
and its trend with the improved leader-follower based control which applies force
feedback to the leader in order to achieve gentle transport. The gentleness to
the object is verified according to the force data in the simulation. Until now,
little thought has been given to the transport of multijoint objects, damage or
discomfort to the objects being transported. The system in this paper is an
example of research that considers extensions to multijoint object transport and
gentleness to the objects being transported.
The control law designed in this study includes feedback from a force sensor
to the leader, so the consensus speed is not simply the desired speed given to the
leader. Modeling of the system allowed us to qualitatively confirm the difference
in consensus speed. In the future, by analytically determining the consensus
speed, it is possible to confirm what parameters and how they are involved in
the value of consensus speed to design the consensus speed. In addition, we
plan to decrease the instability due to the roundness of the body by making
the current robots smaller, increasing the number of robots and increasing the
contact area.
Acknowledgement. This work was supported by JSPS KAKENHI 20H04473,

T21K141830, A21H05104a and JST [Moonshot R&D][Grant Number JPMJPS2032]
and in part by grants-in-aid for JSPS KAKENHI Grant Number JP22K14277. The
authors would like to thank Dr. Kawasetsu for his valuable advice on the flexible tac-
tile sensor and Mr. Ito for discussions on the hardware design.
References
1. Cao, Y., Fukunaga, A., Kahng, A.: Cooperative mobile robotics: antecedents
and directions. Auton. Robot. 4, 7–27 (1997). https://doi.org/10.1023/A:
1008855018923
2. Tuci, E., Alkilabi, M.H., Akanyeti, O.: Cooperative object transport in multi-robot
systems: a review of the state-of-the-art. Front. Robot. AI. 5 (2018). https://doi.
org/10.3389/frobt.2018.00059
and current applications. Front. Robot. AI (2020)
4. Fukuda, T., Ueyama, T., Kawauchi, Y., Arai, F.: Concept of cellular robotic system
(CEBOT) and basic strategies for its realization. Comput. Electr. Eng. 18, 11–39
(1992). https://www.sciencedirect.com/science/article/pii/004579069290029D
5. Parker, L.: Current state of the art in distributed autonomous mobile robotics.
Distrib. Auton. Robot. Syst. 4, 3–12 (2000)
6. Yan, Z., Jouandeau, N., Cherif, A.: A survey and analysis of multi-robot coordi-
nation. Int. J. Adv. Robot. Syst. 10 (2013). https://doi.org/10.5772/57313
from the swarm engineering perspective. Swarm Intell. 7, 1–41 (2013)
8. Nazarova, A., Zhai, M.: The application of multi-agent robotic systems for earth-
quake rescue. Robot.: Indus. 4.0 Issues New Intell. Control Paradigms., 133–146
(2020). https://doi.org/10.1007/978-3-030-37841-7 11
9. Osuka, K., Isayama, S.: Motion control of multi-linked stretcher robot DUCKS.
In: Proceedings Of SICE Annual Conference 2010, pp. 873–874 (2010)
10. Dorigo, M., et al.: The SWARM-BOTS project
11. Wang, H., et al.: Design methodology for magnetic field-based soft tri-axis tactile
sensors. Sensors (Switzerland) 16 (2016)
112 Y. Zhang et al.
12. Kawasetsu, T., Horii, T., Ishihara, H., Asada, M.: Flexible tri-axis tactile sensor
using spiral inductor and magnetorheological elastomer. IEEE Sens. J. 18, 5834–
5841 (2018)
13. Wang, H., et al.: Design and characterization of tri-axis soft inductive tactile sen-
sors. IEEE Sens. J. 18, 7793–7801 (2018)
14. Hughes, J., Culha, U., Giardina, F., Guenther, F., Rosendo, A., Iida, F.: Soft
manipulators and grippers: a review. Front. Robot. AI 3 (2016)
15. Kim, S., Laschi, C., Trimmer, B.: Soft robotics: a bioinspired evolution in robotics.
Trends Biotechnol. 31, 287–294 (2013)
16. Asama, H., Matsumoto, A., Ishida, Y.: Design of an autonomous and distributed
robot system: ACTRESS. In: Proceedings. IEEE/RSJ International Workshop on
Intelligent Robots and Systems. (IROS ’89) the Autonomous Mobile Robots and
its Applications, pp. 283–290 (1989)
17. Kosuge, K., Oosumi, T., Chiba, K.: Decentralized control of multiple mobile robots
handling a single object in coordination. J. Robot. Soc. Jpn. 16, 87–95 (1998)
18. Osumi, H., Nojiri, H., Kuribayashi, Y., Okazaki, T.: Cooperative control for three
mobile robots transporting a large object. J. Robot. Soc. Jpn. 19, 744–752 (2001)
19. Wang, Z., Schwager, M.: Kinematic multi-robot manipulation with no communi-
cation using force feedback. In: 2016 IEEE International Conference On Robotics
And Automation (ICRA), pp. 427–432 (2016)
20. Wang, Z., Yang, G., Su, X., Schwager, M.: OuijaBots: omnidirectional robots for
cooperative object transport with rotation control using no communication. In:
Groß, R., et al. (eds.) Distributed Autonomous Robotic Systems. SPAR, vol. 6, pp.
117–131. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73008-0 9
21. Zhang, Y.: Transportation experiment with mobile robots based on tactile feedback
- object. https://youtu.be/i5yjvAJ5VRE
- half-sized skeleton. https://youtu.be/AtM5e7bqSZQ
- moving limb. https://youtu.be/NBYaup4oivA
Sparse Sensing in Ergodic Optimization
Ananya Rao1(B) , Ian Abraham2 , Guillaume Sartoretti3 , and Howie Choset1

1
Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15203, USA
ananyara@andrew.cmu.edu
2
Department of Mechanical Engineering, Yale University,
New Haven, CT 06520-8292, USA
3
Mechanical Engineering Department, National University of Singapore,
Singapore 117575, Singapore
https://biorobotics.org/
Abstract. This paper presents a novel, sparse sensing motion planning

algorithm for autonomous mobile robots in resource limited coverage
problems. Optimizing usage of limited resources while effectively explor-
ing an area is vital in scenarios where sensing is expensive, has adverse
effects, or is exhaustive. We approach this problem using ergodic search
techniques, which optimize how long a robot spends in a region based
on the likelihood of obtaining informative measurements which guaran-
tee coverage of a space. We recast the ergodic search problem to take
into account when to take sensing measurements. This amounts to a
mixed-integer program that optimizes when and where a sensor measure-
ment should be taken while optimizing the agent’s paths for coverage.
Using a continuous relaxation, we show that our formulation performs
comparably to dense sampling methods, collecting information-rich mea-
surements while adhering to limited sensing measurements. Multi-agent
examples demonstrate the capability of our approach to automatically
distribute sensor resources across the team. Further comparisons show
comparable performance with the continuous relaxation of the mixed-
integer program while reducing computational resources.
Keywords: Search and coverage · Control · Ergodic search
1 Introduction
There is a lack of capability of current robotic systems to explore and search for
information in resource limited settings. Such a problem occurs in several appli-
cations including planetary science robots with expensive sensors, agricultural
robots with limited available measurements, etc. Currently, arbitrary heuristics
are used to prioritize resources, but this approach does not consider the depen-
dence on what information the robot has acquired and its mobility capabilities.
There is a need for methods that can jointly optimize how robots search and
when to take measurements.
Prior works in coverage, especially those based on ergodic search processes,
assume that the sensor being used is consistently taking measurements, and
https://doi.org/10.1007/978-3-031-51497-5_9
114 A. Rao et al.
therefore, constantly uses energy. In robotic applications, taking a sensor mea-

surement incurs a cost, which this work strives to minimize without compro-
mising the efficacy of autonomous coverage, in terms of the ergodic metric. To
improve the applicability and efficacy of robots, especially in resource limited
environments, we need to consider optimizing over when to record and acquire
sensor measurements (for example, in Fig. 1). In this paper, we pose the prob-
lem of acquiring a set of optimal measurements as a sparse ergodic optimization
problem, where the decision to take a measurement is optimized as a vector of
decision variables, and show that placing an L1 norm can promote sparsity in
this vector.
Much prior work in resource-
limited robotics focuses on
using sparse data to achieve
goals like localization [1] and
depth reconstruction [2]. How-
ever, these formulations pose
complex optimizations, which
are hard to solve or require
forming heuristics that neglect
optimality of sensor measure-
ment placement. Unlike many
other information-based search Fig. 1. Proposed Approach for Sparse Ergodic
Optimization: Our approach automates how a
methods, trajectories planned
robot explores an area and what informative mea-
using ergodic search processes surements to collect in a joint optimization prob-
can be optimized over longer lem. Illustrated above is an example sparse sensing
time horizons, and drive the solution for covering a one-dimensional information
robot to spend time in areas distribution (gray distribution), where peaks corre-
of the search domain in pro- spond to areas of high expected information, and the
portion to the amount of green colored bar represents when the robot takes a
expected information at those measurement.
locations, thus efficiently balancing exploration and exploitation. Optimizing the
ergodic metric ties to better coverage of the a priori information distribution.
Furthermore, ergodic search processes yield special structure in the formulation
of the optimization problem, which we leverage to formulate sampling decision
times that are solutions to a convex optimization problem. The sampling solu-
tions result in a near-optimal set of sensing measurements along the robot’s
trajectory.
The organization of this paper is as follows: In Sect. 2 we discuss existing
search and coverage methods, as well as recent advances in sparse sensing. A
brief background on ergodic search processes is also presented in Sect. 2. Details
on our sparse ergodic optimization approach, and its application to distributing
sensing resources across a multi-agent team are discussed in Sect. 3. In Sect. 4 we
present and discuss the results of our systematic set of experiments, in which we
observe that our approach can maintain coverage performance while significantly
decreasing sensing costs, in both single and multi-agent settings. Finally, we
presenting concluding remarks in Sect. 5.
Sparse Sensing in Ergodic Optimization 115
2 Prior Work
Search and Coverage Methods: Search and coverage methods have wide-
spread uses in a number of important applications, including search and rescue,
cleaning robots, and automated agriculture techniques. Both object search and
coverage involve motion planning, typically based on an a priori information
distribution in a domain. Most search and coverage techniques involve the con-
struction and use of an information map which encodes the utility of the robot
exploring specific regions of the search space.
Broadly, area search and coverage methods fall into three categories: geo-
metric, gradient-based and trajectory optimization-based approaches. Geomet-
ric methods, like lawnmower search, are good techniques for uniformly covering
a search domain [3,4]. Such approaches are useful when no a priori informa-
tion is provided, or when the a priori information is presumed to be a uniform
probability distribution over the search domain. On the other hand, since these
approaches do not utilize any a priori information about regions of interest, the
efficacy of exploration and object search is upper bounded.
When a priori information is available, a representation of how information
is distributed across a search domain (i.e. an a priori information map or infor-
mation distribution) can be leveraged to develop better search and coverage
techniques. In gradient-based methods [5–7], the short-term information gain is
greedily maximized by guiding the search agent in the direction of the derivative
of the information map around its position. Since the agent is always driven in the
direction of greatest information gain, it is led to areas with higher likelihood of
finding objects. However, these approaches can leave areas unexplored, because
they generally do not take into account the uncertainty associated with the a
priori information distribution, which can help differentiate between unexplored
areas with low information and areas with no information. These approaches are
very sensitive to noise in the information map because noise leads to inaccurate
gradient estimations, which in turn leads to agents greedily over-exploiting local
information maxima.
Trajectory optimization-based methods formulate search as an information
gathering maximization problem, which is then solved by planning a path for
the search agent [8]. The cost function that drives the optimization in many of
these approaches combines both the information distribution and its associated
uncertainty, thereby reducing the likelihood of leaving areas unexplored.
Sparse Sensing: Many real-world robotic applications are plagued by resource
limitations. Sparse sensing techniques are useful in these scenarios. Most prior
work in applications involving sparse sensing focus on using sparse sensor mea-
surements (and therefore sparse data) to accomplish tasks like localization and
SLAM [1], depth reconstruction [2] and wall-following [9]. These works mostly
explore methods to better use limited data that has already been acquired, or
that is actively being acquired. However, in all of these approaches, the robot
still has to acquire the measurements, and post-optimizing for sparse data points
does not help reduce costs for limited onboard resources. While intelligently using
116 A. Rao et al.
limited data does help improve the performance of resource-limited robotic sys-
tems, further improvements can be made by deciding where to take these limited
measurements.
Ergodic Search: Most information-based search and coverage methods view
the information gathering problem through one of two lenses: exploration or
exploitation. Through the exploratory lens, the information acquisition problem
is framed as the wide space search for diffused information, for applications like
localization or coverage [10,11]. On the other hand, through the exploitative
lens, the information gathering problem is framed as the direct search for highly
structured information peaks, such as in object search [11,12]. Ergodic search is
able to balance both exploration and exploitation goals by accounting for both
diffused information densities and highly focused points of information [8,13,
14]. By doing this, ergodic control trajectories are able to conduct both broad-
stroke coverage for diffused information densities and localized search for more
focused high information points, thereby balancing exploration and exploitation.
Specifically, an ergodic path will drive a robot to spend more time in regions of
higher expected information in an a priori information map, and less time in
regions of low information.
Ergodic coverage [8] produces trajectories for agents such that they spend
time in areas of the domain proportional to the expected amount of information
present in that area. In order to control agents to accomplish this behavior, we
pose an optimization problem that minimizes the distance between the time-
average statistics of the agent Eq. 1 and the underlying information map.
1
t−1
t
C (x, γt ) = δ(x − γi (τ )), (1)
t τ =0
where γ is the agent’s trajectory, defined as γ : (0, t] → X , t is the discrete time

horizon, and δ is the Dirac delta function, with X ⊂ IRd in the d-dimensional
search domain. The spatial time-average statistics of an agent’s trajectory quan-
tifies the fraction of time spent at a position x ∈ X .
The expected information distribution, or information map, over the domain
to be explored and searched is determined by a target distribution which defines
the likelihood of generating informative measurements at any given location in
the search domain. Formally, the agent’s time-averaged trajectory statistics are
optimized against this expected information distribution over the whole domain,
by minimizing the distance between the Fourier spectral decomposition of each
distribution. This is obtained by minimizing the ergodic metric Φ(·), expressed
as the weighted sum of the difference between the spectral coefficients of these
two distributions [8]:

m
2
Φ(γ(t)) = αk |ck (γt ) − ξk | , (2)
k=0
where ck and ξk are the Fourier coefficients of the time-average statistics of

an agent’s trajectory γ(t) and the desired spatial distribution respectively,
and
αk are the weights of each coefficient difference. In practice, αk =
(1 + k2 )−(d+1) is usually defined to place higher weights on lower frequency
components, which correspond to larger spatial-scale variations in the informa-
tion distribution.
The goal of ergodic coverage is to generate optimal controls u∗ (t) for an
agent, whose dynamics are described by a function f : Q × U → T Q, such that
u∗ (t) = arg minu Φ(γ(t)),

(3)
subject to q̇ = f (q(t), u(t)), u(t) ≤ umax
where q ∈ Q is the state and u ∈ U denotes the set of controls. Equation 3

is solved directly for the optimal control input at each time-step, by trajectory
optimization to plan feed-forward trajectories over a specified time horizon [14],
or by using sampling-based motion planners [15], where it is straightforward to
pose additional constraints such as obstacle avoidance. Note that in this opti-
mization, each point x(t) in the robot’s trajectory is considered a sample point
during the exploration of the search domain, which can be detrimental in sce-
narios where the robot’s trajectory traverses over large regions of low expected
information. In the following section, we describe our approach to selecting the
sampling decision times as a part of the optimization.
3 Sparse Sensing in Ergodic Optimization

We approach sparse sensing in ergodic
coverage in two steps. First, we extend
ergodic optimization to formulate a
sparse ergodic optimization problem
to jointly optimize for both an agent’s
trajectory, and the importance of tak-
ing a sensor measurement at each
point along the trajectory. Second, we
apply this formulation to ergodic cov-
erage using multi-agent teams, and
jointly optimize the trajectories and
sensing decisions of each agent in the
team, resulting in the optimal dis-
tribution of limited sensing resources
amongst agents.
Fig. 2. Sparse Ergodic Trajectory

3.1 Sparse Ergodic
Example: Trajectories are generated with
Optimization a decision variable for when to take a sen-
In many search and coverage applica- sor measurement. This approach results in
trajectories which provide maximal infor-
tions, only a few sensor measurements
mation gathering with minimal sensing
or data points are required to fully resources.
118 A. Rao et al.
comprehend and optimize a model. Note that in the optimization in Eq. 3, the
locations at which sensor measurements are obtained are uniform in time, that
is, each measurement along the trajectory is treated as being equally important
and contributing equally to minimizing the ergodic metric. In many scenarios,
due to both the non-uniform distribution of information in the region being cov-
ered, as well as the dynamic constraints of the robotic agents being used, this
can lead to extraneous measurements that do not contribute to reconstructing
the information prior.
For example, if there are few regions of high information, which are sepa-
rated by larger regions of low expected information, uniform sensor measure-
ments would result in many measurements that provide low information gain
(see Fig. 2). In this case, ergodic performance would be improved by only taking
sensor measurements near the areas of higher information.
In this work we extend ergodic optimization in order to find this set of optimal
measurements for a given search scenario. We optimize for the set of optimal
sensor measurements by posing the following ergodic optimization,
u∗ (t), λ∗ (t) = arg minu,λ Φ (γ(t)),

(4)
subject to q̇ = f (q(t), u(t)), u(t) ≤ umax
where q ∈ Q is the state, u ∈ U denotes the set of controls, and λ(t) ∈ {0, 1}.
λ(t) represents the decision variable for choosing whether to take a sensor mea-
surement or not at a given location in the search domain. We promote sparsity
in the sample measurements by regularizing λ with an L1 optimization which
promotes sparsity [16].
We augment the ergodic metric in Eq. 2 in the following manner

m
2
Φ (γ(t)) = αk |ck (γ(t), λ(t)) − ξk | + |λk |, (5)
k=0
where ck and ξk are the Fourier coefficients of the time-average statistics of

the set of agent’s trajectories γ(t) and the desired spatial distribution of agents
respectively, and αk are the weights of each coefficient difference.
The spatial time-average statistics of the agent’s trajectory (from Eq. 1) are
also modified to be,
1 t
C t (x, γ(t)) = λ(t)δ(x − γ(τ )), (6)
t λ(t) τ =0
where λ(t) ∈ {0, 1}.

Thus far, λ(t) is defined to be an integer (i.e. λ(t) ∈ {0, 1}), resulting in
Eq. 4 being a mixed integer programming problem. However, such mixed integer
programming problems are difficult to solve, due to a lack of gradient information
from the integer variables [17], and due to requiring direct search methods that
do not scale with longer time horizons. For this reason, we employ a relaxation
of the problem Eq. 4 by defining λ(t) to be a bounded continuous variable
Fig. 3. Mixed-Integer Continuous Relaxation for Sparse Ergodic Optimiza-

tion: Illustrated is the process with which we jointly optimize for trajectories and when
to sample. The decision variable λ is relaxed to be continuous [0, 1] where solutions
are projected into the integer 0, 1 space. As input, our approach takes information
prior distributions which guide the planning and sensing. Output trajectory solutions
(shown on the right) show concentrated samples over areas of high information with
minimal use of sensor resources.
λ(t) ∈ [0, 1] and optimize over the new domain. After optimization, we project λ
from the continuous domain to the nearest integer value, while adhering to the
sensing budget. This allows us to continuously optimize Eq. 4, and then map
the resultant continuous values of λ(t) to discrete values {0, 1}. The procedure
for optimizing the sparse ergodic problem Eq. 4 is depicted in Fig. 3.
When we investigate the numerical values of the decision variables λ(t) cal-
culated in sparse ergodic optimization Eq. 4, we see that there are peaks formed
that correspond to peaks in information in the a priori information distribu-
tion (Fig. 3). When these continuous numerical values are mapped back to
λ(t) ∈ {0, 1}, this results in λ(t) = 1 being more likely in areas of higher infor-
mation, and λ(t) = 0 being more likely in areas of lower information. This shows
that the sparse ergodic optimization drives the likelihood of taking a sensor mea-
surement in an area of higher expected information to be higher, which follows
intuition. Additionally, we also see some peaks in regions of lower information,
which typically correspond with areas between relatively closely places areas of
high information, or with regions of low information where no sensor measure-
ments have been taken for a considerable amount of time.
3.2 Multi-agent Sparse Ergodic Optimization
For a multi-agent team covering a given information prior, the limited number
of measurements required to fully cover the region can be distributed among the
120 A. Rao et al.
different agents. We apply sparse ergodic optimization (Eq. 4), as described in

Sect. 3.1 to distribute sensing resources amongst a multi-agent team. A sample
result is shown in Fig. 4.
For N agents, the modified joint spatial time-average statistics of the set of
agent trajectories {γi }N
i=1 are defined as
1
t
C t (x, γ(t)) = N λi (t)δ(x − γi (τ )), (7)
Nt i=1 t λi (t) 0
where λi (t) ∈ {0, 1} for all integers i ∈ [0, N ).

We use the augmented ergodic
metric defined in Eq. 5 to drive the
joint optimization of the set of agent
trajectories and sensing decision vari-
ables. The ergodic optimization posed
in Eq. 4 is used to generate opti-
mal controls and set of optimal sensor
measurements for each agent.
We again employ a relaxation of
the problem described by Eq. 4 by
defining each λi (t) to be a bounded
continuous variable λi (t) ∈ [0, 1] for
all integers i ∈ [0, N ) and optimize
over the new domain. After optimiza-
tion, we project the λi for each agent i
for all integers i ∈ [0, N ) from the con-
tinuous domain to the nearest inte- Fig. 4. Multi-Agent Sparse Ergodic
ger value, while adhering to the sens- Trajectory Example: Trajectories for a
ing budget. We take into account all multi-agent team are generated with a deci-
of the agent trajectories when map- sion variable for when to take a sensor
ping from the continuous domain to measurement. This approach automatically
discrete integers in order to distribute updates trajectories which provide maximal
information gathering with minimal sensing
sensing measurements across all of the
resources.
agents.
4 Results and Discussion

In this section we empirically show that for a specific ergodic coverage prob-
lem, there exists a minimal set of sensor measurements to take in order to gain
enough information to fully characterize and solve the coverage task. Through
our experiments we illustrate that optimizing for this set of samples, or sensor
measurements, leads to improvements in ergodicity for coverage problems. This
supports our intuition that in many coverage scenarios, extraneous samples are
taken that either do not contribute to or negatively impact the optimization.
Further, we show that our sparse ergodic optimization approach can be applied
to multi-agent coverage tasks in order to optimally distribute a given sensing
budget amongst all of the agents. In multi-agent coverage scenarios as well, we

empirically show that for a specific ergodic coverage problem, there is a minimal
set of sensor measurements required to cover the information prior.
4.1 Experimental Details

In our set of systematic experiments, the dynamics of the agent considered are
defined as the constrained discrete time dynamical system
x(t + 1) = f (x, u) = x(t) + tanh(u) (8)
where x(t) ∈ R2 and u ∈ R2 is constrained by the tanh function bounded ∈ [0, 1].
The sensor footprint of the agent is modeled as a Gaussian distribution cen-
tered at the agent’s position, whose variance prescribes a circular observation
range ρ > 0. The information maps are built using information distributions
that are distributed within the continuous search space X ∈ [0, L]2 ⊂ R2 , which
is defined as
3
p(x) = ηi exp(||x − ci ||2−1 ) (9)
i
i=1

where p(x) : X → R+ , and ηi , ci , i are the normalizing factor, the Gaussian
center, and the Gaussian variance respectively.
As described in Sect. 3.1, the sparse ergodic optimization problem Eq. 4 is a
mixed integer programming problem. Since mixed integer programming prob-
lems are difficult to solve, we relax this formulation by defining λ(t) to be a
bounded continuous variable λ(t) ∈ [0, 1], and map the resultant values to dis-
crete values in {0, 1}. We use a solver to compare the results of the mixed integer
programming formulation to our relaxed problem to show that this relaxation
greatly improves the computational cost of the optimization, without negatively
impacting performance.
We investigate the performance of the sparse ergodic optimization described
in Sect. 3.1 in terms of coverage performance, using ergodicity as the metric.
We compare our coverage results to two different approaches. The first is stan-
dard ergodic optimization Eq. 2, where sensor measurements are uniformly dis-
tributed along the optimized trajectory. The second is a probabilistic heuristic
with a two step process: first we optimize an ergodic trajectory, then we sample
measurement locations from the distribution of information under the optimized
trajectory.
Further, we investigate the performance of multi-agent sparse ergodic opti-
mization described in Sect. 3.2 in terms of coverage performance, with ergodicity
as the metric. We compare our results to the two baselines described above, aver-
aged across different team sizes (ranging from 3 to 10 agents).
The performance statistics for each method and sensing budget are averaged
across 25 randomized experiment setups each, where initial information map
is varied between experiments. For each method, 10 different sensing budgets
were explored. Agents starting positions, initial information maps, and sensing
budgets are kept identical among experiments with different controllers to ensure
that our results are comparable.
122 A. Rao et al.
4.2 Single Agent Sensing Distribution
When looking at results of sparse

ergodic optimization in terms of over-
all coverage performance, measured
by the ergodic metric, we observe
that there is a minimal number of
sensor measurements to be taken to
minimize ergodicity (see Fig. 6). In
our experiments, this number of sen-
sor measurements varies with changes
in information map being covered,
sample rate, time horizon and ini-
tial sample weights. However, for a
set of fixed experiment hyperparam-
eters, the minimal number of sam- Fig. 5. Standard Ergodic and Sparse
ples required is consistent. For any Ergodic Trajectory Comparison: Stan-
fixed experiment setup, when we take dard ergodic optimization results in a dif-
fewer than this minimal number of ferent final trajectory (orange trajectory)
sensor measurements, the optimiza- than jointly optimizing for both trajectory
and sensor weight (blue and yellow trajec-
tion lacks relevant information, and
tory). This implies a cross-effect between
so, we see a decrease in coverage per- agent position and if a sensing measurement
formance. On the other hand, when is taken, due to which joint optimization
we take more sensor measurements becomes more important as the coverage
than the minimal required number, problem scales.
the ergodic value increases, due to
extraneous measurements negatively impacting coverage.
We also experimentally demon-
0.09
Sparse Ergodic Optimization strate that it is better to selec-
0.085
Uniform Sampling
0.08 Probabilistic Heuristic
tively choose measurements along
the standard ergodic trajectory,
Ergodic Value
0.075
0.07 since in all cases, the coverage

0.065
performance of the probabilistic
0.06
0.055
heuristic is much better than that
0.05 of standard ergodic optimization
0.045
0 10 20 30 40 50 60 70 80
with uniformly distributed mea-
90 100
Number of Sensing Measurements surements. On the other hand,

using the probabilistic heuristic
Fig. 6. Comparison of Baseline Appro- on a standard ergodic trajectory
aches and Sparse Optimization: Results
has very similar coverage perfor-
show that sparse ergodic optimization has bet-
mance to sparse ergodic optimiza-
ter coverage performance in terms of the ergodic
metric when compared to standard ergodic opti- tion for a single agent. We see that
mization with uniformly distributed sparse mea- optimizing for trajectory alone,
surements. The probabilistic heuristic results in and jointly optimizing for trajec-
comparable performance. tory and measurement placement
creates different resultant trajec-
tories (Fig. 5), implying that trajectory optimization and measurement choice
impact each other. Jointly optimizing for trajectory and sensing measurements
leads to lower control cost, as you are directly taking into account the cost of
moving between chosen sensing measurements. For a single agent in the sim-
ple coverage problems being considered, there aren’t large differences in control
cost, leading to very similar performance of the probabilistic heuristic and sparse
ergodic optimization.
Further, we compare the performance statistics and computation cost of our
approach to that of solving the mixed integer programming formulation of the
sparse ergodic optimization problem. We see in Fig. 7 that our relaxation of the
sparse ergodic optimization problems leads to much lower computational costs
(i.e. lower run times), while retaining comparable coverage performance in terms
of the ergodic metric.
0.08 6
Sparse Ergodic Optimization Computation Time (seconds) Sparse Ergodic Optimization
0.075
Mixed Integer Formulation 5 Mixed Integer Formulation
0.07
Ergodic Value
4
0.065
0.06 3
0.055
2
0.05
1
0.045
0.04 0
0 50 100 150 0 50 100 150
Number of Sensing Measurements Number of Sensing Measurements
Fig. 7. Comparison of Mixed Integer Optimization Problem and Relaxed

Sparse Optimization Problem: Solving the mixed integer formulation of the sparse
sensing problem leads to slightly better coverage performance (a), in terms of the
ergodic metric, but has a much higher computational cost (b).
124 A. Rao et al.
4.3 Multi-agent Sparse Ergodic Optimization
When optimizing limited sensing

resources (specifically a sensing 0.085
Sparse Ergodic Optimization
budget in terms of a restricted 0.08
Uniform Sampling
Probabilistic Heuristic
number of sensing measurements)
Ergodic Value
0.075
for a multi-agent team using the
sparse ergodic optimization app- 0.07
roach, we see that the ‘workload’

0.065
of covering different peaks in a
given a priori information map 0.06
0 50 100 150
is distributed among the agents, Number of Sensing Measurements
and the sensing measurements are

distributed in order to support Fig. 8. Comparison of Baseline Approa-
the requirements of each coverage ches and Sparse Optimization for Multi-
Agent Coverage: Results show that multi-
workload.
agent sparse ergodic optimization has better
Similar to single agent sparse
coverage performance in terms of the ergodic
ergodic optimization results, we metric when compared to standard ergodic opti-
see that using sparse ergodic optimization with uniformly distributed sparse mea-
mization to distribute a limited surements and with sparse measurements dis-
sensing budget across a multi- tributed using a probabilistic heuristic.
agent team results in improved
coverage performance in terms of
the ergodic metric with fewer sensing measurements (i.e. for lower sensing bud-
gets) (see Fig. 8). We also observe that there is a minimal number of sensor mea-
surements that are required in order to minimize ergodicity, and this minimal
number varies with changes in experiment hyperparameters like information map
being covered, sample rate, time horizon and initial sample weights. For a fixed
set of experiment hyperparameters, the ergodicity increases when the sensing
budget is increased past this minimum required number, since there are extra-
neous measurements being taken, while the ergodicity increases with decrease in
sensing budget below the minimum required number, since the optimization is
missing information.
For multi-agent teams we see that the probabilistic heuristic has worse cov-
erage performance compared to sparse ergodic optimization (see Fig. 8). As
explained in Sect. 4.2, jointly optimizing for trajectory and sensing measure-
ments leads to lower control cost. As we scale up the optimization problem,
control cost becomes more substantial, as we need to optimize multiple tra-
jectories. Further, the cross-effect of the agents’ trajectories and if they take
sensing measurements increased for multiple agents, since coverage performance
is now impacted by several trajectories with potential areas of overlap. Thus, it
becomes more necessary to jointly optimize for trajectory and choosing sensing
measurements in multi-agent sparse sensing coverage problems.
5 Conclusion
In this paper we investigate the idea that there is an optimal set of sensing mea-
surements that can be taken during coverage, in order to fully characterize the
problem, reduce sensing costs and avoid local sensing optima. To this end, we
formulate a novel approach to sensing-resource limited coverage by modifying
ergodic optimization to jointly optimize for both the sensing trajectory and the
decision of where to take sensing measurements. Specifically, the set of optimal
sensor measurements is posed as a sparse ergodic optimization problem, where
the choice to take a measurement is encoded in a vector of sample weights. Our
set of experiments show that there exists an optimal set of sensing measure-
ments for a given coverage scenario, in both single and multi agent cases, using
ergodicity, a state-of-the-art coverage metric. We also infer that there exists a
cross-effect between an agent’s trajectory and sensing decisions, which make it
important to jointly optimize these in cases with limited sensing measurements.
This effect is stronger for multi-agent scenarios.
This work experimentally shows that there are improvements to be made in
resource-limited coverage scenarios through sparse sensing. Specifically, we show
that there exists an optimal set of measurements required to solve a coverage
problem. Future work will focus on proving theoretical guarantees regarding
the existence of this optimal set of sensing measurements. This work assumes
the availability of accurate a priori information maps, which is not the case
for many real-world coverage applications. Future work will seek to use sparse
ergodic optimization in order to identify and account for inaccurate information
priors with minimum sensor measurements, and take into account sensor noise.
References
1. Marı́n, L., Vallés, M., Soriano, A., Valera, A., Albertos, P.: Multi sensor fusion
framework for indoor-outdoor localization of limited resource mobile robots. Sen-
sors 13(10), 14133–14160 (2013). https://doi.org/10.3390/s131014133. https://
www.mdpi.com/1424-8220/13/10/14133
2. Ma, F., Carlone, L., Ayaz, U., Karaman, S.: Sparse sensing for resource-constrained
depth reconstruction. In: 2016 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pp. 96–103 (2016). https://doi.org/10.1109/IROS.
2016.7759040
3. Ablavsky, V., Snorrason, M.: Optimal search for a moving target - a geometric
approach. In: AIAA Guidance, Navigation, and Control Conference and Exhibit.
AIAA (2000)
4. Choset, H.: Coverage for robotics-a survey of recent results. Ann. Math. Artif.
Intell. 31(1), 113–126 (2001)
5. Lanillos, P., Gan, S.K., Besada-Portas, E., Pajares, G., Sukkarieh, S.: Multi-UAV
target search using decentralized gradient-based negotiation with expected obser-
vation. Inf. Sci. 282, 92–110 (2014)
126 A. Rao et al.
6. Baxter, J.L., Burke, E.K., Garibaldi, J.M., Norman, M.: Multi-robot search and
rescue: a potential field based approach. In: Mukhopadhyay, S.C., Gupta, G.S.
(eds.) Autonomous Robots and Agents. Studies in Computational Intelligence,
vol. 76, pp. 9–16. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-
73424-6 2
7. Wong, E.M., Bourgault, F., Furukawa, T.: Multi-vehicle bayesian search for mul-
tiple lost targets. In: International Conference on Robotics and Automation, pp.
3169–3174. IEEE (2005)
8. Mathew, G., Mezić, I.: Metrics for ergodicity and design of ergodic dynamics for
multi-agent systems. Physica D 240(4), 432–442 (2011)
9. Wu, G.D., Zhu, Z.W., Chien, C.W.: Sparse-sensing-based wall-following control
design for a mobile-robot. In: 2016 IEEE International Conference on Control and
Robotics Engineering (ICCRE), pp. 1–5 (2016). https://doi.org/10.1109/ICCRE.
2016.7476144
10. Miller, L.M., Silverman, Y., MacIver, M.A., Murphey, T.D.: Ergodic exploration
of distributed information. IEEE Trans. Rob. 32(1), 36–52 (2015)
11. Abraham, I., Mavrommati, A., Murphey, T.: Data-driven measurement models for
active localization in sparse environments. In: Robotics: Science and Systems XIV.
Robotics: Science and Systems Foundation (2018). https://doi.org/10.15607/RSS.
2018.XIV.045
12. Abraham, I., Prabhakar, A., Murphey, T.D.: An ergodic measure for active learning
from equilibrium. IEEE Trans. Autom. Sci. Eng. 18(3), 917–931 (2021). https://
doi.org/10.1109/TASE.2020.3043636
13. Ayvali, E., Salman, H., Choset, H.: Ergodic coverage in constrained environments
using stochastic trajectory optimization. In: International Conference on Intelligent
Robots and Systems, pp. 5204–5210. IEEE (2017)
14. Miller, L.M., Murphey, T.D.: Trajectory optimization for continuous ergodic explo-
ration. In: American Control Conference (ACC) 2013, pp. 4196–4201. IEEE (2013)
15. Kobilarov, M.: Cross-entropy motion planning. The Int. J. Robot. Res. 31(7), 855–
871 (2012)
16. Schmidt, M., Niculescu-Mizil, A., Murphy, K., et al.: Learning graphical model
structure using l1-regularization paths. In: AAAI, vol. 7, pp. 1278–1283 (2007)
17. Wolsey, L.A.: Mixed Integer Programming. Wiley Encyclopedia of Computer Sci-
ence and Engineering, pp. 1–10. Wiley, Hoboken (2007)
Distributed Multi-robot Tracking
of Unknown Clustered Targets with Noisy
Measurements
Jun Chen1(B) , Philip Dames2 , and Shinkyu Park1

1
King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
{jun.chen.1,shinkyu.park}@kaust.edu.sa
2
Temple University, Philadelphia, PA 19122, USA
pdames@temple.edu
Abstract. Distributed multi-target tracking is a canonical task for

multi-robot systems, encompassing applications from environmental
monitoring to disaster response to surveillance. In many situations, the
distribution of unknown objects in a search area is irregular, with objects
are likely to distribute in clusters instead of evenly distributed. In this
paper, we develop a novel distributed multi-robot multi-target tracking
algorithm for effectively tracking clustered targets from noisy measure-
ments. Our algorithm contains two major components. Firstly, both the
instantaneous and cumulative target density are estimated, providing
the best guess of current target states and long-term coarse distribution
of clusters, respectively. Secondly, the power diagram is implemented in
Lloyd’s algorithm to optimize task space assignment for each robot to
trade-off between tracking detected targets in clusters and searching for
potential targets outside clusters. We demonstrate the efficacy of our
proposed method and show that our method outperforms of other can-
didates in tracking accuracy through a set of simulations.
Keywords: Multiple target tracking · sensor-based control ·

distributed sensor network
1 Introduction
Multi-target tracking using distributed multi-robot systems (MRSs) has drawn
increasing attention over the past decades as robots become more powerful and
low-cost. In a large number of real-world scenarios, targets are likely to dis-
tribute in clusters despite lack of prior knowledge about exact locations, includ-
ing social animals, certain species of plants, trash distributed in inhabited places,
etc. In such cases, detecting a targets indicates that some other targets are likely
to appear nearby. Most existing multi-robot multi-target tracking (MR-MTT)
algorithms underperform in trading-off between having robots to search for un-
detected targets and tracking detected targets when targets are not evenly dis-
tributed across the search space given no prior knowledge. This paper aims at
developing effective distributed tracking algorithms for unknown targets that
https://doi.org/10.1007/978-3-031-51497-5_10
128 J. Chen et al.
are likely to distributed in clusters. Such applications include the acquisition

of image data from coral reef for high-precision 3D reconstruction of its habi-
tats, trash collection in a desert area, flaw inspection and repair on surfaces of
buildings and large machinery, detection and sample collection of vegetation in
a nature reserve, and so on. There are two key components to a multi-robot
multi-target tracking (MR-MTT) system: an estimation system to model and
track objects as they are detected and a control system to drive the motion of
individual robots in the team towards areas that are likely to contain useful
information.
Different from single target tracking, the main challenge of MTT is matching
detections to target tracks, a process known as data association, especially in
the presence of false negative and false positive detections. There are a number
of standard MTT algorithms, each of which solve data association in a differ-
ent way: global nearest neighbor (GNN) [9], joint probabilistic data association
(JPDA) [8], multiple hypothesis tracking (MHT) [2], and particle filters [7]. Each
of these trackers propagates the posterior of target states over time and solves
the data association problem prior to tracking. Another class of MTT techniques
is derived from random finite set (RFS) statistics [13] and which simultaneously
solve data association and tracking. We use the probability hypothesis density
(PHD) filter [12], which tracks the spatial density of targets, making it best
suited to situations where each target is not required to have a unique identity.
We recently developed a distributed PHD filter that is provably equivalent to
the centralized solution [6].
Lloyd’s algorithm [11] is one of the best-known control algorithms for dis-
tributed target tracking, the idea of which is to represent target states by a
weighting function over the task space and to drive each robot to the weighted
centroid of its Voronoi cell [5]. Comparing to other MR-MTT algorithms such
as graph-based methods [1,16], Lloyd’s algorithm requires no prior knowledge
about targets and the number of targets can be unknown and time-varying. It
drives more robots to areas where targets are more likely to appear, while allows
fewer robots to search for targets in the rest of the areas based on the weighting
function. The weighting function can be estimated online using on-board sen-
sors [15], and Dames in his work [6] uses the PHD as the weighting function,
driving robots to actively track targets. However, when no target is within a
robot’s Voronoi cell, the robots move erratically, reacting to any false positive
detections as well as the dynamically changing shape of their Voronoi cells. As a
result, robots often stay within empty sub-regions instead of purposefully seek-
ing out un-tracked targets, slowing down the rate at which they find targets.
This problem is further exacerbated when a majority of targets gather within
some small subsets of the environment.
To improve tracking accuracy of clustered targets, this paper develops a
novel estimation and control policy. We have three primary contributions: 1)
we introduce a state estimation strategy incorporating both instantaneous and
cumulative target states, which allows robots to track detected targets precisely
through noisy measurements while learning coarse distribution of targets from
Distributed Tracking of Clustered Targets 129
historical detecting outcomes, 2) we implement the power diagram in Lloyd’s

algorithm to dynamically assign optimized task space to each robot based on
its current workload and drive the robots to explore or track targets more effec-
tively, and 3) we demonstrate in a series of simulated experiments that with our
proposed method, the team finds and tracks targets more effectively than using
the previous algorithms in [6].
A set of nt targets with states X = {x1 , . . . , xnt } are located within a convex
open task space denoted by E ⊂ R2 . A team of nr (possibly heterogeneous)
robots R = {r1 , . . . , rnr } are tasked with determining nt and X, both of which
are unknown and may vary over time. We assume that each robot ri knows its
location qi in a global reference frame (e.g. from GPS), though our proposed
method can be immediately extended to handle localization uncertainty using
the algorithms from our previous work [3]. At each time step, a robot ri receives
a set of |Zi | noisy measurements Zi = {zi , z2 , . . .} of targets within the field of
view (FoV) Fi of its onboard sensor. Note that the sensor may experience false
negative or false positive detections so the number of detections may not match
the true number of targets.
2.1 Lloyd’s Algorithm
The objective of Lloyd’s algorithm is to minimize the following functional:

n

H(Q, W) = f x − qi φ(x)dx, (1)
i=1 Wi
where Wi is dominance region of robot ri (i.e., the region that robot ri is respon-
sible for), · is the Euclidean distance, x ∈ E, φ(x) is the weighting function
for all x ∈ E, and f (·) is a monotonically increasing function. The role of f is to
quantify the cost of sensing due to degradation of a robot’s ability to measure
events with increasing distance. The dominance regions Wi form a partition over
E, meaning the regions have disjoint interiors and the union of all regions is E
[5].
The goal is for the team to minimize the functional (1), both with respect to
the partition set W and the robot positions Q. Minimizing H with respect to W
induces a partition on the environment Vi = {x | i = arg mink=1,...,n x − qk }.
In other words, Vi is the collection of all points that are the nearest neighbor of
ri . This is the Voronoi partition, and these Vi are the Voronoi cells, which are
convex by construction. We call qi the generator point of Vi . Minimizing H with
respect to Q leads each sensor to the weighted centroid of its Voronoi cell [5],
that is
130 J. Chen et al.

xφ(x) dx
qi∗ = Vi . (2)
Vi
φ(x) dx
Lloyd’s algorithm sets the control input for robot ri to ur (qi∗ ), where
g − qi
ui (g) = min dstep , g − qi , (3)
g − qi
g is an arbitrary goal location, and dstep > 0 is the distance a robot can move
during one time step. Robots follow this control law online, i.e., recursively move
to the temporary weighted centroids of their Voronoi cells, re-construct their cells
based on their new positions, and compute the new weighted centroids to move
to. As a result, the robots asymptotically converge to the weighted centroids
of their Voronoi cells, causing the team to reach a local minimum of Eq. (2).
This still holds true when φ(x) varies with time. Note that the original Lloyd’s
algorithm works in a centralized manner.
3 Distributed Multi-target Tracking
3.1 Instantaneous State Estimation
The sets X and Zi from above contain a random number of random elements, and
thus are realization of random finite sets (RFSs) [13]. The first order moment of
an RFS is known as the Probability Hypothesis Density (PHD) (which we denote
v(x)) and takes the form of a density function over the state space of a single tar-
get or measurement. By assuming that the RFSs are Poisson distributed where
the number of targets follows a Poisson distribution and the spatial distribution
of targets is i.i.d., the PHD filter recursively updates this target density function
in order to track the distribution over target sets [12].
The PHD filter uses three models to describe the motion of targets: 1) The
motion model, f (x | ξ), describes the likelihood of an individual target tran-
sitioning from an initial state ξ to a new state x. 2) The survival probability
model, ps (x), describes the likelihood that a target with state x will continue to
exist from one time step to the next. 3) The birth PHD, b(x), encodes both the
number and locations of the new targets that may appear in the environment.
The PHD filter also uses three models to describe the ability of robots to
detect targets: 1) The detection model, pd (x | q), gives the probability of a
robot with state q successfully detecting a target with state x. Note that the
probability of detection is identically zero for all x outside the sensor FoV. 2)
The measurement model, g(z | x, q), gives the likelihood of a robot with state q
receiving a measurement z from a target with state x. 3) The false positive (or
clutter) PHD, c(z | q), describes both the number and locations of the clutter
measurements.
Using these target and sensor models, the PHD filter prediction and update
equations are:

v̄t (x) = b(x) + f (x | ξ)ps (ξ)vt−1 (ξ) dξ (4)
E
ψz,q (x)v̄t (x)
vt (x) = (1 − pd (x | q))v̄t (x) + (5)
ηz (v̄ t )
z∈Zt

ηz (v) = c(z | q) + ψz,q (x)v(x) dx (6)
E
ψz,q (x) = g(z | x, q)pd (x | q), (7)
where ψz,q (x) is the probability of a sensor at q receiving measurement z from
a target with state x.
The PHD represents the best guess of target states at current time step and
is utilized as the instant state estimation of target density, denoted by wi(x),
i.e., wi(x) = v(x), x ∈ E. In [6] a distributed PHD filter is formulated, with
each robot maintaining the PHD within a unique subset, Vi , of the environ-
ment. Three algorithms then account for motion of the robots (to update the
subsets Vi ), motion of the targets (in (4)), and measurement updates (in (5)).
As a result, each robot recursively re-constructs its Voronoi cell online based
on current relative locations of neighboring robots and maintains PHD locally
by communicating with neighbors to estimate target states, yielding to identical
results to running a centralized PHD filter over the task space. In this paper,
same strategies are applied for propagating wi(x) over time in a distributed
manner.
3.2 Cumulative State Estimation

Since no prior of birth model is known for the PHD filter, we typically assume
that the target birth rate is uniformly distributed. As a result, individuals do
not actively search for a target when no target is detected and could often
spend a long time locating targets that appeared in underexplored regions of
the environment, degrading the tracking performance especially when targets
are clustered. To improve this, we need to estimate online the probability of
target appearance over a long period of time, i.e., the cumulative state wc.
In a variety of scenarios, targets move randomly in some relatively fixed
clusters, e.g., animals cluster around water sources. In such cases, the frequency
of target appearance at each point x ∈ E can be regarded as time-invariant and
the cumulative state estimation is a density distribution that quantifies the best
guess of the number of expected targets at each location based on accumulated
observation.
The PHD filter assumes that the number of targets follows a Poisson distri-
bution. The conjugate prior of this is a Gamma distribution, which describes the
distribution f (x) of the expected number of targets found at x. This is given by
xα−1 e−βx β α
Γ(α, β) = , x > 0, α, β > 0, (8)
(α − 1)!
132 J. Chen et al.
where α(x) is the shape parameter, which describes the total rewards obtained
from the observations at location x, and β(x) is the inverse rate parameter, which
describes the inverse of the number of historical observations sensors conduct at
location x. Initially, α0 (x) = 0 and β0 (x) = ∞. When a robot ri receives a
measurement set Zi at time t > 0, it updates α(x), β(x) for all x ∈ Fi by
α(x) = α(x) + rw,
(9)
β(x) = (β(x)−1 + 1)−1 ,
where
0 |Zi | = 0
rw = . (10)
1 else
Then, for each x ∈ E, the cumulative state estimation is given by the mode of
the distribution Γ(α(x), β(x))
α(x)−1
β(x) , α(x) ≥ 1
wc(x) = (11)
0, α(x) < 1.
3.3 Environment Approximation

Since it is neither feasible or physically meaningful to have a different distribution
for each point in space, it makes sense to use a subset of points to represent
both wi and wc. In this paper, we use uniform grids to represent wi and wc
over the task space. Therefore, the integral of wi and wc in a grid is equal
to the instantaneous and cumulative expected number of targets in the grid,
respectively.
3.4 Distributed Control

Optimized Space Assignment. In order to optimize the assigned coverage
area, each robot is endowed a weight ρ2 in constructing it Voronoi cell, and the
monotonically increasing function in Eq. (1) becomes
f (x − qi ) = x − qi 2 − ρ2i (12)
where ρi is the power weight of robot ri . The resulting W is a additively weighted
Voronoi diagram, or power diagram,a variant of the standard Voronoi diagram,
and Wi is the power cell. Recall that x∈Wi wc(x) quantifies the expected number
of targets to be found in Wi and that the optimal partition Wi = {x | i =
arg mink=1,...,n f (x − qk )}. Therefore, for each robot ri , we set ρi as
− 12
ρi = wc(x) , (13)
x∈Wi
so that the area of Wi is approximately inversely proportional to the estimated

number of targets can be found in Wi . Thereby, robots in areas with lower
expected target density are assigned larger area to explore, which takes advan-
tage of the search and tracking capability of the team.
3.5 Algorithm Outline

The distributed target search and tracking algorithm is outlined as Algorithm 1.
For each robot, both instantaneous and cumulative states are initialized to
null as there is no prior knowledge about target states, and α(x) and β(x)
are initialized. As robots start to explore and receive sensor measurements,
each of them exchange its location with neighbors to compute its power cell
Wi , as outlined in lines 5–8. Then each robot updates and broadcast the tuple
{α(x), β(x), wii (x), wci (x)}. Lines 12–18 outline the control strategy, which sep-
arates the team into idle robots (i.e., robots that find no target) and busy robots
(i.e., robots that find targets), and causes the robot to move according to the
following rules: When targets are detected, it drives the robot to locations with
higher instantaneous estimation of target density to better track the detected
targets; On the contrary, a robot is driven to search for targets in areas with
higher cumulative estimation of target density where targets are more likely to
be found based on accumulated experience. As a result, the team is able to opti-
mize robot locations by taking advantage of historical measurements and further
improve the tracking accuracy of targets.
Algorithm 1 boosts the performance of our previous methods [6] in that
it allows the team to actively explore the environment and learn the charac-
teristics of the target distribution. In particular, robots are now able to use a
combination of detailed local information (coming from the PHD) and coarse
global information (coming from α, β) to inform their actions. For evenly dis-
tributed targets, Algorithm 1 yields a similar tracking performance compared
Algorithm 1: Distributed Search and Tracking

1 for ri ∈ R do
2 wa(x) ← 0, wi(x) ← 0, x ∈ E
1
3 α(x) ← ∞ , β(x) ← ∞, x ∈ E
4 while true do
5 for ri ∈ R do
6 Receive measurement set Zi
7 Exchange state qi with neighbors Ni
8 Compute power cell Wi
9 Update α(x), β(x), x ∈ Wi using (9)
10 Update wii (x), wci (x), x ∈ Wi using (5) and (11), respectively
11 Broadcast {α(x), β(x), wii (x), wci (x)}
12 if Zi = ∅ then
13 φi (x) ← wci (x), x ∈ Wi Idle robots
14 else
15 φi (x) ← wii (x), x ∈ Wi Busy robots
16 Set goal gi = qi∗using (2)
17 Compute ui (gi ) using (3)
18 q i ← q i + ui Move towards goal
134 J. Chen et al.
with Dames’ method [6] since wc is close to uniform across the task space and
W is similar to a standard Voronoi diagram since then. However, the advantages
of our method is pronounced when targets are not uniformly distributed in the
space but are instead grouped together within small regions. Under these cir-
cumstances, idle robots are especially helpful in learning the difference in target
density among sub-regions and optimizing the assignment of tracking effort in
different sub-regions.
4 Simulations
We test our proposed algorithms via Matlab simulations. The task space is a
100 m × 100 m square. Targets are distributed in clusters, where 30 are located
in a 33 × 33 m square sub-region at the lower-left corner of E, and another 30
targets in a 33 × 33 m squared sub-region at the top-right corner. Targets move
within the sub-regions following a Gaussian random walk at a maximum speed
of 3 m/s. All robots begin each trial at randomized locations within a 20 m × 10
m box at the bottom center of the environment. Robots have a maximum speed
of 5 m/s and are equipped with isotropic sensors with a sensing radius 5 m.
Both instantaneous and cumulative estimation are approximated by uniform
grid implementation, with a grid size of 0.1 m × 0.1 m. In the PHD filter, the
robots use a Gaussian random walk (with σ = 0.35 m/s) for the motion models
f , set the survival probability to 1, and the birth PHD to 0 due to no prior
knowledge of targets is learned. We use the same measurement model for homo-
geneous noisy sensors as [6]. Note that our proposed method is compatible with
heterogeneous sensing network [4], we just make this choice to simplify the tests.
Fig. 1. Figures show 70 robots and 60 targets distribution in a 100 m × 100 m squared
task space after 300 s of tracking using Dame’s method and our method, respectively.
Orange diamonds plot locations of targets. Green squares and circles show locations
and field of views of robots, respectively. Red crosses plot robots’ temporary goals.
Dashed lines plot boundaries of robots’ current assigned space.
4.1 Qualitative Comparison
We first show how our proposed algorithms improve multi-target tracking using
a single trial using 70 robots. Figures 1 show the locations of robots and targets
after 300 s using both Dames’ method [6] and our method. When targets gather
within only a small portion of the environment, the previous method can hardly
optimize robot locations for more accurate tracking of targets since only a few
robots have found targets and are tracking them. As a result, most of the robots
are idle and move erratically to cover areas where no targets are distribute,
which demonstrates the weakness of the previous algorithm that idle robots do
Fig. 2. Figures plot surface of wi and wc distributed over the 100 m × 100 m task
spaces after 300 s, respectively.
136 J. Chen et al.
not actively search for targets, causing an inefficient use of the total sensing
capability of the team while searching for un-tracked targets. Such phenomenon
is improved by our method, which drives a larger number of robots to gather at
the clusters of targets while a handful of other robots continue to search unex-
plored areas, as the exact locations and motion models of targets are unknown.
The result indicates the efficacy of our method in that the distribution of the
clusters is learned overtime and the power diagram distributes more balanced
workload to individuals.
To illustrate the different characteristics of the instantaneous estimation and
the cumulative estimation of targets, we plot in Fig. 2 the value of wi(x) and
wc(x) for all x ∈ E after 300 s of the task using our method, respectively.
The instantaneous estimation, i.e., the PHD, provides the best guess of exact
locations of targets at current time through a set of sharp peaks, shown as Fig. 2a.
The PHD wi(x) of x ∈ E returns to near zero rapidly as targets are no longer
found at x over a very short period of time given low expected target birth rate,
exhibiting high accuracy of estimating target exact locations but short “memory”
of historical target distribution. On the contrary, the cumulative estimation wc
presents a relatively smooth and continuous distribution over the task space,
with higher values distributed over the entire target clusters and near-zero values
over the rest of the space, shown as Fig. 2b. Therefore, the instantaneous and
the cumulative estimation are utilized to drive busy and idle robots respectively,
as the former robots requires target exact locations for accurate tracking while
the latter robots are in need of coarse distribution of clusters for optimized
deployment.
4.2 Quantitative Comparison

To test the efficacy of our proposed approach, we conduct a series of trials
using a range of team sizes (from 60 to 90 robots). Three tracking strategies are
compared: 1) Dames’ method (“D” method) in [6], 2) Dames’ method with power
diagram (“P” method) which uses instantaneous estimation only for tracking
(similar to the power diagram implementation in [4]), and the power weight in
Eq. (14) is depending on PHD instead, given by
− 12
ρi = wi(x) , (14)
x∈Wi
and 3) our method (“O” method) which uses both instantaneous and cumulative
estimation, outlined in Algorithm 1. For each team size we run 10 trials.
OSPA. To quantify the performance, we will use the first order Optimal Sub-
Pattern Assignment (OSPA) metric [14], a commonly-used approach in MTT.
The error between two sets X, Y , where |X| = m ≤ |Y | = n without loss of
generality, is
1/p
1
m
d(X, Y ) = min dc (xi , yπ(i) ) + c (n − m)
p p
, (15)
n π∈Πn i=1
where c is a cutoff distance, dc (x, y) = min(c, x − y), and Πn is the set of all
permutations of the set {1, 2, . . . , n}. This gives the average error in matched
targets, where OSPA considers all possible assignments between elements x ∈ X
and y ∈ Y that are within distance c of each other. This can be efficiently
computed in polynomial time using the Hungarian algorithm [10]. We use c =
10 m, p = 1, and measure the error between the true and estimated target sets.
Note that a lower OSPA value indicates a more accurate tracking of the target
set.
Results. In Fig. 3a, we report the median OSPA value to measure tracking
accuracy over the final 700 s of each trial, allowing the team to reach or be close
to a steady state and smoothing out the effects of spurious measurements that
cause the OSPA to fluctuate, and the results are aggregated into boxplots. “D”
method shows the worst performance in tracking clustered targets regardless
of team sizes. As depicted by the results of “P” method, such performance is
improved by assigning optimized task spaces that contain equivalent instanta-
neous estimated number of targets to robots in the team, which drives the robots
to gather at where targets are currently tracked. However, robots are still not
Fig. 3. Figures show OSPA errors of group of trials. Figure 3a displays boxplots of
average OSPA errors from 300 s to 1000 s using the three methods (“D”, “P”, and
“O”) with four team sizes, each over ten trials. |R| denotes the number of robots.
Figure 3b plots average OSPA errors of the total of 40 trials (10 for each team size)
using the three methods over the entire 1000 s.
138 J. Chen et al.
tend to move to where targets are likely to clustered by no target is tracked

instantly. Our proposed method further improves this flaw and shows the best
tracking performance of clustered targets, as suggested by the OSPA values of
“O” method.
In Fig. 3b, we plot the average OSPA values of all 40 trials of four different
team sizes using the three methods over the entire 1000 s. It is shown that the
OSPA error drops at similar rates over the first 200 s as robots start moving from
the starting area and explore the entire search space, despite the applied tracking
algorithm. After that, “D” method no longer improves the tracking accuracy
and the team reaches a steady state, while the other two algorithms continues
to result in lower down of the OSPA error. “P” method reaches a steady state
at around 400 s, after which the team has completed updating a steady-state
PHD and propagating optimized cell assignment. Meanwhile, the OSPA error
continuously drops by using “O” method until approximately 600 s. This is due
to the fact that the cumulative estimation of target number continuously grows
as more observations are gain, leading to an enhanced congregation of robots in
target clusters. It is illustrated that our proposed method results in a continuous
learning behavior of cluster distribution as robots explore and contributes to the
outperformance of tracking accuracy over the other two algorithms.
5 Conclusions
The tracking accuracy of existing distributed MR-MTT algorithms significantly

decays when targets are clustered instead of evenly distributed across the task
space. In this paper, we propose a novel distributed multi-target tracking algo-
rithm that allows a team of robots to effectively track clustered targets, despite
given no prior knowledge of target states. Each robot estimates both instanta-
neous and cumulative target density and dynamically optimizes its space assign-
ment using a power diagram implementation of Lloyd’s algorithm. As a result,
robots are able to track detected targets precisely while congregate at target
clusters by learning the coarse cluster distribution from past observations. Sim-
ulation results suggest that our algorithm is superior to other candidates in
effective tracking of clustered targets.
Acknowledgements. This work was supported by funding from King Abdullah Uni-
versity of Science and Technology (KAUST).
References
1. Ahmad, A., Tipaldi, G.D., Lima, P., Burgard, W.: Cooperative robot localization
and target tracking based on least squares minimization. In: 2013 IEEE Interna-
tional Conference on Robotics and Automation, pp. 5696–5701. IEEE (2013)
2. Blackman, S.S.: Multiple hypothesis tracking for multiple target tracking. IEEE
Aerosp. Electron. Syst. Mag. 19(1), 5–18 (2004)
3. Chen, J., Dames, P.: Collision-free distributed multi-target tracking using teams of
mobile robot with localization uncertainty. In: IEEE/RSJ International Conference
on Intelligent Robots and Systems (2020)
4. Chen, J., Dames, P.: Distributed multi-target tracking for heterogeneous mobile
sensing networks with limited field of views. In: IEEE International Conference
Robotics and Automation (2021), Under review
5. Cortes, J., Martinez, S., Karatas, T., Bullo, F.: Coverage control for mobile sensing
networks. IEEE Trans. Robot. Autom. 20(2), 243–255 (2004)
6. Dames, P.M.: Distributed multi-target search and tracking using the phd filter.
Auton. Robot. 44(3), 673–689 (2020)
7. Doucet, A., Vo, B.N., Andrieu, C., Davy, M.: Particle filtering for multi-target
tracking and sensor management. In: Proceedings of the Fifth International Con-
ference on Information Fusion, vol. 1, pp. 474–481. IEEE (2002)
8. Hamid Rezatofighi, S., Milan, A., Zhang, Z., Shi, Q., Dick, A., Reid, I.: Joint
probabilistic data association revisited. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 3047–3055 (2015)
9. Konstantinova, P., Udvarev, A., Semerdjiev, T.: A study of a target tracking algo-
rithm using global nearest neighbor approach. In: Proceedings of the International
Conference on Computer Systems and Technologies (CompSysTech 2003), pp. 290–
295 (2003)
10. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logis-
tics Q. 2(1–2), 83–97 (1955)
11. Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2),
129–137 (1982)
12. Mahler, R.P.: Multitarget bayes filtering via first-order multitarget moments. IEEE
Trans. Aerosp. Electron. Syst. 39(4), 1152–1178 (2003)
13. Mahler, R.P.: Statistical multisource-multitarget information fusion, vol. 685.
Artech House Norwood, MA (2007)
14. Schuhmacher, D., Vo, B.T., Vo, B.N.: A consistent metric for performance evalu-
ation of multi-object filters. IEEE Trans. Signal Process. 56(8), 3447–3457 (2008)
15. Schwager, M., Rus, D., Slotine, J.J.: Decentralized, adaptive coverage control for
networked robots. Inter. J. Robot. Res. 28(3), 357–375 (2009)
16. Sung, Y., Budhiraja, A.K., Williams, R.K., Tokekar, P.: Distributed assignment
with limited communication for multi-robot multi-target tracking. Auton. Robot.
44(1), 57–73 (2020)
A Force-Mediated Controller
for Cooperative Object Manipulation
with Independent Autonomous Robots
Nicole E. Carey1(B) and Justin Werfel2

1
Autodesk Robotics Lab, Autodesk Inc., San Francisco, USA
nic.carey@autodesk.com
2
Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard
University, Boston, USA
jkwerfel@seas.harvard.edu
Abstract. We consider cooperative manipulation by multiple robots

assisting a leader, when information about the manipulation task, envi-
ronment, and team of helpers is unavailable, and without the use of
explicit communication. The shared object being manipulated serves as
a physical channel for coordination, with robots sensing forces associated
with its movement. Robots minimize force conflicts, which are unavoid-
able under these restrictions, by inferring an intended context: decom-
posing the object’s motion into a task space of allowed motion and a
null space in which perturbations are rejected. The leader can signal a
change in context by applying a sustained strong force in an intended
new direction. We present a controller, prove its stability, and demon-
strate its utility through experiments with (a) an in-lab force-sensitive
robot assisting a human operator and (b) a multi-robot collective in
simulation.
Keywords: robot cooperation · distributed robotics · manipulation ·

force-based control
1 Background
Cooperative (or collaborative) manipulation is the task where two or more
agents, which may be any combination of robots and humans, must work together
to manipulate an object too large or unwieldy for one agent to handle alone.
Often one agent may act as a leader, seeking to move the object along a privately
known trajectory through space, aided by an arbitrary number of independent
helper agents who have no direct knowledge of the path or destination.
Challenges associated with this task can include the following: (1) The details
of the environment (topography of the ground, locations of obstacles, etc.) may
not be known by some or all agents, and may change dynamically. (2) The
number and locations of the agents participating in the task may not be known.
(3) The properties of the object (mass, geometry, etc.) may not be known. (4)
https://doi.org/10.1007/978-3-031-51497-5_11
A Force-Mediated Controller for Cooperative Object Manipulation 141
There may be unexpected disruptions (e.g., the object or agents may bump into
obstacles). (5) Communication may be problematic, e.g., unreliable, undesirable,
or unscalable.
We are interested in the case of an unanticipated need to move an object,
with no opportunity for advance preparation. That is, the properties of the
environment, object, set of helpers, and desired manipulation trajectory are all
unknown, and some of these elements may also be changeable and dynamic.
Further, if the leader is a human, then communicating the details of a desired
trajectory to the helper robots is unlikely to be quick, easy, or detailed and
accurate. From a usability standpoint, the ideal way for the robots to give help
would be for the human to try to move the object in the desired way directly, by
exerting forces on it, and for the robots to supplement the leader’s force to enable
the manipulation, without requiring any explicit instruction. An approach that
works under these restrictive conditions could also be used in a scenario where
more information or communication is available.
2 Related Work
Work on collective manipulation typically assumes that all robots share knowl-
edge about the intended trajectory, inertial parameters of the payload, and/or
number and locations of teammates, and that explicit robot-to-robot commu-
nication is available [1,7,15,17,19]. Some work has considered the problem of
obtaining such information in partially restricted cases, e.g., estimating inertial
parameters using carefully coordinated actions and knowledge about the team
[15]. Some studies rely on specialized gripper hardware that constrains freedom
of movement in specific ways [10]. A subcategory of cooperative manipulation
is collective transport, where movement is limited to a plane [4,9,10]. Here we
consider cases where only limited payload information (local mass and relative
center of mass (CM) location) can be obtained, no other information or explicit
communication is available, movement in any dimension may be desired, and
off-the-shelf hardware can be used.
3 Challenges and Assumptions

In general, a helper robot cannot disambiguate wrenches that are due to guid-
ance from a leader, collisions, other agents, or the load itself (Fig. 1B). Thus to
successfully follow guidance forces and reject all others, each robot must have
a contextual framework through which it can observe, filter, and classify sensed
forces, and apply an appropriately calculated control torque to smoothly accom-
plish the task goal.
In previous work [4], we presented an adaptive control framework for col-
lective transport, a special case of the more general cooperative manipulation
problem, in which an unknown object is carried by an arbitrary number of robots
with movement strictly in a horizontal plane. That work followed an operational
space control paradigm [12] in which the robots allow movement in the plane
142 N. E. Carey and J. Werfel
Fig. 1. A: A robot collective transports an unknown object following a leader’s guid-

ance. B: Two helper robots handle a small rigid object, guided by a human leader’s
force applied to one corner (blue arrow). The left-hand robot experiences a single
multi-dimensional wrench at its end-effector, with no disambiguation of components
resulting from the leader, the object’s inertial properties, and forces due to the other
agent. C: Physical testing of cooperative manipulation of a basket, using a Franka
Emika Panda. D: An example application scenario: a robot helps a human manipulate
a load in a challenging field situation of installing solar panels.
and prevent movement out of the plane. That is, the contextual framework that
robots used to interpret forces was the a priori knowledge that only movement
in the plane would be intended.
In this paper, we modify our previously presented controller to permit more
general collaborative manipulation tasks, encompassing movement in any direc-
tion. The central principle robots use to determine context is to interpret tran-
sient or weak forces as unintended (e.g., perturbations due to collisions), and sus-
tained stronger forces as intentional control signals. We show that if each robot
can estimate the payload’s mass and relative CM location, then it can respond
to control signals in any direction, switching between dimensional contexts with-
out loss of controller stability or compliance. We demonstrate this framework in
physical collaborative experiments using a compliant 7-DoF manipulator.
The key assumptions of the approach are: (1) Each agent can obtain an
estimate of the load’s relative CM location and its share of the mass (the latter
may typically be m/na , for a system of na agents and total mass m, with both m
and na unknown to the agents). (See §4.2 for a discussion of how this assumption
can be realized.) (2) There are enough agents to handle the load (na is large
enough that m/na is significantly less than the carrying capacity of any single
agent). (3) The grasp or connection between each agent and the shared load is
rigid, with no slippage or in-grasp rotation. (4) Agents have some compliance,
permitting a degree of conflict between forces exerted by different agents without
immediately causing instability. (5) Each agent knows the orientation of its base
with respect to a shared reference vector (e.g., robots can sense the direction of
gravity while moving over uneven terrain). We do not assume other knowledge
about the environment, task, agent numbers or locations, nor direct explicit
communication.
We note that compliance can be helpful in multiple ways. For instance, it
facilitates movement of the object in response to the guiding force, providing
displacement as well as force cues that can be used by helper robots to infer the
intended motion (Table 1).
Table 1. Control variables and constants.
d Binary 6 × 1 dimensional constraint vector, [x, y, z, roll, pitch, yaw]T

Db Binary diagonal matrix s.t. diag(Db ) = d.
Λx , Λν Inertial matrices associated with task and null spaces, respectively
p The end-effector pose of the robot in cartesian space
μx , μ ν Coriolis matrices associated with task and null spaces, respectively
x An n × 1 subvector of p consisting of the task-related pose
dimensions
J the (non-square) standard manipulator Jacobian
ν An m × 1 null-space velocity vector, where the total degrees of
freedom of the manipulator can be written n + m
JE The extended (n + m) × (n + m) manipulator Jacobian
Jt The task-space Jacobian
η A 6 × 1 vector describing the pose in non-task-related dimensions,
with task dimensions (from x) set to zero.
Zν The null-space Jacobian
q Joint positions
wt An n × 1 vector of estimated external wrench along task-related
dimensions
Ke , K ν , γ Controller gain constants (stiffness)
Bn , Dν , Dx Controller gain constants (damping)
T Joint torque
f Force (from force/torque sensors) expressed in the robot’s base frame
4 Control Methodology
4.1 Closed Loop Control Dynamics
The system dynamics for each agent take the standard form
T = M q̈ + C(q, q̇)q̇ + g(q) + Text (1)
where the load mass and inertia are modeled as connected rigid components
of the end-effector, incorporated into the mass/Coriolis/gravitational matrices
accordingly.
Each agent’s (dynamic) context for the intended manipulation is encoded as
a binary 6 × 1 vector d, expressed in the robot’s base frame. di > 0 indicates
the agent should follow a guiding force along dimension i; di = 0 indicates a
dimension where any external force is likely to be disturbance-related.
We can use d to decouple the instantaneous control torque applied in the
robot’s joint space into two components: a task controller, and a null or residual
controller. Using d as a dimensional filter on the full manipulator Jacobian J
gives a task Jacobian Jt such that ẋ = Jt q̇. The null space Jacobian Zν can
be derived as in [18]. This allows us to develop a task-specific control schema,
and use the complementary space to enact a different form of control on the
remaining null dimensions.
In a previous study [4], we considered a framework in which the task space
was the set of dimensions in which external forces should be rejected, and the null
space allowed free movement along other dimensions. Here we invert that division,
so that instead of assigning a high priority task of stabilisation against unknown
inertial dynamics (with the force-following component relegated to a lower prior-
ity null space), we assign the collaborative force-following goal to the high prior-
ity task space, while disturbance rejection of out-of-task forces is accomplished in
the null space. This assignment allows the force-following behaviour of each agent
(and hence the collective) to be more sensitive and responsive.
The task space controller takes the following form (a task passivity-based
control):
ẍc = Λ−1
x (wt − (μx + Dx )ẋ − γKe x̃) (2)
where wt is an estimate of the guiding external wrench in the chosen task domain,
expressed in the robot’s base frame, and measured by the robot using either
joint torque sensors or a wrist-based force/torque sensor; (x, ẋ) describes the
trajectory of the end-effector in the task-space dimensions only (i.e., x is a subset
of the full 6×1 end-effector pose vector p); and Λx , μx are the task-space dynamic
inertia and Coriolis matrices, respectively [20]. The error-correcting stiffness term
Ke is explored in Sect. 4.2.
In a cluttered dynamic environment, it is likely that disturbance forces will
be due to environmental collisions. Rigidly maintaining null-space target posi-
tioning during a collision risks potential damage to the robot or the colliding
object/person. However, the control space of the null component is defined by
the residual degrees of freedom that remain unused by the task controller (i.e.,
it is thus both task- and robot-dependent). Noting this, we can take advantage
of the greater number of degrees of freedom offered by redundant manipulators,
and use a redundant joint space to accommodate collisions compliantly while
minimally impacting the task space behavior. (If, instead, a high rigidity null
space is desired, or the robot has insufficient DOFs to accommodate all possi-
ble task space configurations while rejecting error, a high stiffness impedance
controller based on joint space error could be used.)
We define ν as a null space velocity vector [4,20] and create an impedance

control-based input torque of the following form:

Tν = −Zν#T Z T Kν JE η̃ − Bν (Λν Ż # q̇ + (μν + Dν )ν) (3)
where μν , Λν are the null Coriolis and inertial matrices [20], Kν is a stiffness term
acting on offsets in non-task-space end-effector dimensions, and Bν , Dν represent
damping gains. The null position error state variable η̃ can be calculated using
D̄b , the binary matrix derived from the negation of d, assuming that for small
changes in η and q, we can use the expanded (full-rank) Jacobian JE to transform
errors in joint space to the (non-task-constrained) end-effector frame dimensions:
δη = D̄b JE δq (4)
(In fact for x = 0, where the goal joint configuration qd is governed solely by
null space control input, η̃ is the 6 × 1 end-effector pose projection of the joint
error q̃.)
The input torque corresponding to the task control components can be writ-
ten
Tt = −JtT Λx (J˙q̇) + (wt − (μx + D)ẋ − γKe x̃) (5)
and our total input torque becomes
T = Tt + Tν + C(q, q̇) + g(q). (6)
4.2 Inertial Estimation and Error Compensation

Here we discuss the need for each agent to have an estimate of local load mass
(mi ) and CM location (ri ), the relationship between mass inertia matrix error
and controller responsiveness, and how these estimates can be achieved.
i
A robot can calculate an estimated mass inertia matrix Mest using q, mi ,
and ri . The true (unknown) mass inertia matrix of the (local) system is M i . We
define the error in this inertial estimation as follows:
ΔM i = M i − Mest
i
. (7)
When a follower agent senses an external force, we can assume that this force
is predominantly composed of a wrench imparted by the guiding agent, and any
residual uncompensated inertial elements of the load:
Fext = Wt + J T ΔM i (q)q̈. (8)
Assuming a constant load, M i , and hence ΔM i , are bounded above [21]. In

fact, given our assumption of enough agents to handle the load, M i ≤ mmax ,
where mmax is the maximum load each agent can support. (Ideally it would be
significantly less.) Then an absolute lower bound for the external guiding wrench
can be written
Wt,min = fext − J T (Mmax (q) − Mest

i
(q))q̈ (9)
where Mmax (q) is the mass matrix for a load of mass mmax at joint configuration
q. Based on this minimum guiding wrench input, and the current system state,
we can project a minimum expected end-effector pose estimate xg , given a dis-
crete time-window Δt. The difference between the actual pose and this forward
projection can be written as an error term in the task domain, x̃ = xg − x.
This allows us to introduce an error correction term into our control acceler-
ation equation, using a spring stiffness gain matrix (Ke ) (see Eq. 2), and our
challenge then becomes choosing a suitable Ke so that it fully compensates for
model error.
From Eq. 9, suppose f (Mmax , q) is the inertial force applied by the maximum
load the robot could support (calculated given the joint configuration q). Then a
possible choice of the elements ke of Ke is |kej | ≥ max(f (ML,max , q))j , i.e., we
estimate the maximum possible force exerted by the inertial mass matrix upper
bound Mmax in each task dimension {j}, and choose Ke to guarantee sufficient
compensation force. Note that given Eq. 9, the ratio of {maximum force it is
reasonable for the guiding agent to exert} :: {upper bound of load estimate}
represents a metric for the system responsiveness.
Ke need not be constant: an adaptive stiffness term can be used without loss
of stability [14] (provided K̇e is bounded), as shown in [4], to create regions or
dimensions of high and low responsiveness based on the expected impact of errors
in the inertial estimate. Adaptive stiffness can also be increased temporarily in
response to unexpected motions of the load, to prevent instability that could
otherwise result if load inertia estimates are poor [4].
Other work has shown approaches for multiple robots to collaboratively esti-
mate load parameters [8,15]. In future work we plan to extend these approaches
to the case without explicit communication. In the reductive case of a single
guiding agent and a single follower (as in §5.1), an upper bound on ΔM i can be
found by examining the force residuals Δf (q) at the extremes of inertial offset,
while only the follower agent is carrying the load (call this a stiffness tuning
phase). Since in this case it is also relatively trivial to obtain a high degree of
accuracy in inertial load estimation [16], in practice these residual force errors
are minimal, and a small constant Ke suffices to fully compensate for model
error.
4.3 Controller Stability
In this section, we consider the stability of the full operational space controller,
including the model-error correction component described above. Let our state
variable be z = {q̃, ν, T̃ , f̃ , x̃}. . From [11], we can demonstrate asymptotic sta-
bility for the system if there exists a function V (z) such that in a neighborhood
Ω of an equilibrium point z = 0, the following three conditions are satisfied:
1. V (z) ≥ 0 ∀z ∈ Ω, and V (0) = 0;

2. V̇ (z) ≤ 0 ∀z ∈ Ω;
3. the system is asymptotically stable for the largest positive invariant set L in
{z ∈ Ω | V (z) = 0}.
The total wrench acting on the system along the task dimensions can be writ-
ten f = wt + f̃ , where wt is the estimated input guiding input wrench as before,
and f̃ is the total force estimation error, now including not only inaccuracies in
the system model as above, but also other sources of error such as sensor noise.
(Note that we are not considering the constrained null dimensions, so this error
term is not confounded by disturbance rejection). At equilibrium, f = f̃ , and we
can introduce a disturbance observer to track error in the force domain. Using
˙
the task state error defined previously, we desire ẋg = 0, so f̃ = −Γf−1 ẋ, where
Γf is symmetric and positive definite.
Let V (x, f ) be a candidate Lyapunov sub-function that considers only task-
related energy terms:
1 1
V (x, f ) = x̃T Ke x̃ + ẋT Λx ẋ + f̃ T Γf f̃ (10)
2 2
This function is positive definite along (x, f ) and positive semi-definite on the
full state z (condition 1 above). Differentiating, we find
1 ˙
V̇ = 2x̃T Ke ẋ + ẋT Λx ẍ + ẋT Λ̇x ẋ + f̃ T Γf f̃ (11)
2
Substitute the closed form dynamics from Eq. 2 to obtain
1 T ˙
V̇ = 2x̃T Ke ẋ + ẋT Λx Λ−1
x (wt − (μx + D)ẋ − γKe x̃) + ẋ Λ̇x ẋ + f̃ Γf f̃ (12)
T
2
We can substitute the disturbance observer defined in Eq. 4.3 and obtain an esti-
mate of the task dimension force error by examining the position error in com-
bination with the environmental stiffness. Then, considering the skew-symmetry
of Λ̇x − 2μx [20], we find condition 2 above on V̇ is satisfied provided
2x̃T Ke ẋ − γ x̃T Ke ẋ − ẋT Dẋ < 0 (13)
and since D is constant and positive definite, we require only that the controller
gain γ > 2 to ensure conditional stability.
Furthermore, under the set condition x = 0, our null state error term
Z T Kn J η̃ reduces to Z T Kn q̃, and the null space control torque is analogous
to that used in [20]. By using a similar subset function candidate VL on the
set L = {q̃, ν, T̃ , f̃ = 0, x̃ = 0}, it can be shown that the system is asymptoti-
cally stable conditional to L (condition 3 above). Hence all conditions for overall
system stability are fulfilled.
4.4 Task-Frame Switching
The guiding agent can indicate a change in d (e.g., switching from lifting to
horizontal carrying) to the other agents using the payload as the medium for a
physical signal. For this paper, we note that in the absence of slow changes in
uncompensated force driven by unknown load inertia (and with the assumption
of a joint-compliant null-space control), disturbance forces are likely to be abrupt

in onset and of limited duration. We can use the dynamic characteristics of
such disturbances to define an admittance/rejection filter, establishing a set of
time- and magnitude-based characteristics that signify a force imposed by the
knowledgeable agent intended to change the dimensional domain of the task,
and thus adjust the vector d on the fly. §6.2 discusses examples.
5 Experimental Validation
The approach requires a force-sensitive manipulator with (ideally) > 6 degrees
of joint freedom. The Franka Emika Panda cobot meets these needs (7 DOF)
and is readily commercially available, making it an ideal platform with which
to undertake proof of concept experiments. In §5.1 we discuss hardware test-
ing/demonstration of the approach, using the single physical Panda available to
us to assist a human leader. In §5.2 we discuss simulations with four Pandas
assisting a leader. All code is available at [2].
5.1 Implementation in Hardware
We implemented an initial parameter estimation phase without a human in

the loop, using a two-stage recursive least squares method [13] to estimate
approximate mass and CM location of the load, without requiring any fore-
knowledge of the load parameters or grasp point. Although the inclusion of
off-diagonal inertias will in theory improve the stability region [14], we are inter-
ested in exploring the case where inertial estimation is imprecise or not all
parameters can be determined. The estimation trajectory comprised a series
of sinusoidal joint velocity commands (phase offset to each other to mini-
mize inversion errors due to singularity points), operating symmetrically around
the robot’s default neutral pose. The joint velocity control during this iner-
tial estimation phase was defined as: q̇(t) = ai sin 2π t−t ωi
L
, with amplitudes
a = [0.3265, −0.3265, 0.225, −0.3, −0.4435, 0.3, −0.3625], cycle period scaling
factor ω = [3.68, 2.04, 2.98, 1.75, 4.43, 2.749, 1.4], tL a phase offset constant.
The parametric convergence of the overall mass and CM location (relative to
the end-effector grasp centroid) was usually accomplished in < 5s, with a mass
estimation accuracy within ±5% and centre of mass accuracy ±10%. We did
not seek to ensure highly accurate parameter estimation, because we anticipate
precise inertial information to be difficult to obtain in the multi-agent case, and
did not want this to be a necessary pre-condition when assessing the success of
this proof of concept.
When calculating the null-space Jacobian projection Z, we must take par-
ticular care to ensure the devolved sub-Jacobian Jm [18] is well-conditioned,
by using a robust inversion function; this necessarily increases the computation
time required. The joint control torque is thus updated asynchronously, leading
to non-zero errors in torque and projected position estimation; however, we find
that the small correction stiffness Ke (described in §4.2) serves to accommodate

these errors and ensure stability without impacting performance.
The Franka Panda is extremely sensitive to discontinuities in control inputs.
Abrupt changes in torque input can result in a safety system lock, and aside from
the need to low pass filter the raw torque sensor data due to a high noise level,
there is also the potential for such discontinuities during task switching. There-
fore, when a request for task dimension change has been registered, the real-time
controller (a) sends a short burst of torque commands (∼50ms) comprising mass
and Coriolis compensating elements only, in order to bring the system smoothly
to equilibrium, and (b) re-establishes a new null space origin at the current joint
and end-effector positions. This ensures a relatively smooth transition between
task contexts.
As discussed in Sect. 4.4, we use a force input from the guiding agent to signal
task dimension switches (changes in d). For each test case discussed in the next
section, we established a set di of pre-coded potential task domain vectors, and
time and wrench magnitude thresholds (tth , wi,th ) such that any external force
or torque > wi,th applied for > tth along a non-task dimension would signal a
switch in the task dimension vector. In our case, the number of pre-coded states
di is small; however, with a sufficiently granular set of di , we can attain as much
dimensional control flexibility as desired.
Fig. 2. Hardware implementation: behaviour in null-space (constrained; left) vs. task-

space (unconstrained; right) dimensions. Top left: torque was applied around the pitch
axis (relative to the robot’s base). Bottom left: a force was applied in the Z (vertical)
dimension. Top right: torque was applied around the yaw axis (relative to the robot’s
base). Bottom right: a force was applied in the Y (lateral) dimension. In all cases, blue
indicates the applied force or torque, orange indicates the system response along or
around the axis in question. Controller gains: Dx = 5.0I3 , Ke = 0.2I3 , γ = 2.5, Kν =
diag(240, 120, 120, 120, 120, 120, 120), Bν = 0.2, Dν = 4.0I4 .
5.2 Implementation in Simulation

To demonstrate the efficacy of this method for multi-robot cooperation without
direct communication, we used a simulated Franka Emika Panda model created
in the Unity platform [3], and implemented the controller described above on
four simulated robots manipulating a large shared load. A separate simulated
leader applies forces and torques to the load. Each Panda is mounted on an
omnidirectional wheeled base; it uses lateral motion of its end-effector in the
task space as a control input to its base, thereby moving along with the leader’s
guidance in the plane, and extending its effective workspace.
Fig. 3. Robot guidance without and with contextual information. The human guide
applies a torque to rotate the end-effector through 90 degree, around the roll axis. In the
first sequence of images (top), the robot control torque compensates for gravitational
load, but otherwise is compliant in all dimensions. In the second sequence (bottom), we
implement our decoupled force/impedance algorithm with the task constraint vector
d = {0, 0, 0, 1, 1, 0}, rejecting wrenches in any direction other than roll and pitch. X’s
show the position of the end of the object across successive frames. Video: [5].
6 Results
6.1 Controller Performance Within Static Task Domain
To investigate the performance of the two control elements (force following vs.
impedance-control), we defined our task domain as d = {1, 1, 0, 0, 0, 1}, applied a
set of short disturbance forces to a load held by a single robot, and examined the
resulting end-effector trajectory. A sample output is shown in Fig. 2, with motion
and force expressed relative to the robot’s base frame of reference. Disturbances
in the pitch or vertical motion are rejected by the null controller, while those
around the yaw axis or along the lateral Y axis result in motion for as long as
the wrench is applied, and do not return to the origin position afterwards.
To demonstrate the behavioural difference between our chosen decoupled
control schema and a more conventional compliant control that might be used
for, e.g., a learning by demonstration context, we consider a task whose goal is a
90 degree rotation of a manipulated object around the base’s roll axis. Figure 3
compares the motion of the robot and object for each of the two control schemes
in response to the same user input wrench. In the top set of images, the robot
is under a free-floating (mass-compensated) control without contextual informa-
tion about which wrench dimensions should be followed. Since the user input—as
expressed through the shared load—has wrench components in multiple dimen-
sions, the robot’s end-effector position not only rotates, but also moves in the
Y and Z dimensions. In the bottom image series, the robot is controlled by our
decoupled controller and supplied with a dimensional context which restricts
motion in Y and Z (d = {0, 0, 0, 1, 1, 0}). This controller decouples the wrenches
imparted by human manipulation of the load and responds only to those along
or around the task dimensions, resulting in an end-effector pose that changes
only along the roll axis, as desired. In other words, with this control framework
and given a suitable context, we can shape the robot’s resultant trajectory with
much greater precision. See video at [5] (part 1).
Fig. 4. Performing a collaborative task where force following and rejection dimensions
change, under the guidance of a human operator. The human guides the robot to first
insert a key into a slot, then turn it 90 degree and slide it along the slot. Left: Still
image after insertion, during the turning phase. Right: The force and motion mapping
for one dimension (roll). Force following vs force rejection phases are indicated through
colour on the torque input (blue = force follow, red = force reject). Controller gains:
Dx = 5.0I3 , Ke = 0.2I3 , γ = 2.5, Kν = diag(240, 120, 120, 120, 120, 120, 120), Bν =
0.2, Dν = 4.0I4 . Task-switch thresholding: (tth = 1.5s, wth = (5N, 0.5N m)). Video: [5].
6.2 Controller Performance with a Changing Task Domain

To demonstrate switching between contexts when the motion requirements of
a task have a clearly defined dimensional change, we consider a collaborative
key/lock positioning/insertion and rotation task. We pre-code a set of dimen-

sional constraints di corresponding to the key insertion (i = 0), and key turn/slide
(i = 1):
{1, 1, 1, 0, 0, 1} i = 0
di = (14)
{0, 1, 0, 1, 1, 0} i = 1
Figure 4 shows a snapshot during the process, and the roll input torque and
end-effector angle over the interaction. Initially, the robot is free to move in (x, y,
z, yaw), but in order to maintain the key in the correct orientation for insertion,
constrained in (roll, pitch). Once the key has been inserted into the slot, the
human seeks to shift the robot’s context to allow a 90 degree turn, and does this
by applying a firm roll torque for 2s. The robot then shifts its constraints to a
new dimensional set where it is free in (y, roll, pitch) but constrained in (x, z,
yaw). See video at [5] (part 2).
The plot on the right of Fig. 4 shows the roll torque applied to the robot
through the shared load, and its corresponding roll angle. During the first phase
(indicated by a red color on the input torque), the robot successfully rejects motion
around its roll axis. Once the time/magnitude threshold for context switching has
been passed, the contextual control dimension changes to d1 , and the roll input
torque (now shown in blue) results in a corresponding angular rotation.
Fig. 5. Behaviour of a four robot collective using our controller, following a two-stage
guiding force. Top left: stable state counteracting inertial load, before external force
application. Top right: An external force is applied along the global X axis. Bottom
left: An external torque is applied around the global X axis. Bottom right: The cor-
responding local Y and local pitch pose of one robot during this process. Video: [5].
6.3 Multi-robot Performance with Changing Task Domain
To demonstrate that this control methodology can be extended to multiple

robots with no direct communication, knowing only a (local) estimate of the
load inertia, we implemented a simulated trial with four Franka Emika Panda
robots using the Unity platform.
After grasping a shared load, each robot applies a joint torque sufficient to
counteract the gravitational inertia of its portion of the manipulation object
(assumed for this proof of concept to be known ahead of time). An external
guiding force is then applied in two stages, first horizontally in the plane, then
rotationally around the long axis of the load object.
The corresponding dimensional control constraints for each stage are:

{1, 1, 0, 0, 0, 1} i = 0
di = (15)
{1, 0, 1, 0, 1, 0} i = 1
Figure 5 shows that using only an external force input and a pre-calculated
knowledge of load inertial parameters (note that the robots are not aware of the
number, locations, or behaviour of other agents co-manipulating the object),
the robot collective can ensure the load follows the guidance force. Video: [5]
(part 3).
7 Discussion
A fundamental reason for deploying decoupled operational space control, in this
task of robots performing collective manipulation with limited information, is to
address disturbances imposed by the lack of shared knowledge. Without aware-
ness of the others’ poses or intentions, agents cannot disambiguate between inter-
pretations of sensed forces, and thus in general will apply conflicting torques.
This research demonstrates that (a) given a task context, these kinds of conflicts
between agents can be resolved by using a compliant error-rejection framework;
(b) if the agents are further provided with an estimation of load inertia, this
task context can incorporate out-of-plane motion; and (c) force can be used as a
control signal to enable contextual switching between control dimension subsets.
The use case for such a system is likely to be in collaborative transport or
manipulation in complex, unknown, or dynamic environments. As §6.3 demon-
strates, this kind of spontaneous collaboration can be effectively deployed with
very little foreknowledge or preparation, requiring only one knowledgeable guid-
ing agent. This simplifies the navigational challenges of collaborative transport
considerably, and minimizes the shared knowledge base required. In particular,
using compliant control for both elements of the decoupled controller eliminates
any need for strict global kinematic alignment between agents; hence such sys-
tems should be robust to perturbations such as rough terrain or minor collisions.
To fully realize the goal of collective manipulation without prior knowledge
or communication, further work is needed for letting agents estimate inertial
parameters online, potentially during an interaction [6]. Since direct communi-

cation is not necessary for localised mass estimation [15], it may be feasible, e.g.,
to use an adaptive high-gradient stiffness controller [4] to implement a CM esti-
mation routine within a shallow 6DoF low-stiffness basin common to all agents,
solving the dynamic equations iteratively to allow each agent to calculate an
individual estimate of the approximate CM location. Future work will focus on
investigating such methods.
Acknowledgements. This work was supported by a Space Technology Research

Institutes grant (number 80NSSC19K1076) from NASA’s Space Technology Research
Grants Program.
References
1. Bai, H., Wen, J.T.: Cooperative load transport: a formation-control perspective.
IEEE Trans. Rob. 26(4), 742–750 (2010)
2. Carey, N.: Decoupled Controller for the Franka Panda (Version 1.08) [Computer
software] (2022). https://github.com/niccarey/panda decoupled controller
3. Carey, N.: Unity implementation of the Franka Panda platform(version 1.0) [Com-
puter software] (2020). https://github.com/niccarey/FrankaPanda Unity
4. Carey, N.E., Werfel, J.: Collective transport of unconstrained objects via implicit
coordination and adaptive compliance. In 2021 IEEE International Conference on
Robotics and Automation (ICRA), (pp. 12603–12609). IEEE (2021)
5. Carey, N.E., Werfel, J.: A force-mediated controller for cooperative object manip-
ulation [Media file, video] (2022) https://youtu.be/VIoj6v a-Gw
6. Ćehajić, D., Hirche, S.: Estimating unknown object dynamics in human-robot
manipulation tasks. In 2017 IEEE International Conference on Robotics and
Automation (ICRA), (pp. 1730–1737) (2017) IEEE
7. Elwin, M.L., Strong, B., Freeman, R.A., Lynch, K.M.: Human-multirobot col-
laborative mobile manipulation: the Omnid Mocobots. (2022) arXiv preprint
arXiv:2206.14293
8. Franchi, A., Petitti, A., Rizzo, A.: Decentralized parameter estimation and obser-
vation for cooperative mobile manipulation of an unknown load using noisy mea-
surements. In: 2015 IEEE International Conference on Robotics and Automation
(ICRA), (pp. 5517–5522). IEEE (2015)
9. Habibi, G., Kingston, Z., Xie, W., Jellins, M., McLurkin, J.: Distributed centroid
estimation and motion controllers for collective transport by multi-robot systems.
In: 2015 IEEE International Conference on Robotics and Automation (ICRA), (pp.
1282–1288). IEEE (May 2015)
10. Hichri, B., Fauroux, J.C., Adouane, L., Doroftei, I., Mezouar, Y.: Design of cooper-
ative mobile robots for co-manipulation and transportation tasks. Robot. Comput.-
Integr. Manufact. 57, 412–421 (2019)
11. Iggidr, A., Sallet, G.: On the stability of nonautonomous systems. Automatica
39(1), 167–171 (2003)
12. Khatib, O.: A unified approach for motion and force control of robot manipulators:
the operational space formulation. IEEE J. Robot. Autom. 3(1), 43–53 (1987)
13. Kubus, D., Kroger, T., Wahl, F.M.: On-line rigid object recognition and pose esti-
mation based on inertial parameters. In: 2007 IEEE/RSJ International Conference
on Intelligent Robots and Systems, (pp. 1402–1408). IEEE (Oct 2007)
14. Liang, X., et al.: An adaptive time-varying impedance controller for manipulators.
Front. Neurorobot. 16 (2022)
15. Marino, A., Muscio, G., Pierri, F.: Distributed cooperative object parameter esti-
mation and manipulation without explicit communication. In: 2017 IEEE Interna-
tional Conference on Robotics and Automation (ICRA), (pp. 2110–21116). IEEE
(May 2017)
16. Mavrakis, N., Stolkin, R.: Estimation and exploitation of objects’ inertial param-
eters in robotic grasping and manipulation: a survey. Robot. Auton. Syst. 124,
103374 (2020)
17. Montemayor, G., Wen, J.T.: Decentralized collaborative load transport by multiple
robots. In: Proceedings of the 2005 IEEE International Conference on Robotics and
Automation, (pp. 372–377). IEEE (Apr 2005)
18. Ott, C.: Cartesian impedance control of redundant and flexible-joint robots.
Springer (2008)
19. Petitti, A., Franchi, A., Di Paola, D., Rizzo, A.: Decentralized motion control for
cooperative manipulation with a team of networked mobile manipulators. In: 2016
IEEE International Conference on Robotics and Automation (ICRA), (pp. 441–
446). IEEE (May 2016)
20. Sadeghian, H., Villani, L., Keshmiri, M., Siciliano, B.: Task-space control of robot
manipulators with null-space compliance. IEEE Trans. Rob. 30(2), 493–506 (2013)
21. Sariyildiz, E., Sekiguchi, H., Nozaki, T., Ugurlu, B., Ohnishi, K.: A stability anal-
ysis for the acceleration-based robust position control of robot manipulators via
disturbance observer. IEEE/ASME Trans. Mechatron. 23(5), 2369–2378 (2018)
22. Wang, Z., Schwager, M.: Kinematic multi-robot manipulation with no communi-
cation using force feedback. In: 2016 IEEE International Conference on Robotics
and Automation (ICRA), (pp. 427–432). IEEE (May 2016)
A Distributed Architecture for Onboard
Tightly-Coupled Estimation
and Predictive Control of Micro Aerial
Vehicle Formations
I. Kagan Erunsal1,2(B) , Rodrigo Ventura2 , and Alcherio Martinoli1

1
Distributed Intelligent Systems and Algorithms Laboratory, École Polytechnique
Fédérale de Lausanne, Lausanne, Switzerland
{kagan.erunsal,alcherio.martinoli}@epfl.ch
2
Institute for Systems and Robotics, Instituto Superior Técnico, Lisbon, Portugal
rodrigo.ventura@isr.tecnico.ulisboa.pt
Abstract. This paper presents a distributed estimation and control

architecture for leader-follower formations of multi-rotor micro aerial
vehicles. The architecture involves multi-rate extended Kalman filtering
and nonlinear model predictive control in order to optimize the system
performance while satisfying various physical constraints of the vehicles,
such as actuation limits, safety thresholds, and perceptual restrictions.
The architecture leverages exclusively onboard sensing, computation,
and communication resources, and it has been designed for enhanced
robustness to perturbations thanks to its tightly-coupled components.
The architecture has initially been tested and calibrated in a high-fidelity
robotic simulator and then validated with a real two-vehicle system
engaged in formation navigation and reconfiguration tasks. The results
not only show the high formation performance of the architecture while
satisfying numerous constraints, but also indicate that it is possible to
achieve full navigation and coordination autonomy in presence of severe
resource constraints as those characterizing micro aerial vehicles.
Keywords: Formation control · micro aerial vehicles · distributed

nonlinear model predictive control · relative and onboard localization
1 Introduction
Recent years have been characterized by various improvements in the coordina-
tion and cooperation strategies of autonomous robots. This is because loosely
connected systems controlled by distributed or decentralized strategies have
inherent advantages over centrally governed systems, such as additional flexi-
bility, robustness, and scalability [1]. Among these strategies, formation control,
usually aimed at achieving prescribed geometric constraints between vehicles, is
one of the most actively investigated topics [2]. A promising method to optimally
https://doi.org/10.1007/978-3-031-51497-5_12
A Distributed Architecture for Onboard Estimation and Control of MAVs 157
achieve this goal while satisfying diverse constraints is Nonlinear Model Predic-
tive Control (NMPC) due to its architectural heterogeneity, simple reconfigura-
bility, and hierarchical flexibility [3]. Furthermore, to improve the performance,
the distributed version of this algorithm is a prominent candidate because it
approximates the global objective more accurately than its decentralized coun-
terpart by including inter-robot communication.
However, NMPC-based algorithms require an accurate model of the system,
correct state estimates, and sufficient computational resources. Furthermore, the
pure distributed version of the algorithm requires a robust communication frame-
work. These requirements might be challenging to satisfy primarily on resource-
constrained Micro Aerial Vehicles (MAVs). For example, system identification
techniques might not yield a faithful model, the system might be time-varying
due to the intrinsic nature of multiple vehicles, and only the partial state infor-
mation and limited computational power might be available because of hardware
and software constraints. As a result, these problems should be addressed in a
tightly-coupled estimation and control architecture that considers the various
constraints in order to be deployed successfully on real robots. Another critical
challenge for deploying such systems is the dependency of robotic platforms on
global localization systems, such as Motion Capture Systems (MCSs) or Global
Navigation Satellite Systems (GNSSs), for indoor and outdoor scenarios, respec-
tively.
There are various successful real-world implementations of model predictive
approaches for the formations of multi-robot systems. Among those, we focus
on decentralized and distributed approaches with real-world implementations.
In [4], authors propose a trajectory planner that considers various task require-
ments such as swarm coordination, flight efficiency, obstacle avoidance for a
swarm of flying robots and demonstrates the method in cluttered environments.
Although the strategy seems highly effective and scalable for flocking navigation,
visual-inertial odometry (VIO) coupled with Ultra-Wide Band (UWB) distance
measurement (without orientation information) might not be accurate enough
for formation control applications. In [5], authors present a decentralized strat-
egy for the compact swarm of unmanned aerial vehicles without communication
and external localization. The approach is well-constructed for outdoor flock-
ing applications; however, the accuracy of the ultra-violet (UV) direction and
ranging technique is limited for indoor formation applications. Furthermore,
since predictive information is not shared among vehicles, it might yield low
performance in dynamic scenarios. A distributed methodology is studied for
the formation control and obstacle avoidance of homogeneous robots employ-
ing multi-robot consensus in [6]. The method is scalable and efficient with the
number of robots; however, the implementation is not carried out onboard, and
a motion capture system is used to obtain state feedback. In [7], a distributed
predictive approach is presented for drone swarms in cluttered environments
and with high sensor noise. Although the real-world experiments with sixteen
quadrotors show the scalability of the strategy, the computations are not carried
out onboard, and the localization solution is global. In [8], authors introduce a
158 I. K. Erunsal et al.
decentralized robotic front-end for autonomous aerial motion capture in an out-

door environment and integrate perception constraints into the optimization
problem. Although computations are performed onboard, GNSS information
is collected and shared among vehicles to generate relative position informa-
tion, which is not always reliable. Other works with real-world implementations
include dynamic encirclement [9], predictive outdoor flocking [10], formation in
motion [11], formation with communication constraints [12] and indoor forma-
tion with visual-inertial odometry [14]. However, the papers listed previously
relies on global localization systems and/or offboard computation and/or shared
coordinate frames and/or reliable communication network.
This paper presents a distributed estimation and control architecture based
on a multi-rate Extended Kalman Filter (EKF) and NMPC to perform leader-
follower formations of multi-rotor MAVs under state, actuation, and perception
constraints and in the presence of environmental disturbances and model uncer-
tainties. Our focus here is mainly on the follower subsystem since it constitutes
the central part of the leader-follower formations. We validate the performance
and applicability of this architecture, leveraging exclusively onboard resources
and relative sensing in a high-fidelity robotic simulator and a real-world setup
with diverse experiments. In particular, followers obtain relative pose and body-
fixed velocity information by employing their coupled optical flow and distance
sensor, and an RGB camera tracking April bundles [15].
The main contribution of this paper is to meet the following requirements
simultaneously. First, the formation subsystems exclusively employ onboard
resources for localization, computation, communication, and sensing. This brings
the multi-robot system closer to practical applications. Second, the components
of the overall architecture are designed to be scalable, performant, and robust
at the network level. The main ingredient that allows such a characteristic is the
tight-coupling of perception, communication, estimation, and control at the indi-
vidual robot level. This unified architecture leads each component not only to
exploit its full potential, but also to assist the other components in maintaining
a robust operation. In particular, the distributed NMPC exhibits high perfor-
mance with an accurate model, state estimates, information from neighboring
vehicles, and enough computational resources. For this purpose, the computa-
tionally efficient distributed EKF in each vehicle employs both the information
received on communication and perception assets to provide uninterrupted state
and uncertainty estimates, even if the communication channel has flaws. On the
other hand, the individual EKFs require sufficient perception and communica-
tion update rates to perform successfully. This is where the distributed NMPC
helps the perception and communication components to satisfy their various
constraints, such as the field of view (FOV) and range. Furthermore, the simple
but efficient prediction model selected for NMPC not only minimizes commu-
nication requirements, but also maximizes computational efficiency. Finally, the
distributed architecture can be scaled linearly by increasing the computational
power up to a maximum number of neighbors.
Similar to [16], the following notation for vectors and rotation matrices will be
adopted assuming that {a}, {b}, {c} are the coordinate frames and x is the name
of the vector:
xca/b : Vector quantity x ∈ Rn of the origin of {a} with respect to the origin
of {b} expressed in {c}.
Fig. 1. An example of formation net- Fig. 2. MPC coordinate frames in a pre-

work consisting of one leader (L0 ) and diction horizon of length N: {m} (blue),
five followers (Fi , i = 1, .., 5): two sub- {b} (yellow), {c} (red), {c} (green). (Color
networks for the F1 and F4 , the cor- figure online)
responding relative positions, sensing
and communication ranges are also indi-
cated.
Rab : Rotation matrix R ∈ SO3 that transforms a vector expression from {a}
to {b} with the Z-Y-X convention.
. denotes the√Euclidean norm of a vector or the spectral norm of a real
matrix and . P := xT P x (with P ∈ Rn×n and P 0) stands for the weighted
Euclidean norm of x ∈ Rn .
Consider a formation network N that consists of K +1 aerial robots. Assume
that one of them is assigned as a leader L and the rest are followers Fi so that
N = {L, F1 , ..., FK }. All robots operate in a workspace W ⊂ R3 . The state of
each robot i ∈ N is defined as follows:
T
χi = [xnb/n T vb/n
n T n T b
tb/n wb/n ]T (1)
where xnb/n and vb/n

n
represent the position and linear velocity, tnb/n is the Euler
angles, and wb/n denotes the angular velocities. Here, {n} is the Earth-fixed
b
inertial frame and {b} is the body-fixed frame. Then, for instance, the vector
xnb/n would represent the position of the origin of {b} with respect to {n} and
expressed in the latter coordinate system.
We adopt the full dynamics of the multi-rotor MAV including dominant

(propeller) and auxiliary (drag) aerodynamic effects based on [17,18]. Note that
although this model can be used as a plant model for any simulation, it is
too complex for optimal control purposes. We will explain the simplified model
used for MPC in Sect. 3. Finally, due to the physical limits of the propellers
and electric motors of the aerial vehicle, the inputs should be within a set of
constraints, that is, ubb/n ∈ Uc . Additionally, due to vehicle safety limits, we
have χi ∈ Xc .
Consider that a robot i ∈ N has a sensing area of As ⊂ Bs (xi , rs ) and a
communication area Ac ⊂ Bc (xi , rc ) where xi is the center and rs and rc are the
radius of the norm balls Bs and Bc , respectively. Assume Δxij = xj − xi is the
relative position of the robot j ∈ N with respect to the robot i ∈ N \ {L}. Then
the sub-network for the follower i is defined as follows: Ni := {i, j ∈ N , j = i :
Δxij < min(rs , rc )}. Figure 1 represents this network structure.
In this work, all robots, including the leader, have an output type of feedback
(i.e., no state-based feedback across vehicles is possible). The only vehicle that
has access to its absolute position by a global localization system is the leader.
As a result, the leader is responsible for the absolute navigation of the robot
ensemble. All vehicles are equipped with a coupled optical flow and distance
sensor to obtain linear velocities and ground distance, an Inertial Measurement
Unit (IMU), and magnetometer to acquire linear accelerations, rotational veloc-
ities, and Euler angles. All follower vehicles are endowed with an onboard, lim-
ited range and FOV, 3D, relative localization system for measuring inter-vehicle
poses of neighboring vehicles (i.e., relative range, bearing and orientation). We
selected to use an RGB camera and April bundles to obtain such 3D relative
pose information of the neighbor vehicles due to its accuracy and implementa-
tion simplicity. We do not assume any reliable communication network among
vehicles, but we benefit from it when it is available. Furthermore, all vehicles
are under the disturbed effect of model uncertainty and slowly varying airflow
disturbance due to the wake effect of neighboring vehicles. Finally, all sensors
are characterized by zero-mean Gaussian noise.
Problem 1. Consider one leader and K follower robots under the input, percep-
tion, communication, and state constraints ubb/n ∈ Uc , zb/n
b
∈ Zc and χ ∈ Xc .
Here, the leader can be considered as a separate and independent trajectory
tracking robot. Under the assumptions explained in the previous paragraph,
design a control and estimation strategy for followers so that the robotic net-
work will follow a predefined reference trajectory rp ∈ R3 while maintaining the
formation and connectivity defined by, possibly time-varying, relative positions
references Δ xij , relative yaw angles references Δψi where i ∈ Ni \ {L} and
j ∈ Ni .
3 Methodology
The components of our distributed architecture are selected and designed to take
into account resource constraints. In particular, we have paid attention to dis-
tribute the architecture over the network while preserving its strong integration
at the robot level. The former feature allows us to obtain scalable behavior, while
the latter brings high performance and robustness due to the tight coupling of
the perception, communication, estimation, and control components.
The proposed architecture is represented in Fig. 3. Here, the neighborhood
consists exclusively of followers, but the leader could also be part of the sub-
network. One can observe the tight-coupling of components on the vehicle level
and their distribution over the network. The details of the estimation and for-
mation control components and their relationship with perception and commu-
nication will be explained in the next section.
To comprehend the details of the estimation and control architecture, we
should define the coordinate frames employed here. The first frame is the body-
fixed reference frame {b} and is defined for the real-time horizon, the second is
the body-fixed MPC reference frame {d} anchoring to the body throughout any
MPC prediction horizon. As a result, {b} and {d} coincide at the beginning of
each prediction horizon. Furthermore, {m} is defined as the inertial MPC-EKF
frame, which corresponds to the zero vector at the beginning of the motion. The
final reference frame is introduced as the MPC control frame {c}, which is a roll
and pitch-free frame anchored to the {d} and adopted to define the control and
reference variables. These frames are represented in Fig. 2.
Fig. 3. Distributed architecture for a sub-network, tight-coupling of components on the agent level
is observable.
3.1 Follower Estimation Scheme
We adopt a distributed multi-rate EKF approach to carry out the estimation

function due to its computational efficiency. The related literature is vast, and
we refer the reader to [19,20] and the references therein. For the sake of clarity,
we will present the process and perception models and highlight the important
points.
Consider a new extended state definition for each robot i ∈ N \ {L}:
ξii = [viT ψ̇i dTi,a ]T

ξij = [ΔxTij vjT Δψij ψ̇j ]T (2)
where Δxij , Δvij and Δψij are the relative positions, velocities, and yaw angles
of the robot i with respect to j ∈ Ni ; vi , vj , ψ̇i and ψ̇j are the linear and
yaw velocities of the robot i and j ∈ Ni , respectively. Finally, di,a stands for
the slowly varying disturbances in the acceleration dynamics of the agent i.
Note that all states are expressed in the inertial frame {m}. Let us concatenate
the defined states so that we obtain an extended state for an individual robot
i ∈ Ni \ {L}:
T T T T
ξi = [ξii ξij1 ξij2
... ξijN
]T (3)
where j1 , j2 , ..., jN ∈ Ni .
Consider a generic nonlinear time-discrete model of an individual agent i:
ξi [k + 1] = fi (ξi [k], ui [k], wi [k])

zi [k] = hi (ξi [k], νi [k]) (4)
where fi is the process map, hi is the perception map, wi ∼ N (0, σw ) and

νi ∈ N (0, σν ) are the zero-mean Gaussian noise vectors in the process and
perception models, respectively. Then, inspired by [21], for ∀j ∈ N , the process
model can be written explicitly as follows:

Fi [k]
vi [k + 1] = vi [k] + Δt Ri [k] − g + di,a [k] + wi,a [k] (5)
m
ψi [k + 1] = ψi [k] + Δt (wi,ψ [k]) (6)
ψ̇i [k + 1] = ψ̇i [k] + wi,ψ [k] (7)
Δxij [k + 1] = Δxij [k] + Δt(vj [k] − vi [k])− (8)

Δt2 Fi [k]
Ri [k] − g − di,a [k] − wij,Δa [k] (9)
2 m
vj [k + 1] = vj [k] + Δt (wj,a [k]) (10)
Δψij [k + 1] = Δψij [k] + Δt(Δψ̇ij [k] + wij,Δψ [k]) (11)
Δψ̇ij [k + 1] = Δψ̇ij [k] + wij,Δψ [k] (12)
di,a [k + 1] = di,a [k] + wi,d [k] (13)
where Fi is the input, Ri stands for the rotation matrix, g is the gravitational
acceleration vector, Δt is the sample time, w’s represent the Gaussian noise, d’s
are the slowly varying disturbance (or uncertainty) on the corresponding states.
Note that these lumped disturbance estimates accounting for model uncertainty,
environmental disturbance, and time-varying characteristics of the vehicle are
crucial for the robust behavior of the predictive formation controller. In addition,
some definitions in the model can be written as follows: wij,Δa := wj,a − wi,a
and wij,Δψ := wj,ψ − wi,ψ . Note that all equations are written with respect to
the {m} frame and ∀j ∈ Ni . Furthermore, ∀j ∈ Ni , the perception model is
given as follows:
zi,Δx [k] = g(Ri [k]Δxij [k] + νi,Δx [k]) (14)

zi,v [k] = Ri [k]vi [k] + νi,v [k] (15)
zi,Δψ [k] = Δψij [k] + νi,Δψ [k] (16)
zi,ψ [k] = ψij [k] + νi,ψ [k] (17)
zi,ψ̇ [k] = ψ̇ij [k] + νi,ψ̇ [k] (18)
zi,vj [k] = Rj [k](vj [k]) + νi,vj [k] (19)
where z represents the measurement, ν stands for perception noise, and the
function g(.) maps the relative position to the range and the bearing information.
Furthermore, it is assumed that the robot has an accurate attitude estimator
that generates the roll and pitch estimates φi [k] and θi [k] so that the rotation
matrix Ri is fully defined.
Remark 1. (Neighboring vehicle states): Eq. (19) is critical in the perception
model. vj represents the velocity of any neighboring vehicle j ∈ Ni received
by communication and must be assigned to the reference frame of vehicle i by
Rj . However, if it is not available due to various communication flaws, the EKF
provides a continuous estimation and tracking of the velocity of neighboring
vehicles.
3.2 Follower Controller Scheme

Reliable and high performance attitude controllers are commonly available for
today’s aerial vehicles. In order to separate high-frequency tasks (attitude con-
trol) from low-frequency ones (formation control), to exploit the full potential
of the autopilot and onboard computer, and to simplify the prediction model,
we leverage a cascaded architecture as shown in Fig. 3. The attitude controller
is selected as the PID-based controller proposed by [22] and the formation con-
troller is a Distributed Nonlinear Model Predictive Controller (D-NMPC). The
reason behind the latter choice is that NMPC can simultaneously handle con-
straints and optimize the performance in a systematic manner, and the dis-
tributed version inherently has higher performance since it approximates the
centralized problem more accurately compared to its decentralized counterpart.
Here, we use the following notation to distinguish predicted values from
actual ones: v[m|n] is the value of the variable v at a discrete instant k = n,
predicted at k = m where m ≥ n. Furthermore, N is defined as the prediction
(and control) horizon and Δt is the sampling time. Now, let us write the Open
Loop Control problem (OCP) P(ξ̂[0], k) (at time k with initial estimated state
ξ̂[0]) inspired by [23]:

N −1
min Jk (ξi [k + n|k], ui [k + n|k], ri [k + n|k])+
ξ i [k|k],...,ξ i [k+N |k]
u i [k|k],...u i [k+N −1|k] n=0
JN (ξi [k + N |k], ui [k + N |k], ri [k + N |k]) (20)
subject to the following constraints for n= 1,2, ..., N :
ξi [k + n + 1|k] = fˆ(ξi [k + n|k], ui [k + n|k]) (21)

gn (ξi [k + n|k], ui [k + n|k]) ≤ 0 (22)
hn (ξi [k + n|k], ui [k + n|k]) = 0 (23)
ξi [k|k] = ξ̂i [0] (24)
Here, Jk is the stage cost, JN is the terminal cost, ξi , ui and ri represent

the state, input and reference of the vehicle i, respectively. Additionally, fˆ is
the prediction function and gn and hn represent the inequality and equality
constraints.
Before explaining the individual terms of the OCP, it is worth visiting the
pinhole camera model [24] so that we can extract the coordinates of the detected
vehicle in the image plane and incorporate this information into OCP to obtain
perception-aware behavior. Consider that Δxij is a representation of the relative
3D coordinates of the neighbor vehicle j and pj are the image coordinates of the
same vehicle assuming that the origin is the image center. Then we can obtain
the following relationship:
pj = KCΔxij (25)
where C is the camera matrix, which is a function of the focal length f of the
camera and K is the gain.
Now, the stage and terminal cost functions can be expressed as follows:
Jk (ξi [k + n|k], ui [k + n|k], ri [k + n|k]) =

Δxij [k + n|k] − Δxij [k + n|k]QΔx +
j∈Ni

vj [k + n|k] − vi [k + n|k]Qv +
j∈Ni

ui [k + n|k]Qu + Δui [k + n|k]QΔu + pj Qp (26)
j∈Ni
JN (ξi [k + N |k], ui [k + N |k], ri [k + N |k]) =

Δxij [k + N |k] − Δxij [k + N |k]QΔxN +
j∈Ni

vj [k + N |k] − vi [k + N |k]QvN + pj QpN (27)
j∈Ni j∈Ni
where Δ xij represents the geometric formation reference and Δui stands for
the rate of change of inputs. As a result, in the stage cost function, the first
term minimizes the formation errors, the second term stabilizes the relative
velocities, the third and fourth terms optimize the control effort, and the last
one is responsible for satisfying camera FOV constraints.
The continuous version of the prediction function, which includes not only ego
but also neighboring vehicle motion characteristics, fˆ expressed in Eq. (21) and
can be discretized with any efficient method such as the implicit Runge-Kutta
(i-RG), can be written as follows:
˙ m
d/m,ij = vd/m,j − vd/m,i
m m
Δx
m Rdm d m
v̇d/m,i = F + g + Cd vd/m,i + di,a
mb d/m,i
1
ṫm
d/m,i = (ki tref,i − tm
d/m,i )
τi
m
ψ̇d/m,j = 0, ∀j ∈ Ni (28)
d
where inputs are thrust Fd/m,i and attitude angle references tref,i . Finally, gn
and hn include constraints to impose the safety conditions/limits such as the
maximum velocity, thrust, and Euler angle that the robotic agent is allowed to
achieve. The following remarks will be helpful to gain more insight about the
OCP:
Remark 2. (Information sharing): At each sample instant, vd/m,j m
, the predicted
m
velocity of neighboring vehicles should be received and vd/m,i , the vehicle’s pre-
dicted velocity should be transmitted over the communication channel for the
whole horizon. If the received information is not available for some reason, the D-
NMPC leverages the estimates sent by the local EKF for the prediction. In this
case, the formation controller assumes a constant velocity over the prediction
horizon.
Remark 3. (Prediction model): The prediction model includes the closed-loop

attitude subsystem as three first-order linear differential equations. This sim-
plifies the prediction model and allows a seamless separation of tasks. Their
parameters can be easily identified by conducting attitude-tracking tests.
Remark 4. (Disturbance estimates): The disturbances acting on the prediction

model, di,a , are received from the estimator and assumed constant over the
horizon. This is a valid assumption if the horizon length is not very long and/or
the main disturbance source is model uncertainty.
Remark 5. (Perception requirements): The horizontal and vertical field-of-view

requirements are not added to hn to avoid increasing the number of constraints
in the OCP. Although it does not provide guarantees, the inclusion of the last
terms in the cost function ensures a reliable perception operation.
Real-Time Iteration Scheme. We adopt a modified Real-Time Iteration

(RTI) strategy combined with estimation. This is one of the state-of-the-art solu-
tion methods applied to real-time systems as explained by [25]. In this method,
first discretization and then linearization are applied to the OCP around oper-
ating points and Quadratic Programming (QP) solvers are used successively to
solve the generated problem. The optimality of the solution is determined by
the Karush-Kuhn-Tucker (KKT) value. The computational constraints can be
satisfied by monitoring the online solver time. Please refer to [13] for details
of the RTI scheme repeated for each control iteration. Note that to realize the
algorithm, the nonlinear program solver ACADO [26] is integrated with the nec-
essary modifications. Note that ACADO employs qpOASES as a convex solver.
4 Experiments and Results

The proposed architecture has been first tested in simulation leveraging the
high-fidelity robotics simulator Webots [27] interfaced with the Robot Operating
System (ROS). A screen-shot of a formation experiment with three MAVs is
shown in Fig. 5. The simulator employs a realistic quadrotor model obtained and
identified in our previous work [13] where we tuned the parameters of both the
estimator and the controller to faithfully match reality. Since the interfaces of the
Webots simulator with ROS is exactly the same as that of physical quadrotors,
the transition to physical reality was quite smooth.
Fig. 4. The Helipal Quadrotor and its Fig. 5. Webots simulation consisting
components. of three drones equipped with April-
bundles and a RGB camera.
The quadrotor used in our experiments is a modified Helipal Storm Drone-4

v3, endowed with a Hex Cube Orange autopilot, a Jetson Xavier NX onboard
computer, an IMU, a coupled optical flow and distance sensor, a wide FOV
global shutter RGB camera and a bundle consisting of April tags, as shown in
Fig. 4. The instrumented quadrotor weighs 1.71 kg and has a center-to-propeller
distance of 21 cm. All computations are performed onboard by leveraging ROS.

The Jetson Xavier NX computer has six parallel NVIDIA Carmel ARM v8.2
(max. 1.9 GHz) cores and one dedicated GPU (384 NVIDIA CUDA cores and
48 Tensor cores). For the experiments, all CPUs are activated with maximum
power. Only the leader has access to 3D position information generated by a
MCS with millimeter accuracy. The communication among the robots is real-
ized using Wide Area Network (WAN) and the multimasterfkie ROS package.
Trajectories are generated inside an indoor volume of 27 m3 . This arena, which
includes two quadrotors, can be seen in Fig. 6. The measurement update rates
of the 3D relative localization solution, optical flow sensor, distance sensor, IMU
and magnetometer are 10 Hz, 20 Hz, 60 Hz, 100 Hz and 40 Hz, respectively. The
control and estimation frequency is 40 Hz. NMPC for the follower solves the
OCP with 40 horizon steps, which corresponds to 1 s of prediction window. The
optical flow and distance sensors are selected and calibrated for indoor use. As
can be seen in Fig. 6, various white stripes are placed on the floor to improve
the quality of optical flow. The appropriate selection and calibration of the cam-
era and tag for the mission and the measures taken against vibration and blur
are the critical parts of the integration. With this solution, we have a detection
range of 6 m and a field of view of 120◦ . The magnetometer is not fully reli-
able indoors; however, onsite and on-the-fly calibration significantly improves
performance. Lastly, the computational resources of both the onboard computer
and the autopilot are allocated to ensure efficient perception, communication,
estimation, and control.
Fig. 6. Experiment arena with Fig. 7. Leader’s trajectory for the navigation
two drones. experiment.
We will present the results of two different types of experiments: formation

navigation and reconfiguration, which are two common tasks for multi-robot
systems performing formations. Note that you can view the video of the exper-
iments on the following research page https://www.epfl.ch/labs/disal/research/
quadrotorformationmpc/.
Fig. 8. Formation errors for the nav- Fig. 9. Solver metrics: cost function
igation experiment: x, y, z and norm value, KKT value, solver time and the
error, dashed red lines indicate the number of iterations, dashed red lines
averages over time. indicate the averages over time.
4.1 Formation Navigation

For this section, after the follower engaged into the formation, the leader follows
a predefined 3D trajectory, and the follower tries to maintain the formation
geometry. The formation reference is given as [2, 2, 0] m, and the leader’s path
can be seen from Fig. 7. The total duration of the experiment is nearly 90 s, and
the same scenario is executed five times.
Figure 8 presents the mean and standard deviations of the formation errors
for each axis and their norm over five experiments. For this calculation, the MCS
is used as a ground truth. As can be seen in the figure, while the average norm
error is less than 0.16 m, ≈ 5% of the total range, the maximum norm error is
less than 0.27 m ≈ 9% of the total range. While the mean errors on the x and y
axes are different from zero, the one for the z axis is very close to zero. This is
possibly due to the fact that modeling uncertainty for the altitude dynamics is
less time-varying and compensates well with higher gains.
Figure 9 shows the mean and standard deviations of the solver metrics over
five experiments. These are the cost function value, the KKT value, the solver
time, and the iteration number. Despite some occasional peaks, the cost value
stays below a upper limit and the KKT value is lower than 0.005, which is a solid
indication that Nonlinear Program (NP) is converging to Quadratic Program
(QP) approximations. Furthermore, the solver time remains below 0.025 s, with
at most three SQP iterations, which in turn defines the sample time of the
controller.
Finally, Fig. 10 shows the estimation errors, again calculated with respect to
the ground-truth and disturbance estimates. As can be observed from the figure,
the average error is around 0.1 m and follows a trend similar to that of formation
errors. The absence of sudden peaks indicates that the FOV is maintained so that
the 3D relative pose measurement is received steadily. Furthermore, the velocity
estimation error in the body frame is less than 0.1 m/s on average, and the
disturbance estimations are mostly steady, indicating that the main disturbance
source in this experiment is the time-invariant model uncertainty.
Fig. 10. Estimation errors (relative position, Fig. 11. Follower’s reference dur-
velocity) and disturbance estimations in 3D, ing the reconfiguration experi-
dashed red lines indicate the averages over ment. 15-30-30 corresponds to the
time. angles of relative position vectors
4.2 Formation Reconfiguration
For these experiments, the position of the leader is kept fixed and the formation
reference of the follower is changed over a prescribed duration to reconfigure the
formation geometry. Three distinct relative position references for the follower
are displayed in Fig. 11. The total duration of the experiment is around 23 s and
the scenario is realized five times.
The mean and standard deviations of the formation errors are given in Fig. 12.
We can see that the oscillatory behavior of the errors is due to the fact that the
reference commands are sent separately over the course of the experiment. As
a result, the mean error is around 0.21 m ≈ 7% of the total range. However,
the errors reach steady state after some time and, on average, stabilize around
0.17 m.
Solver metrics are shown in Fig. 13. We can make similar observations for
the solver metrics, except for the cost value. It can be seen that the cost value
is overall higher but reaches the same low levels at the end of the experiment.
This is because the experiment is more dynamic compared to the formation
navigation experiment. Furthermore, the observed plateaus are possibly due to
the transient compromise between the individual terms in the cost function.
Fig. 12. The formation errors for the Fig. 13. Solver metrics: cost function
reconfiguration experiment: x, y, z and value, KKT value, solver time and the
norm errors, dashed red lines indicate number of iterations, dashed red line
the averages over time indicate the averages over time
4.3 Discussion
In order for the reader to compare the quality of the results, we present a brief
benchmark on relative position estimation and formation errors. Recently, [28]
introduced a decentralized VIO-UWB perception and estimation scheme for
aerial swarms. For an indoor trajectory of 27.8 m, they obtained about 0.08
m average relative position estimation error, which is comparable to our vision-
based solution. Another state-of-the-art localization method presented by [5]
targets outdoor applications with their ultra-violent (UV) markers and camera,
and they achieved 1.16 m relative localization RMSE. Finally, [21] employed
a distributed infrared-based relative localization system for the formations of
quadrotors. The system achieved an error of less than 0.20 m in range, 5◦ in
bearing and 10◦ in elevation for the sensor-to-target distance of 3 m. Consider-
ing formation control errors, [21] adopted a distributed graph-based formation
control method that obtained a 0.27 m range and 5◦ elevation error for the
desired range of 2.26 m, which corresponds to 12% of the full range. In our pre-
vious work [13], we proposed a decentralized predictive approach and achieved
an average formation error of 0.44 m. The high error observed in this work was
due to the exclusion of aerodynamic disturbances occurring between vehicles and
the assumption of constant neighbor velocity in the prediction model.
The low average estimation (≤0.1 m) and formation (≤0.17 m) errors
obtained in two different formation control scenarios reveal the high perfor-
mance of our strategy. By leveraging the tightly-coupled distributed MPC and
EKF scheme, we could simultaneously comply with perception, communication,
and vehicle constraints. Although the method is scalable in terms of algorith-
mic properties, a couple of issues should be addressed for real-world scalability.
First, at least three wide-lens cameras should be used to perceive the neighbor
vehicles in all directions. Next, the April-bundle pose detection algorithm is not
lightweight, and in our current implementation, it runs on the CPU. There are
efficient implementations of this algorithm using a GPU [29]. Finally, obtaining

the SQP solution of NMPC is also costly, which limits the maximum number
of neighbors each vehicle can have; however, there is no theoretical limit on the
total swarm size.
5 Conclusion
Obtaining a robust, autonomous, performant and scalable control and estimation
architecture for multi-robot systems is a significant challenge. In this work, we
present a unified, predictive, and distributed architecture to carry out leader-
follower formations with MAVs. Our approach employs onboard resources and
integrates various physical constraints, such as camera FOV, actuation, velocity,
and acceleration limits, into the architecture. After a parameter tuning process
in a high-fidelity simulator with three vehicles, we validated and showed the
performance of our approach with two real quadrotors, one serving as leader
and the other as follower, for two common formation tasks.
In the future, we intend to validate the approach with more following vehicles,
improve the autonomy of the leader, analyze the robustness in presence of various
communication problems and compare our solution with alternative approaches
in detail.
Acknowledgment. This work has been partially sponsored by the FCT grant
[PD/BD/135151/2017], the FCT doctoral program RBCog and the FCT project
[UIDB/50009/2013]
References
1. Ebel, H., Ardakani, E., Eberhard, P.: Distributed model predictive formation con-
trol with discretization-free path planning for transporting a load. Robot. Auton.
Syst. 96, 211–223 (2017)
2. Oh, K., Park, M., Ahn, H.: A survey of multi-agent formation control. Automatica
53, 424–440 (2015)
3. Eren, U., Prach, A., Koçer, B., Raković, S., Kayacan, E., Açıkmeşe, B.: MPC in
aerospace systems: current state and opportunities. J. Guid. Control. Dyn. 40,
1541–1566 (2017)
4. Zhou, X., et al.: Swarm of micro flying robots in the wild. Sci. Robot. 7, eabm5954
(2022)
5. Petráček, P., Walter, V., Báča, T., Saska, M.: Bio-inspired compact swarms of
unmanned aerial vehicles without communication and external localization. Bioin-
spir. Biomim. 16, 026009 (2020)
6. Alonso-Mora, J., Montijano, E., Nägeli, T., Hilliges, O., Schwager, M., Rus, D.:
Distributed multi-robot formation control in dynamic environments. Auton. Robot.
43, 1079–1100 (2019)
7. Soria, E., Schiano, F., Floreano, D.: Distributed Predictive Drone Swarms in Clut-
tered Environments. IEEE Robot. Autom. Lett. 7, 73–80 (2022)
8. Allamraju, R., et al.: Active perception based formation control for multiple aerial
vehicles. IEEE Robot. Autom. Lett. 4, 4491–4498 (2019)
9. Hafez, A., Marasco, A., Givigi, S., Iskandarani, M., Yousefi, S., Rabbath, C.: Solv-
ing multi-UAV dynamic encirclement via model predictive control. IEEE Trans.
Control Syst. Technol. 23, 2251–2265 (2015)
10. Yuan, Q., Zhan, J., Li, X.: Outdoor flocking of quadcopter drones with decentral-
ized model predictive control. Int. Soc. Autom. Trans. 71, 84–92 (2017)
11. Van Parys, R., Pipeleers, G.: Distributed MPC for multi-vehicle systems moving
in formation. IEEE Robot. Autonom. Syst. 97, 144–152 (2017)
12. Abichandani, P., Levin, K., Bucci, D.: Decentralized formation coordination of
multiple quadcopters under communication constraints. In: IEEE International
Conference on Robotics and Automation, pp. 3326–3332 (2019)
13. Erunsal, I., Ventura, R., Martinoli, A. NMPC for formations of multi-rotor micro
aerial vehicles: an experimental approach. In: International Symposium on Exper-
imental Robotics, pp. 449–461 (2020)
14. Thakur, D., Tao, Y., Li, R., Zhou, A., Kushleyev, A., Kumar, V.: Swarm of inexpen-
sive heterogeneous micro aerial vehicles. In: International Symposium on Experi-
mental Robotics, pp. 413–423 (2020)
15. Olson, E.: AprilTag: a robust and flexible visual fiducial system. In: IEEE Inter-
national Conference on Robotics and Automation, pp. 3400–3407 (2011)
16. Fossen, T.: Handbook of marine craft hydrodynamics and motion control. John
Wiley and Sons (2011)
17. Mahony, R., Kumar, V., Corke, P.: Multirotor aerial vehicles: modeling, estimation,
and control of quadrotor. IEEE Robot. Autom. Mag. 19, 20–32 (2012)
18. Omari, S., Hua, M., Ducard, G., Hamel, T.: Nonlinear control of VTOL UAVs
incorporating flapping dynamics. In: IEEE/RSJ International Conference on Intel-
ligent Robots and Systems, pp. 2419–2425 (2013)
19. Jazwinski, A.: Stochastic processes and filtering theory Jazwinski. AH Academic
Press (1970)
20. Quan, Q.: Introduction to multicopter design and control. Springer (2017)
21. Dias, D.: Distributed State Estimation and Control of Autonomous Quadrotor
Formations Using Exclusively Onboard Resources. (EPFL-IST PhD Thesis, No.
9224) (2019)
22. https://docs.px4.io/ . Autopilot control. Accessed July 2022
23. Magni, L., Scattolini, R.: Stabilizing decentralized model predictive control of non-
linear systems. Automatica 42, 1231–1236 (2006)
24. Szeliski, R.: Computer vision: algorithms and applications. Springer Science &
Business Media (2010)
25. Gros, S., Zanon, M., Quirynen, R., Bemporad, A., Diehl, M.: From linear to non-
linear MPC: bridging the gap via the real-time iteration. Int. J. Control 93, 62–80
(2020)
26. Ferreau, H., Kraus, T., Vukov, M., Saeys, W., Diehl, M.: High-speed moving hori-
zon estimation based on automatic code generation. In: IEEE Conference On Deci-
sion And Control, pp. 687–692 (2012)
27. Michel, O.: Webots: professional mobile robot simulation. Int. J. Adv. Rob. Syst.
1, 39–42 (2004)
28. Xu, H., et al.: Omni-swarm: a decentralized omnidirectional visual-inertial-uwb
state estimation system for aerial swarms. IEEE Trans. Robot. (2022)
29. https://github.com/NVIDIA-AI-IOT/isaac ros apriltag . Accessed Sept 2022
Search Space Illumination of Robot
Swarm Parameters for Trustworthy
Interaction
James Wilson1,2(B) and Sabine Hauert1,2

1
Bristol Robotics Laboratory, Bristol, UK
sabine.hauert@bristol.ac.uk
2
Department Of Engineering Mathematics, University of Bristol, Bristol, UK
j.wilson@bristol.ac.uk
Abstract. For swarms of robots to become commonly adopted, oper-

ators will need to be able to adjust swarm behaviour in an intuitive
and simple manner. To achieve this, we produce a feature space of
swarm characteristic sets using Map-Elites. These sets provide users with
the ability to explore specific combinations of characteristics known to
achieve good results for the specified requirements. This allows operators
to pass preferential parameter sets to swarm agents, without requiring
a deep understanding of how parameters effect swarm behaviour. The
resultant system gives the user reassurance that the swarm will operate
to their needs while maintaining the benefits of distributed swarm tech-
nology. We demonstrate how this system might be used with a swarm
logistics case study where an operator requires two swarm behaviour
modes with differing characteristic needs. In this example we illustrate
how an operator might express preference, receive recommendations, and
select their final swarm characteristics.
Keywords: Search space illumination · Swarm robotics ·

Trustworthiness
1 Introduction
Robot swarms create promising opportunities within a wide range of applica-
tions, from intralogistics for small businesses, to space exploration [24]. How-
ever, to create a swarm robotic system that can be deemed trustworthy, and
thus enable wide-spread use, swarm operators will need to rapidly simulate,
deploy, monitor, and control swarms through intuitive cyber-physical infrastruc-
ture, making use of methodologies that do not retract from the swarms inherent
beneficial qualities.
Swarms of robots present an interesting challenge when it comes to human
control. Human input can benefit the swarms performance by providing context
or instructions from the human operator that would not typically be available to
the swarm [13]. However, the high number of agents within a swarm can present
https://doi.org/10.1007/978-3-031-51497-5_13
174 J. Wilson and S. Hauert
difficulty; with more agents acting within a system, the more difficult manually
controlling the entire swarm is, potentially surpassing the cognitive limit of the
operator [15]. Additionally, by centralising the swarm to a human operator it is
possible to create a single point of failure in the system. As a result, the robot
system may lose its swarm-like properties i.e. the adaptability, scalability and
robustness swarms gain from their innate redundancy and distributed operation.
[9]
The level of control an operator has over a swarm is a key consideration when
producing an effective swarm system [1]. Too much influence from a human, espe-
cially from an inexperienced operator, can lead to significantly reduced swarm
performance as the collective behaviours of the swarm are overridden by human
control. This effect has been referred to as ‘Neglect Benevolence’ [28] in which
swarms can, in some cases, be observed to operate more effectively as interfer-
ence from a human operator reduces. Nonetheless, currently there is a demand
for human operator supervision of autonomous systems [20] and swarms [4]. Not
only can operators provide the aforementioned context to swarm tasks, but they
can also provide an element of trust in the swarm system, giving the public reas-
surance in their interactions with the swarm. Operators are able to make ethical
considerations when presented with scenarios that current levels of autonomy
would struggle to provide adequate resolutions to. For example: directing agents
to areas of importance in a wild fire scenario, or selecting a preferable behaviour
in a shared human-robot warehouse as humans enter/exit the environment. Even
in less critical scenarios, operators themselves may exhibit more trust in a system
they are able to influence towards their preference for operation.
To address the issue of effective human swarm interaction, many techniques
have been investigated. Some studies have explored approaches for direction
control of agents including gesture, voice and touch [14,22]. While others have
presented approaches for monitoring and providing information on many agents
to an operator [6,29]. In addition to this, there have also been investigations into
the appropriate levels of control over swarms [1,27].
Recently, evolutionary techniques have been deployed to automatically gen-
erate swarm behaviour to overcome the difficulty of collective behaviour design.
[2]. Within this field, methods have been explored that generate several swarm
behaviour options that are human understandable and human readable [8]. Oper-
ators with contextual awareness can then use these options to send commands
to a swarm, which then behaves autonomously and in a distributed manner to
execute the desired swarm behaviour. This allows the operator to modify the
behaviour of the swarm as a task or environment varies, while still preserving
swarm qualities.
We look to expand upon the idea of operator behaviour selection and hope
to add an element of trustworthiness to swarm systems by providing operators
with characteristic options defined in simple terms. Options which have been
measured and tested based on sets of generated, low-level, swarm parameters.
When it comes to deployment, this means that an operator can simply express
their use-case’s characteristic priorities, and select a set of parameters for their
Search Space Illumination for Trustworthy Swarms 175
swarm which represents an intuitive set of trade-offs. For example, an operator

managing a swarm logistics warehouse may express that they require a high per-
forming, robust, swarm but they have little requirement for energy efficiency. Our
system could subsequently recommend multiple options, displaying the trade-offs
between performance, robustness and energy efficiency. The operator could then
select the option that best represents their requirements, uploading the parame-
ters associated with their selected characteristic set to their swarm. This method
should allow operators, even those who are inexperienced, to beneficially adjust
the behaviour of the swarm, meeting user requirements for trust while preserving
swarm capabilities.
We achieve this method of interaction by creating a large feature space of
swarm characteristics representing high performing sets of swarm parameters.
This should provide users with the ability to explore combinations of trustwor-
thiness characteristics and, as a result, receive suggestions for relevant parame-
ter sets. Consequently, operators can request swarm behaviours in user-friendly
terms without needing in depth knowledge of swarm behaviour or having to
micromanage swarm agents.
We will produce the aforementioned feature space through the use of Map-
Elites, a search space illumination algorithm first introduced by Mouret and
Clune [19]. Map-Elites has since been used to develop varieties of robotic
behaviours to create adaptive methods of control [5], swarm behaviour gen-
eration [3,7,12] and has also been proposed as a method for building trust in
optimisation algorithms by displaying multiple solutions to users at once [26].
From the characteristic sets (or ‘genomes’) generated through Map-Elites, we
will demonstrate how a user might request and interact with the feature space
to effectively and intuitively influence the behaviour of a swarm. An illustration
of our target system can be seen in Fig. 1.
Fig. 1. Illustration of target interface displaying the behaviour of the swarm within
an arena (left side of interface). Users are able to list their priority of characteristic
traits. Based on this priority, the top sets of available characteristic combinations are
displayed in a radar chart to the right. Operators can then select the suggestion they
feel best fits their use case or can re-specify their preference order to receive additional
recommendations.
2 Methodology
2.1 Scenario
In our scenario we mimic a logistics swarm warehouse scenario. The swarm

execute a simple box delivery task, moving items from one end of a 20 × 20m
environment (displayed in Fig. 1 as an element within the target interface) to
another. In this task, the swarm has 30 boxes to deliver with a time limit of
500 seconds. If the swarm completes the delivery of all of the boxes before the
specified time limit the simulation ends early and the scores are recorded.
Simulated Agents. The agents within our simulation are based on DOTS, an
open access swarm robot solution designed for industrial swarm experimentation
[10]. While this experimentation platform has a high fidelity simulation platform
available, we chose to create a simple 2D simulation for the purpose of executing
the illumination algorithm. This was due to the large amount of computing
power that would be required to simulate 5–40 agents in high fidelity across
several thousand iterations.
In our simulation, agents measure 30 cm in diameter and are able to pick up
boxes instantaneously once within 25 cm of the boxes centre coordinate. Agents
have been simulated with simple 2D physics and, should their avoidance systems
fail, are able to collide elastically with other agents, boxes, and the walls of the
environment. The maximum velocity, size of the swarm, maximum amount of
time an agent will carry a box, and distance at which agents begin to avoid
one other are all controlled by the Map-Elites mutations. The limitations of our
swarm parameters and the magnitude of mutations applied on each iteration of
Map-Elites are shown in Table 1.
Robot Controller. The controller present on each agent executes a simple

set of instructions (illustrated in Fig. 2). They first search for boxes, moving
randomly through the environment, gaining a randomly generated heading once
every 10 s to modify their path. Simultaneously, the agents attempt to avoid
one another, generating avoidance vectors, which navigate them away from the
centroid of neighbouring agents (agents within the distance specified by the
Table 1. Table of swarm parameter value constraints.
Parameters Minimum Maximum ± Maximum Mutation

Maximum Velocity 1 cm/s 200 cm/s 20 cm/s
Maximum Acceleration 200 cm/s 200 cm/s N/A
Avoidance Range 1 cm Unconstrained 25 cm
Swarm Size 3 agents 50 agents 5
Maximum Box Hold Time 5 s Unconstrained 2.5 s
Agent within
Check timer. No Agent within No No
emergency
Start. Have 10 sec- in avoidance
avoidance
onds passed? range? range?
Yes Yes Yes
Override movement with

Generate new random
Activate avoidance. evasive vector away
heading.
from closest robot.
No Is the box in No Yes Are you in Yes

Are you car-
Pick up the box. range being the drop off
rying a box?
held? zone?
Yes No
Apply movement bias Is time hold-

Apply movement bias
towards collection ing box > max Drop the box.
towards drop off zone.
zone. box hold
Fig. 2. Flow chart illustrating the behaviour executed by every swarm agent.
‘Avoidance Range’ parameter). In addition to this we incorporate an emergency

avoidance system which overrides the current heading of an agent and directs
them away from any agent they sense less than 20 cm away. Once an agent
encounters a box by chance, they are able to pick it up and attempt to move the
box to the delivery zone, once again moving in the same random fashion while
attempting to execute avoidance behaviour.
To assist in the delivery of boxes there is a light bias applied to the movement
of agents. When searching for boxes, a vector of random magnitude (between zero
and half the maximum acceleration of the agent) is added in the direction of the
box pick up zone. Once a box is collected, a similar vector is added away from the
box pick up zone to assist in delivery. This is similar to the behaviour deployed
by Milner et al. [16,17], with this bias acting as a proxy to simple directional
awareness agents may have access to, such as an environmental gradient sensor
(e.g to a light or wireless beacon).
2.2 Swarm Metrics

In order to create a feature space of elite parameter sets, multiple characteristics
relating to the swarm must be produced. In this paper, our focus when imple-
menting Map-Elites was to create a method of swarm interaction which enabled
trustworthiness. In doing so, not only should the method of interaction be intu-
itive, fostering the trust between operator and swarm, but the characteristics the
operator is able to select preference for should have a strong relation to trust.
With this considered, we produced three characteristics, energy consumption,
robot danger, and robustness, as well as a metric for overall swarm performance.
Performance. To measure performance (P ) we used number of boxes collected

by the swarm (nB ) within the maximum available time (T ). In some instances
agents may collect all of the available boxes before the maximum time limit. In
instances such as this, to avoid using unnecessary computational resources, we
truncated the experiment. To compensate for this, and to promote controllers
that perform the task quickly, the number of boxes collected is multiplied by
T
t (where t is the time actually taken to complete the task). This produced a
score representing how many boxes the swarm would have collected had agents
continued moving boxes at the average rate of collection for the full experiment
length (see Eq. 1).
nB T
P = (1)
t
Energy Consumption. Our first swarm characteristic is the total energy con-
sumed by the swarm (E) (displayed in Eq. 2). Energy consumption is a key
ethical consideration when creating modern products and, depending on the
use case, may be a factor in the trustworthiness of a system [21,25]. E was pro-
duced by summing the energy used by robots (Er ) across the experiment length.
Energy consumption in this case was approximated through the magnitude of
acceleration each agent experienced at any given time step. This creates a score
capturing where the majority of energy would be consumed in the equivalent
real-world system.
As with the performance metric, to compensate for early completion of the
task and to make parameter set comparison fair, the total score was multiplied
by maximum experiment length (T ) over actual time taken to complete the task
(t).
nr
T
E= Er (2)
t r=1
Robot Danger. The next characteristic was the average number of time each
agent spent in the emergency avoidance mode (shown in Fig. 2). This character-
istic was chosen to give an indication of how safe a system might be. Equation 3
shows that this was calculated by summing the amount of time each agent spent
performing emergency avoidance (Dt ) throughout the experiment and dividing
by the total number of agents in the swarm (nr ). This was once more multi-
plied by a factor of Tt to give a value representative of how long agents would
be expected to have spent in emergency avoidance should the simulation have
extended to the maximum time limit.
nr
T r=1 Dt
D= (3)
t nr
Robustness. Our final characteristic, robustness, comes from the recent work
on quantifying swarm traits conducted by Milner et al. [18]. This metric is pro-
duced from the deviation from optimal performance due to faults (dP ) over the
deviation from optimal swarm size due to loss of agents (dN ). Using this metric,
values where R ≥ 1 are considered robust while R < 1 are not. This means that a
system is considered robust when, despite failures or faults, the swarm produces
a performance equal to or greater than the same system with the faulty agents
entirely removed from the swarm. We show the equations comprising this metric
in Eqs. 4 - 6.
dP
R= (4)
dN
P − Po
dP = 1 − (5)
Po
Nf
dN = 1 − (6)
N
To calculate robustness in these experiments, faults must be injected into the
system. We inject two faults into comparison trials to gauge the performance of
the system when compromised. To simulate these robot faults we recreated a
common wheel fault, in which faulty agents have a constant directional bias.
We achieved this in simulation by creating an orthogonal vector to the right of
the agent at one tenth the magnitude of the current velocity. This caused faulty
agents to consistently spiral to the right, compromising avoidance and direction
of travel.
As a result of fault injection, we are able to gain values for the swarm’s
performance with no faults injected (Po ) and performance with faults injected
(P ). As mentioned before, across these trials the number of faults injected to
the swarm for testing (Nf ) was always 2, while N , the number of agents in the
swarm, would act as a variable parameter in our search space illumination.
2.3 Map-Elites Illumination Algorithm Methodology
Having defined our simulation, performance metric, swarm parameters, and

swarm characteristics, we deployed the Map-Elites illumination algorithm to
generate characteristic subsets. This algorithm iterates through genomes (in our
case the swarm parameter sets), selecting and mutating existing genomes to
create offspring.
To evaluate a new genome’s characteristics and performance, an experimental
trial (in our case the previously described swarm simulation) is executed using
the parameters of the offspring. If the evaluated genome outperforms the current
occupants of a cell, or identifies a new characteristic niche, the genome is placed
within the cell. The algorithm then iterates as new genomes are selected from
the occupied cells, mutated, and their offspring evaluated. An illustration of this
process can be seen in Fig. 3.
Speed: 1m/s Performance: 50

Avoidance: 2m
Swarm size: 6
Repeat with new mutated genome as input.
Robustness Robot Danger Energy Performance
Select new genome at random from the fea- 1-1.25 0-5 15-20 52
Params
Params 1-1.25 5-10 0-5 39 -> 50
Parameters Params 1-1.25 5-10 5-10 43
Speed: 2 m/s 0.3 m/s Params 1-1.25 5-10 10-15 23

Avoidance: 1.5m - 0.2m Params 1-1.25 5-10 15-20 12
12s 1-1.25 10-15 0-5 54
Params
Swarm size: 11 -1
Params 1-1.25 10-15 5-10 47
Fig. 3. Illustration of the map elites algorithm process with example parameters, char-
acteristic values and mutations.
In our use of Map-Elites we began with an empty feature space with charac-
teristics divided equally into 10 bins, using maximum and minimum values iden-
tified through initial testing of the simulation. Starting with a single unevaluated
parameter set, on each iteration thereafter we randomly selected 7 genomes cur-
rently filling the feature space to mutate. We then tested all 7 of the offspring in
parallel, repeating each test 4 times, taking the mean results from the repetitions
as the genome’s final characteristic and performance evaluation scores.
3 Results
The results of our application of Map-Elites can be seen in Fig. 4. These results,
displaying 28000 total parameter mutations (4000 repetitions of our 7 parallel
parameter mutation batches, executed until coverage of the feature space was
deemed adequate), show patterns forming in the relations between characteristics
and performance. To begin there is a clear gradient towards the top right of each
graph. This shows, somewhat intuitively, that there is a link between power
consumption, safety and performance. This is explained by fast moving or large
volumes of agents consuming large amounts of energy to quickly move boxes from
one side of the arena to the other. At these higher speeds agents also struggle
to effectively maintain distance from one another, relying more so on emergency
avoidance to prevent collisions. Interestingly, robustness does not seem to have
a linear relationship with performance. Instead, performance peaks at a sweet

spot with robustness scores between 1–2, this reflects the inherent robustness
within the swarm algorithm. Further tests and a wider variety of fault injections
would likely produce a greater variance in the scores seen here, however this was
outside of the scope of this initial proof-of-concept experimentation.
The patterns and trade-offs shown here demonstrate the variety of perfor-
mances that can be achieved by a swarm given different user requirements. The
ability to visualise and intuitively select characteristics based on requirements
could prove valuable in creating trustworthy and adoptable swarm systems. In
the next section we explore the generated genomes in greater detail, demonstrat-
ing how feature spaces of swarm characteristics could be implemented in reality
for user focused applications.
3.1 Qualitative Analysis: Case Study
Looking more closely at the results from our application of Map-Elites, we con-
sider a case study in which users have specified a priority list for swarm charac-
teristics. We then take a closer look at the parameters that could be selected for
the representative genomes, highlighting how Map-Elites generated characteris-
tics can create an effective tool for swarm control.
Taking the feature space generated through our implementation of Map-
Elites, we produce radar charts to illustrate characteristic values. To display
the information intuitively, characteristics have been normalised to take values
between 0 and 1. Energy consumption and danger have been used to produce
values for energy efficiency (1 − E) and safety (1 − D) respectively. The resulting
characteristics thus constitute their greatest fitness at a value of 1, in line with
performance and robustness.
In our example case study, a user requires a swarm for a cluttered warehouse
in which humans will frequently interact with swarm agents during the day. At
night there will be no humans present in the warehouse and agents will rely on
energy stored via solar panels during the day. As a result, the operator specifies
priority lists 1 and 2 for day and night operation respectively:
Priority List 1
1. Safety 2. Performance 3. Energy Efficiency 4. Robustness
Priority List 2
1. Energy Efficiency 2. Performance 3. Safety 4. Robustness
To begin observing the trade-offs between characteristics, the feature space

is ranked by characteristic in the order specified by the operator. The genome at
the top of the list can then be selected as the first characteristic recommendation
(examples 1 & 4 in this case study). Then, by making concessions to the highest
priority characteristic, genomes from lower down the feature space ranking can be
Fig. 4. Heat maps of performance based on characteristics extracted using MAP-

Elitess. Coloured cells in these heat maps display the performance scores across charac-
teristic categories for robot danger, energy consumed and robustness. Black cells show
characteristic combinations for which no parameter set was found across the Map-Elites
iterations performed in this instance.
selected, providing recommendations that demonstrate how reducing the highest

priority feature might effect the rest of the swarm’s features.
This method for selecting the top characteristic examples does create some
issues for the operator should they want a balance between multiple characteris-
tic features. However, it should be possible to address this by adding the ability
to specify minimum possible values for certain characteristics. This would reduce
the size of the feature space, possibly removing genomes which perform highly
in one particular characteristic. In restricting elements of the feature space, the
resultant characteristic examples would be more representative of genomes that
perform more evenly across multiple characteristic types.
In Fig. 5 we illustrate two radar charts showing the genomes that would be
presented to the operator for each of the stated priority lists. In the illustration
for priority list 1 we can see that example 1 shows the option with the highest
possible safety rating. However, this safety comes at a cost to all of the other
swarm features. By selecting other features with slightly lower safety scores,
we see improvements to both robustness and performance. Similarly, for the
radar chart given for priority list 2 we can see that example 4, the genome best
representing the operators priorities, gives very good results for energy efficiency,
safety and robustness but lacks performance. As energy efficiency is reduced in
examples 5 and 6 we see a notable increase in performance. However, we also
see a drop in robustness and safety.
Using these graphs to display swarm information in simple terms, we can
see how the search space illumination can be beneficial to a user. Operators
can make their selection, clearly see the trade-offs between the features they
have a primary interest in, and use this information to make informed swarm
characteristic selection.
Fig. 5. Radar diagrams each displaying the characteristics and performance of the
three example parameter sets that may be recommended for the specified operator
preferences.

In this paper we show how search space illumination techniques can be deployed
to create a method of swarm robot interaction with a focus on trustworthiness.
This is achieved by presenting human-centric swarm characteristics to a user in
a simple, consumable manner. We have illustrated how this system might be
used through a proof of concept case study, and identified the benefit such a
methodology could provide to swarm operators.
In the future we hope to take this work further by implementing the Map-
Elites algorithm on a wider scale; evolving full swarm behaviours using behaviour
trees [11], instead of simply mutating swarm parameters. Through the use of
this system, swarms could be equipped with vast repertoires of behaviour types,
deployed based on human understanding of characteristic requirements.
Another interesting implementation for this system would be to create
diversity amongst a swarm, taking clusters of neighbouring genomes to be
spread amongst agents. These selections may lead to more robust and effec-
tive behaviours with agents specialising and dividing labour to complement each
other’s strengths and weaknesses. This would reflect the benefits demonstrated
by natural examples of increased swarm diversity in social insects [23].
We also hope to investigate the viability of this system for user interaction
by performing a human study. This study would examine whether a system
similar to that presented in this paper would be perceived as beneficial and
more trustworthy than other high-level command techniques, or direct agent
control.
Finally, the generation of discrete sets of parameters seen in this paper could
enable rigorous validation of swarms. After the initial search space illumination,
the resultant parameter sets or behaviours could be taken through a series of
tests, evaluating the viability of the genomes, removing genomes from the fea-
ture space that are deemed unacceptable. Such a process would increase user
confidence in the system, adding an additional layer of safety to enforce the
trustworthiness of any behaviours or parameter sets selected via the user’s char-
acteristic requirements.
References
1. Ashcraft, C.C., Goodrich, M.A., Crandall, J.W.: Moderating operator influence in
human-swarm systems. In: 2019 IEEE International Conference on Systems, Man
and Cybernetics (SMC), pp. 4275–4282. IEEE (2019)
2. Birattari, M., et al.: Automatic off-line design of robot swarms: a manifesto. Front.
Robot. AI 6, 59 (2019)
3. Bossens, D.M., Tarapore, D.: Rapidly adapting robot swarms with swarm map-
based Bayesian optimisation. In: 2021 IEEE International Conference on Robotics
and Automation (ICRA), pp. 9848–9854. IEEE (2021)
4. Carrillo-Zapata, D., et al.: Mutual shaping in swarm robotics: user studies in fire
and rescue, storage organization, and bridge inspection. Front. Robot. AI 7, 53
(2020)
5. Cully, A., Clune, J., Tarapore, D., Mouret, J.B.: Robots that can adapt like ani-
mals. Nature 521(7553), 503–507 (2015)
6. Divband Soorati, M., Clark, J., Ghofrani, J., Tarapore, D., Ramchurn, S.D.:
Designing a user-centered interaction interface for human-swarm teaming. Drones
5(4), 131 (2021)
7. Engebraaten, S.A., Moen, J., Yakimenko, O.A., Glette, K.: A framework for auto-
matic behavior generation in multi-function swarms. Front. Robot. AI, 175 (2020)
8. Hogg, E., Hauert, S., Harvey, D., Richards, A.: Evolving behaviour trees for super-
visory control of robot swarms. Artif. Life Robot. 25(4), 569–577 (2020)
9. Jones, S., Milner, E., Sooriyabandara, M., Hauert, S.: Distributed situational
awareness in robot swarms. Adv. Intell. Syst. 2(11), 2000110 (2020)
10. Jones, S., Milner, E., Sooriyabandara, M., Hauert, S.: DOTS: an open testbed for
industrial swarm robotic solutions. arXiv preprint arXiv:2203.13809 (2022)
11. Jones, S., Studley, M., Hauert, S., Winfield, A.: Evolving behaviour trees for swarm
robotics. In: Groß, R., et al. (eds.) Distributed Autonomous Robotic Systems.
SPAR, vol. 6, pp. 487–501. Springer, Cham (2018). https://doi.org/10.1007/978-
3-319-73008-0 34
12. Kaiser, T.K., Hamann, H.: Evolution of diverse swarm behaviors with minimal
surprise. In: ALIFE 2020: The 2020 Conference on Artificial Life, pp. 384–392.
MIT Press (2020)
13. Kapellmann-Zafra, G., Salomons, N., Kolling, A., Groß, R.: Human-robot swarm
interaction with limited situational awareness. In: Dorigo, M., et al. (eds.) ANTS
2016. LNCS, vol. 9882, pp. 125–136. Springer, Cham (2016). https://doi.org/10.
1007/978-3-319-44427-7 11
14. Kim, L.H., Drew, D.S., Domova, V., Follmer, S.: User-defined swarm robot control.
In: Proceedings of the 2020 CHI Conference on Human Factors in Computing
Systems, pp. 1–13 (2020)
15. Lewis, M.: Human interaction with multiple remote robots. Rev. Hum. Factors
Ergon. 9(1), 131–174 (2013)
16. Milner, E., Sooriyabandara, M., Hauert, S.: Stochastic behaviours for retrieval of
storage items using simulated robot swarms. Artif. Life Robot. 27, 1–8 (2022)
17. Milner, E., Sooriyabandara, M., Hauert, S.: Swarm diffusion-taxis: transport of
spatial information for cooperative gradient-based navigation. In: AROB SWARM
(2022)
18. Milner, E., Sooriyabandara, M., Hauert, S.: Swarm performance indicators (SPIS).
In: Preparation, pp. 1–6 (2022)
19. Mouret, J.B., Clune, J.: Illuminating search spaces by mapping elites. arXiv
preprint arXiv:1504.04909 (2015)
20. Nahavandi, S.: Trusted autonomy between humans and robots: toward human-on-
the-loop in robotics and autonomous systems. IEEE Syst. Man Cybern. Magaz.
3(1), 10–17 (2017)
21. Partridge, A., et al.: ReRobot: recycled materials for trustworthy soft robots. In:
5th IEEE-RAS International Conference on Soft Robotics (2022)
22. Podevijn, G., O’Grady, R., Nashed, Y.S.G., Dorigo, M.: Gesturing at subswarms:
towards direct human control of robot swarms. In: Natraj, A., Cameron, S., Mel-
huish, C., Witkowski, M. (eds.) TAROS 2013. LNCS (LNAI), vol. 8069, pp. 390–
403. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43645-5 41
23. Psalti, M.N., Gohlke, D., Libbrecht, R.: Experimental increase of worker diversity
benefits brood production in ants. BMC Ecol. Evol. 21(1), 1–10 (2021)
and current applications. Front. Robot. AI 7, 36 (2020)
25. Thiebes, S., Lins, S., Sunyaev, A.: Trustworthy artificial intelligence. Electron.
Mark. 31(2), 447–464 (2021)
26. Urquhart, N., Guckert, M., Powers, S.: Increasing trust in meta-heuristics by using
map-elites. In: Proceedings of the Genetic and Evolutionary Computation Confer-
ence Companion, pp. 1345–1348 (2019)
27. Walker, P., Nunnally, S., Lewis, M., Chakraborty, N., Sycara, K.: Levels of automa-
tion for human influence of robot swarms. In: Proceedings of the Human Factors
and Ergonomics Society Annual Meeting, vol. 57, pp. 429–433. SAGE Publications
Sage CA: Los Angeles, CA (2013)
28. Walker, P., Nunnally, S., Lewis, M., Kolling, A., Chakraborty, N., Sycara, K.:
Neglect benevolence in human control of swarms in the presence of latency. In:
2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC),
pp. 3009–3014. IEEE (2012)
29. Wilson, J., Hauert, S.: Information transport in communication limited swarms.
In: 5th International Symposium on Swarm Behavior and Bio-Inspired Robotics
(2022)
Collective Gradient Following
with Sensory Heterogeneous UAV Swarm
Tugay Alperen Karagüzel1(B) , Nicolas Cambier1 , A.E. Eiben1 ,

and Eliseo Ferrante1,2
1
Vrije Universiteit Amsterdam, Amsterdam, Noord-Holland, The Netherlands
{t.a.karaguzel,n.p.a.cambier,a.e.eiben,e.ferrante}@vu.nl
2
Technology Innovation Institute, Abu Dhabi, United Arab Emirates
Abstract. In this paper, we present a new method for a swarm to collec-

tively sense and follow a gradient in the environment. The agents in the
swarm only rely on relative distance and bearing measurements of neigh-
bors. Additionally, only a minority of agents in the swarm perceive the
scalar value of the gradient at their location. We test the method with
incrementally changing ratio of agents with sensors on various swarm
sizes. In addition to repeated simulation experiments, we also test the
performance with a real nano-drone swarm. Results show us that, using
the new method, the swarm was successful at following the gradient in
the environment even with a low portion of the swarm with sensors on
various swarm sizes. A real nano-drone swarm also demonstrates a good
performance in our test even with members having disabled sensors.
Keywords: collective sensing · sensor heterogeneity · nano-drone

swarm
1 Introduction
Coordinated and ordered collective motion of entities can be encountered in dif-
ferent species from bacteria to big mammals [17]. They gain various advantages
from this distributed organization. For some species, the advantage is improved
searching of a food source [6]. For others, it can be safety and enhanced survival
[11]. Among these advantages, an important one is emergent sensing. Emergent
sensing can be defined as the collective behavior arising from social interactions
that enable a swarm to perceive and react to a global feature in the environment
while individuals alone perceive insufficient information about this feature and
would not be able to react alone. A special case, considered in this paper, is
the ability to achieve collective gradient following with individuals having the
measurements limited to the scalar values of the gradient [3]. When individu-
als have the right decision and evaluation mechanisms, emergent sensing can be
achieved without direct information exchange of the environmental feature. Fish
schools couple this behavior with collective motion [2,13]. An individual fish
cannot estimate the light gradient in the environment and, hence, cannot follow
https://doi.org/10.1007/978-3-031-51497-5_14
188 T. A. Karagüzel et al.
it to migrate towards darkness. Nevertheless, a collective composed of individu-

als that can perceive neighbor’s relative positions, can successfully estimate the
light gradient and migrate towards darkness, even when the light gradient is
changing with time.
Researchers have analyzed collective and emergent sensing of other species
as well. For instance, bacteria evaluate and modulate their interactions with
others according to the chemical concentration of a certain substance [5]. As a
result, the swarm achieves a collective computation of the chemical gradient. A
similar behavior has also been observed in honeybees [10]. Bees achieve emer-
gent estimation of temperature and aggregate without any explicit information
exchange. They do this by wandering around until colliding with another bee.
They then modulate their waiting time according to the temperature of their
current position.
Such examples from nature have influenced the swarm robotics field. The
behavior of honeybees was reproduced in wheeled robots in [18], whereby robots
aggregate on particular regions in the environment depending on the light gra-
dient. In another work [4], emergent sensing was achieved by relying only on
inhibiting the sociality of agents. In [14], simulated robots modulate their mes-
saging frequency depending on their local measurements to achieve collective
taxis. In another example, robots in a tiled arena estimate the most frequent
color of tiles [16]. Successful estimation of the main color shows a valid example
of collective sensing without any coherence in the swarm. In addition, [7] use
a swarm of nano-drones to localize a gas source. Despite their breadth, none
of the above mentioned methods guarantee an ordered motion of the collective.
Moreover, [7,14,16] all rely on peer to peer communication, which adds one pos-
sible point of failure to the system and threaten the robustness of the swarm.
Emergent sensing and collective motion without any information exchange is,
however, possible, but existing works [15] rely on multiple sensors to estimate
desirable direction before modulating their social behaviors.
In [9], we introduced a gradient following method for a flocking swarm. This
method achieves collective gradient following without any information exchange,
and by having all agents equipped with on-board peer sensing and local-scalar
environmental measurements. However, it is yet possible to further improve the
robustness of the swarm by considering cases where some of the agents in the
swarm do not have sensors for environmental measurements. Having non-sensory
agents in the swarm corresponds to cases where a portion of the swarm have dam-
aged sensors or is unable to use their sensors for some other reason. Moreover,
having non-sensory agents could be a design choice depending on the sensor cost.
In this paper, we build on [9] to enable a swarm of largely non-sensory agents
to successfully follow a gradient. For this purpose, we increase the attraction
possibility between neighbor agents to produce more drastic reactions to higher
gradient values. Consequently we expect the sensory agents to be more effective
on non-sensory ones and become sufficient on influencing the whole swarm to
move towards the increasing gradient while maintaining a collective motion.
With the results, we show that gradient following is robust to a considerable
Collective Gradient Following with Sensory Heterogeneous UAV Swarm 189
proportion of non-sensory agents in the swarm, for a wide range of swarm size.
In real life experiments, we emulate the peer and environmental sensing to show
similar behaviors are achievable with a real aerial vehicle dynamics, in a way
that does not violate the aforementioned constraints.
2 Methodology
Members of the swarm move in a 2D environment. Each member is individually

controlled. It has a direction of motion (heading) and can only translate in
this direction. In other words, the axis of linear speed is always aligned with
the heading direction. Meanwhile, the angular speed of agents determines the
change of heading. These constraints are similar to the non-holonomic constraints
of a differential drive robot: although these are not natural for drones, they are
imposed only on drones (as explained later) in order to build our method on top
of others presented recently [1].
All agents can measure the relative distance and bearing of others within a
given range, which we set to 3 m in this paper. The range is chosen by considering
the trade-off between ensuring the locality of perception within the swarm and
agents having sufficient space to react to other agents and maneuver without
loosing the sight of them. Linear and angular speeds of a focal agent i are
calculated by combining distance and bearing of perceived peers. To do that
calculation, a proximal control vector [8], designated as pi is used. This vector
generates a spring-like effect between the focal agent i and each perceived peer,
which oscillates around the desired distance dides . It becomes a pulling effect if
the perceived peer is too far (distance > dides ) and a pushing effect if the peer is
too close (distance < dides ).
Some agents, which we call “sensory agents”, can also measure the gradient’s
local value at their positions, but they have no memory of previous measurement,
so they have no information about the gradient’s direction. Non-sensory agents
have no information about the gradient at all and can only measure the relative
distance and bearing of their neighbours.
We call our new method Attractive Desired Distance Modulation (ADM) as
opposed to the previously presented one that we called Desired Distance Mod-
ulation (DM). In ADM, the proximal control vector produces more attraction
between agents compared to DM [8,9]. ADM’s proximal control vector is calcu-
lated as follows for N perceived peers:
m
pi = α pm m
i (di , σi )∠e
jφi
(1)
m∈N
In the equation above, the relative distance of a neighbor peer m is shown with
jφm
i . The bearing angle of neighbor m is represented with ∠e
dm i .
In ADM, the desired distance, i.e. the distance where the neighbor m pro-
duces no virtual forces on the focal agent, is given by dides = σi . Finally, α stands
for the proximal weight coefficient. The contribution of each perceived peer to
the magnitude of pi is:
0.5
σi σi
pm m
i (di , σi ) =− − m (2)
2 dm
i di
In the above equation, is the coefficient that tunes the strength of the proximal
vector. In addition to the peers positions, sensory agents can use the gradient’s
local value to compute the proximal vector. These scalar measurements modulate
dides through σi . In ADM, the desired distance of the focal agent (dides ) takes the
same value as the coefficient σi .
The value of σi of a sensory agent i is computed according to the scalar
perception of the gradient as follows:
0.2
G◦
σi = σc + σc (3)
Gmax − G◦
The scalar gradient perception of the focal agent is depicted by the term G◦
and the sensor threshold by Gmax . σc is a parameter which is kept constant
and simply determines upper and lower boundaries of σi on sensory agents. On
non-sensory agents, σi is always equal to σc . In other words, non-sensory agents
always sense G◦ as zero.
To calculate linear (Ui ) and angular (ωi ) speeds, first the proximal control
pi ) is projected on to agent’s local frame. The local frame of agent is
vector (
placed such that x-axis of this orthogonal and right-hand sided frame always
coincides with the heading direction. Next, speeds are calculated as follows where
K1 and K2 are respective gains of speeds and px and py are components of pi
on agent’s local frame:
Ui = K 1 p x , ω i = K2 p y (4)
Finally, linear speed (Ui ) is clamped between 0 and Umax whereas angular speed
(ωi ) is clamped between −ωmax and ωmax . This clamping implies that agents
cannot move backwards. In case of a sharp repulsion/attraction, agent changes
heading first and slowly increase the linear speed when the heading is diverged
enough. In addition, the rate of change of heading being finite implies that agent
changes heading over time, not instantaneously.
Our previous proposed method is called Desired Distance Modulation (DM)
[9]. The major differences between DM and ADM are the calculation of the
proximal control vector magnitude and the modulation of the desired distance.
In DM, the contribution of each neighbor peer to the magnitude of proximal
control vector of i is calculated as follows:

σi4 σi2
pm
i (d m
i , σ i ) = − 2 − (5)
(dm
i )
5 (dm
i )
3
The desired distance coefficient σi is obtained as in:

◦
G
σi = σc + σc (6)
Gmax
Fig. 1. Proximal control vector magnitude vs neighbor distance. Positive values repre-
sent attraction, while negative repulsion.
√
Differently from ADM, √ σi correlates with the desired distance by a factor of 2
(dides = 21/2 σi ) in DM. 2 (dides = 21/2 σi ) in DM.
Figure 1a and Fig. 1b show the difference between the two methods of calcu-
lating, respectively, pm
i and σi . In Fig. 1a, positive values produce attraction and
negative ones repulsion. In addition, the points where the curves intersect the
red dashed line (no force) indicate the desired distances (dides ). As seen in Fig. 1a,
ADM produces attraction for the neighbors which are further than dides , up to
the sensing range, which is set to 3.0. In contrary, DM produces attraction only
for a short distance after dides and the attraction starts to decrease. Although the
form of DM’s proximal control vector helps on maintaining a stable collective
motion in the swarm [8], the form of ADM becomes advantageous when there
are non-sensory agents since sensory agents can sufficiently influence the whole
swarm. To maintain the stable collective motion, curve of ADM intersects the
res dashed line at a shorter distance than DM, implying a shorter desired dis-
tance. In Fig. 1b, higher σi stands for larger dides on both methods. In ADM, σi
changes drastically at high local perceived values to produce a impulse within
the swarm by using sensory agents, which guides the swarm towards increasing
gradient more successfully.
3 Experimental Setup
We perform both simulations as well as validation experiments on real nano

drones. Simulations involve N agents with no inertial properties. The state of
each agent is its position and heading, which are updated by discrete integration
using a time step dt , which is set to 0.05 simulated seconds. The gradient in
the environment is modeled as a grid which carries values between 0 and 255.
The gradient model is designed in such a way that it carries the maximum value
(255) in its center and has decreasing values radially out from there. Sensory
agents can only perceive the value of the cell they are currently located in,
with a random error in range of [−eg , eg ]. All agents including non-sensory ones
can perceive relative peer distances and bearings within a limited interaction
range (Di = 3.0 m). Random error is added to the relative range and bearing
measurements within the ranges of [−er , er ] and [−eb , eb ] respectively. These
ranges are set to: er = 0.025 m, eb = 0.09 radians and eg = 2.5. We also tested
the following swarm sizes: 10, 20, 50, 100, 150, 200, 250. For a fair comparison,
the grid is modeled in a way that each cell edge measures 0.04 m. In addition,
larger swarms require larger gradient grid to counterbalance the effect of them
covering a larger area. To achieve that, the grid is scaled in correlation with the
swarm size while still preserving cell edge length. To assess the influence of non-
sensory agents on the swarm, the ratio of sensory agents is incremented by 10%
steps between 10% and 100% for each swarm size. Sensory agents are selected
randomly.
For each run of each setting, the swarm is initialized around a random point
chosen on the isoline (which is a circle because of our gradient design) carrying
values which are 25% of the maximum. This value was chosen to be challenging
while starting with the whole swarm inside the gradient, since starting from out-
side of the gradient would make our methods unusable. The initialization is done
within a circle large enough to host all agents with a safe distance among them
when uniformly scattered. Random perturbations are applied to the positions
before each run. The heading is randomly chosen for each agent. For each setting,
24 experiments are conducted. To assess a gradient following performance (GFP)
for swarms, the local value perceived by all agents are averaged over swarm and
recorded every 2 simulated seconds, together with agent states. Since larger
swarms move in larger gradients, they require respectively longer times. Hence,
the experiment duration increases proportionally to the swarm sizes (1620, 2292,
3623, 5123, 6275, 7245, 8100 simulated seconds). In all experiments of ADM and
DM, speed gains were chosen as K1 = 0.2, K2 = 0.1. Whereas σc was 0.6 for
ADM and 0.5 for DM. and α are chosen as 12.0 and 3.0 respectively, for both
methods. The maximum speed was 0.08 m/s and maximum angular speed was
3.14 rad/s.
To test our methods on agents which are under the constraints of aerial vehi-
cle dynamics, we chose Crazyflie 2.1 nano-drones1 . A swarm of 7 Crazyflie’s is
used for the experiments. In our setup, we use an Ultra-Wideband positioning
system developed for the Crazyflie2 . In the setup, each drone calculates its posi-
tion onboard in 3D. Since an off-the-shelf framework for onboard peer sensing
on Crazyflie still does not exist, we use the Crazyswarm project [12] and paral-
lelized controllers on central computer. Nevertheless in our central computer, we
guarantee that constraints on peer (Di ) and environmental sensing (defined in
Sect. 2) are never violated. During the experiment, each drone sends its position
to the central computer. Then, the control software calculates linear and angular
speeds on parallel controllers. Local value sensing is achieved in the same way
1
https://www.bitcraze.io/products/crazyflie-2-1/.
2
https://www.bitcraze.io/documentation/system/positioning/loco-positioning-
system/.
with a simulated gradient grid. The control software detects which virtual cell
the real agent is in and the value of that cell is used in the velocity command
calculations. Finally, linear and angular speed commands are sent back to each
drone. Drones and control software exchange information 20 times a second.
Here, it is important to note that the heading does not correspond to the yaw
angle of the drone in our setup. Instead, the yaw is kept constant. The head-
ing only updates the instantaneous direction of motion. By doing so, we get rid
of extra control overhead of the yaw angle and still have an effective notion of
heading.
Fig. 2. Performance comparisons of DM and ADM with 50 agents and 50% sensory
agent ratio.
For each method, we tested the GFP for two cases: All agents are sensory
and only 4 out of 7 agents are sensory. To test each setting, flights are repeated
5 times, once for each region selected around 4 corners in addition to the center
point of our flight arena. Initial positions of the drones are chosen randomly
in the selected region. The dimensions of the flight arena are 6.5 m by 4.0 by
meters. Each flight test has a duration of 270 s. Finally, the variable parameters
of ADM in flight tests were: K1 = 0.4, K2 = 0.2 and σc = 0.7. In DM, they are
chosen as: K1 = 0.2, K2 = 0.1 and σc = 0.5. Maximum speeds, α and were
exactly the same with simulations. The parameter values for simulation and flight
experiments are manually tuned for the best performance by considering sensor
ranges and maximum allowed speeds at the same time. Some parameter values
change between flight and simulation experiments since the motion dynamics of
point agents and nano-drones are different from each other.

In this section, we presents simulation experiments results followed by results
from nano-drone swarm experiments. Finally, we discuss on our findings.
First, we compare the performances of ADM and DM on following the gra-

dient and observe the behaviours by looking at the interactions between agents
and swarm trajectories. In Fig. 2a, the time evolution of the GFP over 24 repe-
titions is presented for 50 agents and 50% sensory agent ratio. We can observe
that ADM reaches near-maximum values faster than DM and remains stable
afterward. Meanwhile DM shows a wider range for quartiles and oscillation of
the median around lower values than ADM.
In Fig. 2b, we present the median and quartiles for GFP of both methods
averaged over experiment duration, by changing sensory agent ratios.
We also show whether there was any separation in the group. This is deter-
mined by calculating the total number of groups (N oG) which are completely
out of each other’s sensing range at any instant. The performance of DM is
significantly worse than ADM until the sensory agent ratio reaches 50%. After
this point, although we can see a clear improvement with DM, its GFP never
reaches values close to ADM’s. In addition, we observe more than one groups
in DM for sensory agent ratios between 40% and 90%. The N oG being higher
than 1 indicates that despite some portion of the swarm reaching the high value
region, other portions can be elsewhere even outside of the defined gradient.
To understand the reasons of performance differences between methods, the
distances between agents can be recorded and observed. This observation is
important because of the main intuitive idea behind our methodology: The het-
erogeneity of desired distance is the only possible cause for the emergent col-
lective motion. In Fig. 3a and 3b (for DM and ADM respectively), we present
all distances between all neighboring agents and desired distances of all agents
as a distribution over the experiment duration and at the mid-point of experi-
ment duration. We observe that the desired distance and the median distance of
neighbors are closer to each other in ADM compared to DM. As expected from
our methodology (see Eq. 3), sensory agents have larger desired distances on a
gradient than non-sensory ones for both methods. Hence, we can understand the
performances by observing the desired distances.
Figure 3b shows a clear increase in desired distance for ADM, which indicates
that perceived local values are getting higher and reaching a maximum. This
increase is not present for DM in Fig. 3a.
Another difference between the two methods is between minimum neighbor
distances, which is the distance between each agent and its closest neighbor.
Differently from the desired distance, minimum neighbor distance is a result
rather than a variable we control with our methodology. In ADM, the minimum
distance is significantly lower than the desired one. Nevertheless, the minimum
distance never reaches near-zero values (which could be dangerous for physical
agents) for 50 agents. In DM, it changes in close correlation with the desired
distance which means that agents keep a safe distance from each other. The
reason behind this difference in minimum distances between DM and ADM is
that, in ADM, attraction is produced even for the furthest perceived neighbors.
Thus, some portion of the repulsion that would keep the focal agent away from
the nearest neighbor is cancelled out by that increased attraction. As a result,
Fig. 3. Distributions of neighbor distances for sensory and non-sensory agents of 50

agents with 50% sensory ratio
the minimum distance where the balance is achieved is smaller when compared
to DM.
When violins are observed (which shows distribution at a time instant), we
naturally see very similar properties with over-time distributions. Yet, it can be
seen that distances are clustered around several values, in a multimodal distribu-
tion. This indicates that neighbors are located as “neighborhood rings” around
the focal agents by having empty spaces between “rings”.
In Fig. 4, we present the trajectory of the swarm centroids (i.e., the center
of mass of the swarm in each experiment, not the individual agent trajectories)
and snapshots from simulations. For both methods, we can clearly see “neigh-
borhood rings” with aligned directions of motion when the swarm is following
the gradient. Centroid trajectories also indicate that ADM ascends the gradi-
ent successfully with a decisive direction of motion while DM trajectories show
disturbances and curves. Also in the top center snapshot, ADM forms a unique
emergent and dynamic formation whereby the swarm rotates around the max-
imum. In DM, the swarm wanders around the maximum by covering a larger
region without any emergent formation.
Next, we focus on scalability and look at the performances of the methods
for changing swarm sizes. For each swarm size, all possible sensory agent ratios
are observed. In addition, the interaction between agents within a larger swarm
is also analyzed. Median of time-averaged GFP distributions are presented in
Fig. 5.
With this new metric, we have an overall look at the performances for all
agent sizes and sensory agent ratios. We observe that performance with ADM is
Fig. 4. Centroid trajectories of the swarms for 24 repetitions and snapshots from phases
of a simulation run. DM in top row, ADM in bottom. Non-sensory agents are colored
in blue and sensory agents are colored in orange. (Color figure online)
Fig. 5. Median of time-averaged GFP of two methods for 24 repetitions. Left block
shows values for DM and right block for ADM. Blue box boundaries points out a group
separation during at least one of the experiments.
consistently better than DM. Still, we can pin point the cases where ADM is not
at its best, specifically with low sensory agent ratio combined with small swarm
size or very large swarm size. When the swarm size is small, sensory agent’s total
influence become insufficient to guide the whole swarm. On the other hand when
swarm size is larger, visual inspections has shown reduced order. This results in
lower speed of the swarm centroid, which makes the time limit given insufficient
for swarm to gather higher values of the gradient. For DM, performances suffer
comparatively more from decreasing sensory agent ratio and never reach high
values for ratios lower than 50%. In addition, the blue box boundaries indicate
a group separation (N oG > 1). It is also another indicator of low performance
for DM, especially for larger swarm sizes.
Experiments with increased swarm size give us the opportunity to analyze
the scalability of the neighbor distances since we observed potential issues with
smaller swarms. Figure 6a and 6b show this information for 50% sensory agent
ratio. For ADM with 250 agents the desired and median neighbor distances are
very close to each other, similarly with 50 agents. Yet minimum neighbor distance
is clearly shorter. It can even be considered dangerous since it reaches near-zero
values which indicates a possible collisions for real robots. When those distances
are observed for DM, we can see that minimum neighbor distance never falls
below 0.5 m. Despite that, sensory agents with DM still show a difference for 50
and 250 agents: Difference between desired distance and minimum distance is
significantly higher for 250 agents. Especially in larger swarms with DM, agents
fail to obtain the emergent isolation between sensory and non-sensory agents.
Non-sensory agents with low desired distances become trapped among sensory
ones. Hence, they produce minimum agent to agent distances for sensory agents
similar to non-sensory ones.
Finally, we assess the performances of ADM and DM with aerial vehicle
dynamics. We observe the centroid trajectory of the nano-drone swarm (each
trajectory corresponds to the swarm centroid in one experiment) in addition
to calculating GFP. Figure 7 presents these centroid trajectories in addition to
mean time-averaged GFP of 5 experiments. Start points are labeled with S where
last recorded points are labeled with F.
What we can observe from the centroid trajectories is that when all agents
are sensory, both methods manage to follow the ascending gradient and stay
around the maximum for the rest of the experiment. However, when only 4
agents are sensory out of 7, DM struggles to follow the gradient. Even if the
swarm manages to reach the maximum, it does not stay there. In addition, since
we did not introduce any boundaries, swarms with DM leave the gradient (and,
hence, our flight arena) in 2 out of 5 experiments. For safety reasons, these
experiments are terminated earlier than the time limit and the last recorded
point is labeled with E. On the other hand, swarms with ADM show satisfactory
performances with only 4 sensory agents. Although the centroid wanders around
the maximum instead of staying there, we do not observe any behavior where
the swarm abandons it or outright leaves the gradient. In addition, the formation
we observed on simulation experiments also emerges in real flights with ADM.
Fig. 6. Distributions of neighbor distances for sensory and non-sensory agents of 250
agents with 50% sensory ratio
As a final remark, for both methods and for all settings, there were no crashes
or group separations. Drones always maintained a safe distance to each other
and maintained a cohesive group. Shots of drone swarms can be seen in Fig. 7.
Videos can be accessed on https://youtu.be/migbwjt EW0.
We now discuss the reported results at a higher level. Altogether, when over-
all performances of both methods are compared, with various swarm sizes and
sensory agent ratios, ADM clearly stands out. It shows better performances with
larger swarms and lower proportions of sensory agents, even down to 10% of all
swarm. Yet, for both methods, when the swarm is too small, the influence of
sensory agents becomes insufficient to guide the rest of the swarm. Conversely,
when the swarm becomes too large with low proportions of sensory agents, order
cannot be obtained and maintained. Thus, although the swarm moves towards
the ascending gradient, the performance ends up unsatisfactory within a given
time limit.
In neighbor distances data, we can observe the difference of methods on sen-
sory and non-sensory agent’s neighbor distances. In ADM, the gap is larger,
hence its performances are better. To understand this, we need to compre-
hend how both methods work in the first place. When sensory agents perceive
higher values, they increase their desired distances to others. In other words,
they become more “asocial”. Since non-sensory agents still have shorter desired
Fig. 7. The swarm centroid trajectories of the nano-drone swarm and shots from flights
distances, they move to be closer to the others, in other words they become
more “social”. The chase of “social” to be closer to “asocial”, ends up with a
collective motion following increasing gradient. When the distance gap is more
significant between sensory and non-sensory, gradient following gets better. Yet,
we could not optimize DM manually to obtain larger difference in distances. In
that case, group separation becomes a serious problem. This is expected because
DM has a smaller attractive region in its proximal control vector (see Fig. 1a).
ADM does not suffer from that but, as a drawback, it cannot guarantee safe dis-
tance between agents, especially in larger swarms. DM guarantees safe distance
but suffers from low performance. The most optimized solution must be lying
between them, yet still not discovered.
We do not see the drawback of ADM in real flights since the number of
agents is lower than in simulations. Hence ADM performs better than DM in
nano-drone experiments as it does in simulation experiments.
5 Conclusion
In this paper, we propose an efficient gradient-following method which does

not require information communication and relies on scalar sensing only. We
compare the gradient following performance with our previous effort and test
the performance of the method over changing swarm sizes and sensory agent
ratios. Results show that this new method’s gradient following ability compares
favorably to our previous effort which was, itself, comparable to the ideal case of
a single agent able to sense the ascending direction of the gradient at each step.
Moreover, our proposed method is robust to disabled or absent sensors for large
portions of the swarm, for a wide range of swarm size.
The possible searching applications for the gradient following behaviour are
numerous: Wildfires, dangerous radioactive sources outdoor or gas leaks indoor
are only some of them. The ability of the proposed method on being robust to
disabled sensors becomes an effective advantage on these applications. Even a
large portion of the swarm have disabled sensors for some reason, we can still
expect the swarm to keep conducting the searching task.
Implementing the control, peer sensing and scalar measurements on-board
is an important step to take in the future. Meanwhile, designing a gradient
model with more realistic form (a plume or fluctuating local values) would be
a complementary step. In addition, our results also constitute building blocks
for alternative functionalities, which opens the door for further research. For
instance, if the swarm is tasked to become a predator entity to chase some prey,
we expect that it will be able to do so, even if the latter is in the sensing range
of only a portion of the swarm.
For that reason we believe further research and real life trials with relevant
sensors will reveal exciting promises for daily life usage.
References
1. Amorim, T., Nascimento, T., Petracek, P., De Masi, G., Ferrante, E., Saska, M.:
Self-organized UAV flocking based on proximal control. In: 2021 International Con-
ference on Unmanned Aircraft Systems (ICUAS), pp. 1374–1382. IEEE (2021)
2. Berdahl, A., Torney, C.J., Ioannou, C.C., Faria, J.J., Couzin, I.D.: Emergent sens-
ing of complex environments by mobile animal groups. Science 339(6119), 574–576
(2013)
3. Berdahl, A.M., et al.: Collective animal navigation and migratory culture: from
theoretical models to empirical evidence. Philos. Trans. Roy. Soc. B Biol. Sci.
373(1746), 20170009 (2018)
4. Bjerknes, J.D., Winfield, A.F., Melhuish, C.: An analysis of emergent taxis in

a wireless connected swarm of mobile robots. In: 2007 IEEE Swarm Intelligence
Symposium, pp. 45–52. IEEE (2007)
5. Camley, B.A.: Collective gradient sensing and chemotaxis: modeling and recent
developments. J. Phys.: Condens. Matter 30(22), 223001 (2018)
6. Couzin, I.D., Krause, J., Franks, N.R., Levin, S.A.: Effective leadership and
decision-making in animal groups on the move. Nature 433(7025), 513–516 (2005).
https://doi.org/10.1038/nature03236
7. Duisterhof, B.P., Li, S., Burgués, J., Reddi, V.J., de Croon, G.C.: Sniffy bug: a fully
autonomous swarm of gas-seeking nano quadcopters in cluttered environments.
In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp. 9099–9106. IEEE (2021)
8. Ferrante, E., Turgut, A.E., Stranieri, A., Pinciroli, C., Birattari, M., Dorigo,
M.: A self-adaptive communication strategy for flocking in stationary and non-
stationary environments. Nat. Comput. 13(2), 225–245 (2014). https://doi.org/
10.1007/s11047-013-9390-9
9. Karagüzel, T.A., Turgut, A.E., Eiben, A.E., Ferrante, E.: Collective gradient per-
ception with a flying robot swarm. Swarm Intell. 17, 117–146 (2023). https://doi.
org/10.1007/s11721-022-00220-1
10. Kernbach, S., Thenius, R., Kernbach, O., Schmickl, T.: Re-embodiment of hon-
eybee aggregation behavior in an artificial micro-robotic system. Adapt. Behav.
17(3), 237–259 (2009)
11. Olson, R.S., Hintze, A., Dyer, F.C., Knoester, D.B., Adami, C.: Predator confusion
is sufficient to evolve swarming behaviour. J. Roy. Soc. Interface 10(85), 20130305
(2013). https://doi.org/10.1098/rsif.2013.0305
12. Preiss, J.A., Hönig, W., Sukhatme, G.S., Ayanian, N.: Crazyswarm: a large nano-
quadcopter swarm. In: IEEE International Conference on Robotics and Automa-
tion (ICRA), pp. 3299–3304. IEEE (2017)
13. Puckett, J.G., Pokhrel, A.R., Giannini, J.A.: Collective gradient sensing in fish
schools. Sci. Rep. 8(1), 1–11 (2018)
14. Schmickl, T., Wotawa, F., Thenius, R., Varughese, J.C.: FSTaxis algorithm: bio-
inspired emergent gradient taxis. In: ALIFE 2016, The Fifteenth International
Conference on the Synthesis and Simulation of Living Systems, pp. 330–337. MIT
Press (2016)
15. Shaukat, M., Chitre, M.: Adaptive behaviors in multi-agent source localization
using passive sensing. Adapt. Behav. 24(6), 446–463 (2016)
16. Valentini, G., Brambilla, D., Hamann, H., Dorigo, M.: Collective perception of
environmental features in a robot swarm. In: Dorigo, M., et al. (eds.) ANTS 2016.
LNCS, vol. 9882, pp. 65–76. Springer, Cham (2016). https://doi.org/10.1007/978-
3-319-44427-7 6
17. Vicsek, T., Zafeiris, A.: Collective motion. Phys. Rep. 517(3–4), 71–140 (2012)
18. Wahby, M., Petzold, J., Eschke, C., Schmickl, T., Hamann, H.: Collective change
detection: adaptivity to dynamic swarm densities and light conditions in robot
swarms. In: Artificial Life Conference Proceedings, pp. 642–649. MIT Press (2019)
DAN: Decentralized Attention-Based
Neural Network for the MinMax Multiple
Traveling Salesman Problem
Yuhong Cao, Zhanhong Sun, and Guillaume Sartoretti(B)
National University of Singapore, Mechanical Engineering Department,

Singapore, Singapore
{caoyuhong,sun z}@u.nus.edu, guillaume.sartoretti@nus.edu.sg
Abstract. The multiple traveling salesman problem (mTSP) is a well-

known NP-hard problem with numerous real-world applications. In par-
ticular, this work addresses MinMax mTSP, where the objective is to
minimize the max tour length among all agents. Many robotic deploy-
ments require recomputing potentially large mTSP instances frequently,
making the natural trade-off between computing time and solution qual-
ity of great importance. However, exact and heuristic algorithms become
inefficient as the number of cities increases, due to their computational
complexity. Encouraged by the recent developments in deep reinforce-
ment learning (dRL), this work approaches the mTSP as a cooperative
task and introduces DAN, a decentralized attention-based neural method
that aims at tackling this key trade-off. In DAN, agents learn fully decen-
tralized policies to collaboratively construct a tour, by predicting each
other’s future decisions. Our model relies on attention mechanisms and is
trained using multi-agent RL with parameter sharing, providing natural
scalability to the numbers of agents and cities. Our experimental results
on small- to large-scale mTSP instances (50 to 1000 cities and 5 to 20
agents) show that DAN is able to match or outperform state-of-the-art
solvers while keeping planning times low. In particular, given the same
computation time budget, DAN outperforms all conventional and dRL-
based baselines on larger-scale instances (more than 100 cities, more than
5 agents), and exhibits enhanced agent collaboration. A video explaining
our approach and presenting our results is available at https://youtu.be/
xi3cLsDsLvs.
Keywords: multiple traveling salesman problem · decentralized

planning · deep reinforcement learning
1 Introduction
The traveling salesman problem (TSP) is a challenging NP-hard problem, where

given a group of cities (i.e., nodes) of a given complete graph, an agent needs to
find a complete tour of this graph, i.e., a closed path from a given starting node
https://doi.org/10.1007/978-3-031-51497-5_15
DAN: Decentralized Attention-based Neural Network for the MinMax mTSP 203
that visits all other nodes exactly once with minimal path length. The TSP can
be further extended to the multiple traveling salesman problem (mTSP), where
multiple agents collaborate with each other to visit all cities from a common
starting node with minimal cost. In particular, MinMax (minimizing the max
tour length among agents, i.e., total task duration) and MinSum (minimizing
total tour length) are two most common objectives for mTSP [1–3]. Although
simple in nature, the TSP and mTSP are ubiquitous in robotics problems that
need to address agent distribution, task allocation, and/or path planning, such
as multi-robot patrolling, last mile delivery, or distributed search/coverage. For
example, Faigl et al. [4], Obwald et al. [5] and Cao et al. [6] proposed methods for
robot/multi-robot exploration tasks, at the core of which is such a TSP/mTSP
instance, whose solution quality will influence the overall performance. More gen-
erally, dynamic environments are involved in many robotic deployments, where
the underlying graph may change with time and thus require the TSP/mTSP
solution be recomputed frequently and rapidly (within seconds or tens of seconds
if possible). While state-of-the-art exact algorithms (e.g., CPLEX [7], LKH3 [8],
and Gurobi [9]) can find near-optimal solutions to TSP instances in a few sec-
onds, exact algorithms become unusable for mTSP instances if computing time
is limited [2]. As an alternative, meta-heuristic algorithms like OR Tools [10]
were proposed to balance solution quality and computing time. In recent years,
neural-based methods have been developed to solve TSP instances [11–13] and
showed promising advantages over heuristic algorithms both in terms of com-
puting time and solution quality. However, neural-based methods for mTSP are
scarce[2,3]. In this work, we introduce DAN, a decentralized attention-based
neural approach for the MinMax mTSP, which is able to quickly and distribut-
edly generate tours that minimize the time needed for the whole team to visit
all cities and return to the depot (i.e., the makespan).
We focus on solving mTSP as a decentralized coop-
eration problem, where agents each construct their
own tour towards a common objective. To this end, we
rely on a threefold approach: first, we formulate mTSP
as a sequential decision-making problem where agents
make decisions asynchronously towards enhanced col-
laboration. Second, we propose an attention based
neural network to allow agents to make individual
decisions according to their own observations, which
provides agents with the ability to implicitly predict
other agents’ future decisions, by modeling the depen-
dencies of all the agents and cities. Third, we train our
Fig. 1. DAN’s final solu-
model using multi-agent reinforcement learning with
tion to an example mTSP
parameter sharing, which provides our model with problem.
natural scalability to arbitrary team sizes (Fig. 1).
We present test results on randomized mTSP
instances involving 50 to 1000 cities and 5 to 20 agents, and compare DAN’s
performance with that of exact, meta-heuristic, and dRL methods. Our results
204 Y. Cao et al.
highlight that DAN achieves performance close to OR Tools, a highly optimized

meta-heuristic baseline [10], in relatively small-scale mTSP (fewer than 100
cities). In relatively large-scale mTSP, our model is able to significantly out-
perform OR Tools both in terms of solution quality and computing time. We
believe this advantage makes DAN more reliable in dynamic robotic tasks than
non-learning approaches. DAN also outperforms two recent dRL based methods
in terms of solution quality for nearly all instances.
2 Prior Works
For TSP, exact algorithms like dynamic programming and integer program-
ming can theoretically guarantee optimal solutions. However, these algorithms do
not scale well. Nevertheless, exact algorithms with handcrafted heuristics (e.g.,
CPLEX [7]) remain state-of-the-art, since they can reduce the search space effi-
ciently. Neural network methods for TSP became competitive after the recent
advancements in dRL. Vinyals et al. [11] first built the connection between deep
learning and TSP by proposing the Pointer network, a sequence-to-sequence
model with a modified attention mechanism, which allows one neural network to
solve TSP instances composed of an arbitrary number of cities. Kool et al. [13]
replaced the recurrent unit in the Pointer Network and proposed a model based
on a Transformer unit [14]. Taking the advantage of self-attention on modeling
the dependencies among cities, Kool et al.’s achieved significant improvement in
term of solution quality.
In mTSP, cities need to be partitioned and allocated to each agent in addi-
tion to finding optimal sequences/tours. This added complexity makes state-of-
the-art exact algorithm TSP solvers impractical for larger mTSP instances. OR
Tools [10], developed by Google, is a highly optimized meta-heuristic algorithm.
Although it does not guarantee optimal solutions, OR Tools is one of the most
popular mTSP solvers because it effectively balances the trade-off between solu-
tion quality and computing time. A few recent neural-based methods have also
approached the mTSP in a decentralized manner. Notably, Hu et al. [2] proposed
a model based on a shared graph neural network and distributed attention mech-
anism networks to first allocate cities to agents. OR Tools can then be used to
quickly generate the tour associated with each agent. Park et al. [3] presented an
end-to-end decentralized model based on graph attention network, which could
solve mTSP instances with arbitrary numbers of agents and cities. They used a
type-ware graph attention mechanism to learn the dependencies between cities
and agents, where the extracted agents’ feature and cities’ features are then con-
catenated during the embedding procedure before outputting the final policy.
The mTSP is defined on a graph G = (V, E), where V = {1, ..., n} is a set of n
nodes (cities), and E is a set of edges. In this work, we consider G to be complete,
i.e., (i, j) ∈ E for all i = j. Node 1 is defined as the depot, where all m agents
are initially placed, and the remaining nodes as cities to be visited. The cities
must be visited exactly once by any agent. After all cities are visited, all agents
return to the depot to finish their tour. Following the usual mTSP statement
in robotics [15], we use the Euclidean distance between cities as edge weights,
i.e.,
this work addresses
the Euclidean mTSP. We define a solution to mTSP
π = π 1 , ..., π m as a set of agent tours. Each agent tour π i = (π1i , π2i , ..., πni i ) is
an ordered set of the cities visited by this agent, where πti ∈ V and π1i = πni i is the
m
depot. ni denotes the number of cities in this agent tour, so i=1 ni = n+2m−1
since all agent tours involve the depot twice. Denoting the Euclidean distance
n
i −1
between cities i and j as cij , the cost of agent i’s tour reads: L(π i ) = cπji πj+1
i .
j=1
As discussed in Sect. 1, we consider MinMax (minimize max L(π i )) as the
i∈{1,...,m}
objective of our model.
4 mTSP as an RL PROBLEM
In this section, we cast mTSP into a decentralized multi-agent reinforcement
learning (MARL) framework. In our work, we consider mTSP as a cooperative
task instead of an optimization task. We first formulate mTSP as a sequential
decision making problem, which allows RL to tackle mTSP. We then detail the
agents’ state and action spaces, and the reward structure used in our work.
Sequential Decision Making. Building upon recent works on neural-based

TSP solvers, we consider mTSP as a sequential decision-making problem. That
is, we let agents interact with the environment by performing a sequence of
decisions, which in mTSP are to select the next city to visit. These decisions
are made sequentially and asynchronously by each agent based on their own
observation, upon arriving at the next city along their tour, thus constructing
a global solution collaboratively and iteratively. Each decision (i.e., RL action)
will transfer the agent from the current city to the next city it selects. Assum-
ing all agents move at the same, uniform velocity, each transition takes time
directly proportional to the Euclidean distance between the current city and
the next city. We denote the remaining time until the next decision of agent i
as gi (i.e., its remaining travel time). In practice, we discretize time in steps,
and let gi decrease by Δg at each time step, which defines the agents’ velocity.
This assumption respects the actual time needed by agents to move about the
domain, but also lets agents make decisions asynchronously (i.e., only when they
reach the next city on their tour). We note that staggering the agents’ decision
in time naturally helps avoid potential conflicts in the city selection process, by
allowing agents to select cities in a sequentially-conditional manner (i.e., agents
select one after the other, each having access to the decisions already made by
agents before them in this sequence). We empirically found that this significantly
improves collaboration and performance.
Observation. We consider a fully observable world where each agent can access
the states of all cities and all agents. Although a partial observation is more
206 Y. Cao et al.
Algorithm 1. Sequential decision making to solve mTSP.

Input: number of agents m, graph G = (V, E)
Output: Solution π
Initialize mask M = {0}, remaining travel time g = {0}, and
empty tours starting at the depot π i = {1} (1 ≤ i ≤ m).
while sum(M ) < n do
for i = 1, ..., m do
if gi ≤ 0 then
Observe sci , sai of agent i and outputs p
Select next city πti from p (t = Length(π i ) + 1)
Append πti to π i , M [πti ] ← 1, gi ← cπi πti
t−1
end if
gi ← gi − Δg
end for
end while
return π = {π 1 , ..., π m }
common in decentralized MARL [16], a global observation is necessary to make

our model comparable to baseline algorithms, and partial observability will be
considered in future works. Each agent’s observation consists of three parts: the
cities state, the agents state, and a global mask.
The cities state sci = (xci , yic ), i ∈ {1, ..., n} contains the Euclidean coordinates
of all cities relative to the observing agent. Compared to absolute information,
we empirically found that relative coordinates can help prevent premature con-
vergence and lead to a better final policy.
The agents state sai = (xai , yia , gi ), i ∈ {1, ..., m} contains the Euclidean coor-
dinates of all agents relative to the observing agent, and the agents’ remaining
travel time gi . As mTSP is a cooperative task, one agent can benefit from observ-
ing the state of other agents, e.g., to predict their future decisions.
Finally, agents can observe a global mask M : an n-dimensional binary vector
containing the visit history of all n cities. Each entry of M is initially 0, and is
set to 1 after any agent has visited the corresponding city. Note that the depot
is always unmasked during the task. This help agents avoid to be forced to visit
remaining cities even it would lead to worse solutions.
Action. At each decision step of agent i, based on its current observation

(sci , sai , M ), our decentralized attention-based neural network outputs a stochas-
tic policy p(πti ), parameterized by the set of weights θ: pθ (πti = j|sci , sai , M ),
where j denotes an unvisited city. Agent i takes an action based on this pol-
icy to select the next city πti . By performing such actions iteratively, agent i
constructs its tour π i .
Reward Structure. To show the advantage of reinforcement learning, we try

to minimize the amount of domain knowledge introduced into our approach.
In this work, the reward is simply the negative of the max tour length among
agents: R(π) = − max (L(π i )), and all agents share it as a global reward.
i∈{1,..,m}
This reward structure is sparse, i.e., agents only get rewarded after all agents
finish their tours.
5 DAN: Decentralized Attention-Based Network

We propose an attention-based neural network, composed of a city encoder,
an agent encoder, a city-agent encoder, and a decoder. Its structure is used to
model three kinds of dependencies in mTSP, i.e., the agent-agent dependencies,
the city-city dependencies, and the agent-city dependencies. To achieve good
collaboration in mTSP, it is important for agents to learn all of these depen-
dencies to make decisions that benefit the whole team. Each agent uses its local
DAN network to select the next city based on its own observation. Compared
to existing attention-based TSP solvers, which only learn dependencies among
cities and finds good individual tours, DAN further endows agents with the abil-
ity to predict each others’ future decision to improve agent-city allocation, by
adding the agent and the city-agent encoders.
Figure 2 shows the structure of DAN. Based on the observations of the decid-
ing agent, we first use the city encoder and the agent encoder to model the depen-
dencies among cities and among agents respectively. In the city-agent encoder,
we then update the city features by considering other agents’ potential decisions
according to their features. Finally, in the decoder, based on the deciding agent’s
current state and the updated city features, we allocate attention weights to each
city, which we directly use as its policy.
Attention Layer. The Transformer attention layer [14] is used as the funda-
mental building block in our model. The input of such an attention layer con-
sists of the query source hq and the key-and-value source hk,v , which are both
vectors with the same dimension. The attention layer updates the query source
using the weighted sum of the value, where the attention weight depends on the
similarity between query and key. We compute the query qi , key ki and value vi
as: qi = W Q hqi , ki = W k hk,v v k,v Q K
i , vi = W hi , where W , W , W
V
are all learn-
able matrices with size d × d. Next, we compute the similarity uij between the
qiT ·kj
query qi and the key kj using a scaled dot product: uij = √ . Then we calculate
d
uij
the attention weights aij using a softmax: aij = ne . Finally, we compute a
uij
j=1 e
weighted
n sum of these values as the output embedding from this attention layer:

hi = j=1 aij vj . The embedding content is then passed through the feed forward
sublayer (containing two linear layer and a ReLU activation). Layer normalization
and residual connections are used within these two sublayers as in [14].
City Encoder. The city encoder is used to extract features from the cities
state sci and model their dependencies. The city encoder first embeds the relative
Euclidean coordinates xci of city i, i ∈ {2, ..., n} into a d-dimensional (d = 128
208 Y. Cao et al.
Fig. 2. DAN consists of a city encoder, an agent encoder, a city-agent encoder and
a final decoder, which allows each agent to individually process its inputs (the cities
states, and the agents states), to finally obtain its own city selection policy. In partic-
ular, the agent and city-agent encoders are introduced in this work to endow agents
with the ability to predict each others’ future decision and improve the decentralized
distribution of agents.
in practice) initial city embedding hci using a linear layer. Similarly, the depot’s
Euclidean coordinates xc1 are embedded by another linear layer to hc1 . The initial
city embedding is then passed through an attention layer. Here hq = hk,v = hc ,
as is commonly done in self-attention mechanisms. Self-attention achieved good
performance to model the dependencies of cities in single TSP approaches [13],
and we propose to rely on the same fundamental idea to model the dependencies

in mTSP. We term the output of the city encoder, h c , the city embedding. It
contains the dependencies between each city i and all other cities.
Agent Encoder. The agent encoder is used to extract features from the agents
state sai and model their dependencies. A linear layer is used to separately embed
each (3-dimensional) component of sai into the initial agent embedding hai . This
embedding is then passed through an attention layer, where hq = hk,v = ha .

We term the output of this encoder, h a the agent embedding. It contains the
dependencies between each agent i and all other agents.
City-agent Encoder. The city-agent encoder is used to model the dependen-

cies between cities and agents. The city-agent encoder applies an attention layer

with cross-attention, where hq = h c , hk k, v = h a . We term the output hca of
this encoder the city-agent embedding. It contains the relationship between each
city i and each agent j and implicitly predicts whether city i is likely to be
selected by another agent j, which is one of the keys to the improved perfor-
mance of our model.
Decoder. The decoder is used to decode the different embeddings into a policy
for selecting the next city to visit. The decoder starts with encoding the deciding
agent’s current state. We choose to express the current agent state implicitly by
computing an aggregated embedding hs which is the mean of the city embedding.
This operation is similar to the graph embedding used in [13].
The first attention layer then adds the agent embedding to the aggregated
embedding. In doing so, it relates the state of the deciding agent to that of all

other agents. Here hq = hs and hk,v = h a . This layer outputs the current state

s
embedding h . After that, a second attention layer is used to compute the final

candidate embedding hf , where hq = h s , hk,v = hca . This layer serves as a
glimpse which is common to improve attention mechanisms [12]. There, when
computing the similarity, we rely on the global mask M to manually set the sim-
ilarity ui = −∞ if the corresponding city i has already been visited to ensure the
attention weights of visited cities are 0. The final candidate embedding hf then
passes through a third and final attention layer. The query source is the final
candidate embedding hf , and the key source is the city-agent embedding hca . For
this final layer only, following [11], we directly use the vector of attention weights
as the final policy for the deciding agent. The same masking operation is also
applied in this layer to satisfy the mTSP constraint. These similarities are nor-
malized using a Softmax operation, to finally yield the probability n distribution
p for the next city to visit: pi = pθ (πtj = i|sci , sai , M ) = eui / i=1 eui .
6 Training
In this section, we describe how DAN is trained, including the choice of hyper-
parameters and hardware used.
REINFORCE with Rollout Baseline. In order to train our model, we define

ni
the policy loss: L = −Epθ (πi ) [R(π)], where pθ (π i ) = pθ (πti |sci , sai , M ). The
t=1
policy loss is the expectation of the negative of the max length among the tours
of agents. The loss is optimized by gradient descent using the REINFORCE
algorithm with a greedy rollout baseline [13]. That is, we re-run the same exact
episode from the start a second time, and let all agents take decisions by greedily
exploiting the best policy so far (i.e., the “baseline model” explained in Sect. 6
below). The cumulative reward b(π) of this baseline episode is then used to esti-
mate the advantage function: A(π) = R(π) − b(π) (with R(π) the cumulative
reward at each state of the RL episode). This helps reduce the gradient vari-
ance and avoids the burden of training the model to explicitly estimate the state
value, as in traditional actor-critic algorithms. The final gradient estimator for
the policy loss reads: L = −Epθ (πi ) [(R(π) − b(π))∇logpθ (π i )].
Distributed Training. We train our model using parameter sharing, a gen-

eral method for MARL [17]. That is, we allow agents to share the parameters
210 Y. Cao et al.
of a common neural network, thus making the training more efficient by relying
on the sum of experience from all agents. Our model is trained on a worksta-
tion equipped with a i9-10980XE CPU and four NVIDIA GeForce RTX 3090
GPUs. We train our model utilizing Ray, a distributed framework for machine
learning [18] to accelerate training by parallelizing the code. With Ray, we run
8 mTSP instances in parallel and pass gradients to be applied to the global
network under the A2C structure [19]. At each training episode, the positions
of cities are generated uniformly at random in the unit square [0, 1]2 and the
velocity of agents is set to Δg = 0.1. The number of agent is randomized within
[5, 10] and the number of cites is randomized within [20, 100] during early train-
ing, which needs 96 h to converge. After initial convergence of the policy, the
number of cities is randomized within [20, 200] for further refinement, requiring
24 h of training. We formulate one training batch after 8 mTSP instances are
solved, and perform one gradient update for each agent. We train the model with
the Adam optimizer [20] and use an initial learning rate of 10−5 and decay every
1024 steps by a factor of 0.96. Every 2048 steps we update the baseline model if
the current training model is significantly better than the baseline model accord-
ing to a t-test. Our full training and testing code is available at https://bit.ly/
DAN mTSP.
7 Experiments
We test our decentralized attention-based neural network (DAN) on numerous
sets of 500 mTSP instances each, generated uniformly at random in the unit
square [0, 1]2 . We test two different variants of our model, which utilize the same
trained policy differently:
– Greedy: each agent always selects the action with highest activation in its
policy.
– Sampling: each agent selects the city stochastically according to its policy.
For our sampling variant, we run our model multiple times on the same instance
and report the solution with the highest quality. While [13] sample 1280 solutions
for single TSP, we only sample 64 solutions (denoted as s.64) for each mTSP
instance to balance the trade-off between computing time and solution quality.
In the test, the velocity of agents is set to Δg = 0.01 to improves the performance
of our model by allowing more finely-grained asynchronous action selection.
7.1 Results
We compare the performance of our model with both conventional and neural-
based methods. For conventional methods, we test OR Tools, evolutionary algo-
rithm (EA), and self organizing maps (SOM) [21] on the same test set. OR Tools
initially gets a solution using meta-heuristic algorithms (path-cheapest-arc) and
then further improves it by local search (2-opt) [22] (denoted as OR). We allow
OR Tools to run for a similar amount of time as our sampling variant, for fair
Table 1. Results on random mTSP set (500 instances each). n denotes the number of
cities and m denotes the number of agents
n=50 m=5 n=50 m=7 n=50 m=10 n=100 m=5 n=100 m=10
Method
Max. T(s) Max. T(s) Max. T(s) Max. T(s) Max. T(s)
EA 2.35 7.82 2.08 9.58 1.96 11.50 3.55 12.80 2.75 17.52
SOM 2.57 0.76 2.30 0.78 2.16 0.76 3.10 1.58 2.41 1.58
OR 2.04 12.00 1.96 12.00 1.96 12.00 2.36 18.00 2.29 18.00
Gurobi 2.54 3600 2.42 3600 7.29 3600 7.17 3600
Kool et al. 2.84 0.20 2.64 0.22 2.52 0.25 3.28 0.37 2.77 0.41
Park et al. 2.37 2.18 2.10 2.88 2.23
Hu et al. 2.12 0.01 1.95 0.02 2.48 0.04 2.09 0.04
DAN(g.) 2.29 0.25 2.11 0.26 2.03 0.30 2.72 0.43 2.17 0.48
DAN(s.64) 2.12 7.87 1.99 9.38 1.95 11.26 2.55 12.18 2.05 14.81
n=100 m=15 n=200 m=10 n=200 m=15 n=200 m=20

Method
Max. T(s) Max. T(s) Max. T(s) Max. T(s)
EA 2.51 21.63 4.07 29.91 3.62 34.33 3.37 39.34

SOM 2.22 1.57 2.81 3.01 2.50 3.04 2.34 3.04
OR 2.25 18.00 2.57 63.70 2.59 60.29 2.59 61.74
Kool et al. 2.64 0.46 3.27 0.78 2.92 0.83 2.77 0.89
Park et al. 2.16 2.50 2.38 2.44
DAN(g.) 2.09 0.58 2.40 0.93 2.20 0.98 2.15 1.07

DAN(s.64) 2.00 19.13 2.29 23.49 2.13 26.27 2.07 29.83
comparison. Note that OR Tools is always allowed to perform local search if

there is computing time left after finding an initial solution. Exact algorithms
need hours of time to solve one mTSP instances. Here, we report the results of
Gurobi [9], one of the state-of-the-art integer linear programming solver, from
Hu et al.’s paper [2], where 1 h of computation was allowed for each instance.
Table 1 reports the average MinMax cost (lower is better) for small-scale mTSP
instances (from 50 to 200 cities), as well as the average computing time per
instance for each solver.
For neural-based methods, we report Park et al.’s results [3] and Hu et al.’s
results [2] from their papers, since they did not make their code available publicly.
Since Park et al.’s paper does not report the computing time of their approach,
we leave the corresponding cells blank in Table 1. Similarly, since Hu et al. did
not provide any results for cases involving more than 100 cities or more than 10
agents, these cells are also left blank. Note that the test sets used by Park et al.
and Hu et al. are likely different from ours, since they have not been made public.
However, the values reported here from their paper are also averaged over 500
instances under the same exact conditions as the ones used in this work. Finally,
for completeness, we also test Kool et al.’s (TSP) model on our mTSP instances,
212 Y. Cao et al.
Table 2. Results on the large-scale mTSP set (500 instances each) where the number
of agents is fixed to 10.
Method n=500 n=600 n=700 n=800 n=900 n=1000

Max T(s) Max T(s) Max T(s) Max T(s) Max T(s) Max T(s)
OR [2] 7.75 1800 9.64 1800 11.24 1800 12.34 1800 13.71 1800 14.84 1800
SOM [21] 3.86 7.63 4.24 9.39 4.54 10.86 4.93 14.28 5.21 16.65 5.53 17.89
Hu et al. [2] 3.32 0.56 3.65 0.81 3.95 1.22 4.20 1.69 4.59 2.21 4.81 2.87
DAN(g.) 3.29 2.15 3.60 2.58 3.91 3.03 4.23 3.36 4.55 3.81 4.84 4.21
DAN(s.64) 3.14 48.91 3.46 57.81 3.75 67.69 4.10 77.08 4.42 87.03 4.75 97.26
Table 3. Results on TSPlib instances where longer computing time is allowed.
Method eil51 eil76 eil101 kroa150 tsp225

m=5 m=10 m=5 m=10 m=5 m=10 m=10 m=20 m=10 m=20
LKH3 [8] 119 112 142 126 147 113 1554 1554 998 999
OR (600 s) 119 114 145 130 153 116 1580 1560 1068 1143
DAN (s.256) 126 113 160 128 168 116 1610 1571 1111 1032
by simply replacing our neural network structure with theirs in our distributed
RL framework.
Table 2 shows the average MinMax cost for large-scale mTSP instances (from
500 to 1000 cities), where the number of agents is fixed to 10 (due to the limita-
tion of Hu et al.’s model). When testing our sampling variant, we set C = 100 in
the third decoder layer for efficient exploration (since the tour is much longer).
Except DAN and Hu et al.’s model, no method can handle such large-scale
mTSP within reasonable time, but we still report the results of OR Tools from
Hu et al.’s paper, as well as SOM results as the best-performing meta-heuristic
algorithms for completeness.
Table 3 shows the test results on TSPlib [23] instances, a well-known bench-
mark library, where city distributions come from real-world data. There, we
extend the computing time of OR Tools to 600s and increase the sample size of
DAN to 256, to yield solutions as optimal as possible. Note that the computing
time of DAN never exceeds 100s. LKH3 results are reported from [8], where long
enough computing time were allowed to yield exact solutions.
7.2 Discussion
We first notice that DAN significantly outperforms OR Tools in larger-scale

mTSP instances (m > 5, n ≥ 50), but is outperformed by OR Tools in smaller-
scale instances (as can be expected). In smaller-scale mTSP, OR Tools can
explore the search space sufficiently to produce near-optimal solutions. In this
situation, DAN and all other decentralized methods considered find it difficult
to achieve the same level of performance.
However, DAN outperforms

OR Tools in larger-scale mTSP
instances, where OR Tools can
only yield sub-optimal solutions
within the same time budget. In
mTSP100(m ≥ 10), our sam-
pling variant is 10% better than
OR Tools. The advantage of our Fig. 3. Planning time for the different solvers
model becomes more significant from n = 20 to n = 140 while m = 5. The
as the scale of the instance grows, computing time of our model only increases lin-
early with respect to the number of cities, while
even when using our greedy vari-
the computing time of OR Tools (without local
ant. For instances on instances search) increases exponentially.
involving 500 or more cities, OR
Tools becomes impractical even
when allowing up to 1800 s per
instance, while our greedy variant still outputs solutions with good quality in a
matter of seconds. In general, the computing time of our model increases linearly
with the scale of the mTSP instance, as shown in Fig. 3.
Second, we notice that DAN’s structure helps achieve better agent collabo-
ration than the other decentralized dRL methods, thus yielding better overall
results. In particular, we note that simply applying a generic encoder-decoder
structure (Kool et al.’s model) only provides mediocre to low-quality solutions
across mTSP instances. This supports our claim that DAN’s extended network
structure is key in yielding drastically improved performance over this original
work. More generally, we note that all four dRL methods tested in our work (i.e.,
DAN, Hu et al.’s model, Park et al.’s model, and Kool et al.’s model) contain
similar network structures as our city encoder and decoder. Therefore, our over-
all better performance seem to indicate that our additional agent encoder and
city-agent encoder are responsible for most of DAN’s increased performance.
Finally, we notice that DAN can provide near-optimal solutions for larger
teams in medium-scale mTSP instances, while keeping computing times much
lower than non-learning baselines (at least one order of magnitude). As shown in
Tables 1 and 3, although exact solvers cannot find reasonable mTSP solutions
in seconds/minutes like OR Tools and DAN, by running for long enough they
still can guarantee near-optimal solutions. However, DAN’s performance is still
close (within 4%) to LKH3 in problems involving 10 agents.
8 Conclusion
This work introduced DAN, a decentralized attention-based neural network to

solve the MinMax multiple travelling salesman (mTSP) problem. We approach
mTSP as a cooperative task and formulate mTSP as a sequential decision
making problem, where agents distributedly construct a collaborative mTSP
solution iteratively and asynchronously. In doing so, our attention-based neu-
ral model allows agents to achieve implicit coordination to solve the mTSP
214 Y. Cao et al.
instance together, in a fully decentralized manner. Through our results, we

showed that our model exhibits excellent performance for small- to large-scale
mTSP instances, which involve 50 to 1000 cities and 5 to 20 agents. Com-
pared to state-of-the-art conventional baseline, our model achieves better per-
formance both in terms of solution quality and computing time in large-scale
mTSP instances, while achieving comparable performance in small-scale mTSP
instances. We notice such a feature may make DAN more interesting for robotics
tasks, where mTSP needs to be solved frequently and within seconds or a few
minutes.
We believe that the developments made in the design of DAN can extend to
more general robotic problems where agent allocation/distribution is key, such
as multi-robot patrolling, distributed search/coverage, or collaborative manu-
facturing. We also acknowledge that robotic applications of mTSP may benefit
from the consideration of real-life deployment constraints directly at the core of
planning. We note that such constraints as agent capacity, time window, city
demand were added to learning-based TSP solvers with minimal change in net-
work structure [13]. We believe that the same should hold true for DAN, and
such developments will be the subject of future works as well.
Acknowledgements. This work was supported by Temasek Laboratories (TL@NUS)

under grants TL/SRP/20/03 and TL/SRP/21/19. We thank colleagues at TL@NUS
and DSO for useful discussions, and Mehul Damani for his help with the initial
manuscript. Detailed comments from anonymous referees contributed to the quality
of this paper.
References
1. Kaempfer, Y., Wolf, L.: Learning the multiple traveling salesmen problem with
permutation invariant pooling networks. arXiv preprint arXiv:1803.09621 (2018)
2. Hu, Y., Yao, Y., Lee, W.S.: A reinforcement learning approach for optimizing
multiple traveling salesman problems over graphs. Knowl.-Based Syst. 204, 106244
(2020)
3. Park, J., Bakhtiyar, S., Park, J.: ScheduleNet: learn to solve multi-agent scheduling
problems with reinforcement learning. arXiv preprint arXiv:2106.03051 (2021)
4. Faigl, J., Kulich, M., Přeučil, L.: Goal assignment using distance cost in multi-robot
exploration. In: 2012 IEEE/RSJ International Conference on Intelligent Robots
and Systems, pp. 3741–3746. IEEE (2012)
5. Oßwald, S., Bennewitz, M., Burgard, W., Stachniss, C.: Speeding-up robot explo-
ration by exploiting background information. IEEE Robot. Autom. Lett. 1(2),
716–723 (2016)
6. Chao, C., Hongbiao, Z., Howie, C., Ji, Z.: TARE: a hierarchical framework for
efficiently exploring complex 3D environments. In: Robotics: Science and Systems
Conference (RSS). Virtual (2021)
7. IBM: CPLEX Optimizer (2018). https://www.ibm.com/analytics/cplex-optimizer
8. Helsgaun, K.: An extension of the Lin-Kernighan-Helsgaun TSP solver for con-
strained traveling salesman and vehicle routing problems. Roskilde University,
Roskilde (2017)
9. Gurobi Optimizer (2020). https://www.gurobi.com

10. Google: OR Tools (2012). https://developers.google.com/optimization/routing/
vrp
11. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. arXiv preprint
arXiv:1506.03134 (2015)
12. Bello, I., Pham, H., Le, Q.V., Norouzi, M., Bengio, S.: Neural combinatorial opti-
mization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016)
13. Kool, W., Van Hoof, H., Welling, M.: Attention, learn to solve routing problems!
14. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS, pp.
5998–6008 (2017)
15. Bektas, T.: The multiple traveling salesman problem: an overview of formulations
and solution procedures. Omega 34(3), 209–219 (2006). https://doi.org/10.1016/
j.omega.2004.10.004
16. Zhang, K., Yang, Z., Başar, T.: Multi-agent reinforcement learning: a selective
overview of theories and algorithms. arXiv:1911.10635 (2021)
17. Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multi-agent control using
deep reinforcement learning. In: Proceedings of AAMAS, pp. 66–83 (2017)
18. Moritz, P., et al.: Ray: a distributed framework for emerging AI applications. In:
Proceedings of OSDI, pp. 561–577 (2018)
19. OpenAI: OpenAI Baselines: ACKTR & A2C (2017). https://openai.com/blog/
baselines-acktr-a2c/
20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980
(2017)
21. Lupoaie, V.I., Chili, I.A., Breaban, M.E., Raschip, M.: SOM-guided evolutionary
search for solving MinMax multiple-TSP. arXiv:1907.11910 (2019)
22. Voudouris, C., Tsang, E.P., Alsheddy, A.: Guided local search. In: M. Gendreau,
J.Y. Potvin (eds.) Handbook of Metaheuristics, vol. 146, pp. 321–361. Springer, US,
Boston, MA (2010). https://doi.org/10.1007/978-1-4419-1665-5 11. Series Title:
International Series in Operations Research & Management Science
23. Reinelt, G.: TSPLIB-A traveling salesman problem library. INFORMS J. Comput.
3(4), 376–384 (1991)
Receding Horizon Control
on the Broadcast of Information
in Stochastic Networks
Thales C. Silva1(B) , Li Shen1 , Xi Yu2 , and M. Ani Hsieh1

1
Department of Mechanical Engineering and Applied Mechanics, University of
Pennsylvania, Philadelphia, PA 19104, USA
scthales@seas.upenn.edu
2
Department of Mechanical and Aerospace Engineering, West Virginia University,
Morgantown, WV 26501, USA
xi.yu1@mail.wvu.edu
Abstract. This paper focuses on the broadcast of information on robot

networks with stochastic network interconnection topologies. Problem-
atic communication networks are almost unavoidable in areas where we
wish to deploy multi-robotic systems, usually due to a lack of environ-
mental consistency, accessibility, and structure. We tackle this problem
by modeling the broadcast of information in a multi-robot communica-
tion network as a stochastic process with random arrival times, which
can be produced by irregular robot movements, wireless attenuation,
and other environmental factors. Using this model, we provide and ana-
lyze a receding horizon control strategy to control the statistics of the
information broadcast. The resulting strategy compels the robots to re-
direct their communication resources to different neighbors according to
the current propagation process to fulfill global broadcast requirements.
Based on this method, we provide an approach to compute the expected
time to broadcast the message to all nodes. Numerical examples are pro-
vided to illustrate the results.
Keywords: multi-robot system · receding horizon control · broadcast

of information
1 Introduction
We are interested in leveraging spatio-temporal sensing capabilities of robotic

teams to explore or monitor large-scale complex environments (e.g., forests,
oceans, and underground caves and tunnels [3,4,7,27,29]). While individual
robots can each monitor or explore limited regions, a team of them can effi-
ciently cover larger areas. Naturally, individual robots in swarms are expected
to exchange information and make certain decisions based on the information
gathered by itself and received from other robots [9,14]. It is, therefore, essential
for any robot to be able to transmit its gathered data to others in the team.
https://doi.org/10.1007/978-3-031-51497-5_16
Receding Horizon Control on the Broadcast of Information 217
As the demand of information transmission can emerge between any pair of

nodes, we use a propagation model to study the efficiency of the connectivity
of the whole team. In such models, robots are represented as nodes of a graph,
and interactions between nodes are represented by edges. At the beginning of a
propagation, the nodes are considered as ‘non-informed’, and will be ‘informed’
by the information propagated through the graph. A faster convergence of all
nodes into the status of ‘informed’ indicates a more efficiently connected network.
Existing works abstract the propagation of information from any individual agent
to the rest of the team as compartmental models [10,11,31], which are also
broadly used in modeling the spread of an infectious diseases [2]. The flow of the
nodes from one compartment to another occurs when active links exist between
robots that are ‘informed’ and ‘non-informed’ (i.e., only ‘informed’ robots can
transmit the information to ‘non-informed’ ones).
However, compartmental models often focus on the statistical behavior of the
nodes, overlooking the capabilities of the individual-level decision making that
may impact the network’s performance. Recent works in robotics have noticed
and embraced the potential of controlling individual robots so that the per-
formance of the whole network (i.e., the transmission rates, if a propagation
model is considered,) can be impacted or improved. The topology of networks
was analyzed in [18] and control strategies were proposed to identify and isolate
the nodes that have a higher impact on the propagation process. The design of
time-varying networks where robots leverage their mobility to move into each
other’s communication ranges to form and maintain temporal communication
links has been addressed in [8,12,28]. Information can be propagated in such
networks by relying on time-respect routing paths formed by a sequence chrono-
logically ordered temporal links [28]. Nonetheless, such approaches still require
thorough and perfect knowledge of the network’s topology and the robots’ abil-
ity to maintain active links and requires robots to execute their motion plans
precisely both in time and space. As such, the resulting networks lack robustness
to uncertainties in link formation timings that may result from errors in motion
plan execution due to localization errors and/or actuation uncertainties.
Existing stochastic graph models focus on random changes in a network’s
topology centered around random creation and removal of edges within a graph.
For example, the Canadian Traveller Problem assumes partially visible graphs
where edges randomly appear and disappear [1,15]. For time-varying networks,
[13,21] assumed that any temporal link may suffer from deviations from its
scheduled timing due to the uncertain arrival time of one robot into another
robot’s communication range. Information routing or propagation planned for
such networks may experience severe delays if subsequent links along a routing
path appear out of chronological order in the presence of link formation uncer-
tainties resulting from uncertainties in robot motions [28]. These challenges are
addressed in [21] for networks with periodically time-varying interconnection
topologies. Control strategies to ‘fix’ nodes with higher impact on the whole
network’s performance were also proposed in [21] similar in spirit to those pre-
sented in [18].
218 T. C. Silva et al.
In the Canadian Traveler Problems and [21], messages are assumed to be

transmitted instantaneously between nodes whenever a link appears. As such,
the resulting strategies are solely based on the existence of an available link
between a certain pair of robots at a certain time point. When the time to trans-
mit a message is non-trivial, the question goes beyond whether the information
can be transmitted or not and must consider the quality of the transmission
(e.g., safe, fast, confident). Consider a pair of robots flying through a Poisson
forest maintaining a communication link that requires line-of-sight (LOS). The
randomly distributed obstacles may intermittently disrupt the LOS between the
robots causing randomized delay in the completion of the information transmis-
sion task. In these situations, it becomes difficult to determine whether a robot
is informed or not at a given time point which then impacts the robots ability
to plan subsequent actions.
In this paper, we aim to develop receding horizon control schemes that allows
individual robots to re-direct their transmission resources (e.g., time, power)
to different neighboring robots on the communication network with stochastic
links to guarantee an exponentially fast convergence status in the information
propagation. We model the completion of the message transmission across a link
as a Poisson Point Process. The density parameter of this process is determined
by the transmission resources invested by the nodes. Each node carries limited
transmission resources that is distributed among all the links connecting to it.
The completion of the transition of a message from one node to another is then
modeled as a Markov process. All robots would then be able to update the
distribution of their own transmission resources according to the neighboring
current states and follow the control directions. Therefore, the control strategy
takes into account the transmission resource capabilities of the robots and acts
on them to fulfill performance requirements. Such a set of resources changes
according to the application, for example, in the Poisson forest we might require
that the robots stay close together at the cost of decreasing the covered area; in
a situation in which the robots need to move back and forth to carry information
(e.g., surveillance around buildings or in tunnels), we might require the robots
to increase their frequency of visits among each other.
The paper is organized as follows. Section 2.1 and Sect. 2.2 provides the-
oretical backgrounds of the graph structure and the propagation model. The
problem is formally stated at the end of Sect. 2. Section 3 introduces our app-
roach of developing the receding horizon control scheme for the stochastic net-
work. Section 4 validates our proposed control schemes with numerical examples.
Section 5 concludes the paper and proposes future directions.
2 Background and Problem Formulation

2.1 Graph Theory
We represent the communication network as a graph G(V, E), in which V =
{1, ..., n} is the node set and E ⊂ V ×V the edge set. Each element of the edge set
represents a directed link from i to j if (i, j) ∈ E. The set constituted by the nodes
that have the ith node as a child node is denoted by Ni = {j ∈ V : (j, i) ∈ E} and
it is called neighborhood of the ith node. In accordance to this nomenclature,
we call a neighbor of a node i a node j in Ni . The adjacency matrix A = [aij ]
associated with the graph G is an object that maps the edge set into a matrix
as,

0, if i = j or (j, i)E,
aij = (1)
ωij (t), if (j, i) ∈ E,
where ωij (t) > 0 is a positive number that maps the transmission rate of infor-
mation from node j to node i, which we will examine in more detail below.
2.2 Information Propagation Model
We study the stochastic broadcast of information in a networked system repre-

sented by a graph G(V, E), in which the edges represent communication channels
and the nodes represent the robots. We assume that each robot has limited com-
munication capabilities, in these scenarios the systems may have to operate with
intermittent network connectivity due to wireless signal attenuation, resulting
from obstacle occlusions and other environmental factors [23]. We borrow an
epidemiological model from Van Mieghem and Cator [24], who defined an exact
2n -state joint Markov chain to study the stochastic process of propagation of
diseases using a Susceptible-Infected (SI) representation. We investigate a simi-
lar model to represent the broadcast of information. Naturally, using 2n states to
represent the broadcast of information can quickly lead to computational infea-
sibility. To circumvent this problem we apply a moment-closure to the model
[5,20,25], which approximates the representation of the joint Markov chain using
a lower-dimensional representation. Afterward, we compare the performance of
both representations.
In a SI model, each individual can be either “susceptible” or “infected” by a
disease, and a susceptible agent can only become infected if there is at least one
infected agent in its neighborhood. We adopt this idea to model the transmission
of information in a robot networked system, such that each agent can be either
informed or non-informed about a message to be transmitted over the network.
We assume that the transmission process between any given pair of agents (j, i)
follows a stochastic jump process with Poisson distribution and transmission rate
ωij (t) [16]. This type of modeling can represent scenarios where the information is
carried by mobile robots with low-range and limited communication capabilities,
which usually leads to intermittent interaction between agents, such as in [23,30].
To be able to affect the transmission statistics over the whole network while
satisfying a team’s performance criterion, we assume that the robots have access
to the states of its neighborhood on predefined sampling times, nevertheless,
each node can carry the computation with that information. Even though this
assumption can be conservative in general, there are scenarios in which it is
possible to have a higher range, but low transmission rate, to perform general
and simple coordination, while agents need to perform local higher demanding
data transactions, for example on data fusion tasks. Also, estimates techniques
might help to alleviate such an assumption.
We model the current state of a robot as an indicator random variable Xi (t) ∈
{0, 1}, such that Xi (t) = 1 if the ith robot is informed at time t, and Xi (t) = 0
otherwise. Note that Xi (t) is a Bernoulli random variable, therefore the evolution
of the network expectation of broadcast of information can be determined by the
evolution of the probability Pr{Xi (t) = 1 : Xj (0) = 1} for any pair (i, j), over
time t > 0. The joint state of the network maps the set of agents that hold the
information in a given instant. Since Xi (t) ∈ {0, 1} for each i ∈ V, it is clear that
the joint states of the network evolve in a 2n state-space. Inspired by [19,24] in
epidemiological studies, we define a time-dependent state-space vector for the
whole network as
Y (t) = [Y0 (t) Y1 (t) · · · Y2n−1 ] , with

n
1, i = k=1 Xk 2k−1 ,
Yi (t) =
0, otherwise,
n
for all Xk ∈ {0, 1}. Observe that Y (t) ∈ R2 corresponds to a binary-valued
vector that maps the current state of each node into a specific element Yi (t)
(i.e., we have a bijective map between each possible state of the network and the
vector Y (t)). Figure 1 illustrates the relation of all possible states of the network
and Y (t) for a network with three nodes. In addition, note that the elements of
Y (t) also are Bernoulli random variables, hence E[Yi (t)] = Pr{Yi (t) = 1}.
Fig. 1. Representation of all possible states of a network with three agents. The arrows
represent possible switching between states.
One could try to model the states of the whole network directly as X(t) =
[X1 (t) X2 (t) · · · Xn (t)], since that would reduce the dimension of the represen-
n
tation from R2 to Rn . However, as shown in [19], the evolution of E[X(t)] repre-
sents only a marginal probability distribution of the states of the network, and this
information is not enough to describe the evolution of Pr{Xi (t) = 1 : Xj (0) = 1}
for any t > 0. Therefore, if we are interested in computing the expectation of
future states given a current distribution, using X(t) alone is not enough. Alter-
native methods to reduce the dimension of the state-space include mean-field
type approximations and moment-closure methods (please, see [19] and refer-
ences therein). Typically those methods are valid under the assumption of large
numbers and have reduced precision about the underlying stochastic process.
Indeed, as pointed out in [25], in general, mean-field approximations do not
consistently provide either an upper- nor a lower-bound for general spreading
processes.
Since Y (t) is an exact characterization of the states of the network, an
expression for Pr{Yi (t) = 1 : Y (0)} can be obtained using a continuous-time
Markov chain with 2n states, determined by the following infinitesimal genera-
n n
tor Q(t) = [qij (t)] ∈ R2 ×2 [24],
⎧ n m−1
⎨ k=1 ωmk Xk , for m = 1, ..., n, Xm = 1 if i = j − 2 ,
n
qij (t) = − k=1 qkj , if i = j, (2)
⎩
0, otherwise,
for all Xk ∈ {0, 1}, in which the corresponding Markov chain evolves according
the following Kolmogorov foward equation
dv(t)
= v(t) Q(t), (3)
dt
where the state vector is defined as v(t) = [v0 (t) · · · v2n−1 ] with elements
vi (t) = Pr{Yi (t) = 1} and satisfies i vi = 1 for all t > 0.
In this paper, we are interested in enforcing a desired evolution for the propa-
gation of information on the network (i.e., Pr{Xi (t) → 1 : Xj (0) = 1}) exponen-
tially fast, by closing the loop of the network with a feedback control strategy,
which chooses the control action from a finite set of possible actions and mini-
mizes a cost of available resources. Specifically, the problem investigated can be
stated as:
Problem 1. Given a network with stochastic propagation of information rep-

resented by the topology in (1), define a feedback control that actuates on the
transmission rates ωij (t) such that the evolution of the probability of all nodes in
the network becoming informed nodes follows the desired convergence criteria.
3 Methodology
3.1 Control Strategy
To focus on the transmission of information over the network and understand the
ability to shape the broadcast of information, we make the following assumption:
Assumption 1. The communication graph has a directed spanning tree with a root
corresponding to the robot that is the origin of the information to be transmitted.
This assumption implies that each edge of the spanning tree has a positive
transmission rate (i.e., the information is propagated with some positive base-
line transmission). In this scenario, it is clear that the broadcast of information
will eventually reach every robot on the network with probability 1. Hence, the
main objective of our control problem can be viewed as an attempt to shape the
statistics of the broadcast of information on the network to satisfy a given con-
vergence criterion. It is worth noticing that most works on contact processes are
concerned with regular lattices or mean-field models [17,18]. Also, usually, the
desired convergence rate is performed through static strategies, targeting central
nodes, and using heuristics [18,21]. With our control strategy, the network can
adapt to the stochastic realization of the propagation of information and only
expend resources to actuate during realizations that might interfere with the
team’s desired behavior.
With the broadcast model in (3), we control the transmission rates ωij (t)
by applying allowable actions based on the current state of the network and on
predicted future states, sampled on predefined times, following a sampled-data
predictive controller [22]. The controller we study in this paper is defined by the
following optimization problem:
min C(Xi (t)) (4)

ωij ∈ U

s.t. 1 − E Xi (t + Δt) : X(t) − 1 − Xi (t) e−rΔt ≤ 0,

where E Xi (t + Δt) : X(t) is the prediction stage which is computed using (3)
(later we utilize a moment-closure to compute its upper-bounds), C(Xi (t)) is
the cost of applying the control action ωij on the transmission between nodes
i and j, Δt > 0 defines a window of prediction from the current instant t,
the positive constant r defines a desired exponential convergence rate, and U
represents the set of admissible control actions. In practice, the set of admissible
control actions should represent the operations that robots can perform to adjust
their transmission rate among neighbors. For example, if the team of robots
operate in geographically disjoint locations and travel back and forward regularly
to neighbors’ sites and carry information with them, one could define an increase
in the transmission rate as an increase in the frequency the agents should visit
each other. Another example is in the case of periodic rendezvous, in which
admissible actions could map the range of possible changes in the agent’s period.
We draw attention to the fact that ensuring the constraints in (4) plays a
central role in guaranteeing the fulfillment of the desired performance. We notice
that for the general case it might be difficult to guarantee feasibility of problem
(4). Therefore, in this work we assume that there is an auxiliary control law that
provides feasibility and that the controller has access to it. This assumption is
not uncommon in the nonlinear model predictive literature (see [6] and references
therein), typically such an auxiliary controller is provided ad hoc. Nonetheless,
in the problems we are interested in, we notice that in several applications it is
possible to find feasible, but high-priced, solutions for the problem. For example,
in tasks where robots carry the message from site to site intermittently we might
satisfy the constraints by bringing all robots closer to each other; in the case of
robots performing tasks on cyclic paths we could make all robots cover overlaying
paths. Hence, even though it is necessary to provide an auxiliary control strategy
according to each task, we notice that in the problem of broadcasting information
simply defining a feasible strategy is not necessarily interesting. Therefore, we

seek to apply admissible control actions that do not lead to trivial connectivity
on the network and preserve a satisfactory performance.
3.2 Expected Time to Broadcast Information
The feasibility of the optimization in (4) guarantees that the information is

transmitted throughout the network exponentially fast. However, a remaining
open question is how can we infer how long it will take until all robots in the
networks receive the message of interest. In other words, what is the hitting
time for the stochastic process to reach the stationary state. In this section, we
analyze this question and provide the expectation for the propagation time.
Given that we have the exact representation of the Markov chain that
describes the underlying stochastic process we could, in principle, compute the
hitting times directly. However, since we solve the receding optimization problem
(4), our Markov chain is non-homogeneous and its transition rates might change
according to each realization. Therefore, we take advantage of the constraint
in problem (4) relying on the assumption previously discussed that a feasible
solution can always be found and compute the expected time to broadcast infor-
mation based on Watkins et al. (Please, note that the method shown below is
similar to Theorem 5 in [26].)
Summing the constraint in (4) over the n nodes gives,

n − E S(X(t + Δt)) ≤ (n − S(X(t))e−rΔt ,
n
where S(X(t)) = i Xi (t) is the number of informed robots at time t. The
feasibility of the problem ensures that

n − E S(X(t)) ≤ (n − S(X(0))e−rt . (5)
Define the broadcast time as tb = inf{t ≥ 0 : S(X(t)) = n}, then its expectation
can be computed as
∞
E[tb : X(0)] = 1 − Pr(tb ≤ σ)dσ. (6)
0
Note that Pr{tb ≤ t} = Pr{S(X(t)) = n}, therefore the equivalence 1 − Pr{tb ≤

t} = Pr{S(X(t)) < n} holds, where Pr{S(X(t)) < n} is the probability of
having any robot non-informed at time t. We use this equality and (5) to compute
the expectation time for the broadcast of information. Note that the right-hand
side of (5) bounds the expected number of the existence of non-informed robots,
therefore
Pr S(X(t)) < n : (X(0)) ≤ min 1, (n − S(X(0))e−rt ,

hence, (6) yields

∞
σ
E[tb : X(0)] ≤ min 1, (n − S(X(0))e−r Δt Δt dσ
0
∞
σ
≤ τ1 + (n − S(X(0))Δte−r Δt Δt
τ 1
i= Δt
e−rτ1
≤ τ1 + (n − S(X(0))Δt, (7)
1 − e−rΔt
t
with τ1 = inf{t ≥ 0 : (n − S(X(t)))e−r Δt Δt ≤ 1}.
3.3 Robust Moment Closure

In this section we consider a moment closure technique to solve Problem 1 with a
computational amenable method. We reiterate that, while (3) is a theoretically
right representation for our problem, it is not computationally tractable for
relatively medium-size networks due to the necessity to use 2n states to capture
the evolution of the probabilities in the network. The technique considered in
this section was introduced in [25] for the problem of continuous-time epidemics.
The approach is based on the Fréchet inequalities, defined as
Flwr (Pr{A}, Pr{B}) = max{0, Pr{A} + Pr{B} − 1}, (8)

Fupr (Pr{A}, Pr{B}) = min{Pr{A}, Pr{B}}, (9)
which are lower- and upper-bounds for the joint probability Pr{A, B}, respec-
tively. Its main advantage over mean-field approximations is that it gives mean-
ingful bounds for the evolution of the real mean of the underlying stochastic
process (see [25] for a detailed discussion).
Initially, we need to reformulate our representation to consider the marginal
probabilities of each robot receiving the information, since we are interested in
drawing bounds on their evolution. For our problem, the marginal probabilities
can be written as the expectation of a robot becoming informed as a function of
its neighbors as,
dE[Xi (t)]
= E[X̄i (t)Xj (t)]ωij (t), (10)
dt
j∈Ni
for all i ∈ V, where X̄i (t) is the complement of Xi (t) (i.e., X̄i (t) = 1 if Xi (t) = 0).
This equation maps an infinitesimal probability of Xi receiving the information
given that some neighbors Xj , j ∈ Ni , have the message. Notice that (10)
is not closed, therefore we cannot solve it for E[Xi (t)] without equations for
E[X̄i (t)Xj (t)]. Central to the result in [25] is to bound the joint probabilities of
the form E[X̄i Xj ]. Next, we show the optimal approximations for the expecta-
tions we employ, which can be expressed as closed-form functions.
Lemma 1 (Watkins et al.,[25]). Let x(0) = X(0) = x(0), define S(X) as the
number of nodes informed in X. It holds that

E[S(X(t)) : X(0)] ≤ i (t), 1 − xi (t) , and
min xinf non
(11)
i∈V

E[S(X(t)) : X(0)] ≥ i (t), 1 − xi (t) ,
max xinf non
(12)
i∈V
where the variables xi (t) and xi (t), for each ∈ L = {inf, non} are the solutions
of the following ordinary differential equations,

ẋi = Fupr (Bxi (xi ), xj )ωij (t), (13)
j∈Ni ∈L

ẋi = Flwr (xi , xj )ωij (t), (14)
j∈Ni ∈L

and Bxi (xi ) = min 1 − xi , xi } is the upper-bounding operator. In addition, the
following inclusion is always satisfied

E[Xi (t) : X(0)] ∈ xi (t), xi (t) ⊆ 0, 1 ,
for all t ≥ 0.
The result in Lemma 1 allows us to compute bounds for the evolution of the
probabilities of the broadcast of information in the network using 4n differential
equations (equations (13) and (14) for each informed and non-informed node),
instead of 2n . This enables us to apply the receding horizon strategy in equation
(4) in larger networks.
4 Numerical Evaluation and Discussions

In this section, we illustrate the approach proposed in Sect. 3 with two simulation
results. The first simulation was carried out on a 15 robots static topology. The
topology of the network was generated using Erdös-Rényi model with connection
probability of 0.3. We assumed that each robot has an uniform total transmission
rate μ to distribute to its connected neighbor. In other words, an informed robot
can become connected to its neighbors according to different transmission rates
ωij (t) such that the sum of its transmission rates between neighbors is limited by
fixed value μ. This parameter, along with the connection probability of Erdös-
Rényi model, has a big impact on the information propagation speed of the
whole network. In this simulation, we set the total transmission rate μ to be 2
for better comparison results.
We applied the control strategy (4) proposed in Sect. 3.1 with the Con-
tinuous-Time Markov Chain (CTMC) representation and with the Robust
Moment Closure (RMC), model as described in Sect. 3.3. We can choose from a
myriad of cost functions that align with different design purposes. Here we chose
2
C(Xi (t)) = j∈Ni ωij . If we were not seeking an exponential propagation, this
objective function, together with the constraints on the transmission resource
ωij (t) ∈ (0, μ), and j∈Ni ωij = μ for each i ∈ V, would push for an even distri-
bution of every robot’s resources between its neighbors. We chose Δt = 1, and
r = 0.22 for both CTMC and RMC models, and simulated our proposed con-
troller which guarantees an exponential propagation while minimizes the chosen
cost. The original exact model is also simulated without any control method in
the loop. We run all three models 10 times each, and we took their average as the
performance results. The performance comparison result is shown in Fig. 2 and
the numerical result at different stages of the propagation process is in Table 1.
From Fig. 2a, the CTMC model with control outperforms two other models,
the RMC model with control is slightly slower than the CTMC, but still faster
than the original model without control. The CTMC model was in average 39.6%
faster than the open-loop model to informed all robots, and RMC model was
27.3% faster than the open-loop model completion time. Notice that the two pat-
terns in black and in red lines shown in Fig. 2b are similar–the biggest difference
number between CTMC and RMC is 3 robots which occurs around 2.5 s.
Fig. 2. Numerical results comparison of 15 nodes network’s broadcasting speed. In (a),

the bar chart showcases the total number of robots that received information after the
end of each 1.0833 s. The error bars represent the maximum and minimum informed
robots in our 10 different data-set. In (b), the Line chart demonstrates the evolution of
the difference in informed robots with respect to time. The red plot shows the difference
between CTMC and the exact model, and the black plot shows the difference between
RMC and the exact model.
We conducted the second simulation in a larger static topology network with

100 nodes. The CTMC model is not appropriate in this setting as the state-
100 100
space evolves in R2 ×2 . The topology of the network is again generated
using Erdös-Rényi model with connection probability of 0.05, and total trans-
mission rate μ = 4. We compared the performance between the open-loop model
Fig. 3. Performance comparison of 100 nodes network’s broadcasting speed. In (a),

the bar chart depicts the total number of informed robots after every 0.2609 s for both
RMC and exact models. The error bars indicate the variation in informed robots of 10
different simulations. In (b), the line plot represents the evolution of the difference in
informed robots between RMC and the open-loop method.
Table 1. Average completion time at each propagation stage
Topology Model 20% 40% 60% 80% 100%

15 nodes CTMC 2.0571 2.4084 2.7997 3.7673 4.5902
RMC 2.2174 2.7436 3.3202 4.1394 5.0324
Open-Loop 2.6085 3.1619 3.7573 5.1964 6.4092
100 nodes RMC 0.2325 0.2849 0.3131 0.3525 0.9364
Open-Loop 0.2515 0.3343 0.4603 0.6428 2.9845
and RMC model in broadcasting speed. The control strategy is implemented

considering Δt = 0.1 and exponential convergence rate r = 2.8. We run 10 dif-
ferent trials for each model using random seeds and use their mean values as the
numerical results.
Figure 3 shows the performance in information broadcasting for those two
models. Notice that at the beginning in Fig. 3a, before 0.3 s, there was not much
difference between the two methods. This is because the constraint of the opti-
mization problem was satisfied with the initial trivial solution and there was
not many actions in U that could further improve the information propagation
during that period. This might be why there is a fluctuating pattern at the begin-
ning in Fig. 3b. After that, the control strategy starts to show its advantage, and
the average time to broadcast the information to all robots is 0.93 s, which is
in average 218.7% faster than the open-loop network, which has an average of
2.98 s.
Both numerical examples show a significantly better performance of our pro-

posed methods in the scenarios assuming a random node is propagating infor-
mation to the rest of the team. Notice that such scenarios are not usually seen
in real-world applications. A faster propagation indicates a more efficiently con-
nected network, providing a tighter upper bound of the information transmission
time between any pair of nodes in the network. Therefore the propagation mod-
els were chosen to demonstrate a more efficient transmission is expected for any
shortest path connecting two nodes in the network. A limitation of this propaga-
tion model is that all shortest paths are considered equally important, while in
real-world applications, it makes sense that certain paths are with higher prior-
ities due to the actual variation of demands. Our method can be easily tailored
to fit different objective functions.

We have investigated a control structure for shaping the transmission rates on
networked systems to increase the connectivity in networks, the broadcast of
information can be modeled as stochastic jump processes. As a first result, we
have shown that by using an epidemiological model to represent the transmission
of information among robots we can describe the probability of future transmis-
sions based on the current joint state of the network and its topological structure.
Then, based on this finding, we examined the applicability of a receding control
strategy to account for possible drifts in performance on the current realiza-
tion and, subsequently, actuate on transmission rates whenever necessary. This
approach provides efficient adjustments to the network. Finally, numerical exper-
iments were implemented, illustrating the effectiveness of the method. Possible
future work includes the extension of the strategy for distributed optimization.
It is also essential to connect the robot dynamics with the changing transition
rates. One possible approach to bridging those two models is through maps
defined a priori and then accounting for them on the admissible set of actions.
Finally, our methodology only accounts for the expectation of the transmission
of information. This parameter can be a weak decision parameter, an extension
to consider the variance would be appealing.
Acknowledgements. We gratefully acknowledge the support of ARL DCIST CRA

W911NF-17-2-0181, Office of Naval Research (ONR) Award No. N00014-22-1-2157.
References
1. Bar-Noy, A., Schieber, B.: The canadian traveller problem. In: Proceedings of
the Second Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 261–270
(1991)
2. Brauer, F.: Compartmental models in epidemiology. In: Brauer, F., van den Driess-
che, P., Wu, J. (eds.) Mathematical Epidemiology, pp. 19–79. Springer, Heidelberg
(2008). https://doi.org/10.1007/978-3-540-78911-6 2
3. Breitenmoser, A., Schwager, M., Metzger, J.C., Siegwart, R., Rus, D.: Voronoi cov-
erage of non-convex environments with a group of networked robots. In: 2010 IEEE
International Conference on Robotics and Automation, pp. 4982–4989 (2010).
https://doi.org/10.1109/ROBOT.2010.5509696
4. Cassandras, C.G.: Smart cities as cyber-physical social systems. Engineering 2(2),
156–158 (2016)
5. Cator, E., Van Mieghem, P.: Second-order mean-field susceptible-infected-
susceptible epidemic threshold. Phys. Rev. E 85, 056111 (2012). https://doi.org/
10.1103/PhysRevE.85.056111
6. Ellis, M., Durand, H., Christofides, P.D.: A tutorial review of economic model
predictive control methods. J. Process Control 24(8), 1156–1178 (2014)
7. Gusrialdi, A., Hirche, S., Hatanaka, T., Fujita, M.: Voronoi based coverage control
with anisotropic sensors. In: Proceedings of the IEEE American Control Confer-
ence, pp. 736–741 (2008)
8. Hollinger, G.A., Singh, S.: Multirobot coordination with periodic connectivity:
theory and experiments. IEEE Trans. Rob. 28(4), 967–973 (2012)
9. Hsieh, M.A., Kumar, V., Chaimowicz, L.: Decentralized controllers for shape gen-
eration with robotic swarms. Robotica 26(5), 691–701 (2008)
10. Huang, L., Park, K., Lai, Y.C.: Information propagation on modular networks.
Phys. Rev. E 73(3), 035103 (2006)
11. Khelil, A., Becker, C., Tian, J., Rothermel, K.: An epidemic model for information
diffusion in manets. In: Proceedings of the 5th ACM International Workshop on
Modeling Analysis and Simulation of Wireless and Mobile Systems, pp. 54–60
(2002)
12. Khodayi-Mehr, R., Kantaros, Y., Zavlanos, M.M.: Distributed state estimation
using intermittently connected robot networks. IEEE Trans. Robot. 35(3), 709–
724 (2019). https://doi.org/10.1109/TRO.2019.2897865
13. Knizhnik, G., Li, P., Yu, X., Hsieh, M.A.: Flow-based control of marine robots in
gyre-like environments. In: 2022 IEEE international Conference on Robotics and
Automation (ICRA). IEEE (2022)
14. Moarref, S., Kress-Gazit, H.: Automated synthesis of decentralized controllers for
robot swarms from high-level temporal logic specifications. Auton. Robot. 44(3),
585–600 (2020)
15. Nikolova, E., Karger, D.R.: Route planning under uncertainty: the Canadian trav-
eller problem. In: Proceedings of the 23rd National Conference on Artificial Intel-
ligence (AAAI 2008), pp. 969–974. AAAI Press (2008)
16. Øksendal, B., Sulem, A.: Applied Stochastic Control of Jump Diffusions, 2nd edn.
Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-69826-5
17. Paré, P.E., Beck, C.L., Nedić, A.: Epidemic processes over time-varying networks.
IEEE Trans. Control Netw. Syst. 5(3), 1322–1334 (2018). https://doi.org/10.1109/
TCNS.2017.2706138
18. Preciado, V.M., Zargham, M., Enyioha, C., Jadbabaie, A., Pappas, G.J.: Opti-
mal resource allocation for network protection against spreading processes. IEEE
Trans. Control Netw. Syst. 1(1), 99–108 (2014). https://doi.org/10.1109/TCNS.
2014.2310911
19. Sahneh, D.F., Scoglio, C., Van Mieghem, P.: Generalized epidemic mean-field
model for spreading processes over multilayer complex networks. IEEE/ACM
Trans. Netw. 21(5), 1609–1620 (2013). https://doi.org/10.1109/TNET.2013.
2239658
20. Schnoerr, D., Sanguinetti, G., Grima, R.: Comparison of different moment-closure
approximations for stochastic chemical kinetics. J. Chem. Phys. 143(18), 185101
(2015). https://doi.org/10.1063/1.4934990
21. Shen, L., Yu, X., Hsieh, M.A.: Topology control of a periodic time-varying com-
munication network with stochastic temporal links. In: 2022 American Control
Conference (ACC), pp. 4211–4217. IEEE (2022)
22. Sopasakis, P., Patrinos, P., Sarimveis, H.: MPC for sampled-data linear systems:
guaranteeing constraint satisfaction in continuous-time. IEEE Trans. Autom. Con-
trol 59(4), 1088–1093 (2014). https://doi.org/10.1109/TAC.2013.2285786
23. Twigg, J., Dagefu, F., Chopra, N., Sadler, B.M.: Robotic parasitic array optimiza-
tion in outdoor environments. In: 2019 IEEE International Symposium on Safety,
Security, and Rescue Robotics (SSRR), pp. 1–8 (2019). https://doi.org/10.1109/
SSRR.2019.8848974
24. Van Mieghem, P., Cator, E.: Epidemics in networks with nodal self-infection and
the epidemic threshold. Phys. Rev. E 86, 016116 (2012). https://doi.org/10.1103/
PhysRevE.86.016116
25. Watkins, N.J., Nowzari, C., Pappas, G.J.: A robust moment closure for general
continuous-time epidemic processes. In: 2018 IEEE Conference on Decision and
Control (CDC), pp. 244–249 (2018). https://doi.org/10.1109/CDC.2018.8618664
26. Watkins, N.J., Nowzari, C., Pappas, G.J.: Robust economic model predictive con-
trol of continuous-time epidemic processes. IEEE Trans. Autom. Control 65(3),
1116–1131 (2020). https://doi.org/10.1109/TAC.2019.2919136
27. Wei, C., Yu, X., Tanner, H.G., Hsieh, M.A.: Synchronous rendezvous for networks
of active drifters in gyre flows. In: Distributed Autonomous Robotic Systems, pp.
413–425. Springer, Cham (2019)
28. Yu, X., Hsieh, M.A.: Synthesis of a time-varying communication network by robot
teams with information propagation guarantees. IEEE Robot. Automat. Lett. 5(2),
1413–1420 (2020)
29. Yu, X., Hsieh, M.A., Wei, C., Tanner, H.G.: Synchronous rendezvous for networks
of marine robots in large scale ocean monitoring. Front. Robot. AI 6, 76 (2019)
30. Yu, X., Saldaña, D., Shishika, D., Hsieh, M.A.: Resilient consensus in robot swarms
with periodic motion and intermittent communication. IEEE Trans. Robot. 38,
110–125 (2021). https://doi.org/10.1109/TRO.2021.3088765
31. Zanette, D.H.: Dynamics of rumor propagation on small-world networks. Phys.
Rev. E 65(4), 041908 (2002)
Adaptation Strategy for a Distributed
Autonomous UAV Formation in Case
of Aircraft Loss
Tagir Muslimov(B)
Ufa State Aviation Technical University, Ufa, Russian Federation

tagir.muslimov@gmail.com
Abstract. Controlling a distributed autonomous unmanned aerial vehi-

cle (UAV) formation is usually considered in the context of recovering
the connectivity graph should a single UAV agent be lost. At the same
time, little focus is made on how such loss affects the dynamics of the
formation as a system. To compensate for the negative effects, we pro-
pose an adaptation algorithm that reduces the increasing interaction
between the UAV agents that remain in the formation. This algorithm
enables the autonomous system to adjust to the new equilibrium state.
The algorithm has been tested by computer simulation on full nonlinear
UAV models. Simulation results prove the negative effect (the increased
final cruising speed of the formation) to be completely eliminated.
Keywords: UAV Formation Flight · Fault-tolerant Formation

Control · Drone Flocking · Formation Reconfiguration
1 Introduction
For a decentralized formation of unmanned aerial vehicles (UAVs), the forma-
tion needs to be capable of reconfiguration should one of the UAVs fail. Some
papers approach this problem by developing algorithms to optimize the recovery
or maintenance of connectivity. For instance, paper [1] investigates reconfigur-
ing the communication topology to minimize the communication cost in case of
the UAVs in the formation losing their links. When considering a UAV forma-
tion as a multiagent system, existing approaches become applicable, see, e.g.,
[2] for an overview. Paper [3] investigates the possibility of reaching consensus
should agents stop communicating. Work [4] covers the effectiveness of a swarm
intelligence in a multiagent system that loses its agents. Article [5] investigates
controlling a multiagent system when links between agents are lost. Paper [6]
proposes a novel approach that relies on a Negative Imaginary (NI) theory-based
control algorithm that keeps a decentralized system of single integrators stable
when several agents are lost. Work [7] covers an agent loss-tolerant multiagent
system. One of the approaches covered therein consists in adding redundant links
to make the group robust to agent loss.
https://doi.org/10.1007/978-3-031-51497-5_17
232 T. Muslimov
Being a complex system, a UAV formation does affect the reconfiguration

algorithms in development. The way it does so depends, for instance, on the
mission being planned or on the dynamics of the aircraft themselves. Paper [8]
presents a concept of fault-tolerant formation control for cases of UAV loss. The
control strategy consists in adjusting the objective function that keeps the entire
formation functional. Work [9] covers a situation of UAV loss and proposes an
algorithm that performs priority ranking in order to choose a UAV to replace
the lost one. Article [10] presents an approach where the lost UAV agents are
replaced on a priority basis in a formation caught in an natural disaster. Paper
[11] covers operating a UAV formation in a wildfire. Should a single UAV be
lost, the proposed algorithm reconfigures the formation by adjusting for the lost
links. Numerous papers cover UAV formation control strategies in cases of actu-
ator faults. For instance, work [12] demonstrates the use of UAV formations to
tackle wildfires; it shows that cooperative tracking errors could be bounded even
in case of actuator faults. Article [13] investigates UAV formation stability by
applying a Lyapunov function, assuming both actuator faults and communica-
tion interrupts. Adaptive control enabled the authors to compensate for actuator
faults.
The above paper overview leads to a conclusion that the existing publications
make little focus on how losing a UAV agent affects the dynamics of the entire
formation. Of particular interest are cases where the system has completely
decentralized control, i.e., each UAV agent relies exclusively on readings on the
adjacent aircraft to stir itself. Yamaguchi et al. have developed one of the most
effective control methods for distributed autonomous systems. Paper [14] was
the first to propose a model that enabled completely decentralized control. The
approach was further enhanced in a series of subsequent papers that showed
using this method to enable a group of autonomous robots to generate a variety
of shapes [15] as well as to encircle a target [16]. Yamaguchi states this approach
is based on formation vectors. The idea is that each agent in a formation has its
own formation vector that is calculated from the adjacent agents’ readings. Work
[17] also showed that when autonomous robotic agents are configured to maintain
different distances to their neighbors, the general stability of the formation can
be maintained by running a special adaptation algorithm.
However, this cycle of papers did not focus on losing an agent from the
formation. Besides, they only covered linear integrators as agents. Paper [18]
proposed an adaptation algorithm for a decentralized system of linear oscillators,
which also covered the case of losing a single oscillator. This approach can be
modified for use in a system of autonomous robots including flying ones/UAVs.
Thus, the novelty hereof lies in that we (i) investigate the feasibility of
implementing an agent loss adaptation algorithm in a completely decentral-
ized/distributed autonomous system, from the standpoint of its dynamics; (ii)
test the proposed algorithm on full nonlinear UAV models engaged in coopera-
tive target tracking in a circular formation.
Adaptation Strategy for a Distributed Autonomous UAV Formation 233
2 Adaptation Algorithm for a Decentralized UAV

Formation
2.1 Distributed Autonomous System Model
A decentralized system usually consists of interchangeable locally interconnected

systems. The structural similarity of such systems to their biological counterparts
is what makes researchers believe these systems could be flexible and adaptable.
This gives rise to two problems: controlling the entire system with only local
data at hand; and implementing adaptation mechanisms. Of particular interest
is the development of a fault-tolerant algorithm to adapt the UAV formation to
losing a faulty UAV agent when tracking a ground target. This model is based
on the Japanese researcher’s work [18–20], where they developed an adaptation
algorithm for the locomotion model of an oscillatory network of motor neurons
in a biological organism; they also developed a multicylinder engine model [21].
These papers consider locally interconnected oscillators that maintain relative
phase angles to create locomotion rhythms that control walking, running, swim-
ming, and other types of organism locomotion. In a target-tracking decentralized
UAV formation, UAV agents, too, communicate locally and maintain the present
relative phase angles (phase shifts) of orbiting the target. This is why this model,
when further modified and developed, could become the foundation for an adap-
tive UAV formation reconfiguration algorithm for fault-tolerant control.
Fig. 1. Losing one UAV in a formation due to failure
We further describe a distributed autonomous system per [18]. The system

consists of N interchangeable subsystems {Si }i=1,...,m . Interchangeability means
that the dynamics of such subsystems can be described with differential equa-
tions of the same form. For analysis, consider a subsystem state written as the
state variable qi . This could be the phase angle of orbiting the target in a for-
T
mation. Consider the vector Q = [q1 , . . . , qm ] . This vector will be further used
as a vector of states. For example, in the case of UAV formation, the vector Q
can be a vector of these phase angles mentioned earlier.
In a decentralized formation, UAVs do not have one-to-all communications,
i.e., a single UAV agent will not be able to communicate with all other UAV
234 T. Muslimov
agents in the formation. In other words, the UAV agent dynamics contains the
state variables for the limited set of adjacent UAV agents. Interaction refers
herein to receiving the state readings of the adjacent UAV agents. Let N (qi ) =
{ql |ql is interacting with qi } be the set of such adjacent UAVs.
Each pair of interacting UAVs, e.g., Si and Sj , have a linear functional depen-
dency relationship as pk = pk (qi , qj ), where k = 1, . . . , n are the ordinal num-
bers of interactions. The difference between the respective subsystem states is an
T
example of such functional dependency. Consider the vector P = [p1 , . . . , pn ] .
In the case of a decentralized UAV formation, this vector can be regarded as
a vector of current phase shifts in circular motion around the target object.
The functional dependencies in the formation can be written as the equation
P = LQ, where L ∈ Rn×m .
Control consists in finding such subsystem dynamics q̇i = fi (N (qi )) that the
T
functional dependencies P converge to the desired values Pd = [pd1 , . . . , pdn ] . In
the case of decentralized UAV formation, vector Pd can be considered as a vector
of desired phase shifts in circular motion around the target object. For brevity,
let us refer to the desired geometric shape of the UAV formation (depends on
Pd ) as the pattern. An important assumption here is that the desired values of
the functional dependencies Pd are predetermined, i.e., there is such vector Q
that the condition LQ = Pd holds. Write such condition as (I − LL+ ) Pd = 0,
where L+ is the pseudoinverse of matrix L, and I is a unit matrix.
UAV formation control applies when the vector P converges to the vector
Pd , i.e., each element of the vector P converges to the correspondingly numbered
element of the vector Pd . However, the vector Pd is not always attainable for a
variety of reasons. For instance, it will not be attainable if one or more UAVs
in a formation fail. Thus, creating an adaptation (or adaptive reconfiguration)
algorithm boils down to developing a vector Pd tuning algorithm.
Yuasa and Ito published a series of papers [19,20] where they solved the
above-mentioned control problem on determining the dynamics of subsystems
Fig. 2. Reconfiguring a decentralized system by means of an adaptation algorithm.

Figure partly adapted from [18]
q̇i = fi (N (qi )) for oscillators by proving a number of theorems. However, their

solution was developed with linear agents in mind and would require a sub-
stantial modification for use on nonlinear agents. To begin with, the equa-
tion q̇i = fi (N (qi )) can be rewritten in vector form as Q̇ = F (Q) , where
T
F = [f1 , . . . , fm ] . The equation can be transformed as Ṗ = LQ̇ = LF (Q).
Theorem 1 (Yuasa and Ito) [19]. Dynamics of P in the equation Ṗ = LQ̇ =
LF (Q) can be described as an autonomous system if and only if F in the equation
∂F
Q̇ = F (Q) satisfies the following condition: L ∂Q (I − L+ L) = 0.
Theorem 2 (Yuasa and Ito) [19]. If the dynamics of P can be described

as a gradient system with the potential function V (P), i.e., P = − ∂V∂P , then
T
∂VX (X)
F can be written as F = + (I − L L) Q , where X = −LT P,
+
T ∂X
VX (X) = VX −L P = V (P) , and Q is an arbitrary vector that has the
same dimensionality as Q. Conversely, if F can be described with the equation
above, there exists such potential function V (P) that the dynamics of P would
be described as P = − ∂V
∂P .
Suppose that the subsystem dynamics can be described as q̇i = f¯i + f˜i (xi ),
T T
⊥
where f¯1 , . . . , f¯m = F̄ ∈ ker L and f˜1 , . . . , f˜m = F̃ ∈ (ker L) . Dynamics
of P is only determined by the summand f˜i (xi ) since LF̄ = 0. Given this, write
the dynamics of P as Ṗ = LF̃. Paper [18] also proves the following theorem.
Theorem 3 [18]. If f˜i (xi ) satisfies the following conditions: 1. f˜i (xi ) = 0 at the
˜
point xi = xdi , 2. ∂ f∂x
i (xi )
i
> 0, then X = Xd (here Xd = −LT Pd ) will be
xi =xid
one of the equilibrium states. Besides, if the following holds: 3.f˜i (xi )·(xi − xdi ) >
0, then X = Xd becomes the only equilibrium state.
The authors of [19] prove this theorem under an assumption that the dynam-
ics of P become a gradient system. When all the three conditions of this the-
orem are satisfied, X = Xd is the global minimum of the potential function
m
V (X) = i=1 f˜i (xi )dxi . Below is a variant of dynamics satisfying this theo-
rem:
q̇i = f¯i + fi+ (xi − xdi ) , (1)
where fi+ (·) is an odd increasing function.
2.2 Adaptation Strategy for a Distributed Autonomous System

In paper [18], the authors refer to losing one or more oscillators as an environment
variation. In a UAV formation, it can be referred to as a structural variation in
the formation. Should the formation lose a single UAV as shown in Fig. 1, the
desired phase shifts between the UAVs will be altered, as they are assumed
to depend on the total number of UAVs in the formation. In other words, the
pattern Pd configured before losing a faulty UAV becomes unattainable. Pattern
236 T. Muslimov
Pd being unattainable means there is no such matrix Q that would satisfy the
equation LQ = Pd . From the above equations Xd = −LT Pd and X = −LT P,
derive X − Xd = −LT (P − Pd ). Apparently, X − Xd = 0 is necessary but
insufficient to satisfy P − Pd = 0. In other words, if P − Pd = 0 and also
P − Pd ∈ ker L, then X − Xd = 0.
Thus, Pd needs to be adaptively tuned. Such tuning involves estimating the
current system. From the engineering standpoint, it is reasonable to use the
function
1 n 2
E= (pi − pdi ) . (2)
2 i=1
Figure 2 shows the minimum of this function is attained when the desired pattern
Pd is formed; local interactions between the adjacent UAVs are null. If the
pattern Pd cannot be attained due to losing a UAV from the formation, then
the interaction is ongoing, and a new pattern emerges as a result of balancing
such interactions, see Fig. 2b. To reduce subsystem interaction while preserving
the newly emerged pattern, Pd needs to be tuned, see Fig. 2c.
As noted in [18], a UAV agent’s Si interaction as a subsystem can be defined
in terms of the equation:
Iki = −Lki (pk − pdk ) . (3)
Here, k is the ordinal number of the interaction, i.e., assuming that in the matrix
L, its elements Lki and Lkj are non-zero, that means the UAVs numbered i and
j interact as subsystems in the formation.
Based on (3), the second term of the Eq. (1) can be written as [18]
n
fi+ (xi − xdi ) = fi+ Iki .
k=1
Using (3), the objective function (2) can be transformed as follows [18]:
1 i 2
m n
1
E= I .
4 i=1 k=1
L2ki k
Lki =0
Apparently, minimizing the function E, we minimize the interactions Iki between

the subsystems.
The key point of such adaptive reconfiguration is the separation of dynamics
on the time scale, i.e., the system dynamics needs to be much faster than the
adaptive tuning dynamics. Reason being, adaptation requires estimating the
current state of the system. Before a pattern can be adjusted and a new variation
thereof can be made, the current pattern (variation) needs to be suitability-
tested.
Let us show how this approach could be used on a decentralized UAV forma-
tion when tracking a target. Consider a 4-UAV formation orbiting the target at
a preset radius. Control aims at maintaining a predefined phase shift between
UAVs numbered i and i + 1. The UAVs are engaged in an “open-chain” inter-
action, i.e., 1–2, 2–3, 3–4. This phase shift is denoted as pi,i+1 = ϕi+1 − ϕi ,
where ϕ∗ is the phase angle of UAVs orbiting the target, with the subscript
for the UAV’s ordinal number in the formation. As this is a four-UAV group,
i = 1, 2, 3, 4. Phase shifts should converge to the desired values determined by
T T
the vector of the pattern Pd = [D1 , D2 , D3 ] = [2π/3, 9π/13,18π/29] . Then we
can obtain
⎡ ⎤
−1 1 0 0
P = LQ, L = ⎣ 0 −1 1 0 ⎦ ,
0 0 −1 1
T T
where P = p12 p23 p34 and Q = ϕ1 ϕ2 ϕ3 ϕ4 . The kernel of the matrix L
T
in this case is defined as ker L = 1 1 1 1 , and from the relation X = −LT LQ,
it can be found that
⎡ ⎤ ⎡ ⎤
p12 −ϕ1 + ϕ2
⎢ −p12 + p23 ⎥ ⎢ ϕ1 − 2ϕ2 + ϕ3 ⎥
X=⎢ ⎥ ⎢ ⎥
⎣ −p23 + p34 ⎦ = ⎣ ϕ2 − 2ϕ3 + ϕ4 ⎦ .
−p34 ϕ3 − ϕ 4
UAV formation dynamics in case of encircling a target is defined as
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
ϕ̇1 ω1 vf (2/{πρ1 }) arctan (kθ (p12 − D1 ))
⎢ ϕ̇2 ⎥ ⎢ ω2 ⎥ ⎢ vf (2/{πρ2 }) arctan (kθ (−p12 + p23 + D1 − D2 )) ⎥
Q̇ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ ϕ̇3 ⎦ = ⎣ ω3 ⎦ + ⎣ vf (2/{πρ3 }) arctan (kθ (−p23 + p34 + D2 − D3 )) ⎦ ,
ϕ̇4 ω4 vf (2/{πρ3 }) arctan (kθ (−p34 + D3 ))
where ω∗ and ρ∗ are the angular velocity of UAVs orbiting the target, and the
radius of such orbit, respectively; subscripts match the UAVs’ ordinal numbers
in the formation. Applying a nonlinear arctan (·) function limits the minima
and maxima of the speed control signal. Thus, it avoids the undesirable integral
windup that might cause the UAV formation to go unstable. Here vf is a positive
constant that determines the maximum added speed; kθ is a positive constant
that determines the smoothness of how the UAV reaches its orbit.
The dynamics of the pattern Ṗ, once the uniform radius ρ is attained, is
written as follows since ω1 = ω2 = ω3 = ω4 = v/ρ, where v is the final cruising
linear speed of the formation:
⎡ ⎤
−vf (2/{πρ}) arctan (kθ (p12 − D1 )) +
⎢ +vf (2/{πρ}) arctan (kθ (−p12 + p23 + D1 − D2 )) ⎥
⎢ ⎥
⎡ ⎤ ⎢ −p12 + p23 ⎥
ṗ12 ⎢ −v (2/{πρ}) arctan k + ⎥
⎢ ⎥
+D1 − D2
f θ
Ṗ = ⎣ ṗ23 ⎦ = ⎢
⎢
⎥.
⎥
⎢ −p23 + p34 ⎥
ṗ34 ⎢ +v (2/{πρ}) arctan k ⎥
⎢
f θ
+D2 − D3 ⎥
⎣ −vf (2/{πρ}) arctan (kθ (−p23 + p34 + D2 − D3 )) + ⎦
+vf (2/{πρ}) arctan (kθ (−p34 + D3 ))
We propose the following adaptation algorithm:
ṗdi = as fsigm (τp (pi − pdi )) , (4)

238 T. Muslimov
where 0 < τp < 1; fsigm is a sigmoid odd function, e.g., an arctangent or

hyperbolic tangent; as is a positive coefficient to alter the maxima and minima
of the function as fsigm ; for example, if an arctangent is picked for fsigm , then
this can be as = 2/π.
3 Simulation Results and Discussion

A 4-UAV simulation model based on full nonlinear models has been developed
in MATLAB/Simulink. For details on the models, refer to the monograph [22].
Simulation parameters are the same as in paper [23]. Find below the results of
simulation modeling using full nonlinear models. Figure 3a shows the trajectories
of a 4-UAV formation in case of a stationary target. Figure 3b shows changes in
phase shift errors. Apparently, these errors reach zero over time. There are three
phase shift errors as the UAVs in the formation use open-chain interaction.
Stability can be proved as follows, similarly to [18]. Let the function E (2)
be the Lyapunov function. Then derivative of the Lyapunov function E along
the trajectories of the system (4), with an arctangent for fsigm , is written as
n 2
Ė = − τp (pi − pdi ) arctan (pi − pdi ) < 0, ∀ (pi − pdi ) = 0.
i=1 π
Here the condition of quasi-stationarity is accepted and taken into account, that
is pi ≈ const, and the system manages to attain a new state of equilibrium after
losing one or more UAVs. As mentioned earlier, the adaptation algorithm has
much slower dynamics than the system.
Figure 4a shows how phase shift errors change and that they do not reach
zero because losing UAV No. 3 makes the pattern unattainable. However, graphs
also show the system to reach a new state of equilibrium. There are two phase
shift errors here, as the UAVs in the formation use open-chain interaction; losing
one makes it a 3-UAV formation.
As noted earlier, the new equilibrium state exists as a result of interaction
balancing. This can be seen in the UAV speeds graph when the formation loses
one unit, see Fig. 4b. UAV No. 3 is lost. The remaining UAVs fail to attain
the initially preconfigured cruising speed of 12 m/s. These are UAV No. 1 and
UAVs No. 2, No. 4 that the lost UAV was between. Failure to reach the required
cruising speed is due to the continuous inter-UAV communication that keep the
system in this new equilibrium state. Cruising at a speed other than the initially
configured value is strongly undesirable for two reasons: (i) it may jeopardize
the mission that the initial cruising speed was meant for; (ii) it may be resource-
suboptimal as the initial cruising speed was optimized for the UAV’s aerody-
namics to maximize fuel efficiency. Thus, reducing interaction within the UAV
formation subject to adaptive reconfiguration is directly applicable to attaining
the initially configured cruising speed.
We further implemented the adaptation algorithm (4) with the coefficient
τp = 0.1. As shown in Fig. 5a, this made every UAV in the formation reach its
preconfigured final cruising speed of 12 m/s. Figure 5b shows change in the phase
shift between UAV No. 1 and UAV No. 2, UAV No. 2 and UAV No. 4. The new
state of equilibrium is different from that of the 4-UAV formation.
In order to further illustrate the performance of the adaptation algorithm,
we conducted simulations at other initial positions of the UAV-agents in the
formation. This time, we used the following initial positions in the inertial coor-
T T T
dinate system: [230, 750, 100] , [150, 360, 100] , [810, 770, 100] . Here, the order
in which the initial position vectors are listed corresponds to the ordinal numbers
of the UAVs in the formation. The desired relative phase shift angles and other
simulation parameters were chosen the same as in the first scenario described
above. Figure 6a shows how the speeds of the UAVs changed in the absence of
the adaptation algorithm. Figure 6b shows how the speeds of the UAV changed
when the adaptation algorithm was running. From the graphs we can see the
effect of eliminating the increase in the final cruising speed of the formation.
Fig. 3. (a) 4-UAV formation trajectories; (b) relative phase errors in a 4-UAV formation
Fig. 4. (a) Phase shift errors in the resulting 3-UAV formation after losing 3rd UAV;
(b) UAV speeds in the resulting 3-UAV formation after losing 3rd UAV
240 T. Muslimov
Fig. 5. (a) Adaptation algorithm performance in case of adaptively reconfiguring a

UAV formation after losing a single UAV; (b) changes in phase shift during post-loss
adaptive reconfiguration
Fig. 6. (a) UAV speeds in the resulting 3-UAV formation after losing 3rd UAV; (b)
adaptation algorithm performance in case of adaptively reconfiguring a UAV formation
after losing a single UAV
4 Conclusions
The adaptation strategy presented herein eliminates a negative effect that com-
pletely decentralized UAV formations tend to show when losing one or more
UAVs. This negative effect lies in the fact that in such a situation the final
cruising speed of the formation increases. If it is possible to take the UAV out
of formation in advance with fault diagnosis, then the proposed approach will
allow the overall mission to continue while maintaining high safety. The forma-
tion reconfiguration approach shown here could be applicable in more diverse
UAV formation control scenarios; the adaptation algorithms are further modifi-
able. This adaptation scenario also necessitates designing a diagnostics module
for UAVs; further work will present its models.
Acknowledgements. This work was supported by the Ministry of Science and Higher
Education of the Russian Federation (Agreement No. 075-15-2021-1016).
References
1. Wang, G., Luo, H., Hu, X., Ma, H., Yang, S.: Fault-tolerant communication topol-
ogy management based on minimum cost arborescence for leader-follower UAV
formation under communication faults. Int. J. Adv. Robot. Syst. 14, 1–17 (2017)
2. Yang, H., et al.: Fault-tolerant cooperative control of multiagent systems: a survey
of trends and methodologies. IEEE Trans. Ind. Inform. 16, 4–17 (2020)
3. Rezaee, H., Parisini, T., Polycarpou, M.M.: Almost sure resilient consensus under
stochastic interaction: links failure and noisy channels. IEEE Trans. Autom. Con-
trol 66, 5727–5741 (2021)
4. Mann, V., Sivaram, A., Das, L., Venkatasubramanian, V.: Robust and efficient
swarm communication topologies for hostile environments. Swarm Evol. Comput.
62, 100848 (2021)
5. Abel, R.O., Dasgupta, S., Kuhl, J.G.: The relation between redundancy and con-
vergence rate in distributed multi-agent formation control. In: 2008 47th IEEE
Conference on Decision and Control, pp. 3977–3982. IEEE (2008)
6. Hu, J., Lennox, B., Arvin, F.: Robust formation control for networked robotic
systems using negative imaginary dynamics. Automatica 140, 110235 (2022)
7. Alireza Motevallian, S., Yu, C., Anderson, B.D.O.: On the robustness to multiple
agent losses in 2D and 3D formations. Int. J. Robust Nonlinear Control 25, 1654–
1687 (2015)
8. Slim, M., Saied, M., Mazeh, H., Shraim, H., Francis, C.: Fault-tolerant control
design for multirotor UAVs formation flight. Gyroscopy Navig. 12, 166–177 (2021)
9. Huang, S., Teo, R.S.H., Kwan, J.L.P., Liu, W., Dymkou, S.M.: Distributed UAV
loss detection and auto-replacement protocol with guaranteed properties. J. Intell.
Robot. Syst. 93, 303–316 (2019)
10. Aminzadeh, A., Khoshnood, A.: A novel distributed fault-tolerant cooperative con-
trol approach for auto-reconfiguration of a multi agent system in natural disasters.
In: 2021 9th RSI International Conference on Robotics and Mechatronics (ICRoM),
pp. 163–170. IEEE (2021)
11. Ghamry, K.A., Zhang, Y.: Fault-tolerant cooperative control of multiple UAVs for
forest fire detection and tracking mission. In 2016 3rd Conference on Control and
Fault-Tolerant Systems (SysTol), 2016-November, pp. 133–138 (2016)
12. Yu, Z., Zhang, Y., Jiang, B., Yu, X.: Fault-tolerant time-varying elliptical formation
control of multiple fixed-wing UAVs for cooperative forest fire monitoring. J. Intell.
Robot. Syst. 101, 48 (2021)
13. Han, B., Jiang, J., Yu, C.: Distributed fault-tolerant formation control for multiple
unmanned aerial vehicles under actuator fault and intermittent communication
interrupt. Proc. Inst. Mech. Eng. Part I J. Syst. Control Eng. 235, 1064–1083
(2021)
14. Yamaguchi, H., Arai, T.: Distributed and autonomous control method for gen-
erating shape of multiple mobile robot group. In: Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems, vol. 2, pp. 800–807
(1994)
15. Yamaguchi, H., Beni, G.: Distributed autonomous formation control of mobile
robot groups by swarm-based pattern generation. Distrib. Auton. Robot. Syst. 2,
141–155 (1996)
242 T. Muslimov
16. Yamaguchi, H.: A distributed motion coordination strategy for multiple nonholo-
nomic mobile robots in cooperative hunting operations. Robot. Auton. Syst. 43,
257–282 (2003)
17. Yamaguchi, H., Arai, T., Beni, G.: A distributed control scheme for multiple robotic
vehicles to make group formations. Robot. Auton. Syst. 36, 125–147 (2001)
18. Ito, S., Yuasa, H., Ito, M., Hosoe, S.: On an adaptation in distributed system
based on a gradient dynamics. In: IEEE SMC’99 Conference Proceedings. 1999
IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.
99CH37028), pp. 200–205. IEEE (1999)
19. Yuasa, H., Ito, M.: Coordination of many oscillators and generation of locomotory
patterns. Biol. Cybern. 63, 177–184 (1990)
20. Ito, M.: Trends and perspective of researches on control system theory. Distrib.
Auton. Robot. Syst. 2, 3–11 (1996)
21. Ito, S., Sasaki, M., Fujita, Y., Yuasa, H.: A circularly coupled oscillator system for
relative phase regulation and its application to timing control of a multicylinder
engine. Commun. Nonlinear Sci. Numer. Simul. 15, 3100–3112 (2010)
22. Beard, R.W., McLain, T.W.: Small Unmanned Aircraft: Theory and Practice.
Princeton University Press, Princeton (2012)
23. Muslimov, T.Z., Munasypov, R.A.: Coordinated UAV standoff tracking of moving
target based on Lyapunov vector fields. In: 2020 International Conference Nonlin-
earity, Information and Robotics (NIR), pp. 1–5. IEEE (2020)
DGORL: Distributed Graph Optimization
Based Relative Localization of Multi-robot
Systems
Ehsan Latif and Ramviyas Parasuraman(B)
School of Computing, University of Georgia, Athens, GA 30602, USA

{ehsan.latif,ramviyas}@uga.edu
https://github.com/herolab-uga/DGORL
Abstract. An optimization problem is at the heart of many robotics estimat-

ing, planning, and optimum control problems. Several attempts have been made
at model-based multi-robot localization, and few have formulated the multi-
robot collaborative localization problem as a factor graph problem to solve
through graph optimization. Here, the optimization objective is to minimize the
errors of estimating the relative location estimates in a distributed manner. Our
novel graph-theoretic approach to solving this problem consists of three major
components; (connectivity) graph formation, expansion through the transition
model, and optimization of relative poses. First, we estimate the relative pose-
connectivity graph using the received signal strength between the connected
robots, indicating relative ranges between them. Then, we apply a motion model
to formulate graph expansion and optimize them using g2 o graph optimization
as a distributed solver over dynamic networks. Finally, we theoretically analyze
the algorithm and numerically validate its optimality and performance through
extensive simulations. The results demonstrate the practicality of the proposed
solution compared to a state-of-the-art algorithm for collaborative localization in
multi-robot systems.
Keywords: Multi-Robot · Localization · Graph Theory · Distributed

Optimization
1 Introduction
The estimation of a relative pose, including position and orientation, [1], for multi-
robot systems (MRS) [2] is the foundation for higher-level tasks like collision avoid-
ance, cooperative transportation, and object manipulation. Motion capture systems [3],
ultra-wideband (UWB) systems with anchors, and RTK-GPS systems are a few exam-
ples of established multi-robot relative positioning solutions that currently rely on the
deployment of physical anchor or base stations in the application. These plans, how-
ever, are not suitable for large areas or interior settings where it is difficult to convert
the infrastructure, which limits the overall performance and application possibilities of
multi-robot systems and makes their use more difficult and expensive. Furthermore,
extraction of range and bearing measurements from cameras and visual makers, while

https://doi.org/10.1007/978-3-031-51497-5_18
244 E. Latif and R. Parasuraman
another practical approach, has the drawbacks of having a small field of vision, a short-
range, obscured by nearby objects, and maybe requiring much computational power.
The use of distance measurements from sensors like radars, Lidars, and UWB to achieve
relative localization, on the other hand, has recently attracted more significant interest.
The multi-robot relative localization (MRL) problem, which refers to detecting and
locating the relative configurations of mobile agents (typically with fewer sensor data
such as relative range or bearing) concerning other agents or landmarks, is critical in
MRS because it is required for robot teaming and swarming [4, 5]. As a result, many
applications are frequently confronted with the relative localization problem, includ-
ing formation control, cooperative transportation, perimeter surveillance, area cover-
age, and situational awareness. The relative localization and mapping (aka multi-robot
SLAM) is an extension of the MRL problem. While several researchers have proposed
novel solutions to the multi-robot map merging problem using pose graph matching
and optimization techniques, they rely on extensive sensor data inputs (such as point
clouds or Lidar scans) [6–8]. Therefore, solving the MRL problem with relative range
or bearing in a distributed manner is desirable and scalable in MRS [9].
R3
Connected Robots
R5
R2
Connecting Robots
R02 R03
R43 Connected and
X3
3
, ω)
R45
, (v
Connecting Robots
5
ω)
,ω
X3 , (v
(v,
Rij Computed range of i & j

)3
5,
X2
X
,(
R1 R01 Xj, (v, ω)j Sending State and

v,
ω
R04 velocity of j to i
)2
X1 , (v R4
, ω)
1
R0 X4, (v, ω)4
Fig. 1. Overview of configuration space of a multi-robot system, sharing their pose (xi ) and rela-
tive range (Rij ) measurements in our DGORL solution.
Distributed optimization is the problem of minimizing a joint objective function

that is the sum of many local objective functions, each corresponding to a computer
node. We can model many fundamental activities in this area as distributed optimiza-
tion problems, which have significant implications for multi-robot systems. Examples
include cooperative estimation [10], multiagent learning [11], and collaborative motion
planning. The distributed optimization formulation provides a versatile and effective
paradigm for creating distributed algorithms for numerous multi-robot problems.
In consumer electronics, Wi-Fi is one of the most extensively utilized wireless tech-
nology for indoor wireless networks. The ubiquitous availability of Received Signal
Strength Indicator (RSSI) measurement on such inexpensive commercial devices is the
RSSI measured from an Access Point (AP) or a Wireless Sensor Robot (WSN). The
RSSI value can be used in various applications, including relative localization [12–14],
cooperative control [15, 16], and communication optimization [17].
In this paper, we formulate the MRL problem as a graph optimization problem and
solve it in a distributed manner using a polynomial-time optimizer called the General
Distributed Graph Optimization Based Relative Localization 245
Graph Optimization (g2 o [18]). g2 o is an open-source graph-based framework to handle

the nonlinear error problems and is used to optimize global measurement pose using the
initial global measurement poses and local relative pose constraints.
Our solution, termed DGORL, aims to achieve high localization accuracy efficiently
in a distributed fashion. DGORL forms relative position-weighted connectivity graphs
using RSSI as local sensor data then expands graphs based on possible positions at
an instant and further optimizes to fetch relative position estimates for all connected
robots. See Fig. 1 for an overview of the configuration space of DGORL.
The main contributions of this paper are listed below.
1. A novel distributed, efficient, and precise relative localization system based on
shared inertial measurements and RSSI inputs from connected robots.
2. Position-weighted connectivity graph construction and optimization strategy tai-
lored specifically for obtaining reliable relative pose estimates.
3. Theoretical and numerical analysis to evaluate the performance of the algorithm.
4. Validation of accuracy and efficiency of the DGORL compared to the recent collab-
orative multi-robot localization algorithm [19], which used covariance intersection
technique to address the temporal correlation between received signals.
5. Open-sourcing of the codes1 for use and improvement by the research community.
2 Related Work
Most recent solutions to the simultaneous localization and mapping (SLAM) and MRL
problem are based on graph optimization (i.e., all robot poses and landmark positions
compose the graph’s nodes, while each edge encodes a measurement constraint) [18].
A conventional graph formulation, on the other hand, may suffer from unbounded pro-
cessing and memory complexity, which might constantly expand over time. This is
because new robot poses (and new landmarks in the case of feature-based SLAM) are
constantly being added to the graph, resulting in an increase in the number of nodes
over time; additionally, if frequent loop-closing events occur in SLAM, loop-closure
constraints (edges) can significantly increase the graph density [20]. For example, this
could be the case if a service robot works for an extended time inside an office building.
Particularly, graph optimization and factoring have been recently proposed in the
literature to solve different variants of the MRL problem [21–23]. Even though the issue
of reducing the complexity of graph optimization has recently been addressed [24, 25],
to the best of our knowledge, little work has yet explicitly taken into account estimation
consistency (i.e., unbiased estimates and an estimated covariance more significant than
or equal to the actual covariance [26]) in the design of graph reduction (sparsification)
schemes. This is a critical flaw because if an estimator is inconsistent, the accuracy
of the derived state estimations is unclear, making it untrustworthy [27]. Moreover,
the performance and efficiency of approaches to the localization problem in dynamic
environments are significantly traded off.
Most cooperative localization methods entail robot communication and observation,
which makes any step prone to inaccuracy. In a recent attempt at multi-robot local-
ization, many robots can locate themselves jointly using Terrain Relative Navigation
1
http://github.com/herolab-uga/DGORL.
(TRN) [19]. The localization estimation utilizing shared information fusion has been
improved by using an estimator structure that takes advantage of covariance intersec-
tion (CI) to reduce one source of measurement correlation while properly including oth-
ers. Similarly, a work [28] developed a CI-based localization method with an explicit
communication update and guaranteed estimation consistency simultaneously would
increase the robustness of multi-robot cooperative localization methods in a dispersed
setting. However, robots keep a system-wide estimate in their algorithm, which relative
observations can instantly update. Consequently, it increases the computational com-
plexity at a centralized server and over-burdened the robot to keep track of dynamics
for global positioning accurately. Unlike the explicit CI modeling methods, our objec-
tive is to localize the robot relative to other robots in a distributed fashion by utilizing
the polynomial-time graph optimization technique with high accuracy and efficiency.
Therefore, this paper proposes a graph-based optimization algorithm to address the
gaps mentioned above. To form a graph using shared RSSI information among robots,
we employ a Relative Pose Measurement Graph (RPMG) using observability analysis
[22]. Once we have created a connected, reliable graph, we exploit particle filtering over
the motion model to expand the graph based on mobility constraints. Finally, we con-
struct a k-possible combination of graphs for each robot which needs to be optimized.
We use the polynomial-time graph optimizer (g2 o [18]) for graph optimization with a
distributed constraint optimization.
3 Problem Formulation and the Proposed DGORL Solution

An overview of the typical MRS components is given in this section, along with a
thorough explanation of the suggested distributed graph optimization formulation.
Multi-robot System. Robotic members of an MRS are divided into disjoint (iso-
lated/disconnected from others) and connected (operating collaboratively) robots,
which can be either ground-based or aerial. The robot that enters the measuring range
of the monitoring robot at any given moment is referred to as the observed robot. A
robot that takes measurements of a random robot, among other robots, is the observa-
tion robot. The following qualities are presumptive for the considered MRS’s robotic
members:
– Wireless communication is used to share information across the MRS.

– The watching robot can extract the neighboring robot’s relative range (e.g., through
RSSI measurements) and can uniquely identify each robot in its field of view.
– While the observed robots have limited sensory and computational capabilities, the
observing robot may use high-precision sensors to carry out its self-localization.
– We restrict the movement of the robots within a two-dimensional planar space.
Assume at a given time t, a team of robots contains n ∈ N connected robots can

form a weighted undirected graph, denoted by G = (V, E, A), of order n consists of
a vertex set V = {v1 , ..., vn }, an undirected edge set E ∈ V × V is a range between
connected robots and an adjacency matrix A = {aij }n×n with non-negative element
aij > 0, if (vi , vj ) ∈ E and aij = 0 otherwise. An undirected edge eij in the graph G
is denoted by the unordered pair of robots (vi , vj ), which means that robots vi and vj
can exchange information with each other.
Here, we only consider the undirected graphs, indicating that the robots’ communi-
cations are all bidirectional. Then, the connection weight between robots vi and vj in
graph G satisfies aij = aji > 0 if they are connected; otherwise, aij = aji = 0. With-
out loss of generality, it is noted that aii = 0 indicates no self-connection in the graph.
n
The degree of robot vi is defined by d(vi ) = j=1 aij where j = i and i = 1, 2, ..., n.
The Laplacian matrix of the graph G is defined as Ln = D−A, where D is the diagonal
with D = diag{d(v1 ), d(v2 ), ..., d(vn )}. If the graph G is undirected, Ln is symmetric
and positive semi-definite. A path between robots vi and vj in a graph G is a sequence
of edges (vi , vi1 ), (vi1 , vi2 ), ..., (vik , vj ) in the graph with distinct robots vil ∈ V . An
undirected graph G is connected if a path exists between any pair of distinct robots vi
and vj where (i, j = 1, ..., n).
Fig. 2. The Distributed Graph Optimization for Relative Localization of Multi-Robot Systems
(DGORL) system architecture shows input (Received connected robot’s motion information and
RSSI) and output (relative pose estimation) for robot i. The robot generates and expends graphs
based on observability for optimization along with the constraints set up through local and
received inertial measurement, which further fed into the optimization function. The optimized
graph yields relative pose estimates for all connected robots.
In this paper, we formulated a multi-robot system as a graph problem to find the

optimized solution for relative localization based on inequality constraints with a time-
varying domain. Furthermore, we have segmented the solution into three different com-
ponents, which can be solved simultaneously for every iteration over distributed con-
straints: 1) graph formation, 2) expansion through transition, and 3) optimization. See
Fig. 2 for sequential procedure of DGORL.
3.1 Graph Formation
An undirected graph G = (V, E), where (i, j) ∈ E if and only if (i, j) and (j, i)
both are in E, is called the underlying undirected graph of G. G is connected if its
underlying undirected graph is connected. An isolated and connected subgraph of G
is called a component. Suppose that an undirected graph G has n nodes and m edges.
The incidence matrix of G, denoted by A(G), is an m × n matrix whose m rows and n

columns correspond to the m edges and n nodes of G, respectively. Each element aij
of A(G) is defined as:

1 if node vj is the tail/head of edge ei ,
aij =
0 otherwise
The following lemma describes the relation between the rank of A(G) and the con-
nectivity of G.
Lemma 1. Let G be an undirected graph with n nodes and A(G) be its incidence
matrix. If G is composed of λ components, the rank of A(G) satisfies
rank(A(G)) = n − λ ≤ n − 1
The overall algorithm is mainly composed of two steps, as depicted in Algorithms

1 and 2, respectively. The first step is to generate the Estimated Relative Position
Measurement Graph (ERPMG); GE = (VE , EE , AE ) based on the Relative Posi-
tion Estimation Graph (RPMG); G = (V, E, A), which can be built-up using received
range information from connected robots of an n-robot system, which describes the
relative position measurements among robots. In this step, the node-set VE of the
ERP M G = GE is initialized to be the same as that of the RP M G = G, and the
edge set EE of the ERP M G = GE is initialized to be empty. Then, add all relative
position measurement edges of E into the edge set EE if edge eij satisfy the motion
constraint; J Fj VBj = 0. A concise description of this step is illustrated in Algorithm 1.
Algorithm 1: ERPMG Formation

1 Input: State x, input u, RPMG G = (V, E, A);
2 Output: ERPMG GE ;
3 Initialize ERPMG GE = (VE , EE ) with VE = V , EE = φ ;
4 for each node vi ∈ V d do
5 for each edge eij ∈ E do
6 if J Fj VBj = 0 then
7 add edge eij to EE ;
8 end
9 end
10 end
The second step (See Algorithm 2) is to examine the observability of the n-robot
system according to the ERP M G = GE generated by Algorithm 1. We initialize the
incidence matrix A(GE ) and the diagonal matrix Ln to be zero matrices, with their
dimensions determined by the number of nodes and edges in GE . Then construct the
incidence matrix A(GE ) and the diagonal matrix Ln . The incidence matrix A(GE )
describes which nodes are connected by which edges and is constructed based on the
topology of GE . The diagonal matrix Ln describes the weight of the edges in GE with
the weight vector of each edge. The edges in A(GE ) and Ln take the same order. Then,
we can obtain the spectral matrix C(GE ) based on A(GE ) and Ln . The observability
of the n-robot system can be determined by evaluating the rank of C(GE ).
3.2 Expansion Through Transition
Algorithm 2: Observability Checking

1 Input: n = |VE |, m = |EE | ;
2 Initialize the incidence matrix: A(GE ) = m × n;
3 Initialize the diagonal matrix: Ln = n × n ;
4 k = 0;
5 for each node vi ∈ VE do
6 for each edge (i, j) ∈ EE do
7 k = k + 1;
8 A(GE )[k, i] = 1;
9 W [3k − 2 : 3k, 3k − 2 : 3k] = diag(Ln ((i, j)));
10 end
11 end
12 if rank (C(GE )) is equal to 4(n − 1) then
13 return True;
14 end
15 else
16 return False;
17 end
Initial position and velocity constraints are known to each robot; hence locally con-
structed ERPMG tends to change based on relative motion in the network. We have
limited the number of possible positions of an individual robot to k by exploiting the
concept of particle filtering. The motion process is carried out in the 2-D state-space
Xn,t includes position (xn,t , yn,t ) and orientation φn , t.
The robot model f (∗) can be written as:
xn,t+1 = xn,t + vn,t Δtcos(φn , t)
yn,t+1 = yn,t + vn,t Δtsin(φn , t) (1)
φn,t+1 = φn,t + ωn,t Δt
In the Eq. (1), vn,t and ωn,t are the velocity and angular velocity of the robot n at
time, t respectively. δt represents the time interval between two control inputs. Based
on the robot model f (∗) , the control input of the robot n at time t is defined as:
un,t = [vn,t , ωn,t ] (2)
It is worth mentioning that Eq. (2) is the ideal kinematic model of the robot. Under
pure rolling conditions, the robot’s motion follows this equation. However, imperfec-
tions are unavoidable in real devices. Complex wheel-ground interactions and system
noises cause disturbances to the robots. Moreover, these disturbances are modeled as
Gaussian random variables, characterized by their mean and covariance matrices. Thus,
the model of a given robot n at time t is defined as:
xn,t+1 = f (xn,t , un,t ) + Nnoisen,t (3)
Here, xn,t ∈ Rnx and un,t ∈ Rnu are the state and control input of the robot
n at the time, t respectively. Nnoise ∈ Rnx is an unknown disturbance with Gaussian
probability distribution. Considering the disturbance level in the real environment, Nn,t
noise in simulations is set to diag(0.1 m, 0.1 m, 0.5◦ ).
As each robot also receives RSSI from other robots, we can find the intersecting
region as an area of interest to map the estimated relative position for each robot using
the model and find k soft max out of them. Once we have k possible positions of each
robot, we can generate < nk solvable graphs for optimization.
3.3 Optimization
Constraints. We are interested in solving the constrained convex optimization problem

over a multi-robot network in a distributed fashion. More specifically, we consider a
network with n robots, labeled by V = {1, 2, ..., n} and k possible connections to other
robots. Every robot i has a local convex objective function and a global constraint set.
The network cost function is given by:

N
minimize f (x) = fi (x) subject to x ∈ D = x ∈ Rk : c(x) ≤ 0 (4)
i=1
Here, x ∈ Rk is a global decision vector; fi : Rk → R is the convex objective function

of robot i known only by robot i; D is a bounded convex domain, which is (without
loss of generality) characterized by an inequality constraint, i.e., c(x) ≤ 0, where c :
Rk → R is a convex constraint function. All the robots know it. We assume that D is
contained in a Euclidean ball of radius R, that is:

D ⊆ B = x ∈ Rk : x2 ≤ R . (5)
We also assume that there exists a point x̂ such that the inequality constraint is strictly
feasible, i.e., c(x̂) < 0. We introduce a regularized Lagrangian function to deal with the
inequality constraint c(x).
N
γ
N

γ
L(x, λ) = fi (x) + λN c(x) − N λ2 = fi (x) + λc(x) − λ2
i=1
2 i=1
2
N

= Li (x, λ), (6)
i=1
where, we have replaced the inequality constraint c(x) ≤ 0 with Nc (x) ≤ 0, and γ > 0
is some parameter. It is noted that Li (x, λ) is known only by robot i.
Algorithm 3: Relative Localization Based on Graph Optimization

1 At given time interval t, for robot i;
2 Observe connected robots n;
3 Receive motion and observability information from connected robots;
4 Generate ERPMG from RPMG using Algorithm 1;
5 Check observability of n-robot systems using Algorithm 2;
6 for every sampling instance tk , k = 0,1,2,.. do
7 for each connected robot Rj , j = i and j = 1,2,...,n do
8 Get previous odometry information;
9 Get previous relative positioning information;
10 Generate nk possible graph over estimation horizon;
11 end
12 end
13 Set constraints as Eq. (4);
14 Generate a time-varying network model for n connected robots;
15 Solve the graph optimization problem using distributed solver [18] over a predicted
horizon;
16 for each connected robot Rj , j = i and j = 1,2,...,n do
17 Extract position estimation from optimized graph: Xj,t+1 = (xj,t+1 , yj,t+1 , φj,t+1 );
18 end
Model. We consider a time-varying network model that has been widely considered in
the literature [18, 20, 29]. The robots’ connectivity at time t can be represented by an
undirected graph G(t) = (V, E(t), A(t)), where E(t) is the set of activated edges at
time t, i.e., edge (i, j) ∈ E(t) if and only if robot i can receive data from robot j, and
we assign a weight [A(t)]ij > 0 to the data on edge (i, j) at time k. Note that the set
E(t) includes self-edges (i, i) for all i. We make the following standard assumption on
the graph G(t).
Assumptions: The graph G(t) = (V, E(t), A(t)) satisfies the following.
1. For all t ≥ 0, the weight matrix A(t) is doubly stochastic.

2. There exists a positive scalar ξ such that [A(t)]ii ≥ ξ for all i and t ≥ 0, and
[A(t)]ij ≥ ξ if [A(t)]ij > 0.
3. There exists an integer T ≥ 1 such that the graph (V, E(sT ) ∪ ... ∪ E((s + 1)T − 1))
is strongly connected for all s ≥ 0.
Once the robot i model the network and constraints, g2 o [18] optimize the provided
k-ERPMG to find the best possible position estimations for connected n robots in the
form of optimized graph Go = (Vo , Eo , Ao ). Vo are the possible node positions con-
cerning i, Eo contains the possible distance range to other nodes in terms of weighted
edges, and Ao is an optimized adjacency matrix from i to every other node. Algorithm 3
describes the entire localization strategy for a given MRS, with a graph-based optimizer
denoting the algorithm’s three main iterative steps.
4 Theoretical Analysis
Graph optimization-based relative pose localization has three submodules; Formation,
Expansion, and Optimization. The formation depends on the number of robots n, which
causes the generation of ERPMG G = (V, E, A) and expansion relies on the number of
possible position instances, k, which produces nk potential graphs to be optimized. As
both initial steps can be serially done in polynomial time, one can deduce that the algo-
rithm is polynomial and scalable with the number of robots. However, the optimization
of nk possible graphs using the g2 o optimizer needs to be analyzed for its NP-hardness.
NP-hardness: We will show that the problem of finding optimizers is NP-hard in gen-
eral. An example shows that L1-optimality is NP-hard for non-submodular energies.
Remember that if no two vertices of a graph G = (V, E, A) are connected by
an edge, the set U of vertices is independent. The problem of finding the maximal
independent set of vertices of an arbitrary graph is known to be NP-hard [30]. Consider
the following local costs as an example:
– Give each vertex v of label l the cost li .
– For each edge with both vertices of label l, let the cost be N = |V | + l.
– Connect the cost of 0 to any other edge.
It is worth noting that if and only if the set U = l1 (1) is independent, the maximum cost
of any labeling l is N . All labeling l associated with an independent U have a maximum
cost of 1. Furthermore, the labeling l is a strict minimizer when the number of cost 1
atom for U , which is |V | − |U |, is minimal, i.e., when the size of U is maximal. To put
it another way, if we use the previously mentioned local cost assignments for a graph
G, then l is a strict minimizer if and only if U := l1 (1) is a maximal independent set
of vertices. As a result, our problem, like the problem of finding the most extensive
independent set of vertices, is NP-hard. Thus, the optimization problem is proved to
be NP-hard, and we also know that such graph problems can be verified in polynomial
time [30] hence the problem is NP-complete.
Convergence: In order to control convergence, the g2 o method adds a damping factor

and backup operations to Gauss-Newton, for which g2 o solves a damped version of the
optimization function. A damping factor λ is present here; the more significant it is, the
smaller the λ is. In the case of nonlinear surfaces, this helps control the step size. The
g2 o algorithm’s goal is to regulate the damping factor dynamically. The error of the new
setup is tracked throughout each iteration. The following iteration is reduced if the new
error is smaller than the prior one. If not, lambda is raised, and the solution is reversed.
We refer to [18] for a more thorough discussion of how the g2 o algorithm guarantees
convergence.
5 Simulation Experiments and Results

We performed extensive simulation experiments in a 60 × 60 m bounded region under
certain conditions to analyze the algorithm. We have set up the MRS as discussed in
Fig. 3. Trajectories and the Optimized Graph. Left: Simulation with the ground truth (green)
trajectory along with the colored predicted trajectory of a robot; Right: Robots as colored vertices
and colored edges between connected robots with the respective color for optimized graphs.
Fig. 4. DGORL performance of five robots in terms of Root Mean Squared Error (RMSE) in
the 60 m × 60 m simulation workspace (Left). Comparison with Collaborative Terrain Relative
Navigation (TRN) based on Covariance Intersection (Right).
Sect. 3, where each robot shares inertial measurement and RSSI with connected robots.
Each robot in the MRS performs relative localization based on its frame of reference.
We can calculate weights of edges eij in G = (V, E) as a range d the between Ri and
Rj using path loss model using perceived RSSI:
A−RSSI
d = 10 10n (7)
Here, n denotes the signal propagation exponent, which varies between 2 (free space)
and 6 (complex indoor environment), d denotes the distance between robots i and j, and
A denotes received signal strength at a reference distance of one meter. On every itera-
tion, ri generates ERPMG based on its observation and performs observability analysis
(at least four nodes connected, based on the observability Checking in Algorithm 2).
Later, it will expand ERPMG and generate input for the g2 o optimizer.
The initial global measuring posture, the local relative pose limitations, and its infor-
mation matrix make up g2 o’s input. Once obtained an optimized graph, ri estimated rel-
ative positions from vertex poses. Although in MRS, each robot performs localization
distributedly. To measure error, we have captured initial positions on a global scale and
compared predicted trajectories with ground truth as RMSE in meters. We used range
sensors to compare our strategy to Terrain Relative Navigation (TRN) from [19].
Localization Accuracy. We performed experiments with five robots driving them on a

random walk for 100 iterations, and obtained consistent results through 10 repetitive
trials under identical conditions. Figure 3 visualizes the predicted trajectories and con-
nected graphs for experimentation. Results in Fig. 4 have validated the claim about the
accuracy of DGORL in a larger space, with 8% localization error.
Furthermore, experimentation was carried out in the same simulated workspace area
to evaluate the localization accuracy of DGORL in contrast to TRN and to confirm its
applicability in more significant scenarios. We performed ten attempts using identical
simulation conditions and obtained the RMSE in meters to assess the localization accu-
racy. Results in Fig. 4, which demonstrate a 23% percent improvement in localization
accuracy over TRN, have supported the assertion that DGORL is viable in larger con-
texts. Furthermore, the absolute localization error of 4.2 m for a 3600 m2 zone with five
robots in a standalone view of localization of DGORL is highly encouraging. Still, it
needs further analysis to understand the limitations and sufficient conditions. Neverthe-
less, the results showcase the viability of DGORL for MRS applications.
Computational Demand. The mean optimization time (referring to the time taken by
the solver) and the mean CPU usage of the whole process are 10.2 ± 2.7 ms and
143 ± 49%, respectively. The results are evaluated for five robots from 10 runs over
100 iterations. The computational complexity of covariance intersection-based TRN is
close to DGORL, which further highlights the efficiency of DGORL. It is worth notic-
ing that the performance of DGORL provides evidence for its practical use in small
resource-constrained robots (i.e., equipped with tiny computers, e.g., Raspberry Pi zero
with Quad Core running at 1 GHz). Hence, DGORL can be used as an efficient relative
localization algorithm for most multi-robot systems, including swarm robotics.
6 Conclusion
Many estimations, planning, and optimum control challenges in robotics have a basis
for optimization problems. In most optimization problems, the goal to be maximized
or minimized is made up of multiple local components or terms. That is, they only
rely on a tiny portion of the variables. We use graph theory to solve a well-defined
multi-robot relative localization problem. Graph generation, expansion through transi-
tion, and optimization are the three fundamental components of our novel technique.
We first estimated a relative pose measurement graph using received signal intensity
between connected robots. Later, we used a motion model to build graph expansions
and a polynomial-time graph optimizer to find an optimized pose estimation graph from
which a robot extracts relative positions for connected robots. Finally, we analyzed
the algorithm through simulations to validate its practicality and accuracy in a large
workspace. Furthermore, DGORL delineates high localization accuracy than appropri-
ate methods while maintaining a comparable optimization time, which further backs
the efficiency and accuracy of DGORL. In the future, we will perform hardware exper-
imentation to substantiate the applicability of graph optimization in the real world.
References
1. Islam, M.J., Mo, J., Sattar, J.: Robot-to-robot relative pose estimation using humans as mark-
ers. Auton. Robot. 45(4), 579–593 (2021)
2. Xianjia, Y., Qingqing, L., Queralta, J.P., Heikkonen, J., Westerlund, T.: Applications of
UWB networks and positioning to autonomous robots and industrial systems. In: 2021 10th
Mediterranean Conference on Embedded Computing (MECO), pp. 1–6. IEEE (2021)
3. Najafi, M., Nadealian, Z., Rahmanian, S., Ghafarinia, V.: An adaptive distributed approach
for the real-time vision-based navigation system. Measurement 145, 14–21 (2019)
4. Guo, K., Qiu, Z., Meng, W., Xie, L., Teo, R.: Ultra-wideband based cooperative relative
localization algorithm and experiments for multiple unmanned aerial vehicles in GPS denied
environments. Int. J. Micro Air Veh. 9(3), 169–186 (2017)
5. Fink, J., Michael, N., Kim, S., Kumar, V.: Distributed pursuit-evasion without global local-
ization via local fronteirs. Auton. Robot. 32(1), 81–95 (2012)
6. Dubé, R., Gawel, A., Sommer, H., Nieto, J., Siegwart, R., Cadena, C.: An online multi-robot
slam system for 3D lidars. In: 2017 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pp. 1004–1011. IEEE (2017)
7. Mangelson, J.G., Dominic, D., Eustice, R.M., Vasudevan, R.: Pairwise consistent measure-
ment set maximization for robust multi-robot map merging. In: IEEE International Confer-
ence on Robotics and Automation (ICRA), pp. 2916–2923. IEEE (2018)
8. Tian, Y., Chang, Y., Arias, F.H., Nieto-Granda, C., How, J.P., Carlone, L.: Kimera-Multi:
robust, distributed, dense metric-semantic slam for multi-robot systems. IEEE Trans. Robot.
38, 2022–2038 (2022)
9. Latif, E., Parasuraman, R.: Multi-robot synergistic localization in dynamic environments. In:
ISR Europe 2022; 54th International Symposium on Robotics, pp. 1–8 (2022)
10. Shorinwa, O., Yu, J., Halsted, T., Koufos, A., Schwager, M.: Distributed multi-target tracking
for autonomous vehicle fleets. In: IEEE International Conference on Robotics and Automa-
tion (ICRA) 2020, pp. 3495–3501 (2020)
11. Wai, H.-T., Yang, Z., Wang, Z., Hong, M.: Multi-agent reinforcement learning via double
averaging primal-dual optimization. In: Advances in Neural Information Processing Sys-
tems, vol. 31 (2018)
12. Wang, W., Jadhav, N., Vohs, P., Hughes, N., Mazumder, M., Gil, S.: Active rendezvous for
multi-robot pose graph optimization using sensing over Wi-Fi. In: Asfour, T., Yoshida, E.,
Park, J., Christensen, H., Khatib, O. (eds.) Robotics Research. ISRR 2019. Springer Proceed-
ings in Advanced Robotics, vol. 20, pp. 832–849. Springer, Cham (2019). https://doi.org/10.
1007/978-3-030-95459-8_51
13. Parashar, R., Parasuraman, R.: Particle filter based localization of access points using
direction of arrival on mobile robots. In: IEEE 92nd Vehicular Technology Conference
(VTC2020-Fall), pp. 1–6. IEEE (2020)
14. Latif, E., Parasuraman, R.: Online indoor localization using DOA of wireless signals. arXiv
15. Luo, S., Kim, J., Parasuraman, R., Bae, J.H., Matson, E.T., Min, B.-C.: Multi-robot ren-
dezvous based on bearing-aided hierarchical tracking of network topology. Ad Hoc Netw.
86, 131–143 (2019)
16. Parasuraman, R., Min, B.-C.: Consensus control of distributed robots using direction of
arrival of wireless signals. In: Correll, N., Schwager, M., Otte, M. (eds.) Distributed
Autonomous Robotic Systems. SPAR, vol. 9, pp. 17–34. Springer, Cham (2019). https://
doi.org/10.1007/978-3-030-05816-6_2
17. Parasuraman, R., Oegren, P., Min, B.-C.: Kalman filter based spatial prediction of wireless
connectivity for autonomous robots and connected vehicles. In: IEEE 88th Vehicular Tech-
nology Conference (VTC-fall), pp. 1–5. IEEE (2018)
18. Kümmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g2 o: a general frame-
work for graph optimization. In: 2011 IEEE International Conference on Robotics and
Automation, pp. 3607–3613. IEEE (2011)
19. Wiktor, A., Rock, S.: Collaborative multi-robot localization in natural terrain*. In: IEEE
International Conference on Robotics and Automation (ICRA), pp. 4529–4535 (2020)
20. Huang, G., Kaess, M., Leonard, J.J.: Consistent sparsification for graph optimization. In:
European Conference on Mobile Robots, pp. 150–157 (2013)
21. Zheng, S., et al.: Multi-robot relative positioning and orientation system based on UWB
range and graph optimization. Measurement 195, 111068 (2022)
22. Hao, N., He, F., Hou, Y., Yao, Y.: Graph-based observability analysis for mutual localization
in multi-robot systems. Syst. Control Lett. 161, 105152 (2022)
23. Sahawneh, L.R., Brink, K.M.: Factor graphs-based multi-robot cooperative localization: a
study of shared information influence on optimization accuracy and consistency. In: Pro-
ceedings of the 2017 International Technical Meeting of the Institute of Navigation (ION
ITM 2017), Monterey, California, pp. 819–838 (2017)
24. Carlevaris-Bianco, N., Eustice, R.M.: Generic factor-based node marginalization and edge
sparsification for pose-graph SLAM. In: 2013 IEEE International Conference on Robotics
and Automation, pp. 5748–5755. IEEE (2013)
25. Johannsson, H., Kaess, M., Fallon, M., Leonard, J.J.: Temporally scalable visual slam using
a reduced pose graph. In: 2013 IEEE International Conference on Robotics and Automation,
pp. 54–61. IEEE (2013)
26. Bar-Shalom, Y., Li, X.R., Kirubarajan, T.: Estimation with Applications to Tracking and
Navigation: Theory Algorithms and Software. Wiley, New York (2004)
27. Indelman, V., Nelson, E., Michael, N., Dellaert, F.: Multi-robot pose graph localization and
data association from unknown initial relative poses via expectation maximization. In: 2014
IEEE International Conference on Robotics and Automation (ICRA), pp. 593–600. IEEE
(2014)
28. Chang, T.-K., Chen, K., Mehta, A.: Resilient and consistent multirobot cooperative localiza-
tion with covariance intersection. IEEE Trans. Rob. 38(1), 197–208 (2021)
29. Jiang, X., Zeng, X., Sun, J., Chen, J.: Distributed solver for discrete-time Lyapunov equations
over dynamic networks with linear convergence rate. IEEE Trans. Cybern. 52(2), 937–946
(2022)
30. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2022)
Characterization of the Design Space
of Collective Braitenberg Vehicles
Jack A. Defay, Alexandra Q. Nilles(B) , and Kirstin Petersen
School of Electrical and Computer Engineering, Cornell University, Ithaca,

NY 14853, USA
{jad452,aqn3,kirstin}@cornell.edu
Abstract. Large collectives of artificial agents are quickly becoming

a reality at the micro-scale for healthcare and biological research, and
at the macro-scale for personal care, transportation, and environmen-
tal monitoring. However, the design space of reactive collectives and the
resulting emergent behaviors are not well understood, especially with
respect to different sensing models. Our work presents a well-defined
model and simulation for study of such collectives, extending the Brait-
enberg Vehicle model to multi-agent systems with on-board stimulus. We
define omnidirectional and directional sensing and stimulus models, and
examine the impact of the modelling choices. We characterize the result-
ing behaviors with respect to spatial and kinetic energy metrics over the
collective, and identify several behaviors that are robust to changes in
the sensor model and other parameters. Finally, we provide a demon-
stration of how this approach can be used for control of a swarm using
a single controllable agent and global mode switching.
1 Introduction
Valentino Braitenberg introduced his vehicles [2] as a thought experiment in

synthetic psychology: a light-hearted demonstration of the human tendency to
see order and intelligence in the emergent behaviors of simple reactive agents.
Sensors are “wired” directly to actuators, so the agents do not use on-board
stateful representations or memory (Fig. 1(a)). Seemingly intelligent behaviors
such as collision avoidance or aggregation easily arise from memoryless reactions
between agents and environmental stimuli, along with the human tendency to see
agency in expressive motion [18]. In this work, we introduce a simulated model of
simple Braitenberg Vehicles (BVs) in a collective. Most demonstrations of BVs
implement off-board stimuli sources; here, we extend the model to on-board
rear-facing stimuli sources; the environment of interest is the collective itself and
its dynamics. We analyze the emergent behaviors for robustness, and cast the
behaviors in light of potential robotics applications.
From social organisms such as cellular slime mold, we know that collec-
tive intelligence is possible without highly complex nervous systems. At the
micrometer-scale, researchers are developing engineered collectives for drug
https://doi.org/10.1007/978-3-031-51497-5_19
258 J. A. Defay et al.
Fig. 1. (a) Overview of how the original BV react to environmental stimulus. (b)
Overview of local and global coordinates referenced in this article.
delivery and minimally invasive surgery [13]. Robotics techniques at this scale
are rapidly becoming a reality, with memristor-based circuits capable of sensing
chemical and light signals [5] and read/write memory of temporal events [34].
However, the agents are too small to use traditional onboard state-estimation
and control algorithms, instead requiring careful design of primarily reactive
strategies, uniform off-board controls for the entire collective [30], or a combina-
tion of the two approaches [9]. Reactive approaches that use local information
and embodied intelligence are crucial for success at the micro-scale, as biological
collectives show [10]. At the human scale, as robotic collectives begin assisting
us with personal care, transportation, and environmental monitoring, the design
of reactive collectives is appealing. Decentralized methods are more scalable,
and could even help side-step some privacy and surveillance concerns if we can
implement systems that do not require cameras, full scene reconstruction, or
global positioning [8,29]. To our knowledge, the dynamics of multiple BVs with
onboard stimuli have not been a first-class object of study in the robotics, control
or autonomous systems literature.
Our contributions include a more formal definition of the BV model, an open-
source Python implementation for simulating and characterizing the system, and
an interpretation of the results and next steps toward application. We extend the
BV model to include on-board stimuli and define two stimuli models: a symmetric
omni-directional stimulus such as that enabled by radio frequency localization
[17], and a directional stimulus modeled off a generic LED’s spatial intensity.
We also define two sensor models, an omnidirectional sensor model and a sensor
model with an angular dependency, such that the observed stimuli intensity
decreases with misalignment. In this paper, we examine three combinations of
these models, and characterize emergent behaviors with respect to metrics over
the nearest-neighbor distances and kinetic energy of the collective. We examine
homogeneous collectives while varying three main design parameters:
– Influence, γ: scale of the reactive control law, updates heading and velocity
– Influence angle, θmax : angular extent of constrained sensors and/or stimuli
– Actuation noise: Gaussian noise on heading and velocity execution
Design Space of Braitenberg Vehicles 259
1.1 Related Work
The premise that collectives of simple, reactive agents exhibit robust behaviors
has been extensively validated in simulated and empirical work, from boid mod-
els of swarms [25] and particle-based optimization algorithms [7], to consensus
in biological collectives [28]. However, the effort to design robust and efficient
artificial collectives is still very much underway, with current systems falling far
short of the performance seen in biological collectives [6]. It is still difficult to
design for robust behavior, especially if the behaviors are learned, evolved or envi-
ronmental disturbances are present. Recent works in formalizing reactive plans
or controllers have focused on problems such as co-design of individual robotic
hardware and software with dynamic effects [36]; task-driven single-robot plan-
ning and design problems [22]; or top-down synthesis problems whose solutions
require significant sensing, computation, and communication between agents [3].
Biological systems, such as schooling fish, are a proof-of-concept that parsimo-
nious local strategies can lead to robust behavior, especially when combined
with embodied agent-environment interactions [19]. In this work we contribute
a dynamical system definition and characterization of the BV model as a step
toward more formal understanding of the dynamics and design of reactive col-
lectives.
The BV model has not often been a direct object of study by engineers,
apart from use as a demonstration in education [14] and graphics [33]. Similar
underlying themes can be found in the reactive multi-robot systems literature,
often focused on finding minimal requirements of a single behavior, while we aim
to generalize many behaviors in a single model. One fruitful line of work shows
that local tactile information, such as the frequency of robot-robot encounters, is
sufficient for tasks such as localization and density regulation in structured envi-
ronments [21]. In [11], the authors perform a parameter sweep over the control
space to find an optimal reactive controller for aggregation with a one-bit laser-
type occupancy sensor. Similarly, segregative behavior is achieved by a reactive
controller on a sensor reporting only whether another agent is nearby and if it is
the same or a different type [23]. Interestingly, neither of these systems [11,23]
include the ability to sense whether nearby agents are to the left or right, as is
fundamental to the BV model. However, in both systems the optimal motion
pattern in the absence of stimulus is a circular arc, and the agents are able to
effectively search the space and aggregate when they detect nearby agents. This
“sweep-until-find” strategy seems to simulate the taxis-kinesis strategy of love
with a more minimalist active sensing strategy that avoids the need to compute
direction of taxis. However, these systems are not designed to switch modes to
other flocking or spatial coverage behaviors, as the BV model allows.
Another promising approach for safe emergent behavior design is the Gibbs
Random Fields (GRF) model, demonstrating segregative and flocking behaviors
[26] as well as static shape formation [27]. The GRF approach achieves stronger
guarantees on performance for specific behaviors, but assumes the ability to
estimate neighboring agent positions, velocities, and type (in the heterogenous
swarm case). In this work, we focus our efforts on the BV model with onboard
stimulus for homogeneous collectives and characterize the behavior space. The
BV model has minimal sensing requirements and yet still demonstrates at mul-
tiple robust switchable behavior modes.
Researchers in control theory and applied mathematics have also shown for-
mal stability proofs for emergent collective motion. The Vicsek model of constant
speed agents in the plane with a nearest-neighbor alignment feedback rule is a
stable switched linear system [16,32]; the Dubin’s car model has been shown
to reliably rendezvous given a simple reactive steering control law [35]. Discrete
stochastic methods can give formal guarantees on performance of collectives of
minimal agents tasked with static shape formation, aggregation and dispersal
[4,20], and the strategies show promising applicability to continuous systems. So
while we have strong mathematical intuition that reactive swarms can perform
robust behaviors, there is a lack of understanding of dynamic system behav-
ior, especially considering different sensor models and sensor-actuator couplings.
One exception is [12], where the dynamics of schooling fish were investigated as a
function of both the perception angle and manoeuvrability of the fish, identifying
three possible stable states (schooling, swarming and milling). In a similar spirit,
we address the BV model as a starting point to unify these insights and advance
the understanding of minimalist, mode-switching autonomous collectives.
Here, we present a thorough characterization of the BV model as a first step
toward more formal understanding of the system’s stable states. A similar and
more developed effort is the swarmalator model [24]; the model leverages the
physics of oscillators to formally show five stable states in collective phase and
position, but is limited in its applicability to systems with directional sensing
constraints or the inability to sense neighboring phase. BVs have recently been
used as a testbed for embodied artificial intelligence research in developmen-
tal neurosimulation [1] and structured Bayesian techniques [31]. However, these
approaches generally lack a formal model or an accessible implementation. To
fill this gap, we implement an open-source BV model and swarm characteriza-
tion metrics, allowing for comparison between approaches to reactive collective
design.
2 Model and Simulation
First, we detail the modelling choices made in the adaptation of the BV to

our simulated implementation. We have emphasized directional sensing and
stimulus production, motivated by practical constraints in robotic systems.
Our interactive, Python-based implementation is available at github.com/CEI-
lab/DARS2022-BVcollectives. An overview of the model and results is also avail-
able in video format at https://youtu.be/9ngfH0vkOfc.
2.1 Agent Model
Each agent i has a position on the plane, xi , orientation θi , and speed vi ∈

[Vmin = 0; Vmax ]. Agents are considered to be points in the plane; we leave most
considerations of morphology and collisions to future work, and assume that each
agent is small compared to the workspace. Time progresses in discrete stages,
where agents update their heading and speed at each time step as a memoryless,
reactive function of their observations in the previous time step. We omit the
time variable in our notation where operations occur within a single time step.
We implement directional and omnidirectional sensing and stimulus models,
inspired by common geometric sensing constraints for multi-robot systems, such
as those found with radio communications or LED-photodiode implementations.
Consider two agents (a leader and a follower ) with positions and orientations
xL , θL and xF , θF , respectively. Let θC be the global orientation of the vector
from xF to xL . Let α ∈ [−π, π] be the angular coordinate of the follower in the
leader’s reference frame. If the follower is to the right as viewed by the leader,
α is negative. Let β be the angular coordinate of the leader in the follower’s
reference frame (β = 0 when θC = θF ). If the leader is to the right as viewed by
the follower, β is negative.
In both stimulus models, intensity drops off proportional to the square of
the distance from the stimulus source, as would be the case with light-based
stimulus. If an observer is distance d from a stimulus source, the fraction of
stimulus intensity remaining after distance is taken into account is
d
Id = 1 − ( )2 , (1)
dmax
where dmax is the distance cutoff beyond which the stimulus is undetectable.
The directional source is modelled off a generic LED intensity profile, with
the stimulus pointing “behind” the agent and visible if the follower is within
the stimulus “cone” (α > π − θmax or α < −π + θmax ). We assume perceived
intensity does not depend on α within this angular range.
In the omnidirectional sensing case, the sensed area is the entire circle around
an agent defined by radius dmax .
In the directional sensing case, the field of view is limited by the angle con-
straint θmax such that |β| < θmax . In addition, we have the apparent intensity
be scaled by a multiplicative factor Iβ , as photodiodes are often flat and thus
reflect more incoming photons when misaligned. We use the relation
β
Iβ = 1 − ( )2 (2)
θmax
to model a smooth dropoff in sensor sensitivity that reaches zero at a mis-
alignment of β = θmax . See Fig. 1(B) for an illustration of the directional stimulus
intensity profile for θmax = π/3, θF = θL = π/2.
Inspired by Braitenberg’s two “forward-facing” sensors, we define a “wind-
shield” sensor model, splitting the field of view into the left and right half of the
“windshield,” centered on the agent’s orientation. Each agent then has two high
level sensors, each of which reports the total intensity of stimuli observed in that
half. We let yi,r and yi,l denote the right-hand and left-hand sensor readings for
agent i. To compute each of these sensor readings, we compute the sum of each
visible agent’s intensity (Imax = 1 for all agents in our simulations), scaled by
Id and Iβ as applicable.
Next, we define reactive, proportional control laws according to Braitenberg’s
schema, where we vary kinesis (k) to move faster or slower in response to stimuli,
and vary taxis (t) to turn toward or away from stimuli. For agent i, we control
the speed vi and the orientation θi according to
v˙i = k(yi,l , yi,r ) (3)

θ˙i = t(yi,l , yi,r ) (4)
where yi,l and yi,r are the left and right sensor response magnitudes, respec-
tively. For readability, we define i = yi,l + yi,r for the total stimulus observed by
an agent, and δi = yi,l − yi,r for the difference between the two sensing regimes,
with a positive δi indicating more stimulus on the left hand side of the agent
than the right.
The basic reactive “behaviors” that we consider here are love, fear, aggres-
sion, and curiosity, and each correspond to different simple control law imple-
mentations for k and t. For example, in the love behavior, agents turn toward
stimuli and slow down as intensity increases, leading to reliable aggregations.
The discretized control laws for love are then
vi (t + 1) = vi (t) − γi (t) (5)

θi (t + 1) = θi (t) + γδi (t) (6)
where γ is a scaling factor and design parameter we call the influence, gov-
erning how strongly the agents actuate in each time step. In our discrete model
and simulation, it combines the design parameters of gain in the control law and
the discretization of time.
Likewise, we define fear as the emergent behavior when the agents turn away
from stimuli, and slow down in response to stimuli; agents “fear” the stimuli and
wish to avoid it. Aggression emerges when agents turn toward stimuli and speed
up (“attacking”), and curiosity emerges when agents turn away from stimuli and
speed up (on their way to find the next stimulus signal). The control laws are
summarized in Table 1.
We implement two independent sources of noise, on the actuation direction
and speed. During each timestep, the agent updates its control inputs (head-
ing and speed) according to the rules in Table 1. To simulate actuation noise,
an additive Gaussian noise term is applied to both of the control inputs, inde-
pendently for each agent, before the simulation takes a step. The default noise
has a mean of zero for both heading and speed, and a standard deviation of
σθ = 72π
= 5◦ for the heading and σv = vmax72 for speed. In later sections, we
increase these standard deviations to study system response to actuation noise.
Table 1. The reaction rules for each of the four regimes of interest.
Response to Negative Kinesis Positive Kinesis

Increasing Stimulus Slow Down Speed Up
Positive Taxis Love Aggression
Turn Toward vi (t + 1) = vi (t) − γi (t) vi (t + 1) = vi (t) + γi (t)
θi (t + 1) = θi (t) + γδi (t) θi (t + 1) = θi (t) + γδi (t)
Negative Taxis Fear Curiosity
Turn Away vi (t + 1) = vi (t) − γi (t) vi (t + 1) = vi (t) + γi (t)
θi (t + 1) = θi (t) − γδi (t) θi (t + 1) = θi (t) − γδi (t)
2.2 World
We define a two-dimensional 500 × 500 unit environment. We give the agents a

maximum velocity of 4 units per second. To study collective behavior without
the influence of boundary conditions, we identify the edges [0, y] ∼ [500, y] and
[x, 0] ∼ [x, 500], creating an impossible torus environment. Agents are able to
sense across these identified edges. We include the possibility for obstacles to
be present, with a default behavior of an elastic collision when an agent crosses
an obstacle boundary. However, the computation of occlusions and shading are
beyond the scope of this work, so agents are able to detect each other through
obstacles if the angle and distance constraints are met. For the radius of the
circle denoting maximum distance of stimulus detection, we use dmax = 200
for all simulations, regardless of other geometric constraints on stimulus and
sensing.
2.3 Metrics
We define two metrics for evaluating the emergent behavior in terms of spatial-
and velocity distribution.
The Nearest Neighbor Distance (NND) metric is defined as the population
average of the distance from each agent its closest neighbor, or
1
N
N N Dpop = min(d(ai , aj )) (7)
N i=0
where d(ai , aj ) is the minimum Euclidean distance along the surface of the
torus world between agents i and j.
Using an approximation based on uniformly distributed points in a unit
square [15], we estimate N N Dpop ≈ 60 in the case where 20 agents are dis-
tributed uniformly; however this is an underestimation due to the approximation
as well as our torus geometry. In practice, we see a maximum N N Dpop ≈ 75
in cases where the behavior is observed to create a more nearly uniform spatial
distribution.
We also compute the population average for a kinetic energy metric, such
that
1 2
N
KEpop = v (8)
N i=0 i
normalizing the mass of each agent to 1.
3 Emergent Behaviors
3.1 Behavior Characterization
For the models with directional sensing, we characterize each of the four emergent
behaviors, with respect to the nearest neighbor distance and energy metrics, by
performing parameter sweeps across the influence angle θmax and the influence
magnitude γ. For the influence angle, we vary θmax ∈ [π/8, π/2], and we vary the
influence scale in [0.25, 2.5], with each range discretized into ten evenly spaced
values. For each pair of angle/influence variables, we simulated 20 randomly
initialized collectives of 20 agents, collecting the metrics at each timestep after
a convergence period Tconv = 990 for a duration of Trecord = 10 timesteps, for
a total simulation runtime of 1000 timesteps. We then average each metric over
time during the converged behavior, and report the value on the heat maps in
Figs. 2, 3, 4, and 5. We also show a representative time-lapse of agent positions
for the two directional models over the last 100 frames of the simulation. Finally,
we show the system response to noise for all three models. For each model, we
fix γ = 1.5 and θmax = π/2, and compute the mean and standard deviation of
NND and KE over 20 iterations with Tconv = 990 and Trecord = 10, as in the
heat maps.
3.2 General Observations

While all four taxis-kinesis combinations produce distinct emergent behaviors,
themes emerge in our characterization of the design space. Firstly, most behav-
iors are robust to noise, suggesting several stable states in the system phase
space. In fact, without adequate noise, the system may get stuck in a “local
minimum” where connectivity is too low to activate a steady state with higher
kinetic energy, especially in cases of limited sensing, stimulus, and influence.
If influence is too low, or the angular extent of directional sensors/stimulus
is too narrow, the agents may be “under-stimulated” and will not necessarily
produce the expected emergent behaviors. However, they may still form coher-
ent flows or other temporary or local dynamical structures, as can be seen in
the supplementary video. In a second case, if influence is too high or sensing
and stimulus production is overpowered, the agents are “over-stimulated” and
may produce an unexpected emergent behavior. Of course in between these
extremes there is a regime where the system produces emergent behaviors that
are expected and indicate a balance between too much and too little reactivity;
but all behaviors are of technical interest for multi-modal artificial collectives.
In the following sections, we will first characterize the balanced behavior regime,
then identify under- and over-stimulated modes if they exist.
3.3 Love
“Loving” agents turn toward stimulus and slow down as local stimulus increases.
For artificial collectives, this behavior is primarily useful for aggregation into a
low-velocity collective, perhaps to await a new task or to establish a static net-
work for exchange of information. Figure 2 presents an overview of the emergent
behaviors of this taxis-kinesis combination.
Fig. 2. Characterization of love behavior (positive taxis, negative kinesis). (a) First
row: NND and kinetic energy profiles for the directional stimulus and sensing model,
varying the angle θmax identically for both stimulus and sensing, and the influence γ.
The right-most figure is a snapshot of the behavior for the last 100 frames of a random
simulation with γ = 1 and θ = π/2. Second row, (b) results for omni-directional
stimulus and directional sensing model, varying the angle θmax (now only affecting
angular sensing range) and the influence γ. Third row, (c) results for each sensor-stimuli
model of NND and kinetic energy, while varying noise scaling factor as a percent of
the maximum value π for rotational noise and vmax = 4px for linear actuation noise.
All data points have θmax = π/2 where applicable and γ = 1.5.
Overall this is one of the most consistent quadrants, with almost all parame-
ter choices leading to consistent aggregative behaviors. Aggregation is not a new
behavior in the decentralized reactive agent literature, and our results confirm
that minimal sensing and reactive strategies lead to stable, low kinetic energy,
low NND states regardless of noise. Of interest is that the three sensor/stimulus
models have notably different mean NND as the noise magnitude increases, but
the mean kinetic energy of all three implementations is unaffected by noise.
In the case of directional stimulus and sensing, the converged steady state has
higher NND because the agents cannot see nearby agents behind them, and thus
an aligned following behavior is stable in the torus environment. In the cases
with omnidirectional stimulus and/or sensing, the agents have sufficient mutual
visibility such that some agents will halt completely and form more traditional
aggregates.
3.4 Aggression
Aggressive agents turn toward stimuli and speed up. This behavior is useful for
clustering and flocking in spaces where agents wish to keep a high momentum for
quick dispersion after aggregation, or when agents must come together quickly
or follow stimulus sources closely. As seen in Fig. 3(a) and (b), the traditional
behavior of aggressive following is found to emerge for higher values of influence
and θmax , and the behavior is not as robust as love.
At lower levels of stimulation, the agents tend to become more uniformly dis-
tributed, and qualitatively their behavior consists of short sprints toward stim-
ulus before losing the signal and slowing down. Quantitatively we also observe a
lower average kinetic energy. As we increase the power of the sensing and stimuli
models, mean kinetic energy increases, but mean NND stays relatively constant.
We observe that for narrow θmax , agents form small slow clusters, but as θmax
and influence increase, the collective forms chains of agents following each other
around the torus. Agents are closely following each other, so mean NND stays
low while collective kinetic energy increases.
Noise has relatively little effect on the behavior of the system, apart from
preventing a deadlocked stable state at zero noise for directional stimulus and
sensing. Increasing noise also tends to increase the kinetic energy of the sys-
tem for directional constraints, increasing the chance of sighting another agent
through random walk coverage and re-orientations. Finally, we observe that the
standard deviation for kinetic energy is high in the directional cases, supporting
our observation of bi-modal velocity distributions with this taxis-kinesis combi-
nation.
3.5 Fear
Fearful agents turn away from stimuli and slow down, and the resulting emergent
behavior is a space-filling formation. This behavior lends itself well to applica-
tions that need automatic organization of a collective into a mostly static network
of agents monitoring the environment or passing messages.
Fig. 3. Characterization of aggression behavior (positive taxis, positive kinesis). (a)

First row: NND and kinetic energy profiles for the directional stimulus and sensing
model, varying the angle θmax identically for both stimulus and sensing, and the influ-
ence γ. The right-most figure is a snapshot of the behavior for the last 100 frames
of a random simulation with γ = 1 and θ = π/2. Second row, (b) results for omni-
directional stimulus and directional sensing model, varying the angle θmax (now only
affecting angular sensing range) and the influence γ. Third row, (c) results for each
sensor-stimuli model of NND and kinetic energy, while varying noise scaling factor
as a percent of the maximum value π for rotational noise and vmax = 4px for linear
actuation noise. All data points have θmax = π/2 where applicable and γ = 1.5.
We see from Fig. 4 that overall, the NND metric is consistently high with
this behavior regardless of sensor/stimulus model, nearly maximized in much of
the design space. As sensing and stimulus production become more powerful,
each agent is able to detect more stimulus around it, and becomes more “afraid”
and moves more slowly.
The under-stimulated case is apparent, dramatically so in the case of direc-
tional stimulus and sensing, where kinetic energy is consistently high across all
noise values. The system is in a stable state of agents detecting a low amount
of stimulus most of the time, and stopping and reorienting when they happen
to cross paths. The resulting behavior is reminiscent of run-and-tumble models
in active matter systems. The taxis-kinesis combination of fear is notably more
robust with respect to noise than the previous two behaviors.
Fig. 4. Characterization of fear behavior (negative taxis, negative kinesis). (a) First
row: NND and kinetic energy profiles for the directional stimulus and sensing model,
varying the angle θmax identically for both stimulus and sensing, and the influence γ.
The right-most figure is a snapshot of the behavior for the last 100 frames of a random
simulation with γ = 1 and θ = π/2. Second row, (b) results for omni-directional
stimulus and directional sensing model, varying the angle θmax (now only affecting
angular sensing range) and the influence γ. Third row, (c) results for each sensor-stimuli
model of NND and kinetic energy, while varying noise scaling factor as a percent of
the maximum value π for rotational noise and vmax = 4px for linear actuation noise.
All data points have θmax = π/2 where applicable and γ = 1.5.
3.6 Curiosity
Finally, in the curiosity mode, agents turn away from stimulus and speed up
when more stimulus is detected. Intuitively, they are “seeking” regions with
no stimulus in which to stop. The agents tend to disperse, similarly to fear, but
display a greater sensitivity to noise and system design parameters. The curiosity
behavior is applicable to the formation of space-filling static formations, as well
as dynamic space-filling trajectories that have a higher average energy, depending
on the sensing and stimulus model and design parameters. In Fig. 5, we see that
counterintuitively, more directional models are better at maintaining a uniform
spatial distribution. As θmax increases, agents are able to see more neighbors,
and a stable state occurs where agents align and form a “search party” style
formation, as seen in Fig. 5(a).
Fig. 5. Characterization of curiosity behavior (negative taxis, positive kinesis). (a)

First row: NND and kinetic energy profiles for the directional stimulus and sensing
model, varying the angle θmax identically for both stimulus and sensing, and the influ-
ence γ. The right-most figure is a snapshot of the behavior for the last 100 frames
of a random simulation with γ = 1 and θ = π/2. Second row, (b) results for omni-
directional stimulus and directional sensing model, varying the angle θmax (now only
affecting angular sensing range) and the influence γ. Third row, (c) results for each
sensor-stimuli model of NND and kinetic energy, while varying noise scaling factor
as a percent of the maximum value π for rotational noise and vmax = 4px for linear
actuation noise. All data points have θmax = π/2 where applicable and γ = 1.5.
At high θmax in the omnidirectional stimulus and directed sensing case, the
system becomes overstimulated, as it also does in the fully omnidirectional case.
In this mode, agents cannot escape stimuli, and maintain a consistently high
speed. Interestingly, this behavior does not maximize the NND metric, indicating
some additional collective structure, especially at lower noise values.
4 Conclusion and Outlook
We have presented a novel model of Braitenberg Vehicles with onboard stimulus,

and characterized the behavior of the system with respect to nearest-neighbor
distance and kinetic energy, under three sensing and stimulus models inspired by
robotics applications. We identified several unexpected modes, such as both low
and high kinetic energy modes for curiosity. We also analyzed for response to
noise, and found the emergent collective dynamics to be quite stable, giving hope
for continued formalization of multi-agent system design. It is also interesting to
note that noise often aids stability or can be used to tune the emergent behavior.
We also identify certain “deadlock” states, such as aggression with directional
stimulus and sensing; when agents cannot adequately detect others, the system
can to become “stuck” in a motionless state without noise.
We identified that constraints on angular range of sensing and stimulus have
a key effect on behavior in much of the design space, supporting findings in other
works on embodied system dynamics [12]. Other interesting design parameters
that we have not considered here include agent density in the space, environment
geometries and strategies for agent-environment interactions, heterogenous col-
lectives, and the influence of controllable or unreliable agents. In the supplemen-
tal video, available at https://youtu.be/9ngfH0vkOfc, we show a demonstration
of how a single controllable agent can be used along with global mode-switching
to guide a swarm through narrow passageways and explore a multi-room envi-
ronment.
Many of these behaviors that we have shown are amenable to more formal
analysis. We also aim to use this platform for comparison of reactive, stateful,
and learned strategies under minimal sensing, as well as develop an embodied
representation of the agents that allows for design of agent-agent and agent-
environment interactions. The Braitenberg Vehicle model’s strength lies in lever-
aging the ability to switch a collective between several useful modes with minimal
requirements in hardware and software.
Acknowledgments. This project was funded by the Packard Fellowship for Science
and Engineering, GETTYLABS, and the National Science Foundation (NSF) Grant
#2042411.
References
1. Alicea, B., Dvoretskii, S., Felder, S., Gong, Z., Gupta, A., Parent, J.: Developmental
embodied agents as meta-brain models. In: DevoNN Workshop (2020)
2. Braitenberg, V.: Vehicles: Experiments in Synthetic Psychology. MIT Press, Cam-
bridge (1986)
3. Chen, J., Sun, R., Kress-Gazit, H.: Distributed control of robotic swarms from
reactive high-level specifications. In: 2021 IEEE 17th International Conference on
Automation Science and Engineering (CASE), pp. 1247–1254. IEEE (2021)
4. Coppola, M., Guo, J., Gill, E., de Croon, G.C.: Provable self-organizing pattern
formation by a swarm of robots with limited knowledge. Swarm Intell. 13(1), 59–94
(2019)
5. Ding, T., et al.: Light-induced actuating nanotransducers. Proc. Natl. Acad. Sci.
113(20), 5503–5507 (2016)
6. Dorigo, M., Theraulaz, G., Trianni, V.: Swarm robotics: past, present, and future
[point of view]. Proc. IEEE 109(7), 1152–1165 (2021)
7. Eberhart, R.C., Shi, Y., Kennedy, J.: Swarm Intelligence. Elsevier, Amsterdam
(2001)
8. Frank, S., Kuijper, A.: Privacy by design: survey on capacitive proximity sensing
as system of choice for driver vehicle interfaces. In: Computer Science in Cars
Symposium, pp. 1–9 (2020)
9. Gardi, G., Ceron, S., Wang, W., Petersen, K., Sitti, M.: Microrobot collectives
with reconfigurable morphologies, behaviors, and functions. Nat. Commun. 13(1),
1–14 (2022)
10. Garnier, S., et al.: The embodiment of cockroach aggregation behavior in a group
of micro-robots. Artif. Life 14(4), 387–408 (2008)
11. Gauci, M., Chen, J., Li, W., Dodd, T.J., Groß, R.: Self-organized aggregation
without computation. Int. J. Robot. Res. 33(8), 1145–1161 (2014)
12. Gautrais, J., Jost, C., Theraulaz, G.: Key behavioural factors in a self-organised
fish school model. In: Annales Zoologici Fennici, vol. 45, pp. 415–428. BioOne
(2008)
13. Hauert, S.: Swarm engineering across scales: from robots to nanomedicine. In:
ECAL 2017, The Fourteenth European Conference on Artificial Life, pp. 11–12.
MIT Press (2017)
14. Hogg, D.W., Martin, F., Resnick, M.: Braitenberg Creatures. Epistemology and
Learning Group, MIT Media Laboratory Cambridge (1991)
15. achille hui. (https://math.stackexchange.com/users/59379/achille hui): Average
distance between n randomly distributed points on a square with their near-
est neighbors. Mathematics Stack Exchange. https://math.stackexchange.com/q/
2565546. Accessed 11 Nov 2018
16. Jadbabaie, A., Lin, J., Morse, A.S.: Coordination of groups of mobile autonomous
agents using nearest neighbor rules. IEEE Trans. Autom. Control 48(6), 988–1001
(2003)
17. Lanzisera, S., Zats, D., Pister, K.S.: Radio frequency time-of-flight distance mea-
surement for low-cost wireless sensor localization. IEEE Sens. J. 11(3), 837–845
(2011)
18. LaViers, A., et al.: Choreographic and somatic approaches for the development of
expressive robotic systems. In: Arts, vol. 7, p. 11. MDPI (2018)
19. Lei, L., Escobedo, R., Sire, C., Theraulaz, G.: Computational and robotic modeling
reveal parsimonious combinations of interactions between individuals in schooling
fish. PLoS Comput. Biol. 16(3), e1007194 (2020)
20. Li, S., et al.: Programming active cohesive granular matter with mechanically
induced phase changes. Sci. Adv. 7(17), eabe8494 (2021)
21. Mayya, S.: Local encounters in robot swarms: from localization to density regula-
tion. Ph.D. thesis, Georgia Institute of Technology (2019)
22. McFassel, G., Shell, D.A.: Reactivity and statefulness: action-based sensors, plans,
and necessary state. Int. J. Robot. Res., 02783649221078874 (2021)
23. Mitrano, P., Burklund, J., Giancola, M., Pinciroli, C.: A minimalistic approach to
segregation in robot swarms. In: 2019 International Symposium on Multi-Robot
and Multi-Agent Systems (MRS), pp. 105–111. IEEE (2019)
24. O’Keeffe, K.P., Hong, H., Strogatz, S.H.: Oscillators that sync and swarm. Nat.
Commun. 8(1), 1–13 (2017)
26. Rezeck, P., Assunção, R.M., Chaimowicz, L.: Flocking-segregative swarming
behaviors using Gibbs random fields. In: 2021 IEEE International Conference on
Robotics and Automation (ICRA), pp. 8757–8763. IEEE (2021)
27. Rezeck, P., Chaimowicz, L.: Chemistry-inspired pattern formation with robotic
swarms. arXiv preprint: arXiv:2206.03388 (2022)
28. Rosenthal, S.B., Twomey, C.R., Hartnett, A.T., Wu, H.S., Couzin, I.D.: Revealing
the hidden networks of interaction in mobile animal groups allows prediction of
complex behavioral contagion. Proc. Natl. Acad. Sci. 112(15), 4690–4695 (2015)
29. Rueben, M., et al.: Themes and research directions in privacy-sensitive robotics.
In: 2018 IEEE Workshop on Advanced Robotics and its Social Impacts (ARSO),
pp. 77–84. IEEE (2018)
30. Shahrokhi, S., Lin, L., Ertel, C., Wan, M., Becker, A.T.: Steering a swarm of
particles using global inputs and swarm statistics. IEEE Trans. Rob. 34(1), 207–
219 (2017)
31. Thornton, C.: Predictive processing simplified: the infotropic machine. Brain Cogn.
112, 13–24 (2017)
32. Vicsek, T., Czirók, A., Ben-Jacob, E., Cohen, I., Shochet, O.: Novel type of phase
transition in a system of self-driven particles. Phys. Rev. Lett. 75(6), 1226 (1995)
33. Wiseman, J.: Braitenberg vehicles (1999). http://people.cs.uchicago.edu/wiseman/
vehicles/
34. Yang, J.F., et al.: Memristor circuits for colloidal robotics: temporal access to
memory, sensing, and actuation. Adv. Intell. Syst. 4(4), 2100205 (2022)
35. Yu, J., LaValle, S.M., Liberzon, D.: Rendezvous without coordinates. IEEE Trans.
Autom. Control 57(2), 421–434 (2011)
36. Zardini, G., Censi, A., Frazzoli, E.: Co-design of autonomous systems: from hard-
ware selection to control synthesis. In: 2021 European Control Conference (ECC),
pp. 682–689. IEEE (2021)
Decision-Making Among Bounded
Rational Agents
Junhong Xu1(B) , Durgakant Pushp1 , Kai Yin2 , and Lantao Liu1

1
Indiana University, Bloomington, IN 47408, USA
{xu14,dpushp,lantao}@iu.edu
2
Expedia Group, Seattle, USA
Abstract. When robots share the same workspace with other intelli-
gent agents (e.g., other robots or humans), they must be able to reason
about the behaviors of their neighboring agents while accomplishing the
designated tasks. In practice, frequently, agents do not exhibit abso-
lutely rational behavior due to their limited computational resources.
Thus, predicting the optimal agent behaviors is undesirable (because it
demands prohibitive computational resources) and undesirable (because
the prediction may be wrong). Motivated by this observation, we remove
the assumption of perfectly rational agents and propose incorporating
the concept of bounded rationality from an information-theoretic view
into the game-theoretic framework. This allows the robots to reason other
agents’ sub-optimal behaviors and act accordingly under their computa-
tional constraints. Specifically, bounded rationality directly models the
agent’s information processing ability, which is represented as the KL-
divergence between nominal and optimized stochastic policies, and the
solution to the bounded-optimal policy can be obtained by an efficient
importance sampling approach. Using both simulated and real-world
experiments in multi-robot navigation tasks, we demonstrate that the
resulting framework allows the robots to reason about different levels
of rational behaviors of other agents and compute a reasonable strat-
egy under its computational constraint. (1 A preliminary version of this
work appeared as a poster in 2021 NeurIPS Workshop on Learning and
Decision-Making with Strategic Feedback.
The video of the real-world experiments can be found at https://
youtu.be/hzCitSSuWiI.
We gratefully acknowledge the support of NSF with grant No. 2006886
and 2047169.)
Keywords: Bounded Rationality · Game Theory · Multi-Robot

System
1 Introduction
We consider the problem of generating reasonable decisions for robots in multi-

agent environments. This decision-making problem is complex because each
https://doi.org/10.1007/978-3-031-51497-5_20
274 J. Xu et al.
robot’s motion trajectory depends on and affects the trajectories of others. Thus
they need to anticipate how others will respond to their decisions. The game-
theoretic framework provides an appealing model choice to describe this com-
plex decision-making problem among agents [16] and has been applied to various
robotics applications, e.g., drone racing [24] and swarm coordination [1]. In an
ideal situation, where all the agents are perfectly rational (i.e., they have unlim-
ited computational resources), they can select the motion trajectories that reach
the Nash equilibrium (if exists). However, since these trajectories live in a contin-
uous space, agents need to evaluate infinitely many trajectories and the interac-
tion among them, which is intractable. As a remedy, most of the works consider
constraining the search space of the multi-robot problem via sampling [25] or
locally perturbing the solution [24] to find a good trajectory within a reasonable
amount of time.
Most game-theoretic planners mentioned above do not explicitly consider the
agents’ computational limitations in their modeling process, i.e., they assume
each agent is perfectly rational. They consider these limitations only externally,
e.g., by truncating the number of iterations during optimization. In contrast, we
propose a novel and more principled treatment for modeling agents with limited
computational resources by directly modeling the agents being only bounded-
rational [6] in the game-theoretic framework. Bounded Rationality (BR) has
been developed in economics [22] and cognitive science [7] to describe behav-
iors of humans (or other intelligent agents), who have limited computational
resources and partial information about the world but still need to make deci-
sions from an enormous number of choices, and has been applied to analyze the
robustness of controllers in the single-agent setting [17]. In this work, we use
the information-theoretic view of BR [6], which states that the agent optimizes
its strategy under an information-theoretic constraint (e.g., KL-divergence),
describing the cost of transforming its a-prior strategy into an optimized one.
This problem can be solved efficiently by evaluating a finite number of trajectory
choices from its prior trajectory distribution. Since BR explicitly considers the
computational constraints during the modeling process, the resulting solution
provides an explainable way to trade-off computational efficiency and perfor-
mance. Furthermore, by incorporating BR into the game-theoretic framework,
robots can naturally reason about other agents’ sub-optimal behaviors and use
these predictions to take advantage of (or avoid) other agents with less (or higher)
computational resources.
2 Related Work
Our work is related to motion planning in multi-agent systems and game-

theoretic frameworks. Here we provide a brief overview of these topics. The
game-theoretic approach in robotics has gained increasing popularity recently.
For example, the authors in [23,24] combine Model-Predictive Control (MPC)
and Iterated Best Response (IBR) to approximate the Nash Equilibrium solution
of a two-player vehicle racing game. Recently, a game-theoretic iterative linear
Decision-Making Among Bounded Rational Agents 275
quadratic regulator (iLQR) has been proposed to solve a general-sum stochastic

game [21]. In addition, the game-theoretic framework is also used in self-driving
vehicles for trajectory prediction [5,20] and motion planning among pedestri-
ans [4,12]. The above works are all based on the concept of rational agents
who are assumed to be able to maximize their utility. In contrast, the proposed
bounded rational framework explicitly considers the information-processing con-
straints of intelligent agents.
Although bounded rational solutions are used in almost all robotic systems
in practice, e.g., anytime planners terminate the computation if the time runs
out [8], only a few works attempt to model this bounded rational assumption
explicitly. Recently, authors in [17] analyze the single-agent robust control per-
formance using the information-theoretic bounded rationality [6,15]. Another
closely related but independently developed literature is KL-Control [3,9]. When
computing the optimal policy, it includes an additional information-theoretic
cost measured by the KL-divergence between a prior policy distribution and a
posterior after optimization. This is similar to the effect of the information-
theoretic constraints of bounded rationality, where the final optimal control
distribution can be sampled from the exponential family using approximate
Bayesian inference [10,26]. Although these methods have similar traits, they
generally only focus on single-agent problems. In contrast, our work integrates
the bounded rationality idea into the game-theoretic framework and provides a
method to compute agents’ strategies under computational limits.
In this section, we define multi-agent decision-making using the formulation of
Multi-Agent Markov Decision Processes (MMDPs) or Markov Game (MGs) [11],
and provide the resulting Nash Equilibrium (NE) solution concept under the
perfect rationality assumption. In the subsequent sections, we show how this
assumption can be mitigated using the BR concept and derive a sampling-based
method to find a Nash Equilibrium strategy profile under bounded rationality.
3.1 Multi-agent Markov Decision Process

In an MMDP with N agents, each agent i has its own state space si ∈ S i and
action space ai ∈ Ai , where ai and si denote the state and action of agent i; S i
and Ai denote the corresponding spaces. We denote the joint states and actions of
all the agents as S = [s1 , ..., sN ] and A = [a1 , ..., aN ]. The agents progress in the
environment as follows. At every timestep t, all the agents simultaneously execute
their actions At to advance to the next states St+1 according to their stochastic
transition function sit+1 ∼ pi (sit+1 |St , At ). At the same time, they receive rewards
Rt = [rt1 , ..., rtN ], where rti = fti (St , At ) is the reward function for agent i. Each
agent’s stochastic transition and reward functions depend on all agents’ states
and actions in the system. Under the perfectly rational assumption, the goal for
agent i is to find a strategy that maximizes an expected utility function
πti,∗ = arg maxπti U i (St , Πt ), (1)
276 J. Xu et al.

H+t i
where U i (St , Πt ) = E k=t rt (S k , Ak ) is agent i’s utility function at t; H
is planning horizon, and Πt = [πt1 , ..., πtN ] denotes the strategy profile for
every agent. In this work, we assume that the agents’ strategies take a spe-
cific form: a distribution over the action sequence ait ∼ πti (ait |St , Πt−i ), where
ait = [ait , ..., ait+H ] is the action sequence up to horizon H and Πt−i is the strat-
egy profile without agent i. This policy form is well-suited for most robotics
problems because the trajectory induced by the action sequence can be tracked
by a low-level controller.
3.2 Iterative Best Response for MMDPs

To solve the problem defined in Eq. (1), each agent needs to predict how other
agents will behave and respond to each other. For brevity, we will write π i (ait |St )
as agent i’s strategy and omit the dependency on Π −i . One possible and com-
mon way to predict other agents’ behaviors is by assuming all other agents are
perfectly rational, and thus the strategy profile of all the agents reaches the Nash
Equilibrium (NE) (if exists) [16], which satisfies the following relationship:
U i (St , Πt−i,∗ , πti,∗ ) ≥ U i (St , Πt−i,∗ , πti ), ∀i ∈ {1, ..., N }, for any πti , (2)
Intuitively, if the agents satisfy the NE, no agent can improve its utility by
unilaterally changing its strategy. To compute NE, we can apply a numerical
procedure called Iterative Best Response (IBR) [19]. Starting from an initial
guess of the strategy profiles of all the agents, we update each agent’s strategy
to the best response to the current strategies of all other agents. The above
procedure is applied iteratively for each agent until the strategy profile does
not change. If every agent is perfectly rational, the robot can use NE strategy
profile to predict other agents’ behaviors and act correspondingly.1 . However,
there is a gap between this perfect rational assumption and the real world, as
most existing methods can only search the strategy profile in a neighborhood of
the initial guess [23,24] due to computational limits. In the following section, we
fill this gap by explicitly formulating this bounded rational solution.
4 Methodology
This section first provides details on the formulation of bounded rationality and
integrates it into the game-theoretic framework. Then, we propose a sampling-
based approach to generate bounded-rational stochastic policies.
4.1 Bounded Rational Agents in Game-Theoretic Framework

In the standard game-theoretic frameworks, agents are rational, i.e., they opti-
mize their decisions via evaluating an infinite number of action sequences with-
out considering the computational resources. In contrast, we model each agent
1
We make a simplifying assumption that there is only one NE in the game.
as bounded rational – it makes decisions that maximize its utility subject to a

certain computational constraint. Following the work in information-theoretic
bounded rationality [6,14], this constraint is explicitly defined by the neighbor-
hood of a default policy q i . Intuitively, this default policy describes the nominal
behavior of the agent. For example, in a driving scenario, the default policy of
an aggressive driver may be more likely to drive at a high speed. The agent’s
goal is to search for an optimized posterior policy bounded within the neighbor-
hood of q i . This size of the neighborhood may reflect the limited computational
resources or other practical considerations. In the following, we omit the time
subscript t for clarity. We use KL-divergence to characterize this neighborhood
π i,∗ = arg max U i (S, Π), s. t. KL(π i ||q i ) ≤ Ki . (3)

πi
Ki is a constant denoting the amount of computation (measured in bits) agent i

can deviate from the default policy. Using Lagrange multipliers, we can rewrite
the constrained optimization problem in Eq. (3) as an unconstrained one π i,∗ =
arg maxπi U i (S, Π) − β1i KL(π i ||q i ), where βi > 0 indicates the rationality level.
To see how this bounded-optimal stochastic policy can be computed, we can first
observe that the unconstrained problem can be written as
1
U i (S, Π) − KL(π i ||q i )
βi
1
=− KL(π i ||q) − βU i (S, Π)
βi
(4)
1 π i (ai |S)
=− π i (ai |S) log i i βU i (S,Π) dai
β q (a )e
1
= − KL(π i ||ψ i ),
β
where ψ i (ai |s) ∝ q i (ai )eβU (s,a) . Since KL-divergence is non-negative, the max-
imum of − β1 KL(π i ||ψ i ) is obtained only when KL(π i ||ψ i ) = 0, which means
π i = ψ i . Therefore, the optimal action sequence distribution of agent i (while
keeping other agents’ strategies fixed) under the bounded rationality constraint
is
1 i
π i,∗ (ai |S, Π −i ) = q i (ai )eβ·U (S,Π) , (5)
Z
i i β·U i (S,Π) i
where Z = q (a )e da is a normalization constant. This bounded-
optimal strategy provides an intuitive and explainable way to trade off between
computation and performance. When β increases from 0 to infinity, the agent
becomes more rational and requires more time to compute the optimal behavior.
When βi = 0, the bounded-rational policy becomes the prior, meaning agent i
has no computational resources to leverage. On the other hand, when βi →
∞, the agent becomes entirely rational and selects the optimal action sequence
deterministically. The rationality parameter β allows us to model agents with
different amounts of computational resources.
278 J. Xu et al.
Similar to the rational case, our goal is to find the Nash Equilibrium strategy
profile for a group of bounded-rational agents whose bounded-optimal policies
are defined in Eq. (5). This can be done using the IBR procedure analogous to
Sect. 3. Instead of optimizing the policy in Eq. (1), each bounded-rational agent
finds the optimal strategy distribution defined in Eq. (5) while keeping other
agents’ strategies fixed. This procedure is carried out for each agent for a fixed
number of iterations or until no agent’s stochastic policy changes.
4.2 Importance Sampling for Computing Bounded-Rational

Strategies
The previous section describes the bounded rationality concept, its integration
with the game-theoretic framework, and uses the IBR numerical method to solve
for a bounded rational Nash Equilibrium strategy profile. To actually compute
the bounded-rational strategies, we need an efficient way to query samples from
the distribution in Eq. (5) for each agent. Since it is relatively easier to sample
the actions from the default q i (ai ), we can use importance sampling to estimate
the expectation of the optimal action sequence as the best response for the agent
i while keeping others’ actions fixed [2]

1 i
Eai,∗ ∼πi,∗ [ai,∗
t |S t , Πt
−i
] = ait qti (ait )eβU (St ,Πt ) dait
t t Z
1
= Eait ∼qti (ait ) [w(ait )ait ] (6)
Z
K
1 1
≈ w(ait,k )ait,k ,
ZK
k=1
H+t
where w(ait ) = exp{β k=t rti (Sk , Ak )} and ait,k denotes the k th sample from
the default policy with N samples in total. Similarly, the normalization constant
can also be approximated as

i
Z = q i (ait )eβU (St ,Πt ) dait
= Eait ∼qi (ait ) [w(ait )] (7)

K
1
≈ w(ait,k ).
K
k=1
At a high level, the importance sampling procedure proceeds as follows. The

agents first propose action sequence samples from their default policies qti (ai )
and then assign each sequence a weight w(at i ) indicating its value based on
the agents’ utilities and rationality levels. Finally, by combining Eq. (6) and
Eq. (7), we can use the weighted average to compute the expected optimal action
sequence
N i i
i,∗ −i k=1 w(at,k )at,k
Eai,∗ ∼πi,∗ [at |St , Πt ] ≈ N . (8)
t t i
k=1 w(at,k )
To find the bounded-rational Nash Equilibrium strategy profile, we replace the

optimization procedure in IBR Eq. (2) using the above importance sampling. One
important observation is that the shape of the prior distribution q i , the number
of samples for evaluation N , and the rationality level β play essential roles in
the final performance. Their relationships will be explored in the experimental
section.
5 Simulated Experiments
We conduct extensive simulation experiments to demonstrate that integrating
bounded rationality with the game theory framework (1) allows each agent to
reason about other agents’ rationality levels to exploit (or avoid) others with less
(or higher) computational capabilities and (2) explicitly trades-off between the
performance and computation by varying the rationality level β and the number
of sampled trajectories. We also show qualitatively that our method can generate
group behaviors that approximately reach a bounded rational Nash Equilibrium
strategy profile for a varying number of agents.
5.1 Simulation Setup

In this experiment, we consider the task of navigating a group of aerial vehi-
cles in a 3D space. Each agent’s goal is to swap its position with the dia-
metrically opposite one while avoiding collisions with each other. The distance
between each agent and its goal is 6m. This environment is neither fully coop-
erative as each agent has a different goal location nor fully competitive because
their objectives are not totally reversed (zero-sum). Thus, the agents need to
collaborate with each other to avoid blockage in the center and at the same
time compete with each other to follow their shortest paths to the goals. The
agents are homogeneous in terms of their sizes and physical capabilities. Each
agent has a size of 0.125m in x, y, z directions (similar to the dimension of
Crazyfly drones used in the next section). Similar to [24], we set their transition
functions using a deterministic single integrator model with the minimum and
maximum speeds as amin = 0m/s and amax = 1m/s. We set a uniform prior
q i (a) = U nif orm(amin , amax ) as the default policy for all the agents. The num-
ber of IBR iterations is set to 10 as we found that under varying parameters,
IBR usually converges to a bounded rational NE strategy at 10 iterations. In all
the simulations, we assume that the rationality levels of all the agents are known
to each other. The one-step reward function is the same for each agent and set
to penalize collisions and large distances to the goal. We run 50 times for each
simulation with T = 80 timesteps.
5.2 Results
We first show that using the proposed framework agents can naturally reason
about the behaviors of others with the same physical capabilities but different
280 J. Xu et al.
Fig. 1. (a) Comparison of the performance by increasing the ego’s rationality level from
β = 0.01 to β = 0.13 while keeping other agents’ rationality levels fixed in six (solid
lines) and eight (dashed lines) agent environments. The x-axis and y-axis indicate the
ego’s β values and traveled distances. The green lines are the average travel distance
of other agents and the red lines indicate the ego’s travel distance. (b)(c) Show the
agents’ trajectories in six-agent environment with β = 0.01 and β = 0.13, respectively.
The ego’s trajectories are depicted in red.
rationality levels β. Since we want to examine how varying β affects the perfor-
mance, we need to ensure that the policy converges to the “optimal” one under
the computational constraints. Thus, we sample a large number of trajectories,
5 × 105 , for policy computation for each agent. In this simulation, we fix other
agents’ β = 0.05 and vary one agent’s (called ego agent) β from 0.01 to 0.13 and
compare the travel distances between the robot and other agents (the distance of
other agents is averaged). Figure 1(a) shows the performance comparison in six
and eight agent environments. In general, when the ego has a lower rationality
level β than other agents, it avoids them by taking a longer path. When all the
agents have the same β, the ego and other agents have a similar performance.
As β increases, the ego begins to exploit its advantage in computing more ratio-
nal decisions and taking shorter paths. We also notice that the ego generally
performs better with a large β when there are more agents in the scene. The
trajectories of the six-agent environment for β = 0.01 and β = 0.13 are plotted
in Fig. 1(b) and Fig. 1(c), respectively. When the ego’s β = 0.01 (in the red tra-
jectory), it takes a detour by elevating to a higher altitude to avoid other agents.
In contrast, when its β = 0.13, it pushes other agents aside and takes an almost
straight path. These results are aligned with Fig. 1(a). We omit the trajecto-
ries of eight agent environments to avoid clutter. The readers are encouraged to
watch the videos at https://youtu.be/hzCitSSuWiI
Next, we evaluate the performance of a group of agents with the same β
to show that the trade-off between the performance and computation can be
directly controlled by the rationality level. Since most of the computation occurs
when evaluating a large number of sampled trajectories in the proposed impor-
tance sampling-based method, we can use the number of evaluated trajectories as
a proxy to measure the amount of computation an agent possesses. In Fig. 2(c),
we analyze the relationship between β and the computation constraint in the
six-agent environment. We can observe that when the computation is limited,
Fig. 2. (a) Compares the performance (average traveled distance of the group) of dif-
ferent β values using a different number of trajectories in the six-agent environment.
E The x and y axes are the number of trajectories and the average traveled distance
of the group. (b) Shows the number of trajectories required to converge for different β
in four, six, and eight agent environments.
Fig. 3. Minimum distances of each agent to other agents at every timestep in (a)
four-agent (b) six-agent (c) ten-agent environments. The x-axis and y-axis are the
timesteps and agents’ minimum distances, respectively. The colored solid lines repre-
sent the statistics of each agent, which is the same as their trajectory color in Fig. 4.
The dashed grey line shows the minimum safety distance (0.25m) that needs to be
maintained to avoid collisions.
i.e., only a small number of action sequences can be evaluated, a larger β (more
rational) actually hurts the performance. When the more rational agents have
the resources to evaluate more trajectories, they travel less distance on average
than the less rational groups. This result demonstrates that by controlling the
rationality parameter the bounded rational framework can effectively trade-off
between optimality and limited computation. In Fig. 2(b), we also evaluate the
number of trajectories that need to be sampled for the method to converge at
different rationality levels in four, six, and eight agent environments. The result
aligns with the previous observation – in general, when β is larger, more trajec-
tories need to be evaluated to converge to “optimal” under the bounded rational
constraints. Furthermore, when more agents are present in the environment, the
method requires more trajectories to converge.
282 J. Xu et al.
Fig. 4. Agents’ trajectories in environments with four, six, and ten agents. The three
columns show the snapshots at t = 20, 60, 80, respectively. Each agent, its trajectory,
and its goal are assigned the same unique color. We only show the last 10 steps of
trajectories to avoid clutter.
Finally, we show qualitative trajectory results of the group behaviors under

bounded rationality using a fixed β = 0.1 and number of samples N = 5 × 105
for each agent in Fig. 4. Additionally, we provide the minimum distances of each
agent to other agents in Fig. 3. The result shows that in each environment, the
agent can maintain a safe distance > 0.25m to avoid collisions.
6 Physical Experiments
We use the Crazyflie 2.1 nano-drones under a motion capture system to validate
our method in the real world. For this hardware experiment, we consider two
types of tasks with a varying number of agents. The first task is to navigate
a group of drones to a designated goal region while avoiding static obstacles
and inter-drone collisions. The second task is position swapping similar to the
previous section. The size of the workspace considered in all the experiments
is 4.2 m × 5.4 m × 2 m in the x, y, and z axes, and the size of the drone is
Fig. 5. (a) Shows the experimental setup with 4 Crazyflie drones. Green arrow points
to the goal position. All the agents are navigating to the same goal point. (b) Snapshot
of the experiment. Shows the path followed by the agents to avoid the obstacles.
Fig. 6. Shows the experimental setup with 6 Crazyflie drones. The drones are divided
into two groups - red and green. (a) Shows the initial position of the drones. The task
is to swap the positions. Black lines show assigned swapping tasks among agents. (b)
Snapshot of the experiment during swapping. It shows the path followed by the agents
to avoid collision with each other.
92 mm × 92 mm × 29 mm. To mitigate the downwash effect and control inac-

curacy, we buffer the drone’s collision size to be 0.5m × 0.5m × 0.5m. We use
Crazyswarm [18] platform to control the drones. For each run, we generate tra-
jectories using N = 3 × 105 and β = 0.1 for all the agents using the proposed
method. These trajectories contain a sequence of (x, y, z) coordinates. We use
minimum snap trajectory optimization and control strategy [13] to track the
trajectories generated by the proposed planner.
We show two representative scenarios for each task. For complete experi-
mental videos, please refer to https://youtu.be/hzCitSSuWiI Fig. 5 shows that
a group of four drones have to go from the red zone to the green zone while
avoiding four obstacles of various sizes distributed around the center of the
workspace. Note that one of the drones opted to go over the obstacle of height
1.5m which shows that it finds a path through the space as narrow as the size
284 J. Xu et al.
of the drone (0.5m) in the z-axis. This event is captured in the snapshot shown
in Fig. 5(b). Figure 6 shows the position swapping scenario. We use the same
dynamics model as the previous section to generate the trajectories. We observe
that the outcomes of the physical experiments are consistent with the results
obtained in the simulation.
7 Conclusion
This paper considers the problem of making sequential decisions for agents with
finite computational resources, where they need to interact with each other
to complete their designated tasks. This problem is challenging because each
agent needs to evaluate its infinite number of decisions (e.g., waypoints or actu-
ator commands) and reason how others will respond to its behavior. While the
game-theoretic formulation provides an elegant way to describe this problem,
it is based on an unrealistic assumption that agents are perfectly rational and
have the ability to evaluate the large decision space. We propose a formulation
that replaces this rational assumption with the bounded rationality concept and
presents a sampling-based approach to computing agents’ policies under their
computational constraints. As shown in the experiments, by removing the per-
fect rational assumption, the proposed formulation allows the agents to take
advantage of those with less computational power or avoid those who are more
computational-capable. Additionally, when all the agents are similarly compu-
tational capable, they exhibit behaviors that avoid being taken advantage of by
others.
References
1. Abdelkader, M., Güler, S., Jaleel, H., Shamma, J.S.: Aerial swarms: recent appli-
cations and challenges. Current Rob. Reports 2(3), 309–320 (2021)
2. Bishop. C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 4.
Springer (2006)
3. Botvinick, M., Toussaint, M.: Planning as inference. Trends Cogn. Sci. 16(10),
485–488 (2012)
4. Chen, Y.F., Everett, M., Liu, M., How, J.P.: Socially aware motion planning with
deep reinforcement learning. In: 2017 IEEE/RSJ International Conference on Intel-
ligent Robots and Systems (IROS), pp. 1343–1350. IEEE (2017)
5. Fisac, J.F., Bronstein, E., Stefansson, E., Sadigh, D., Sastry, S.S., Dragan, A.D.:
Hierarchical game-theoretic planning for autonomous vehicles. In: 2019 Interna-
tional Conference on Robotics and Automation (ICRA), pp. 9590–9596. IEEE
(2019)
6. Genewein, T., Leibfried, F., Grau-Moya, J., Braun, D.A.: Bounded rationality,
abstraction, and hierarchical decision-making: An information-theoretic optimality
principle. Front. Robot. AI 2, 27 (2015)
7. Gigerenzer, G., Brighton, H.: Homo heuristicus: why biased minds make better
inferences. Top. Cogn. Sci. 1(1), 107–143 (2009)
8. Ingrand, F., Ghallab, M.: Deliberation for autonomous robots: a survey. Artif.
Intell. 247, 10–44 (2017)
9. Kappen, H.J., Gómez, V., Opper, M.: Optimal control as a graphical model infer-
ence problem. Mach. Learn. 87(2), 159–182 (2012)
10. Lambert, A., Fishman, A., Fox, D., Boots, B., Ramos, F.: Stein variational model
predictive control. arXiv preprint arXiv:2011.07641 (2020)
11. Littman, M.L.: Markov games as a framework for multi-agent reinforcement learn-
ing. In: Machine Learning Proceedings 1994, pp. 157–163. Elsevier (1994)
12. Lütjens, B., Everett, M., How, J.P.: Safe reinforcement learning with model uncer-
tainty estimates. In: 2019 International Conference on Robotics and Automation
(ICRA), pp. 8662–8668. IEEE (2019)
13. Mellinger, D., Kumar, V.: Minimum snap trajectory generation and control for
quadrotors. In: 2011 IEEE International Conference on Robotics and Automation,
pp. 2520–2525 (2011)
14. Ortega, P.A., Braun, D.A.: Thermodynamics as a theory of decision-making
with information-processing costs. Proc. Royal Soc. A: Mathem. Phys. Eng. Sci.
469(2153), 20120683 (2013)
15. Ortega, P.A., Braun, D.A., Dyer, J., Kim, K.-E., Tishby, N.: Information-theoretic
bounded rationality. arXiv preprint arXiv:1512.06789 (2015)
16. Osborne, M.J., et al.: An introduction to game theory, vol. 3. Oxford University
Press New York (2004)
17. Pacelli, V., Majumdar, A.: Robust control under uncertainty via bounded ratio-
nality and differential privacy. arXiv preprint arXiv:2109.08262 (2021)
18. Preiss, J.A., Hönig, W., Sukhatme, G.S., Ayanian, N.: Crazyswarm: a large nano-
quadcopter swarm. In: IEEE International Conference on Robotics and Automa-
tion (ICRA), pp. 3299–3304. IEEE (2017). https://github.com/USC-ACTLab/
crazyswarm
19. Reeves, D., Wellman, M.P.: Computing best-response strategies in infinite games
of incomplete information. arXiv preprint arXiv:1207.4171 (2012)
20. Schwarting, W., Alonso-Mora, J., Rus, D.: Planning and decision-making for
autonomous vehicles. Annual Rev. Control Robot. Autonom. Syst. 1, 187–210
(2018)
21. Schwarting, W., Pierson, A., Karaman, S., Rus, D.: Stochastic dynamic games in
belief space. IEEE Trans. Robot. (2021)
22. Simon, H.A.: A behavioral model of rational choice. Q. J. Econ. 69(1), 99–118
(1955)
23. Spica, R., Cristofalo, E., Wang, Z., Montijano, E., Schwager, M.: A real-time game
theoretic planner for autonomous two-player drone racing. IEEE Trans. Rob. 36(5),
1389–1403 (2020)
24. Wang, M., Wang, Z., Talbot, J., Gerdes, J.S., Schwager, M.: Game theoretic plan-
ning for self-driving cars in competitive scenarios. In: Robotics: Science and Sys-
tems (2019)
25. Williams, G., Goldfain, B., Drews, P., Rehg, J.M., Theodorou, E.A.: Best response
model predictive control for agile interactions between autonomous ground vehi-
cles. In: 2018 IEEE International Conference on Robotics and Automation (ICRA),
pp. 2403–2410. IEEE (2018)
26. Williams, G., et al.: Information theoretic mpc for model-based reinforcement
learning. In: 2017 IEEE International Conference on Robotics and Automation
(ICRA), pp. 1714–1721. IEEE (2017)
Distributed Multi-robot Information
Gathering Using Path-Based Sensors
in Entropy-Weighted Voronoi Regions
Alkesh Kumar Srivastava1(B) , George P. Kontoudis1 , Donald Sofge2 ,

and Michael Otte1
1
University of Maryland, College Park, MD 20742, USA
{alkesh,kont,otte}@umd.edu
2
U.S. Naval Research Laboratory, Washington DC 20375, USA
donald.a.sofge.civ@us.navy.mil
Abstract. In this paper, we present a distributed information-gathering

algorithm for multi-robot systems that use multiple path-based sensors
to infer the locations of hazards within the environment. Path-based sen-
sors output binary observations, reporting whether or not an event (like
robot destruction) has occurred somewhere along a path, but without the
ability to discern where along a path an event has occurred. Prior work
has shown that path-based sensors can be used for search and rescue
in hazardous communication-denied environments—sending robots into
the environment one-at-a-time. We extend this idea to enable multiple
robots to search the environment simultaneously. The search space con-
tains targets (human survivors) amidst hazards that can destroy robots
(triggering a path-based hazard sensor). We consider a case where com-
munication from the unknown field is prohibited due to communication
loss, jamming, or stealth. The search effort is distributed among multiple
robots using an entropy-weighted Voronoi partitioning of the environ-
ment, such that during each search round all regions have approximately
equal information entropy. In each round, every robot is assigned a region
in which its search path is calculated. Numerical Monte Carlo simula-
tions are used to compare this idea to other ways of using path-based
sensors on multiple robots. The experiments show that dividing search
effort using entropy-weighted Voronoi partitioning outperforms the other
methods in terms of the information gathered and computational cost.
Keywords: path planning · multi-robot exploration · distributed

decision making · Shannon Information Theory · information-driven
partitioning · Bayesian Probability
1 Introduction
Autonomous agents are increasingly used in scientific, commercial, and military

applications. In critical missions of locating human survivors after disasters,
https://doi.org/10.1007/978-3-031-51497-5_21
Multi-robot Information Gathering, Path-Based Sensors, Voronoi Regions 287
Fig. 1. Information gathering with two robots where the search space is decomposed
in two entropy-weighted Voronoi partitions. The dotted lines represent the path that
the agent traverses in the three search rounds. Robot 1 efficiently traverses the path
of round 1, reporting a potential target at the location shown in green. However, it
gets destroyed on round 2 and 3, and thus we only know that a hazard exist along
these two paths. The overlapping areas of the round 2 and 3 for robot 1 have higher
probability of hazard existence (darker gray). Robot 2 gets destroyed at round 1, and
survives round 2 and 3. Since robot 2 reports a target at the same location at round 2
and 3, the probability of target existence is higher (darker green).
the deployment of multi-robot systems may be more efficient than deploying

a single agent. In this work, we are primarily motivated by search and rescue
missions of human survivors in lethally hostile environments. We consider a
general case where information exchange between agents in the field is prohib-
ited. Yet, the locations of base stations—where robots start and end each search
round’s journey through the environment—provide connectivity to a shared cen-
tralized server (i.e., before and after each journey). The environment is com-
pletely unknown and lethally hostile such that agents may be destroyed by a
stationary hazards during each search round, and the locations of hazards are
initially unknown. Due to the lack of communication in the field, agent destruc-
tion happens at an unknown location along the agent’s path and leads to loss of
any information collected about targets along that path. However, agent loss also
provides information about the location of hazards within the environment (see
Fig. 1). Even though we do not know where along the path the event of robot
destruction has occurred, the shape of the path constrains the set of possible
locations at which the hazard may have destroyed the agent.
The idea that a path’s shape provides information that can be used to update
a belief map (e.g., of hazard presence) given that an event (e.g., robot destruc-
tion) has occurred somewhere along a path is called a path-based sensor [1].
The objective of the work described in the current paper is to extend the use
of path-based sensors to the case where multiple robots explore the environ-
ment simultaneously and in a distributed fashion, to improve the computational
efficiency and accelerate the convergence speed of the hazard and target belief
maps.
The proposed distributed methodology decomposes this problem into two
distinct sub-problems – decomposition of search space, and path planning in the
288 A. K. Srivastava et al.
decomposed space. Thus, this technique is a decoupled approach consisting of

two steps: (i) centralized Shannon entropy-driven partitioning of the environment
into a set of disjoint regions (one region per agent); and (ii) distributed infor-
mation theoretic path planning of each agent in its region. The first step assigns
each agent a particular partitioned region in the search space. To ensure that
each point in the partitioned space is given to the nearest base station and affects
the entropy of the partitioned space the least, we use entropy-weighted Voronoi
partitioning. This helps distribute the workload among the agents for the next
step, where each agent simultaneously and locally plans an information-driven
path in its corresponding search space, as described in Fig. 1. Thereby helping
the multi-robot system to efficiently and robustly infer the hazard and target
belief map of the environment.
Related Work: The concept of path-based sensors was introduced in [1], which
focused on the scenario where multiple agents explored the environment sequen-
tially. The current paper compares three different extensions of the path-based
sensor idea in which m agents explore the environment in parallel during each
search round.
We now discuss two different bodies of related work. We begin by reviewing
the literature that studied the passive cooperation problem for coverage path
planning (CPP) algorithms, and then we review work on information gathering.
CPP algorithms use a combination of environmental partitioning and passive
communication [3]. In [4], Nair and Guruprasad use a partition-and-cover app-
roach based on Voronoi partitioning to achieve a passive cooperation between
agents to avoid task duplicity. Distributed strategies are used in [5–12] to keep
track of regions covered. The use of geodesic or Manhattan distance as a dis-
tance metric to improve exploration is proposed in [13]. Cooperative sweep-
ing of the environment by multiple robots is studied in [14] and the use of
entropy as a weight for Voronoi tessellation was discussed in [15]. In our work,
we employ a divide-and-conquer approach to environmental partitioning using
entropy-weighted Voronoi decomposition. Our work assumes that base stations
have a centralized communication topology, and we focus on using path-based
sensors to gather information about hazards when communication is denied in
the field.
Recursive Bayesian filters are used in the literature to update belief maps of
the environment [16,17]. However, [17] assumed knowledge of exact location of
malfunctioning agents, i.e., false positive location is known. A formal derivation
of mutual information gradient for information gathering is presented in [18]. The
latter introduced information surfing, the idea of using the gradient of mutual
information for information gathering, and the work is extended in [19,20], but
with the assumption that a hazard could not destroy the agents. The focus
of this work is on combining path-based sensors and information gathering in
environments with lethal hazards and targets, using a team of imperfect robots
that may malfunction (false positive) or report false negative observations.
Contribution: The contribution of this paper is twofold. First, we present a
distributed methodology for information gathering in a multi-robot system using
path-based sensors. The proposed technique is a decoupled two-step scheme

that uses a distributed information-theoretic mechanism for partitioning the
environment and local information-theoretic planner to maximize information
gathering. Second, we also synthesize a centralized global planner for information
gathering in a multi-robot system that uses sequential Bayesian filter to calculate
estimated belief maps of the environment.
Consider a spatial environment which is divided into a×b discrete cells composing
the search space S ∈ R2 . A team of M agents is deployed in the environment,
where each agent is denoted by i = 1, . . . , M . The search space includes n base
stations D = {d1 , d2 , . . . , dn } ⊂ S, where n ≤ M . Note that in this paper we
consider the case of n = M . Each agent i is located at its corresponding base
station di to explore a specific bounded region mSi ⊂ S. The union of the regions of
exploration Si constitute the search space i=1 Si = S, with no overlap ∀i = j,
Si ∩ Sj = ∅, where i, j ∈ {1, 2, . . . , m}. Each agent i can visit up to li cells c in its
search space forming the path ζi = c1 , c2 , . . . , cli ⊆ Si such that the exploration
starts and ends at the agent’s corresponding base station di , i.e., c1 = cli = di .
When all the surviving agents reach their respective base stations at the end of
their exploration, we terminate one search round r ∈ Z>0 . We define the time
taken by an agent i to move from cell cj to cj+1 as one timestep t. Thus, the
duration of a search round can be defined as tr := maxi∈{1,...,M } li t.
Definition 1. Path-based sensor [1]

A sensor that reports whether or not an event has occurred somewhere along a
path, but has no conception of where along the path that event has taken place.
The search space includes hazardous elements Z = {0, 1} that may destroy
the agent, where Z = 1 indicates the presence of a hazard, while Z = 0 denotes
the absence of a hazard in a particular cell. If the agent i is destroyed anywhere
along the path ζi then the path-based sensor (Definition 1) is triggered (Θ = 1),
and if the agent i survives the path ζi then we consider that the path-based
sensor is not triggered (Θ = 0). We assume that the path-based sensor may
report false positive and false negative triggering. False-positive accounts for
faulty or malfunctioning robots that get destroyed regardless of the presence of
a hazard, whereas false-negative accounts for the chance of the robot surviving
a cell despite having a hazard. The search space also includes some elements of
interest, hereafter referred to as targets, X = {0, 1}, where X = 1 indicates the
presence of a target, while X = 0 denotes the absence of a target in a particular
cell. The presence of a target is recorded by a noisy sensor that may also report
a false positive or a false negative observation of the target. The sensor used for
the detection of targets is not a path-based sensor. In other words, if a robot
survives a path, the target sensor reports the exact whereabouts of the target
observation along the survived path.
All agents start their exploration simultaneously at the beginning of a search

round. Inter-agent communication is prohibited at all times. However, the suc-
cessful traverse of a path by an agent indicates information about absence of
hazards, and each survived agent also transmits information about targets along
their path to its base station. In other words, each agent i reports its observation
only to its base station di . A central server σ can obtain this information from
each base station to update the global belief map.
Problem 1. Given a team of M agents, with inter-agent communication prohib-

ited at all times, in an environment with multiple available base stations D that
communicate with a central server σ, the task of the agents is to explore a com-
pletely unknown environment efficiently and gather information about targets
X and hazards Z.
Remark 1. The use of the centralized server to partition the environment and
update shared belief maps is convenient, but not required. If a completely decen-
tralized and distributed algorithm is desired, then the use of the centralized
server can be eliminated, assuming that base stations are able to synchronize
their data after each search round. Each base station can perform the same
(deterministic) weighted Voronoi partitioning algorithm in parallel such that all
base stations achieve the same partitioning.
3 Distributed Information Gathering

In this section, we propose three methodologies to address Problem 1. The
first methodology consists of a decoupled two-step scheme: i) entropy-weighted
Voronoi partitioning as a centralized method to decompose the environment in
Si spaces; and ii) mutual information-based planner from each base station di
that maximizes local information gathering. In the second method, we assume
no partitioning of the environment, thus information-based planning is executed
in the search space S. However, we deploy a team of robots simultaneously from
various base stations di . Lastly, we discuss a global planning method that cal-
culates expected belief maps for each agent i before deployment from the same
base station d and then plans an information-driven path for all agents.
3.1 Distributed Entropy Voronoi Partition and Planner (DEVPP)
To address the obscure information obtained from a path-based sensor, [1] pro-
posed integration of all potential location for a path-based sensor triggering.
However, this increases the computational complexity of optimal path solutions
beyond short path lengths. Therefore, we propose a distributed strategy with
multiple agents that partitions the environment in a way that maximizes the
expected information gain of the entire search space S without task duplicity.
The structure of the proposed methodology is presented in Fig. 2.
Fig. 2. The block diagram illustrates a decoupled two-step scheme of distributed local
planning with centralized partitioning. In the leftmost block diagram base stations di
(r) (r)
communicate their belief maps Xi , Zi to the central server σ. Next, the central
server partitions the environment S into sub-search spaces Si and then transmits this
information to each base station di . The sub-search spaces Si are used locally for
information theoretic planning.
Centralized Entropy Decomposition. For centralized decomposition we use

weighted Voronoi partitioning to decompose the search space S into M smaller
search spaces {S1 , S2 , . . . , SM }, where each robot i is assigned Si . This partition-
ing uses base stations di as generators, assigning each agent i to a specific search
space Si . Let pi = [pi,x , pi,y ] ∈ S denote the location of a cell in the search space
and pdj ∈ S denote the location of base station dj , where j = 2, 3, . . . , card(S)
with card(S) = ab. The geodesic Voronoi partitioning of the search space is
computed by,
Si = { pi ∈ S | g(pi , pdj ) ≤ g(pi , pdk ), ∀ j = k}, (1)

where g(pi , pd ) = pi − pd is the distance between pi and pd . Next, we redefine
g(pi , pd ) to incorporate the Shannon entropy of the partition,
gw (pi , pd ; Xp i , XSp d ) = w(Xp i , XSp d )g(pi , pd ), (2)
where w(Xp i , XSp d ) = (H(Xp i ) + H(XSp d ))/(card(XSp d ) + 1) is a weight rep-

resenting the average
entropy of the expected partition with entropy given
as H(Xp i ) = − X P(Xp i ) log P(Xp i ) dx. Thus, by substituting (2) to (1) the
entropy-weighted Voronoi partitioning yields,
Si = { pi ∈ S | w(Xp i , XSj )g(pi , pdj ) ≤ w(Xp i , XSk )g(pi , pdk ), ∀ j = k}. (3)
Note that for the initial decomposition of the environment with no prior informa-
tion, the entropy-weighted Voronoi partitioning (3) is identical to the geodesic
Voronoi partitioning (1), i.e., w(Xp i , XSj ) = w(Xp i , XSk ) for all j = k.
Local Information-Theoretic Planning. Centralized entropy-weighted

Voronoi partitioning decomposes the problem into single-agent sub-problems
of information gathering. Here, each agent i stationed at its base station di is
Algorithm 1. Distributed Entropy Voronoi Partitioning and Planner (DEVPP)

(0) (0)
Inputs: {Xi }M i=1 , {Zi }i=1 , S, M , D, rmax
M
(r)∗ M
Output: {ζi }i=1
1: for i = 1, . . . , M do Initial decomposition and planning
(0)
2: Si ← geodesicPartition(p, pdi ) (1)
(0)∗ (0) (0) (0)
3: ζi ← calculatePath(Xi , Zi , Si )
(1) (1) (0) (0) (0) (0)∗
4: Xi , Zi ← beliefUpdate(Xi , Y i , Zi , ζi )
5: end for
6: for r = 2 to rmax do
7: for i = 1, . . . , M do Communication to server
(r−1) (r−1)
8: communicate Xi , Zi from base stations to central server
9: end for
(r)
10: Si ← {∅}
11: for p ∈ S do Centralized Entropy Decomposition
12: for i = 1, . . . , M do
13: gw ← EntropyWeight(p, pdi ; Xp , XSp d ) (2)
i
14: end for
15: pdi = arg minp di ∈D {gw }
(r) (r)
16: i ← index(pdi ); Si = Si ∪ p
17: end for
(r)
18: broadcast Si from central server to base stations Communication from
server
19: for i = 1, . . . , M do Local Information Theoretic Planning
(r)∗ (r−1) (r−1) (r)
20: ζi ← calculatePath(Xi , Zi , Si )
(r) (r) (r)∗
21: Θζi , Y i ← traversePath(ζi )
(r) (r) (r−1) (r) (r−1) (r) (r)∗
22: Xi , Z i ← beliefUpdate(Xi , Y i , Zi , Θζi , ζi )
23: end for
24: end for
now assigned to find a path ζi∗ in its corresponding search space Si that max-
imizes the expected information gained about targets X and hazards Z, given
the observed target sensor measurement Y and path-based sensor measurements
Θζi along a path ζi at the exploration round r,

(r)∗ (r−1) (r) (r) (r) (r−1) (r) (r)
ζi = arg maxζi ∈Ωi I(Xi ; Y i | Θζi , Si ) + I(Zi ; Θζi | Si ) ,
(4)
(r) (r)
where Ωi is the space all possible paths of agent i in its search space Si , Y i =
[Yi,1 , . . . , Yi,l ] ∈ Rl denotes all l observations collected at search round r by
(r) (r)
agent i, and Θζi is the space of observation of agent i about the path-based sensor
triggering. This local path planning (4) is addressed using the methodology in [1].
We illustrate the proposed methodology for partitioning the environment

with M agents and M base stations in Fig. 2. DEVPP (Algorithm 1) describes
the centralized partitioning and local planning methodology, where brown color
Fig. 3. The block diagram illustrates a distributed local planning with centralized
sequential Bayesian filter (SBF). In the leftmost block all base stations di communicate
(r) (r)
their local observations Y i , Θζi to the central server σ. Next, the central server
updates the belief map sequentially, where the posterior belief map of agent i is used
as prior for the i + 1 update. The posterior belief map is broadcasted to the base
stations which use them for local information theoretic planning.
indicates action from central server σ, and blue color local action from a base
station di . After the environment has been partitioned successfully at the central
entity, each agent i plans a path using calculatePath algorithm introduced in
[1]. DEVPP terminates after a number of rounds rmax .
3.2 Multi-agent Distributed Information-Theoretic Planner

(MA-DITP)
Unlike DEVPP, this approach does not perform any partitioning of the search
space. Thus, each agent i explores the entire search space S and plan its path
based on the prior belief of hazards and targets in the search space S at every
search round r. In other words, the planning is global in the entire search space
S and not local as in Sect. 3.1. After a search round, each agent i transmits the
(r) (r)
target observations Y i and the path-based sensor measurement Θζi to its base
station di which subsequently uploads this information to the central server σ.
(r)
The central server then receives observations from all base stations {Θζi }M i=1 ,
(r)
{Y i }Mi=1 and updates the belief maps using sequential Bayesian filtering (SBF).
Next, the central server broadcasts the posterior belief maps to the base stations.
(r+1)∗
Lastly, each base station di computes an information theoretic path ζi in
the search space S and the robots are assigned to traverse the paths and collect
measurements. The proposed MA-DITP method is presented in Algorithm 2,
where brown color indicates action from the central entity, and blue color local
action from a base station di . The block diagram is illustrated in Fig. 3.
3.3 Multi-agent Global Information-Theoretic Planner (MA-GITP)
In this section, we discuss a global planning method where a team of robots is

deployed simultaneously from the same base station d instead of multiple base
stations as in DEVPP and MA-DITP. Using MA-DITP in a single station with
Algorithm 2 . Multi-Agent Distributed Information Theoretic Planner (MA-

DITP)
(0) (0)
Input: {Xi }M i=1 , {Zi }i=1 , S, M , rmax
M
(r)∗ M
Output:{ζi }i=1
2: for i = 1, . . . , M do Local information-theoretic planning
(r)∗ (r−1) (r−1)
3: ζi ← calculatePath(Xi , Zi , S)
(r) (r) (r)∗
(r) (r)
5: communicate Θζi , Y i from base station to central server
6: end for
(r) (r) (r−1) (r) (r−1) (r) (r)∗
7: Xi , Zi ← beliefUpdate(Xi , Y i , Zi , Θζi , ζi ) Centralized
SBF
(r) (r)
8: broadcast Xi , Zi from central server to base stations
9: end for
multiple-agent scenario would result in redundant and repeated exploration of

search space S, hampering the information gathering process. To address this
problem, we introduce the idea of using an expected belief map for each agent.
The expected belief map for the first agent is the true belief map computed by
the sequential Bayesian filter (SBF) using observations from the previous search
round r − 1. Next, the base station makes a prediction for each possible path-
(r) (r)
based sensor observation of agent 1 (Θζ1 = 1 for destruction or Θζ1 = 0 for
survival) and computes the weighted average belief map which generalizes to,
(r) (r) (r) (r) (r) (r) (r)
Zi+1 = p1 Zi (Θζi = 1) + p0 Zi (Θζi = 0), ∀i = 2, . . . , M, (5)
(r) (r) (r) (r)

where p1 = P(Θζi = 1) is the probability of destruction, p0 = P(Θζi = 0) is
(r) (r)
the probability of survival, Zi (Θζi = 1) is the belief map in case of destruction,
(r) (r)
and Zi (Θζi = 0) is the belief map in case of survival of the previous agent
(r)
i. Each expected belief map Zi+1 is used to compute an information-theoretic
(r)
path ζi+1 . The multi-agent global information-theoretic planning method (MA-
GITP) from a single base station is presented in Algorithm 3. Note that Algo-
rithm 3 includes only local updates, thus the all actions are illustrated in blue
color.
The multi-agent global planner is inherently resilient to communication chan-
nel attacks, because there is no communication and all the processing is per-
formed at a base station for all agents. In addition, this approach is practical as
in most search and rescue missions we typically have access to a single station
for exploration.
Algorithm 3. Multi-Agent Global Information Theoretic Planner (MA-GITP)

(0)
Inputs: X (0) , {Zi }M i=1 , S, M , rmax
(r)∗ M
Output: {ζi }i=1
2: for i = 1, . . . , M do Weighted Average Belief Map
(r)∗ (r−1)
3: ζi ← calculatePath(X (r−1) , Zi )
(r−1) (r−1)
4: Zi+1 ← weightedAvgBelief(Zi ) (5)
5: end for
6: for i = 1, . . . , M do Sequential Bayesian Filter
(r) (r) (r)∗
(r) (r) (r−1) (r) (r)∗
8: X (r) , Zi ← beliefUpdate(X (r−1) , Y i , Zi , Θζi , ζi )
9: end for
10: end for

Experimental Setup: To evaluate the performance of the proposed methodolo-
gies we implement 30 Monte Carlo (MC) replications for all experiments, where
each MC replication has r = 150 search rounds. The spatial environment is of
dimensions a × b = 15 × 15 cells. A cell in the environment can be either empty
(contain no hazard and no target), or contain a hazard, or a target, or both. The
movement of each agent in the environment is determined by a 9-grid connec-
tivity, i.e., an agent can move anywhere to the 8-neighboring cells or decide to
stay at its current cell for the next timestep.
Let each agent to have a malfunction probability of 5% per timestep and
the target sensor to have a false positive and false negative rate of 10% per
timestep. We consider environments with different hazard lethalities of 50%,
70%, and 90%. Lethality refers to the probability of an agent destruction when
it visits a cell that contains a hazard. We evaluate the proposed methodologies
for various fleet sizes, ranging from M = 1 to M = 5 agents. In all numerical
experiments, we set the maximum number of timesteps for each path to be twice
the Manhattan distance between its base station and the furthest corner of its
search space Si , i.e., li = 2 maxp i ∈Si pdi − pi 1 .
Experiment 1: In this set of experiments, we compare the efficiency of DEVPP
against [1] with different fleet sizes. Note that for the case of M = 1 agent,
DEVPP is identical to [1]. We deploy up to M = 5 robots at card(D) = 5 distinct
base stations. In Fig. 4, we demonstrate the progress of information entropy for
the environment during 150 search rounds. As the number of agents increases, the
information entropy reduces faster for all adversary lethalities cases. Contrary to
the notion that adding more agents may increase the computational cost of the
experiments, Fig. 5 (Left) presents that the time elapsed per agent to complete
150 search rounds decreases with an increasing number of agents. This significant
reduction in computational time is attributed to the smaller search sub-spaces for
planning with multiple robots. Although the time elapsed for the entire planning
and execution decreases with increasing number of robots, the time taken by the
Fig. 4. Information entropy of the environment with hazard lethality of 50%, 70%,
90% using the DEVPP method. As the fleet size increases the information entropy of
the environment decreases faster.
Fig. 5. (Left) Computational time required for each agent to perform local computa-
tions for r = 150 search rounds using DEVPP for M ≥ 2 and Fig. 5 for M = 1. As
the fleet size increases the computation is distributed among agents. (Right) Compu-
tational time required by the central server σ to partition the environment using (3).
As the size of the fleet increases, the time required to compute the partitioned search
spaces Si ∈ S increases.
central server σ to compute the partitioning increases with addition of robots in

the search space S, as shown in Fig. 5 (Right). The computations of the central
server σ are insignificant compared to the local computations of the agents for
small fleet size (M = 2, 3), but increases at similar level for local computations
with higher fleet size (M = 4, 5). In any case, even if we account the central
computations for all fleet sizes, the proposed DEVPP method outperforms [1].
Experiment 2: Next we compare all methods with each other and with [1]. The
lethality rate is fixed to 90% and we consider three fleet sizes M = 2, M = 4,
and M = 5 robots. Note that when M = 1 agent all methods are identical
to [1]. In Fig. 6, we present the progress of information entropy through r = 150
search rounds. The results show that DEVPP outperforms all methods; MA-
DITP gathers competitive information to DEVPP; while MA-GITP performs
slightly better to [1] for all fleet sizes. In Fig. 7, we demonstrate the execution
time required per agent after r = 150 search rounds. The execution time of
DEVPP is significantly lower from all other methods and with minimal variation.
Fig. 6. Information entropy of the environment with hazard lethality of 90% for all
proposed methods. We consider three fleet sizes M = 2, M = 4, and M = 5 agents,
while all cases are identical to [1] when M = 1 agent.
Fig. 7. The boxplots illustrate the execution time per agent in seconds with hazard
lethality of 90% for fleet sizes M = 2, M = 4, and M = 5 agents. When of M = 1
agent all methods report identical execution time to [1].
To this end, not only information gathered using DEVPP is more efficient, but
also it is executed significantly faster. The other proposed methods (MA-DITP
and MA-GITP) are executed in similar or slightly more time than [1].
Main Result: The numerical experiments reveal that the proposed DEVPP
method with multiple robots is more efficient in terms of information gathered
compared to the case of a single robot [1]. This result is attributed to the shorter
paths produced for each agent on the assigned subspaces, which subsequently
reduces the likelihood of robot destruction and significantly reduces execution
time. To this end, simultaneous deployment of multiple robots leads to efficient
information gathering in all proposed methodologies.
5 Conclusion
This paper introduces DEVPP, a distributed information gathering approach in

an environment with path-based sensors using a team of robots simultaneously.
We demonstrate the efficacy of this approach in various experiments with Monte
Carlo replications. We also compare this approach against two information-
gathering methods that employ multiple robots simultaneously (MA-DITP and
MA-GITP) and [1] in the same environment. We experimentally show that the
information gathered for hazards and targets using DEVPP outperforms the
rest approaches in all cases. In addition, DEVPP reduces the computational
cost, thereby accelerates the overall execution time of the information-theoretic
planner. Although MA-GITP performs slower than the rest methods, no com-
munication is required, and thus it is resilient to malicious attacks in the com-
munication system. All proposed multi-robot methodologies, reduce the entropy
of the environment faster than the case where only a single robot is deployed.
Acknowledgments. This study was conducted at the Motion and Teaming Labo-
ratory, University of Maryland (UMD), and was funded by the Maryland Robotics
Center (MRC) and the Office of Naval Research (ONR). Alkesh K. Srivastava and
George P. Kontoudis were partially supported by the MRC to conduct this research
with “Pathway to the Ph.D.” and “MRC Postdoctoral Fellowship” programs, respec-
tively. This work has been possible with the support of ONR under the grant “Exper-
imentally Verified Autonomous Path Planning for Information Gathering in Lethally
Hostile Environments with Severely Limited Communication” (N0001420WX01827).
The views, positions, and conclusions contained in this document are solely those of
the authors and do not explicitly represent those of ONR.
References
1. Otte, M., Sofge, D.: Path-based sensors: paths as Sensors, Bayesian updates, and
Shannon information gathering. IEEE Trans. Autom. Sci. Eng. 18, 946–967 (2021)
2. Okabe, A., Boots, B., Sugihara, K.: Nearest neighbourhood operations with gen-
eralized Voronoi diagrams: a review. Int. J. Geogr. Inf. Syst. 8(1), 43–71 (1994)
3. Choset, H.: Coverage for robotics - a survey of recent results. Ann. Math. Artif.
Intell. 31, 113–126 (2001)
4. Nair, V.G., Guruprasad, K.R.: GM-VPC: an algorithm for multi-robot coverage of
known spaces using generalized Voronoi partition. Robotica 38(5), 845–860 (2020)
5. Jager, M., Nebel, B.: Dynamic decentralized area partitioning for cooperating
cleaning robots. In: IEEE International Conference on Robotics and Automation,
vol. 4, pp. 3577–3582 (2002)
6. Rankin, E.S., Rekleitis, I., New, A.P., Choset, H.: Efficient Boustrophedon multi-
robot coverage: an algorithmic approach. Ann. Math. Artif. Intell. 52, 109–142
(2008)
7. Hazon, N., Kaminka, G.A.: On redundancy, efficiency, and robustness in coverage
for multiple robots. Robot. Auton. Syst. 56, 1102–1114 (2008)
8. Agmon, N., Hazon, N., Kaminka, G.A.: The giving tree: constructing trees for
efficient offline and online multi-robot coverage. Ann. Math. Artif. Intell. 52(2–4),
43–168 (2009)
9. Zheng, X., Koenig, S., Kempe, D., Jain, S.: Multirobot forest coverage for weighted
and unweighted terrain. IEEE Trans. Rob. 26(6), 1018–1031 (2010)
10. Wilson, Z., Whipple, T., Dasgupta, P.: Multi-robot coverage with dynamic coverage
information compression. In: International Conference on Informatics in Control,
Automation and Robotics, pp. 236–241 (2011)
11. Michel, D., McIsaac, K.: New path planning scheme for complete coverage of
mapped areas by single and multiple robots. In: IEEE International Conference
on Mechatronics and Automation, pp. 1233–1240 (2012)
12. Macharet, D., Azpurua, H., Freitas, G., Campos, M.: Multi-robot coverage path
planning using hexagonal segmentation for geophysical surveys. Robotica 36(8),
1144–1166 (2018)
13. Guruprasad, K.R., Wilson, Z., Dasgupta, P.: Complete coverage of an initially
unknown environment by multiple robots using Voronoi partition. In: International
Conference on Advances in Control and Optimization in Dynamical Systems, Ban-
galore, India (2012)
14. Kurabayashi, D., Ota, J., Arai, T., Yoshida, E.: Cooperative sweeping by multiple
mobile robots. In: IEEE International Conference on Robotics and Automation,
vol. 2, pp. 1744–1749 (1996)
15. Bhattacharya, S., Michael, N., Kumar, V.: Distributed coverage and exploration
in unknown non-convex environments. In: Martinoli, A., et al. (eds.) Distributed
Autonomous Robotic Systems. Springer Tracts in Advanced Robotics, vol. 83, pp.
61–75. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-32723-0 5
16. Otte, M., Sofge, D.: Path planning for information gathering with lethal hazards
and no communication. In: International Workshop on the Algorithmic Founda-
tions of Robotics, pp. 389–405 (2018)
17. Schwager, M., Dames, P., Rus, D., Kumar, V.: A multi-robot control policy for
information gathering in the presence of unknown hazards. In: Christensen, H.I.,
Khatib, O. (eds.) Robotics Research. STAR, vol. 100, pp. 455–472. Springer, Cham
(2017). https://doi.org/10.1007/978-3-319-29363-9 26
18. Julian, B.J., Angermann, M., Schwager, M., Rus, D.: Distributed robotic sensor
networks: an information-theoretic approach. Int. J. Robot. Res. 31(10), 1134–1154
(2012)
19. Dames, P., Schwager, M., Kumar, V., Rus, D.: A decentralized control policy for
adaptive information gathering in hazardous environments. In: IEEE Conference
on Decision and Control, pp. 2807–2813 (2012)
20. Dames, P., Schwager, M., Rus, D., Kumar, V.: Active magnetic anomaly detection
using multiple micro aerial vehicles. IEEE Robot. Autom. Lett. 1, 153–160 (2016)
Distributed Multiple Hypothesis Tracker
for Mobile Sensor Networks
Pujie Xin and Philip Dames(B)

{pujie.xin,pdames}@temple.edu
https://sites.temple.edu/trail/
Abstract. This paper proposes a distributed estimation and control

algorithm to allow a team of robots to search for and track an unknown
number of targets. The number of targets in the area of interest varies
over time as targets enter or leave, and there are many sources of sensing
uncertainty, including false positive detections, false negative detections,
and measurement noise. The robots use a novel distributed Multiple
Hypothesis Tracker (MHT) to estimate both the number of targets and
the states of each target. A key contribution is a new data association
method that reallocates target tracks across the team. The distributed
MHT is compared against another distributed multi-target tracker to
test its utility for multi-robot, multi-target tracking.
Keywords: Distributed estimation · Multiple Hypothesis Tracker ·

Multi-target tracking
1 Introduction
The paper addresses the problem of multi-robot, multi-target tracking (MR-

MTT), a canonical task in robotics, which includes problems such as search and
rescue [21], surveillance [11], and mapping [10]. Additionally, we assume that
the robots operate in a known area that contains an unknown number of targets
(e.g., the robots know the blueprint map of a building but not how many people
are inside of it). The targets of interest can be stationary or dynamic, and they
might enter or leave current environment. Thus the number of the targets varies
with time. A solution to this problem contains two primary components: (I) a
estimation algorithm to track the targets and (II) a control algorithm which
leads robots to search for new targets (exploration) while also keeping track of
detected targets (exploitation).
Multiple Target Tracking (MTT) is a well studied problem, with the primary
distinction between methods being how they address the problem of data associa-
tion [30]. Common approaches include the Multiple Hypothesis Tracker (MHT)
[2], Joint Probabilistic Data Association (JPDA) [13], Probability Hypothesis
Density (PHD) filter [20], and Multi-target Multi-Bernoulli (MeMBer) Filter

https://doi.org/10.1007/978-3-031-51497-5_22
Distributed Multiple Hypothesis Tracker 301
[33]. We base our solution on the track-oriented MHT (TOMHT) [16] as this
offers a conceptually simple and computationally efficient solution.
Active search with teams of robots is another well studied problem. The
primary distinction in this space is the use of centralized [27,34] versus dis-
tributed strategies [11,19,29]. One of the most successful distributed algorithms
is Voronoi-based control [7], where the basic idea is to divide the search area
with a Voronoi partition [12] and iterative move each robot to the centroid of its
Voronoi cell, a process known as Lloyd’s algorithm. In recent years, Multi-Agent
Deep Reinforcement Learning (MADRL) has offered a new method to generate
search policies through repeated interactions between other agents and the envi-
ronment in an attempt to maximize a reward function [15]. However, much of
the work based on MADRL requires global information and uses a centralized
architecture [14,25]. Zhou et al. [35] propose a decentralized MADRL method
using the maximum reciprocal reward to track cooperatively with UAV swarms.
However, one downside of learning-based frameworks is that they often have
trouble generalizing to unseen scenarios.
There are few works that specifically address the MR-MTT problem. One of
the first solutions was Cooperative Multi-robot Observation of Multiple Moving
Targets (CMOMMT) from Parker [22], which assigns robots to targets to maxi-
mize the time such that each target is observed by at least one robot. Related to
CMOMMT, Pierson and Rus [23] proposed a distributed method for capturing
multiple evaders with multiple pursuers in 2D and 3D environment by assigning
pursuers to evaders and then using gradient descent. Pimenta et al. [24] propose
Simultaneous Coverage and Tracking (SCAT) where they use Lloyd’s algorithm
to create a control law with guaranteed exponential convergence to a local min-
imum of the objective function. Our previous work [9] built upon SCAT by
allowing the robots to learn the target distribution online rather than needing
that information a priori. We did this by formulating a distributed PHD filter
and using this as the importance weighting function in Lloyd’s algorithm to drive
target search. This paper will use the same control framework but formulate a
new distributed MHT.
The primary contribution of this work is creating a new distributed TOMHT
algorithm. Our approach leverages information about the relative locations of
sensors (i.e., robots) to decrease the memory and computational requirements
relative to a naı̈ve implementation. We then demonstrate the utility of this for-
mulation to solve the MR-MTT problem by using the resulting target estimates
to drive robot search. Our search strategy naturally and effectively drives the
robots to follow previously detected targets and to explore unknown areas that
may contain new targets. To demonstrate our result is correct and efficient, we
compare our result to that of the PHD filter [9] through a series of simulated
experiments with static and moving targets.
302 P. Xin and P. Dames
2 Background
In this section we provide the mathematical formulation of the standard Multiple
Hypothesis Tracker (MHT) for multi-target tracking and of Lloyd’s algorithm
for distributed coverage control.
Fig. 1. Illustration of MHT. (a) Example of at t = 1, where observations 2 and 3 are

in the gate and could be associated with the current track 1 while observation 4 is out
of the gate and can not be associated to existing tracks. (b) 1–6 are observations at
different time step. With the gating process, initial track hypothesis are built based
on gating test. (c) The corresponding track trees. Each tree node is associated to an
observation in (b).
2.1 MHT Definition

Let the set of objects be Xt = {x1t , x2t , . . .}, where xit is the state of the ith
object at time t. The MHT uses an extended Kalman filter to estimate the state
of each object, so x̂it ∼ N (μit , Σit ). Since the number of objects is unknown, the
MHT uses the concept of a track to represent a potential object. Therefore, the
number of tracks represents the estimated number of objects within the search
space and the state of each track represents the state of that individual object.
At each time step, a robot receives a set of measurements Zt = {z1t , z2t , . . .}.
The number of measurements depends upon the number of targets within the
robot’s field of view (FoV) as well as the probability of receiving a false negative
pfn or false positive detection pfp . The MHT uses a two-step approach to solve
the MTT, first solving the data association problem to match measurements to
tracks and then using the associated measurements to update each track.
Data Association. The MHT uses a gating method to determine if an indi-

vidual measurement zjt should be associated with a track x̂it :
d2t (x̂it , zjt ) = (μ̂it − zjt )T (Σ it )−1 (μ̂it − zjt ) ≤ 2d , (1)
where μit , Σ it are the mean and covariance of the estimated track x̂it and d
is the allowable distance threshold. Figure 1a shows the gating process for one
time step. This comparison is to determine the full set of potential associations
between all measurements Zt and all tracks X̂t at time t.
The TOMHT then considers a set of potential associations, as Fig. 1b shows.
For example, if measurements z2t and z3t could be associated with a track x̂1t ,
then two copies of that track are propagated into the future, one using each
hypothetical association, as Fig. 1c shows. All these distinct hypotheses for a
track are maintained using a tree structure, where each branch is a unique series
of associations and each node has a distinct mean and covariance. Additionally,
the TOMHT considers the possibility of starting a new tree using each measure-
ment, which represents the possibility that a detection represents an entirely
new object.
Track Scoring. Based on the above structure, there will be many distinct
hypotheses for an individual track based on different association histories. There-
fore, the MHT uses a scoring system to determine the likelihood of individual
tracks. This score can be used both to find the maximum likelihood track as well
as to prune branches within the tree to avoid an exponential growth in the set
of hypotheses over time. A standard way to formulate the score S uses the log
likelihood ratio [3,16]
ki−1
i=1:t p(zi | z1:i−1 , zi ⊆ Zi )
ki ki
p(zk1:t
1:t
| zki i ⊆ Zi ∀i)
S(Ai1:t ) = ln = ln (2)
p(zk1:t
1:t
| zki i ⊆ ∅) ki−1
i=1:t p(zi | z1:i−1 , zi ⊆ ∅)
ki ki
where Ai1:t = {k1 , . . . , kt } is a history of associations for track i, p(zki i |

ki−1
z1:i−1 , zki i ⊆ Zi ) = N (zki i ; x̂ki i ) represents the track being updated with a spe-
ki−1
cific measurement and p(zki i | z1:i−1 , zki i ⊆ ∅) = 1/AFoV represents a missed
detection (i.e., false negative) and AFoV is the area of the sensor field of view.
This score can be easily updated at each time
p
fn
, zkt t ⊆ ∅
S(A1:t ) = S(A1:t−1 ) + 1−pAfp
i i
2 (3)
ln 2πFoV
− 12 ln |Σ tkt | + d2 , zkt t ⊆ Zt
where d(= 2) is the dimension of the space.

To get the maximum likelihood target set at time t, the TOMHT selects the
best association history using the combination assignment problem:

max S(Ai1:t )
i (4)
j
s.t. A1:t ∩ A1:t = ∅, ∀i = j
i
In other words, we want to find the maximum total store such that each mea-
surement is associated with at most one target. This optimization problem is
solved in [16]. Then the N-scan pruning approach will trace back to t − N and
prune the subtrees which diverge from the global optimal hypothesis, which is
the solution to (4).
Lloyds’ algorithm [7] locally minimizes the function:

H(q1 , q2 , ...qR ) = min f (d(x, qr ))φ(x)dx, (5)
E r∈{1,...,R}
where d(x, q) measures the distance between elements in E, f (·) is a monotoni-

cally increasing function, and φ(x) is a non-negative weighting function. φ(x) can
be a time-invariant function in coverage problem and it can also be a time-variant
function in tracking problem. A general choice is f (x) = x2 . The minimum value
of the inside the integral is reached when we divide the environment using the
Voronoi partition, Vr = {x | d(x, qi ) ≤ d(x, qj ), ∀i = j}, where Vr are the Voronoi
cells. The minimum with respect to robot positions is

r ∗
xφ(x) dx
(q ) = Vr , (6)
Vr
φ(x) dx
This is a distributed control algorithm because as long as each robot knows the
positions of its neighbors, a robot is capable of computing its Voronoi cell and
moving forward to its weighted centroid of Voronoi cell. Note that in practice,
Voronoi neighbors can potentially be far apart. We make the assumption that
the density of robots and/or the communication range of robots is sufficiently
large to allow robots to communicate with all of their neighbors. The problem
of formulating a consistent partition subject to communication range limits will
the subject of future work.
This simple iterative, myopic control scheme is widely used in coverage prob-
lems. For example, Lee and Egerstedt [18] use Lloyd’s algorithm to track a time-
varying density function with a teams of robots. Bhattacharya et al. [1] expand
the search space to non-convex and non-Euclidean space. Suresh et al. [31] pro-
pose a method with low complexity and communication workload in distributed
case using Lloyd’s algorithm. Our previous work [9] repurposes this coverage
controller to perform target tracking by choosing the estimated target density
as the time-varying importance weighting function φ(x).
3 Distributed Multiple Hypothesis Tracker
The formulation above was for the standard single-sensor case. However, two
new considerations arise when there are multiple robots. First, multiple robots
may observe the same target at the same time (and consequently may each have
a local copy of a track for that target). Second, targets may leave the FoV of
one robot and enter the FoV of another robot. So long as those two robots are
within communication range (which is typically much larger than the sensing
range), then the first robot can provide some information to the second robot
to improve its initial estimation. This section outlines our novel formulation to
address these issues.
Note that there are existing distributed MHT solutions for multi-sensor
surveillance in radar systems [5,6]. However, in that case, the sensors are static
and the topology of the communication network is fixed. In our case, the sen-
sors are mobile, have a smaller FoV than that of radar, and the communication
topology changes online. Our new formulation accounts for all of these effects.
3.1 Assumptions
We make three main assumptions. First, that robots know their pose. This is
a strong assumption, but robots operating in indoor environments with a high-
quality prior map [26] or in outdoor environments with GPS receivers are able to
navigate for long periods of time with minimal drift in their pose estimates. We
also showed in our past work that sufficiently small drift has negligible effects
on distributed target tracking accuracy [9]. Second, we assume that each robot
is capable of communicating with its Voronoi neighbors and all the robots with
overlapping sensor FoVs. This assumption was also discussed above in Sect. 2.2.
Third, each robot has a unique ID, which they will use to determine precedence
to avoid conflicts.
Fig. 2. Example of track fusion. There are three robots, whose FoV are shown by the
dashed circles. Each robot has a unique color that is shared by its tracks. There are
4 targets, located at (10.5, 11), (5.5, 6.5), (5.5, 9), and (16, 16), and 6 tracks (colored
ellipses), 2 from each robot. Here r2 is the only robot with a nonempty set of private
tracks, P 2 = (16, 16). All other tracks are in the union of multiple areas. Since r3 has
the largest ID, then R3 = (5.5, 9), (10.5, 11) and F 3 = (10, 11), (11, 11), (5.5, 9), while
R1 , R2 , F 1 , F 2 = ∅. Then, r3 calculates the KL-Divergence between tracks in F 3 and
R3 and fuse tracks with low KL-Divergence using Algorithm 1.
3.2 Track Exchange and Fusion

Similar to our past work creating a distributed PHD filter [9], our distributed
MHT solution utilizes the Voronoi partition from Lloyd’s algorithm to spatially
distribute information across the team, i.e., robot r maintains the tracks that
are located within its Voronoi cell Vr . The key is to create a set of rules for data
exchange to ensure consistency across the team as cell boundaries shift due to
robot motion and as targets move.
When robots are well spaced out with their FoVs not overlapping, then each
robot can run its own separate MHT. However, when multiple robots view the
same region then each may have its own separate track for the same target.
These repeated tracks not only drive several robots track one target, which is
a waste of system resource, but also increase false positive rate. Algorithm 1
outlines our approach, which is also illustrated in Fig. 2.
The process starts by inflating a robot’s FoV to account for effects such as
measurement noise (line 1), typically using the standard deviation of the mea-
surement model 2σ. Robots then exchange these inflated regions to determine
their local grouping NF (line 2). Robots then exchange track estimates amongst
these local clusters (lines 3–14), using the relative positions of the track and
other robots as well as the IDs of each robot to sort each track into one of three
categories: 1) private (i.e., only in F̂r ), 2) fused (i.e., in multiple regions but r
has the highest ID amongst these), or 3) received (i.e., in multiple regions but
r is not the highest ID).
After this data exchange process, robot r measures the Kullback-Leibler
divergence between all tracks in its fused and received sets F and R (lines 15–20).
Robot r then uses the Munkres algorithm [17] to find the optimal assignment of
tracks (line 21), where a valid assignment (i.e., with cost less than dmax ) means
that a fused and received track are the same object and the estimates are com-
bined and added to the private list (lines 26–27). Unassigned tracks are added
directly to the private list (lines 23–24 and 30–34).
3.3 Importance Weighting Function

Then we need to use the resulting target estimation from the MHT to drive
exploration using Lloyd’s algorithm. The key choice in Lloyd’s algorithm is to
set the importance weighting function φ(x). In our previous work [9], we used
the PHD as the weighting function. We will take a similar approach, using

n
φ(x) = v(x) + N (x̂i | μi , Σi2 ) (7)
i=1
where v(x) accounts for undetected targets (i.e., driving exploration) and the
right terms accounts for detected targets (i.e., exploitation). We create a simple
ad hoc rule to approximate the number of undetected targets

(1 − pd (x))v(x) x ∈ FoV
v(x) = (8)
max(v + Δv, δ) else,
where pd (x) is the probability of the robot detecting a target at x, Δv is the

number of new targets potentially appearing per time step, and δ is a maxi-
mum target density to prevent “windup” in infrequently observed areas. The
Algorithm 1. Track Exchange and Fusion for Robot r

1: F̂r = Fr ⊕ B(2σ) Expand the FoV
2: NF = {j = r | F̂r ∩ F̂j = ∅}
3: P, F , R = ∅ Private, fused, and received track lists
4: for t ∈ T do
5: Nt = {j ∈ NF | t ∈ F̂j }
6: if Nt is empty then
7: P = P ∪ {t}
8: else if r > maxj∈Nt j then
9: F = F ∪ {t}
10: else
11: Send t to robot maxj∈Nt j
12: end if
13: end for
14: Add all tracks received from other robots to R
15: for tRi ∈ R do Create assignment distance matrix D
16: for tF
j ∈ F do
det Σ F

17: DKL (tR F 1 F R 2 F −1 R
i tj ) = 2 μj − μi (Σ F )−1 + tr((Σj ) Σi ) − 2 + ln det ΣjR
j i
DKL (tR F
i tj ), DKL (tR F
i tj ) < KDL
18: di,j =
dmax , else
19: end for
20: end for
21: Find assignment A : R → F
22: for tRi ∈ R do
23: if A(i) = ∅ then Track tRi not assigned
R
24: P = P ∪ {ti }
25: else
μR +μF Σ R +Σ F
26: Make new track t with μ = i 2 A(i) , Σ = i 2 A(i)
27: P = P ∪ {t}
28: end if
29: end for
30: for tFj ∈ F do
31: if (i s.t. A(i) = j) AND (di,j = dmax , ∀i) then Track tF
j not assigned
F
32: P = P ∪ {tj }
33: end if
34: end for
motivation behind this rule is to decrease the expected number of undetected

targets for all points that are observed and increases the number of undetected
targets outside of the field of view (where Δv = 0 when it is known that targets
are stationary). We set v(x) = δ at the start of exploration. Typically, we set
δ = 1/A, where A is the area of the environment. This assumes there is only one
target in the environment.
To make the problem tractable, we approximate the weighting function by
a set of key points along a uniform grid in the environment, a common choice
in robotics [32]. We then calculate the weight of each key point using (7) and
use this to approximate the integrals in (6) by summations. We use the same
algorithms as our previous work [9, Algorithm 1] to ensure the data is properly
exchanged as Voronoi boundaries shift.
4 Results
In this section, we test our approach against our previous results using the
distributed PHD filter [9], as this is the most closely related MR-MTT solution
in the literature. First, we isolate the differences due to the tracker by using the
same measurement and pose data from [9] and running this data through our
new MHT. Then, we examine the benefits of the new weighting function (7) to
show that unifying the estimation and control is helpful.
All trials take place in an open 100 × 100 m area. We run trials with different
number of targets with the position of targets drawn uniformly from a 120 × 120
m area. Any targets initialized outside the map will be discarded. The robots
begin at random locations within a small box at the bottom center of the map.
The robot are holonomic with a maximum velocity of 2 m/s. Each robot is
equipped with an onboard sensor with a circular FoV of radius 5 m, pfn = 1−pd =
0.2, and pfp = 3.66 · 10−3 . Robots measure the relative position of a target
z ∼ N (x − q, 0.25I2 ). The sensors collect measurements at 2 Hz. Note, these
values do not represent any specific real-world situation, but could represent a
team of drones equipped with downward facing cameras. For a real situation, one
could follow the procedures outlined in our other work to carefully characterize
the sensor models [4,8].
4.1 Performance Metric
We measure tracking error using the Optimal SubPattern Assignment (OSPA)

metric [28]. The error between sets X and Y , where |X| = m ≤ |Y | = n without
loss of generality, is
p1
1
m
p p
(9)
n π∈Πn 1
where c is the cutoff distance (we use c = 10 m), dc (x, y) = min(c, x − y ), Πn

is the set of permutations of {1, 2, 3, ..., n}, and p = 1 is our choice for p-norm.
Since the OSPA can fluctuate due to false negative and false positive detections,
we calculate the median of the OSPA over the last 100 s to obtain a steady
estimate and measure the variance of the OSPA over this same time window to
measurement the stability of the tracker.
Fig. 3. Robot trajectories. (a) shows the trajectories of robots during a single trail
with 20 robots and 5 static targets. (b) shows the trajectories of robots during a single
trail with 20 robots and dynamic targets
Fig. 4. Boxplots shows the median OSPA statistics over 10 runs for teams of 10–100
robots and 10, 30, 50 static targets for distributed MHT, MHT tracking based on
control result in PHD and distributed PHD case respectively.
4.2 Stationary Targets
In the static case, the number of targets and their positions remain constant.
We run a batch of trials with 10, 30, and 50 targets, respectively. The robots
begin by distributing over the environment. Once a robot detects a target, it will
typically continue to track that target unless the track gets assigned to another
robot, as Fig. 3a shows. As the robots explore the environment, v(x) decreases in
the detected area, thus the weighted centroid in one Voronoi cell keeps shifting
away from regions without targets.
Fig. 5. Boxplots shows the median of standard deviation of OSPA statistics over 10
runs for teams of 10–100 robots and 10, 30, 50 static targets for distributed MHT,
tracking results from MHT using detection results in PHD and distributed PHD case
respectively.
Figure 4 shows the median OSPA over the last 100 s of a 550 s experiment
(long enough to reach a steady state). We see that the median OSPA decreases
significantly when the number of robots increases for all three methods, leveling
out when the number of robots is near the number of targets. We see that in
general, the performance is about the same using all three methods, with the
mixed MHT-PHD case having a higher average OSPA as the control was not
directly tied to the tracker. This is also born out in the standard deviation of
the OSPA data in Fig. 5, where the MHT-PHD case tends to have the highest
standard deviation. This also explains the increase in the standard deviation as
the number of robots increases.
We also see that the MHT data tends to be between the mixed case and the
PHD. This is likely due to the MHT making hard decisions about the presence
of a target (i.e., a track exists or is pruned) compared to the PHD, which uses
a target density function that more gracefully degrades. We expect the MHT
to perform better when there are fewer false negative detections, as repeated
missed detections often cause tracks to be pruned.
4.3 Dynamic Targets

In the dynamic case, targets move around in the environment and the number
of targets varies over time as targets enter or leave the map area. Thus we set
up the birth rate b(x) near the map boundary (∂E) to simulate the occurrence
of new targets:
5.26 × 10−5 x − ∂E ≤ 5m
b(x) =
0 else (10)
v(x) := v(x) + b(x)
Note, this birth model is incorporated into the v term in (8). With this birth
model, the total expected number of increasing targets is 1 per 10 s. Because of
this flux, the final number of targets is always similar regardless of the initial
Fig. 6. Boxplots shows the median OSPA error statistics over 10 runs for teams of 10–
100 robots and dynamic targets for distributed MHT, MHT tracking based on control
result in PHD and distributed PHD case respectively.
number. The emergent behavior in this case is different the static case, where
some robots spread out along the map boundary due to detect new targets,
as Fig. 3b shows. To get the steady state tracking performance, we extend the
experiment to 800 s and we evaluate the OSPA error over last 200 s.
Figure 6 shows the median OSPA of the distributed MHT, MHT-PHD and
PHD. We see that the OSPA decreases as the number of robots increases, and
the variance of the OSPA has the same trend as the static case. The pure MHT
estimation, namely, the MHT-PHD, still performs a little worse than the dis-
tributed PHD and keeps the largest variance among these three methods. The
main potential reason is similar to that in static case, which is the control policy
based on distributed PHD disturbs the tracks assignment in MHT estimation.
However, unlike the static case, the MHT has a consistently lower OSPA value
than the PHD data, especially in smaller teams. It also has a smaller variation
in larger teams, meaning it provides a more consistent result.
We do not analyze the standard deviation of the OSPA in last 200 s of each
trail because the appearance and disappearance of the targets is the principle
component of the standard deviation.
4.4 Discussion
Beyond the qualitative and quantitative tracking performance in the above sce-
nario, there are a number of other factors to consider with MR-MTT algorithms.
Failure Cases. The most challenging scenario for MTT is when targets are
tightly clustered together (relative to the measurement covariance), as this is
when data association is most difficult. Due to their different approaches, the
MHT and PHD filter will fail in different ways. The MHT makes multiple hard
decisions about data association. Thus, tightly clustered targets will lead to a
large branching factor at each time step, resulting in a large number of branches
that are difficult to prune. This will cause the number of branches to be very
large and difficult to prune. On the other hand, the PHD filter represents targets
using one target density function. Thus, the PHD filter will have a single large
peak in the density function, making it difficult to extract the precise location
of any individual object within the cluster.
Another challenging scenario is when there are multiple missed detections in
a row. In the MHT, this causes a track to be pruned prematurely, as there is
typically a maximum timeout for a track to ensure that old branches get pruned.
Increasing this timeout will make the MHT more robust to missed detections,
but also increases computation by increasing the number of tracks. The PHD
filter fails in a different manner. In the PHD filter update equation [9, Eq. (4)],
the target density function decreases in areas where the robot expects to detect
a target and increases in areas where the robot actually detects a target. Thus,
repeated missed detections lower the density and make it difficult to recover
quickly.
Adding multiple robots adds another challenge. Due to the design of our
distributed MTT algorithms, when a target lies right at the boundary between
two Voronoi cells then the ownership of that track will be ambiguous. Thus, two
robots may repeatedly exchange ownership of a track simply due to measurement
noise pushing the mean position across the Voronoi boundary. This does not
affect tracking quality but increase the communication load.
Computational Complexity. Analyzing the computational complexity of the

MHT filter is challenging as the number of computations depends on the specific
scenario under consideration. As mentioned above, the worst case scenario for
the MHT is when all the targets are tightly clustered together (i.e., within
the gate radius of one another). In this case, the number of branches will grow
exponentially over time, where the base of the exponential function is the number
of targets in the cluster. On the other extreme, when targets are widely spaced,
there will only be one branch per target, essentially just running an extended
Kalman filter for each target in parallel.
The complexity of the PHD filter scales differently than that of the MHT
filter in terms of the number of targets. As detailed in [9, Eq. (4)], the PHD filter
features a summation over the number of measurements (which is proportional
to the number of targets). This superior scaling law relative to the MHT is a
result of the choices made during the data association step of the MTT algorithm
design.
The distributed MHT or PHD filter will generally scale better in the number
of targets than a centralized MTT as the computations are distributed across the
team, with each robot taking ownership of one portion of the full environment.
The worst case would be if all targets in cell of a single robot, but this is almost
guaranteed to never occur by construction of the control method, as robots are
drawn towards areas of high target density.
Communication Load. Like the complexity, the communication load is situa-

tionally dependent. There are three main stages to the distributed MHT or PHD
filter algorithms, 1) Voronoi calculations, 2) environment reassignment, and 3)
MTT update.
The Voronoi calculations are exactly the same between the distributed MHT
and PHD filter and only require robots to exchange pose information with their
neighbors, a low bandwidth operation.
The environment reassignment step in the MHT filter is where robots
exchange tracks that have crossed the boundaries of Voronoi cells. The com-
munication load depends upon the number of targets, the length of history, and
the number of branches in the track tree. In the PHD filter, the robots exchange
portions of the target density function (in our case, sets of weighted particles).
Thus, with low target density the MHT will be more efficient but with high
target density the PHD filter will be more efficient.
The MTT update step requires robots to exchange measurements if and
only if they can see into the cell of another robot (i.e., when the robot-robot
distance is smaller than the field of view of a robot). In this case, the robots
simply need to exchange measurement sets, which is a relatively low bandwidth
operation as this is typically just a numerical array (e.g., a list of range-bearing
measurements). In the case of the PHD filter, the robots need to perform a
second round of communication in order to compute a consistent normalization
term in the PHD update step [9, Eq. (5)].
5 Conclusion
In this paper, we propose a distributed algorithm to search for and track an

unknown number of targets in a search area. There are two main components: 1)
a novel, distributed MHT formulation for mobile robot teams and (2) a Voronoi-
based control strategy. We compare the result of the distributed MHT to our
previous work, which uses an alternative MTT solution, a distributed PHD filter.
We found that both filters perform similarly well to solve the MR-MTT task.
The PHD filter was slightly better in the case of static targets, yielding more
accurate and consistent tracking performance. However, the MHT performed
slightly better when the targets were dynamic, particularly when the size of
the team is smaller. We also demonstrated the importance of tying the control
directly to the tracking, with the MHT performing better on live data compared
to running on the prior data collected by robots using the PHD filter.
Acknowledgment. This work was funded by NSF grants IIS-1830419 and CNS-
2143312.
References
1. Bhattacharya, S., Ghrist, R., Kumar, V.: Multi-robot coverage and exploration on
Riemannian manifolds with boundaries. Int. J. Robot. Res. 33(1), 113–137 (2014).
https://doi.org/10.1177/0278364913507324
2. Blackman, S.: Multiple hypothesis tracking for multiple target tracking. IEEE
Aerosp. Electron. Syst. Mag. 19(1), 5–18 (2004). https://doi.org/10.1109/MAES.
2004.1263228
3. Blackman, S., Popoli, R.: Design and Analysis of Modern Tracking Systems. Artech
House Radar Library. Artech House (1999)
4. Chen, J., Xie, Z., Dames, P.: The semantic PHD filter for multi-class target track-
ing: from theory to practice. Robot. Auton. Syst. 149 (2022). https://doi.org/10.
1016/j.robot.2021.103947
5. Chong, C.Y., Mori, S., Reid, D.B.: Forty years of multiple hypothesis tracking -
a review of key developments. In: 2018 21st International Conference on Informa-
tion Fusion (FUSION), pp. 452–459 (2018). https://doi.org/10.23919/ICIF.2018.
8455386
6. Coraluppi, S., et al.: Distributed MHT with active and passive sensors. In: 2015
18th International Conference on Information Fusion (Fusion), pp. 2065–2072
(2015)
networks. IEEE Trans. Robot. Autom. 20(2), 243–255 (2004). https://doi.org/10.
1109/TRA.2004.824698
8. Dames, P., Kumar, V.: Experimental characterization of a bearing-only sensor for
use with the PHD filter (2015)
9. Dames, P.M.: Distributed multi-target search and tracking using the PHD filter.
Auton. Robot. 44(3), 673–689 (2020). https://doi.org/10.1007/s10514-019-09840-
9
10. Deutsch, I., Liu, M., Siegwart, R.: A framework for multi-robot pose graph SLAM.
In: 2016 IEEE International Conference on Real-time Computing and Robotics
(RCAR), pp. 567–572 (2016). https://doi.org/10.1109/RCAR.2016.7784092
11. Doitsidis, L., et al.: Optimal surveillance coverage for teams of micro aerial vehicles
in GPS-denied environments using onboard vision. Auton. Robot. 33(1), 173–188
(2012). https://doi.org/10.1007/s10514-012-9292-1
12. Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applica-
tions and algorithms. SIAM Rev. 41(4), 637–676 (1999). https://doi.org/10.1137/
S0036144599352836
13. Fortmann, T., Bar-Shalom, Y., Scheffe, M.: Sonar tracking of multiple targets using
joint probabilistic data association. IEEE J. Oceanic Eng. 8(3), 173–184 (1983).
https://doi.org/10.1109/JOE.1983.1145560
14. Goldhoorn, A., Garrell, A., Alquézar, R., Sanfeliu, A.: Searching and tracking peo-
ple with cooperative mobile robots. Auton. Robot. 42(4), 739–759 (2018). https://
doi.org/10.1007/s10514-017-9681-6
15. Gronauer, S., Diepold, K.: Multi-agent deep reinforcement learning: a survey. Artif.
Intell. Rev. 55(2), 895–943 (2022). https://doi.org/10.1007/s10462-021-09996-w
16. Kim, C., Li, F., Ciptadi, A., Rehg, J.M.: Multiple hypothesis tracking revisited.
In: Proceedings of the IEEE International Conference on Computer Vision, pp.
4696–4704 (2015)
17. Konstantinova, P., Udvarev, A., Semerdjiev, T.: A study of a target tracking algo-
rithm using global nearest neighbor approach. In: Proceedings of the 4th Interna-
tional Conference Conference on Computer Systems and Technologies: E-Learning,
CompSysTech 2003, pp. 290–295 (2003). https://doi.org/10.1145/973620.973668
18. Lee, S.G., Egerstedt, M.: Controlled coverage using time-varying density functions.
IFAC Proc. Vol. 46(27), 220–226 (2013). https://doi.org/10.3182/20130925-2-DE-
4044.00030
19. Leonard, M.R., Zoubir, A.M.: Multi-target tracking in distributed sensor networks
using particle PHD filters. Signal Process. 159, 130–146 (2019)
20. Mahler, R.: Multitarget bayes filtering via first-order multitarget moments. IEEE
Trans. Aerosp. Electron. Syst. 39(4), 1152–1178 (2003). https://doi.org/10.1109/
TAES.2003.1261119
21. Murphy, R.: Human-robot interaction in rescue robotics. IEEE Trans. Syst. Man
Cybern. Part C (Appl. Rev.) 34(2), 138–153 (2004). https://doi.org/10.1109/
TSMCC.2004.826267
22. Parker, L.E.: Cooperative robotics for multi-target observation. Intell. Autom. Soft
Comput. 5(1), 5–19 (1999)
23. Pierson, A., Rus, D.: Distributed target tracking in cluttered environments with
guaranteed collision avoidance. In: 2017 International Symposium on Multi-Robot
and Multi-Agent Systems (MRS), pp. 83–89 (2017). https://doi.org/10.1109/MRS.
2017.8250935
24. Pimenta, L.C.A., et al.: Simultaneous coverage and tracking (SCAT) of moving tar-
gets with robot networks. In: Chirikjian, G.S., Choset, H., Morales, M., Murphey,
T. (eds.) Algorithmic Foundation of Robotics VIII. Springer Tracts in Advanced
Robotics, vol. 57, pp. 85–99. Springer, Heidelberg (2010). https://doi.org/10.1007/
978-3-642-00312-7 6
25. Qie, H., Shi, D., Shen, T., Xu, X., Li, Y., Wang, L.: Joint optimization of multi-UAV
target assignment and path planning based on multi-agent reinforcement learning.
IEEE Access 7, 146264–146272 (2019). https://doi.org/10.1109/ACCESS.2019.
2943253
26. Ribas, D., Ridao, P., Neira, J.: Localization with an a priori Map. In: Ribas,
D., Ridao, P., Neira, J. (eds.) Underwater SLAM for Structured Environments
Using an Imaging Sonar. Springer Tracts in Advanced Robotics, vol. 65, pp. 47–
75. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14040-2 5
27. Robin, C., Lacroix, S.: Multi-robot target detection and tracking: taxonomy and
survey. Auton. Robot. 40(4), 729–760 (2016)
28. Schuhmacher, D., Vo, B.T., Vo, B.N.: A consistent metric for performance evalua-
tion of multi-object filters. IEEE Trans. Signal Process. 56(8), 3447–3457 (2008).
https://doi.org/10.1109/TSP.2008.920469
Khatib, O. (eds.) Robotics Research. STAR, vol. 100, pp. 455–472. Springer, Cham
(2017). https://doi.org/10.1007/978-3-319-29363-9 26
30. Stone, L.D., Streit, R.L., Corwin, T.L., Bell, K.L.: Bayesian Multiple Target Track-
ing. Artech House (2013)
31. Suresh, A.T., Yu, F.X., McMahan, H.B., Kumar, S.: Distributed mean estimation
with limited communication. CoRR abs/1611.00429 (2016)
32. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge
(2005)
33. Vo, B.T., Vo, B.N., Cantoni, A.: The cardinality balanced multi-target multi-
bernoulli filter and its implementations. IEEE Trans. Signal Process. 57(2), 409–
423 (2009). https://doi.org/10.1109/TSP.2008.2007924
34. Yan, Z., Jouandeau, N., Cherif, A.A.: A survey and analysis of multi-robot coordi-
nation. Int. J. Adv. Rob. Syst. 10(12), 399 (2013). https://doi.org/10.5772/57313
35. Zhou, W., Li, J., Liu, Z., Shen, L.: Improving multi-target cooperative tracking
guidance for UAV swarms using multi-agent reinforcement learning. Chin. J. Aero-
naut. 35(7), 100–112 (2022). https://doi.org/10.1016/j.cja.2021.09.008
Distributed Multirobot Control
for Non-cooperative Herding
Nishant Mohanty(B) , Jaskaran Grover, Changliu Liu, and Katia Sycara
The Robotics Institute, Carnegie Mellon University, Pittsburgh, USA

{nishantm,jaskarag,cliu6,sycara}@andrew.cmu.edu
Abstract. In this paper, we consider the problem of protecting a high-

value area from being breached by sheep agents by crafting motions for
dog robots. We use control barrier functions to pose constraints on the
dogs’ velocities that induce repulsions in the sheep relative to the high-
value area. This paper extends the results developed in our prior work
on the same topic in three ways. Firstly, we implement and validate our
previously developed centralized herding algorithm on many robots. We
show herding of up to five sheep agents using three dog robots. Secondly,
as an extension to the centralized approach, we develop two distributed
herding algorithms, one favoring feasibility while the other favoring opti-
mality. In the first algorithm, we allocate a unique sheep to a unique dog,
making that dog responsible for herding its allocated sheep away from the
protected zone. We provide feasibility proof for this approach, along with
numerical simulations. In the second algorithm, we develop an iterative
distributed reformulation of the centralized algorithm, which inherits the
optimality (i.e. budget efficiency) from the centralized approach. Lastly,
we conduct real-world experiments of these distributed algorithms and
demonstrate herding of up to five sheep agents using five dog robots.
Videos of these results are available at https://bit.ly/3bZq0dB.
Keywords: Herding · Barrier Functions · Quadratic Programming
1 Introduction
Recent developments in robotics and sensing have created significant interest
among researchers to deploy multiple robots to operate cooperatively towards
achieving a common goal. Many works have developed techniques to tackle real-
world problems using multi-robot systems (MRS), like conducting surveys or
automating warehouses [1–3]. The major developments in MRS for enabling
multiple robots to behave cooperatively have been based on interactions within
a single team, i.e., a robot interacts with other robots in its group to achieve a
given objective [4,5]. The main features of these types of algorithms are a) local
N. Mohanty and J. Grover—These authors contributed equally to this work.
This research is supported by AFOSR FA9550-18-1-0097 and AFRL/AFOSR FA9550-
18-1-0251.
https://doi.org/10.1007/978-3-031-51497-5_23
318 N. Mohanty et al.
interaction, b) collision-free motion within the group, and c) achieving collective

behavior using local interaction [6].
In literature, there are studies on MRS that involve interaction between mul-
tiple groups of agents. Here, along with the local interaction with group mem-
bers, the individuals also interact with an external agent from another group.
An example of this is a scenario where a group of adversarial robots has a goal
of their own that might damage a given high-value unit. Here, a group of defend-
ers must interact with the adversarial robots to ensure the safety of the unit.
[7,8]. In this paper, we propose a provably correct controller for the group of
defenders (“dog robots”) to prevent an adversarial group (the “sheep robots”)
from breaching a protected zone. This is challenging because dog robots do not
control the sheep robots directly; rather have to rely on the interaction dynamics
between the dogs and sheep to influence the sheep’s behavior.
In our prior work [9]1 , we developed a centralized algorithm to solve this
problem using control barrier functions. In this work, a) we provide more exper-
imental validation of the centralized algorithm, b) propose two distributed algo-
rithms, and c) provide simulations and experiments to validate these algorithms.
Our formulation computes the velocity of each dog locally to prevent sheep from
breaching the protected zone(s). In the first distributed algorithm, we allocate
each sheep to a unique dog and pose a constraint on that dog’s velocity to herd
its allocated sheep away from the protected zone. We provide proof of feasibility
of this approach, thus showing that whenever the number of sheep and dogs
are equal, the herding problem is well-posed. Our previously proposed central-
ized algorithm lacked this feasibility guarantee. However, it did not necessitate
equal numbers of dogs and sheep; in fact, in many experiments, fewer dogs
than sheep were sufficient to herd all the sheep away. This observation led us
to develop the second algorithm. In this algorithm, we construct an iterative
distributed approach that asymptotically attains the same velocities as com-
puted by the centralized approach, thereby attaining the same total optimality
(measured in terms of the total movement the dogs exhibit) as the centralized
approach and obviating the need to have equal numbers of dogs and sheep.
We build on the dual-decomposition algorithms proposed in [10,11] for develop-
ing this distributed algorithm. Both of our proposed distributed algorithms are
compositional in nature i.e., we can protect multiple zones by including more
constraints, as shown in Fig. 1(c). To highlight the performance of our formula-
tion, we provide results from numerical simulations showing the success of our
approach for multiple dogs against multiple sheep. Finally, we demonstrate our
algorithm on real robots and show multiple dog robots successfully preventing
the breaching of protected zones against multiple sheep robots.
The outline of this paper is as follows: in Sect. 2, we give a brief review of
the prior work in this area. In Sect. 3, we provide a mathematical formulation of
the problem statement. In Sect. 4, we show how to use control barrier functions
to pose constraints on dog velocities. Section 5 provides simulations and exper-
1
Accepted in IEEE Conference on Decision and Control 2022.
Distributed Herding 319
imental results to demonstrate the proposed approach. Finally, we summarize

our work in Sect. 6 along with our directions for future work.
2 Prior Work
The framework of multi-group interaction within MRS has many applications
beyond the adversarial problem statements. The shepherding problem is an
example of such a category. In [12,13], the authors have proposed methods to
enable multiple shepherd agents to influence a flock of sheep by modeling the
interaction as repulsion forces. The Robot Sheepdog Project [14,15] conducted
a real-world demonstration of a shepherding algorithm where a group of mobile
ground robots cooperatively herded a flock of ducks to a given goal location.
In the literature, there are several works on non-cooperative shepherding as
an example of a multi-group interaction type problem. The works like [13,16–20]
deal with a problem where the sheep robots do not exhibit adversarial behavior.
They do not have any goals of their own. However, they experience a repulsive
force from the dog robots, which is exploited to produce the desired behavior
in the sheep robots. For example, collecting all the sheep at some location and
then driving them to a target goal.
Differently from prior work, our sheep may or may not be adversarial. We call
them adversarial if their goal lies inside the protected zone and non-adversarial
otherwise. Our safe control synthesis approach remains the same regardless. The
dog robots observe and generate their control commands considering the cohesion
between the sheep robots, the attraction to their goal location, and the repulsion
experienced by them from the dog robots. And as we use control barrier functions
to generate the constraints on the velocity of the dog robots, it only requires the
dynamics of the sheep to be represented as a symbolic function. Thus allowing
for the sheep to experience any kind of attractive or repulsive forces.
Consider a scenario with m sheep agents flocking towards a common goal loca-
tion. One commonly assumed model for flocking is the Reynolds-Boids dynamics
[6] that considers inter-sheep cohesive forces, inter-sheep repulsive forces, and
attraction to a common goal. In the presence of dog agents, each sheep’s dynam-
ics would include repulsive forces from each dog robot. While en route to their
goal, the sheep, having no knowledge about high-value regions in workspace
(protected zones), pose a risk of breaching them. Thus, our problem is to orches-
trate the motions of dog robots by capitalizing on the repulsions that the sheep
experience from the dogs to prevent this breaching. Next, we pose this problem
in formal terms.
Consider the protected zone P ⊂ R2 as a disc centered at xP with radius
Rp , i.e., P := {x ∈ R2 | x − xP ≤ Rp }. We denote the flock of sheep as S and
the position of the ith sheep as xSi ∈ R2 . The collective positions of all sheep
is denoted as xall
S := (xS1 , xS2 , ..., xSm ). Similarly, we denote the set of all dogs
using D. The position of the k th dog is xDk ∈ R2 and the positions of all dogs
collectively is xall
D := (xD1 , xD2 , ..., xDn ). Each sheep follows single integrator
dynamics ẋSi := f i (xS1 , ..., xSn , xD1 , ..., xDn ), given by
RS3

ẋSi = uSi = kS 1− (xSj − xSi ) + kG (xG − xSi )
xSj − xSi 3
j∈S\i
attraction to goal
inter-sheep cohesion and repulsion
xS − xD
i l
+ kD (1)
xSi − xDl 3
l∈D

repulsion from dogs
Here, RS is a safety margin that each sheep tends to maintain with every other
sheep, xG is the sheep’s desired goal and kS , kG and kD are proportional gains
corresponding to the attractive and repulsive forces. We model each dog as a
velocity controlled robot with the following dynamics:
ẋDk = uDk ∀k ∈ {1, 2, · · · , n} (2)
Before posing the problem, we state some assumptions on the dogs’ knowledge:
Assumption 1. The dog robots have knowledge about the sheep’s dynamics i.e.
(1) and can measure the sheep’s positions accurately.
Assumption 2. Each dog robot can measure the velocities of other dog robots
(by using numerical differentiation, for example).
Problem 1. Assuming that the initial positions of the sheep xSi (0) ∈ / P ∀i ∈
S, the dog robots’ problem is to synthesize controls {uD1 , · · · , uDn } such that
xSi (t) ∈
/ P ∀t ≥ 0 ∀i ∈ S.
4 Controller Design
In this section, we show two approaches to solve Problem 1, building on our
previously proposed centralized algorithm [9]. Define a safety index h(·) : R2 −→
R that quantifies the distance of Si from P:
h(xSi ) = xSi − xP 2 − (r + Rp )2 (3)
Here r is a safety buffer distance. Thus, we require h(xSi (t)) ≥ 0 ∀t ≥ 0. We

define x = (xall all
S , xD ) as the aggregated state of all sheep and all dogs. To
ensure, h(xSi (t)) ≥ 0 ∀t ≥ 0, we treat h(·) as a control barrier function require
its derivative to satisfy
ḣ(x) + p1 h(xSi ) ≥ 0. (4)

Here p1 is a design parameter and is chosen such that it satisfies:

ḣ(x(0))
p1 > 0 and p1 > − . (5)
h(xSi (0))
The first condition on p1 ensures the pole is real and negative. The second
depends on the initial positions x(0) of all the sheep and dogs relative to the
protected zone. Note that the constraint in (4) does not contain any dog velocity
terms, which is what we require to control each dog. Therefore, we define the
LHS of (4) as another control barrier function v(x) : R4n −→ R:
v = ḣ + p1 h, (6)
and require its derivative to satisfy the constraint: v̇(x) + p2 v(x) ≥ 0. Here p2
is another design parameter which must satisfy
ḧ(x(0)) + p1 ḣ(x(0))
p2 > 0 and p2 > − (7)
ḣ(x(0)) + p1 h(xSi (0))
Using (3), (6) and the constraint on the derivative, we get
ḧ(x) + αḣ(x) + βh(xSi ) ≥ 0 (8)
where α := p1 + p2 and β := p1 p2 . The derivatives of h(·) are:
ḣ(x) = 2(xSi − xP )T ẋSi = 2(xSi − xP )T f i (x) (9)

T T S D
ḧ(x) = 2f i f i + 2(xSi − xP ) Jji f i + Jli uDl (10)
j∈S l∈D
where the jacobians are defined as JSji := ∇x Sj f i (x) and JD li := ∇x Dl f i (x)

Note that (10) contains the velocity terms of all dogs. In [9], we leveraged this
observation to obtain a linear constraint on the velocity of all dogs collectively
for preventing sheep Si from breaching P:
AC all C
i u D ≤ bi , where (11)
AC
i := (xP − xSi )
T
JD D
1i , J2i , · · · , Jni
D
h
T
bC
i := f i f i + (xSi − xP )
T
JSji f j + α(xSi − xP )T f i + β
2
j∈S
This gives us a centralized algorithm that collectively computes the velocities of

all dogs using the following QP:
uall all 2
D = arg minuD
u all
D
subject to AC all
i uC ≤ bC
i ∀i ∈ S. (12)
Building on this centralized approach, in this paper, we develop two distributed
approaches wherein we allow each dog to compute its velocity locally. The com-
puted velocities will make the dog herd the sheep away from P.
4.1 Approach 1: One Dog to One Sheep Allocation Based Approach
In this approach, we assume that we have an equal number of dogs and sheep. By
exploiting this equality, we assign a unique sheep Si for i ∈ {1, · · · , n} to a unique
dog Dk for k ∈ {1, · · · , n} and make Dk responsible for herding Si away from
P. In other words, Dk computes a velocity uDk that repels Si from P thereby
ensuring that xSi (t) ∈ / P ∀t ≥ 0. The premise is that owing to the equality,
each sheep will end up being herded by a unique dog, therefore, no sheep will
breach the protected zone2 . Now while this strategy necessitates having an equal
number of dogs and sheep, the benefit of this approach stems from the feasibility
guarantee (that we prove shortly), which the centralized approach lacks. Simple
algebraic manipulation of constraint (11) yields a constraint on the velocity of
Dk as follows
AH H
i u D k ≤ bi , where (13)
AH T D
i := (xP − xSi ) Jki
h
T
bH
i := f i f i + β + (xSi − xP )T JSji f j + αf i + JD
li uDl
2
j∈S l∈D\k
Here AHi ∈R
1×2
and bH H
i ∈ R. The term uDl in the expression of bi is computed
by using numerical differentiation of the positions xDl . We pose a QP to obtain
the min-norm velocity for Dk as follows
u∗Dk = arg minuDk 2

u Dk
subject to AH H
i u D k ≤ bi (14)
The obtained velocity u∗Dk guarantees that the protected zone P will not be
breached by sheep Si by ensuring that h(xSi (t)) ≥ 0 ∀t ≥ 0. Since each dog
in D is in-charge of herding exactly one sheep in S, feasibility of (13) ∀k ∈ D
would ensure no sheep breaches P. Next, we show the conditions under which
(14) remains feasible but first state some assumptions.
Assumption 3. We make the following assumptions on the distances between
pairs of agents:
1. There exists a lower bound and upper bound on the distance between any pair
of sheep, i.e., LS xSi − xSj MS , ∀i, j ∈ S and i = j.
2. There exists a lower bound on the distance between every sheep and dog, i.e.,
xSi − xDk ≥ LD ∀i ∈ S and k ∈ D.
2
Note that although Si is assigned to Dk , the position of the remaining dogs
{1, · · · , n}\k and the remaining sheep {1, · · · , n}\i do influence Dk ’s constraint
parameters (AH H ∗
i , bi ), and in turn, its computed velocity u Dk .
3. There exists a upper bound on the distance between each sheep and its goal
i.e., xSi − xG MG and between the sheep and the center of the protected
zone i.e., xSi − xP MP .
Theorem 1. In a scenario with ‘n’ dogs and ‘n’ sheep, with each dog assigned
a unique sheep, the herding constraint (13) for a given dog is always feasible,
provided assumptions 3 are met.
Proof. See appendix (Sect. 7).
4.2 Approach 2: Iterative Distributed Reformulation of (12)
The distributed formulation proposed in (14) comes with a feasibility guarantee

ensuring that all sheep will be herded away from P. While vital, this comes at
the cost of requiring as many dog robots as the number of sheep agents. This
is because, in a way, this equality ensures that controlling the sheep from the
perspective of dog robots is not an underactuated problem. Be that as it may,
in our simulations and experiments involving the centralized approach with an
equal number of dogs and sheep, we frequently observed that not all dog robots
needed to move to repel the sheep away from P i.e., equality may have been an
overkill. Thus, in terms of budget efficiency, at least empirically, the centralized
approach outweighs the distributed approach.
This raises the question, can we convert the centralized algorithm of (12) into
a distributed version that inherits the budget efficiency (optimality) promised
by (12)? Indeed, we found out that [10,11] propose algorithms to convert
constrained-coupled convex optimization problems (such as (12)) into distributed
counterparts. They combine techniques called dual decomposition and proximal
minimization and develop iterative distributed schemes which consist of local
optimization problems. The solutions to these optimization problems asymptot-
ically converge to the solution of centralized optimization under mild convexity
assumptions and connectivity properties of the communication network. In our
case, this network refers to the communication between dog robots. Below, we
present the distributed dual sub-gradient method of [10,11] adapted to the costs
and constraints of (12). This algorithm calculates an estimate of dog Dk ’s veloc-
ity ûDk which, given large enough iterations Kmax , matches with the k th velocity
component in the optimal velocities u∗all
D returned by (12). Ak ∈ RnS ×2 refers
to those columns of A that correspond to uDk in uall
H
D .
Algorithm 1 . Distributed Dual Subgradient for (12) (based on sec. 3.4.2 in

[11])
Initialize Lagrange Multiplier: µ0k = 0 ∈ RnS
Evolution: t = 1, 2, · · · , Kmax
Gather Multipliers µtr from Dr ∀r ∈ {1, · · · , nD }\k
Average Multipliers: v t+1 k = n1D r∈{1,··· ,nD }\k µtr
2
Local Solution: ut+1 t+1 T 1 H 1 T t+1
Dk = arg min u + (v k ) (Ak u − nD b ) = − 2 Ak v k
u
Update Multiplier: µt+1
k = v t+1
k + γt Ak ut+1 1
Dk − nD b +
Kmax t
Return Average: ûDk = (1/Kmax ) t=1 uDk
5 Results
In this section, we provide simulation and real-world experimental results demon-

strating our proposed distributed algorithms.
5.1 Simulation Results
We first validate the first distributed algorithm and the feasibility proof given
in Sect. 4.1. For this, we model the sheep with the Reynolds-Boids dynamics (1)
with gains kS = 0.5, kG = 1 and kD = 0.1. The dogs use (14) to compute their
velocities, where hyperparameters α and β are computed following (5) and (7).
We chose a circular protected zone of radius Rp = 0.6 m and center xP at origin.
The sheep are initialized outside of the protected zone, and their goal location xG
is chosen such that their nominal trajectory would make them breach the zone,
thus necessitating intervention from dogs. The positions of dogs are initialized
randomly within a certain range of the protected zone. In Figs. 1(a) and 1(b),
Fig. 1. Preventing the breaching of the protected zone using our proposed distributed
algorithm in Sect. 4.1. Here dogs are shown in blue and sheep in red. The green disc
represents the protected zone. The nominal task of the sheep is to go straight towards
goal xG . However, since this would result in infiltration of the protected zone, the dog
intervenes using the control algorithm presented in (14). In Fig. 1(c), we defend two
protected zones from four sheep.
we show two examples involving a) two dog robots vs. two sheep robots and b)
three dog robots vs. three sheep robots. To demonstrate the compositionality of
our approach, we consider two protected zones in Fig. 1(c) where we have four
dogs defending both zones from four sheep. In all these simulations, none of the
sheep breach any zone, thus demonstrating the correctness of our approach. In
the interest of space, we skip the simulation results for the algorithm in Sect. 4.2
but do provide experimental results.
5.2 Robot Experiments
In this section, we show the results obtained by performing robot experiments by

implementing the distributed algorithms of Sect. 4.1 and Sect. 4.2. Additionally,
we also present more experimental results for our prior centralized algorithm
from [9] (because at the time, we did not have as many robots). We conduct
these experiments in our lab’s multi-robot arena, which consists of a 14ft × 7ft
platform with multiple Khepera IV robots and eight Vicon cameras for motion
tracking. Although Khepera robots have unicycle dynamics, [9] consists of a
technique to convert the single-integrator dynamics (assumed for dogs and sheep)
to linear and angular velocity commands for the robots.
First of all, to build upon our previous work, we show additional experiments
using centralized velocity computation of the dog robots (12). Figure 2 shows a case
with 2 dog and 4 sheep robots. The dog robots have a green tail, and the sheep robots
have an orange tail. The tails are pointing in the opposite direction of the robot’s
heading angle. The protected zone is the green-colored circular region. This figure
shows the performance in the case of an underactuated system, i.e., there are more
sheep against less number of dogs. Another example is shown in Fig. 3 where 3 dogs
successfully prevent breaching against 5 sheep robots.
Following that, multiple experiments were conducted using the distributed
algorithm presented in Sect. 4.1, which requires equal numbers of dogs and sheep.
Figure 4 shows 4 dog robots against 4 sheep robots scenario. Here we take two
protected zones and show that the dogs can protect both of them. This highlights
the compositional nature of our proposed algorithm. We conducted experiments
with 5 dog robots and 5 sheep robots, as shown in Fig. 5. Here we can see some
dog robots did not require to move as the assigned sheep were being prevented
from entering the protected zone due to the configuration of the flock itself.
Finally, we test our distributed algorithm presented in Sect. 4.2. Figure 6 shows
a case where 2 dogs prevent the breaching of protected zone against three dogs.
This highlights that our distributed approach can handle under-actuated scenar-
ios. Figure 7 and Fig. 2 can be compared to see both centralized and distributed
algorithm handling a similar scenario of 2 dogs against 4 sheep.
Fig. 2. Experiments for Centralized Control: Two dogs defending the protected
zone from four sheep using centralized control algorithm (12) from our prior work [9].
Video at https://bit.ly/3OTAnOu.
Fig. 3. Experiment for Centralized Control: Three dogs (green-tailed robots)

defending a protected zone from five sheep (orange-tailed robots) using centralized
control (12) from our prior work [9]. Video at https://youtu.be/2 Xuxnd9jZw.
Fig. 4. Experiment for the distributed algorithm in Sect. 4.1 : Four dogs (green-
tailed robots) defending two protected zone from four sheep (orange-tailed robots). The
goal position xG (red disc) is in extreme left that would encourage sheep to breach both
zones. However, our proposed algorithm moves the dogs so that none of the zones get
breached. Video at https://bit.ly/3yo9ziC.
Fig. 5. Experiment for the distributed algorithm in Sect. 4.1): Five dogs (green-
tailed robots) defending the protected zone from five sheep (orange-tailed robots). The
sheep’s goal (red disc) is in the center of the protected zone. Eventually, in this scenario
a deadlock occurs where all sheep come to a stop outside the protected zone. Video at
https://bit.ly/3o51Cu1.
Fig. 6. Experiment for distributed algorithm in Sect. 4.2): Two dogs (green-
tailed robots) defending the protected zone from three sheep (orange-tailed robots).
The goal position xG (red disc) is at the center of the zone. Video at https://youtu.
be/IbCjkR1ye0c.
Fig. 7. Experiment for distributed algorithm in Sect. 4.2): Two dogs (green-
tailed robots) defending the protected zone from four sheep (orange-tailed robots). This
case is similar to the one shown in Fig. 2. Video at https://youtu.be/51FoHZWFYC4.
6 Conclusions
In this paper, we developed a novel optimization-based distributed control tech-
niques to enable multiple dog robots to prevent the breaching of protected zones by
sheep agents. We provided proof of feasibility of the controller when n dog robots
faceanequalnumberofsheeprobots.Additionally,wedevelopedanotherdistributed
algorithm that iteratively computes a solution that agrees with the solution returned
by the centralized problem without requiring equal number of dogs and sheep. We
experimentally validated both distributed algorithms in addition to validating our
previously developed centralized control. We show that multiple dog robots can pre-
vent breaching of protected zone in both simulation and real-world experiments. In
futurework,weaimforthedogrobotstolearnthedynamicsofthesheeprobotsonline
while preventing them from breaching.
7 Appendix: Proof of Feasibility for Approach 1
Theorem 1. In a scenario with ‘n’ dogs and ‘n’ sheep, with each dog assigned
a unique sheep, the herding constraint (13) for a given dog is always feasible,
provided Assumptions 3 are met.
Proof. Our strategy to guarantee feasibility of constraint (13) relies on ruling

out situations in which it is infeasible. (13) can become infeasible
– either when AH H
i = 0 and bi < 0 (possibility 1)
H
– or when bi = −∞ (possibility 2).
To determine the conditions in which possibility 1 occurs, we calculate the deter-

minant of JD
ki as
−2kD2
det(JD
ki ) =
xDk − xSi 3
The determinant det(JD ki ) is non-zero as long as the distance between dog Dk and
sheep Si is finite. Therefore, JD H
ki will have no null space, implying that Ai = 0
∀xSi ∈ R , xDk ∈ R . This rules out possibility 1 for infeasibility. To rule out
2 2
possibility 2, we need to check for condition when bH H

i −→ −∞. Given bi in (13),
T
we find its worst case lower bound. Here f i f i ≥ 0 and as we assume that at
the current time step, the sheep is outside P, this ensures β h2 ≥ 0. By removing
these terms, the lower bound of bH i can be given as
T

bH
i ≥ (xSi − xP ) JSji f j + (xSi − xP )T JSii f i + (xSi − xP )T JD
li uDl
j∈S\i l∈D\k
T
+ α(xSi − xP ) f i (1)
Using the triangle inequality on the RHS and Cauchy-Schwarz inequality on

individual terms, we get

bH
i ≥ −σmax JSji xSi − xP f j − σmax JSii xSi − xP f i (2)
j∈S\i

+ −σmax JD
li xSi − xP uDl − αxSi − xP f i
l∈D\k
where σmax is the largest singular value of a matrix. Further, using the fact that
the largest singular value of a matrix (σmax ) is upper bounded by its Frobenius
norm (σF ), we obtain

bH
i ≥ −σF JSji xSi − xP f j − σF JSii xSi − xP f i (3)
j∈S\i

−σF JD
li xSi − xP uDl − αxSi − xP f i
l∈D\k
Now to compute this lower bound we make use of Assumption 3. We use the
dynamics in (1) to compute JSii and obtain the upper bound on σF JSii and use
the bounds on distances from Assumption 3 to get following upper bound:
√ √
√ (3 + 2)R3 √ 3 + 2 kD
σF JSii kS 2+ + 2kG +
xSi − xSj 3 xSi − xDl
3
j∈S\i l∈D
√ √
√ (3 + 2)R3 √ 3 + 2 kD
(n − 1)kS 2+ + 2kG + n := λM
L3S L3D
We omit the proof of this computation in the interest of space. Similarly, using
the dynamics in (1), we compute an expression for JSji and obtain an upper

bound on σF JSji as follows:
√ √

(3 + 2)kS R3√ √ (3 + 2)kS R3
σF 2kS +JSji 2kS + := λS
xSi − xSj 3 L3S

Likewise, an upper bound of σF JD li , is given by
√ √
(3 + 2)kD (3 + 2)kD
σF JDli := λD
xSi − xDl 3 L3D
Lastly, we use obtain an upper bound on the dynamics of each sheep f i as:

R3
f i kS xSi − xSj + + kG xG − xSi
xSi − xSj 2
j∈S\i
xSi − xDl
+ kD (4)
xSi − xDl 3
l∈D
Now we need to compute the maximum possible value of the RHS to get the
upper bound of the sheep dynamics. The first term has a local minima at xSi −
xSj = (2)1/3 R. Therefore the maximum value can occur at either the lower
bound or upper bound of xSi − xSj . Thus the maximum value of the first
R3 R3
term can be given as Fmax := max(kS LS + kS L 2 , kS MS + kS M 2 ). Second term
S S
is maximum when xG − xSi = MG . The last term is maximum when distance
of the sheep to the dogs are minimum, xSi −xDk = LD . Using these the upper
bound on the sheep dynamics is computed as:

1
f i (n − 1)Fmax + kG MG + nkD
L2D
Assuming that the velocity of the dog robots have an upper bound, and by
taking the upper bound on the dynamics of all the sheep to be equal, the lower
bound on bHi from 3 is (taking γ = −(α + λM + (n − 1)λS )Mp )

nkD
bH
i γ (n − 1)F max + k G M G + − (n − 1)λD MP uD max
L2D
This shows that bHi has a finite lower bound, thus ruling out possibility 2. Thus,
the herding constraint (13) for a one dog to repel one sheep from the protected
zone is always feasible. Since each sheep in S is allocated to one unique dog in
D, extension of this feasibility result to all sheep ensures that none of them will
breach the protected zone.
References
1. D’Andrea, R.: Guest editorial: a revolution in the warehouse: a retrospective on
kiva systems and the grand challenges ahead. IEEE Trans. Autom. Sci. Eng. 9(4),
638–639 (2012)
2. D’Andrea, R., Dullerud, G.E.: Distributed control design for spatially intercon-
nected systems. IEEE Trans. Autom. Control 48(9), 1478–1495 (2003)
3. Kazmi, W., Bisgaard, M., Garcia-Ruiz, F., Hansen, K.D., la Cour-Harbo, A.: Adap-
tive surveying and early treatment of crops with a team of autonomous vehicles.
In: Proceedings of the 5th European Conference on Mobile Robots ECMR 2011,
pp. 253–258 (2011)
4. Ji, M., Egerstedt, M.: Distributed coordination control of multiagent systems while
preserving connectedness. IEEE Trans. Rob. 23(4), 693–703 (2007)
5. Lin, J., Morse, A.S., Anderson, B.D.: The multi-agent rendezvous problem-the
asynchronous case. In: 2004 43rd IEEE Conference on Decision and Control (CDC)
(IEEE Cat. No. 04CH37601), vol. 2. IEEE, pp. 1926–1931 (2004)
7. Walton, C., Kaminer, I., Gong, Q., Clark, A., Tsatsanifos, T., et al.: Defense against
adversarial swarms with parameter uncertainty. arXiv preprint arXiv:2108.04205
(2021)
8. Tsatsanifos, T., Clark, A.H., Walton, C., Kaminer, I., Gong, Q.: Modeling and con-
trol of large-scale adversarial swarm engagements. arXiv preprint arXiv:2108.02311
(2021)
9. Grover, J., Mohanty, N., Luo, W., Liu, C., Sycara, K.: Noncooperative herd-
ing with control barrier functions: Theory and experiments, arXiv preprint
arXiv:2204.10945 (2022)
10. Falsone, A., Margellos, K., Garatti, S., Prandini, M.: Dual decomposition for multi-
agent distributed optimization with coupling constraints. Automatica 84, 149–158
(2017)
11. Notarstefano, G., Notarnicola, I., Camisa, A., et al.: Distributed optimization for
smart cyber-physical networks. Found. Trends Syst. Control 73, 253–383 (2019)
12. Lien, J.-M., Bayazit, O.B., Sowell, R.T., Rodriguez, S., Amato, N.M.: Shepherd-
ing behaviors. In: Proceedings of IEEE International Conference on Robotics and
Automation, ICRA’04. 2004, vol. 4. IEEE, 2004, pp. 4159–4164 (2004)
13. Pierson, A., Schwager, M.: Controlling noncooperative herds with robotic herders.
IEEE Trans. Rob. 34(2), 517–525 (2017)
14. Vaughan, R., Sumpter, N., Henderson, J., Frost, A., Cameron, S.: Robot control
of animal flocks. In: Proceedings of the 1998 IEEE International Symposium on
Intelligent Control (ISIC) held jointly with IEEE International Symposium on
Computational Intelligence in Robotics and Automation (CIRA) Intell., pp. 277–
282. IEEE (1998)
15. Vaughan, R., Sumpter, N., Henderson, J., Frost, A., Cameron, S.: Experiments in
automatic flock control. Robotics and autonomous systems, vol. 31, no. 1-2, pp.
109–117 (2000)
16. Pierson, A., Schwager, M.: Bio-inspired non-cooperative multi-robot herding. In:
2015 IEEE International Conference on Robotics and Automation (ICRA), pp.
1843–1849. IEEE (2015)
17. Licitra, R.A., Bell, Z.I., Doucette, E.A., Dixon, W.E.: Single agent indirect herding
of multiple targets: a switched adaptive control approach. IEEE Control Syst. Lett.
2(1), 127–132 (2017)
18. Licitra, R.A., Hutcheson, Z.D., Doucette, E.A., Dixon, W.E.: Single agent herding
of n-agents: a switched systems approach. IFAC-PapersOnLine 50(1), 14 374–
14 379 (2017)
19. Sebastián, E., Montijano, E.: Multi-robot implicit control of herds. In: 2021 IEEE
International Conference on Robotics and Automation (ICRA), pp. 1601–1607.
IEEE (2021)
20. Bacon, M., Olgac, N.: Swarm herding using a region holding sliding mode con-
troller. J. Vib. Control 18(7), 1056–1066 (2012)
On Limited-Range Coverage Control
for Large-Scale Teams of Aerial Drones:
Deployment and Study
Filippo Bertoncelli1 , Mehdi Belal2(B) , Dario Albani2 , Federico Pratissoli1 ,

and Lorenzo Sabattini1
1 Department of Sciences and Methods for Engineering (DISMI),
{filippo.bertoncelli,federico.pratissoli,lorenzo.sabattini}@unimore.it
2 Autonomous Robotics Research Center Technology Innovation Institute (TII),
Abu Dhabi, UAE

{mehdi.belal,dario.albani}@tii.ae
Abstract. When considering a multi-robot aerial system, coverage con-

trol can be seen as a basic behavior serving as a building block for
complex mission. However, when the number of units in play becomes
too high, centralized and global optimal solutions might not scale well,
causing a reduction in efficiency and reactiveness. To this end, in this
manuscript we propose and analyze a control methodology for area cov-
erage that is fully distributed and is able to mitigate the above fac-
tors while still providing high performances. The approach we propose
in this manuscript relies on local Voronoi partitioning constrained by
limited robot sensing and communication capabilities. To validate our
approach, we perform an ablation study over a large set of simulations
that we evaluate with a set of proposed metrics. At the same time, the
algorithm has been tested on the field, with real quad-copters, and with
a set of proof of concept experiments deploying up to seven aerial units.
1 Introduction
Addressing a persistent and unpredictable service demand with a multi-robot
system can provide higher efficiency, thanks to parallel operation, and larger
flexibility by modulating the operational costs in terms of robots employed,
as well as by deploying robots only where needed. Concrete solutions can be
obtained by addressing two concurrent problems. In the last years, significant
improvements have been made by the researchers into the field of multi-robot
systems [4]. In particular, these systems are gaining relevance in the real world
applications where, due to unpredictability of the environments and the pres-
ence of external factors unknown a priori, they often requires the distributed
capabilities provided by a teams of cooperating robots [13]. At the same time,
a distributed and decentralized robotic system proves scalable, fault-tolerant
and, due to the absence of a central point of failure, ensures robust continuous
and efficient operation. Among the many applications that could benefit from
https://doi.org/10.1007/978-3-031-51497-5_24
334 F. Bertoncelli et al.
above properties we find: Precision agriculture, emergency response and rescue,

environmental monitoring, search and rescue operations, and outdoor industrial
operations such as fault diagnosis [4]. Some require control and coordination
strategies and in the recent years several have been deeply studied which aim
at organizing the robots to enable solving global tasks using local information.
The main representative classes of control strategies are discussed in [4] and
among them, a remarkable paradigm is represented by coverage control, which
entails controlling a group of robots to be deployed in a given environment whit
a distribution that allows the units to cover it in an optimal way.
The literature is flourishing with strategies proposing multiple implementa-
tion of coverage control performed by a group of mobile robots—either ground,
aerial, marine or hybrid. Amongst these, well known methods [5,6] propose
decentralized coordination algorithms for groups of mobile agents based on
Voronoi diagrams [7,9,10] and proximity graphs [2,11]. This control strategy
provides a methodology to guarantee convergence of the positions of the robots
in the team to a configuration that maximizes the area covered in the environ-
ment. The main drawback is related to the fact that the computation of each
robot’s control input requires the computation of the Voronoi partitioning of the
entire environment, which requires global knowledge and/or global communica-
tion and information exchange [15,17]. However, in many practical application
scenarios, robots may not have access to global information or to a reliable all-
to-all communication network: therefore, methods were proposed to define the
Voronoi partitioning in a range-constrained scenario. A distributed methodology
was introduced in [5,6] to calculate the Voronoi partition over a limited sensing
network, exploiting the algorithm presented in [3], which consists in gradually
refining the globally computed partition by utilizing an adjustable sensing range
and incrementally increasing the sensing range. However, in practical applica-
tions, limited sensing capabilities are typically a hard constraint, and rarely the
sensing radius can be adjusted to detect sufficient information to allow compu-
tation of the Voronoi partition.
To overcome these issues, a few methodologies were recently proposed to
implement coverage control in a totally distributed manner, letting each robot
compute its own control input base on locally available information only. In par-
ticular, a methodology was proposed in [14] that allows a team of mobile robots,
moving in a two-dimensional environment, to optimally cover a bounded region
with unknown feature. Robots are considered with limited sensing capabilities,
and without access to reliable high-bandwidth communication infrastructures.
The proposed methodology was validated by means of kinematic simulations and
preliminary experiments on ground robots.
Contribution. The aim of this paper is to experimentally evaluate the perfor-

mance of the coverage control architecture proposed in [14] deployed on a large
team of drones for outdoor operations. In the manuscript, we perform a deep and
extensive analysis of both realistic simulations and real field tests with a fleet
of quad-rotor drones in an unstructured environment as the Abu Dhabi desert.
Thanks to these, we assess the scalability of the control methodology and its
Limited-Range Coverage Control of Aerial Drones 335
performance in the presence of real-world constraints and limitations such as

localization error coming from noisy GPS readings. In order to guarantee the
statistical relevance of the simulated results, we perform an ablation study of
the main parameters with randomized initialization conditions repeated over
multiple runs.
We note that, to the best of the authors’ knowledge, this represents the first
large-scale deployment of a team of drones performing fully decentralized and
distributed coverage control in an outdoor environment.
Organization of the Paper. The rest of the paper is organized as follows. The
methodology considered for our experiment is described in Sect. 2. The anal-
ysis of the experiments is extensively presented in Sect. 3. Here, quantitative
measurements are reported along with a description of the simulator and the
platform used for the real-world experiments and continuing with a discussion
of the observed results. particular Finally, Sect. 4 closes the manuscript with
some concluding remarks.
2 Methods
The aim of our experiments is to evaluate the performance of a team of drones
controlled by a decentralized, limited-range, coverage control strategy, deployed
in a large outdoor environment. For this purpose, field trials and high-fidelity
dynamic simulations were set up, to achieve a statistically relevant characteri-
zation. Details of the methods are provided in the following paragraphs.
2.1 Coverage Control Strategy

We consider a swarm of robots composed by n drones modeled as holonomic
units able to move in two dimensions. All the robots posses the same limited
sensing and communication capabilities. The objective of the coverage control
strategy here proposed is to optimize the distribution of the agents on a bounded
environment Q ⊂ R2 . Let pi ∈ R2 be the position of the i-th drone. Also, assume
P = {p1, . . . , pn } ⊂ R2 to be the set of positions of all the robots. Last, let the
velocity p of each drone be directly controllable. By defining ui ∈ R2 as the
control input of the i-th drone, ∀i = 1, . . . , n, the kinematic model of the robot
can be expressed as
pi = ui, (1)
As discussed in [5,6], the Lloyd’s algorithm can be used to define control
strategy to optimally cover a desired environment with a team of holonomic
robots. This is achieved partitioning the environment in Voronoi cells Vi , adjusted
over time using the positions pi ∈ R2 of the robots as seeds. A probability density
function (pdf) φ(·) : Q → R+ can be defined to identify regions of interest in Q:
the centroid of each cell CVi is then computed considering the pdf, as
∫
V
qφ(q)dq
CVi = ∫ i (2)
V
φ(q)dq
i
Each robot is then controlled to move towards the centroid of its cell by:
ui = −k pr op (pi − CVi ) (3)
with ui representing a simple proportional controller and k pr op > 0 being a

proportional gain. To account for the limited sensing and communication capa-
bilities of the robots, as discussed in [14], the partition of the environment is
defined only locally with non overlapping Voronoi cells computed by each robot
i as:
Vi = {q ∈ B(pi, r)|q − pi ≤ q − p j , ∀ j i = 1. . . . , n} (4)
In Eq. 4, the term B(pi, r) represents a circle of radius r centered in pi with
r associated with the range—either for communication or for sensing—of the
robot. This limited-range definition allows the robot to compute, in a decentral-
ized fashion, its own control action (3), based on its current position and on the
position of its neighbors within B(pi, r).
2.2 Simulation Protocol
Simulated experiments have been performed using the Flightmare Quadrotor

Simulator [16], a photo-realistic simulator, with integrated physic, developed by
the Robotics and Perception group at UZH. The control algorithm described
in Sect. 2.1 has been implemented using the Robot Operating System—ROS—
on separate threads, one for each drone. Each robot thread also handles the
low=level controller for the quad-rotor fly, the initial take-off, and the shar-
ing of information. A python script acting as a parallel daemon manages the
execution of the simulation batches and collects the data needed to assess per-
formances. The simulations have been performed on an Intel Xeon Platinum
8280M CPU@2.70 GHz with 4 processors, for a total of 112 cores and 512 GBof
RAM. Each simulation collects the global position for each drone over a time
interval of 120 s—empirically set at the end of the simulation. To guarantee the
Independence of the proposed methodology from the initial conditions, drones
are spawned inside the bounded environment in randomized positions: for every
simulation run, the vehicles are spawned within the simulation environment in a
neighborhood of the center of the coverage area with a random displacement. The
final position is bounded to be at least 0.5 m from any of the already positioned
vehicles. After arming and take-off of all the vehicles at the same altitude, the
coverage control algorithm takes over and steers the drones toward their wanted
position thanks to the velocities by Eq. (3). The simulator makes use of the RPG
Quadrotor control [8], which allows for generating desired orientation, collective
thrust command, body rates, angular accelerations, and angular jerk to generate
the control values of the four motors from the requested velocity command.
The measurements of the positions of the other vehicles, used to feed the
control function, are added to a noise value generated by a normal distribu-
tion N (0, 1) to simulate realistic measurement error. Ten simulations runs have
been performed by combining different parameters expressed a tuple (n, a, r)
respectively representing: The number of drones n ∈ N = {8, 32, 64}; the size
of the operational area a ∈ A = {50 m, 100 m, 200 m}; and the sensitivity range
r ∈ R = {10 m, 50 m, 100 m, 500 m}.
2.3 Field Test Protocol

Real-world experiments have been performed in a remote location in the Al
Khaznah region of the Abu Dhabi emirate in the UAE. The platform selected as
the flying unit is depicted in Fig. 1. It is a custom-built quad-rotors developed
by the company Fly4Future, consisting of a standard DJI frame of 450 mm, a
Pixhawk4 autopilot with custom firmware, and an Intel NUC with i7 processor
and 16 GB of RAM as an on-board computer. The latter is used to run the high-
level software control stack that, in this particular case, runs on the MRS UAV
system openly available for research purposes and released by the Multi-robot
Systems Group at the Czech Technical University in Prague [1].
Fig. 1. Figure of the MRS F450, the quad-rotor platform selected for the experiments
The robotic unit is equipped with several sensors among which, for the spe-
cific use case we analyze here, we highlight a Global Positioning System module
for self-localization—as the one coming with the Pixhawk autopilot bundle—
and a XBee devices for communication among the units and information sharing.
The latter, makes use of an ad-hoc peer-to-peer proprietary mesh-network called
DiGiMesh that we limit to be single-hop to force the range communication con-
straint. We employed omni-directional low-power antennas on board all drones,
each of which runs on 900Mhz empirically estimated to guarantee a good con-
nectivity among all the units in the group. The vehicle controller [12] allows to
generate the desired orientation, collective thrust command, body rates, angular
accelerations, and angular jerk based on the coverage control. The experiments
were carried out using 7 vehicles, in an area of 50×50 m; as previously described,
the communication devices used offer full connectivity but, to evaluate the behav-
ior of the control strategy we manually limit the effect of the sensing range r
with the same values used for the simulations—i.e., R ∈ {10 m, 50 m, 100 m}.
Three simulations runs have been performed for each value of the sensing range.
As for the simulation, the control algorithm proposed in Sect. 2 runs in a dis-
tributed way on board of each drone. The position information of neighboring
units needed by the coverage control are solely provided through radio commu-
nication and only by the aerial units—i.e., no external or central computer is
used. The ve Fig. 2 shows the trajectories of the UAVs in one of the experiments
carried out; following the takeoff procedure to bring the vehicles to the same alti-
tude, the coverage control function is performed to determine the new positions
of the vehicles.
The speed control of Eq. 1 is limited to 2m/s for safety reasons, taking into
account the number of vehicles involved in the experiment and the degree of
freedom that the drones have to move in various directions.
2.4 Metrics
In order to evaluate the overall performance of the system, the same metrics
were collected both in simulations and in field tests. In particular, we consider
the following three main aspects: (i) optimality of the final configuration, (ii)
efficiency in the coverage of the environment, and (iii) effectiveness in the use
of the robots.
Fig. 2. Vehicles trajectories of a real experiment with 7 drones

Optimality of the Configuration. As is typically done in the literature, the per-

formance of the coverage control algorithm is evaluated computing the value of
an optimization function H (P), whose value indicates how close the displace-
ment of the robots is to the optimal configuration, that is the centroidal Voronoi
configuration.
Considering the positions P of the robots, it is possible to partition the
environment Q into n range-unlimited Voronoi cells {V̄1, . . . , V̄n }, where the i-th
cell is computed using position pi as a seed. The optimization function is then
defined as follows:
n ∫
H (P) = − q − pi 2 φ(q)dq (5)
i=1 V̄i
As discussed in [6,15], such definition allows to quantify how close the configura-
tion of the robots is to the centroidal Voronoi configuration, defined considering,
for each point q ∈ Q, the value of the pdf φ(q).
Coverage Effectiveness. Considering the limited-range sensing capabilities of the

robots, in order to assess how well the environment Q is covered by the set
of individual robot non-overlapping limited-range cells {V1, . . . , Vn }, we propose
to evaluate the ratio A(P) of the integral of the pdf φ(·), computed over the
summation of the cells Vi , with respect to the same integral computed over Q,
namely
n ∫

φ(q)dq
i=1Vi
A(P) = ∫ (6)
φ(q)dq
Q
Overall Efficiency. Considering the fact that each robot is able to individually
cover a circular area with radius r, we measure the effectiveness in the use of
resources computing the ratio between the actual area covered by the team of
drones, and the upper-bound on the area that could be theoretically covered,
namely:
n ∫
dq
i=1 V
i
E(P) = (7)
n π r2

In this section we report the results for both simulations and field tests per-
formed according to the presented methodology. Quantitative results are related
to the value of the optimization function defined in Eq. (5), with larger values
corresponding to a configuration that is closer to the optimal one; to the integral
of the PDF defined in Eq. (6), where again larger values reflect better coverage
efficiency; and last to the measurement of the overall covered area as defined
in (7), where greater values reporting a better effectiveness in the usage of the
robots.
3.1 Simulation Results
We hereby present the results of the simulations. Figure 3 displays the value of
the average value of H (P) across the set repetitions. The blue line shows the
value for 8 robots, the red line for 32 robots, and the yellow line for 64 robots.
Each plot groups according to the size of the area to cover and the sensing range.
Following the same color scheme, Figs. 4 and 5 display the average value of A(P)
and E(P) respectively.
Figure 3 clearly shows that increasing the number of robots increases the
performance with respect to the coverage metric. This effect is intuitively
attributable to a greater cumulative covered area. It is also possible to note
that, the rate of convergence decreases with the increase of area size, increases if
the sensing range grows and decreases the more robots are employed. The oscilla-
tions present on the blue line for the cases with large Q and big sensing range are
caused by the dynamics of the drones. Inherently, each drone possesses a certain
motion dynamic different from pi = ui . The velocity controller that generates
Fig. 3. Average value of A(P) across all repetitions. The plots on the same line refer
to the same total area to cover, while the plots on the same column refer to cases with
same sensing range. The values for 8, 32 and 64 robots are displayed with blue, red
and yellow lines respectively.
Fig. 4. Average value of A(P) across all repetitions. The plots on the same line refer
Fig. 5. Average value of E(P) across all repetitions. The plots on the same line refer
the motor commands controls the drone so that its motion is as similar to (1)
as possible. However, if the requested velocity is too different from the current
velocity, the controller encounters acceleration limits that render it impossible
to closely track the requested velocity profile. With large Q and large sensing
range, the resulting Vi can be very large and their centroid can be far away
from the drone. Therefore, the input generated by the control law (3) is very
large, requiring accelerations outside the limitations of the drone, causing the
oscillating behavior depicted by the blue line. With a larger number of drones,
this behavior is limited or totally absent since each drone has a larger number
of neighbors, therefore its resulting cell Vi is smaller. This effect can be reduced
by reducing the value of k pr op in (3) or by applying a saturation on the value
of ui . Nevertheless, the coverage algorithm is still capable of spreading drones
across the area.
Figure 4 shows how well the area is covered by the drones. When the sensing
range is small, the robots often struggle to cover the totality of Q. The cov-
ered area increases with the increase of the number of drones employed. This is
expected since, obviously, more robots imply more cells Vi and the behavior for
each drone is to distance itself from its neighbors, enlarging its cell Vi .
Figure 5 depicts the efficiency of the coverage. The plots clearly show that
fewer robots are more efficient. A value of 1 for E means that each robot has
reached the configuration where its cell Vi is the largest possible. With few robots
there are fewer neighbors to interact with, ultimately allowing for the maximum
enlargement of each cell. The condition for a team of robots to not reach max-
imum efficiency that the total area Q cannot accommodate all the cells of the
robot.
These considerations ultimately allow us to set some design analysis to make
when deploying a coverage application. A larger number of robots is not neces-
sarily better in all scenarios, even if it increases the performance. Depending on
the sensing range, which is often dictated by different factors, a larger number
of robots can imply a better area coverage but a trade-off has to be considered,
in order to not overly reduce efficiency.
3.2 Results of Field Tests
We continue our analysis of the results with real field experiments involving
a group of seven drones. Results are presented in Fig. 6 where all the metrics
detailed in Sect. 2.4 are utilized. The first line of graphs describes the metric
H (P) for the different sensitivity range values taken into consideration. The
second and third lines describe A(P) and E(P) respectively.
Fig. 6. Average value of Eqs. (5), (6), (7) across all repetitions. The plots on the same
line display the same metric, while the plots on the same column refer to cases with
same sensing range.
Parallel to the simulated case, it can be seen how the metric of H (P), first
row, blue line, increases as the value of the range increases. Unlike Fig. 5 it can
be seen that the convergence times of the optimization function in Eq. 5 are
longer. This is due to a cap to the maximum velocity of 2m/s introduced for
safety reason and to avoid long overshoot with the real platform. Results for
values of r equals to 10, 50 and 100 show different convergence time. For r = 50
and r = 100 we note how the velocity cap is particularly detrimental since the
vehicles unable to reach the speed outputted by the proportional control law
and the optimization function requires more time to reach the steady state.
Figure 7 shows the initial and final conditions of the whole control process.
Coloured dots show the position of the each drone in the group at start—leftmost
figure—and end—rightmost figure—of the simulation. The circles surrounding
the dots represent the sensing range of the robots. In particular, by visual anal-
ysis of Fig. 7b, it can be noted that the final group configuration does not show
a compact and close formation. Indeed, while in perfect conditions the final
graph should have adjacent but non-overlapping sensing ranges—circles—this is
not the case here. By analysis of the data, it has been noticed that, during the
experiment, the GPS covariance matrix for σx,x 2
and σy,y
2
ranged in the interval
[1.0, 1.4] m, while values for the EPH, the standard deviation of horizontal posi-
tion error, and EPV, the standard deviation of vertical position, were oscillating
between 0.6 and 0.95 m. These values, and the position uncertainty that derives
from them, justify the distance between the sensitivity ranges.
Fig. 7. On image (a) the initial configuration of the vehicles during a real experiment
with sensitivity range R = 50; on image (b), the final configuration of the drones in the
same experiment. On image (c) an aerial photo of the convergence process is shown,
with a different orientation of the axes.
4 Conclusions
In this paper, we presented a distributed coverage control algorithm capable

of arbitrating a group of autonomous vehicles with limited sensing capabilities.
Thanks to an extensive set of simulations and the implementation of real field
experiments, the strategy has been proven to be robust and efficient. It is indeed
capable of achieving close-to-optimal performances both in simulation and reality
and, as shown, it is also capable of orchestrating a significant number of drones
marking the approach as scalable. On the other end, real experiments have shown
how the presence of non-ideal conditions, such as noisy GPS reading, have an
effect on the final configuration but overall do compromise the behavior of the
coverage control function.
Future work includes a more extensive study of the real experiments and
the investigation of an alternative control law. Indeed, in this work, the final
outputted velocity is generated by the proportional law in Eq. (1). In some sit-
uations, as previously discussed, the generated velocity is too high—e.g., when
the point to reach is far from the drone’s actual position—and does not respect
the physical constraints of the platform. This represents an intuition for a future
investigation, analyzing control strategies that take into account the dynamics
of the drone, through a more elaborate kinematic model.
References
1. Baca, T., et al.: The MRS UAV system: pushing the frontiers of reproducible
research, real-world deployment, and education with autonomous unmanned aerial
vehicles. J. Intell. Robot. Syst. 102(1), 1–28 (2021)
2. Bullo, F., Cortés, J., Martinez, S.: Distributed Control of Robotic Networks: A
Mathematical Approach to Motion Coordination Algorithms. Princeton University
Press, Princeton (2009)
3. Cao, M., Hadjicostis, C.: Distributed algorithms for voronoi diagrams and appli-
cation in ad-hoc networks. Technical report UILU-ENG-03-2222, DC-210, UIUC
Coordinated Science Laboratory (2003)
4. Cortés, J., Egerstedt, M.: Coordinated control of multi-robot systems: a survey.
SICE J. Control Meas. Syst. Integr. 10(6), 495–503 (2017)
5. Cortes, J., Martinez, S., Bullo, F.: Spatially-distributed coverage optimization and
control with limited-range interactions. ESAIM Control Optim. Calculus Varia-
tions 11(4), 691–719 (2005)
7. Du, Q., Emelianenko, M., Ju, L.: Convergence of the Lloyd algorithm for computing
centroidal voronoi tessellations. SIAM J. Numer. Anal. 44(1), 102–119 (2006)
8. Faessler, M., Franchi, A., Scaramuzza, D.: Differential flatness of quadrotor dynam-
ics subject to rotor drag for accurate tracking of high-speed trajectories. IEEE
Robot. Autom. Lett. 3(2), 620–626 (2018). https://doi.org/10.1109/LRA.2017.
2776353
9. Guruprasad, K., Dasgupta, P.: Distributed voronoi partitioning for multi-robot
systems with limited range sensors. In: 2012 IEEE/RSJ International Conference
on Intelligent Robots and Systems, pp. 3546–3552. IEEE (2012)
10. He, C., Feng, Z., Ren, Z.: Distributed algorithm for voronoi partition of wireless
sensor networks with a limited sensing range. Sensors 18(2), 446 (2018)
11. Laventall, K., Cortés, J.: Coverage control by multi-robot networks with limited-
range anisotropic sensory. Int. J. Control 82(6), 1113–1121 (2009)
12. Lee, T., Leok, M., McClamroch, N.H.: Geometric tracking control of a quadrotor
UAV on se (3). In: 49th IEEE Conference on Decision and Control (CDC), pp.
5420–5425. IEEE (2010)
13. Parker, L.E.: Multi-robot team design for real-world applications. In: Asama, H.,
Fukuda, T., Arai, T., Endo, I. (eds.) Distributed Autonomous Robotic Systems
2, pp. 91–102. Springer, Tokyo (1996). https://doi.org/10.1007/978-4-431-66942-
5 10
14. Pratissoli, F., Capelli, B., Sabattini, L.: On coverage control for limited range multi-
robot systems. In: IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS) (2022)
15. Schwager, M., McLurkin, J., Rus, D.: Distributed coverage control with sensory
feedback for networked robots. In: Robotics: Science and Systems, pp. 49–56 (2006)
16. Song, Y., Naji, S., Kaufmann, E., Loquercio, A., Scaramuzza, D.: Flightmare: a
flexible quadrotor simulator. In: Conference on Robot Learning (2020)
17. Teruel, E., Aragues, R., López-Nicolás, G.: A practical method to cover evenly a
dynamic region with a swarm. IEEE Robot. Autom. Lett. 6(2), 1359–1366 (2021)
MARLAS: Multi Agent Reinforcement
Learning for Cooperated Adaptive
Sampling
Lishuo Pan1,2(B) , Sandeep Manjanna1,3(B) , and M. Ani Hsieh1(B)

1 GRASP Laboratory at the University of Pennsylvania, Philadelphia, USA
msandeep.sjce@gmail.com, m.hsieh@seas.upenn.edu
2 Brown University, Providence, USA
lishuo pan@brown.edu
3 Plaksha University, Mohali, India
Abstract. The multi-robot adaptive sampling problem aims at finding

trajectories for a team of robots to efficiently sample the phenomenon of
interest within a given endurance budget of the robots. In this paper, we
propose a robust and scalable approach using Multi-Agent Reinforcement
Learning for cooperated Adaptive Sampling (MARLAS) of quasi-static
environmental processes. Given a prior on the field being sampled, the
proposed method learns decentralized policies for a team of robots to
sample high-utility regions within a fixed budget. The multi-robot adap-
tive sampling problem requires the robots to coordinate with each other
to avoid overlapping sampling trajectories. Therefore, we encode the esti-
mates of neighbor positions and intermittent communication between
robots into the learning process. We evaluated MARLAS over multiple
performance metrics and found it to outperform other baseline multi-
robot sampling techniques. Additionally, we demonstrate scalability with
both the size of the robot team and the size of the region being sampled.
We further demonstrate robustness to communication failures and robot
failures. The experimental evaluations are conducted both in simulations
on real data and in real robot experiments on demo environmental setup
(The demo video can be accessed at: https://youtu.be/qRRpNC60KL4).
Keywords: multi-robot systems · adaptive sampling · reinforcement

learning
1 Introduction
In this paper, we propose a decentralized multi-robot planning strategy (Fig. 1(a)

presents the framework of our method) to cooperatively sample data from two-
dimensional spatial fields so that the underlying physical phenomenon can be
L. Pan and S. Manjanna—Equal contribution.

https://doi.org/10.1007/978-3-031-51497-5_25
348 L. Pan et al.
modeled accurately. We consider sampling from quasi-static spatial processes,

that is, we can reasonably assume the field to be temporally static for the sam-
pling duration. Examples of such spatial fields include algal blooms, coral reefs,
the distribution of flora and fauna, and aerosol concentration. Building high-
resolution representations of environmental physical phenomena can help better
understand the effects of global warming and climate change on our environment.
Observing and modeling the spatial evolution of such fields using multiple data
sampling robots plays a key role in applications such as environmental monitor-
ing, search and rescue, anomaly detection, and geological surveys. We present a
planning algorithm for multiple robotic platforms to sample these spatial fields
efficiently and adaptively. In this context, sampling refers to collecting sensor
measurements of a phenomenon.
Adaptive sampling of spatial fields with hotspots has gained a lot of momen-
tum in the field of robotic exploration and mapping [1–4]. Adaptive sampling
refers to strategic online path planning for robots based on the observations made
until the current time step. Exhaustively sampling each point of an unknown
survey region can be tedious, inefficient, and impractical if the survey space is
large and/or the phenomenon of interest has only a few regions (hotspots) with
important information [1,5]. It has been observed that for low-pass multiband
signals, uniform sampling can be inefficient, and sampling rates far below the
Nyquist rate can still preserve information [6]. This is the key guiding principle
behind active and non-uniform sampling [7]. Although the approaches solving
coverage control problems [8,9] seem suitable for solving sampling problems,
they differ in a quite important aspect. The coverage control or area division
problem [10,11] focuses on finding the most preferable positions for the robots
such that a non-occupied space is covered with their sensors or the correspond-
ing assigned areas are physically visited by the respective robot as depicted in
Fig. 1(b). On the contrary, in this paper we are looking at designing robot paths
that visit high-utility regions within a fixed budget as illustrated in an example
run of our method in Fig. 1(c).
Recently, multi-agent reinforcement learning (MARL) methods have proven
to be effective for searching cooperative policies in complex and large-scale multi-
robot tasks and they have been explored for information gathering and coverage
problems. Persistent monitoring and coverage with MARL involve defining equal
utility for every state [12,13]. In the sampling problem, utilities are distributed
unevenly, thus requiring nonuniform exploration. Some of the multi-robot infor-
mation gathering techniques are limited to offline planning [14] or are incapable
of handling robot failures, as they function by dividing the region of interest
among the robots in the team [15]. Scalable multi-agent reinforcement learning
policy is proposed to achieve large-scale active information gathering problem
in [16]. However, they use a central node for target localization, which is fragile
to single-node failure.
The key contribution of this work is a fully decentralized multi-robot control
strategy, MARLAS, to learn multi-robot cooperation and achieve efficient online
sampling of an environmental field. From comprehensive experimental results,
MARLAS: Multi Agent Reinforcement Learning 349
we show that our method is scalable with both the size of robot team and the
size of the region being sampled, and achieves equitable task distribution among
robots. We further demonstrate our approach is robust to both communication
failures and individual robot failures.
Consider the multi-robot adaptive sampling problem where N homogeneous

robots aim to maximize accumulated utilities over a region of interest in a given
time horizon H. The sampling region E ∈ R2 is a bounded continuous 2D space.
In this work, E is discretized into uniform grid cells forming a workspace with
length h and width w. Thus, the position xi of robot r i is an element of Z2 .
Each grid is assigned a prior value F(x1, x2 ), which represents the amount of
data that robots can collect at this location. In this work, we assume once the
grid is visited by any robot, all the data in that grid will be collected. We further
assume that a low-quality estimate of reward map F ∈ Rh×w of the region E can
be obtained through pilot surveys or satellite data or historical knowledge of the
field in the region of interest.
Fig. 1. (a) Overview of MARLAS algorithm, (b) coverage paths generated by one of
the area division approaches (DARP [15]), and (c) an example sampling run of the
proposed approach (MARLAS).
The objective is to obtain a set of decentralized policies π for all robots in

the multi-robot team, that maximize the averaged expected return of N robots
N i i i i
Jtot (π) = N1 i= 1 J (π ). Here, J is the expected return of the robot r along its
i
trajectory τ . In the homogeneous team setting, we can assume that all robots
have the same control strategy; therefore, we will drop the subscripts and write
the policy as π = π i for each robot r i . We use a neural network with parameters θ
to approximate the underlying optimal control policy as a conditional probability
πθ = P(ai |si ; θ), where ai , si are the action and state of the robot r i , respectively.
Each robot maintains locally a belief of the position of other robots x̂ij ∈ Rh×w
and the belief of the reward map F̂ i ∈ Rh×w throughout the mission. Our learning
350 L. Pan et al.
task is to solve the optimization problem to find a set of parameters θ ∗ for the
neural network.
Multi-robot Learning Formulation. In our decentralized multi-agent system
setting, individual agents do not have the direct access to the ground truth
information, and need to trade-off between exploiting the current information
and communicating with neighbors to solve the task in a cooperative manner. We
frame the task as a decentralized partially observable Markov Decision Process
(Dec-POMDP) represented by the tuple, (I, S, {A i }, {Z i }, T, R, γ). Here, I is the
set of robots, S is a set of states, and A is the set of discrete actions. Z i is a set of
possible observations by the robot ri . T models the transition function. R, γ are
the reward function and discount factor. Optimal solution of the Dec-POMDP
using dynamic programming is provably intractable as the time horizon, number
of robots, and the size of region increases [17]. In this work, we train a neural
network in a multi-agent reinforcement learning (MARL) setting to approximate
the underlying optimal decentralized policy.
3 Approach
An overview of our proposed MARLAS framework is presented in Fig. 1(a). Let

j
{r 1, r 2, · · · , r N } be the set of robots, and xi(t−l+1):t , {x(t−l+1):t }, F(t−l+1):t repre-
sents the state of each robot, where xi(t−l+1):t are robot’s historical positions up
j
to l steps in the past, {x(t−l+1):t , ∀ j i} is the set of positions of the neighboring
robots with a history up to l steps in the past, and F(t−l+1):t are the reward
maps up to l steps in the past. Thus, in our formulation each robot stores the
trajectory history of length l and uses intermittent communication with other
robots to update the neighbor positions and reward maps. The individual robot
has no direct access to the neighbor positions when they are not in communica-
tion. Thus, the ground truth of the reward maps quickly become unavailable to
individual robots. As a substitute, each robot r i estimates and maintains belief
distributions over the neighbor positions and the reward maps. Hence, the new
state of the robot r i at time t is defined as sit := xi(t−l+1):t , {x̂ij,(t−l+1):t }, F̂(t−l+
i
1):t
.
We denote x̂ij,(t−l+1):t as belief distributions of neighbor positions and F̂(t−l+
i
1):t
as the belief distributions of reward maps. A discrete action set is defined as
A = {(i, j) | ∀i, j ∈ {−1, 0, 1}, (i, j) (0, 0)}, that is our action space consists of
actions leading to 8-connected neighboring grids from the current position. At
each time step t, each robot r i takes an action ait ∈ A. We assume that robots
can uniquely identify other robots, and thus zij,t represents the observation of
robot r j ’s position by robot r i at time t. More concretely, we define zij,t as the
occupancy {0, 1}h×w of grid cells within the sensing radius of robot r i . We assume
a deterministic transition function T , but our method can be easily extended to
handle non-deterministic transitions.
Distributed Belief State Estimate. We assume that a robot’s estimates

of its neighbor positions is limited by its sensing radius and these estimates
are noisy due to hardware limitations and environmental uncertainties. Direct
communication between robots is limited by the communication radius. Since
knowledge of global robot states is necessary to achieve multi-robot coordination,
we assume each robot r i maintains local beliefs x̂ij of all teammates’ positions,
where j is the index of its teammates. Without loss of generality, we choose the
Bayes filter to estimate neighbor positions, which can be replaced with other
filters such as the Kalman or learning-based filters.
For simplicity, we assume that the sensing radius is equal to the commu-
nication radius, and both denoted as dcr , but this is not a strict requirement.
When zij (x1, x2 ) = 1, a neighboring robot is detected at location (x1, x2 ), otherwise
zij (x1, x2 ) = 0. Due to sensing noise, we consider false positive and false negative
in the discrete setting. Grid cells that are outside of the robot’s sensing radius
have zij = 0. In this work, we assume that the robot has no knowledge of its
neighbors’ control policies. In other words, in the action update of a neighbor’s
position estimate, the unobserved neighbors have equal transition probabilities
to its 8-connected neighboring grids. Thus, the distribution x̂ij is dictated by a
2D random walk in the action update. The sequence of action and sensor updates
are given by,
j
action update: p(x̂ij,t+1 |x̂ij,t , at ) = p(x̂ij,t+1 |x̂ij,t ), (1)
p(zij, t | x̂ij, t )·p(x̂ij, t )
sensor update: p(x̂ij,t+1 |zij,t ) =
p(zij, t | x̂ij, t )·p(x̂ij, t )
(2)
x̂ i
j, t
To enhance each robot’s estimates of its neighbors’ positions, we utilize com-

munication between robots. To achieve this, each robot saves a length l of the tra-
i
jectory history τt−l+ , considering the limitation of memory units in distributed
1:t
systems. When neighboring robots enter a robot’s communication radius, the
robot has direct access to its neighbor’s trajectory history and updates its belief
of the neighbor’s position, x̂ij,t−l+1:t , accordingly. In this work, we only consider
1-hop communication without communication delays. In general, multi-hop com-
munication would introduce communication delays and thus is out of scope of
this work.
Decentralized Control Policy. In this work, we use the expressiveness of a
neural network to approximate the underlying optimal decentralized control pol-
icy for the proposed multi-agent reinforcement learning problem. We use a fully
connected neural network of two hidden layers f (·) with 128 hidden units, and
hyperbolic tangent activation functions following each hidden layer. The output
layer units correspond with actions to be taken. To define a stochastic policy, a
softmax layer is attached to generate a distribution on the set of actions A. From
experiments, we find that with the spatial feature aggregation φ(·) detailed in
the following section, two-hidden-layers neural networks have sufficient expres-
siveness to approximate the underlying optimal policy. We do not exploit the
fine-tuning of neural network architectures, as it is not the main focus of this
352 L. Pan et al.
work. Policy parameterization is given by
πθ (ait |sit ) = Softmax( f (φ(sit ); θ)) (3)
where φ(sit ) represents the spatial feature aggregation vector. At each time step,
robots sample the actions from stochastic policy and move to their next positions
following transition function T .
To avoid collisions with other robots in their neighborhood, robots employ
a potential field based obstacle avoidance strategy similar to [18]. It is assumed
that collision avoidance will only be activated within d0 distance from the clos-
est neighboring robot. We further quantized repulsive force synthesised from
collision avoidance strategy, by maximizing its alignment with the action from
the discrete set A. For robots whose collision avoidance is active, their actions
generated from learned policy are replaced with a repulsive force. Additionally,
robots generate random actions when the collision occurs.
Spatial Feature Aggregation. In our previous work [3,19], we proposed and
evaluated various feature aggregation methods for adaptive sampling. Multires-
olution feature aggregation [3] has been empirically proven to outperform other
methods, as it reduces the state dimension and utilizes the geometry induced
bias for the sampling task. In multiresolution feature aggregation, we divided
the current belief distribution of reward map F̂ti into regions with different res-
olutions according to their distance to each robot. Feature regions closer to
a robot have higher resolution, and resolution decreases farther away from the
robot. Since rewards at locations visited by other robots are removed, each robot
must estimate its neighbors’ positions at any given time. We utilize each robot’s
belief estimates for other robot locations, x̂ij and the belief estimate for pre-
vious reward map F̂t−l i when updating the belief of current reward map F̂ti in
t i
the form F̂ti (x1, x2 ) = F̂t−l
i (x , x ) J
1 2 h,w − k=t−l+1 j x̂ j,k (x1, x2 ) , for all (x1, x2 ).
+
Here, Jh,w is a all-ones matrix of size of the reward map, the x̂ij,k is the discrete
belief distribution of the neighbor positions with its value between 0 and 1 in
each grid cell, the operation [·]+ converts the negative value to 0. Notice that the
direct communication between robots updates the x̂ij,t−l+1:t , thus we recompute
the summation over l time steps of the belief distributions of the neighbor posi-
i
tions. Here, the belief distributions of reward maps F̂(t−l+ follow the Markov
1):t
property.
Multi-robot Learning in Adaptive Sampling. As mentioned in Sect. 2, solv-
ing a Dec-POMDP optimally is provably intractable in our task. Therefore, we
use the belief model to maintain belief distributions of the ground truth and
employ a neural network to approximate the optimal decentralized policy π ∗
and we use policy gradient methods to search the optimal policy on the neu-
ral network parameter space. To encourage cooperation, we adapt the central-
ized training with decentralized execution (CTDE) technique commonly used for
multi-agent policy training. More specifically, we compute a centralized reward
as the average of all the robots’ individual rewards, using global reward func-
tion R(·), state s and action a information. Thus, the centralized expected return
H t N i i
is given by Jtot (πθ ) = Eτ∼πθ N1 t= 1γ i=1 R st , at | s0 , Control policy πθ is
decentralized in both training and execution.
To optimize θ directly, we use policy gradient which gives the gradient of
the expected return in the form of ∇J(πθ ) ∝ Eτ∼πθ [Gt · ∇ log πθ (at |st )], where

Gt = Tk=t+1 γ k−t−1 R sik , aik . Action at and state st are drawn from τ. To esti-
mate Gt , we adapt the Monte Carlo based REINFORCE method by generating
trajectories τ following πθ . For each training episode of MARLAS, we initialize
robots randomly in E. We assumed the robot knows the initial position of all
the other robots. Each robot is given the control policy with the most updated
parameters θ. At time step t, each robot r i randomly samples action ait from
πθ (ait |sit ). Robot r i receives a reward F(xi ) by sampling data at location xi . Once
the grid cell xi is visited, F(xi ) is set to a negative value indicating the data
has been consumed at this location. If multiple robots enter the same grid cell
simultaneously, the reward will be evenly distributed to each robot in the cell,
and there will be a collision penalty for each robot in the same cell. The reward
function is given by,
F(xi )
R s i , ai = + βcol · I(xi = x j ), ∀ j ∈ I, (4)
c
where c is the total number of robots in the grid of xi , I is the indicator function,
which indicates a collision between robots, and βcol < 0 is the collision penalty.
Each robot generates τ by executing the control policy up to the time horizon H.
Furthermore, we generate M trajectories for each robot as one training batch.
We use REINFORCE with baseline to update the parameters θ as follows
N M H−1
1
θ k+1 = θ k + α · (Gi,m
t − bit )∇ log πθ (ai,m i,m
t |st ) , (5)
N·M i=1 m=1 t=0
1 M i,m
with the baseline defined as bit = M m=1 G t . Superscription i, m are the indices
of the robots and the trajectories. Note that we draw actions according to the
policy distribution in training to enhance the exploration. During deployment,
we draw the action that has the maximum probability.

We train the multi-robot team on a region E synthesized by a mixture of Gaus-
sians. In the training phase, we use a robot team size of N = 5, a sensing and
communication radius of dcr = 10 grid cells, a collision avoidance range of do = 2
grid cells and collision penalty coefficient βcol = −2. Other hyperparameters used
for training include: a discount factor γ = 0.9, training time horizon H = 200, and
set the number of trajectories M = 40. We use the ADAM stochastic optimizer
[20] in our training phase and the control policy is trained on a single map of size
30 × 30 by changing the starting locations of the robots for every epoch. We only
use one trained policy to generate all the results in this paper. We found that the
354 L. Pan et al.
Fig. 2. Test datasets we used in evaluation, (a) real bathymetry data from Barba-
dos, (b) synthetic cshape distribution, (c) reaction diffusion simulation [21], and (d) a
satellite image of a coral reef.
learned policies can adapt to diverse fields during the test phase as presented in
our results. The results presented in this Section show that the learned control
policies outperform existing baseline methods. We further evaluate scalability of
MARLAS for different robot team and workspace sizes, and the robustness of
MARLAS to communication failures and robot failures.
Testing Datasets: For our testing phase, we consider a set of diverse environ-
ments generated both by real-world data and synthetic simulations, as illustrated
in Fig. 2.
Performance Metrics: Many applications in the domain of robotic sampling
have a need to sample in information-rich locations at the earliest due to a limited
endurance of the robots to carry out the survey, and the environmental processes
being sampled are quasi-static. Hence, we measure the discounted accumulated
reward as one of the performance metrics. Standard deviation of discounted accu-
mulated reward provides a measure on task sharing between the robots. A lower
value implies that the robots contribute more equally to the overall task. Aver-
age pair-wise overlap measures the total pair-wise overlap between the robot
trajectories averaged by the number of all possible neighbors. It reflects the per-
formance of a multi-robot sampling approach in terms of coordination and task
allocation. Average overlap percentage measures the percentage of the averaged
pair-wise overlap in the robot trajectories and is computed as the ratio between
the average pair-wise overlap and the number of steps the robots has traveled.
Coverage measures the completeness of the active sampling mission, calculated
as the ratio between the accumulated reward and the maximum total rewards
possible in the given time horizon H. Average communication volume isa mea-
sure of the amount of communication between robots, given by N1 i t Nti ,
where Nti is the set of neighboring robots of robot r i at time t.
Fig. 3. Comparison with baseline methods: Column (a) illustrates trajectories of the
deployed robot team. Circles are the deployment location, and triangles are the stop
location. Column (b) presents comparisons of discounted accumulated reward. Column
(c) presents comparisons of average pair-wise overlap. The 95% confidence intervals are
over 40 trials.
In Fig. 3, we compare MARLAS with multiple baseline sampling methods

including independently trained multi-robot sampler [19], DARP (Divide Area
based on the Robot’s initial Positions) algorithm [15], and maxima search algo-
rithm [22]. We used discounted accumulated reward and average pair-wise over-
lap metrics for these comparisons. Column (a) of Fig. 3 presents the trajectories
followed by two robots sampling the underlying distribution using MARLAS.
The communication radius is fixed as dcr = 20%Dmax , where Dmax is the maxi-
mum possible distance in the given region of interest.
Columns (b) in Fig. 3 presents quantitative comparison of the discounted
accumulated rewards. It can be observed that MARLAS outperforms baseline
methods. Figure 3 column (c) compares the pair-wise overlap between paths
generated by MARLAS and independently trained multi-robot sampling tech-
nique, which keeps track of the full history for inter-robot communications. In
these experiments, the robots using MARLAS only communicate history length
l = 50 trajectory data, compared to the baseline methods where the full trajec-
tory history is being sent, MARLAS outperforms other methods in generating
356 L. Pan et al.
non-overlapping utility maximizing paths. Even though the average overlaps for
MARLAS with l = 2 is high, the neighbor position estimate embedded learning
helps it to achieve higher discounted rewards compared to the baselines. MAR-
LAS is stochastic because the robots are initialized in the same grid and select
random actions to avoid collision. Therefore, its metrics have a tight confidence
interval.
4.1 Analyzing Scalability of Learned Policy
Scalability is an important property for a learning-based multi-robot decentral-

ized policy. To evaluate the scalability of MARLAS, we conducted experiments
with different robot team sizes; different sizes of the workspace; and varying
sensing and communication radii. We trained a policy with 5 robots (N = 5)
and sensing and communication radius dcr = 10 grids, and generalized to all the
different settings.
Fig. 4. Deployment of robot teams over reef data for 150 time duration. The first row
presents robot trajectories (in different colors) as the team size is increased from 2 to
15. Circles are the deployment location, and triangles are the stop location. Second row
presents sequential progress of a sampling run with 15 robots at time steps 10, 40, 80
and 150. Circles represent the robot current positions.
Scalability with the Team Size: In this scenario, we increase the number of
robots from 2 to 20 and fix the √ sensing and communication radius to 10 grids.
We scale the map by a factor of n to maintain a constant density of robots for
different sizes of robot teams. For example, a map of size 26 × 26 is used for a 2
robot experiment, while a map of size 40 × 40 is used for a 5 robot experiment.
Robots start exploring from the upper left corner of the map, as illustrated in
Fig. 4 so that they have the same opportunity to accumulate rewards. Qualitative
results presented in the first row of Fig. 4 show the robot trajectories as the team
size is increased from 2 to 15. The plots in the second row of Fig. 4 illustrate the
sequential progress of a sampling run with 15 robots at time steps 10, 40, 80 and
150. We observe that the robots quickly spread out, identify, and cover to dense
reward regions. Furthermore, the robots spread to the different regions and are
able to perform the sampling task in a coordinated manner.
The quantitative results in Fig. 5 illustrate that the standard deviation of
the discounted accumulated rewards start close to 0 and increase sub-linearly
as the robot team size increases. The increase of the standard deviation val-
ues is expected as the discounted accumulated rewards also increase as team
size increases. We further notice that all the standard deviation values are neg-
ligible compared to the discounted accumulated rewards. This indicates that
the collected rewards are distributed evenly to each individual robot. Together
with the qualitative results, we conclude that each robot continues to contribute
equally to the sampling task as we scale up the team size. Majority of coverage
metrics in Fig. 5(b) remain above 70% regardless of scaling up the number of
robots to 20 and on an 80 × 80 map, which is far bigger than the training setup.
The decrease in coverage is not surprising, as the robots need to travel longer
distances between the dense reward regions as the map resolution increases.
Fig. 5. Quantitative results for scaling the team size. The dots represent each of the
40 trials.
Scalability with Sensing and Communication Radius: To evaluate the

scalability with the sensing and the communication radius, we change the sens-
ing and the communication radius dcr during the deployment from 0%Dmax
to 100%Dmax . In our simulations, we down-sampled the reef data to 30 × 30.
Figure 6 (a) and (b) present 5 robots deployed at upper left corner of the map
carrying out the sampling task with dcr = 0%Dmax and 100%Dmax respectively.
As expected, with a larger sensing and communication radius, robots explore
cooperatively and do not sample the same region repeatedly.
358 L. Pan et al.
Fig. 6. (a) and (b) Robot team with 5 robots carrying out the sampling task with dcr =
0%Dmax and dcr = 100%Dmax using the learned policy. Circles are the deployment
location, and triangles are the stop location. (c) Quantitative results include plots of
discounted accumulated reward, average pair-wise overlap, and average communication
volume over different communication radii. The 95% confidence intervals over 40 trials.
The plots in Fig. 6(c) illustrate that the discounted accumulated reward
increases and the average pair-wise overlap decreases, and both quickly plateau
at dcr = 30%Dmax . The average communication volume increases linearly and
slows its rate after dcr = 60%Dmax . We observed consistent behaviors over all
the other test datasets. We conclude that, with a moderate range of sensing and
communication, the cooperative behavior is achieved by the team, and increas-
ing the sensing and communication radius yields no significant contribution to
its performance.
4.2 Analyzing Robustness of the Learned Policy

In this section, we show how the ability to estimate the position of other robots
makes the system robust against communication failures and robot failures.
Communication Failure: We first consider a communication failure scenario
where the robot can only rely on sensing to localize neighbor positions. The robot
team and workspace setup is kept same as the one described in the communica-
tion scalability experiment. We created four scenarios by changing the neighbor
position estimation and communication capability of the learned model: the first
is when the estimates of the neighbor positions and the communication are both
enabled; The second scenario is when the communication failed at the time step
20 and robots localize their neighbors only via filtering. The third scenario is
when both the estimates of the neighbor positions and the communication failed
at time step 20. Lastly, we use the control policy with global communication as
a benchmark. To better reflect the overlap change with respect to the robots’
travel distance, we use average overlap percentage as a metric. The results in
Fig. 7 (a) and (b) illustrate that before the failure, there is no significant dif-
ference between models’ performances in all the scenarios. Once the failure is
introduced, the second scenario, with the communication failure but active posi-
tion estimate, has a performance that is very close to that of the benchmark
scenario with the global communication. This means that the estimates of the
neighbor positions serve as good substitutes for the ground truth information.
The scenario with both the communication failure and the neighbor position
estimation failure performs poorly compared to all the other scenarios showing
that the proposed belief model compensates for complete communication fail-
ures, thus increases the robustness of MARLAS. We observed consistent results
over all the test datasets.
Fig. 7. Robustness to failures: The failures happen at time step 20, denoted as the
dashed line. The 95% confidence intervals are computed over 40 trials. (c) Snapshots
of robot failure experiments.
Robot Failure: In Fig. 7 (c), we present qualitative experimental results to

show robustness of MARLAS against the robot failure. As shown in left image
in Fig. 7 (c), when there are no failures, all three robots share and cover the
hotspot regions. When two of the robots fail during a survey, the other robot
samples from the hotspot regions and achieves good overall performance for the
team as depicted in the right image in Fig. 7 (c). These preliminary results display
a fault-tolerance nature of MARLAS and we would like to further investigate
this behavior in the near future.
Fig. 8. Online adaptation: Three sequential time steps of a changing field and MAR-
LAS generating robot paths that adaptively cover the underlying distribution. Circles
represent the start locations, and triangles represent the stop locations.
360 L. Pan et al.
4.3 Online Adaptations to the Changing Field

To demonstrate the online adaptation capability of MARLAS, we deploy the
robots to sample a stochastic field. We synthesize a dynamic field as a mixture-of-
Gaussians with a random motion added to the means of the Gaussians. Figure 8
illustrates three sequential time steps of a changing field and MARLAS generat-
ing robot paths that adaptively cover the underlying distribution. An updated
low-resolution prior is provided to the robots every 100 time steps which can be
achieved by either an aerial vehicle or a sparse network of stationary sensors.
The MARLAS algorithm is able to generate updated plannings for the robot
team on the fly.
4.4 Robot Experiments

To demonstrate the feasibility of the proposed strategy in the physical world, we
conducted experiments in an indoor water tank using centimeter scale robotic
boats as presented in Fig. 9(a) and (b). We pre-plan the robot trajectories using
the MARLAS algorithm, and each mini boat is controlled by base station com-
mands to execute its assigned trajectory. Plots in Fig. 9(c), (d), and (e) illus-
trate the difference between planned path and the path executed by the physical
robots. These results indicate that the action space used for our policy generation
needs further investigation to include the dynamics of the autonomous vehicles.
In future, we plan to design the policies over motion primitives to achieve smooth
trajectories for physical robots.
Fig. 9. (a) Indoor water tank experimental setup and embeded picture shows the inter-
nals of a mini robotic boat, (b) MARLAS policies run on mini robotic boats to sample
from the spatial field that is projected over the water surface.

We proposed an online and fully decentralized multi-robot sampling algorithm
that is robust to communication failures and robot failures, and is scalable with
both the size of the robot team and the workspace size. Our proposed method,
MARLAS, outperforms other baseline multi-robot sampling techniques. We ana-
lyze the key features of MARLAS algorithm and present qualitative and quan-
titative results to support our claims. In the future, we like to investigate the
application of MARLAS to sample dynamic spatiotemporal processes. We would
like to study the effect of reducing the length l of trajectory history that needs
to be communicated between the teammates and thus enhance the memory effi-
ciency and system robustness. We would like to further investigate distributed
learning with map exploration using local measurements of the environment.
Acknowledgements. We gratefully acknowledge the support of NSF IIS 1812319 and

ARL DCIST CRA W911NF-17-2-0181.
References
1. Low, K.H., Dolan, J.M., Khosla, P.: Adaptive multi-robot wide-area explo-
ration and mapping. In: Proceedings of the 7th international joint conference on
Autonomous agents and Multiagent Systems Volume 1. International Foundation
for Autonomous Agents and Multiagent Systems, pp. 23–30 (2008)
2. Sadat, S.A., Wawerla, J., Vaughan, R.: Fractal trajectories for online non-uniform
aerial coverage. In: 2015 IEEE International Conference on Robotics and Automa-
tion (ICRA), pp. 2971–2976. IEEE (2015)
3. Manjanna, S., Van Hoof, H., Dudek, G.: Policy search on aggregated state space
for active sampling. In: Xiao, J., Kröger, T., Khatib, O. (eds.) ISER 2018. SPAR,
vol. 11, pp. 211–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
33950-0 19
4. Almadhoun, R., Taha, T., Seneviratne, L., Zweiri, Y.: A survey on multi-robot
coverage path planning for model reconstruction and mapping. SN Appl. Sci. 1(8),
1–24 (2019)
5. Salam, T., Hsieh, M.A.: Adaptive sampling and reduced-order modeling of dynamic
processes by robot teams. IEEE Robot. Autom. Lett. 4(2), 477–484 (2019)
6. Venkataramani, R., Bresler, Y.: Perfect reconstruction formulas and bounds on
aliasing error in sub-nyquist nonuniform sampling of multiband signals. IEEE
Trans. Inf. Theory 46(6), 2173–2183 (2000)
7. Rahimi, M., Hansen, M., Kaiser, W.J., Sukhatme, G.S., Estrin, D.: Adaptive sam-
pling for environmental field estimation using robotic sensors. In: 2005 IEEE/RSJ
International Conference on Intelligent Robots and Systems, pp. 3692–3698. IEEE
(2005)
9. Durham, J.W., Carli, R., Frasca, P., Bullo, F.: Discrete partitioning and coverage
control for gossiping robots. IEEE Trans. Rob. 28(2), 364–378 (2011)
10. Aurenhammer, F.: Voronoi diagrams-a survey of a fundamental geometric data
structure. ACM Comput. Surv. (CSUR) 23(3), 345–405 (1991)
362 L. Pan et al.
11. Breitenmoser, A., Schwager, M., Metzger, J.-C., Siegwart, R., Rus, D.: Voronoi
coverage of non-convex environments with a group of networked robots. In: 2010
IEEE International Conference on Robotics and Automation, pp. 4982–4989. IEEE
(2010)
12. Mishra, M., Poddar, P., Chen, J., Tokekar, P., Sujit, P.: Galopp: multi-agent
deep reinforcement learning for persistent monitoring with localization constraints,
13. Chen, J., Baskaran, A., Zhang, Z., Tokekar, P.: Multi-agent reinforcement learning
for visibility-based persistent monitoring. In: 2021 IEEE/RSJ International Con-
ference on Intelligent Robots and Systems (IROS), pp. 2563–2570. IEEE (2021)
14. Kantaros, Y., Schlotfeldt, B., Atanasov, N., Pappas, G.J.: Sampling-based planning
for non-myopic multi-robot information gathering. Auton. Robot. 45(7), 1029–1046
(2021)
15. Kapoutsis, A.C., Chatzichristofis, S.A., Kosmatopoulos, E.B.: DARP: divide areas
algorithm for optimal multi-robot coverage path planning. J. Intell. Robot. Syst.
86(3), 663–680 (2017)
16. Hsu, C.D., Jeong, H., Pappas, G.J., Chaudhari, P.: Scalable reinforcement learning
policies for multi-agent control. In: 2021 IEEE/RSJ International Conference on
17. Dibangoye, J.S., Amato, C., Buffet, O., Charpillet, F.: Optimally solving Dec-
POMDPs as continuous-state MDPs. J. Artif. Intell. Rese. 55, 443–497 (2016)
18. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. In:
Cox, I.J., Wilfong, G.T. (eds.) Autonomous Robot Vehicles. Springer, New York,
NY (1986). https://doi.org/10.1007/978-1-4613-8997-2 29
19. Manjanna, S., Hsieh, M.A., Dudek, G.: Scalable multi-robot system for non-myopic
spatial sampling, arXiv preprint arXiv:2105.10018 (2021)
20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2014)
21. Dai, C., Zhao, M.: Nonlinear analysis in a nutrient-algae-zooplankton system with
sinking of algae. In: Abstract and Applied Analysis, vol. 2014. Hindawi (2014)
22. Meghjani, M., Manjanna, S., Dudek, G.: Multi-target rendezvous search. In: 2016
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pp. 2596–2603. IEEE (2016)
Proportional Control for Stochastic
Regulation on Allocation of Multi-robots
Thales C. Silva(B) , Victoria Edwards, and M. Ani Hsieh
Department of Mechanical Engineering and Applied Mechanics, University of

Pennsylvania, Philadelphia, PA 19104, USA
scthales@seas.upenn.edu
Abstract. Any strategy used to distribute a robot ensemble over a set

of sequential tasks is subject to inaccuracy due to robot-level uncertain-
ties and environmental influences on the robots’ behavior. We approach
the problem of inaccuracy during task allocation by modeling and con-
trolling the overall ensemble behavior. Our model represents the allo-
cation problem as a stochastic jump process and we regulate the mean
and variance of such a process. The main contributions of this paper
are: Establishing a structure for the transition rates of the equivalent
stochastic jump process and formally showing that this approach leads
to decoupled parameters that allow us to adjust the first- and second-
order moments of the ensemble distribution over tasks, which gives the
flexibility to decrease the variance in the desired final distribution. This
allows us to directly shape the impact of uncertainties on the group
allocation over tasks. We introduce a detailed procedure to design the
gains to achieve the desired mean and show how the additional param-
eters impact the covariance matrix, which is directly associated with
the degree of task allocation precision. Our simulation and experimental
results illustrate the successful control of several robot ensembles during
task allocation.
Keywords: multi-robot system · control · task-allocation
1 Introduction
Modeling an ensemble of robots as an aggregate dynamical system offers flexi-
bility in the task assignment and is an alternative to traditional bottom-up task
allocation methods, which usually are computationally expensive (see [3,5] and
references therein). Given a desired allocation of robots to tasks, each robot
must navigate, handle dynamic constraints, and interact with the environment
to achieve the desired allocation while meeting some desired team-level perfor-
mance specifications. It is well understood that uncertainties resulting from the
robot’s interactions with the environment, execution of its assigned tasks, and
noise from its own sensors and actuators might lead to several issues during task
allocation (e.g., performance loss and inaccuracy) [13,14]. This is because alloca-
tions are often computed prior to execution and do not account for uncertainties
https://doi.org/10.1007/978-3-031-51497-5_26
that arise during runtime. In addition, analysis of the team-level performance of

swarms has shown that local (and sometimes small) deviations from the required
performance at the robot level can combine and propagate, leading to substan-
tial changes in the behavior of the whole group, which might result in loss of
performance for the entire team (see Sect. 2 in [7]). Consequently, addressing the
fundamental question of how to properly represent and design robot ensemble
behaviors to meet the desired performance requirements can help us to improve
our understanding of such complex systems.
An existing category of ensemble models uses stochastic processes to repre-
sent robot teams [4,8,11]. These models use random variables to describe the
population of robots performing tasks, whereas transitions between tasks are
described based on jump processes with random arrival times. Using these rep-
resentations it is possible to design stable equilibria that reflect desired distribu-
tions of the robots over tasks [1,2]. In addition, it is also possible to account for
and to incorporate heterogeneous sensing and mobility capabilities in a robot
team using these models [17,18]. One of the major advantages of these macro-
scopic representations is their ability to scale to larger team sizes since their
complexity is solely a function of the number of tasks. In contrast, the complex-
ity of traditional task allocation methods grows as the number of agents and
tasks increases.
Nevertheless, most macroscopic approaches are accurate for an asymptoti-
cally large number of agents, i.e., when N → ∞, where N represents the number
of robots in the team. Fundamentally, within this assumption is the notion that
no individual robot’s deviations from its desired performance will greatly impact
the team’s performance as a whole. And yet Tarapore et al., [19] recently dis-
cussed the challenges associated with employing large teams of robots in real
world applications. Considering the impracticability of dense swarms in sev-
eral applications, they propose sparse swarms arguing that guiding the research
toward low-density swarms is more relevant to real-world applications. One of
the goals of this paper is to address the large number assumption by providing
a method to regulate the variance of a robot distribution in the task allocation
problem. Specifically, by explicitly controlling the mean of the robot distribution
and independently controlling its variance we can employ our methodology in
teams with a relatively small number of agents, that is, we decrease the impact
of individual robot deviations on the team’s performance. Hence, our method-
ology is a stride towards accurately modeling the allocation dynamics of these
sparse swarms. While similar techniques exist [12], the proposed strategy lacks
the ability to simultaneously control both the first- and second-order moments.
Instead, we leverage the work from Klavins [10], which considers a stochastic
hybrid system to model and control the mean and variance of molecular species
concentrations. We extend this result to robot swarm applications, which funda-
mentally alter some of the assumptions in [10], for example, in our setting we are
concerned about teams of constant size without a set of continuous variables. In
robotics, Napp et al., [15] presented an integral set point regulation technique,
employed to regulate the mean of the robot population performing specific tasks.
Control for Stochastic Regulation on Allocation of Multi-robots 365
While they highlighted the importance of considering variance control, since the
mean can be a sensitive control objective, they do not propose a methodology
to adjust the variance.
In this paper, we present a strategy to govern the mean and variance of
the distribution of robots across a set of tasks. The contributions of this work
include 1) a decoupled formulation of the first- and second-order moments for
an ensemble model of robot teams that allows their adjustment individually,
and 2) systematic methods to determine the control gains to achieve the desired
distributions of robots over tasks. Our experimental results indicate that our
proposed macroscopic techniques allows for improved control of small number
teams.
2.1 Graph Theory
We are interested in the stochastic behavior of a robot ensemble generated by

robot-level uncertainties. We assume a group of N ∈ N robots executing M ∈ N
spatially distributed tasks, the relationship among tasks is represented by a
graph G(V, E), in which the elements of the node set V = {1, ..., M } represent
the tasks, and the edge set E ⊂ V×V maps adjacent allowable switching between
task locations. Each element of the edge set represents a directed path from i to
j if (i, j) ∈ E. A graph G is called an undirected graph if (i, j) ∈ E ⇐⇒ ( j, i) ∈ E.
We define the neighborhood of the ith task node as the set formed by all tasks
that have the ith task as a child node and it is denoted by Ni = { j ∈ V : ( j, i) ∈
E}. We call a neighbor of a task i an adjacent task j that belongs to Ni . In this
paper we assume the following:
Assumption 1. The graph that maps the relationship between tasks, G(V, E),
is connected and undirected.
Assumption 1 allows the robots to sequentially move back and forth between
tasks. For some task topologies, it is possible to adjust the transition parameters
such that directionality in the task graph is obtained [2].
2.2 Ensemble Model
To derive the macroscopic ensemble model for the N robot team simultane-
ously executing M tasks, we assume each robot can execute only one task at a
given time and it must switch between adjacent tasks to complete the team’s
wide objective. The model needs to capture the global stochastic characteris-
tics that emerge from a robot’s interaction with other robots and the environ-
ment, such as variations in terrain [12,17], collision avoidance [11], and delays
[1]. Thus, the state of our system is given by a vector of random variables
X(t) = [X1 (t) · · · XM (t)], where the prime symbol in a vector, a , denotes its
transpose, and each element of X(t) describes M the number of robots executing
the respective task at time t, therefore i= X
1 i (t) = N for any t ≥ 0.
We aim to characterize the evolution of the random variable Xi (t), for each
i ∈ {1, . . . , M } and, ultimately, regulate its value. However, the stochastic nature
of these variables might lead to convergence only to a vicinity of the desired
value, where the size of the convergence region depends on robot-level deviations
from the desired performance [9]. In principle, the control of systems subject to
random variations could be approached by applying robust control techniques
and defining input-to-state stability criteria. However, in our scenario, defining
bounds for variations of robot-level behavior for a general number of tasks and
robots can be difficult. Therefore, we approach the problem by controlling the
mean value of the state variables Xi (t) and the dynamics of their second-order
moments. Observe that the vicinity size of the desired objective is directly asso-
ciated with the second-order moment.
The states of our system characterize pure jump processes Xi (t) : [0, ∞) → D,
where D is a finite discrete set. In particular, let ψ : D × [0, ∞) → R be a real-
valued function, it is a known result that the dynamics of the expected value of
ψ is described by the following relation (see Chapter 1 in [16]),
d
E[ψ(Xi (t))] = E[Lψ(Xi (t))],
dt
in which L is the infinitesimal generator of the stochastic process defined as

Lψ(Xi (t)) = [ψ(φ (Xi (t))) − ψ(Xi (t))]λ (Xi (t)), (1)

where λ (Xi (t)) is a function of the number of robots at task Xi and in its neigh-
borhood which describes the rate of transitions, i.e., λ (Xi (t))dt represents the
probability of a transition in Xi (t) occurring in dt time, φ (Xi (t)) maps the size
of a change on Xi (t) given that an transition happened. In our case, we assume
two possible unitary changes in dt, a transition of type = 1, in which one agent
leaves Xi (t) defined by φ1 (Xi (t)) = Xi (t) − 1, and of type = 2 where one agent
arrives Xi (t), defined by φ2 (Xi (t)) = Xi (t) + 1. Hence, ∈ {1, 2} maps the type of
transition, i.e., one robot leaving or one agent arriving at Xi (t), respectively.
In this paper, we are interested in enforcing desired first- and second-order
moments of the distribution of robots across tasks. Specifically, the problem
investigated can be stated as:
Problem 1. Given M tasks to be dynamically executed by N robots, in which

the switching between adjacent tasks is represented by the topology of the task
graph G(V, E), define a feedback controller that changes the transition rates
λ (Xi ), for all i ∈ {1, . . . , M }, such that the distribution of first- and second-order
moments converge to the desired value.
It is important to note that a solution for a similar problem was previously

proposed in [11,12]. Even though they were motivated by the study from Klavins
[10], the proposed control strategy relies on a feedback linearization action, which
leads to a closed-loop system of the form (see equation (9) in [11])
d
E[X(t)] = (K α + K β )E[X(t)],
dt
where K α is a matrix that maps the transition rate of the system and K β is a
matrix with new parameters to attain the desired variance. The main drawback
of this approach is twofold: i) by changing K β the stationary distribution of
the equivalent Markov chain also changes, which makes it necessary to readjust
the gains for the whole system, and ii) addressing the regulation of second-
order moments using K β is equivalent to changing the transition rate of the
whole process during the entire evolution. In comparison, our proposed strategy
overcomes these limitations by simply manipulating the graph structure and
defining a new structure for λ (·), in such a way that the new feedback strategy
does not directly modify the stationary distribution of the equivalent Markov
chain and provides changes to the transition rates according to the current state
of the system. We describe our methodology in the following section.
3 Methodology
3.1 Control Strategy
We model the arrival and departure of robots by pure stochastic jump processes.
For example, for two tasks:
X1 (t) ←→ X2 (t) (2)
λ· (·)
the random variables X1 (t) and X2 (t) represent the number of robots in each task
at time t, and λ (Xi ) for ∈ {1, 2} maps the transition rate of agents between
tasks. Motivated by Klavins [10] on gene regulatory networks, we approach Prob-
lem 1 by defining the transition rate functions with a term that is proportional
with the number of agents on a given site and in its neighborhood. In Eq. (2)
this choice means that switching from task 1 to 2 and from 2 to 1 will depend
on the number of agents at both tasks. In particular, for Eq. (1) we define the
transition rates as,

λ1 (Xi ) = ki j Xi − βi Xi X j , (3)
j ∈Ni

λ2 (Xi ) = k ji X j − βi Xi X j . (4)
j ∈Ni
Remark 1. A necessary condition for the equations in (3) and (4) to be valid
transition rates is being non-negative. While it is possible to guarantee positive-
ness during the whole evolution for some well-designed initial conditions with
careful choices of gains, we instead, let robots transition back to previous tasks
before the team reaches steady-state by mapping the equivalent non-negative
transition with reversed direction. This flexibility is the justification for Assump-
tion 1.
It is worth noticing that our control strategy is distinct from the one presented
in [10]. Here, we only require the transition rate to be adjusted according to the
current number of agents in the tasks and in their neighborhood. While in [10]
it is assumed that each node in the network has a source and a sink, which
implies that the variation on the value of the random variable Xi can increase
or decrease independently of the variation in its neighbors, in addition to the
switching between adjacent nodes.
In the following, we show that with our choice for the transition rates λ (Xi ),
the first-order moment depends only on the parameter ki j , while the variable βi
manifests on the second-order moments. This introduces a free variable, allowing
the variance to be properly adjusted.
Defining ψ(Xi ) = Xi and applying the infinitesimal generator (1) with the
transition rates in (3) and (4), we get the following first-order moment for a
general arrangement,

d
E[Xi ] = E k ji X j − βi Xi X j − ki j Xi − βi X j Xi
dt j ∈Ni j ∈Ni

= k ji E[X j ] − ki j E[Xi ] .
j ∈Ni
While for the second-order moments we define ψ(Xi ) = Xi2 ,

d
E[Xi ] = E (Xi + 1)2 − Xi2
2
k ji X j − βi Xi X j
dt j ∈Ni

2
+ (Xi − 1) − Xi
2
ki j Xi − βi Xi X j
j ∈Ni

=E 2k ji Xi X j − 2ki j Xi Xi + ki j Xi + k ji X j − 2βi Xi X j ,
j ∈Ni
and for off-diagonal terms,

d
E[Xi Xq ] = E (Xi + 1)Xq − Xi Xq k ji X j − βi Xi X j
dt j ∈Ni

+ (Xq + 1)Xi − Xi Xq k jq X j − βq Xq X j
j ∈Nq

+ (Xi − 1)Xq − Xi Xq ki j Xi − βi Xi X j
j ∈Ni

+ (Xq − 1)Xi − Xi Xq k q j Xq − βq Xq X j
j ∈Nq

=E k ji Xq X j − ki j Xi Xq + k jq Xi X j − k q j Xq Xi .
j ∈Ni j ∈Nq
An important property of these equations is that they are in closed-form, which

allows us to use them to design gains to attain the desired distribution. For a
compact vectorized form, let K be an M × M matrix defined as follows,
⎧
⎪ if ( j, i) ∈ E,
⎪ kj
⎨ i
⎪
Kij = − qM kiq if i = j, (5)
⎪
⎪
⎪0 otherwise,
⎩
then the first- and second-order moments can be written as,
d
E[X] = K E[X], (6)
dt
d
E[X X ] = K E[X X ] + E[X X ]K
dt

M
+ ei ei ⊗ ki j E[Xi ] + k ji E[X j ] − 2βi E[Xi X j ] , (7)
i=1 j ∈Ni
where ei , for i = 1, ..., M, are the canonical basis of R M and ⊗ denotes the Kro-
necker product. Eqs. (6) and (7) model the dynamics of the first- and second-
order moments of a group of N robots executing M tasks. There are two impor-
tant underlying assumptions on this representation, first we assume the timing
for robots to leave and arrive at a task follows a Poisson process. We believe this
assumption is not too restrictive since, as discussed above, robot level devia-
tions from desired performance will impact the scheduling, leading to stochastic
jump processes. The second assumption, and arguably the hardest to justify for
general systems, is that we can actuate on the system’s transition rates. To be
able to do this in our case, notice the structure of λ (·) in (3) and (4), we need
to monitor the current number of agents at each task and their neighborhood
and then communicate this value to each agent at these sites to compute the
switching time. Dynamically defining the transition rates has an intricate rela-
tionship with the allocated time and microscopic deviations. We expect that the
feedback nature of our approach accounts for microscopic deviations from the
desired performance in the long run.
3.2 Control Analysis

In this section we analyse the convergence of the mean to the desired distribution
and show a simple method to obtain the parameters of the gain matrix K.
Theorem 1. Let N agents execute M tasks organized according to an undi-
rected graph G(V, E). Given a desired stationary distribution E[X d ] such that
M d
i=1 Xi = N, the robot ensemble converges asymptotically to E[X ] from any
d
initial distribution, with transition rates defined in (3) and (4), if, and only if
K E[X d ] = 0.
Proof. The demonstration follows from standard Lyapunov analysis. Note that,
by definition, the eigenvalues of matrix K have negative real part, except for
one which is zero, that is, 0 = σ1 (K) > σ2 (K) ≥ · · · ≥ σM (K), where σi (K)
denotes the ith eigenvalue of K. Hence, defining a simple Lyapunov function
candidate, V = E[X] PE[X], where P is a symmetric positive definite matrix.
The time-derivative of V along the trajectories of the first-order moments yields
dV
= E[X] K P + PK E[X],
dt
because every eigenvalue of K has negative or zero real part, we have that
dV
≤ 0.
dt
Consider the invariant set S = {E[X] ∈ R M : dV dt = 0}, since the robots will always
M
be in one of the graph’s nodes, i.e., i= 1 Xi = N, we have that E[X(t)] 0 for
all t > 0. Therefore, no solution can stay in S other than E[X d ], in other words,
the equilibrium point E[X d ] is globally asymptotically stable.
Notice that we assumed that each robot will always be assigned to one task
at any instant, to consider robots in transit between nodes it is possible to
extend the task graph as in [1]. In addition, by construction, K results in a
stable ensemble. The compelling aspect of Theorem 1 with respect to our model
for the first- and second-order moment, Eqs. (6) and (7), is that we can use
the condition provided on the theorem to design the gains to attain a desired
distribution, while still having free variables βi to adjust the second-order terms.
We compute the gain matrix K by solving the following optimization:
K = argmin QE[X d ] (8)
Q
s.t. 1 Q = 0; Q ∈ K; structure in (5),

where the set K accounts for the set of possible rate matrices for a given set of
tasks and E[X d ] ∈ Null(Q) is the desired final distribution.
At the moment, although we have shown additional tuning variables in our
approach, we do not have a methodology to compute appropriate gains βi for a
desired covariance matrix. Nonetheless, we give an intuition of the mechanism of
how they impact the steady-state covariance matrix. Let C = E[(X − E[X])(X −
E[X])] be the covariance matrix, replacing E[X X ] in (7) and recalling that
K E[X] = 0 during the steady-state, analysing the equilibrium point dt
d
E[X X ] =
0 gives us

M
K C + CK = − ei ei ⊗ ki j E[Xid ] + k ji E[X jd ] − 2βi E[Xid X jd ] .
i=1 j ∈Ni
Since every eigenvalue of K has negative real part except for one which is equal
to zero, and E[Xd ] 0, we have an unique solution for C [9,10]. Therefore,
the additional tuning variables βi , for i ∈ V are inversely proportional to the
covariance at the steady-state–the bigger the values of βi , while (3) and (4) are
positive during the steady-state (recalling Remark 1), the smaller the covariance.
4 Results
4.1 Numerical Simulations
Example 1. We use the stochastic simulation algorithm proposed by Gillespie [6]
to evaluate the proposed control strategy. We present a four task example, M = 4,
with the topology represented in Fig. 1 (a). We run two different tests to evaluate
the effectiveness of our method, one with parameters βi = 0, for i = 1, . . . , 4, and
another with nonzero βi parameters (values given below). In both cases, we aim
to achieve the same final distribution. The initial populations at each task is
given by X 0 = [5 15 5 5], and the desired distribution is E[X d ] = [13 9 6 2].
The gain matrix K used to reach such a distribution was computed solving the
optimization problem (8) with K = {K ∈ R4×4 : K ≤ −1.5I}, this constraint was
imposed to ensure not too long convergence time. The resulting parameters from
the optimization are k12 = 2.1, k14 = 1.4, k21 = 1.5, k23 = 1.3, k32 = 0.9, k34 = 1.2,
k41 = 0.1, k43 = 0.6, and 0 elsewhere outside the main-diagonal. We computed
the mean and variance from the simulation considering 130 data points sampled
for t > 2.0 seconds in each simulation. This number was chosen to match the
sample size between the two experiments. The simulation with βi = 0 for all i =
1, . . . , 4 in Fig. 1 (b) has the following mean E[X] = [12.16 9.76 5.18 2.77], and
Fig. 1. The graph topology is depicted in (a). In (b) and (c) it is shown a realization of
each of the stochastic simulations, where the dots represent the number of agents and
the solid lines represent the numerical solution of (6). The tasks are identified by the
color caption. The realization in (b) uses βi = 0, ∀i ∈ V, while in (c) is with β1 = 0.05,
β2 = 0.20, β3 = 0.11, and β4 = 0.052.
variances, diag(E[X X ]) = [5.78 6.83 4.20 1.44]. The variances were improved
by greedily modifying βi values–if the variance was reduced and did not actively
disturb another task population variance then that value was kept. Figure 1 (c)
used parameters β1 = 0.05, β2 = 0.20, β3 = 0.11, and β4 = 0.052, where the
resulting means are E[X] = [12.26 9.30 5.83 2.60] variances are diag(E[X X ]) =
[1.06 1.12 1.15 0.45].
These results highlight the impact of variability on small size teams (N = 30).
Visually, the control reduces the variance throughout the experiment, especially
once steady-state is reached. It is worth mentioning that we have tested our
strategy on larger teams and the results were in agreement with our hypothesis
that it is possible to shape the variance using (3) and (4), however, due to space
limitations of the venue, we chose to not show those results.
Example 2. In this example, we run 6 different trials with varying team sizes to
numerically evaluate the variation in accuracy for teams of different sizes. To
this end, we consider four tasks with topology represented in Fig. 1 (a), and we
computed the variance considering 160 data points sampled from each simulation
for t > 2.0 s. In addition, we vary the total number of robots between simulations,
hence relative the initial and desired distributions for each simulation are: X 0 =
[25% 25% 0% 50%], and E[X d ] = [50% 50% 0% 0%], respectively, given as
a percentage of the total number of robots in the team. In these simulations,
we used the same values for the parameters βi as in Example 1, and we do
not change them between simulations. To provide an intuition of the level of
spread in each scenario, we compute the Relative Variance (RV) as the quotient
of the mean by the variance of the respective task. The results are presented
in Table 1. We notice that for this scenario, the RV for the multi-robot system
Table 1. Expectation and Relative Variance (RV) of a series of numerical experiments

considering different team sizes, with task graph in Fig. 1 (a), initial and desired final
distributions X 0 = [25% 25% 0% 50%] , and E[X d ] = [50% 50% 0% 0%] , respectively,
and when βi parameters are considered their values are given by β1 = 0.05, β2 = 0.20,
β3 = 0.11, and β4 = 0.052.
βi = 0 for i = 1, . . . , 4 βi 0 for i = 1, . . . , 4
N = 52 N = 26 N = 16 N = 52 N = 26 N = 16
E[X 1 ] 24.8 12.8 8.2 25.2 13.3 7.5
E[X 2 ] 26.5 12.7 7.7 25.6 14.6 8.3
E[X 3 ] 0.5 0.0 0.0 0.6 0.0 0.1
E[X 4 ] 0.2 0.0 0.0 0.4 0.0 0.0
RV(X 1 ) 0.49 0.62 0.61 0.21 0.36 0.53
RV(X 2 ) 0.54 0.63 0.69 0.21 0.23 0.47
RV(X 3 ) 0.97 0.1 0.0 0.88 0.00 1.56
RV(X 4 ) 0.75 0.1 0.0 1.38 0.00 0.00
without βi and with 52 robots is similar to the RV of a team with only 16 robots
and βi values. This suggests that our methodology improves the accuracy of the
allocation of relatively small teams. Figures 2 (a) and (b) provide a realization
for an instance with βi = 0 and N = 52 and with βi 0 and N = 16.
4.2 Experimental Results

Experimental tests were performed in the multi-robot Coherent Structure
Testbed (mCoSTe). Figure 3 depicts the miniature autonomous surface vehi-
cles (mASV) in Fig. 3 (b), and the Multi-Robot Tank (MR tank) in Fig. 3 (a).
The MR Tank is a 4.5 m × 3.0 m × 1.2 m water tank equipped with an OptiTrack
motion capture system. The mASVs are a differential drive platform, localized
using 120 Hz motion capture data, communication link via XBee, and onboard
processing with an Arduino Fio. Experiments were run for 4 minutes each, with
X 0 = [4 0 0 0], and E[X d ] = [1 1 1 1], following the task graph outlined in
Fig. 1 (a). The parameters used for experimental trials were ki j = 0.01 for each
nonzero and out-of-diagonal element of the matrix K, and when variance control
was used it was all βi = 0.005 for i = 1, . . . , 4. Those parameters were chosen to
take into consideration the time necessary for the mASVs to travel among tasks.
The parameters were converted to individual robot transitions by computing the
likelihood of transition using Eqs. (3) and (4), as well as which task the robot
should transition to. At each task the mASVs were required to circle the location
with a radius of 0.25m until a new transition was required.
(a) Realization without and 52 (b) Realization with and 16
Fig. 2. In (b) and (c) it is shown a realization of each of the stochastic simulations
with N = 52 and N = 16, respectively, where the dots represent the number of agents
and the solid lines represent the numerical solution of (6). The figure axis are the same
to help the visualization of the relative spread. The tasks are identified by the color
caption and the graph topology is depicted in Fig. 1 (a). The realization in (b) uses
βi = 0, ∀i ∈ V, while in (c) is with β1 = 0.05, β2 = 0.20, β3 = 0.11, and β4 = 0.052.
An instance of an experimental trial for both, with and without βi , are

included in Fig. 4. The trajectory of each robot is represented by colors, the
intensity of the colors is associated with its time occurrence–when the color is
lighter that path was traveled by the respective mASV earlier in the experimen-
tal trial. Figure 4 (a) has a total of 28 transitions among tasks, those transitions
take place throughout the experiment as indicated by darker lines between tasks.
In the case of variance control, Fig. 4 (b), there are 13 transitions among tasks,
notice that many of these transitions happened earlier in the experiment. From
five experimental trials without variance control the average number of task
switches was 25.4, and from five experimental trials with variance control the
average number of task switches was 15.6. This confirms that in the case where
feedback is provided there is a reduced number of switches. A video illustrating
our results can be found at https://youtu.be/fTsX-5z6BUw.
During experimentation we observed that sometimes the mASVs would tran-
sition before it could physically reach the assigned task, traveling diagonally
among tasks. This was the case in Fig. 4 (b) where Robot 4 quickly switched to
another task while in transit. While this did not impact the overall results, it is
an area of open interest to achieve desired parameters which leads to the desired
task distribution within the capability of the robots.
5 Discussion
In the preceding sections, we have investigated the problem of stochastic alloca-

tion of multi-robot teams. For our main result, we formally demonstrated that
through a particular structure for the transition rates of the stochastic jump
process model, we can decouple the first-order moments from the second-order
moments. Such a decoupling allows us to introduce additional tuning variables
to regulate the variance of the desired distribution of robots over tasks. The
additional degree of freedom helps to reduce the impact of individual robots
on the overall team’s performance. Therefore, the result of this contribution is
to expand the viability of top-down stochastic models to include reduced-size
Fig. 3. The mCoSTe environment and mASV platform used for experimental results.
Fig. 4. Experimental results from a 4 minute and 4 mASV run, with task graph in
Fig. 1 (a). The desired distribution is one agent at each task. The test in Fig. 4 (a)
had 28 switches, while in Fig. 4 (b) it had 13 switches. The colors’ intensity reflects
time–the lighter the color the earlier that path was cruised in the trial.
robot teams. In general, the intuition for these models is that they are more
accurate as the team size increases. However, when using our proposed method
and directly adjusting the team variance it is possible to increase the top-down
stochastic model accuracy for smaller teams. We argue that such refinement is
a stride towards sparse swarms–even though we are technically approaching the
team size problem and ignoring the task space size.
Although we have formally shown the impact of the additional tuning vari-
ables as well as the decoupling between first- and second-order moments, and also
experimentally and numerically investigated the influence of such variables, we
did not draw general mathematical expressions to compute their values during
the design stage. In our investigations, we used a greedy algorithm that increased
the variable number based on the desired values of variance and final distribu-
tion. In addition, our optimization problem to compute the gain matrix K for
a desired distribution incorporates designer knowledge about the robot capabil-
ities through the set K, which will directly affect the robots’ mean transition
time. We plan to formalize this unsolved concern in future investigations.
We have provided a new structure for the transition rates for ensemble robot
distributions modeled from a top-down perspective. As a first result, we have
demonstrated that such a structure leads to uncoupled parameters to adjust
the mean and the variance of desired team’s distribution. Then, based on this
finding, we examined simple design strategies to compute the necessary gains.
This approach provides an efficient ensemble behavior for relatively small groups.
Finally, physical and numerical experiments were implemented, illustrating the
effectiveness of the method. Possible future work includes the extension of the
strategy for distributed regulation. A potential strategy is to perform distributed

estimates of the number of agents performing each task. It is also of interest to
connect the robot dynamics with the changing transition rates. One possible
approach to bridging those two models is through hybrid switching systems.
A formal methodology to design the network structure for a given covariance
bound will be considered in the future.
Acknowledgements. We gratefully acknowledge the support of ARL DCIST CRA

W911NF-17-2-0181, Office of Naval Research (ONR) Award No. N00014-22-1-2157, and
the National Defense Science & Engineering Graduate (NDSEG) Fellowship Program.
References
1. Berman, S., Halász, A., Hsieh, M.A., Kumar, V.: Optimized stochastic policies for
task allocation in swarms of robots. IEEE Trans. Rob. 25(4), 927–937 (2009)
2. Deshmukh, V., Elamvazhuthi, K., Biswal, S., Kakish, Z., Berman, S.: Mean-
field stabilization of Markov chain models for robotic swarms: computational
approaches and experimental results. IEEE Rob. Autom. Lett. 3(3), 1985–1992
(2018). https://doi.org/10.1109/LRA.2018.2792696
3. Elamvazhuthi, K., Berman, S.: Mean-field models in swarm robotics: a survey.
Bioinspiration Biomimetics 15(1), 015001 (2019). https://doi.org/10.1088/1748-
3190/ab49a4
4. Elamvazhuthi, K., Biswal, S., Berman, S.: Controllability and decentralized stabi-
lization of the Kolmogorov forward equation for Markov chains. Automatica 124,
109351 (2021). https://doi.org/10.1016/j.automatica.2020.109351
5. Gerkey, B.P., Matarić, M.J.: A formal analysis and taxonomy of task allocation in
multi-robot systems. Int. J. Rob. Res. 23(9), 939–954 (2004)
6. Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys.
Chem. 81(25), 2340–2361 (1977)
7. Hamann, H.: Towards swarm calculus: universal properties of swarm performance
and collective decisions. In: Dorigo, M., et al. (eds.) ANTS 2012. LNCS, vol.
7461, pp. 168–179. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-
642-32650-9 15
8. Hsieh, M.A., Halász, Á., Berman, S., Kumar, V.: Biologically inspired redistri-
bution of a swarm of robots among multiple sites. Swarm Intell. 2(2), 121–141
(2008)
9. Khalil, H.K.: Nonlinear Systems, vol. 10, 3rd edn. Prentice Hall, Hoboken (2002)
10. Klavins, E.: Proportional-integral control of stochastic gene regulatory networks.
In: 49th IEEE Conference on Decision and Control (CDC), pp. 2547–2553. IEEE
(2010)
11. Mather, T.W., Hsieh, M.A.: Distributed robot ensemble control for deployment to
multiple sites. In: Robotics: Science and Systems VII (2011)
12. Mather, T.W., Hsieh, M.A.: Ensemble modeling and control for congestion man-
agement in automated warehouses. In: 2012 IEEE International Conference on
Automation Science and Engineering (CASE), pp. 390–395 (2012). https://doi.
org/10.1109/CoASE.2012.6386498
13. Nam, C., Shell, D.A.: When to do your own thing: analysis of cost uncertainties in
multi-robot task allocation at run-time. In: 2015 IEEE International Conference
on Robotics and Automation (ICRA), pp. 1249–1254 (2015). https://doi.org/10.
1109/ICRA.2015.7139351
14. Nam, C., Shell, D.A.: Analyzing the sensitivity of the optimal assignment in proba-
bilistic multi-robot task allocation. IEEE Rob. Autom. Lett. 2(1), 193–200 (2017).
https://doi.org/10.1109/LRA.2016.2588138
15. Napp, N., Burden, S., Klavins, E.: Setpoint regulation for stochastically interacting
robots. Auton. Robot. 30(1), 57–71 (2011). https://doi.org/10.1007/s10514-010-
9203-2
16. Øksendal, B., Sulem, A.: Applied Stochastic Control of Jump Diffusions, 2nd edn.
Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-69826-5
17. Prorok, A., Hsieh, M.A., Kumar, V.: The impact of diversity on optimal control
policies for heterogeneous robot swarms. IEEE Trans. Rob. 33(2), 346–358 (2017)
18. Ravichandar, H., Shaw, K., Chernova, S.: Strata: unified framework for task assign-
ments in large teams of heterogeneous agents. Auton. Agents Multi Agent Syst.
34(2), 38 (2020)
19. Tarapore, D., Groß, R., Zauner, K.P.: Sparse robot swarms: moving swarms to
real-world applications. Front. Robot. AI 7 (2020). https://doi.org/10.3389/frobt.
2020.00083
Comparing Stochastic Optimization
Methods for Multi-robot, Multi-target
Tracking
Pujie Xin and Philip Dames(B)

{pujie.xin,pdames}@temple.edu
https://sites.temple.edu/trail/
Abstract. This paper compares different distributed control approaches

which enable a team of robots search for and track an unknown number
of targets. The robots are equipped with sensors which have a limited
field of view (FoV) and they are required to explore the environment.
The team uses a distributed formulation of the Probability Hypothesis
Density (PHD) filter to estimate the number and the position of the tar-
gets. The resulting target estimate is used to select the subsequent search
locations for each robot. This paper compares Lloyd’s algorithm, a tra-
ditional method for distributed search, with two typical stochastic opti-
mization methods: Particle Swarm Optimization (PSO) and Simulated
Annealing (SA). This paper presents novel formulations of PSO and SA
to solve the multi-target tracking problem, which more effectively trade
off between exploration and exploitation. Simulations demonstrate that
the use of these stochastic optimization techniques improves coverage of
the search space and reduces the error in the target estimates compared
to the baseline approach.
Keywords: Multi-target tracking · Stochastic optimization ·

Distributed control
1 Introduction
Multi-robot, multi-target tracking (MR-MTT) encompasses many traditional
robotics problems, including search and rescue [19], surveillance [10], and map-
ping [9]. All of these scenarios require a team of robots to search across an
environment to detect, identify, and track individual objects. A generic solution
to the MR-MTT problem consists of two parts: 1) an estimation algorithm to
determine the number and states of all targets and 2) a control algorithm to
drive the robots to seek out undetected targets as well as track detected targets.
These problems are extensively studied in multi-robot systems [29].
1.1 Related Work

Multi-target tracking (MTT) is a problem wherein one or more sensors seek to
determine the number of targets in a space and the state of each individual
https://doi.org/10.1007/978-3-031-51497-5_27
Stochastic Optimization for Multi-robot Tracking 379
target. Often the number of targets is unknown a priori and in many cases the
targets are not uniquely identifiable. One of the main challenges in MTT is
the data association problem, i.e., matching measurements to potential targets.
Stone et al. [25] discuss this problem in their book, and survey existing solutions,
such as the Multiple Hypothesis Tracker (MHT) [1], Joint Probabilistic Data
Association (JPDA) [12], and the Probability Hypothesis Density (PHD) filter
[17]. Each of these different methods solve data association in a different way.
We pick the PHD filter due to its relative simplicity and good performance.
Multi-robot search is another well studied problem in the literature, with
many methods existing methods, as discussed in the survey [22]. In recent
years, coordination policies using Multi-Agent Deep Reinforcement Learning
(MADRL) have become popular [21,30], with some examples use partially
observable Monte-Carlo planning [14] while others perform a joint optimization
of target assignment and path planning [20]. However, many of the methods are
based on a centralized policy and global information, which is a main challenge in
MADRL. One of the most widely used classical techniques is Lloyd’s algorithm
[11], a decentralized algorithm which iterative partitions the environment and
drives the robots using a gradient descent controller. Due to its wide popularity,
we will use Lloyd’s algorithm as our baseline approach for comparison.
In this paper, we will examine the use of different numerical optimiza-
tion algorithms to drive MR-MTT. These algorithms fall into two broad cat-
egories: deterministic and stochastic. Deterministic methods often suffer from
local optima entrapment, particularly for the gradient-based methods which
require the knowledge of the entire search space [18]. However, in our setting,
the target locations are unknown a priori, which makes the use of gradient-based
methods challenging. In contrast, stochastic optimization methods do not rely
on the gradient descent which makes these particularly fitted for solving prob-
lems with unknown search spaces. Based on the number of candidates solutions,
there are individual-based and population-based algorithms. In the context of
numerical optimization, these techniques are typically used to find a single global
extremum of a function. However, in the MR-MTT problem, we aim to find all
local maxima (i.e., all target locations), which requires us to reformulate these
techniques to suit our problem.
In the family of the individual-based methods, there are Tabu Search [13],
Hill Climbing [7], and simulated annealing (SA) [16]. The SA algorithm is widely
used in solving unconstrained and bound-constrained optimization problems.
This method models the physical activity of heating and tempering in mate-
rial process. It is likely to accept worse solutions at a rate proportional to the
temperature. This enables SA explore broad search space and escape from local
optima, which will allow the robots to better explore the environment. People
proposed different versions of distributed control with SA based on a variety of
objective functions. One similar work is to minimize the total travel distance
under Boolean specification using SA [24]. Their scenario is similar to that in
Travelling salesman problem (TSP), which enumerates the trajectories. The tra-
jectories are the combination of different nodes.
PSO [15] is one of the most successful in population-based algorithms. Derr

et al. [8] proposed a decentralized version of PSO for MR-MTT, but in their
experiment, the targets are mobile phones. This makes the problem much simpler
since the sensors have unlimited range (i.e., are able to detect the signal strength
everywhere in the environment) and there is no ambiguity in data association.
Tang et al. [26] proposed methods in the environment with obstacles, but the
optimization function is still predefined, which means the global importance
weight is known to each robot (i.e., target locations are known). Our solution
differs from these other works in two key ways: 1) the objective function or the
global importance weight of the map is unknown and 2) the robots have a limited
field of view.
1.2 Contributions
In this paper, we propose two novel reformulations of traditional stochastic opti-

mization algorithms, SA and PSO, to solve the MR-MTT problem. We compare
these against Lloyd’s algorithm, a traditional method for multi-robot search. In
all cases we keep the distributed MTT algorithm constant, so the only difference
in performance is due to the search strategy. Our results show that the new
techniques result in more complete coverage of the search space and lower errors
in the target estimate. We also see that SA and PSO yield qualitatively different
search patterns as a result of their different structures.
We have a team of R robots exploring a convex environment E ⊆ R2 to search for
an unknown number of targets. The set of targets are Xt = {xt1 , xt2 , . . .}, where
xti is the state of the ith target at time t. At each time step, robot r receives a
set of measurements Ztr = {zt1,r , zt2,r , . . .}. The number of measurements depends
upon the number of targets within the robot’s field of view (FoV) as well as the
chance of receiving a false negative or false positive detection.
2.1 PHD Filter
The PHD filter [17] models the sets X and Z above as realizations of Poisson
random finite sets (RFSs), which are sets composed of independently and identi-
cally distributed (i.i.d.) elements where the number of elements follows a Poisson
distribution. The likelihood of such an RFS X is

p(X) = e−λ v(x) (1)
x∈X

where v(x) is the PHD and λ = E v(x)dx. The PHD is a density function over
the state space of the targets, and the expected cardinality of X is the integral
of the PHD over E.
The PHD filter recursively updates v(x), which completely encodes the dis-
tribution over sets, using the measurement sets collected by the robots and
the motion model of targets. The PHD filter uses six models. 1) The target
motion model f (xt | xt−1 ) characterizes the state transition model of x. 2)
The birth PHD b(xt ) describes both of the number of targets and their loca-
tion at t. 3) The survival probability, ps (xt−1 ), models the existence of a tar-
get with state xt−1 . 4) The detection model is pd (xt | q) describes the targets’
detected states xt with a sensor state q. 5) The measurement model, g(z | xt ; q),
describes the probability of a robot with state q receiving measurement z from
a detected target with state xt . Combining this with the detection model yields
ψz,q (xt ) = g(z | xt , q)pd (xt | q), which characterizes the probability of receiving
measurement z from a target with state xt by a sensor locating at q. 6) Finally,
the clutter model c(z) is another PHD describing the number and states of false
positive detections. With these targets and sensor models, the prediction and
update of the PHD filter are:

t t t
v̄ (x ) = b(x ) + f (xt | xt−1 )ps (xt−1 )v t−1 (xt−1 ) dxt−1 (2a)
E
ψz,q (xt )v̄ t (xt )
v t (xt ) = (1 − pd (xt | q))v̄ t (xt ) + (2b)
t
ηz (v̄ t (xt ))
z∈Z

ηz (v) = c(z | q) + ψz,q (xt )v t (xt ) dxt (2c)
E

Here, ηz (v) is a normalization factor, though note that E v(x) dx = λ, i.e., the
expected number of targets in the search space, rather than 1, which is standard
in single-target tracking.
Since the PHD v(x) is a density function in the state space of targets, we
need to make an additional assumption to parametrically represent it in a manner
that is amenable to computations. The two traditional methods are as a set of
weighted particles [28] or a mixture of Gaussians [27]. We use the former, as this
allows for arbitrarily non-linear sensing and target models.
Lloyds’ algorithm [3] iterative partitioning the environment and navigating the
robots towards the weighted centroid of their Voronoi cell. This process locally
minimizes the function:

H(q1 , q2 , ..., qR ) = min f (d(x, qr ))φ(x)dx, (3)
E r∈{1,...,R}
where d(x, q) measures the distance between elements in E, f (·) is a mono-

tonically increasing function, and φ(x) is a non-negative weighting function. A
general choice is f (x) = x2 . The inside of the integral approaches the mini-
mum when the environment is continuously divided using the Voronoi partition.
Vi = {x | d(x, qi ) ≤ d(x, qj ), ∀i = j} where Vi is the Voronoi cell for robot i.

The minimum with respect to robot’s position is

i ∗
xφ(x) dx
(q ) = Vi , (4)
Vi
φ(x) dx
This is a distributed control algorithm since if each robot knows the positions
of its neighbors, the robot is able to compute its Voronoi cell and move to the
weighted centroid of its Voronoi cell. In our case, the PHD serves as weighting
function φ(x), converting the traditional coverage problem into a search and
tracking problem.
2.3 Distributed PHD Filter

We use the distributed PHD filter that we developed in our previous work [6].
The key to this is that each robot maintains the portion of the PHD within its
own Voronoi cell and exchanging the data with its Voronoi neighbors. There are
then three algorithms that account for the motion of Voronoi cells (i.e., motion
of robots), motion of targets across cell boundaries, and measurements within
other robots’ cells. This estimation process is described and the convergence is
proved in the previous work [6]. Since the distributed estimation method is same,
we will focus on the distributed control method. Since all teams use the same
distributed PHD filter, they all employ the same Voronoi-based partitioning
scheme regardless of their chosen search strategy
2.4 Assumptions
The first assumption we make is that each robot knows its own position. While
this is a strong assumption, it can be easily achieved using a localization algo-
rithm in indoor environments [4] or GPS receivers in outdoor environments.
Additionally, we have shown that the distributed PHD filter can perform well
even with some bounded uncertainty in the pose estimates [6].
Another assumption we make is that all robots can communicate with all of
their Voronoi neighbors, this is a necessary assumption under Lloyd’s algorithm
(and for our distributed PHD filter formulation). In a practical setting, this
problem could be addressed by using robots as communication relay or using
multi-hop communication strategies.
3 Distributed Control
As is mentioned in the introduction, stochastic optimization algorithm can be
divided into population-based algorithm and individual-based algorithm accord-
ing the number of candidate solutions. In this paper, we pick one typical method
from each family and compare these methods with the standard baseline of
Lloyd’s algorithm. Specifically, we use SA as an example of individual-based
algorithms and PSO as an example of population-based algorithms.
Traditionally, optimization algorithms aim to find a single global maximum of

some function. However, in the multi-target tracking case we instead wish to find
all the local maxima. In the context of PHD, these local maxima correspond to
the peaks of the PHD in search space, which represent the location of the targets.
3.1 Simulated Annealing

Simulated annealing takes inspiration from the annealing process in manufactur-
ing, where materials are slowly cooled to allow them to reach a stable configura-
tion. In SA, at each iteration, a new candidate point locating at q̂t+1 is randomly
generated near the last solution point qt , where the distance δ = |q̂t+1 − qt | is
proportional to the “temperature” T . Suppose we want to maximize the objec-
tive function. The solution then moves to the new location if the value is better at
the new location, or with some probability if the value is worse. This probability
of accepting a suboptimal solution is given by:

1
Paccept ∼ Bernoulli δ
(5)
1 + exp ( max(T ))
which allows the solution to escape from local optima and hopefully reach a
global optimum, with the probability of a jump decreasing over time as the
temperature cools.
Algorithm 1. Distributed control with SA for Robot r

1: V̂r = Vr ⊕ B(d) Expand the Voronoi Cell
2: for i ∈ N (r) do This step can be combined with particle exchange in estimation
3: Compute ΔVi,r = Vi ∩ Vr
4: Send particles in ΔVi,r to robot i
5: end for
6: Compute cold = V̂r ∩F oV r (qr ) v(x)
7: for θ ∈ [0 : Δθ : 2π] do
8: q̂t+1 (θ) =qr + T R(θ)d
9: Δc(θ) = V̂r ∩F oV r (q̂t+1 (θ)) v(x) − cold
10: end for
11: θ∗ = arg maxθ c(θ)
12: if Δc(θ∗ ) > 0 then
13: qt+1
r = q̂t+1 (θ∗ )
14: T = Tinit Tinit = 1 in experiment
15: else if rand(0, 1) < Paccept then Paccept from (5)
16: qt+1
r = q̂t+1 (θ∗ )
17: T := 0.95T 0.95 is a typical cooling factor
18: else
19: Set qt+1
r according to (4).
20: T := 0.95T
21: end if
The rules governing simulated annealing correspond to a single searching

agent. Thus, in a multi-robot setting, each agent selects actions on its own based
on the information stored in the shared, distributed PHD. Our approach is out-
lined in Algorithm 1. First, robot r expands its Voronoi cell V̂r and exchange the
particles PHD with its neighbor (line 1–5). The weight cold is the sum of PHD of
the particles in the overlapped area between V̂r and its FoV. Robots then search
for the best action from a finite set of actions composed of angles θ, which fol-
low a discrete uniform distribution with fixed step size d (line 7). Then robot r
searches over potential new locations (line 8), where T is the temperature, R is a
rotation matrix, and d is a standard step. For each potential location, the robot
predicts the weight it may obtain if it moves to that location by summing the
PHD of the particles in the overlapped area between V̂r and the potential future
location (line 9). The movement θ∗ leading to the maximum Δc will be selected
as the next movement step (line 11). Robot r will move to the new position q̂t+1
only if the Δc > 0 or rand(0, 1) < Paccept (lines 12–17). Otherwise, the robot
falls back to the standard Lloyd’s algorithm (line 19).
3.2 Particle Swarm Optimization
In PSO, a collection of particles P move on the search space and the algorithm
evaluate the objective value at each particle. These particles are the candidate
solutions where the particle i locates at qit at time t. The velocity vit of particle
i is updated with the algorithm evaluation function.
qti = qt−1
i + vti
(6)
vti = wvt−1
i + c1 r1 (q∗i,b − qti ) + c2 r2 (q∗i,g − qti )
where r1 and r2 are random numbers between 0 and 1, c1 and c2 are predefined
constant numbers in PSO. And w is called as the inertia term. Here, qi∗ i,b is the
position of particle i where the objective function achieves its best value (i.e., a
local best), and q∗i,g is the position with the best objective function value in the
history (i.e., a global best).
PSO is a metaheuristic method that generates multiple candidate solutions
in the search space at the same time. This has a natural analog in multi-robot
search, where each robot represents a single particle. However, slightly different
from standard PSO, a robot can measure the value of points everywhere in
its FoV instead of a single point. Additionally, our objective is not to find a
single global best value for the objective function, but rather to find all local
maxima (i.e., all targets). Thus, we need to redfine the meaning of q∗·,g and q∗·,b .
In Algorithm 2, we set q∗·,g to be the weighted center of Voronoi cell (i.e., the
location from Lloyd’s algorithm). And if a robot obtains the highest weight in its
history at its current location, it will set q as the q∗·,b . The highest weight in the
history, denoted as whistory , is initialized as the obtained weight at t = 1, and it
is updated with the current weight wt if wt ≥ whistory . Since r1 , r2 ∼ rand(0, 1),
the robot moves with random trade off between q∗·,b and q∗·,g .
Algorithm 2. Distributed control with PSO for Robot r

1: Compute local wt = x∈F oV v(x) at time t.
∗
2: Set qg according to (4)
3: if wt ≥ whistory then
4: q∗r,b = qr Current robot location
5: whistory = wt
6: end if
7: vtr = wvt−1
r + c1 r1 (q∗r,b − qtr ) + c2 r2 (q∗r,g − qti ) Experiments set w, c1 , c2 = 1
8: qr = qr + vtr
t+1 t
Robot moves to qt+1
3.3 Communication and Complexity
All teams use the distributed PHD filter for MTT regardless of the search strat-
egy. This is a limiting factor in terms of both computational complexity and
communication bandwidth. It requires robots to exchange ownership of por-
tions of the PHD density function as the Voronoi partitions shift [6, Algorithm
1], exchange information about targets that may move across the boundaries of
Voronoi cells [6, Algorithm 2], and exchange measurements about targets viewed
in neighboring Voronoi cells [6, Algorithm 3]. The standard PHD filter equations
scale linearly in complexity with the number of targets (i.e., measurements) [6,
eq. (4)]. The complexity and communication load of the distributed PHD filter
depend on the situation. For example, the particle exchange step [6, Algorithm
1] depends on the amount of area changing ownership, which can be large when
the entire team is tightly packed into one area (yielding very large Voronoi cells
for boundary robots). The PHD update [6, Algorithm 3] scales linearly with the
number of robots who can see into one another’s Voronoi cells, again making
the tightly packed case difficult. When all robots are evenly distributed in the
space, the complexity and communication bandwidth will be minimal. Note also
the distributed PHD filter assumes that all robots can communicate with their
Voronoi neighbors, that communication channels are lossless, and that it ignores
the effects of latency. Addressing these real-world factors will be the subject of
future work.
The different control policies (Lloyd’s algorithm, SA, and PSO) all utilize
the resulting PHD estimate to find the goal location. Since robots compute their
goal locations using only the PHD within their own Voronoi cell, no additional
communication is required for the control computations.1 Additionally, the con-
trol computations are all quite straightforward and have constant complexity
with respect to team size and target count.
Lastly, it should be noted that the “particles” used within our SA and PSO
algorithms are actually individual robotic agents in the team. Thus, while it is
well known that the performance of SA and PSO is dependent upon the number
of particles, in our case this simply means it is a function of team size.
1
This assumes that, per the note in Algorithm 1, line 2, the Voronoi computations
are combined with the distributed PHD filter step.
4 Experimental Results
In this section, we conduct a set of simulated experiments using Matlab so as
to 1) evaluate the correctness and 2) demonstrate the efficacy of the proposed
distributed estimation and control algorithm by comparing with the distributed
methods used in our previous work [6].
The environment is an open 100 × 100 m area with no obstacles. For simplic-
ity, we represent PHD using a uniform grid of particles with 1 m spacing. The
initial weight for each grid is wi = 1 · 10−4 so the initial number of expected tar-
gets in the environment is 1. The robots have a maximum velocity of 2 m/s and
are holonomic. The sensor collects the data at 2 Hz and the sensor specification
are:
0.8 x − q ≤ 5m
pd (x | q) =
0 else
(7)
g(z | x, q) = N (z | x − q, 0.25I2 )
c(z | q) = 3.66 · 10−3

The expected number of clutter detections per scan is F oV c(z | q) dz = 0.287.
Note, these values do not represent any specific real-world situation, but could
represent a team of drones equipped with downward facing cameras. For a real
situation, one could follow the procedures outlined in our other work to carefully
characterize the sensor models [2,5].
4.1 Evaluation Metric

To evaluate the results we use the Optimal SubPattern Assignment (OSPA)
metric [23], which is widely used in the PHD literature. The OSPA between two
sets X, Y , where |X| = m ≤ |Y | = n without loss of generality, is
p1
1
m
p p
(8)
n π∈Πn 1
where c is the cutoff distance where we set c = 10 m. The dc (x, y) = min(c, x −

y) and the Πn is the set of permutations of {1, 2, 3, ..., n}. We set p = 1 as the
p-norm. Since the OSPA can fluctuate due to false negative and false positive
detections, we use the median of the OSPA over the last 100 s of each trial to
obtain a steady estimate in each experiment and measure the variance of the
OSPA to evaluate the stability of the tracker.
4.2 Stationary Targets

In our first set of experiments we assume all targets are static, so the transition
model is the identity map, the survival model is 1, and the birth model is 0. We
use teams 10 to 100 robots (in steps of 10) searching for 10, 30, or 50 targets and
compare the new SA and PSO control methods against our previous work that
uses Lloyd’s algorithm alone [6]. The target positions are drawn uniformly at
random, while the robots all begin in a 20 × 10 m rectangle at the bottom center
of the environment. For each configuration of target number, robot number, and
control algorithm, we run 10 trials. Note that to ensure consistency in the results,
we use identical target distributions across all team/control methods, i.e., there
are 10 different randomized target distributions for each configuration.
Tracking Accuracy. Figure 1 shows the statistics of the median OSPA error
over the ten trials using our original function (PHD), PSO, and SA, respectively.
We can see in all cases, as the size of the team increases and surpasses the number
of targets, the OSPA error reaches a minimum and does not decrease, even with
the addition of more robots. The OSPA error of around 0.8 indicates that almost
all targets have been detected and tracked with higher accuracy than the PHD
grid (1 m). This empirical minimum is based on the accuracy of the sensors, the
particle spacing, and our method for extracting target locations (i.e., finding
peaks in the PHD density function [6]).
We see that the PSO and SA based algorithm both have a much lower OSPA
error when the number of robots is smaller than the targets number, with the
PSO algorithm having a lower average value than SA or PHD. The difference
between the new methods and PHD is especially evident in the 10-robot, 10-
target case, where the original method fails to find about half of the targets (i.e.,
OSPA about halfway between the minimum of 0.8 and maximum of c = 10).
We also see that the spread in the OSPA values is smallest for the SA case,
indicating that it offers the most reliable performance across all three approaches.
In summary, PSO offers the lowest average OSPA, SA offers the most consistent
results, and the original is not the best in any category.
Coverage. We also wish to examine how well each algorithm explores the envi-
ronment, as this is another indicator of success. Figure 2 show the robots’ tra-
jectories in the 10-robots, 10-targets case, where the orange diamonds are the
targets location and the lines in different colors are the robots’ trajectories. We
can observe that in all cases most of the robots eventually find targets and oscil-
late around them. However, the paths they take to get there are different. Lloyd’s
algorithm, as shown in Fig. 2a, spends the least time exploring the space, which
contributes to it having the highest OSPA error. In this figure, two targets are
not detected. The robots move directly to their weighted Voronoi Center and
become idle once they have arrived at the goal point. This is a great waste of
the searching resource. We see in Fig. 2b that the robots cover much more of the
search space in the same amount of time. Thus, the addition of the new local
best term (since the global best is identical to Lloyd’s), causes the robots to
explore more. We also see that the robot paths are relatively smooth, a result of
the inertia term in (6) causing the heading direction to remain somewhat con-
sistent. We see in Fig. 2c that these robots take more of a random walk. This is
because the annealing process drive the robots oscillate between the best point
robots and 10, 30, and 50 static targets for original PHD, PSO, and SA respectively.
Fig. 2. Robots’ trajectories using different control methods. Robots all begin in the
dashed box at the bottom of the environment.
Fig. 3. Percent of undetected area over time over 10 runs for 10-robot, 10-target case,
where the solid curves show the mean value and the shaded areas show the range.
Fig. 4. Boxplots show the percent of undetected area at the end of the experiment over
10 runs for teams of 10–100 robots and 50 static targets for the original PHD, PSO,
and SA search strategies.
and the potential point. However, since the temperature keeps decreasing, the
oscillation range is smaller than that of PSO.
We can examine these trends in a quantitative manner by plotting the per-
centage of unexplored area over time for each method in Fig. 3. We observe that
the average exploration speed is similar across all methods over the first 100 s,
with SA showing a larger range. After that, we observe that the exploration rate
of Lloyd’s algorithm levels off around 0.35 as robots become trapped. The other
methods both have ways of escaping local minima, with SA yielding the highest
and most consistent exploration rate as time increases.
To further explain, the unexplored area metric in Fig. 4 measures the per-
centage of the total area that was not explored by any robot at the end of the
search period. We see that in the 50 target case, all methods are able to cover
a significant portion of the environment, but SA has the highest median and
maximum unexplored area. This is likely due to the fact that the robots fre-
quently get stuck in local minima due to the higher density of targets in the
environment. Once the number of robots exceeds the number of targets, all yield
similar results and cover more than 99% of the environment for ≥60 robots in
the allotted time. This is because there are free robots to continue to explore
the remainder of the space even when each target has a robot assigned to track
it.
robots in tracking moving targets for original PHD, PSO, and SA respectively.
4.3 Dynamic Targets

We also conduct a set of experiments with moving targets. Target have a max
velocity 1 m/s and may enter or leave the environment by moving across the
boundary. When a new target crosses the boundary, it moves along a random
direction. Targets become inactive if they move out of the environment. To

model the birth of the targets, we set the birth PHD to be non-zero near the
map boundary (∂E)

5.26 × 10−5 x − ∂E ≤ 5m
b(x) = (9)
0 else
This model expects one new target to appear per 10 s. Based on the appearance
and disappearance of targets, the number of targets reaches a steady state of
around 30 targets regardless of the initial number of targets.
Figure 5 shows the median steady state OSPA value, as measured by the
average over the final 200 out of 1000 time steps. We see that in all cases, the
OSPA decreases as the number of robots increases and appears to reach an
asymptote. However, the PHD method yields the highest OSPA error when the
number of robots is small, but has the lowest asymptote when the number of
robots is high. We see that the PSO and SA methods are quite similar with 10–40
robots. However, above that we see that PSO reaches a higher asymptote and has
a higher variance in the OSPA, meaning it is both less accurate and less reliable
than SA. This is similar to the static target case, where Lloyd’s algorithm get
stuck tracking one target while PSO and SA can both escape these local minima
to switch to another target. This escape behavior is very beneficial when there
are fewer robots than targets but slightly harmful to tracking performance when
there are more robots than targets.
5 Conclusion
In this paper, we examined the performance of different search strategies to
solve the MR-MTT problem. All methods share a common structure with two
components: 1) an estimation algorithm based on the distributed PHD filter
and 2) a control strategy combining the stochastic optimization algorithm with
Voronoi-based control. The robots use the estimate from the distributed PHD
filter to weight the relative importance of the area within their Voronoi cell
and select an action. We compare the standard Lloyd’s algorithm with two new
methods based of standard stochastic optimization algorithms, namely simulated
annealing (SA) and particle swarm optimization (PSO). Both these new methods
allows robots to trade-off between a global goal and other local areas known to
contain targets, unlike Lloyd’s algorithm which only uses local information. We
see that the new stochastic methods enable the robot team to explore more
search area and yield better tracking results when there are fewer robots than
targets. In other words, the robots make a better trade-off between exploration
and exploitation.
Acknowledgement. This work was funded by NSF grants IIS-1830419 and CNS-
2143312.
References
1. Blackman, S.: Multiple hypothesis tracking for multiple target tracking. IEEE
Aerosp. Electron. Syst. Mag. 19(1), 5–18 (2004). https://doi.org/10.1109/MAES.
2004.1263228
2. Chen, J., Xie, Z., Dames, P.: The semantic PHD filter for multi-class target track-
ing: From theory to practice. Robot. Autonomous Syst. 149 (2022). https://doi.
org/10.1016/j.robot.2021.103947
networks. IEEE Trans. Robot. Autom. 20(2), 243–255 (2004). https://doi.org/10.
1109/TRA.2004.824698
4. Dames, P., Kumar, V.: Autonomous localization of an unknown number of targets
without data association using teams of mobile sensors. IEEE Trans. Autom. Sci.
Eng. 12(3), 850–864 (2015). https://doi.org/10.1109/tase.2015.2425212
5. Dames, P., Kumar, V.: Experimental characterization of a bearing-only sensor for
use with the phd filter (2015). https://doi.org/10.48550/arXiv.1502.04661
Auton. Robot. 44, 673–689 (2020). https://doi.org/10.1007/s10514-019-09840-9
7. Davis, L.: Bit-climbing, representational bias, and test suit design. In: Proc. Intl.
Conf. Genetic Algorithm, 1991, pp. 18–23 (1991)
8. Derr, K., Manic, M.: Multi-robot, multi-target particle swarm optimization search
in noisy wireless environments. In: 2009 2nd Conference on Human System Inter-
actions, pp. 81–86 (2009). https://doi.org/10.1109/HSI.2009.5090958
9. Deutsch, I., Liu, M., Siegwart, R.: A framework for multi-robot pose graph slam.
In: 2016 IEEE International Conference on Real-time Computing and Robotics
(RCAR), pp. 567–572 (2016). https://doi.org/10.1109/RCAR.2016.7784092
10. Doitsidis, L., et al.: Optimal surveillance coverage for teams of micro aerial vehicles
in gps-denied environments using onboard vision. Auton. Robot. 33(1), 173–188
(2012). https://doi.org/10.1007/s10514-012-9292-1
11. Du, Q., Faber, V., Gunzburger, M.: Centroidal voronoi tessellations: applica-
tions and algorithms. SIAM Rev. 41(4), 637–676 (1999). https://doi.org/10.1137/
S0036144599352836
12. Fortmann, T., Bar-Shalom, Y., Scheffe, M.: Sonar tracking of multiple targets using
joint probabilistic data association. IEEE J. Oceanic Eng. 8(3), 173–184 (1983).
https://doi.org/10.1109/JOE.1983.1145560
13. Glover, F.: Tabu search-part i. ORSA J. Comput. 1(3), 190–206 (1989)
14. Goldhoorn, A., Garrell, A., Alquézar, R., Sanfeliu, A.: Searching and tracking peo-
ple with cooperative mobile robots. Auton. Robot. 42(4), 739–759 (2018). https://
doi.org/10.1007/s10514-017-9681-6
15. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of
ICNN’95 - International Conference on Neural Networks, vol. 4, pp. 1942–1948
vol 4 (1995). https://doi.org/10.1109/ICNN.1995.488968
16. Kirkpatrick, S., Gelatt, C.D., Jr., Vecchi, M.P.: Optimization by simulated anneal-
ing. Science 220(4598), 671–680 (1983)
17. Mahler, R.: Multitarget bayes filtering via first-order multitarget moments. IEEE
Trans. Aerosp. Electron. Syst. 39(4), 1152–1178 (2003). https://doi.org/10.1109/
TAES.2003.1261119
18. Mirjalili, S.: Moth-flame optimization algorithm: A novel nature-inspired heuris-
tic paradigm. Knowl.-Based Syst. 89, 228–249 (2015). https://doi.org/10.1016/j.
knosys.2015.07.006
19. Murphy, R.: Human-robot interaction in rescue robotics. IEEE Trans. Syst. Man
Cybern. Part C (Applications and Reviews) 34(2), 138–153 (2004). https://doi.
org/10.1109/TSMCC.2004.826267
20. Qie, H., Shi, D., Shen, T., Xu, X., Li, Y., Wang, L.: Joint optimization of multi-uav
target assignment and path planning based on multi-agent reinforcement learning.
IEEE Access 7, 146264–146272 (2019). https://doi.org/10.1109/ACCESS.2019.
2943253
21. Rizk, Y., Awad, M., Tunstel, E.W.: Decision making in multiagent systems: a sur-
vey. IEEE Trans. Cognitive Developmental Syst. 10(3), 514–529 (2018). https://
doi.org/10.1109/TCDS.2018.2840971
23. Schuhmacher, D., Vo, B.T., Vo, B.N.: A consistent metric for performance evalua-
tion of multi-object filters. IEEE Trans. Signal Process. 56(8), 3447–3457 (2008).
https://doi.org/10.1109/TSP.2008.920469
24. Shi, W., He, Z., Tang, W., Liu, W., Ma, Z.: Path planning of multi-robot systems
with boolean specifications based on simulated annealing. IEEE Robot. Autom.
Lett. 7, 6091–6098 (2022). https://doi.org/10.1109/LRA.2022.3165184
25. Stone, L.D., Streit, R.L., Corwin, T.L., Bell, K.L.: Bayesian multiple target track-
ing. Artech House (2013)
26. Tang, Q., Yu, F., Xu, Z., Eberhard, P.: Swarm robots search for multiple tar-
gets. IEEE Access 8, 92814–92826 (2020). https://doi.org/10.1109/ACCESS.2020.
2994151
27. Vo, B.N., Ma, W.K.: The Gaussian mixture probability hypothesis density filter.
IEEE Trans. Signal Process. 54(11), 4091–4104 (2006)
28. Vo, B.N., Singh, S., Doucet, A.: Sequential Monte Carlo methods for multitarget
filtering with random finite sets. IEEE Trans. Aerosp. Electron. Syst. 41(4), 1224–
1245 (2005)
29. Wang, B.: Coverage problems in sensor networks: A survey. ACM Comput. Surv.
43(4) (2011). https://doi.org/10.1145/1978802.1978811
30. Zhang, C., Lesser, V.: Coordinating multi-agent reinforcement learning with lim-
ited communication. In: Proceedings of the 2013 International Conference on
Autonomous Agents and Multi-Agent Systems, AAMAS ’13, pp. 1101–1108. Inter-
national Foundation for Autonomous Agents and Multiagent Systems, Richland,
SC (2013)
Multi-agent Deep Reinforcement
Learning for Countering Uncrewed Aerial
Systems
Jean-Elie Pierre1(B) , Xiang Sun1 , David Novick2 , and Rafael Fierro1

1 Department of Electrical and Computer Engineering, The University of New
Mexico, Albuquerque, NM 87131, USA
jpierre@unm.edu
2 Sandia National Laboratories, Albuquerque, New Mexico 87185, USA
Abstract. The proliferation of small uncrewed aerial systems (UAS)

poses many threats to airspace systems and critical infrastructures. In
this paper, we apply deep reinforcement learning (DRL) to intercept
rogue UAS in urban airspaces. We train a group of homogeneous friendly
UAS, in this paper referred to as agents, to pursue and intercept a
faster UAS evading capture while navigating through crowded airspace
with several moving non-cooperating interacting entities (NCIEs). The
problem is formulated as a multi-agent Markov Decision Process, and
we develop the Proximal Policy Optimization based Advantage Actor-
Critic (PPO-A2C) method to solve it, where the actor and critic net-
works are trained in a centralized server and the derived actor network
is distributed to the agents to generate the optimal action based their
observations. The simulation results show that, as compared to the tradi-
tional method, PPO-A2C fosters collaborations among agents to achieve
the highest probability of capturing the evader and maintain the collision
rate with other agents and NCIEs in the environment.
Keywords: Multi-agent systems · deep reinforcement learning ·

counter uncrewed aerial systems (C-UAS) · machine learning
1 Introduction
Counter Uncrewed Aerial Systems (C-UAS) is an active area of research. A paper
by the Center for The Study of Drones identified 235 counter-drone products on
the market or in development as of 2018 [1]. The rise in C-UAS research is largely
due to the proliferation of uncrewed aerial systems (UAS). Although UAS has
shown great potential in solving various critical and difficult tasks, the growing
autonomy of UAS posed a threat to the security of individuals and organizations
alike. A growing number of concerns is the intrusion of UAS into guarded or
protected spaces as any mishandled and malicious UAS in a protected space
poses a safety risk to the related entities, such as aircraft near or in airports
and groups of people in large gatherings. In addition, malicious UAS can be
https://doi.org/10.1007/978-3-031-51497-5_28
Multi-agent Deep RL for C-UAS 395
Fig. 1. C-UAS scenario.
used to attack critical infrastructure, such as airport control towers, or perform

unwanted surveillance in a protected space [19].
It is necessary to design a C-UAS method that can intercept any threat UAS
before they cause further damage in protected or target areas. Instead of having
one UAS in terms of the pursuer to intercept a threat UAS in terms of an evader,
we consider a group of pursuers, whose original task is to provide surveillance
service, to intercept evaders, thus protecting the target area. Normally, the C-
UAS problem can be formulated as a pursuit-evasion problem, where an evader
moves around and avoids capture by a pursuer [10,23]. The traditional way to
solve the problem is to utilize the control techniques in each agent in terms of a
pursuer, which, however, has high complexity and requires each agent the global
knowledge of the network, and thus is impractical in most cases. In recent years
there has been increasing research to solve the pursuit-evasion problem by using
the Multi-agent Deep Reinforcement Learning (DRL) method [4,9], where deep
neural networks would be used as the policies to derive the actions in terms of
the velocities and angular velocities of the pursuers based on their observations.
Here, each pursuer could apply the same or different policies in terms of deep
neural networks to derive the actions. As compared to the traditional control
techniques, the multi-agent DRL method has a low response time to derive the
optimal actions of a pursuer based on the pursuer’s current observation as long
as the deep neural networks are well-trained via extensive simulations. On the
other hand, although the DRL method has been proposed to solve the C-UAS
problem, some of them do not consider collaboration amongst multiple pursuers
[2]. Some works apply the multi-agent DRL method to enable collaborations
396 J.-E. Pierre et al.
among pursuers, but they do not consider, for example, the existence of mov-
ing obstacles and the collisions between pursuers [7], which may significantly
reduce the state space and complexity of the problem but leads to the methods
impractical.
As compared to the existing works, we study a more complicated but real-
istic C-UAS scenario, where there is one evader and multiple pursuers in an
alarmed area. As shown in Fig. 1, each pursuer has to work together to intercept
the evader from entering the target area, while avoiding collisions with other
pursuers and non-cooperating interacting entities (NCIEs) in the alarmed area.
We design a novel multi-agent DRL algorithm that is inspired by [16,18], where
different pursuers derive their actions based on their local actor networks, that is
trained with the shared experiences of all the pursuers simultaneously based on
Proximal Policy Optimization (PPO). The contributions of the paper are sum-
marized as follows. 1) We consider a complicated C-UAS scenario, where mul-
tiple pursuers, NCIEs, and one evader exist in an alarmed area. We tackle this
challenging problem by designing a Proximal Policy Optimization-based Advan-
tage Actor-Critic for parameter sharing multi-agent DRL. More specifically, the
pursuers share their observations to a centralized controller, which trains the
actor and critic networks, and then distributes the actor network to the pur-
suers to locally derive their actions. The design solution allows the pursuers to
quickly intercept the evader while avoiding collisions and model divergence. 2)
We conduct extensive simulations to show the convergence of the method and
demonstrate its performance by comparing it with the existing baseline methods.
The rest of the paper is structured as follows. We formulate the C-UAS
problem in Sect. 2 as a multi-agent Markov Decision Process. Section 3 describes
our designed multi-agent DRL algorithm, Sect. 4 discusses simulation results,
and Sect. 5 provides concluding remarks and future research directions.
1.1 Related Work

The multi-agent pursuit-evasion problem has been studied throughout the years.
In traditional pursuit-evasion games, a pursuer tries to capture evaders, while an
evader tries to evade capture from the pursuers. The authors in [14] used the area
minimization strategy as a means to intercept rogue robots. The strategy seeks
the best coordination to minimize Voronoi cells of evaders. Similarly, the authors
in [24] solve the cooperative pursuit problem by partitioning the problem space
into Voronoi regions and developing a control strategy that allows each pursuer
to occupy the regions, thus increasing the probability of capturing the evaders.
One variation of the pursuer evasion problem is in the form of a perime-
ter defense game as described in [17]. The authors developed an algorithm to
intercept a group of evaders trying to reach a target area. In their problem for-
mulation, the pursuers are restricted to the perimeter of the target. Whereas in
our paper, the pursuers are free to move out the perimeter of the target area,
which is more practical. Another variation of the pursuer evasion problem is
described in [23], where the policies are developed for two pursuers to inter-
cept any incoming threat. The authors described several intercept strategies for
catching an evader. The main focus of the work is to use two pursuers though,
which may not be adequate to capture a superior evader. Another variation is
described by[3], where the designed policy is to herd a group of swarm evaders
to a safe area. The authors developed a stringent algorithm to herd the attackers
to a safe area away from a target space.
Different from using the traditional control theory, multi-agent DRL is
another solution to solve the pursuit-evasion problem. Research in multi-agent
reinforcement learning has been plentiful over the years. The authors in [21]
highlighted the current state of multi-agent DRL and its challenges. Actually,
Multi-agent DRL can solve not only pursuit-evasion games, but also, for exam-
ple, swarm control [9,21] and cooperation. Some of the challenges to applying
multi-agent DRL include nonstationarity and scalability. The nonstationarity
issue arises because an agent’s action derived from its policy may affect other
agents’ actions, thus leading to model training divergence. The scalability arises
because the action and state spaces exponentially increase with respect to the
number of agents, and thus are not scalable to a larger group of agents. To
address the scalability issue, the authors in [9] propose Mean Embedding for
defining the agent’s observation so that training is invariant to the number of
agents. Whereas [8] used a self-attention-based model to aid with scalability. In
[22], a target prediction network (TP Net) is developed to predict an evader’s
location using long-term short-term memory (LSTM). In [4], which is the most
relevant work as compared to our paper, the authors provided a framework for
training eight pursuers to capture one evader. The evader’s goal is to avoid
capture while moving around within a specific area. This is different from our
scenario, where the evaders’ goal is to reach the predefined target area. Their
main contribution is the evaluation of different factors that affect the probabil-
ity of capture for the pursuers such as the evader’s relative speed, number of
pursuers and radius of pursuers and evader. They also used curriculum learning
to train their agents. With curriculum learning, the agents learn to achieve suc-
cessive tasks by first completing and building on easier tasks. This method has
been shown to have good stability performance but does not address collision
between agents and obstacle avoidance. Our approach builds on this research
and addresses inter-agent collision and obstacle avoidance.
To train the deep neural networks in terms of policies in multi-agent reinforce-
ment learning, several methods have been proposed. For example, [6] applied
Trust region policy optimization to training the deep neural networks, whose
parameters are shared among agents. In [12], the authors created a variant of
DDPG for multi-agent systems, where the agents share the action-value function
for learning and act in a decentralized fashion during execution. To achieve supe-
rior training performance, a reward function must be provided to the agents that
maps agents’ states to their rewards, and so [20] uses sparse reward to enable
pursuers to reach evaders. At each successful capture, the pursuers get a reward
based on the number of pursuers, who are within the captured distance of the
evader.
As shown in Fig. 1, there is a group of N pursuers tasked with protecting a
target area T from a group of M evaders in an alarmed area W, where T ⊂ W.
The positions of the pursuers and evaders at each time step are denoted as
j
Pip (t) ∈ R2 (i = 1, . . . , N) and Pe (t) ∈ R2 ( j = 1, . . . , M), respectively, where i and
j are the indices of the pursuers and evaders, respectively, and t is the current
time step, Pip (t) and θ ip (t) are the 2D position and heading for pursuer i, and
j j
Pe (t) and θ e (t) are the 2D position and heading for evader j. Define xip (t) and
j
xe (t) as the state vector of pursuer i and evader j, where xip (t) = Pip (t), θ ip (t)

j j j
and xe (t) = Pe (t), θ e (t) . If the unicycle model is applied, then the dynamics of
the pursuer and evader are given by
⎡cos θ ip (t) 0⎤ i ⎡cos θ ej (t) 0⎤ j

⎢ ⎥ ⎢ ⎥
d i
x (t) = ⎢ sin θ ip (t) 0⎥ v pi (t) and
d j
x (t) = ⎢ sin θ j (t) 0⎥ vej(t) , (1)
dt p ⎢ ⎥ ω (t) dt e ⎢ e ⎥ ω (t)
⎢ 0 1⎥⎦ p ⎢ 0 1⎥⎦ e
⎣ ⎣
respectively, where v pi (t) and ωip (t) are the wheel velocity and angular velocity
j j
of pursuer i at time step t, ve (t) and ωe (t) are the wheel velocity and angular
j
velocity of evader j at time step t. The feasible values of v pi (t), ωip (t), ve (t), and
j j j
ωe (t) are defined as vmin ≤ v pi (t), ve (t) ≤ vmax and ωmin ≤ ωip (t), ωe (t) ≤ ωmax .
In addition, there are K non-cooperating interacting entities (NCIE) moving
in the area W, and denote Pko (t) ∈ R2 (k = 1, . . . , K) as the position of NCIE k
at time step t. The NCIEs move throughout area W with constant speeds and
no rotational components, i.e.,

xok (t) = cxk , yok (t) = cyk , and θ ok (t) = arctan cyk , cxk , (2)
where cxk and cyk represent the constant velocity of NCIE k in the x and y direc-
tion, respectively. Here, the constant speed setup for an NCIE is used to emulate
the trajectory of the NCIE. Although these NCIEs, such as flying aircraft in busy
airports or a flock of birds, may have variable speeds that lead to much more
sophisticated control of the pursuers to avoid collisions as compared to the con-
stant speed setup of the NCIEs, we adopt a simplified but reasonable strategy
that each pursuer should ensure a safe distance to each NCIE to avoid the col-
lision. Here, the safety distance is calculated based on the worst-case scenario
that two entities are flying towards each other with their maximum velocities.
Hence, the constant/variable speed setup for an NCIE does not affect our pro-
posed method to be mentioned later on. Note that for future reference, we will
drop indices (i, j, k) and t when not applicable. In this paper, we assume that all
entities are at the same height, and therefore will only focus on 2-dimensional
space for this paper. We leave 3-dimensional C-UAS for future research. Also,
restricting UASs to nonholonomic constraints is feasible for working on a planar
space W for fixed-wing UAVs.
The goal of the evaders is to enter the target area T while evading capture.
j
Evader j has successfully penetrated the target space if ||Pe (t)−P T || ≤ δ T , where
P T and δ T are the center position and radius of the target area T , respectively,
and · is the Euclidean norm. The goal of the pursuers is to capture the
evaders before the evaders enter the target area. Evader j successful capture
j
by pursuer i is categorized by ||Pip (t) − Pe (t)|| ≤ δc where δc is defined as a
captured radius around the evader. Moreover, the pursuers cannot enter the
target area T , which is motivated by the fact that in the real world, some agents
cannot enter the crowded closed areas to avoid complicated events. In addition,

a collision happened between two pursuers once Pip (t) − Pip (t) ≤ δ p , where i

and i are the indices of the two pursuers and δp is the safety distance to avoid
the collision between two pursuers. Similarly, a collision between pursuer i and
NCIE k occurs if Pip (t)−Pko (t) ≤ δo , where δo is the safety distance to avoid the
collision between a pursuer and an NCIE. We disregard the collisions between
two NCIEs as well as two evaders as we assume that the evaders/NCIEs are well-
coordinated to prevent collisions. Specifically, the initial positions of pursuers and
evaders are always within and out of the alarmed area W, respectively.

j j j
In our scenario, the evader’s action, i.e., ae (t) = ve (t) , ωe (t) , is dictated by
a virtual attractive force defined in [23]. To define the virtual attractive force, we
first convert the action of the evader from a unicycle model to a single integrator
using [5]. An integrator model takes as input

j j j
ve (t) = vx,e (t) , vy,e (t) , (3)
j j
with vx,e, vy,e being the current speed of the evader in the x and y direction
respectively. Alternatively, the velocity vector for the evader can be defined as:
j j j
ve (t) = ve d̂e (t), (4)
j
where ve is the max permissible velocity of the evader and d̂e (t) is a unit vector
pointing in the desired direction of the evader. Since the evader’s goal is to reach
the target, we can derive a virtual attractive force that pulls the evader towards
the target while simultaneously pushing the evader away from NCIEs or the
pursuer. This desired driving direction is given as

N
1
K
1
j
de (t) = −kr j
i
P̂p,e (t) + j
k
P̂o,e(t)
i=0 Pip (t) − Pe (t) α k=0 Pko (t) − Pe (t) α
1
+ ka j P̂T,e (t) , (5)
Pe (t) − P T
where the first two summations represent the repulsive force pushing the evader
away from the pursuers and NCIEs, and the last summation represents an attrac-
tive force that pulls the evader towards the target space area T . Here, kr and k a
are the repulsive and attractive force gains, respectively, and α determines if the
j
forces are distance dependent. In our simulation, we set α equal to 1. Since ve (t)
is the action for the single integrator model, we use [5] to convert it to action
j
ae (t) of the unicycle model. The values kr and k a are optimized by extensive
evaluation to maximize the evader entering the alarmed space while avoiding
collision with the pursuers and NCIEs.
3 Multi-agent Deep Reinforcement Learning for C-UAS

As described in [7], the C-UAS problem can be described as a multi-agent Markov
Decision Process (MAMDP) formally given as a tuple (N, S, A, R, F , γ). Here,
1. N represents the number of agents in terms of the pursuers in an environment.
2. S is the state space of the environment. Here, a state observed by pursuer
i at time step t, denoted as sit ∈ S, comprises 1) the states with respect to

pursuer i itself, i.e., v pi (t) , ωip (t) , di, T (t) , θ i, T (t) , where v pi (t) and ωip (t) are
the wheel velocity and angular velocity of pursuer i at time step t, and di, T (t)
and θ i, T (t) are the distance and angle between pursuer i and the center of the
target space T , respectively; 2) the states of other evaders with respect to
pursuer i, i.e., di, j , θ i, j , di, T , θ j, T , where di, j is the distance between pursuer
i and evader j, and θ i, j and θ j, T are the angles between evader j and pursuer
i and between evader j and center of the target space T, respectively; 3) the
states of other NCIEs with respect to pursuer i, i.e., di,k , θ i,k , where di,k
and θ i,k are the distance and angle between pursuer i and NCIE k. Here,
we assume each pursuer knows the states of the other pursuers, evaders, and
NCIEs. This may not be realistic in the real world but can be obtained by
using lidar or vision-based sensors.
3. A is the joint action of all the N agents, i.e., A = A1 × . . . × A N . Let ati be
i at i∈ Ai ), which is the velocity and
the action of agent i at time step t (where i
angular velocity of pursuer i, i.e., at = v p, ω p .i
4. R is the reward function. The goal of each agent is to maximize its cumulative
reward Ri = Tt=0 γ t rti , where γ ∈ [0, 1] is the discount factor, T is the length
of one episode, and rti is the reward function of agent i at time slot t, i.e.,
⎧
⎪ j
||Pip (t) − Pe (t)|| ≤ δc,
⎪
⎪ 1,
⎪
⎪
⎪
⎪ −1,
j
||Pe (t) − P T || ≤ δ T ,
⎪
⎪
⎪
⎨ −10,
⎪ ||Pip (t) − P T || ≤ δ T ,
rti = (6)
⎪
⎪ −0.10, Pip (t) − Pip (t) ≤ δp,
⎪
⎪
⎪ −0.15,
⎪ Pip (t) − Pko (t) ≤ δo,
⎪
⎪
⎪
⎪ di, j
⎪ −β dmax , otherwise,
⎩
where rti = 1 if the evader is captured, rti = −1 if the evader reaches the
target space, rti = −10 if it pursuer j enters the target space, rti = −0.10 if
rt = −0.15 if the pursuer collides with
the pursuer collides with each other, i

i, j d
an NCIE in the environment. β dmax is a negative reward calculated based
on the distance between the pursuer and the evader. β determines the weight
of the penalty, di, j is the Euclidean distance to the evader and dmax is the
maximum possible distance of the environment.
5. F : S × A → S defines the state transition probability density function that
maps the current states and actions into the next states.
3.1 Proximal Policy Optimization Based Advantage Actor Critic

In our multi-agent C-UAS environment, each pursuer interacts with the envi-
ronment, receives a reward rti , and proceeds into a new state st+1 . During each
time step t, the pursuer takes an action ati ∈ A based on a state st ∈ S. The goal
of the pursuer, denoted as G, is to maximize its cumulative discounted reward
in the environment following a policy π, which maps state s to action ati [13],
i.e., ∞

G = Eπ,s0 γ r (st , at , st+1 ) |at = π (·|st ) ,
t
(7)
t=0
where γ ∈ [0, 1] is the discount factor and s0 is the initial state of the
environment. In policy-based RL algorithms, we can choose a parameter-
ized
policy πφ (at |st) to maximize the cumulative discounted reward G(φ) =
E ∞ t=0 γ t r(s , a ; φ) . Many solutions have been designed to efficiently solve the
t t
MAMDP problem [13], and motivated by [16], we apply the Advantage Actor-
Critic (A2C) method to the problem. A2C is a DRL method combining policy-
based and value-based reinforcement learning. In A2C, there are two neural
networks, i.e., the actor and critic networks. The actor network provides the
stochastic policy to choose the actions such that the expected cumulative reward
G is maximized. Here, we apply Proximal Policy Optimization (PPO) [16], which
claims to outperform other policy-based methods in continuous control environ-
ments [16]. Denote πφ (at |st ) as the parameterized policy for PPO, where φ is the
set of weights. The loss function in PPO is

L CLIP (φ) = Êt min(rt (φ) Ât , clip(rt (φ), 1 − , 1 + ) Ât ) , (8)
where is the clipping ratio, rt (φ) is the ratio between current policy πφ (at |st )
π (a |s )
and πφold (at |st ) i.e., rt (φ) = πφ φ (at t |st t ) , Ât in Eq. (8) is the Advantage function. In
old
our implementation, Ât is calculated using Generalized Advantage Estimation

[15],

T
ÂGAE(γ,λ) = (γλ)l δV
t+l . (9)
l=0
Here, δV
t+l
is the temporal difference (TD) residual error at time step t + l, i.e.,
t+l = −Vν (st+l ) + rt+l + γVν (st+l+1 ),

δV (10)
where Vν is the state value, which is estimated by the critic network, and ν
indicates the set of the weights in the critic network. The objective of the critic
network is to minimize the loss function L(ν), which is defined as the mean square
error between the estimated state-value and the expected cumulative reward, i.e.,
L (ν) = (rt + γVν (s t+1 ) − Vν (s t ))2 . (11)
Note that λ in Eq. (9) (where 0 ≤ λ ≤ 1) is a hyperparameter that affects the

bias and variance of Vν [15].
3.2 Parameter Sharing

One of the major challenges of multi-agent RL is non-stationarity, where each
agent trains its actor and critic networks based on its observations, thus causing
the models to diverge. To overcome this issue, [6] proposes parameter sharing for
cooperative model training in multi-agent systems. Specifically, the observations
of each agent are shared to a centralized server as shown in Fig. 2, which is used
to train and update the actor and critic networks based on Eq. (12).

φ := φ − βa ∇φ L CLIP (φ),
(12)
ν := ν − βc ∇ν L (ν) ,
Fig. 2. The architecture of parameter sharing in multi-agent reinforcement learning.

Here, βa and βc are the learning rates for the actor and critic networks, respec-
tively. The updated actor network will then be forwarded to all the agents to
derive their actions based on their observations.
Algorithm 1 summarizes Proximal Policy Optimization based Advantage
Actor-Critic (PPO-A2C), which is implemented in RlLib [11], a RL application
programming interface (API) for training multi-agent systems using reinforce-
ment learning.
4 Simulation Results
We will conduct extensive simulations to validate the proposed PPO-A2C algo-
rithm. The environment has an arena size of 80 m × 60 m, where there are four
pursuers to capture one evader with five NCIEs navigating. The captured and
collision radii for pursuers/evader/NCIEs (i.e., δc /δ p /δo ) are set to 1 m. The
maximum speeds of the pursuers and evader are the same 10 m/s, while the
NCIEs move with a constant speed of 1 m/s. At each episode, the pursuers are
initialized in a random location inside the circular alarmed area with a radius
of 30 m but outside the circular target area with a radius of 5 m. Meanwhile,
the evader is randomly initialized outside the alarmed area and the NCIEs are
initialized with a random location inside the environment and heading θ. The
length of each episode is 40 s of simulation time, and the episode ends if the
evader is captured by any pursuers or if any pursuers enter the protected space.
Figure 3 shows a still frame of the pursuers capturing the evader. For additional
full-length videos, please visit our shared folder on Google Drive1 . The trace blue
lines show the trajectory used by the pursuers to reach the evader. The hyper-
parameters for PPO-A2C are listed in Table 1. Also, the simulation is conducted
on a computer with an Intel Core i7 processor with 12 cores, 16 GiB memory,
and GPU Nvidia GeForce RTX 2070 graphics card for 20 hours. Figure 4 shows
the learning curves for the pursuers by applying the proposed PPO-A2C and the
evader based on Eq. (5), respectively. From Fig. 4 we can derive that the A2C
Algorithm 1. PPO-A2C
Input: initial policy parameters φ0 , initial value function parameters ν0
for iteration = 1, 2, · · · do
for each agent i = 1, 2, · · · , N do
Run actor network to derive the related action ati based on the observation;
Interact with the environment to obtain rti and st+ i ;
i i i i 1
Transmit st , at , rt , st+1 to the centralized server;
end for
Update φ and ν based on Eq. (12) via mini-batch SGD;
Forward the updated actor network to all the agents;
end for
1 https://drive.google.com/drive/folders/1GNMzj1GrD3Ov8oOhKUZCorw5D4Xpg2ul?
usp=sharing.
models in the pursuers are finally converged. In contrast, the evader’s normalized
reward decreases as the pursuers learn an optimal policy.
We compare the performance of the proposed
PPO-A2C to a baseline method, i.e., artificial Table 1. Hyperparameters
potential function pursuers (APFP), where each
pursuer uses the artificial potential field function Parameters Values
to determine its action to capture the evader. γ 0.99
Also, we apply the Multi-Agent Reinforcement λ 0.95
Learning Pursuers (MARLP), which also uses Minibatch size 4096
PPO-A2C, but pursuers neglect the existence of 0.2
NCIEs and there is no penalty for the pursuer to β a , βc 5 × 10−5
pursuer collisions. Note that for all the methods,
the evader uses APF (i.e., Eq. (5)) to reach the
target area, while trying to avoid the collisions. The metrics to evaluate the pol-
icy performance of PPO-A2C, APFP and MARLP are the capture probability
of evaders, target collision rate, NCIE collision rate, and pursuer collision rate.
The capture probability of evaders is defined as the number of episodes when
the evader was captured divided by all the episodes. The target collision rate is
the number of episodes when the pursuer enters the protected space divided by
the total number of episodes. The NCIE collision rate is the number of collisions
between pursuers and NCIEs divided by the total number of time steps. Lastly,
the pursuer collision rate is defined as the number of collisions between pursuers
divided by the total number of time steps.
Fig. 3. An example to show the trajec- Fig. 4. Normalize rewards for the pur-
tories of the evader and pursuers. suer and evader during training.
Figure 5 shows the mean capture probability of the three methods over
episodes, PPO-A2C has the highest mean capture probability, which demon-
strates its effectiveness. MARLP has the lowest mean capture probability due
to the reason that it is trained in an environment without NCIEs, thus unable
to adapt to the environment with NCIEs, leading to many collisions. Figure 6
Fig. 5. Probability of capturing the Fig. 6. Probability of capturing the

evader over 400 episodes. evader as we vary the number of pur-
suers.
illustrates the mean capture probability of the three methods when the number
of pursuers increases. From the figure, we can see that PPO-A2C can gener-
ate efficient policies for the pursuers to achieve the highest capture probability
when the number of pursuers is less than seven but achieves a similar capture
probability to APFP when the number of pursuers is no less than seven.
We also observe the target collision rate, pursuer collision rate, and NCIE col-
lision rate for the three methods by varying the number of pursuers. As shown in
Fig. 7a, PPO-A2C always maintains the lowest target collision rate as compared
to the other two methods. In Fig. 7b, we can find that PPO-A2C has a similar
collision rate to APFP. Interestingly, MARLP has the lowest pursuer collision
rate. Lastly, in Fig. 7c, APFP maintains the lowest NCIE collision rate even as
the number of pursuers increases, and PPO-A2C has slightly worse performance
but not as bad as MARLP. Based on the above figures, we can conclude that
PPO-A2C can achieve a comparable low collision rate to the repulsive algorithm,
but derive a better policy to capture the evaders with a higher probability than
the repulsive algorithm.
Fig. 7. Probability of different collisions. (a) Target collision rate. (b) Pursuer collision
rate. (c) NCIE collision rate.
5 Conclusion
In this paper, we proposed PPO-A2C to tackle the counter UAS scenario. PPO-
A2C has been demonstrated to outperform the traditional APFP method against
a smart evader. In addition, each agent can avoid obstacles and maintains a safe
distance around the protected space. One of the limitations of our method is the
lack of cooperative behaviors among the pursuers. During the simulation, the
pursuers are observed with greedy behaviors to catch the evader. One possible
method to overcome the issue is to design reward values to foster collaboration
among the pursuers. We will tackle this limitation in future research. Further-
more, we plan to improve the proposed method so that each pursuer does not
need to have the complete state information of the environment to train the
models and derive the action. Lastly, we would extend the proposed PPO-A2C
method to capture multiple evaders.
Acknowledgements. This work is supported by Sandia National Laboratories, North

Atlantic Treaty Organization (NATO), and National Science Foundation under Award
CNS-2148178.
Sandia National Laboratories is a multi-mission laboratory managed and oper-
ated by National Technology and Engineering Solutions of Sandia, LLC., a wholly
owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s
National Nuclear Security Administration under contract DE-NA0003525.
References
1. Michel, A.H.: Counter-drone systems. http://dronecenter.bard.edu/counter-drone-
systems/ (2018)
2. Çetin, E., Barrado, C., Pastor, E.: Counter a drone in a complex neighborhood
area by deep reinforcement learning. Sensors 20(8), 2320 (2020)
3. Chipade, V.S., Marella, V.S.A., Panagou, D.: Aerial swarm defense by stringnet
herding: theory and experiments. Front. Robot. AI, 8, 81 (2021)
4. De Souza, C., Newbury, R., Cosgun, A., Castillo, P., Vidolov, B., Kulić, D.:
Decentralized multi-agent pursuit using deep reinforcement learning. IEEE Robot.
Autom. Lett. 6(3), 4552–4559 (2021)
5. Glotfelter, P., Egerstedt, M.: A parametric MPC approach to balancing the cost of
abstraction for differential-drive mobile robots. In:2018 IEEE International Con-
ference on Robotics and Automation (ICRA), pp. 732–737. IEEE (2018)
6. Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multi-agent control using
deep reinforcement learning. In: Sukthankar, G., Rodriguez-Aguilar, J.A. (eds.)
AAMAS 2017. LNCS (LNAI), vol. 10642, pp. 66–83. Springer, Cham (2017).
https://doi.org/10.1007/978-3-319-71682-4 5
7. Hasan, Y.A., Garg, A., Sugaya, S., Tapia, L.: Defensive escort teams for navigation
in crowds via multi-agent deep reinforcement learning. IEEE Robot. Autom. Lett.
5(4), 5645–5652 (2020)
8. Hasan, Y.A., Garg, A., Sugaya, S., Tapia, L.: Scalable reinforcement learning poli-
cies for multi-agent control. In: 2021 IEEE/RSJ International Conference on Intel-
ligent Robots and Systems (IROS)
9. Hüttenrauch, M., Adrian, S., Neumann, G., et al.: Deep reinforcement learning for
swarm systems. J. Mach. Learn. Res. 20(54), 1–31 (2019)
10. Kamimura, A., Ohira, T.: Group chase and escape. New J. Phys. 12(5), 053013
(2010)
11. Liang, E., et al.: Abstractions for Distributed Reinforcement Learning. Rllib (2018)
12. Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch,
I.: Multi-agent actor-critic for mixed cooperative-competitive environments.
arXiv:1706.02275 (2017)
13. Oroojlooy, A., Hajinezhad, D.: A review of cooperative multi-agent deep reinforce-
ment learning. arXiv preprint arXiv:1908.03963 (2019). https://doi.org/10.1007/
s10489-022-04105-y
14. Pierson, A., Wang, Z., Schwager, M.: Intercepting rogue robots: an algorithm for
capturing multiple evaders with multiple pursuers. IEEE Robot. Autom. Lett. 2(2),
530–537 (2016)
15. Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional con-
tinuous control using generalized advantage estimation. arXiv:1506.02438 (2015)
16. Schulman, J.: Filip Wolski. Alec Radford, and Oleg Klimov. Proximal policy opti-
mization algorithms, Prafulla Dhariwal (2017)
17. Shishika, D., Paulos, J., Kumar, V.: Cooperative team strategies for multi-player
perimeter-defense games. IEEE Robot. Autom. Lett. 5(2), 2738–2745 (2020)
18. Terry, J.K., Grammel, N., Hari, A., Santos, L., Black, B.: Revisiting parameter
sharing in multi-agent deep reinforcement learning (2021)
19. Wang, J., Liu, Y., Song, H.: Counter-unmanned aircraft system (s)(C-UAS): State
of the art, challenges, and future trends. IEEE Aerosp. Electron. Syst. Mag. 36(3),
4–29 (2021)
20. Chao, Yu., Dong, Y., Li, Y., Chen, Y.: Distributed multi-agent deep reinforcement
learning for cooperative multi-robot pursuit. J. Eng. 2020(13), 499–504 (2020)
21. Zhang, K., Yang, Z., Başar, T.: Multi-agent reinforcement learning: a selective
overview of theories and algorithms. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L.,
Cansever, D. (eds.) Handbook of Reinforcement Learning and Control. SSDC,
vol. 325, pp. 321–384. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
60990-0 12
22. Zhang, R., Zong, Q., Zhang, X., Dou, L., Tian, B.: Game of drones: multi-UAV
pursuit-evasion game with online motion planning by deep reinforcement learning.
IEEE Transactions on Neural Networks and Learning Systems (2022)
23. Zhang, S., Liu, M., Lei, X., Yang, P., Huang, Y., Clark, R.: Synchronous intercept
strategies for a robotic defense-intrusion game with two pursuers. Auton. Robot.
45(1), 15–30 (2021)
24. Zhou, Z., et al.: Pursuit Voronoi Partitions. Autom. 72, 64–72 (2016)
Decentralized Risk-Aware Tracking
of Multiple Targets
Jiazhen Liu1(B) , Lifeng Zhou1 , Ragesh Ramachandran2 , Gaurav S. Sukhatme2 ,

and Vijay Kumar1
1 University of Pennsylvania, Philadelphia, PA 19104, USA
{jzliu,lfzhou,kumar}@seas.upenn.edu
2 University of Southern California, Los Angeles, CA 90007, USA
{rageshku,gaurav}@usc.edu
Abstract. We consider the setting where a team of robots is tasked

with tracking multiple targets with the following property: approaching
the targets enables more accurate target position estimation, but also
increases the risk of sensor failures. Therefore, it is essential to address
the trade-off between tracking quality maximization and risk minimiza-
tion. In the previous work [1], a centralized controller is developed to plan
motions for all the robots – however, this is not a scalable approach. Here,
we present a decentralized and risk-aware multi-target tracking frame-
work, in which each robot plans its motion trading off tracking accu-
racy maximization and aversion to risk, while only relying on its own
information and information exchanged with its neighbors. We use the
control barrier function to guarantee network connectivity throughout
the tracking process. Extensive numerical experiments demonstrate that
our system can achieve similar tracking accuracy and risk-awareness to
its centralized counterpart.
Keywords: distributed robotics · multi-target tracking ·

risk-awareness · connectivity maintenance
1 Introduction
Tracking multiple targets using a decentralized robot team finds applications in
many settings, e.g., tracking biochemical or nuclear hazards [2] or hot-spots in a
fire [3]. In these settings, approaching the targets closely generally produces more
accurate state estimates, but this comes at a cost – the closer the team of robots
comes to the targets, the more likely that they experience sensor failure or other
system faults due to e.g., radiation or heat emitted by the targets. In certain
cases, the targets may even be actively hostile. The challenge is to maximize
tracking accuracy while avoiding risky behavior on the part of the robot team. A
common strategy for achieving team coordination is to devise a central controller
G.S. Sukhatme holds concurrent appointments as a Professor at USC and as an Ama-
zon Scholar. This paper describes work performed at USC and is not associated with
Amazon.
https://doi.org/10.1007/978-3-031-51497-5_29
Decentralized Risk-Aware Tracking of Multiple Targets 409
to make team-level decisions for all robots [4]. This relies on the assumption that
all robots have a sufficiently large communication range to stay connected with
a central station. In real applications, robots may have a limited communication
range. A decentralized approach – where robots only communicate with nearby
team members – is a natural choice [5,6]. In a decentralized strategy, each robot
makes its own decisions according to local information gathered by itself and its
peers within a certain communication range. If such a decentralized strategy is
to achieve the same level of task execution as a centralized system, it is essential
for it to maintain network connectivity, such that robots could share information
mutually and make the most use of their resources.
Motivated by these issues, we present a decentralized risk-aware multi-target
tracking framework. Our contributions are as follows:
1. Our system is risk aware – it automatically trades off tracking quality and
safety;
2. Our system is fully decentralized – each robot makes decisions relying
on information sharing between neighbors within a communication range.
We apply the control barrier function (CBF) in a decentralized manner to
guarantee network connectivity maintenance and collision avoidance;
3. Our system produces comparable results to its centralized coun-
terpart – we extensively evaluate the performance of our system through
simulated experiments in Gazebo environment under various settings, and
compare it to its centralized counterpart from our previous work [1]. We
present qualitative demonstrations to provide an intuitive illustration. Quan-
titative results on metrics including tracking error and risk level are also
exhibited.
2 Related Work
Multi-robot multi-target tracking is usually formulated as an iterative state esti-
mation problem [7–12]. Various methods have been proposed for this; most of
them design different flavors of filters to reduce the uncertainty in target pose
estimation during the tracking process [1,7,13,14]. In the more challenging sce-
nario where the targets may adversarially cause damage to the robots, safety
becomes a major concern when the robot team plans its motion. To incorporate
risk-awareness into a multi-robot system, [15] uses safety barrier certificates to
ensure safety. With a special emphasis on tracking and covering wildfire, [3]
encodes hazards from fire using an artificial potential field. Mutual information
is used in [2] to build a probabilistic model of the distribution of hazards in the
environment.
Considering that a multi-robot system controlled by a central planner does
not scale well [1], efforts have been made to decentralize the multi-target tracking
task. For example, [16] adopts the local algorithm for solving the max-min prob-
lem to solve multi-target tracking in a distributed manner. In [17], a decentralized
MPC-based strategy is proposed that generates collision-free trajectories. In [18],
the general problem of multi-robot coordination while preserving connectivity is
410 J. Liu et al.
addressed. As it points out, a frequently adopted method to coordinate multiple

robots in order to achieve a common goal is to use a consensus-based estimation
algorithm.
To enable multi-robot multi-target tracking in a decentralized framework, it
is typically required that the connectivity of the underlying network graph is
maintained. To do this, [19] formulates the connected region as a safety set and
uses a CBF to render it forward invariant such that the network will always
remain connected as long as the robots are initialized properly. For a systematic
elaboration of the theory of CBF and its relationship with Lyapunov control,
see [20]. In [21], a discussion of using graph spectral properties such as the
algebraic connectivity to ensure that the network is connected is given. In [18],
the connectivity maintenance problem is approached by assigning weights to the
edges and designing control strategies accordingly.
3.1 Notation
We use capital letters in boldface (e.g., A) to denote matrices, lowercase letters
in boldface (e.g., x) to denote vectors, and lowercase letters in regular font (e.g.,
q) to denote scalars. 1 M×N denotes an all-one matrix of shape M × N. I M is
the identity matrix of shape M × M. M1 ⊗ M2 gives the Kronecker product of
two matrices M1 and M2 . [v1 ; v2 ; · · · ; vn ] represents vertical vector concatena-
tion. The vertical concatenation of matrices or scalars are defined in the same
fashion. Diag([M1, M2, · · · , Mn ]) represents diagonal matrix construction with
constituents M1, M2, · · · , Mn placed along the diagonal. Tr(M) is the trace of
matrix M. We use · to denote the 2-norm for vectors. For a set S, |S| returns
its cardinality. [z] denotes the set {1, 2, · · · , z} for an arbitrary integer z ∈ Z+ .
N (0, Q) represents the Gaussian distribution with a zero mean and covariance
matrix Q.
3.2 Robot and Target Modeling

The multi-target tracking task is assigned to a team of N robots, indexed by
R := [N]. For each robot i ∈ [N] with state xi ∈ X ⊆ R p and control input
ui ∈ U ⊆ Rq , its motion is modeled by the control-affine dynamics:
x i = f (xi ) + g(xi )ui (1)
where f : Rp → Rp and g :
Rp → R p×q are continuously differentiable vector
fields. The robot team is equipped with a set of sensors that are heterogeneous
in terms of their sensing ability. Suppose there are in total U different sensors
indexed by [U]. We encode the measurement characteristic of sensor k ∈ [U]
using a vector hk . At time step t, we denote the sensor status, i.e., whether each
sensor is functioning or damaged, for robot i as:

1, if sensor k on robot i is functioning well at time t;
Γki,t = (2)
0, otherwise.
Based on this, Γt denotes the sensor status of the whole robot team at time t. The
set of sensors available at time t for robot i is denoted as γi,t = {k ∈ [U]|Γki,t = 1}.
Combining the sensor status with the measurement characteristic of each sensor,
the measurement matrix for robot i taking measurements of the target j is:
Hi j,t = [hγi, t (1) ; hγi, t (2) ; · · · ; hγi, t ( |γi, t |) ], and the measurement matrix for robot i
measuring all M targets is: Hi,t = I M ⊗ Hi j,t .
Without loss of generality, we assume that all the robots have a homoge-
neous maximum communication range Rcomm , as heterogeneity in the robots’
communication ability is beyond the scope of this paper and is left for future
exploration. For robot i, we define the region where other robots could have
information exchange with it as χi := {x ∈ X|x − xi ≤ Rcomm }. We call the
robots that are within χi as robot i’s neighbors. A robot is only able to commu-
nicate with its neighbors. The list of indices of robot i’s neighbors is defined as
Ni := { j ∈ [N], j i|x j ∈ χi }. Note that Ni is time-varying for each robot i ∈ R
as the robots actively reconfigure their positions to track the targets.
Let the M targets be indexed by T := [M]. Denote the true state of target
j as z j,t ∈ X ⊆ R p , and its control input as v j,t ∈ V ⊆ Rq , both at time step t.
The dynamics of the targets follow:
zt+1 = Azt + Bvt + wt (3)
where A and B are the process and control matrices, respectively. zt and vt
are obtained by stacking z j,t and v j,t , ∀ j ∈ T vertically. wt ∼ N (0, Q) is the
process Gaussian noise with covariance matrix Q. A, B and Q are assumed to
be known to the robots. Robot i’s measurement of the targets with noise ζ i,t =
[ζi1,t ; ζi2,t ; · · · ; ζiM,t ] is modeled by:
yi,t = Hi,t zt + ζ i,t . (4)
We let ζi j,t ∼ N (0, Ri j,t ), ∀ j ∈ T where the measurement noise covariance matrix
Ri j,t grows exponentially as the distance between robot i and target j increases,
R−i j,t
1
= Diag([R−i j11,t , R−i j12,t , · · · , R−i j1|γi, t |,t ]) (5)
with R−i jk,t

1
relying on a particular sensor k ∈ γi,t being:
R−i jk,t
1
= wk exp(−λk xi,t − z j,t ) (6)
where wk and λk are parameters characterizing the noise of sensor k ∈ γi,t . The
target risk field φ, which encodes the probability that a robot experiences sensor
failures induced by the targets, inherits the same definition as our previous
work [1, Sect. III-E, V]. In particular, the risk from target j acting on all robots
is:
cj 1
φ j (x) = exp(− (x − z j )T Σ j (x − z j )) (7)
2π|Σ j | 2
where c j and Σ j denote the peak value and covariance for the risk field of target j,
respectively. Correspondingly, the safety field is defined as π j (x) = 1 − φ j (x), ∀ j ∈
T.
412 J. Liu et al.
3.3 Problem Definition
We focus on planning the motion of each robot in the team to obtain an accurate
estimation of the targets’ positions while at the same time avoiding risk, in
a decentralized manner at each time step t throughout the tracking process.
Specifically, for each robot i ∈ R, the goal is to solve the following optimization
program:
1
min q1 ηi δ i1 2 + q2 δ2 (8a)
xi ,δ i 1,δi 2 1 + ηi i 2
s.t. Tr(Pi j ) ≤ ρ i1, j + δ i1, j , δ i1, j ≥ 0, ∀ j ∈ T ; (8b)
Tr(O−i, Π
1
) ≤ ρi2 + δi2, δi2 ≥ 0; (8c)
|Ni | > 0; (8d)
xi − x̄i ≤ dmax ; (8e)
xi − xl ≥ dmin, ∀l ∈ R, l i. (8f)
The time index t is dropped here for the brevity of notation. xi is the optimization
variable representing robot i’s position at the next step, while x̄i refers to robot
i’s current position. The cost function Eq. (8a) balances competing objectives of
high tracking quality (first term) and safety (second term), using the sensor mar-
gin ηi which encodes the abundance of sensors available for robot i. The larger
the value of ηi , the more powerful the sensor suite on robot i is for observing
the targets, and therefore the first term in the objective is more stressed. On the
contrary, when the sensors on robot i are limited, ηi becomes small, enforcing
the second term in Eq. 8a to be stressed prioritizing safety. q1, q2 are constants
weighing the two objectives. The tracking error is represented using the trace
of target estimation covariance matrix Pi j and is upper bounded by threshold
ρ i1, j , j ∈ T to ensure tracking accuracy, as in Eq. (8b). Meanwhile, the risk of
sensor failure is encoded by Tr(O−i, Π1
), where Oi, Π is the risk-aware observability
Gramian for robot i, defined similarly as the ensemble form in [1, Eq. (14)]. The
risk level is upper bounded by a pre-specified threshold ρi2 to discourage highly
risky behavior, as in Eq. (8c). δ i1, j , ∀ j ∈ T and δi2 are slack variables, which
enable the “bend but not break ” behavior when sensor failures render it impossi-
ble to meet the strict constraints of ρ i1, j , ∀ j ∈ T and ρi2 . See [1, Sect. IV-B, C,
D] for more details on the design of sensor margin and observability Gramian.
Moreover, according to Eq. (8d), each robot should have at least one neighbor,
so that the communication network formed by the robots remains connected.
Note that the connectivity relationship is determined using xi rather than x̄i ,
to ensure that the robots are still connected at the next time step. Constraint
Eq. (8e) specifies the maximum distance dmax that robot i could travel between
two consecutive steps. Finally, Eq. (8f) is the collision avoidance constraint to
guarantee the minimum safety distance dmin between two robots. Notice that
the optimization program in Eq. (8) is a non-convex problem that is computa-
tionally difficult to solve. We discuss an approach to solve it in two stages in the
next section.
4 Approach
In the proposed decentralized system, every robot relies on four major modules
to perform the target tracking task, as illustrated in Fig. 1. At each time step
t, every robot first generates its ideal goal state that it wants to reach without
considering constraints including network connectivity and collision avoidance.
Its decisions are only based on trading off tracking quality and risk aversion. This
is performed by the Ideal Action Generation module, as introduced in Sect. 4.1.
Sect. 4.2 corresponds to the Connectivity and Collision Avoidance module, which
employs CBF to ensure network connectivity and to avoid inter-robot collisions
when the robots move towards their respective goal positions. After each robot
moves to its new position, the Decentralized Estimation module generates a new
estimate of target positions and maintains common knowledge of such estimate
among neighbors using the Kalman-Consensus Filtering algorithm in [22]. We
introduce it in Sect. 4.3. Note that the Local Information Exchange module is
responsible for information sharing within the neighborhood and all the other
three modules rely on it.
Fig. 1. The proposed decentralized framework. Each robot performs the decentralized
target tracking task using 4 modules: Local Information Exchange, Decentralized Ideal
Action Generation, Decentralized Connectivity and Collision Avoidance, and Decen-
tralized Estimation. The Local Information Exchange module takes charge of all com-
munications between neighbors.
4.1 Decentralized Goal Position Generation
In this section, we focus on computing the “ideal” positions that robots want to
reach at every time step, only out of the desire for higher tracking accuracy and
aversion to overly risky behavior, without considering the connectivity mainte-
nance and collision avoidance constraints. To this end, each robot i ∈ R solves a
non-convex optimization program as follows:
414 J. Liu et al.
1
min q1 ηi δ i1 2 + q2 δ2 (9a)
xi ,δ i 1,δi 2 1 + ηi i 2
s.t. Tr(Pi j ) ≤ ρ i1, j + δ i1, j , δ i1, j ≥ 0, ∀ j ∈ T ; (9b)
Tr(O−i, Π
1
) ≤ ρi2 + δi2, δi2 ≥ 0. (9c)
This is a less constrained version of the optimization program in Eq. (8). Note
that to achieve team-level coordination, the computation of Pi j and O−i, Π
1
requires
not only robot i’s information, but also that from its neighbors acquired through
the Local Information Exchange module. Since each robot runs its own opti-
mization program, all robots in the team are required to solve the optimization
program in Eq. (9) over multiple rounds, until their ideal goal positions converge.
4.2 Decentralized Motion Synthesis

Given the generated ideal goal positions, we aim to steer the robots towards
their goal positions while preserving the network connectivity and inter-robot
safety distance. We achieve this by using the control barrier functions.
We consider the robots with control-affine dynamics x = f (x) + g(x)u as
defined in Eq. (1), where f (x), g(x) are locally Lipschitz functions. We let C
be the region where we want the robot team to stay within. Concretely, we
define C as the superlevel set of a continuously differentiable function h(X):
C = {X ∈ R p |h(X) ≥ 0}. From [20], we know that if we can select a feasible
u ∈ Rq such that
L f h(X) + Lg h(X)u + α(h(X)) ≥ 0 (10)
is satisfied, we could render C forward invariant and ensure that if the robots
initially start from an arbitrary state inside C, they will always remain within C.
Note that L f and Lg represent the Lie derivatives of h(·) along the vector fields
f and g, respectively. α(·) is an extended class K function as introduced in [19].
Back to our case, to ensure network connectivity, we define h(·) as:
h(x) = λ2 − (11)
where λ2 is the algebraic connectivity, which is the second smallest eigenvalue
of the weighted Laplacian matrix of the robots’ communication graph, and is
the connectivity threshold adjusting the extent to which we want to enforce this
connectivity constraint. A larger ensures network connectivity more strictly.
The weighted Laplacian matrix is: Lt = Dt −At . Dt and At represent the weighted
degree matrix and weighted adjacency matrix, respectively. Specifically, we adopt
the same definition from [19, Eq. (5)] to define the adjacency ail,t between robot
i and l at time t, using the distance dil,t between them. That is,
2 2 2
e(Rcomm −dil, t ) /σ − 1, if dil,t ≤ Rcomm ;
ail,t = (12)
0, otherwise,
where σ ∈ R is a normalization constant. Based on these, we formulate the
connectivity maintenance constraint using CBF in Eq. (13c) and similarly for-
mulate the collision avoidance constraint in Eq. (13d). With these constraints,
each robot i ∈ R is steered towards its desired position by solving the following
quadratically constrained quadratic program:
1 2
min ui − udes,i (13a)
ui 2
s.t. ui Δt ≤ dmax ; (13b)
∂λ2
ui + λ2 − ≥ 0; (13c)
∂x̄i
∂dil2
ui + dil2 − dmin
2
≥ 0, ∀l ∈ Ni, (13d)
∂x̄i
where udes,i denotes the ideal control input which could be computed easily using
the desired ideal goal position generated in Sect. 4.1. Δt is the time interval
between two consecutive steps. x̄i is robot i’s state at the current step. dil is the
distance between robot i and l calculated using their current states x̄i and x̄l . The
cost function in Eq. (13a) minimizes the discrepancy between robot i’s actual
control input and its ideal one. Constraint Eq. (13b) specifies the maximum
distance dmax that each robot could move between consecutive steps. Moreover,
Eqs. (13c) and (13d) are the connectivity maintenance and collision avoidance
constraints, respectively, as introduced above.
In the decentralized system, every robot solves the optimization program
in Eq. (13) to generate its actual control input. Similar to Sect. 4.1, we adopt
an iterative procedure where all the robots exchange information and solve the
optimization program in Eq. (13) repeatedly, until their solutions ui, ∀i ∈ R
converge.
From the computational perspective, each robot i ∈ R needs to know λ2 and
∂λ2
∂x̄i for calculating Eq. (13c). In a centralized system, a central controller has
access to information of every robot and can directly compute λ2 and ∂λ 2
∂x̄i . How-
ever, in the decentralized system, the robots need to compute them using local
information. Inspired by [23, Sect. III-IV], we utilize the Decentralized Power
Iteration (PI) algorithm, which enables each robot to compute a local approx-
imation of λ2 and ∂λ 2
∂x̄i . Such approximation is achieved through repeated local
information exchange. Specifically, in each round of the information exchange,
the message passed between neighbors is:
Ei = {νi, yi, 1, wi, 1, yi, 2, wi, 2 } (14)
where νi approximates the i th element of the eigenvector ν corresponding to λ2
of the weighted Laplacian matrix L. yi, 1 is robot i’s estimate of Ave({νi }), ∀i ∈ R,
i.e., average of all entries in ν. Similarly, yi, 2 is robot i’s estimate of Ave({νi2 }), ∀i ∈
R, i.e., the average of squares of all entries in ν. wi, 1 and wi, 2 are variables
introduced for computation purpose.
[23] shows that the decentralized PI algorithm converges in a few iterations.
Once it converges, every robot computes its local estimate of the algebraic con-
nectivity λ2 as:
l ∈Ni Lil νl
λ2 = −
i
(15)
νi
416 J. Liu et al.
∂λ2
Additionally, from [19, Eq. (8)], ∂x̄i is computed as:
∂λ2 ∂ail
= (νi − νl )2 (16)
∂x̄i l ∈N ∂x̄i
i
where ail is the weighted adjacency introduced in Eq. (12). In this way, ∂λ 2
∂x̄i
is computed locally using the values of νi, νl obtained from the decentralized
PI algorithm. Substituting the approximations of λ2 and ∂λ 2
∂x̄i into optimization
program in Eq. (13), each robot i ∈ R can solve Eq. (13) to obtain it actual
control input.
4.3 Decentralized Target Position Estimation
Whenever the robot team reaches a new state, each robot obtains new measure-
ments of the targets and runs a Kalman Filter (KF) independently to generate
new estimates of all targets’ positions. Specifically, each robot i ∈ R computes
the a priori estimate ẑi,t |t−1 and covariance matrix Pi,t |t−1 during the prediction
step as:
ẑi,t |t−1 = Aẑi,t−1 |t−1, Pi,t |t−1 = APi,t−1 |t−1 A + Q, (17)
where ẑi,t−1 |t−1 and Pi,t−1 |t−1 represent robot i’s estimate of the targets’ positions
and the corresponding covariance matrix at the previous time step, respectively.
A is the process matrix for targets, and Q is the targets’ process noise covariance
matrix, both introduced in Sect. 3.2.
After robot i ∈ R obtains a new measurement yi,t of the targets, the mea-
surement residual ỹi,t and Kalman gain Ki,t are calculated as:
ỹi,t = yi,t − Hi,t ẑi,t |t−1, Ki,t = Pi,t |t−1 Hi,t S−i,t1, (18)
where Si,t = Hi,t Pi,t |t−1 Hi,t + Ri,t , with Ri,t being the measurement noise covari-
ance matrix as introduced in Sect. 3.2. The new measurement is incorporated
into robot i’s estimation during the update step of KF:
ẑi,t = ẑi,t |t−1 + Ki,t ỹi,t , Pi,t = Pi,t |t−1 − Ki,t Si,t Ki,t , (19)
Since each robot runs its own KF, their estimates may be different. To
maintain a common estimate of the targets’ positions, we utilize the Kalman-
Consensus Filtering algorithm [22, Table I]. This is an iterative approach for the
robots to reach an agreement on their estimates of the targets’ positions. Each
robot first exchanges the information matrix Ωit (0) = (Pi,t )−1 and information
vector qit (0) = (Pi,t )−1 ẑi,t with its neighbors. Then, each robot i ∈ R performs
multiple rounds of the following convex combination:

Ωit (τ + 1) = π̃i,l Ωit (τ), qit (τ + 1) = π̃i,l qit (τ), (20)
l ∈Ni l ∈Ni
until the values of Ωit and qit converge. Note that π̃i,l is the weight between
robot i and l, which is assigned using the method in [24]. Then, each robot
uses Ωit and qit from the last round of the consensus algorithm to compute the
updated estimates of target positions and the covariance matrix. Based on [24],
it is guaranteed that once the consensus algorithm converges, each robot has the
same estimates of targets’ positions as its neighbors.
5 Experiments
Through simulated experiments in the Gazebo environment, we show that the
proposed system can track multiple targets in a decentralized, risk-aware, and
adaptive fashion, while achieving performance comparable to its centralized
counterpart.
5.1 Decentralized Tracking
We demonstrate that our decentralized target tracking system is:
1. Able to track multiple targets simultaneously, and has a reasonable division

of labor among the robots;
2. Risk-aware and adaptive to the sensor configuration;
3. Able to remain connected.
At the start of the target tracking process, each robot is equipped with three
heterogeneous sensors which have the following measurement vectors:

h1 = 1 0 , h2 = 0 1 , h3 = 1 1 (21)
i.e., the first type of sensor has sensing ability in the x direction, the second
type of sensor has sensing ability in the y direction, and the third type has
both. Sensor failures are simulated randomly by flipping a biased coin, where
the probability of failure is proportional to the total risk at each robot’s posi-
tion induced by all the targets. Each robot’s random sensor failure is generated
independently. For an illustration of the system performance, a demo of Gazebo
simulations is available online1 , where five robots track three targets.
Here, we present qualitative results for the setting where five robots track four
targets. The initial sensor status as described in Eq. (2) is Γ0 = 15×3 . For each
robot i ∈ {1, 2, · · · , 5}, we let the parameters to trade off between tracking quality
maximization and risk minimization in Eq. (9a) be q1 = 3.0, q2 = 20.0. The
maximum tracking error in Eq. (9b) is chosen as ρ i1 = [0.001; 0.001; 0.001; 0.001],
and the maximum risk in Eq. (9c) is ρi2 = 0.15. The connectivity threshold in
Eq. (13c) is chosen to be = 0.25.
Figure 2 compares the behavior of robots under three different settings: risk-
agnostic with communication range Rcomm = 11.0 (Fig. 2a), risk-aware with
1 https://www.youtube.com/watch?v=AAOKDGqkuz8.
418 J. Liu et al.
Rcomm = 11.0 (Fig. 2b), and risk-aware with Rcomm = 7.0 (Fig. 2c). To simu-
late the risk-agnostic case, we remove constraint Eq. (9c) and the second term
in the cost function Eq. (9a) for all robots. All robots are initially placed around
the origin, and the targets move in concentric circles. In all three cases, the
robots can move closer to the targets, so that they could obtain more accurate
estimates of target positions. Meanwhile, these five robots spread out to track
different targets.
Figure 2a shows that without risk awareness, the robots directly approach the
targets to get better estimates of the targets’ positions. In contrast, the robots
keep a certain distance from the targets if the risk-aware constraint is added,
as shown in Fig. 2b. Such risk-aware behavior helps the robot team preserve its
sensor suite for a longer time. Figure 2c shows the behavior of the robot team
when the communication range is relatively small. Compared to Fig. 2b, the
robot team is more sparsely connected in this case.
Fig. 2. Comparison of the target tracking behaviors under three different settings. Grey
circles represent the robots. We draw true targets’ positions and their trajectories in
the latest 20 steps using red squares and red dotted lines. Correspondingly, blue squares
and blue dotted lines represent the estimated targets’ positions and trajectories. The
purple-shaded regions around the targets indicate the distance-dependent risk field.
The green edge denotes that the robots on its two endpoints are connected. (Color
figure online)
We further present quantitative results to show the effectiveness of risk aware-

ness in the proposed decentralized system. We utilize two metrics, Tr(P) and η,
which are the trace of the target state estimation covariance matrix and the
sensor margin, respectively, both averaged across all robots. For both risk-aware
and risk-agnostic cases, we run five trials and generate sensor failures randomly.
Figure 3 shows the comparison for a 3-vs-3 case with ρ i1 = [0.01; 0.01; 0.01], ρi2 =
0.33, q1 = 1.0, q2 = 10.0, ∀i ∈ {1, 2, 3} in program (9), and = 0.25 in Eq. (13c).
The initial sensor configuration is Γ0 = 13×3 . The communication range is
Rcomm = 18.0 so that the robots can easily maintain connectivity. It can be
observed that if we remove the risk-aware constraint Eq. (9c), the sensor margin
of the system will decrease dramatically because of more sensor failures. As a
result, the tracking error upsurges sharply as the increasing number of failures
quickly renders the system non-observable.
To demonstrate the system’s ability to trade off between tracking quality and

risk aversion, we compare Tr(P) and Tr(O−Π1 ) with respect to the sensor margin
η. The result is presented in Fig. 4. Note that, different from the above setting
where the sensor margin is computed based on the available suite of sensors at
each time step, here η is specified as a sequence of fixed values to observe the
system’s behavior when the abundance of sensors is varying. As is expected,
when the sensor margin η is large, there are abundant sensors in the team and
the system prioritizes tracking accuracy; conversely, as η decreases, the system
settles for a relatively higher tracking error to preserve its sensor suite.
5.2 Comparison to Centralized Target Tracking
We compare the proposed decentralized target tracking system with its central-
ized counterpart in terms of tracking error and sensor margin. The centralized
tracking system used for comparison is from our previous work [1]. For a fair
comparison, we use the same CBFs (Eq. (13c) and (13d)) for network connec-
tivity maintenance and collision avoidance.
Fig. 3. Comparison of tracking error in Fig. 3a and sensor margin in Fig. 3b when the
risk-aware constraint is vs is-NOT added to our decentralized target tracking framework
with Rcomm = 18.0.
Fig. 4. The trade-off between tracking error and risk level with respect to the sensor
margin. When η is small, the system prioritizes safety; as η increases, the system
prioritizes higher tracking accuracy.
420 J. Liu et al.
Fig. 5. Comparison of the centralized and decentralized systems under the same set-
ting. We enforce the same sequence of sensor failures on both of them as denoted by the
grey crosses. For the decentralized system, the Tr(P) and η shown are both obtained
from averaging across all robots.
We consider the scenario where 4 robots are tasked to track 4 targets. In both
the centralized and decentralized systems, we let Rcomm = 18.0, q1 = 1.0, q2 =
10.0 in Eq. (9a), and the connectivity threshold in Eq. (13c) be = 0.25. The
maximum tracking error in Eq. (9b) is chosen as ρ i1, j = 0.01, ∀ j ∈ {1, 2, 3, 4},
and the maximum risk level in Eq. (9c) is chosen as ρi2 = 0.33, for each robot
indexed i ∈ {1, 2, 3, 4}. Initially, each robot is equipped with three sensors with
measurement matrices h1, h2, h3 , respectively, i.e., the initial sensor configuration
as described in Eq. (2) is Γ0 = 14×3 . To conveniently compare their performance,
we enforce the same sequence of pre-specified sensor failures instead of inducing
sensor failures randomly. In a total of 50 time steps, a sensor whose measure-
ment matrix is equal to h3 is damaged at time step 20; another sensor whose
measurement matrix is h2 is damaged at step 35. The comparison of the track-
ing error and sensor margin is illustrated in Fig. 5. When both systems have the
same changes in their sensor margin as shown in Fig. 5b, the tracking error of
the decentralized system is similar to that of the centralized one as shown in
Fig. 5a.

In this paper, we designed a decentralized and risk-aware multi-target tracking
system. This is motivated by the setting where a team of robots is tasked with
tracking multiple adversarial targets and approaching the targets for more accu-
rate target position estimation increases the risk of sensor failures. We develop
a decentralized approach to address the trade-off between tracking quality and
safety. Each robot plans its motion to balance conflicting objectives – tracking
accuracy and safety, relying on its own information and communication with its
neighbors. We utilize control barrier functions to guarantee network connectivity
throughout the tracking process. We show via simulated experiments that our
system achieves similar tracking accuracy and risk-awareness to its centralized

counterpart. Since our focus is the framework design, we adopted a simplified
linear measurement model. Therefore, one future research direction is to uti-
lize more realistic observation models, e.g., range and bearing sensors [8], which
requires linearized approximations in the optimization program. A second future
direction relates to the computation of goal positions. In this work, the robots
generate ideal goal positions by solving a non-convex optimization program. This
gradient-based approach is in general computationally time-consuming. When a
very strict requirement is specified for the tracking accuracy, it is possible that
the system would fail to find the global optimal solution and get stuck in local
optima. To this end, we plan to design more efficient distributed algorithms to
compute the ideal goal positions [25]. Additionally, we consider extending the
proposed framework to deal with the scenarios where the targets can act strate-
gically to compromise the sensors and communications of the robots [26–31].
References
1. Mayya, S., et al.: Adaptive and risk-aware target tracking for robot teams with
heterogeneous sensors. IEEE Robo. Autom. Lett. 7(2), 5615–5622 (2022)
Khatib, O. (eds.) Robotics Research, pp. 455–472. Springer, Cham (2017). https://
doi.org/10.1007/978-3-319-29363-9 26
3. Pham, H.X., La, H.M., Feil-Seifer, D., Deans, M.: A distributed control framework
for a team of unmanned aerial vehicles for dynamic wildfire tracking. In: 2017
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pp. 6648–6653. IEEE (2017)
5. Sung, Y., Budhiraja, A.K., Williams, R.K., Tokekar, P.: Distributed assign-
ment with limited communication for multi-robot multi-target tracking. Autonom.
Robots 44(1), 57–73 (2020)
6. Zhao, Y., Wang, X., Wang, C., Cong, Y., Shen, L.: Systemic design of distributed
multi-uav cooperative decision-making for multi-target tracking. Auton. Agent.
Multi-Agent Syst. 33(1), 132–158 (2019)
7. Bar-Shalom, Y., Li, X.R., Kirubarajan, T.: Estimation with Applications to Track-
ing and Navigation: Theory Algorithms and Software. John Wiley & Sons (2004)
8. Zhou, K., Roumeliotis, S.I.: Multirobot active target tracking with combinations
of relative observations. IEEE Trans. Robot. 27(4), 678–695 (2011)
9. Dames, P., Tokekar, P., Kumar, V.: Detecting, localizing, and tracking an unknown
number of moving targets using a team of mobile robots. Int. J. Robot. Res. 36(13–
14), 1540–1553 (2017)
10. Zhou, L., Tokekar, P.: Active target tracking with self-triggered communications
in multi-robot teams. IEEE Trans. Autom. Sci. Eng. 16(3), 1085–1096 (2018)
11. Zhou, L., Tokekar, P.: Sensor assignment algorithms to improve observability while
tracking targets. IEEE Trans. Rob. 35(5), 1206–1219 (2019)
Autonom. Robots 44(3), 673–689 (2020)
422 J. Liu et al.
13. Dias, S.S., Bruno, M.G.S.: Cooperative target tracking using decentralized particle
filtering and RSS sensors. IEEE Trans. Signal Process. 61(14), 3632–3646 (2013)
14. Wakulicz, J., Kong, H., Sukkarieh, S.: Active information acquisition under arbi-
trary unknown disturbances. In: 2021 IEEE International Conference on Robotics
and Automation (ICRA), pp. 8429–8435. IEEE (2021)
15. Wang, L., Ames, A.D., Egerstedt, M.: Safety barrier certificates for collisions-free
multirobot systems. IEEE Trans. Robot. 33(3), 661–674 (2017)
16. Sung, Y., Budhiraja, A.K., Williams, R.K., Tokekar, P.: Distributed simultaneous
action and target assignment for multi-robot multi-target tracking. In: 2018 IEEE
IEEE (2018)
17. Tallamraju, R., Rajappa, S., Black, M.J., Karlapalem, K., Ahmad, A.: Decen-
tralized MPC based obstacle avoidance for multi-robot target tracking scenarios.
In: 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics
(SSRR), pp. 1–8. IEEE (2018)
18. Ji, M., Egerstedt, M.: Distributed coordination control of multiagent systems while
preserving connectedness. IEEE Trans. Robot. 23(4), 693–703 (2007)
19. Capelli, B., Fouad, H., Beltrame, G., Sabattini, L.: Decentralized connectivity
maintenance with time delays using control barrier functions. In: 2021 IEEE Inter-
national Conference on Robotics and Automation (ICRA), pp. 1586–1592. IEEE
(2021)
20. Ames, A.D., Coogan, S., Egerstedt, M., Notomista, G., Sreenath, K., Tabuada, P.:
Control barrier functions: theory and applications. In: 2019 18th European Control
Conference (ECC), pp. 3420–3431. IEEE (2019)
21. Sabattini, L., Secchi, C., Chopra, N., Gasparri, A.: Distributed control of multi-
robot systems with global connectivity maintenance. IEEE Trans. Robot. 29(5),
1326–1332 (2013)
22. Liu, Q., Wang, Z., He, X., Zhou, D.H.: On Kalman-consensus filtering with random
link failures over sensor networks. IEEE Trans. Automat. Control. 63(8), 2701–
2708 (2017)
23. Yang, P., Freeman, R.A., Gordon, G.J., Lynch, K.M., Srinivasa, S.S., Sukthankar,
R.: Decentralized estimation and control of graph connectivity for mobile sensor
networks. Automatica 46(2), 390–396 (2010)
24. Boyd, S., Diaconis, P., Xiao, L.: Fastest mixing markov chain on a graph. SIAM
Rev. 46(4), 667–689 (2004)
25. Atanasov, N., Ny, J.L., Daniilidis, K., Pappas, G.J. :Decentralized active infor-
mation acquisition: theory and application to multi-robot slam. In: 2015 IEEE
IEEE (2015)
26. Zhou, L., Tzoumas, V., Pappas, G.J., Tokekar, P.: Resilient active target tracking
with multiple robots. IEEE Robot. Automat. Lett. 4(1), 129–136 (2018)
27. Shi, G., Zhou, L., Tokekar, P.: Robust multiple-path orienteering problem: securing
against adversarial attacks. In: 2020 Robotics: Science and Systems (RSS) (2020)
28. Ramachandran, R.K., Zhou, L., Preiss, J.A., Sukhatme, G.S.: Resilient coverage:
exploring the local-to-global trade-off. In: 2020 IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems (IROS), pp. 11740–11747. IEEE (2020)
29. Liu, J., Zhou, L., Tokekar, P., Williams, R.K.: Distributed resilient submodular
action selection in adversarial environments. IEEE Robot. Autom. Lett. 6(3),
5832–5839 (2021)
30. Zhou, L., Tzoumas, V., Pappas, G.J., Tokekar, P.: Distributed attack-robust sub-
modular maximization for multirobot planning. IEEE Trans. Robot. (2022)
31. Zhou, L., Kumar, V.: Robust multi-robot active target tracking against sensing
and communication attacks. In: 2022 American Control Conference (ACC), pp.
4443–4450. IEEE (2022)
Application of 3D Printed
Vacuum-Actuated Module
with Multi-soft Material to Support
Handwork Rehabilitation
Shoma Abe, Jun Ogawa(B) , Yosuke Watanabe, MD Nahin Islam Shiblee,

Masaru Kawakami, and Hidemitsu Furukawa
Yamagata University, Jonan-4-3-16, Yonezawa, Yamagata 992-8510, Japan

{jun.ogawa,yosuke.w,nahin,kmasaru,furukawa}@yz.yamagata-u.ac.jp
Abstract. Soft modular robotics is a research field that pursues body

flexibility while maintaining reconfigurability by employing flowable or
soft materials as the main constituent materials of modular robots. In
soft modular robotics, a hollow structure made of silicone is created,
and the inside of the structure is pressurized to generate stretching and
bending motions as an actuator module. In general, it is important to
design modular robots with reconfigurability, various deformability, and
environmental adaptability. Modularization of soft actuators with recon-
figurable and diverse deformability allows morphology to be tailored to
the desired task. We have developed a soft actuator module, MORI-
A, that combines hollow silicone and 3D printable deformable struc-
tures (uniaxial shrinking, bending, shearing, uniform contracting, and no
deformation). This study proposes a functional extension of the module
“MORI-A FleX” for application as a wearable device for physical reha-
bilitation using this MORI-A module. Our MORI-A FleX connector is
a thermoplastic polyurethane flexible connector that can be coated with
materials that offer different textures to achieve excellent connectivity
and texture variation. We have assembled MORI-A Flex as a rehabili-
tation device for hand work, assisting the fingers to move harmoniously
and enabling them to grasp a slippery, brittle half-boiled egg and a PET
bottle containing 200 ml of water in a deactivated state.
Keywords: 3D printing · soft robotics · pneumatic actuator · modular

robotics · wearable device
1 Introduction
Soft modular robotics is a research field that pursues body flexibility while main-
taining reconfigurability by employing flowable or soft materials as the main con-
stituent materials of modular robots. In soft modular robotics, a hollow structure
made of silicone is created, and the inside of the structure is pressurized to gen-
erate stretching and bending motions as an actuator module [1,2]. In general,
https://doi.org/10.1007/978-3-031-51497-5_30
Application of 3D Vacuum-Actuated Module to Support Handwork 425
it is important to design modular robots with (1) reconfigurability, (2) various

deformability, and (3) environmental adaptability. Conventional soft modular
robots achieve reconfigurability by forming a large number of microchannels
with connecting devices such as magnets and silicon tubes, [3–8], and creat-
ing various deformabilities by local hardness bias on the surface of the shell
structure [9,10]. However, in the highly reconfigurable method, the connection
is easy and has independent flow paths, making it highly controllable, but the
deformation of the connected module is suppressed by the hard material and the
complexity of the connection. On the other hand, direct processing of silicone
surfaces increases deformability but reduces versatility.
We have developed a vacuum-driven soft module “MORI-A” [11] that
achieves these three functionalities by combining a cube-shaped hollow silicone
case with a soft polymer material and a deformation-controlling structure that
can be easily created with a 3D printer. The MORI-A module has ventilation
holes on six sides of the cube, and each hole is sealed by a connector or sealing
plug made by a 3D printer. Figure 1 shows a schematic diagram of the MORI-A
module. Since neither electrical devices nor water-soluble resins are used in the
construction materials, the module is mobile in different environments, whether
on land or in a fluid environment, and is drivable enough to allow buoyancy
adjustment. In addition, we propose the MORI-A FleX, which extends the sur-
face texture and extensibility of the MORI-A module connector by modifying
its materials for application to physical rehabilitation. The complex geometry
and intended use of this actuator have potential applications in physical reha-
bilitation. As examples of research on hand rehabilitation, two pneumatic soft
actuators have been reported to assist ROM training and muscle stretching [12],
and the soft actuator “pneumagami” with an origami structure has been reported
to support human physical activity through deformation [13]. In the case of arti-
ficial muscles, soft machines have been developed to assist the grip force, joints
and posture of the legs and hips under pressure [14–16]. In the case of arti-
ficial muscles, exoskeletal gloves have been developed for hand rehabilitation,
and motorized ankle-foot orthoses have been developed for foot rehabilitation
[17–20]. In addition, soft machines for rehabilitation are mostly used for adults,
but in recent years, small soft rehabilitation machines for infants have been
developed [21].While these methods are effective as assistive tools, they are char-
acterized by the need for electrical components in the actuators that are directly
attached, and by their lack of reconfigurability. In addition, their size creates
limitations on the age and body size at which they can be used. Therefore,
this study aims to apply MORI-A FleX to wearable devices that support hand
and finger movements, which do not require electrical wiring for actuators with
various deformability and reconfigurability plus scale variation.
2 MORI-A FleX: The Vacuum-Actuated Module with 3D

Printed Flexible Connectors
This section describes the procedure for creating the MORI-A FleX module.
426 S. Abe et al.
2.1 Creation Procedure of MORI-A FleX
The silicone case is made of EcoflexTM 00–10 (Smooth-On, Inc.), a stretchable sil-
icone. The shape is a hollow cube of 19.0 mm square with a thickness of 2.0 mm.
Since the goal here is to develop a wearable device for hand work rehabilitation,
the dimensions are adjusted to the average finger width of an adult. The scala-
bility of the MORI-A FLeX module is not limited to this size, as it can be scaled
Fig. 1. Schematic diagram of the vacuum-actuated module “MORI-A FleX”
Fig. 2. Procedure for creating a MORI-A FleX module

Fig. 3. Basic components of the internal structure and parametric design
up or down. In addition, modules with different dimensions are designed to be

connectable. Figure 2 shows the process of creating the MORI-A FleX module.
First, laser-cut acrylic panels are assembled to form a lattice of partitions. A
core consisting of a thermoplastic polyurethane (TPU) inner structure and a
water-soluble polyvinyl alcohol (PVA) outer shell and lid is placed in the acrylic
partition. After installation, silicone is poured into the core and left to cure for
at least 4 h in a vacuum de-aerator to prevent the formation of air bubbles on the
silicone surface. The cured silicone case is then removed from the acrylic parti-
tion and immersed in water for one hour to dissolve the PVA core. This process
produces MORI-A FleX, a silicone with an embedded TPU inner structure.
2.2 TPU Internal Construction

This internal structure is embedded in a module that is vacuum compressed to
create the desired deformation through the case. The internal structure adopts a
3D printable structure called a parallel-cross structure [22], in which parallel line
objects are rotated by an intersection angle θ, and stacked alternately. The basic
structure is shown in Fig. 3. As shown in Fig. 3, this structure is edited by seven
variables: intersection angle θ, maximum grid width sx , number of cells m and
n, line width t, stacking height tz and maximum stacking height z. The number
of cells can be arbitrarily determined by n, and the value of m is determined by
428 S. Abe et al.
Fig. 4. 3D printed deformation-defining structures
Fig. 5. Appearance of MORI-A FleX deformation
the formula m = 2n − 1. The deformation prescriptive structure of the MORI-

A FleX module is composed of “no deformation”, “shrinking”, “bending”, and
“shearing”. The parameters set for this structure and the deformations are shown
Fig. 6. Connectors used in MORI-A FleX module and their roles
in Fig. 4 and Fig. 5. The shrinking, bending, and shearing deformation are 3D
printed with TPU filaments because they require flexibility to allow deformation.
The no-deformation structure is achieved by using Poly-lactic acid (PLA) as
the material, which has a stiffness that does not cause bending even when the
silicone case is vacuumed. Therefore, This structural form is not necessary for
“undeformed” since the material is the main deforming element. These parallel
cross structures were fabricated using an FDM 3D printer (X-Pro, QIDI Tech,
Inc.).
2.3 3D Printed Flexible Connector
The MORI-A FleX module also has a parametric design for the connector that
closes the opening. The basic shape of the connector is a cylinder with a diameter
slightly larger than the silicone case opening. These connectors can be fabricated
on a 3D printer using the Ninja FleX (Ninja Tek) TPU filament or the UV light-
curable resin (Clear/Black Resin, Wanhao). The list of connectors to be molded
with this material and their roles are shown in Fig. 6. There are four types of
MORI-A FleX connectors: (1) module-to-module connector; (2) module air plug;
(3) texture plug; and (4) tubing connector. (1) The module-to-module connectors
are used to connect two vacuum-actuated modules with flow paths secured. (2)
The module air plug is a connector that seals the opening of a module without
connecting it to another module. The thickness of the cylindrical part is the same
as the wall thickness of the silicone case, so it can be sealed without allowing
air to enter. (3) The texture plug is used to change the surface texture of the
module. (4) A tube connector is a connector that can connect a silicone tube to
a module to allow air to flow from the silicone tube to the module. It is mainly
used to connect an air compressor.
430 S. Abe et al.
3 MORI-A Flex Mechanical Properties

3.1 Load Displacement Performance
Since MORI-A FleX is designed to be vacuum-actuated by connecting many
modules, load capacity in the tensile direction is an important parameter for
applications such as grippers. Here, load-displacement tests are performed on
MORI-A FleX uniaxial shrinkage modules of 1 to 8 modules. This test was per-
formed using a STA-1150 table-top material testing machine manufactured by
A&D Company.The measurements and results are shown in Fig. 7 and Fig. 8.
The inside of the uniaxially contracting MORI-A FleX is vacuumed, and the
displacement and load are measured when tension is applied after the material
is mounted in the chuck of the tensile testing machine. The pressure condition
from the compressor was set to 500[kPa]. From the results in Fig. 8, the load
carrying capacity is about 7.3 [mm] for displacement at a maximum load of
about 15.1 [N] when a single MORI-A FleX module is driven. This means that a
single MORI-A FleX module lifts an object with a mass of about 1.5 Kg. As the
number of couplings increases, the maximum load decreases while the displace-
ment increases. The load values are equivalent for four or more connections. The
actual displacement depends on the stiffness of the internal structure.
Fig. 7. MORI-A FleX load displacement test in tensile direction
3.2 Coefficient of Friction with Texture Change Connector

Connectors with variable surface properties can be coated with materials such
as UV-curable resin (Clear Resin, Wanhao), shape memory gel (SMG) [23], and
Fig. 8. Comparison of load carrying capacity of MORI-A FleX in the tensile direction
depending on the number of connections
Fig. 9. Comparison of MORI-A FleX texture by machined surface of flexible connector

(texture plug)
pile to give the desired texture to the module. The results of a comparison of
the processed surface textures are shown in Fig. 9. These connectors coated with
UV-curable resin have the flexibility of the Ninja FleX (NinjaTek), but have a
smooth, glossy surface texture. Next, connectors coated with SMG developed
432 S. Abe et al.
Fig. 10. Comparison of friction coefficients due to materials coated on texture plugs
a temperature-dependent tackiness that is a characteristic of this material, and

by taking advantage of this tackiness, it is possible to hold objects that are
difficult for conventional MORI-A to grip. Finally, a simple flocking experiment
device (Green Techno Co., Ltd.) was used to apply flocking pile to the connectors.
Connectors with flocking pile attached to them have a fluffy, feather-like texture.
This pile texture can be used to give a texture unique to fibrous fabrics.
In a friction test, the effects of the UV resin, pile, and SMG surface coat-
ings on the fabrication of this study are compared. A velocity friction tester
Type:μV1000 (Trinity-Lab., Inc.) is used for the friction test. Soda-lime-silica
glass substrates of 100[mm]×100[mm]×2[mm] are used as friction surfaces for the
test. Figure 10 shows the results of the friction test. The pile showed the lowest
friction. The average friction coefficient of pile is 0.37[-]. The coefficient of fric-
tion is closest to that of magnesium, which is 0.34[-], and is used in wheelchairs,
canes, and other assistive devices, as well as in electronic components. The fric-
tion of resin is slightly higher than for untreated resin (TPU). Compared to the
other three, the value of friction for resin converges to a constant value. The
average coefficient friction of resin is 0.48[-], and the closest material is cerium,
which is used as an abrasive for glass and automotive glass. The average friction
coefficient of SMG is approximately 1.2[-].
Fig. 11. Schematic of wearable device with MORI-A FleX assembly supported by
handwork
Fig. 12. Grasping a PET bottle of 200 ml water and a half-boiled egg (subject’s hand
is in a state of weakness)
434 S. Abe et al.
4 Applications
MORI-A Flex will be applied to a wearable device for the rehabilitation of human
hand grasping movements. A schematic diagram of the wearable device to be
developed is shown in Fig. 11. To secure MORI-A Flex, which is scaled to the
width of a human finger, to the finger, a ring connector is sculpted with a binding
band glued to one side of a UV-curable resin sealing plug. The modules on the five
fingers are connected to each other by a silicone tube and a six-mouth connector
of UV-curable resin developed by the company. To represent the movement along
the fingers with MORI-A FleX, a shear deformation module is attached to the
tip of the finger, while the rest of the finger area is attached to a module that
Fig. 13. Sixth finger with bending module
Fig. 14. Block diagram showing the wearable device developed in this study and the
control of the sixth finger
can stretch and deform. Figure 12 shows the actual grasping of a boiled egg and
a PET bottle containing 200 ml of water with this wearable device. The results
are evaluated from two points of view: rehabilitation and assistance. First, in
terms of rehabilitation, since the finger flexes softly and without strain, it is
expected to gradually restore the sense of finger bending for those who have
peripheral nerve problems by repeatedly applying and decreasing pressure. From
the viewpoint of power assist, even when the subject is in a state of weakness,
attaching a highly tacky SMG texture change plug to the tip of the little finger
on MORI-A FleX creates an effect that makes it easier to grasp a PET bottle
and successfully carry an object with a load of 200g. Next, in the grasping of a
half-boiled egg, which was conducted to verify whether an object with a smooth
texture could be grasped, a boiled egg was wrapped in the palm of the hand,
and the hand was able to carry the boiled egg stably without slipping under
reduced pressure. The deformability, contact texture, and changeable shape of
the wearable device can be considered to enable simple power assistance even
when the finger muscles are completely inoperable. In addition, the MORI-A
Flex attached to the finger was unable to withstand the weight of an object
weighing more than 200 g. This was due to misalignment between the module
and the finger, air leakage inside the module, and misalignment between the
adhesive surface and the object, which made it impossible to grasp the object
properly. However, although a PET bottle was grasped from the side, objects
that can be wrapped by hand, such as a half-boiled egg, can be expected to
grasp objects weighing 200 g or more. It can be indicated that adding two shear
modules next to the fingertips would allow them to carry objects more stably. In
addition, as shown Fig. 13, the flexion module is expected to assist in grasping
by creating a sixth finger composed only of flexion. Furthermore, if it can be
combined with the wearable device developed this time, it is thought that it will
be possible to grasp heavier and more difficult-to-grasp objects. Figure 14 shows
a block diagram representing the wearable device developed in this study and
the control of the sixth finger. Compressed air from the compressor is connected
to the vacuum generator to reduce pressure. A solenoid valve is inserted just
before the vacuum generator, and when the valve is open, the module attached
to the finger is depressurized and driven. On the other hand, when the valve is
closed, air flows back and the module returns to its original shape. Pneumatic
pressure can be controlled by opening and closing the valve with a relay circuit
controlled by an Arduino.
5 Conclusion
We have developed a versatile vacuum-actuated module, MORI-A FleX, which
allows for high reconfigurability, diverse deformability, and changes in contact
surface texture. By applying the morphological adaptability and texture adjust-
ment capabilities of this module, we have succeeded in developing a soft-weld
device for simple grasp rehabilitation and assistance by assisting soft movements
along with hand and finger movements. Since this module does not require elec-
trical wiring for the actuator, there is no risk of wire breakage or electric shock,
436 S. Abe et al.
and it can be viewed as a safe rehabilitation tool. Future prospects include the
development of MORI-A FleX wearable devices for infants and toddlers through
morphological changes, scaling, various combinations and refinements of different
textures, automatic calculation of module configurations to fit each individual,
and motion assist for different body parts.
Acknowledgement. This work was supported in part by JSPS KAKENHI Grant

Number JP18H05471, JP19H01122, JP21H04936, JP21K14040, JP22K17972, JST -
OPERA Program Grant Number JPMJOP1844, Moonshot Agriculture, Forestry and
Fisheries Research and Development Program (MS508, Grant Number JPJ009237)
and the Cabinet Office (CAO), Cross-ministerial Strategic Innovation Promotion Pro-
gram (SIP), “An intelligent knowledge processing infrastructure, integrating physical
and virtual domains”,”Intensive Support for Young Promising Researchers” (funding
agency: NEDO)
References
1. Cheney, N., MacCurdy, R., Clune, J., Lipson, H.: Unshackling evolution: evolv-
ing soft robots with multiple materials and a powerful generative encoding. ACM
SIGEVOlution 7(1), 11–23 (2014)
2. Kriegman, S., Walker, S., Shah, D., Levin, M., Kramer-Bottiglio, R., Bon-
gard, J.: Automated shapeshifting for function recovery in damaged robots. In:
arXiv:1905.09264 (2019)
3. Vergara, A., Lau, Y., Mendoza-Garcia, R.F., Zagal, J.C.: Soft modular robotic
cubes: toward replicating morphogenetic movements of the embryo. PLoS ONE
12(1), e0169179 (2017)
4. Lee, J., Kim, W., Choi, W., Cho, K.: Soft robotic blocks: introducing SoBL, a
fast-build modularized design block. IEEE Robot. Autom. Magaz. 23(3), 30–41
(2016)
5. Nemitz, M.P., Mihaylov, P., Barraclough, T.W., Ross, D., Stokes, A.A.: Soft robotic
blocks: using voice coils to actuate modular soft robots: wormbot, an example. Soft
Rob. 3(4), 198–204 (2016)
6. Zhang, Y., et al.: A mechatronics-embedded pneumatic soft modular robot powered
via single air tube. Appl. Sci. 9(11), 2260 (2019)
7. Zou, J., Lin, Y., Ji, C., Yang, H.: A reconfigurable omnidirectional soft robot based
on caterpillar locomotion. Soft Rob. 5(2), 164–174 (2018)
8. Sui, X., Cai, H., Bie, D., Zhang, Y., Zhao, J., Zhu, Y.: Automatic generation of
locomotion patterns for soft modular reconfigurable robots. Appl. Sci. 10(1), 294
(2020)
9. Morin, S.A., et al.: Using “Click-e-Bricks” to make 3D elastomeric structures. Adv.
Mater. 26(34), 5991–5999 (2014)
10. Morin, S.A., et al.: Elastomeric tiles for the fabrication of inflatable structures.
Adv. Func. Mater. 24(35), 5541–5549 (2014)
11. Ogawa, J., Mori, T., Watanabe, Y., Kawakami, M., Shiblee, M.D.N.I., Furukawa,
H.: MORI-A: soft vacuum-acctuated module with 3D-printable deformation struc-
ture. IEEE Robot. Autom. Lett. 7(2), 2495–2502 (2022)
12. Hironari, T., Shuichi, W., Koichi, S.: Development of hand rehabilitation system to
prevent contracture for finger joints based on the therapy of occupational therapists
(Massege hand and range of motion exercises using pneumatic soft actuators) .
Trans. JSME 80(820), 348-348 (2014). (in Japanese)
13. Robertson, M.A., Kara, O.C., Paik, J.: Soft pneumatic actuator-driven origami-
inspired modular robotic “pneumagami”. Int. J. Robot. Res. 40(1), 72–85 (2021)
14. Yiyue, L., et al.: Digital fabrication of pneumatic actuators with integrated sending
by machine knitting. In: CHI Conference on Human Factors in Computing Systems,
pp. 1-13 (2022)
15. Al-Fahaam, H., Davis, S., Nefti-Meziani, S., Theodoridis, T.: Novel soft bending
actuator-based power augmentation hand exoskeleton controlled by human inten-
tion. Intell. Serv. Robot. 11(3), 247–268 (2018)
16. Xie, D., Ma, Z., Liu, J., Zuo, S.: Pneumatic artificial muscle based on novel winding
method. Actuators 10(5), 100 (2021)
17. Herr, H.M., Koronbluh, R.D.: New horizons for orthotic and prosthetic technol-
ogy: artificial muscle for ambulation. In: Smart Structures and Materials 2004.
Electroactive Polymer Actuators and Devices (EAPAD), vol. 5385, pp. 1-9 (2004)
18. Dong, T., Zhang, X., Liu, T.: Artificial muscles for wearable assistance and reha-
bilitation. Front. Inf. Technol. Electron. Eng. 19(11), 1303–1315 (2018)
19. Al-Fahaam, H., Davis, S., Nefti-Meziani, S.: The design and mathematical mod-
elling of novel extensor bending pneumatic artificial muscles (EBPAMs) for soft
exoskeletons. Robot. Auton. Syst. 99, 63–74 (2018)
20. Pan, M., et al.: Soft actuators and robotic devices for rehabilitation and assistance.
Adv. Intell. Syst. 4(4), 2100140 (2022)
21. Mendoza, M.J., et al.: A vacuum-powered artificial muscle designed for infant reha-
bilitation. Micromachines 12(8), 971 (2021)
22. Touma, T.: Complex of Parallel crosses structure. J. Imaging Soc. Japan 58, 406–
414 (2019)
23. Yokoo, T., Hidema, R., Furukawa, H.: Smart lenses developed with high-strength
and shape memory gels. e-J. Surf. Sci. Nanotechnol. 10, 243–247 (2012)
Combining Hierarchical MILP-MPC
and Artificial Potential Fields
for Multi-robot Priority-Based
Sanitization of Railway Stations
Riccardo Caccavale1 , Mirko Ermini2 , Eugenio Fedeli2 , Alberto Finzi1 ,

Emanuele Garone3 , Vincenzo Lippiello1 , and Fabrizio Tavano1,2(B)
1
Università degli studi di Napoli “Federico II”,
via Claudio 21, 80125 Naples, Italy
{riccardo.caccavale,alberto.finzi,vincenzo.lippiello,
fabrizio.tavano}@unina.it
2 ReteFerroviariaItaliana,
Piazza della Croce Rossa 1, 00161 Rome, Italy

mi.ermini@rfi.it,e.fedeli@rfi.it
3 ULB, Ecole polytechnique de Bruxelles,
Campus du Solbosch - CP 165/55 Avenue F.D. Roosevelt, 50, Bruxelles, Belgium

emanuele.garone@ulb.be
Abstract. We propose a distributed framework, driving a team of

robots for the sanitization of very large dynamic indoor environment,
as the railway station. A centralized server uses the Hierarchical Mixed
Integer Linear Programming to coordinate the robots assigning different
zones where the cleaning is a priority; thanks to the Model Predictive
Control approach we use historical data about the distribution of peo-
ple and the knowledge about the transportation service of the station, to
predict the future dynamic evolution of the position of people in the envi-
ronment and the spreading of the contaminants. Each robot navigates
the large environment represented as a gridmap, exploiting the Artificial
Potential Fields technique in order to reach and clean the assigned areas.
We tested our solution considering real data collected by the WiFi net-
work of the main Italian railway station, Roma Termini. We compared
our results with a Decentralized Multirobot Deep Reinforcement Learn-
ing approach.
Keywords: Hierarchical Mixed-Integer Linear Programming · Model

Predictive Control · Artificial Potential Fields · distributed ·
multi-agent · Priority-Based Sanitization
1 Introduction and Related Works

The pandemic caused by the SARS-CoV-2 has spawned a crisis that has affected
the railway sector in a significant way [29], for example by inducing people to
https://doi.org/10.1007/978-3-031-51497-5_31
HMILP-MPC Multi-robot Priority-Based Sanitization of Railway Stations 439
prefer cars instead of trains [6]. In this regard, disinfectant robot technologies are
proven very useful to fight global pandemics [30] by reducing the number of peo-
ple involved in the cleaning process and by optimizing the sterilization. In this
work, we propose a multi-robot sanitizing framework specific for teams of robots
capable of cleaning human-populated environments [16] as railway stations. Our
main goal is to design a framework in which a team of autonomous indoor clean-
ing robots executes the sanitizing activities exploiting information about human
presence retrieved from WiFi infrastructure to maximize the cleaning perfor-
mance in big indoor environments such as railway stations without stopping the
railway/commercial activities. In particular, we exploit the station’ WiFi Net-
work to verify the most populated areas of the environment by retrieving the
information about the position of the users’ smartphones [24]. In our solution
a Server (MPC-Server) receives from the WiFi Network, the information about
the presence of the people in the station and estimates the positions of the con-
taminated areas to be mapped into an heatmap. Here, we deploy a distributed
MPC approach that adapts the strategy of sanitization to the evolution of the
contamination due to the variation of the position and number of people in the
station. Several approaches have been proposed for multirobot cleaning of indoor
environments. One possible approach is to formulate the cleaning of the environ-
ment as a coverage path planning (CPP) problem, where the objective is to find
a strategy to reach every part of the environment minimizing the time of the
execution. A first work relies on the random walk method [4], where robots per-
forms random movements inside the environment. In this case, the time of exe-
cution is affected by position overlapping and path repetitions. Other solutions
consider the environment as a grid-map that is partitioned in several parts and
assigned to different robots executing fixed shaped cleaning paths (spirals, trian-
gles, rectangles, etc.) [9–12,14,17,18,20]. These methods are effective to provide
a continuous cleaning service which maximize the coverage and minimizes the
idleness of the agents. Additional works regarding the CPP methodology, con-
sider the presence of some zones of the environment with persistent static higher
priority, that needs to be cleaned/sanitized with pre-specified higher frequency
[5,13,15,21,28]. Those approaches often consider a graph-based representation
of the environments with only a limited number of nodes. In contrast, we are
interested to find a solution considering high resolution and dynamic priorities
in very large railway stations. An interesting work is described in [5], where the
authors solve the multi-robot persistent coverage control problem over a grid
environment using Mixed Integer Linear Programming (MILP) to maximize the
coverage level of the cells over a finite horizon. In their study, the authors defined
terminal constraints to ensure that the agents are always able to terminate their
plans in predetermined closed trajectories. Also in our framework we consider
persistent priorities from the historical data recorded by Meraki WiFi Network,
but in contrast, we are more interested to find a solution that may adapt to
different circumstances, following the unexpected changes of priorities of the
environment, without fixed trajectories or fixed nodes. Moreover, we chose to
design a Hierarchical MILP (HMILP) [3,7] where the problem of finding differ-
440 R. Caccavale et al.
ent paths for different robots in a common graph is decomposed in a cascade of

MILP problems, one for each robot, and each selected path is used to constrain
the selection of the followings. This way, robot starting from the same initial
conditions are forced to select different paths to increase the coverage. In this
context Model Predictive Control (MPC), has often been used to find local solu-
tions in open environments with a small number of obstacles [2]. Unfortunately,
including geometric constraints for complex and very large indoor environments
with static obstacles makes the optimization problem very expensive to solve in
real-time [1]. This depends by the number of added obstacle constraints, increas-
ing the iteration cost of MPC solvers, and the extra non-convexity making it hard
to find good local solutions. For this reason, we propose an approach that avoids
geometric constraints in the control layer by delegating collision checking and
obstacle avoidance to a lower-level Artificial Potential Fields (APF) navigation
module. Similarly to our work, in [27] an APF framework controlling and coor-
dinating a group of robots for cooperative manipulation tasks is proposed. Also
in [23] APF method and optimal controllers methods are used for path planning
of autonomous vehicles. Following these perspectives, in this paper we propose
a MILP-MPC approach combined with APF, which is able to produce high-
level strategies for the cleaning/sanitizing of crowded environments by on-line
adapting the agents’ behavior with respect to the current distribution of people.
In particular, the proposed system relies on anonymous information from the
preexisting WiFi infrastructure, which is often available in public environments,
to estimate the position of people [24] and to prioritize the sanitizing process
toward more crowded areas of the environment. Moreover, we also defined a
simple parametric model to estimate the spreading of contaminants (bacteria or
viruses) due to the presence of people in order to better estimate the risky areas
to be cleaned with higher priority. In particular, our main contributions can be
summarized as follows:
– We propose a scalable MPC-MILP framework combined with APF, where

multiple mobile robots cooperates during the execution of cleaning tasks into
a very large crowded environment, exploiting WiFi information about the
distribution of people, historical data and the knowledge about the trans-
portation service in the station.
– We tested the behavior of a teams of four robots considering two weeks of
real data recorded by the Meraki Cisco System WiFi network of the Roma
Termini station.
– We compared behavior of our solution with the Multi Agent Reinforce-
ment Learning (MARL) framework [25]. For this experiment we have shared
the video MARLvsHMILP.mp4, at the link: http://wpage.unina.it/fabrizio.
tavano/video/MARLvsHMILP.mp4.
– We provide an additional experiment showing how the proposed system can
be adapted to consider on-line requests for the sanitization of specific areas.
For this experiment we have shared the video CleaningSpecificArea.mp4, at
the link: http://wpage.unina.it/fabrizio.tavano/video/CleaningSpecificArea.
mp4.
The rest of the paper is structured as follows. In Sect. 2 we describe the

architecture of the proposed framework, In the Sect. 3 we focus on the case stud-
ies about the comparison with the MARL technique presented in [25]. Finally,
Sect. 4 concludes the paper and discusses future works.
2 Architecture
The architecture is based on a decentralized client-server architecture repre-

sented in Fig. 1, where a team of robots is on-line connected to a unique server.
We decided to follow this choice because our idea is to find a solution that
responds to the demand of Rete Ferroviaria Italiana (RFI), using the resources
and technologies that are just present in the Italian railway stations as the Mer-
aki Wifi Cisco System Infrastructure that records information on passenger foot
traffic that flows the station in form of a Meraki Heatmap.
Fig. 1. Distributed client-server architecture, where the MILP server sends the MPC-
Heatmap and the destination nodes to the robots that decide their root thanks to the
APF motion path technique.
As shown in Fig. 1, we consider that the MPC Server receives the Meraki
heatmap as visualized in the Meraki Dashboard, with an updating period of 1 h.
In the Meraki heatmap, the colors represent the current distribution of people
in the station. This image is converted in a new MPC-Heatmap formatted as
gridmap of dimensions 100 × 172 pixels, in which the colors are now indicating
the level of contamination of the areas in the environment, as shown in Fig. 2
(a). More formally, we define M as the gridmap of the station, X as the set of
all free-obstacle grids in the map, S as the set of possible heatmaps (i.e. priority
distributions) on the map M and A as the set of actions available for a single
agent where ai ∈ A drives the agent i from the current grid to an adjacent one
(Fig. 2 (b)). Each element of the MPC-Heatmap is a real number in the interval
[0, 1] where 1 is the maximum priority and 0 means that no cleaning is needed.
This matrix can be displayed as a color-coded image (see MPC-Heatmap in Fig. 2
(b)) where black pixels are associated to 0 priority areas, while colors from red to
yellow are for increasingly higher priorities. We have also modeled the behavior
of the contaminants to spread and attenuate during the time. As for the update
of the heatmap (cleaning priority) it is computed from the position of people
(clusters) by modeling possible spreading of viruses or bacteria using a gaussian
model of dispersion as in [8]. Specifically, we exploit the periodic convolution of
a Gaussian filter every step, where μ and σ are suitable parameters that can be
regulated depending on the meters/pixels ratio, the timestep, and the considered
typology of spreading (in this work we assume a setting inspired to the aerial
diffusion of the Covid-19 [26]). In our case, we use the values of μ and σ as in
[25]. The effects of spreading and attenuation of the priorities at every step of
time in the MPC-Heatmap is shown in Fig. 2 (c) after 100 steps. This spreading
process is also considered as a model for our MPC method in order to estimate
the evolution of the priority over time.
Fig. 2. Heatmap representation of the Roma Termini Station: Meraki Heatmap (a)
MPC-Heatmap (b) MPC-Heatmap after 100 timesteps, MPC-Heatmap with the asso-
ciated graph.
Our strategy of sanitization has 4 main steps in its execution as shown in

Fig. 3. We consider that the cleaning starts with a delay of 30 minutes from
the first reception of the first Meraki Heatmap in the morning. Following the
arrow of the timeline, we consider a first step where the MPC-Server receives
the update of the Meraki Heatmap from the Meraki Server (step A). The MPC-
Server applies the HMILP as in [3] to decide the destinations of every robot (step
B). In this phase the MPC Server makes use of information about historical data
about the distribution of people in the station in order to do a prediction on
the conditions over the next 30 min. After (step C) there is a new update of the
Meraki Heatmap, and finally (step D) A new HMILP decision is executed to
correct the first prevision considering the new Meraki Heatmap update, with-
out contribution of the historical knowledge. The latter correction is needed to
compensate for the unpredictable behavior of the people.
The server is also responsible to verify the positions of the people in the
environment thanks to the Meraki heatmap and to receive the positions of every
robot inside the station. Those information are used to build MPC-Heatmap
(Fig. 2) with different colored zones. The MPC-Server extracts information about
the positioning of the people in the station from the current Meraki heatmap,
it receives the positions of every robot and their chosen paths and cleans rela-
tive zones in the MPC-Heatmap to prepare the new updated map. MPC-Server
provides a partitioning of the current MPC-Heatmap in a discrete and limited
number of areas with a fixed step of sectioning (in our case we choose 20 pixels);
every resulting subarea will be represented by a single value in an ordered list
of partitions G1. Considering i ∈ N = {0, n}, where n is the total number of
partitions of the MPC-Heatmap, the value of the element i of the list G1, is qi ,
defined as the sum of the priority risk of the relative portion of MPC-Heatmap.
The MPC Server applies the same procedure to realize a second ordered list G2,
using a second heatmap MPC-Heatmap 2, which is used to estimate the future
expected distribution of people. MPC-Heatmap 2 is built considering the knowl-
edge about the transportation service of the station and the average behavior
of people in the historical data. RFI defines 4 categories of days to organize the
transportation service in the station Roma Termini: the working day, from Tues-
day to Thursday; the Friday, the day before the holidays, including the Saturday;
the holiday, including Sunday, The day after the holidays, as the Monday.
Fig. 3. Four main steps of execution in the strategy of Sanitization presented in this
study.
In Figs. 4 the different number of departures in Roma Termini, grouped in

steps of 1 h, where every color is associated to a different category. There are dif-
ferences in number of train services between the 4 categories. RFI defines 4 types
of transportation services: The passenger high speed service by Trenitalia com-
pany (TI Pax HSS), the passenger high speed service of other private companies
(OC-HSS), long haul standard low speed service (TI-PAX-SU) and local trans-
portation (TI-PAX-REG ) as shown for example in Fig. 4 for the cases of working
day and holiday. The MPC-Heatmap 2 is so realized considering the average of
historical Meraki Heatmaps that represent the conditions of the station in the
days that belong the same category of day of the current MPC-Heatmap used
to built G1, but an hour later. Considering i ∈ N = {0, n}, where n is the total
number of partitions of the MPC-Heatmap 2, the value of the element i of the
list G2, is pi , defined as the sum of the priority risk of the relative portion of
MPC-Heatmap2. Starting by the knowledge of the last received position of every
robot in the MPC-Heatmap, the MPC-Server finds for every robot their current
starting node in the graph. Then, it calculates the next nodes for every robot,
considering the contribution of the values of the nodes of two ordered lists G1
and G2. Every robot receives from the MPC-Server the updated MPC-Heatmap
and a list of destination nodes that it must visit, following the indications of the
HMILP results. Every node represents a specific portion of the MPC-Heatmap.
Every robot, acting independently and with autonomy, verifies the positions of
the peaks of priorities in the portion of MPC-Heatmap considering the nodes of
its path in the graph. It sorts the found peaks starting from the higher value
to the lower value. In case of peaks of the same value, they are positioned in
the list from the most closed to its current position to the farthest one. Finally
every robot, autonomously use the APF method to reach the found peaks in
the list. We assume that the team of robots is constantly connected with the
common MPC-Server thanks to the WiFi railway service of the station. We con-
sider that every robot communicates with each other and they know the APF
paths of the other robots. In this way we avoid path repetition. The robots,
acting independently, are also responsible of the process of obstacle avoidance.
When a robot meets an obstacle it calculates the shortest path to turn around.
It starts to count the number of necessary steps to avoid the obstacles in both
the directions, then it choose the shortest route to avoid it. When a priority
peak is reached, the robot starts to clean following a spiral shape path centered
in correspondence of the coordinates of the found peak in the gridmap as in
[11,12]. We consider that our robot sanitizes an area of the gridmap of dimen-
sion 36 pixels at every timestep, where the timestep has a duration of 1 min as
in the study [25]. The Model Predictive Control (MPC) consists of solving an
on-line finite horizon open loop optimization problem using the current state
as the initial state for the team of robots. The solution to the problem gives
a sequence of control inputs for the entire control horizon; however, only some
elements of the optimal input sequence are applied to the team of robots. The
number of used inputs depends in our case by the number of peaks of priorities
that are present in the MPC-Heatmap, and by the horizon time. In our context,
Meraki Server sends an updating of the Meraki Heatmap with a periodicity of
1 h, so our horizon control is also of 1 h.
Fig. 4. Comparison between RFI categories of day for the transportation service in
Roma Termini station. To the left different number of departure of trains, at differ-
ent hours and different categories. To the right, different typology of services for two
categories: Holiday and Working Day.
The graph representation used to describe the problem in this study is similar
to the ones proposed in [19,31]. We define the Graph G = (N, A), where N =
{0, 1, . . . , n} is the set of vertexes and A = {(i, j) : i, j ∈ N, i/ j} is the set of arcs.
Vertex 0 denotes the depot of the robots, while the other vertexes correspond to
different areas of the partitioned gridmap MPC-Heatmap. Each area i must be
sanitized so every node has a non negative priority number qi + pi that represents
the total quantity of risk priority in the relative area of the MPC-Heatmap added
to the predicted value relative to the MPC-Heatmap 2. Each arc between any two
vertexes i and j is associated with a distance di, j (in our solution it represents the
reward for the robot that travel between the nodes i and j). The value associated
to the arcs, takes into account also of the distance of the path in the gridmap
to reach the destination node, and the presence of fixed obstacles. The reward
assigned to the arc is lower in case diagonal movements respect to the axis of
the gridmap (as shown in the Fig. 2 (d)), or in case the robot may meet a wall
or other fixed obstacles to be avoid during its travel in the arc. The weight of
the edge is calculated by the equation : di, j = (q j + p j ) , i, j, ∈ A. In HMILP
path decision, it is important to take in account also of the cost of the distance
that a robot cover to reach the node. In general, a greater distance, depending
for example by the presence of an obstacle, may reduce the efficiency of the total
percentage of cleaning inside the horizon time, because our robots cleans also
the path to arrive at its target as in [25]. For this reason with reference to the
Fig. 2 (d) we define three kinds of arcs: the straight arcs, that connects two nodes
in the same x-axis or same y-axis of the MPC-Heatmap. Examples of straight
arcs in Fig. 2 (d) are the connections between the node 0 and 1, or 0 with 3; the
Table 1. Notations used in the mathematical model
Notation Description
N set of all nodes N = {1, . . . , n}
N0 N and the depot N0 = N ∪ {0}
K total number of k robots
Q max value of sanitization of a single robot
di, j benefit assigned to the arc between the nodes, i, j ∈ N0
xi, j is 1 if the k-th robot travels the arc (i,j)
ui total priority sanitized by the kth robot
qi priority of G1
pi priority of G2
diagonal arcs as the arcs as between nodes 0 and 4 or 1 and 5. A third type of
arc are those in which the robot meets an obstacle in his path in the gridmap
following the connection tracked by that arc. d in G represents a benefit for the
robot the run the relative arc. For this reason the arcs that are associated with
a greater distance, as in case of presence of obstacles, are also associated with
a lower benefit. In particular, we consider the following formula to define the
weights of the three defined typologies of arcs:
⎧
⎪ d ,
⎪
⎨ i, j
⎪
straight arcs,
di, j = di, j /2, diagonal arcs (1)
⎪
⎪
⎪ di, j /2, obstacles in the path
⎩
At the beginning of the horizon time, the robots starts their new path in the
graph from the node that was the last reached in the previous hour. The first
action that the MPC Server must do, is the positioning of the robots from the
deposit to the current starting node where the robot has sent its actual position.
This is done considering a dummy arc with a very high gain. The first step of the
path of every robot from the deposit to the starting node is removed from the
final path considered for the cleaning and sent to every robot at the beginning
of their decentralized operations. The notations including sets, parameters, and
variables used in the model are listed in Table 1. The MILP constraints for our
multirobot sanitization problem are the following as in [22]:
max(z) (2)

N0
N0
xi, j di, j ≥ z (3)
i j

N0
xi, j = 1, i∈N (4)
j

N0
N0
xi,l − xl, j = 0, l ∈ N, i j, i l, jl (5)
i j
pi + qi ≤ ui, i∈N (6)

xi, j ∈ {0, 1} (7)
The objective function of the model is given in (2) which maximize the quan-
tity of priority that is deleted in the graph G. Here, z represents the sum of the
benefits d along a path selected on the graph G. Constraint set (3) indicates
that the objective may not exceed the sum of every benefit of all the arcs of
G. Constraint set (4) ensures that each section of the MPC-Heatmap is visited
exactly once. Constraint set (5) implies that every robot that leaves a node, will
reach a new different node. Constraint set (6) represents a limit for the quantity
of sanitization done in a node i, that cannot be less of the priority assigned to
the node i of the Graph G. Our approach to solve the multirobot optimization
problem is the hierarchical MILP as in [3,7]. The MPC Server repeats the MILP
calculus for every single robot, in a sequence. At every step, it finds the path
for one robot, it puts to zero the nodes of the Graph G, belonging the path
of the previous robot, and calculates the new path for the following robot. The
MPC-Server repeats this method for every following robot. In this way, the MPC
Server considers different graphs for every robot, to prevent two robots to select
the same path, even in case they start from the same node. The MPC Server cal-
culates the value of total reward of the whole team as sum of the single rewards
gained by every robot, for each permutation of the sequence of the 4 robots,
and chooses the sequence with the higher result. MPC Server considers also a
different deposit for every robot, positioned in different places of the Graph G
and fixed at the beginning of simulation. The assignment of a different deposit to
more than one robot may be on-line changed by human intervention to correct
the operation of cleaning by a remote control room. In Fig. 5 we see the action of
cleaning of 4 robots, represented by the white squared dots. The cold zones, in
black, are the path that they leave when they clean a zone of the environment.
The environment is represented by the MPC-Heatmap of Roma Termini, where
1 pixel corresponds to 1 square meter.
3 Experiments and Comparison with MARL Technique
Fig. 5. On the left, the distribution of priorities at the 6:00 a.m. of the 6 September
2021 in our simulation. On the right, the action of cleaning of 4 robots driven by
HMILP-MPC control, and moving in the environment that represent Roma Termini
station, thanks to the APF method.
In order to assess the performance of the system in a realistic dynamic environ-

ment, we propose a case study where the presented method is compared with the
solution proposed in [25]. Thanks to the collaboration with RFI, we received a
Meraki Heatmap for every hour of the day representing the whole planimetry of
the station between the 30 August 2021 to 12 September 2021. We have verified
that in considered period, there were not occurred exceptional events that may
change the ordinary behavior of the station (for example, national or religious
holidays, opening of the schools etc.). Moreover, the two weeks are temporally
consecutive. For these reasons there are not changes in the behavior of passengers
that frequents the station. We have calculated the euclidean distance between
every MPC-Heatmap built at every different hour of the available days, and we
have verified that the difference of the average euclidean distance of the days
that belongs in the same category is lower than the ones in the other cases. In
the available dataset we can predict the distribution and the density of people
in the station in a working day of the second week of data, using the average
of the working days of the first week at the same hour. In order to assess the
performance of the teams of robots, we define a simple metric c perc as in [25],
representing the cleaning rate (in percentage) of the map as follows:
c perc = ((xtot − scurr )/xtot ) · 100, (8)
where xtot = |X | is the number of free-obstacles cells of the map, while scurr =

(i, j) s(i, j) is the sum of the current priorities for each cell of the heatmap.
The value c perc is than 100% if the map is totally clean (i.e., each cell has 0
priority) and 0% if every cell of the map has maximum priority. We compared
the quality of cleaning for the day 6 September 2021, using for the HMILP the
mean values of the heatmaps of the days: 31 August, 1 and 2 September 2021,
considered working day in the RFI category definitions. As it is shown in Fig. 6,
the two methods have comparable results and approximately the same shape of
the curve. They may be considered valid alternative to the proposed problem.
Fig. 6. Simulation of sanitization of the day 6 September 2021. The first two curves
represent c perc for the two cases: MARL, HMILP-MPC. The yellow curve represents
the spreading of the contaminants without any robot action.
Table 2. Comparison between case 1 and case 2
Time Case 1 Case 2

17:00 48.895 56.871
17:30 32.733 61.778
18:00 47.469 79.425
18:30 33.945 54.611
Our solution has an important advantage respect to the MARL method pre-
sented in [25]. It has the possibility to be controlled by remote site by a technical
staff in a control room. This represents an important innovation in the solving
of the problem proposed by RFI, because Roma Temini represents a challenging
environment where exceptional events may occur and form new aggregation of
people. For this reason, it is required a solution that may permit to correct the
activity of cleaning of the robots, assigning greater priority to a specific area.
This feature is enabled by exploiting the graph-based representation of the envi-
ronment. We can on-line select for every robot a different deposit where they
conclude their mission, choosing between those available in the station. It is also
possible to assign a gain to the nodes inside the subarea that we want to select.
We have verified this possibility simulating a situation in which it is occurred
a rapid increment of priority risk in a zone Z where the nodes 7, 10, 13 are
involved. We selected that zone Z, multiplying the weight of these node with a
gain equal to 10. We also assigned to the robots the deposits in the nodes 10
and 13, while in normal condition their deposits are assigned to the node 2, 7, 9,
11. In order to assess the performance on Z we defined a specific c percz value

similarly to the one proposed in Eq. 8. In this case we set xtot = |Z |
as the num-
ber of free-obstacle cells of the map in the zone Z, while scurr = (i, j)∈Z s(i, j) is
the sum of the current priorities for each cell of the heatmap in Z. The results
of the cleaning are shown in the Table 2, where we compare the case in which
we have not selected any subarea (case 1) with the case in which we selected the
zone Z (case 2). In the column 2 and 3 there are the relative results of c percz
for the two cases at different time of day in the simulation. We can see that the
team of case 2 demonstrates to sanitize better the zone Z, reaching a peak of
cleaning at the 18:00 o’ Clock where c percz is equal to 79.425 %, in contrast to
the 47.469% of the team of the case 1.
4 Conclusions
In this work we presented a new technique based of Hierarchical MILP-MPC

method with APF for the on-line cleaning of very large and dynamic environ-
ments as a railway station. We verified the possibility to use historical data about
the positions of people recorded by the WiFi infrastructure Network Meraki and
the knowledge about the transportation service in the considered railway sta-
tion, to predict the movement of clusters of passengers with an horizon time
of 1 h. We have compared our solution with the MARL technique proposed in
[25], highlighting the advantages deriving by the use of a graph and the MILP:
the on-line control of the sanitization strategy from a remote human staff. We
have also tested the quality in a real case simulation, with real data from the
Roma Termini Station, the largest and most populated station of Italy. As future
work, we plan to test the solution with real robots and to investigate how the
proposed MPC approach can be integrated into a MARL framework to improve
the cleaning performance. It is also interesting to verify the possibility to use
the MPC method in the process of training of the MARL solution.
Declarations
Acknowledgments. the research leading to these results has been supported by the
Italian Infrastructure Manager Rete Ferroviaria Italiana S.p.A, and by the HARMONY
project (Horizon 2020 Grant Agreement No. 101017008). The authors are solely respon-
sible for its content.
Conflict of Interests. The authors declared that they have no conflict of interests to
this work.
References
1. Andersson, O., Ljungqvist, O., Tiger, M., Axehill, D., Heintz, F.: Receding-horizon
lattice-based motion planning with dynamic obstacle avoidance. In: 2018 IEEE
Conference on Decision and Control (CDC), pp. 4467–4474 (2018)
2. Andersson, O., Wzorek, M., Rudol, P., Doherty, P.: Model-predictive control with
stochastic collision avoidance using Bayesian policy optimization. In: 2016 IEEE
International Conference on Robotics and Automation (ICRA), pp. 4597–4604
(2016)
3. Branca, C., Fierro, R.: A hierarchical optimization algorithm for cooperative vehi-
cle networks. In: 2006 American Control Conference, p. 6 (2006)
4. Cao, Z.L., Huang, Y., Hall, E.L.: Region filling operations with random obstacle
avoidance for mobile robots. J. Field Robot. 5, 87–102 (1988)
5. Charitidou, M., Keviczky, T.: An milp approach for persistent coverage tasks with
multiple robots and performance guarantees. Eur. J. Control. 64, 100610 (2022)
6. Ciuffini, F., Tengattini, S., Bigazzi, A.Y.: Mitigating increased driving after the
Covid-19 pandemic: an analysis on mode share, travel demand, and public trans-
port capacity. Transp. Res. Res. 03611981211037884 (2023)
7. Earl, M.G., D’Andrea, R.: A decomposition approach to multi-vehicle cooperative
control. Robot. Auton. Syst. 55(4), 276–291 (2007)
8. Hanna, S.: Transport and dispersion of tracers simulating Covid-19 aerosols in
passenger aircraft. Indoor Air 32(1), e12974 (2022)
9. Lakshmanan, A.K., et al.: Complete coverage path planning using reinforcement
learning for tetromino based cleaning and maintenance robot. Autom. Constr. 112,
103078 (2020)
10. Lee, T.-K., Baek, S., Oh, S.-Y.: Sector-based maximal online coverage of unknown
environments for cleaning robots with limited sensing. Robot. Auton. Syst. 59(10),
698–710 (2011)
11. Lee, T.-K., Baek, S.-H., Choi, Y.-H., Oh, S.-Y.: Smooth coverage path planning
and control of mobile robots based on high-resolution grid map representation.
Robot. Auton. Syst. 59(10), 801–812 (2011)
12. Lee, T.-K., Baek, S.-H., Oh, S.-Y., Choi, Y.-H.: Complete coverage algorithm based
on linked smooth spiral paths for mobile robots. In: 2010 11th International Con-
ference on Control Automation Robotics Vision, pp. 609–614 (2010)
13. Mallya, D., Kandala, S., Vachhani, L., Sinha, A.: Priority patrolling using multi-
ple agents. In: 2021 IEEE International Conference on Robotics and Automation
(ICRA), pp. 8692–8698 (2021)
14. Miao, X., Lee, J., Kang, B.-Y.: Scalable coverage path planning for cleaning robots
using rectangular map decomposition on large environments. IEEE Access 6,
38200–38215 (2018)
15. Murtaza, G., Kanhere, S., Jha, S.: Priority-based coverage path planning for aerial
wireless sensor networks. In: 2013 IEEE Eighth International Conference on Intel-
ligent Sensors, Sensor Networks and Information Processing, pp. 219–224 (2013)
16. Narang, M., et al.: Fighting Covid: an autonomous indoor cleaning robot (AICR)
supported by artificial intelligence and vision for dynamic air disinfection. In: 2021
14th IEEE International Conference on Industry Applications (INDUSCON), pp.
1146–1153 (2021)
17. Nasirian, B., Mehrandezh, M., Janabi-Sharifi, F.: Efficient coverage path planning
for mobile disinfecting robots using graph-based representation of environment.
Front. Robot. AI 8, 4 (2021)
18. Nasirian, B., Mehrandezh, M., Janabi-Sharifi, F.: Efficient coverage path planning
for mobile disinfecting robots using graph-based representation of environment.
Front. Robot. AI 8 (2021)
19. Madridano, Á., l-Kaff, A. , Martı́n, D., de la Escalera, A.: Trajectory planning for
multi-robot systems: Methods and applications. Expert Syst. Appl. 173, 114660
(2021)
20. Oh, J.S., Choi, Y.H., Park, J.B., Zheng, Y.: Complete coverage navigation of clean-
ing robots using triangular-cell-based map. IEEE Trans. Industr. Electron. 51(3),
718–726 (2004)
21. Pasqualetti, F., Durham, J.W., Bullo, F.: Cooperative patrolling via weighted
tours: performance analysis and distributed algorithms. IEEE Trans. Robot. 28(5),
1181–1188 (2012)
22. Qin, W., Zhuang, Z., Huang, Z., Huang, H.: A novel reinforcement learning-based
hyper-heuristic for heterogeneous vehicle routing problem. Comput. Ind. Eng. 156,
107252 (2021)
23. Rasekhipour, Y., Khajepour, A., Chen, S.-K., Litkouhi, B.: A potential field-based
model predictive path-planning controller for autonomous road vehicles. IEEE
Trans. Intell. Transp. Syst. 18(5), 1255–1267 (2017)
24. Ren, Y., et al.: D-log: a WIFI log-based differential scheme for enhanced indoor
localization with single RSSI source and infrequent sampling rate. Pervasive Mob.
Comput. 37, 94–114 (2017)
25. Caccavale, R., Calà, V., Ermini, M., Finzi, A., Lippiello, V., Tavano, F.: Multi-
robot sanitization of railway stations based on deep q-learning. AIRO 2021: 8th
Italian Workshop on Artificial Intelligence and Robotics of the 20th International
Conference of the Italian Association for Artificial Intelligence (AI*IA 2021) (2021)
26. Setti, L., et al. Airborne transmission route of Covid-19: why 2 meters/6 feet of
inter-personal distance could not be enough, April 2020
27. Song, P., Kumar, V.: A potential field based approach to multi-robot manipulation.
In: Proceedings 2002 IEEE International Conference on Robotics and Automation
(Cat. No.02CH37292), vol. 2, pp. 1217–1222 (2002)
28. Stump, E., Michael, N.: Multi-robot persistent surveillance planning as a vehicle
routing problem. In: 2011 IEEE International Conference on Automation Science
and Engineering, pp. 569–575 (2011)
29. Tardivo, A., Zanuy, A.C., Martı́n, C.S.: Covid-19 impact on transport: a paper
from the railways’ systems research perspective. Transp. Res. Rec. 2675(5), 367–
378 (2021)
30. Tavakoli, M., Carriere, J., Torabi, A.: Robotics, smart wearable technologies, and
autonomous intelligent systems for healthcare during the Covid-19 pandemic: an
analysis of the state of the art and future vision. Adv. Intell. Syst. 2(7), 2000071
(2020)
31. Zhou, X., Wang, H., Ding, B., Hu, T., Shang, S.: Balanced connected task alloca-
tions for multi-robot systems: an exact flow-based integer program and an approx-
imate tree-based genetic algorithm. Expert Syst. Appl. 116, 10–20 (2019)
Exploration System for Distributed
Swarm Robots Using Probabilistic Action
Decisions
Toui Sato1(B) , Kosuke Sakamoto1 , Takao Maeda2 , and Yasuharu Kunii1

1
Chuo University, Tokyo, Japan
{a14.pw6d,ksakamoto605}@g.chuo-u.ac.jp, kunii@hmsl.elect.chuo-u.ac.jp
2
Tokyo University of Agriculture and Technology, Tokyo, Japan
t-maeda@go.tuat.ac.jp
Abstract. In this paper, we propose a new distributed exploration sys-

tem consisting of a parent robot and a group of ground-mobile child
robots that search the surrounding area to solve the low efficiency of
planetary exploration by relatively large robots such as NASA’s Curios-
ity. The proposed system is a probabilistic algorithm, in which robots
act based on a specified probability distribution. A distributed swarm
robot system based on the proposed algorithm was constructed in real
space and validated by developing and testing robots named RED for
experimental demonstration. The effectiveness of the proposed system
was evaluated through experiments. Experimental results showed that
the robot explored the region according to the specified probability
distribution.
1 Introduction
In recent years, various studies have been conducted in the form of multi-robot
agent systems, or the hierarchical forms, such as swarm robots, parent and child
robots, and so on [1–4]. These studies have been considered for applications
such as environmental measurement on the ground, disaster relief, and plane-
tary exploration, However The size of these robots, such as a planetary rover, is
relatively large because a single robot must perform a variety of tasks [5]. As a
result, the robot is required strict safety, and the development period and cost
become large. and the risk of loss due to failure is high. In addition, the trav-
eling speed is extremely slow because careful operation is required to maintain
safety, and only limited environments, such as along the path, can be measured.
Therefore, the efficiency of exploration by large robots is low. In order to solve
these problems, small robots have been developed by many institutes, such as
JAXA [6–10]. The advantages of small robots are as follows:
1. the development period and cost can be reduced
2. the robots can be mounted with other payloads.
The details of the extreme environments, such as a planetary surface, are
almost unknown. Therefore, it is hard to localize the robot by themselves.
https://doi.org/10.1007/978-3-031-51497-5_32
454 T. Sato et al.
2 The Proposed Exploration Swarm Robot System

and Area Control Algorithm
In this section, we present the proposed exploration system for wide area. we also
present the concept of our proposed algorithm and a description of the proposed
algorithm.
2.1 Distributed Robot System for Wide-Area Exploration

by Swarm Robots
This study focuses on the small swarm robots system without path planning. The
schematic diagram is shown in Fig. 1 and the system structure in Fig. 2. We will
consider a simple configuration for the exploration system without assuming that
the ground robots have abundant functions and performance in terms of sensing
and communication capabilities. The proposed exploration system assumes a
heterogeneous small robot system in which a small hopping leader robot exists
in the upper layer and small wheeled swarm robots are in the lower layer. We do
not consider such a large robot can be a leader robot, since this study focuses
on the small swarm robot system.
The wheeled swarm robots in the lower tier are basically engaged in a sin-
gle leader robot, and explore the surrounding area widely with the leader as a
dispersing center. It is generally difficult for small ground robots to observe the
environment and create the map by their own sensing due to the field of view
constriction. In addition, small ground robots do not create the path because of
the reasons mentioned above. The ground robots need to be aware of the posi-
tional relationship with the leader robots, and for this purpose, employ previous
studies [11]. Since there are some studies on global and local path planning by
leader robots, this paper does not focus on the motion planning or control of the
leader robot [12].
The leader robot’s position in this study is to serve as a reference point
for the ground robots in this study, which cannot sense the environment. The
ground robot needs to periodically observe the leader robot at a higher altitude
than the narrowing of the field of view, and the leader robot will leap directly
above the ground robot at the same position until the ground robot finishes
its exploration, for example. At this time, the leader robot also repeats the
observation to determine the next moving point and searches for the next moving
position.
This paper focuses only on the exploration method around this distributed
reference point and shows how to design an algorithm for ground robots that act
independently and autonomously based only on the location information with
the leader robot.
In addition, the lack of some robots does not significantly affect this system,
because the total number of robots is much larger than the lacks. Thus, it has
an advantage in terms of risk diversification [13].
Exploration System for Distributed Swarm Robots 455
Fig. 1. The overview of the proposed Fig. 2. The structure of the proposed
system system
2.2 Overview of the Region Control Algorithm with Probabilistic

Action Decision
In this method, we adopt a random walk as the basic behavioral model of robots
and suppress their actions in a stochastic manner [14]. This section explains the
proposed stochastic distributed exploration algorithm for wide area observation.
In the assumed exploration system, the objective is that the lower layer robots
perform wide-area distributed exploration around a distributed center such as a
leader without commands from the upper layer.
There are many randomness factors such as the existence of obstacles and
topographical in natural environments or disaster areas, so that it is difficult
to expect the robot position accurately. Considering the low intelligence, low
functionality, and low performance of the robot, the algorithm should be simple
as possible. Formation control by animal mimicry has been the mainstay of the
conventional swarm intelligence studies, however these methods. These are not
suited for the exploration of what this study focuses on [15]. Therefore, we adopt
a stochastic and simple random walk model as the basic algorithm. The random
walk model is generally used to describe the phenomenon of particle dispersion
from an arbitrary point, which is known to gradually expand spatially with the
passage of time. Therefore, we think that the random walk-based method is
suitable for wide-area exploration. However, if the spread distance becomes too
large, many robots move away from the dispersion center and become isolated.
Therefore, it is necessary to moderately restrain the probabilistic movement away
from the center of distribution.
456 T. Sato et al.
2.3 Algorithm for Probabilistic Action Decision Making

For this purpose, we propose the following stochastic action control method
using the Metropolis method. The basic behavior of the robot is a repetition of
turning and walking straight ahead, and the turning angle is randomly selected.
In the case of a general random walk, since the variance is proportional to time,
it gradually diffuses as time goes by. Thus, it is necessary to control the region
where the follower stays. A similar study is a method for controlling the distri-
bution of robot presence in a given region by assuming a probabilistic approach
in which the distribution of agents on the state space is modeled [16]. In this
study, the robot presence region is defined based on the distance space from the
dispersion center, and each robot makes a probabilistic action decision based on
the presence region. The following is the stochastic action decision method.
(1) Decide the destination (x , y ).
Determine the direction of movement by random walk and estimate the
amount of movement from the current position. The estimation of the
amount of movement is based on the information obtained internally by
the robot, such as the information from the rotary encoder. Let dx and dy
be the displacement in the direction of the coordinate axes, respectively.
(x , y ) = (x + dx, y + dy) (1)

(2) Substitute the distance between the current location and the destination
location into the horizontal axis of the proposed distribution.
Prepare the proposal distribution with the distance between the robot and
the dispersion center on the horizontal axis and the probability on the verti-
cal axis. For the proposed distribution, we use the normal distribution. This
is because the probability of the center is high and decreases as it spreads
outward, thus realizing an algorithm that always adopts actions toward the
center and judges actions toward the outside probabilistically, as described
in the previous section.
The normal distribution has an internal parameter standard deviation σ and
mean μ and can be expressed by the following Eq. (2).
1 (x−μ)2
f (x) = √ e− 2σ2 (2)
2π
(3) Calculate the ratio of probabilities
The probability p of a follower at each location is calculated by substituting
the distance α between the leader and the follower before moving and the
estimated distance α between thethe leader and the follower at the destina-
tion into the proposed distribution. The ratio of p(α) at the current distance
and p(α ) after the move is denoted as r.
p(α )
r= (3)
p(α)
(4) A uniform random number R(0 < R < 1) is generated and compared with r
to evaluate the behavior.
R < r, the destination (x , y ) is adopted, and the action is actually taken.
If R > r, it rejects generated action move and takes the rejection action.
Although various patterns of “rejection action” could be considered, such as
re-selection of the direction of movement, or movement toward the center of
dispersion, in this study, the left-right movement behavior is given toward
the center of dispersion, which can be observed widely within the region.
The“rejection action” induces the robots to rotate within the dispersed area
and reduces the bias of the observation.
With the stochastic control method described above, each robot always moves
in the direction of the dispersion center, since r > 1, and it is probabilistically
rejected when the direction of movement is away from the dispersion center. The
proposed method can control a group of robots within 3σ standard deviation
of the proposed distribution and perform a diffuse exploration. The internal
parameters of the proposed distribution can be used for stochastic control of
robots, and the size of the search area around the center of dispersion can be
controlled.
3 Evaluation by Simulation
In this section, we show how the control algorithm with stochastic action decision
proposed in the previous section is controlled in the simulation.
3.1 Algorithm Verification by Simulator and Comparison of Search

Efficiency by Changing Control Parameters
The number of robots was assumed to be 10 as the condition for simulation

using the stochastic domain control algorithm, and the distance and the total
number of steps were set to be 1.0[m] and 150 times, respectively, at a time
when the robot moves straight ahead. The internal parameters of the proposed
distribution were set to have mean μ0 = 0 and standard deviation σ0 = 0 = 2.0.
The center of variance is set at (10, 10), and the position of the center of variance
is set to the initial position of the robot. Figure 3 shows the results of the search
around the dispersion center using the stochastic dispersion algorithm. Figure 4
shows the ideal distribution for the same control parameters. The values of the
color bars indicate the number of overlapping observations for each square of the
grid map.
From Fig. 3, it is confirmed that the stochastic region density control algo-
rithm controls the robot within a region of radius 6.0[m] from the center of
dispersion as a result of suppressing the diffusion phenomenon over time. Since
the robot does not deviate significantly from the range of radius 3σ = 6.0[m],
which is the range of the hem of the proposed distribution prepared in advance,
458 T. Sato et al.
it can be said that the robot is controlled within the range in accordance with
the control parameters. This is also true in comparison with the ideal distribu-
tion shown in Fig. 4. In Fig. 3, all the cells are colored except for the unexplored
cells, and the blue cells outside the cells are light-colored according to the spec-
ification. It can be confirmed that this area is within a radius of 6.0[m] from the
dispersion center.
We also examined whether the proposed method produced exploration results
in accordance with the proposed distribution. A histogram of the x-axis of Fig. 3
is shown in Fig. 5.@ It can be seen that a high density of robots is formed around
the center of dispersion. The histogram was analyzed by approximation, and the
standard deviation σ0 of the normal distribution obtained was 1.7. Although the
analyzed value is not exactly equal to the set value σ0 = 2.0, the results are close
in tendency and numerical value, and it is confirmed that the control is based
on the proposed distribution.
Fig. 3. Stochastic distributed control Fig. 4. Ideal of stochastic distributed

control
When the mean value of the control parameters is μ0 = 0, as shown in Fig. 3

In this situation, the exploration progress to the outside of the area is delayed,
and the exploration efficiency is lost.
To improve exploration efficiency, it is desirable to eliminate the concentra-
tion of movement around the center of dispersion and allow the robot to actively
move to more distant locations. In contrast, this method improves the concen-
tration of robot movement around the dispersion center by changing the mean
value μ0 of the proposed distribution, which is a control parameter, from 0 to
improve the exploration efficiency.
In the case of μ0 = 1.0, the assignment value of the proposed distribution
is the distance between the robot and the dispersion center, which means that
Fig. 5. Exploration density by histogram
the frequency of exploration behavior of the robot at a distance of 1[m] from

the dispersion center increases. This is expected to improve the concentration of
exploration frequency at the center of dispersion, thereby increasing the efficiency
of exploration.
The results of the simulation by changing the mean value, which is an internal
parameter of the proposed distribution, are shown in Fig. 6. In this case, the
mean value μ0 = 5.0 and the standard deviation σ0 = 2.0 were used as the
internal parameters. The ideal distribution for the same parameters is shown in
Fig. 7. The number of units was set to 10 and the number of steps was set to
150. The results show that the search density is not concentrated in the center
of dispersion, but is distributed over a wide area around the center of dispersion,
and the bias of the search density is suppressed compared to the case where the
mean μ0 = 0 and standard deviation σ0 = 2.0 are used as internal parameters.
Fig. 6. Torus probability distribution Fig. 7. Ideal of torus probability distri-

density bution density
460 T. Sato et al.
4 Development and Implementation of Exploration

Systems
In this section we describe the exploration system we have built by introducing

our proposed algorithm. We describe the hardware and software of the entire
exploration system and the robot called RED that we developed to validate the
proposed algorithm.
4.1 Hardware Development of the Robot for Actual Equipment

Verification
In order to demonstrate the stochastic region control algorithm described in the

previous section, we have developed a tubular Robot shown in Fig. 8 as a ground
robot for experimental verification, which corresponds to the lower level in the
exploration system shown in Fig. 2. We named this robot RED and will refer to
it as the RED Robot from now on. This robot has a width 20.0[cm], a depth
17.0[cm], a height 13.0[cm], a weight 600[g]. This robot moves by repeatedly
turning and going straight. The wheels of the RED robot are uneven to improve
its ability to overcome obstacles. The sides of these wheels have protrusions
to make it easier for the RED robot to return to the ground if it gets stuck.
Figure ?? shows how the RED robot observes with a camera for floors. This
floor camera allows the robots to observe and cover the space when turning and
traveling.
The hardware configuration of the robot is shown in Fig. 9. The RED robot
consists of a Raspberry Pi equipped with a motor unit for driving and a gyro
sensor for posture assistance. This robot is equipped with cameras on the front
and the top, respectively. The camera on the front takes pictures of the floor to
detect scratches. In this system, a LED marker is used instead of robots in the
upper layer in order to validate the probabilistic exploration algorithm for the
ground robots in the lower layer. The camera at the top is used to detect the
LED marker, which is treated as a dispersion center. The RED robot measures
Fig. 8. A robot named RED: There is a hole on the top surface to put out a camera
to recognize a photo of an LED marker in the direction of the ceiling. The robot is
equipped with a distance sensor, a camera, and LED lights in the front.
Fig. 9. Structure of robot system Fig. 10. The RED robot explores the
area around one or more LED markers
as a distributed center.
in which direction the LED markers are located and how far away they are. The
measurements are used to run the proposed algorithm. The LED markers are
placed in the exploration space as shown in the Fig. 10 and the RED robot is
allowed to explore around them. The height (z-component) of the LED marker is
given to the robots as a parameter in advance, and the LED marker is detected
and calculated by the camera on the top surface. The distance to the LED
marker in the x-y direction can be calculated by detecting the LED marker with
the camera on the top surface.
Fig. 11. Structure of software system
4.2 Software and Exploration System Configuration
The software configuration is shown in Fig. 11, where the information is inte-
grated by GIT Server and UDP Server in the PC. The software on RED robot
is the latest version by referring to the GIT Server. In addition, we developed
the GUI application as shown in Fig. 12 that integrates the UDP Server. The
RED robots send its individual address, acquired images, and its own state to
462 T. Sato et al.
the application, and the application sends parameters necessary for each robot
to operate. The application sends the following parameters to the robot.
1. Height of the LED marker (measured in advance)

2. Standard deviation of the proposal distribution σ and mean value μ
3. The robot’s straight-line speed
The RED robots can act autonomously with the proposed algorithm based on
the parameters and the positional relationship with the LED markers, which are
the dispersion centers. They send the following information to this application.
1. Estimated distance and direction from the center of dispersion at each step
number
2. Camera pictures in floor and ceiling direction at each number of steps
3. Status of each robot, including connection status
The application can analyze the information received from the robots to
make estimated distribution and parameter estimation for the robots. In addi-
tion, by observing the exploration of the robot by an external camera, we can
know the explored and unexplored locations. By determining the percentage of
explored and unexplored locations within the exploration area, we can estimate
the coverage of space relative to time.
5 Exploration Field Tests and Results

In this section, we present experimental results from a real robot system to
validate the proposed exploration algorithm.
Fig. 12. Applications for Management and Analysis of Swarm Robot Systems
5.1 Description of Experimental Environment/conditions and Data

Collected
The exploration field and area of exploration was a concrete field within a black
circle as shown in Fig. 13. 10 robots were operated in the exploration area. The
control parameters of the proposed algorithm were set to a mean value of μ = 5.0
and a standard deviation of σ = 2.0. The radius of the black circle was set
to 8.0[m]. This means that the area of the exploration area is 64[πm2 ]. The
exploration was performed in 1200[sec]. The RED robots were observed by an
external camera. The trajectories of their movements were extracted from the
video and are highlighted in white in the Fig. 13. The LED marker posts were
placed in the center of the circular area in the Fig. 13, but were excluded from
the figure for the sake of analysis.
In surface exploration, the size of the area and the number of overlaps (uni-
formity) affect the efficiency of exploration. In order to validate the exploration
efficiency, the experiments are evaluated in terms of”exploration time” required
to reach a certain level of coverage in an area.
We describe on a heat map created to evaluate exploration efficiency. This
heat map is shown in Fig. 14. The numbers in the color map on the heat map
indicate the number of overlapping explores for each location by the robot swarm,
and the number of overlapping searches increases as the color changes from blue
to red. The robots estimate the direction and distance of the dispersion center at
each step, and the graph summarizing the estimated distance among these data
is shown in Fig. 15. A graph of the coverage of the exploration area versus time
was created from the information on the unexplored and explored squares in the
exploration area obtained from the heat map. This graph is shown in Fig. 16.
Fig. 13. Robot movement trajectory Fig. 14. Heat map of the robot’s move-
obtained from video analysis of the ment trajectory
results of the wide-area exploration
experiment (it is painted in white).
464 T. Sato et al.
Fig. 15. Histogram of estimated dis- Fig. 16. Graph of spatial coverage of
tance between robot and dispersion the exploration radius 8.0[m] versus
center exploration time (in seconds)
5.2 Examination of Collected Data
From Fig. 6 and Fig. 14, it is confirmed that the on-torus region control works
in the actual experiment as well as in the simulation, depending on the control
parameters. The torus-like region control is a control parameter in which the
mean value μ is made larger than 0 to prevent the robot’s stochastic behavior
from concentrating around the center of dispersion. Figure 15. shows that the
frequency of robot movement around the position 5.0[m] from the dispersion
center is higher. The distribution of robots was modified according to the control
parameter mean μ = 5.0. This data confirms that torus-like area control was
achieved in the actual experiment. Figure 16 shows that in 1200 [sec], the robots
were able to explore 78% of the area with a radius of 8.0[m] from the LED
marker, which is the center of dispersion.
6 Conclusion
In this paper, we propose a distributed exploration system using swarm robots

that can maintain swarm formation in a certain exploration area and observe the
entire area, while the robots act autonomously and flexibly. A stochastic explo-
ration algorithm for area control based on a random walk model of swarm robots
is proposed. Simulation results show that the proposed algorithm can observe the
area around an arbitrary center of dispersion and control swarms in a certain
area while maintaining autonomy and diffuseness. An actual wide-area explo-
ration experiment was conducted using the RED robot swarm, a robot for algo-
rithm verification. The experimental results showed that the actual exploration
system is as effective as the simulation in controlling by the control parameters.
In addition, an 8.0[m] radius exploration area was observed 78% of the time in
1200 [sec]. Therefore, the proposed exploration system and control algorithm are
practical for wide-area distributed exploration.
Acknowledgment. The results of this research were carried out in collaboration with
the Space Exploration Innovation Hub of the National Aerospace Laboratory of Japan
Aerospace Exploration Agency (JAXA), Takenaka Corporation, and Chuo University.
We would like to take this opportunity to express our deepest gratitude.
References
1. Zungeru, A.M., Ang, L.M., Seng, K.P.: Classical and swarm intelligence based
routing protocols for wireless sensor networks, A survey and comparison. J. Netw.
Comput. Appl. 35, 1508–1536 (2012)
2. Ichikawa, S., Hara, F.: Characteristics on swarm intelligence generated in multi-
robot system. J. Robot. Soc. Japan 13(8), 1138–1144 (1995)
3. Nouyan, S., Grob, R., Bonani, M.: Teamwork in self-organized robot colonies. IEEE
Trans. Evol. Comput. 13(4), 695–711 (2009)
4. Hamann, H.: Swarm Robotics: A Formal Approach. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-74528-2
5. Lindemann, R.A. and Voorhees, C.J.: Mars exploration rover mobility assembly
design, test and performance. In 2005 IEEE International Conference on Systems,
Man and Cybernetics, pp. 450–455 (2005)
6. Tao, J., Deng, Z., Hu, M., Liu, J., Bi, Z.: A small wheeled robotic rover for planetary
exploration. In: 2006 1st International Symposium on Systems and Control in
Aerospace and Astronautics (2006)
7. Middleton, A., Paschall, S. and Cohanim, B.: Small lunar lander/Hopper perfor-
mance analysis. In: Proceedings of IEEE International Conference on Systems,
Man and Cybernetics (2010)
8. Yoshimitsu, T., Kubota, T., Nakatani, I.: MINERVA rover which became a smalll
artificial solar satellite. In: 20th Annual AIAA/USU Conference on Small Satellites
(2006)
9. Rubenstein, M., Ahler, C., Nagpal, R.: Kilobot: a low cost scalable robot system
for collective behaviors. In: Robotics and Automation(ICRA) (2012)
10. Kazuya, Y., Yoichi, N., Takeshi, M.: Sampling and surface exploration strategies in
MUSES-C and future asteroid missions. In: 20th Annual AIAA/USU Conference
on Small Satellites (2003)
11. Matsumoto, R., Sato, T., Maeda, T., Kunii, Y., Kida, H.: Position and attitude
estimation for distributed exploration of small rovers using flash light from leader
agent. In: 2019 9th International Conference on Recent Advances in Space Tech-
nologies (RAST) (2019)
12. Ushijima, M., Kunii, Y., Maeda, T., Yoshimitsu, T., Otsuki, M.: Path planning with
risk consideration on hopping mobility. In: IEEE/SICE International Symposium
on System Integration (2017)
13. Zhu, Y.F., Tang, X.M.: Overview of swarm intelligence. In: International Confer-
ence on Computer Application and System Modeling, ICCASM 2010 (2010)
14. Nurzaman, S.G., Matsumoto, Y., Nakamura, Y., Koizumi, S., Ishiguro, H.: Yuragi-
based adaptive searching behavior in mobile robot: from bacterial chemotaxis to
levy walk. In: The 2008 IEEE International Conference on Robotics and Biomimet-
ics, Bangkok, Thailand, 21–26 February (2009)
15. Reynolds, C.W.: Flocks, herds, and schools: a distributed behavioral model. In:
ACM SIGGRAPH 1987 Conference Proceedings. Anaheim, California (1987)
16. Milutinovic, D., Lima, P.: Modeling and optimal centralized control of a large-size
robotic population. IEEE Trans. Rob. 22(6), 1280–1285 (2006). https://doi.org/
10.1109/TRO.2006.882941
Distributed Estimation of Scalar Fields
with Implicit Coordination
Lorenzo Booth and Stefano Carpin(B)
Department of Computer Science and Engineering, University of California, Merced,

5200 N Lake Road, Merced, CA 95343, USA
{lbooth,scarpin}@ucmerced.edu
Abstract. Motivated by our ongoing work in robotics for precision agri-

culture, in this work we consider the problem of estimating a scalar field
using a team of robots collecting samples and subject to a travel bud-
get. Our fully distributed method leverages the underlying properties of
Gaussian Process regression to promote dispersion using minimal infor-
mation sharing. Extensive simulations demonstrate that our proposed
solution outperforms alternative approaches.
Keywords: Coordination · Applications in Agriculture · Estimation
1 Introduction
In this work we consider the problem of estimating a scalar field using a team of
collaborating robots. This problem is motivated by our ongoing research in pre-
cision agriculture, where it is often necessary to estimate the spatial distribution
of parameters such as soil moisture, nitrates, or carbon dioxide flux that can be
modeled as spatially-varying scalars. To model this underlying field, we use a
Gaussian process (GP) [6]. GPs are not only elegant and efficient computational
tools to solve regression problems, but are also the model of choice in geostatis-
tics and agroecological modeling, where GP regression is known as kriging [9]. A
key feature about GP regression is that this model allows to easily estimate the
uncertainty of the predictions, thus enabling iterative methods where new sensor
measurements are collected in regions of high uncertainty to refine accuracy.
In agricultural applications one is often faced with the problem of estimating
quantities over very large domains, therefore the use of multiple robots allows for
quicker coverage of the region of interest. In these conditions, robots have to plan
their motions with multiple objectives. When an exhaustive, systematic coverage
L. Booth is supported by the Labor & Automation in California Agriculture (LACA)

project (UC-MRPI intiative). S. Carpin is partially supported by the (NSF)under
NSF Cooperative Agreement Number EEC-1941529 (IoT4Ag) and USDA/NIFA under
award # 2021-67022-33452. Any opinions, findings, conclusions, or recommendations
expressed in this publication are those of the authors and do not necessarily reflect the
views of the funding agencies.
https://doi.org/10.1007/978-3-031-51497-5_33
Distributed Estimation of Scalar Fields 467
of the entire region is not feasible, it is important to collect samples at the most
informative places. It is also necessary to be aware of the limited distance that
can be covered before the robot must be refueled or recharged. Therefore robots
have to plan paths that will eventually terminate at a designated end point before
they run out of energy or fuel. Likewise, it is also important to consider that
in real world applications the energy consumed to move between two locations
is not deterministically known upfront; rather, it is a random variable whose
realization is only known at run-time. For example, a robot may need to take a
detour to reach a certain place, or it may move through a muddy area causing
wheel splippage, etc. Finally, in rural regions communication infrastructures are
often lacking or limited and therefore robots can not assume the availability of
broadband communication channels.
With these motivations in mind, in this paper we present an algorithm to
solve this estimation problem with a team of robots. Coordination between the
agents is obtained with minimal information exchange and by leveraging the
mathematical properties of GPs to promote dispersion. The approach is fully
distributed and each robot just broadcasts to the rest of the team the limited
information consisting of the locations where it has collected data, the value it
measured, and its unique identifier—no more than a handful of bytes at a very
low frequency. No other communication is required and robots never exchange
their individual plans or models. In addition, each robot uses a refinement of our
recently developed planner for stochastic orienteering to ensure that it reaches
the final location before it runs out energy. Through extensive simulations we
demonstrate that this approach ensures robots collect samples in areas leading
to a more accurate reconstructions of the underlying unknown scalar field.
The rest of the paper is organized as follows. Selected related work is pre-
sented in Sect. 2. The problem formulation is introduced in Sect. 3 and our meth-
ods are discussed in Sect. 4. In Sect. 5 we present extensive simulations to eval-
uate our proposal and in Sect. 6 we draw the conclusions.
2 Related Work
The problem considered in this contribution is related to the orienteering combi-
natorial optimization problem where one has to plan a path through a weighted
graph to gather the maximum sum of vertex rewards while ensuring that the
length of the path does not exceed a preassigned travel budget. Recently, we
have extensively studied this problem in agricultural settings, both for single
agents [7,13] and multiple agents [14], and we also considered its stochastic vari-
ants [12]. In all these works, however, rewards associated with vertices were
static and assigned upfront, while in this work rewards associated with ver-
tices are iteratively re-estimated based on the gathered samples. Moreover, our
former multi-robot solution [14] was centralized, while we here propose a fully
distributed approach.
The use of Gaussian Processes for estimating scalar fields with robots has
also been explored in the past. Suryan and Tokekar [11] proposed an algorithm
468 L. Booth and S. Carpin
to reduce in minimal time the variance of the scalar field being estimated with a
GP. Their solution uses a team of robots, but does not consider travel budgets,
i.e., robots can travel as much as needed. Similarly, Di Caro et al. [2] propose
an estimation algorithm for GPs aiming at reducing uncertainty under time and
communication constraints. However, their solution does not consider a travel
budget.
Our work is also related to Informative path planning (IPP) where the
emphasis is on planning paths to collect the most informative samples [3,4].
In IPP, however, the set candidate sample points is not given, but is rather
determined by the algorithm, and the travel budget is typically not explicitly
considered.
3 Problem Statement
We consider the problem of estimating a scalar function f : X → R defined over

a bounded region of interest X given a limited number of observations collected
by multiple robots. The goal is to determine where robots should collect these
samples. As commonly done in this domain, a graph structure is used to model
the navigable environment. We assume that observations of the underlying scalar
function can be collected at a limited set of sampling locations, denoted as the
set of vertices V in the graph. This assumption holds in a variety of real-world
agricultural applications, where specific points of interest (e.g., sentinel trees
that serve as early-indicators of ecosystem health) have been pre-identified, or
when the robots sense the environment by leveraging pre-deployed infrastructure
(such as soil sensors implanted in the ground). This assumption is not restrictive;
when prior sensing locations are not specified, one can choose V arbitrarily, e.g.
as a set of equally-spaced points covering the region of interest, as is typical in
naı̈ve surveying schemes. We consider this navigable environment as a complete
graph, with edge set E = V × V . To each edge e ∈ E we assign a random
variable c(e) representing the movement cost (e.g. energy spent) when the robot
traverses the edge. We assume that the density functions characterizing these
random variables are known.
All robots must begin at an assigned start vertex vs and end at an assigned
goal vertex vg before they run out of energy. Each robot ri starts with a travel
budget Bi . When the robot traverses an edge e, its budget decreases by the
random value drawn form the random variable c(e). For simplicity, we assume
that all Bi s are the same, but this is not a strict requirement. Each time the
robot visits a location v ∈ V , using its onboard sensor(s) it collects a sample of
the underlying function f , obtaining a noisy observation yv = f (v) + ε, where
ε ∼ N (0, σm2
) is measurement noise, Gaussian-distributed with zero mean and
variance σm .
With regard to communication, we make the following two assumptions:
When robots are collecting data, they can only anonymously broadcast packets
of the type (v, yv , n) indicating that they collected observation yv at vertex v. The
last component n is a unique id assigned to each robot— its use will be presented
in Sect. 4. At the end of the mission, after the robots have converged at the goal
vertex vg and are in proximity, they can exchange all the data they have gathered
during the mission. However, at that point data collection is concluded and they
cannot return to the field and acquire more data to improve the estimate. These
communication assumptions are consistent with contemporary technology used
by robots in agricultural domain. In particular, LoRa [10] offers the capability of
streaming limited amounts of data at long distances and is compatible with the
assumptions we made when robots are out in the field. When robots terminate
their mission and are in close proximity data can be instead be exchanged using
onboard WiFi.
To reconstruct the scalar field we use Gaussian process (GP) regression, as
detailed in Sect. 4. Throughout the mission, using the available data (either
collected or communicated) robots can make predictions about the value of f at
arbitrary locations in X . We indicate such predictions as fˆ. The overall objective
is to collect the set of observations providing the most accurate reconstruction
of the underlying scalar field. As common in estimation literature, in this work
our metric for accuracy is the mean squared error (MSE) defined as

1
M SE = (f (ψ) − fˆ(ψ))2 dψ.
|X | X
4 Methods
4.1 Gaussian Process Regression
We provide the necessary notation and background for GP regression and we
refer the reader to [6] for a more comprehensive introduction. As per our prob-
lem definition, we describe the spatial distribution of an unknown parameter
(moisture, nitrates, etc.) as a function: f : X → R that is continuous in the
2-D environment X ⊂ R2 where measurements are taken. The function which
describes the environmental field f and measurement noise σm are represented
as unique random variables that follow an i.i.d. Gaussian distribution with zero
mean μ and variance σ 2 . The Gaussian process assumption is to model f as a
random probability distribution over a set of functions, and that the value of f
for arbitrary inputs x and x (f (x) and f (x ), respectively) has a jointly-Gaussian
distribution. We assume that f (x) is a realization of a Gaussian process, which is
completely defined by a mean function m(x) and a covariance function k(x, x )
with input vector x:
f (x) ∼ GP (m(x), k(x, x )) (1)

The joint distribution of observations (the explanatory variable) y, {f (x1 ) +
ε1 , . . . , f (xn ) + εn } at inputs X, {x1 , . . . , xn } and function values (the response
variable) f , {f , . . . , fn } can be written as:

y k(X, X) + σ 2 IN k (X, x )
∼ N 0, (2)
f (x ) k (x , X) k (x , x )
where y is a column vector of scalar outputs y, from a training set D of n

observations, D = (X, y) = {(xi , yi ) | i = 1, . . . , n}. k is the covariance function
(or kernel), σn2 is the variance of the observation noise, and input vectors x and
query points x of dimension d are aggregated in the d × n design matrices X
and X respectively.
Through the marginalization of jointly Gaussian distributions, we can derive
the following predictive conditional distribution at a single query point f |
D, x ∼ N (E [f ] , V [f ]) as [6]:
−1
μ = E [f ] = k (x , X) k(X, X) + σn2 In y (3)
−1
σ = V [f ] = k (x , x ) − k (x , X) k(X, X) + σn2 In k (X, x∗ ) (4)
where k(X, X) is a matrix containing the joint prior distribution of covariances

of the function f at inputs X and k (x , X) is a matrix containing the covari-
ances between the function at query points and training inputs. The covariance
function k (or kernel), captures prior knowledge about the function of interest,
including stationarity and smoothness. Often, it is assumed that the covariance
of any two random variables depends only on their distance (isotropy), indepen-
dent of their location (stationarity) [6]. Note that the variance in Equation (4)
can be computed for any point and not only for the observed locations. This
serves as a means to reason about overall map uncertainty at unobserved loca-
tions and is key for the robots to decide where to sample next to decrease the
MSE at the end of the surveying task.
4.2 Spatial Prior
The kernel k establishes a prior likelihood over the space of functions that can
fit observed data in the regression task. Kernel selection and tuning is a key
component in GP regression tasks. In machine learning the radial basis function
(RBF) kernel is often used. However, in this paper, we use the Matérn kernel
with ν = 3/2 which is a finitely-differentiable function. Our choice of this kernel
is motivated by its broad use in the geostatistical literature for modeling physical
processes [9] like those motivating this research. The Matérn covariance function
takes the form:
√ ν √
21−ν 2ν 2ν
KMatern (X, X ) = σ 2 r Kν r (5)
Γ(ν) l l
where Kν is a modified Bessel function , Γ(·) is the Gamma function, and r is the
Euclidean distance between input points X and X . The hyperparemeters ν > 0,
l > 0, and σ 2 > 0 represent smoothness, lengthscale, and observation variance
respectively. As common in GP inference, to account for the measurement noise,
the kernel we use is the sum of two kernels, namely the Matérn kernel and a
noise term, i.e., the kernel we use is
K(X, X ) = KMatern (X, X ) + σn2 I
where I is the identity matrix and the term σn2 models the measurement noise
σm2
. While we keep ν fixed at 3/2, the other hyperparameters θ = {σ 2 , σn2 , l} can
be trained using various optimization methods using the marginal likelyhood to
match the properties of environment and the sampled data [6]. In particular,
the length scale l is related to the lag parameter of the variogram, a function
used in geostatistics that establishes how quickly the variance increases as a
function of separation distance between pairs of observations in space [9]. In the
GP kernel, smaller values of l imply that variance quickly grows with distance,
while with larger values the variance grows less. As we will see in the next
subsection, by putting constraints on the range of possible values of l one can
implicitly encourage dispersion between the robots, thus promoting the collection
of samples in different areas of the environment.
4.3 Exploration
In this section we present the planning algorithm executed by each robot in the
team. No global data structure is shared among the agents, and all the quantities
described in the following are local to each robot. Let G = (V, E) be the graph
of possible sampling locations and let D = (vi , yi ) i = 1 . . . n the set of collected
samples (vertices and values). All robots start from the same start vertex vs and
must end at the same goal vertex vg . D is initialized as an empty set, but then
grows as the robot collects more data or receives data from other agents. Each
robot is given a unique numerical identifier ni , but the robots need not to know
how many agents are in the team. At each iteration the robot assigns a reward
function to each of the vertices in V , i.e., it computes a function r : V → R
assigning a value to each possible sampling location. Different options for the
function r will be discussed in the next subsection.
Once the function r has been computed, the robot is faced with an instance
of the stochastic orienteering problem, i.e., it has a graph G = (V, E) with known
deterministic rewards r associated to vertices and stochastic costs c associated
to edges, as well as a residual budget B. At this point the robot executes the
algorithm presented in [12] to solve the stochastic orienteering problem (SOP).
Because of the intrinsic computational complexity of the orienteering problem,
the SOP algorithm uses a Monte Carlo tree search informed by an heuristic
aiming at identifying vertices with high value r, low travel cost, and from which
the robot can still reach vg with high probability (the reader is referred to [12] for
all details). The SOP algorithm returns the next vertex va to visit. The robot
then moves to va , collects an observation ya , updates D, and broadcasts the
packet (va , ya , ni ) to all other agents, where ni is the unique id of the robot.
This process continues until the SOP algorithm returns vg , in which case the
agent moves to the goal vertex vg and terminates. Throughout the process the
robot keeps listening for possible packets broadcast by other agents, and when
they are received the location and sampled values are added to D.
As the SOP algorithm was developed for the single robot case, in this work
we added a minor modification to account for the presence of multiple robots.
The change is as follows: When considering the reward of a vertex r(v), the robot
considers for all other agents, the last packet they transmitted (if it exists). Then,
if it determines that another agent is closer to v than itself, it discounts the
reward r(v). More precisely, let v be the vertex whose utility is being evaluated,
and r(v) its reward. Let vc be the location of the current robot, and assume
that it determines that robot j has broadcast a packet indicating it collected a
sample at vertex vj . If vj is closer to v than vc , then r(v) is updated as r (v) =
r(v)d(vj , v)/d(vc , v) where d is the Euclidean distance between the vertices. The
rationale for this update is that if another robot is closer to v, then it is more
likely to reach v than the former robot, so the utility of v is decreased for
the former robot to prevent having both robots visiting v, as this would be a
replicated effort wasting resources. However, the utility is not set to zero because
robots do not communicate with each other and do not know their individual
intentions. Also, since each robot maintains its own set of GP hyperparameters
(see discussion below) and these will be different from each other, robots cannot
make absolute predictions about the intentions of other robots in the team.
Remark: one could imagine that after a robot has determined which vertex v
it will visit next, it could broadcast this information to other agents so that they
do not consider it anymore among their choices. However, this is not done for two
reasons. Fist, such additional communication would almost double the amount of
transmitted data, thus going against our effort to keep exchanged information at
a minimum. Second, because of the stochastic nature of the environment there is
no guarantee that a robot electing to visit a certain vertex will eventually reach
it and collect a sample. Hence we opt for the current approach where robots
share measured data only after they have reached and sampled a location.
4.4 Vertex Quality Computation

Key to the presented approach is the reward function r : V → R used by the
SOP algorithm to decide which vertex to visit next. Ideally, the function should
identify instrumentally good vertices to visit, where good in this case means ver-
tices that will yield a reduction of the MSE metric. Different metrics have been
proposed in literature. One obvious choice is to use Eq. (4) to predict the vari-
ance of vertices in V and set r(v) = σ 2 (v). In this case, the objective is to assign
high values to vertices with high uncertainty in the estimate. In [8] the authors
instead propose to use a linear combination of the mean and standard deviation
predicted by Eqs. (3) and (4). Their approach aims at discovering the extrema
of an unknown function. As in our application we are interested in the entire
function, and not just its peaks, we could set r(v) = |μ(v)| + βσ(v). Finally,
in [1] the authors propose an algorithm to compute the mutual information for
vertices (prior to and after being added to the movement graph) using predic-
tions for mean and variance. After having implemented these three alternatives,
preliminary experiments did not outline significant differences between them.

However, setting r(v) = σ 2 (v) has the advantage of not requiring the tuning
of additional parameters, as it is instead necessary for the other two methods.
Therefore, informed by these preliminary findings, in our implementation each
robot assigns the predicted variance as the value of a vertex. Note that for ver-
tices already in D the algorithm sets r(v) = 0, so that robots never consider
again vertices that have been already sampled at least once.
The kernel we use to make predictions about the variance depends on three
hyperparameters θ = {σ 2 , σn2 , l} that can be tuned to best fit the data in D.
As pointed out in [6] Ch.5, to obtain better results it is possible to repeat the
optimization process multiple times, with random restarts to avoid getting stuck
in suboptimal local minima. In our approach, before assigning values to the
vertices each robot executes the optimization locally with ten restarts, but never
communicates the hyperparameters of its internal model θ to the other team
members. Each agent then operates with its separate set of hyperparameters θ
that are unlikely to match the others, due to the random restarts of the optimizer.
This difference will further decrease the likelyhood that multiple agents will
select the same vertices to sample, because even with identical sets D the variance
predicted by the GP will be different. However, during the optimization process
each agent uses the same lower bound l0 for the length scale l. This choice
encourages robots to disperse because the variance of vertices in V near to
vertices already inserted in D is lower than the variance of vertices far from D
and thereby the reward associated to vertices near to already sampled locations
is lower.
5 Experimental Evaluation and Discussion
To assess merits and limitations of the proposed approach, we perform simula-

tions on test cases while varying the different parameters related to the planning
and surveying objectives. Due to the limitation of space, we examine the task
of reconstructing two scalar fields. The first is a synthetic scalar field with a
periodic trend depicted in Fig. 1a. The second, displayed in Fig. 1b, shows the
soil moisture distribution measured in Summer 2018 in a commercial vineyard
located Central California. This second scalar field was used as benchmark in
some of our former publications [14]. To ease the comparisons between the two
cases, both fields were rescaled to the same size, although the amplitude of
the underlying values are different. GP predictions of each respective field were
made with a Matérn kernel with ν = 3/2. It should be noted that this kernel
is commonly used in geostatistical applications and is more appropriate for the
soil moisture dataset. Here, the periodic synthetic field serves as a pathological
example, with a mismatched spatial prior. In fact, the use of periodic kernels
could lead to better results for the synthetic field. Future work will examine
online adaptive kernel selection through Bayesian optimization.
Our algorithm, indicated as Coord in the following discussion, is compared
with two baseline alternatives:
Fig. 1. Benchmark scalar fields to be estimated. With reference to the travel budget
B, the length of the side edge is 5. In both instances the start vertex vs is in the lower
left corner and the goal vertex vg is in the top right corner.
– The random waypoint selection algorithm (RWP), which selects the next ver-
tex to visit at random among those still to be visited. Due to the nature of the
selection process, the ability to communicate during the sampling process is
immaterial. The RWP algorithm is often considered as a baseline comparison
in this type of tasks (see e.g., [2]).
– A non-coordinated (NC) approach, which selects the next sampling point
using the same criteria used by our proposed algorithm, but does not exchange
any information during the sample process, i.e., during the selection process
each agent only considers the samples it collected, but not those collected by
the other agents.
At the end of the mission, when all robots have reached vg , both RWP and
NC share all collected sensor observations and the MSE is computed after fitting
the GP using all data collected by all robots. This step is not necessary for Coord
because data is exchanged on the go, but it ensures that the MSE evaluation is
done fairly among the three algorithms. After the algorithm has selected the next
point to visit, all algorithms, including Coord, use the same planning engine, i.e.,
the one we proposed in [12]. Finally, both NC and Coord do refit of the GP and
update the parameters θ before computing r, while RWP does not do this step
because it does not use the current estimate to select the next point to visit.
All procedures were executed single-threaded in Python running on an Apple
M1 processor. All computations related to GP fitting and processing use the
scikit-learn library [5]. For both scalar fields considered in the tests, we varied
the number of agents (3,5,7,9), the budget B (10,15,20), and the parameter l0
(0.1,0.5,1). For each combination of parameters, twenty independent simulations
were performed, for a total of about 3500 executions.
Figure 2 shows the average MSE as a function of the number robots for
all algorithms. As expected, the trend is decreasing (i.e. improved prediction
accuracy) with diminishing returns as the number of robot grows. We can observe
that the proposed algorithm outperforms the others. Note that the range of
values for MSE in the two test cases is different because of the different values in
Fig. 2. Average final map MSE for n=20 trials per algorithm. Error bars show ± one
standard deviation.
the underlying scalar fields. Next, in Fig. 3 we show the reconstructed scalar field
for the three algorithms with a budget of 20 and 5 robots. The red dots show the
locations where the samples were collected. Due to the random selection process,
RWP ends up collecting less samples before exhausting the budget and this leads
to an inferior estimate. NC and Coord, instead, collect more samples, but we
can see how Coord spreads them more uniformly and ultimately leads to a more
accurate estimate (see Fig. 1b for the ground truth). Similar results are observed
for the synthetic map, but are not displayed for lack of space. Table 1 provides
a more detailed numeric analysis of the performance of the three algorithms.
Specifically, we look at the number of unvisited locations as well as the number
of locations visited by more than one robot. These are two proxies for the MSE
metric, and lower values are desired in both cases. When the travel budget is 10,
the number of unvisited locations is similar for the three algorithms because with
limited budget the set of available choices before the budget is used is limited. As
the budget grows, we see that the Coord algorithm emerges as the algorithm with
less unvisited vertices, thus substantiating the claim that agents spread in the
environment in a more coordinated fashion. For the number of revisited locations,
RWP (as expected) always has the lowest number of revisited locations, due to
the completely random nature of the selection. However, when comparing NC
with Coord we see that the latter has always a lower number, again showing
that the agents better spread in the environment avoiding to revisit the same
places, thus ensuring that coordination leads to a better use of the robots are
mobile sensors.
Finally, in Fig. 4 we display how the choice of l0 , the lower-bound for the
length scale parameter l, impact the value of the MSE metric. The two panels
correspond to a budget of 15 and 20 respectively, and group the results for
different numbers of surveying robots. For 9 robots the impact is marginal, and
this is explained by the fact that with this many agents the team manages to
cover most of the environment during the mission. However, for budget of 15 and
Fig. 3. Reconstructed scalar field soil moisture map with 5 robots and B = 20.
Table 1. Average number of unvisited and re-visited waypoints, from an experimental

setting of 200 candidate sampling locations. X/Y means that there were on average X
unvisited vertices and Y revisited vertices.
map synthetic dataset soil moisture dataset

budget 10 15 20 10 15 20
RWP 179/6 164/8 155/10 179/6 164/9 155/10
NC 164/17 139/31 122/42 166/16 140/30 122/40
Coord 163/12 129/18 104/21 166/11 133/17 106/21
Fig. 4. Average MSE for different values of l0 and number of robots.
3 robots, a value of l0 = 1 gives a clearly better result. Likewise, for budget of

20 and 5 robots, a value of l0 = 0.5 is best. These results show that by tuning l0
it is possible to implicitly promote better dispersion in the team and then lower
values for the MSE. An outstanding question to be investigated is how to select
this value in a general setting. Nevertheless, these results confirm our hypothesis
that by constraining the GP kernel parameters being optimized, one can enforce
different behaviors on the team members.
6 Conclusion
We have presented an approach to reconstruct a scalar field using a team of
robots performing distributed GP estimation. Through limited communication,
by exploiting the underlying properties of GPs, robots implicitly manage to dis-
perse thorough the domain and collect samples at locations leading to a more
accurate estimation. Our proposed approach has been extensively evaluated in
simulation and outperforms competing approaches.
References
1. Contal, E, Perchet, V., Vayatis, N.: Gaussian process optimization with mutual
information. In: Proceedings of the 31st International Conference on Machine
Learning, pp. 253–261 (2014)
2. Di Caro, G.A., Yousaf, Z., Wahab, A.: Map learning via adaptive region-based
sampling in multi-robot systems. In: Matsuno, F., Azuma, S.-I., Yamamoto, M.
(eds.) Distributed Autonomous Robotic Systems: 15th International Symposium,
pp. 335–348. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-92790-
5 26
3. Hollinger, G.A., Sukhatme, G.S.: Sampling-based robotic information gathering
algorithms. Int. J. Robot. Res. 33(9), 1271–1287 (2014)
4. Jadidi, M.G., Miro, J.V., Dissanayake, G.: Sampling-based incremental information
gathering with applications to robotic exploration and environmental monitoring.
Int. J. Robot. Res. 38(6), 658–685 (2019)
5. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
6. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning.
Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA (2006)
7. Shamshirgaran, A., Carpin, S.: Reconstructing a spatial field with an autonomous
robot under a budget constraint. In: Proceedings of the IEEE/RSJ International
Conference on Intelligent Robots and Systems, pp. 8963–8970 (2022)
8. Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.W.: Information-theoretic regret
bounds for gaussian process optimization in the bandit setting. IEEE Trans. Inf.
Theory 58(5), 3250–3265 (2012)
9. Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer
Series in Statistics. Springer, New York (1999)
10. Sundaram, J.P.S., Du, W., Zhao, Z.: A survey on LoRa networking: research prob-
lems, current solutions, and open issues. IEEE Commun. Surv. Tutor. 22(1), 371–
388 (2019)
11. Suryan, V., Tokekar, P.: Learning a spatial field in minimum time with a team of
robots. IEEE Trans. Rob. 36(5), 1562–1576 (2020)
12. Thayer, T.C., Carpin, S.: Solving stochastic orienteering problems with chance con-
straints using Monte Carlo tree search. In: Proceedings of the IEEE International
Conference on Automation Science and Engineering, pp. 1170–1177 (2022)
13. Thayer, T.C., Vougioukas, S., Goldberg, K., Carpin, S.: Routing algorithms for
robot assisted precision irrigation. In: Proceedings of the IEEE International Con-
ference on Robotics and Automation, pp. 2221–2228 (2018)
14. Thayer, T.C., Vougioukas, S., Goldberg, K., Carpin, S.: Multi-robot routing algo-
rithms for robots operating in vineyards. IEEE Trans. Autom. Sci. Eng. 17(3),
1184–1194 (2020)
A Constrained-Optimization Approach
to the Execution of Prioritized Stacks
of Learned Multi-robot Tasks
Gennaro Notomista(B)
Department of Electrical and Computer Engineering, University of Waterloo,

Waterloo, ON, Canada
gennaro.notomista@uwaterloo.ca
Abstract. This paper presents a constrained-optimization formulation

for the prioritized execution of learned robot tasks. The framework lends
itself to the execution of tasks encoded by value functions, such as
tasks learned using the reinforcement learning paradigm. The tasks are
encoded as constraints of a convex optimization program by using con-
trol Lyapunov functions. Moreover, an additional constraint is enforced
in order to specify relative priorities between the tasks. The proposed
approach is showcased in simulation using a team of mobile robots exe-
cuting coordinated multi-robot tasks.
Keywords: Multi-robot motion coordination · Distributed control and

planning · Learning and adaptation in teams of robots
1 Introduction
Learning complex robotic tasks can be challenging for several reasons. The
nature of compound tasks, made up of several simpler subtasks, renders it dif-
ficult to simultaneously capture and combine all features of the subtasks to be
learned. Another limiting factor of the learning process of compound tasks is
the computational complexity of machine learning algorithms employed in the
learning phase. This can make the training phase prohibitive, especially when
the representation of the tasks comprises of a large number of parameters, as it is
generally the case when dealing with complex tasks made up of several subtasks,
or in the case of high-dimensional state space representations.
For these reasons, when there is an effective way of combining the execution
of multiple subtasks, it is useful to break down complex tasks into building blocks
that can be independently learned in a more efficient fashion. Besides the reduced
computational complexity stemming from the simpler nature of the subtasks to
be learned, this approach has the benefit of increasing the modularity of the task
execution framework, by allowing for a reuse of the subtasks as building blocks
for the execution of different complex tasks. Discussions and analyses of such
advantages can be found, for instance, in [9,16,26,32].
https://doi.org/10.1007/978-3-031-51497-5_34
480 G. Notomista
Along these lines, in [13], compositionality and incrementality are recognized

to be two fundamental features of robot learning algorithms. Compositionality,
in the context of learning to execute multiple tasks, is intended as the property
of learning strategies to be in a form that allows them to be combined with pre-
vious knowledge. Incrementality, guarantees the possibility of adding new knowl-
edge and abilities over time, by, for instance, incorporating new tasks. Several
approaches have been proposed, which exhibit these two properties. Neverthe-
less, challenges still remain regarding tasks prioritization and stability guarantees
[6,21,25,28,34]. The possibility of prioritizing tasks together with the stability
guarantees allows us to characterize the behavior resulting from the composition
of multiple tasks.
In fact, when dealing with redundant robotic systems—i.e. systems which
possess more degrees of freedom compared to the minimum number required to
execute a given task, as, for example, multi-robot systems—it is often useful to
allow for the execution of multiple subtasks in a prioritized stack. Task priorities
may allow robots to adapt to the different scenarios in which they are employed
by exhibiting structurally different behaviors. Therefore, it is desirable that a
multi-task execution framework allows for the prioritized execution of multiple
tasks.
In this paper, we present a constrained-optimization robot-control framework
suitable for the stable execution of multiple tasks in a prioritized fashion. This
approach leverages the reinforcement learning (RL) paradigm in order to get
an approximation of the value functions which will be used to encode the tasks
as constraints of a convex quadratic program (QP). Owing to its convexity,
the latter can be solved in polynomial time [3], and it is therefore suitable to
be employed in a large variety of robotic applications, in online settings, even
under real-time constraints. The proposed framework shares the optimization-
based nature with the one proposed in [18] for redundant robotic manipulators,
where, however, it is assumed that a representation for all tasks to be executed
is known a priori. As will be discussed later in the paper, this framework indeed
combines compositionality and incrementality—i.e. the abilities of combining
and adding sub-tasks to build up compound tasks, respectively—with stable
and prioritized task execution in a computationally efficient optimization-based
algorithm.
Figure 1 pictorially shows the strategy adopted in this paper to allow robots
to execute multiple prioritized tasks learned using the RL paradigm. Once a
value function is learned using the RL paradigm (using, e.g., the value itera-
tion algorithm [2]), this learned value function is used to construct a control
Lyapunov function [30] in such a way that a controller synthesized using a min-
norm optimization program is equivalent to the optimal policy corresponding
to the value function [20]. Then, multiple tasks encoded by constraints in a
min-norm controller are combined in a prioritized stack as in [17].
To summarize, the contributions of this paper are the following: (i) We
present a compositional and incremental framework for the execution of mul-
tiple tasks encoded by value functions; (ii) We show how priorities among tasks
Multi-robot Multi-learned-Tasks 481
[2]
Optimal control Value functions
[20]
[17]
Prioritized execution Lyapunov functions
Fig. 1. Pictorial representation of the strategy adopted in this paper for the execution
of prioritized stacks of learned tasks.
can be enforced in a constrained-optimization-based formulation; (iii) We frame

the prioritized multi-task execution as a convex QP which can be efficiently
solved in online settings; (iv) We demonstrate how the proposed framework can
be employed to control robot teams to execute coordinated tasks.
2.1 Multi-task Learning, Composition, and Execution
The prioritized execution framework for learned tasks proposed in this paper
can be related to approaches devised for multi-task learning—a machine learn-
ing paradigm which aims at leveraging useful information contained in mul-
tiple related tasks to help improve the generalization performance of all the
tasks [35]. The learning of multiple tasks can happen in parallel (indepen-
dently) or in sequence for naturally sequential tasks [10,29], and a number
of computational frameworks have been proposed to learn multiple tasks (see,
e.g., [14,24,35], and references therein). It is worth noticing how, owing to its
constrained-optimization nature, the approach proposed in this paper is dual
to multi-objective optimization frameworks, such as [5,27] or compared to the
Riemannian motion policies [15,22,23].
Several works have focused on the composition and hierarchy of deep rein-
forcement learning policies. The seminal work [33] shows compositionality for
a specific class of value functions. More general value functions are considered
in [12], where, however, there are no guarantees on the policy resulting from
the multi-task learning process. Boolean and weighted composition of reward,
(Q-)value functions, or policies are considered in [11,19,34]. While these works
have shown their effectiveness on complex systems and tasks, our proposed app-
roach differs from them in two main aspects: (i) It separates the task learning
from the task composition; (ii) It allows for (possibly time-varying and state-
dependent) task prioritization, with task stacks that are enforced at runtime.
482 G. Notomista
2.2 Constraint-Based Task Execution

In this paper, we adopt a constrained-optimization approach to the prioritized
execution of multiple tasks learned using the RL paradigm. In [17], a constraint-
based task execution framework is presented for a robotic system with control
affine dynamics
ẋ = f0 (x) + f1 (x)u, (1)
where x ∈ X ⊆ Rn and u ∈ U ⊆ Rm denote state and control input, respec-
tively. The M tasks to be executed are encoded by continuously differentiable,
positive definite cost functions Vi : X → R+ , i = 1, . . . , M . With the nota-
tion which will be adopted in this paper, the constraint-based task execution
framework in [17] can be expressed as follows:
minimize u2 + κδ2

u,δ
subject to Lf0 Vi (x) + Lf1 Vi (x)u ≤ −γ(Vi (x)) + δi i = 1, . . . , M (2)

Kδ ≥ 0,
where Lf0 Vi (x) and Lf1 Vi (x) are the Lie derivatives of Vi along the vector fields
f0 and f1 , respectively. The components of δ = [δ1 , . . . , δM ]T are used as slack
variables employed to prioritize the different tasks; γ : R → R is a Lipschitz con-
tinuous extended class K function—i.e. a continuous, monotonically increasing
function, with γ(0) = 0—κ > 0 is an optimization parameter, and K is the pri-
oritization matrix, known a priori, which enforces relative constraints between
components of δ of the following type: δi ≤ lδj , for l 1, which encodes the
fact that task i is executed at higher priority than task j.
In the following, Sect. 2.3 will be devoted to showing the connection between
dynamic programming and optimization-based controllers. In Sect. 3, this con-
nection will allow us to execute tasks learned using the RL paradigm by means
of a formulation akin to (2).
2.3 From Dynamic Programming to Constraint-Driven Control

To illustrate how controllers obtained using dynamic programming can be syn-
thesized as the solution of an optimization program, consider a system with the
following discrete-time dynamics:
xk+1 = f (xk , uk ). (3)
These dynamics can be obtained, for instance, by (1), through a discretization

process. In (3), xk denotes the state, uk ∈ Uk (xk ) the input, and the input
set Uk (xk ) may depend in general on the time k and the state xk . The value
iteration algorithm to solve a deterministic dynamic programming problem with
no terminal cost can be stated as follows [2]:

Jk+1 (xk ) = min gk (xk , uk ) + Jk (fk (xk , uk )) , (4)
uk ∈U k (xk )
with J0 (x0 ) = 0, where x0 is the initial state, and gk (xk , uk ) is the cost incurred
at time k. The total cost accumulated along the system trajectory is given by
N
−1
J(x0 ) = lim αk gk (xk , uk ). (5)
N →∞
k=0
In this paper, we will consider α = 1 and we will assume there exists a cost-free
termination state.1
By Proposition 4.2.1 in [2] the value iteration algorithm (4) converges to J
satisfying
J (x) = min g(x, u) + J (f (x, u)) . (6)
u∈U (x)
Adopting an approximation scheme in value space, J can be replaced by its

approximation J˜ by solving the following approximate dynamic programming
algorithm:

˜
Jk+1 (xk ) = min ˜
gk (xk , uk ) + Jk (fk (xk , uk )) .
uk ∈U k (xk )
In these settings, deep RL algorithms can be leveraged to find parametric approx-

imations, J˜ , of the value function using neural networks. This will be the
paradigm considered in this paper in order to approximate value functions encod-
ing the tasks to be executed in a prioritized fashion.
The bridge between dynamic programming and constraint-driven control is
optimal control. In fact, the cost in (5) is typically considered in optimal control
problems, recalled, in the following, for the continuous time control affine system
(1): ∞

minimize q(x(t)) + u(t)T u(t) dt
u(·) 0 (7)
subject to ẋ = f0 (x) + f1 (x)u.
Comparing (7) with (5), we recognize that the instantaneous cost g(x, u) in (5)
in the context of the optimal control problem (7) corresponds to q(x) + uT u,
where q : X → R is a continuously differentiable and positive definite function.
A dynamic programming argument on (7) leads to the following Hamilton-
Jacobi-Bellman equation:
1 T
Lf0 J (x) − Lf1 J (x) (Lf1 J (x)) + q(x) = 0,
4
where J ∗ is the value function—similar to (6) for continuous-time problems—
representing the minimum cost-to-go from state x, defined as
∞

J (x) = min q(x(τ )) + u(τ )T u(τ ) dτ. (8)
u(·) t
1
Problems of this class are referred to as shortest path problems in [2].
484 G. Notomista
The optimal policy corresponding to the optimal value function (8) can be eval-
uated as follows [4]:
1 T
u = − (Lf1 J (x)) . (9)
2
In order to show how the optimal policy u in (9) can be obtained using an
optimization-based formulation, we now recall the concept of control Lyapunov
functions.
Definition 1 (Control Lyapunov function [30]). A continuously differ-
entiable, positive definite function V : Rn → R is a control Lyapunov function
(CLF) for the system (1) if, for all x = 0

inf Lf0 V (x) + Lf1 V (x)u < 0. (10)
u
To select a control input u which satisfies the inequality (10), a universal

expression—known as the Sontag’s formula [31]—can be employed. With the aim
of encoding the optimal control input u by means of a CLF, we will consider
the following modified Sontag’s formula originally proposed in [7]:

T
−v(x) (Lf1 V (x)) if Lf1 V (x) = 0
u(x) = (11)
0 otherwise,

2 T
Lf0 V (x)+ (Lf0 V (x)) +q(x)Lf1 V (x)(Lf1 V (x))
where v(x) = T .
Lf1 V (x)(Lf1 V (x))
As shown in [20], the modified Sontag’s formula (11) is equivalent to the
solution of the optimal control problem (7) if the following relation between the
CLF V and the value function J holds:
∂J ∂V
= λ(x) , (12)
∂x ∂x
T
where λ(x) = 2v(x) (Lf1 V (x)) . The relation in (12) corresponds to the fact
that the level sets of the CLF V and those of the value function J are parallel.
The last step towards the constrained-optimization-based approach to gen-
erate optimal control policies is to recognize the fact that, owing to its inverse
optimality property [7], the modified Sontag’s formula (11) can be obtained using
the following constrained-optimization formulation, also known as the pointwise
min-norm controller:
minimize u2
u
(13)
subject to Lf0 V (x) + Lf1 V (x)u ≤ −σ(x),
2 T
where σ(x) = (Lf0 V (x)) + q(x)Lf1 V (x) (Lf1 V (x)) . This formulation shares
the same optimization structure with the one introduced in (2) in Sect. 2, and in
the next section we will provide a formulation which strengthens the connection
with approximate dynamic programming.
In Appendix A, additional results are reported, which further illustrate the

theoretical equivalence discussed in this section, by comparing the optimal con-
troller, the optimization-based controller, and a policy learned using the RL
framework for a simple dynamical system.
3 Prioritized Multi-task Execution

When V = J˜ , the min-norm controller solution of (13) is the optimal policy
which would be learned using a deep RL algorithm. This is what allows us to
bridge the gap between constraint-driven control and RL and it is the key to
execute tasks learned using the RL paradigm in a compositional, incremental,
prioritized, and computationally-efficient fashion.
Following the formulation given in (2), the multi-task prioritized execution of
tasks learned using RL can be implemented executing the control input solution
of the following optimization program:
minimize u2 + κδ2
u,δ
1
subject to Lf0 J˜i (x) + Lf1 J˜i (x)u ≤ −σi (x) + δi , i = 1, . . . , M (14)
λi (x)
Kδ ≥ 0
where J˜1 , . . . , J˜M

are the approximated value functions encoding the tasks
learned using the RL paradigm (e.g. value iteration). In summary, with the RL
paradigm, one can get the approximate value functions J˜1 , . . . , J˜M

; the robotic
system is then controlled using the control input solution of (14) in order to
execute these tasks in a prioritized fashion.
Remark 1. The Lie derivatives Lf0 J˜1 (x), . . . , Lf0 J˜M
(x) contain the gradients
∂J1
∂JM
˜ (x), . . . , J˜ (x) are approximated using neural networks,
∂x , . . . , ∂x . When J1 M
these gradients can be efficiently computed using back propagation.
We conclude this section with the following Proposition 1, which ensures the
stability of the prioritized execution of multiple tasks encoded through the value
functions J˜i by a robotic system modeled by the dynamics (1) and controlled
with control input solution of (14).
Proposition 1 (Stability of multiple prioritized learned tasks). Con-
sider executing a set of M prioritized tasks encoded by approximate value func-
tions J˜i , i = 1, . . . , M , by solving the optimization problem in (14). Assume the
following:
1. All constraints in (14) are active
2. The robotic system can be modeled by driftless control affine dynamical system,
i.e. f0 (x) = 0 ∀x ∈ X
3. The instantaneous cost function g used to learn the tasks is positive for all
x ∈ X and u ∈ U .
Then,
486 G. Notomista
⎡ ⎤
J˜1 (x(t))
⎢ .. ⎥
⎣ . ⎦ → N (K),
˜
J (x(t))
M
as t → ∞, where N (K) denotes the null space of the prioritization matrix K.
That is, the tasks will be executed according to the priorities specified by the
prioritization matrix K in (2).
Proof. The Lagrangian associated with the optimization problem (14) is given
by L(u, δ) = u2 + κδ2 + η1T fˆ0 (x) + fˆ1 (x)u + σ(x) − δ + η2T (−Kδ), where
fˆ0 (x) ∈ RM and fˆ1 (x) ∈ RM ×m defined as follows: the i-th component of fˆ0 (x) is
equal to 1 Lf J˜ (x), while the i-th row of fî (x) is equal to 1 Lf J˜ (x). η1
λi (x) 0 i λi (x) 1 i
and η2 are the Lagrange multipliers corresponding to the task and prioritization
constraints, respectively.
From the KKT conditions, we obtain:
1 1
u = − fˆ1 (x)T 0 η δ= I −K T η, (15)
2 2κ
where η = [η1T , η2T ]T . By resorting to the Lagrange dual problem, and by using
assumption 1, we get the following expression for η:
I −1
+ fˆ1 (x)fˆ1 (x)T −K T fˆ0 (x) + σ(x)
η=2 κ , (16)
K KK T 0

A−1 b0
1
where I denotes
an identity
matrix of appropriate
size.
−1Substituting (16) in (15),
we get u = − fˆ1 (x)T 0 A−11 b0 and δ = κ I −K
1 T
A1 b0 .
To show the claimed stability property, we will proceed by a Lyapunov
argument. Let us consider the Lyapunov function candidate V (x) = 12 J(x) ˜ T
T

T
K K J˜ (x), where J˜ (x) = J˜1 (x) . . . J˜M (x) . The time derivative of V eval-
uates to:
∂V ˜
V̇ = ˜ T K T K ∂ J ẋ
ẋ = J(x)
∂x ∂x
∂ ˜
J
˜ T K T K
= J(x) f1 (x) u (by assumption 2),
∂x
fˆ1,λ (x)
where,
notice that fˆ1,λ (x) = Λ(x)fˆ1 (x) ∈ RM and Λ(x) = diag ([λ1 (x), . . . , λM (x)]).
By assumption 2, λi (x) ≥ 0 for all i, and therefore Λ(x) 0, i.e. Λ(x)
˜ T T ˆ ˜ T K T
is positive
semidefinite.
Then, V̇ = J(x) K KΛ(x)f1 (x)u = −J(x)
σ(x)
KΛ(x)Â , where
0

I + fˆ1 (x)fˆ1 (x)T −K T −1
Â = fˆ1 (x) fˆ1 (x)T 0 κ 0 (17)
K KK T
as in Proposition 3 in [17], and we used assumption 2 to simplify the expression

of b0 .
By assumption 3, it follows that the value functions J˜i are positive definite.
Therefore, from the definition of σ, in a neighborhood of 0 ∈ RM , we can bound
σ(x)—defined by the gradients of J˜ —by the value of J˜ as σ(x) = γJ (J˜ (x)),
where γJ is a class K function.
Then, proceeding similarly to Proposition 3 in [17], we can bound V̇ as fol-
˜ T K T KΛ(x)ÂγJ (J(x))
lows: V̇ = −J(x) ˜ ˜ T K T KΛ(x)J(x)
≤ −J(x) ˜ ≤ −λ̄V (x),
where λ̄ = min{λ1 (x), . . . , λM (x)}. Hence, K J˜ (x(t)) → 0 as t → ∞, and
J˜ (x(t)) → N (K) as t → ∞.
Remark 2. The proof of Proposition 1 can be carried out even in case of time-
varying and state-dependent prioritization matrix K. Under the assumption that
K is bounded and continuously differentiable for all x and uniformly in time,
the norm and the gradient of K can be bounded in order to obtain an upper
bound for V̇ .
Remark 3. Even when the prioritization stack specified through the matrix K in
(14) is not physically realizable—due to the fact that, for instance, the functions
encoding the tasks cannot achieve the relative values prescribed by the prioriti-
zation matrix—the optimization program will still be feasible. Nevertheless, the
tasks will not be executed with the desired priorities and even the execution of
high-priority tasks might be degraded.
4 Experimental Results
In this section, the proposed framework for the execution of prioritized stacks
of tasks is showcased in simulation using a team of mobile robots. Owing to the
multitude of robotic units of which they are comprised, multi-robot systems are
often highly redundant with respect to the tasks they have to execute. Therefore,
they perfectly lend themselves to the concurrent execution of multiple prioritized
tasks.
4.1 Multi-robot Tasks

For multi-robot systems, the redundancy stems from the multiplicity of robotic
units of which the system is comprised. In this section, we will showcase the
execution of dependent tasks—two tasks are dependent if executing one prevents
the execution of the other [1]—in different orders of priority. The multi-robot
system is comprised of 6 planar robots modeled with single integrator dynamics
and controlled to execute the following 4 tasks: All robots assemble an hexagonal
formation (task T1 ), robot 1 goes to goal point 1 (task T2 ), robot 2 goes to
goal point 2 (task T3 ), robot 3 goes to goal point 3 (task T4 ). While Task 1 is
independent of each of the other tasks taken singularly, it is not independent of
any pair of tasks 2, 3, and 4. This intuitively corresponds to the fact that it is
possible to form a hexagonal formation in different points in space, but it might
488 G. Notomista
Fig. 2. Snapshots (a–l) and plot of J˜1 , J˜2 , J˜3 , and J˜4 (m) corresponding to the
hexagonal formation control task and 3 go-to-goal tasks for robots 1, 2, and 3, respec-
tively, recorded during the course of a simulated experiment with a multi-robot system.
Robots are gray dots, connection edges between the robots used to assemble the desired
formation are depicted in blue, goal points are shown as red dots.
not be feasible to form a hexagonal formation while two robots are constrained
to be in two pre-specified arbitrary locations.
Figure 2 reports a sequence of snapshots and the graph of the value functions
encoding the four tasks recorded during the course of the experiment. Denoting
by Ti ≺ Tj the condition under which task Ti has priority higher than Tj , the
sequence of prioritized stacks tested in the experiment are the following:
⎧
⎪
⎨T2 , T3 , T4 ≺ T1 0 s ≤ t < 15 s
T 1 ≺ T 2 , T3 , T4 15 s ≤ t < 30 s (18)
⎪
⎩
T 1 ≺ T 2 ≺ T 3 , T4 30 s ≤ t ≤ 45 s.
The plot of the value functions in Fig. 2l shows how, for 0s ≤ t < 15s, since
the hexagonal formation control algorithm has lower priority compared to the
three go-to-goal tasks, its value function J˜1 is allowed to grow while the other
three value functions are driven to 0 by the velocity control input solution of
(14) supplied to the robots. For 15s ≤ t < 30s, the situation is reversed: the
hexagonal formation control is executed with highest priority while the value
functions encoding the three go-to-goal tasks are allowed to grow—a condition
which corresponds to the non-execution of the tasks. Finally, for 30s ≤ t ≤ 45s,
task T2 , i.e. go-to-goal task for robot 1 to goal point 1 is added at higher priority
with respect to tasks T3 and T4 . Since this is independent by task T1 , it can
be executed at the same time. As a result, as can be seen from the snapshots,
the formation translates towards the red point marked with 1. Tasks T1 and T2
are successfully executed while tasks T3 and T4 are not executed since are not
independent by the first two and they have lower priority.
Remark 4. The optimization program responsible for the execution of multiple

prioritized tasks encoded by value functions is solved at each iteration of the
robot control loop. This illustrates how the convex optimization formulation of
the developed framework is computationally efficient and therefore amenable
to be employed in online settings. Alternative approaches for task prioritiza-
tion and allocation in the context of multi-robot systems generally result in
(mixed-)integer optimization programs, which are often characterized by a com-
binatorial nature and are not always suitable for an online implementation [8].
4.2 Discussion
The experiments of the previous section highlight several amenable properties

of the framework developed in this paper for the prioritized execution of tasks
encoded by a value function. First of all, its compositionality is given by the
fact that tasks can easily be inserted and removed by adding and removing
constraints from the optimization program (14). For the same reason the frame-
work is incremental and modular as it allows for building a complex task using
a number of subtasks which can be incrementally added to the constraints of
an optimization-based controller. Moreover, it allows for seamless incorporation
of priorities among tasks, and, as we showcased in Sect. 4.1, these priority can
also be switched in an online fashion, in particular without the need of stopping
and restarting the motion of the robots. Furthermore, Proposition 1 shows that
the execution of multiple tasks using the constraint-driven control is stable and
the robotic system will indeed execute the given tasks according to the specified
priorities. Finally, as the developed optimization program is a convex QP, its low
490 G. Notomista
computational complexity allows for an efficient implementation in online settings

even under real-time constraints on computationally limited robotic platforms.
5 Conclusion
In this paper, we presented an optimization-based framework for the prioritized

execution of multiple tasks encoded by value functions. The approach com-
bines control-theoretic and learning techniques in order to exhibit properties
of compositionality, incrementality, stability, and low computational complexity.
These properties render the proposed framework suitable for online and real-
time robotic implementations. A multi-robot simulated scenario illustrated its
effectiveness in the control of a redundant robotic system executing a prioritized
stack of tasks.
A Comparison Between Optimal Control,

Optimization-Based Control, and RL Policy
To compare optimal controller, optimization-based controller, and RL policy,

in this section, we consider the stabilization of a double integrator
system to
01 0
the origin. The system dynamics are given by: ẋ = x+ u, where
00 1
x = [x1 , x2 ]T ∈ R2 and u ∈ R. The instantaneous cost considered in the optimal
control problem (7) is given by q(x)+u2 where q(x) = xT x. The reward function
of the value iteration algorithm employed to learn an approximate representa-
tion of the value function has been set to g(x, u) = −q(x) − u2 , and the resulting
value function J˜ has been shifted so that J˜ (0) = 0.
The results of the comparison are reported in Fig. 3. Here, the optimization-
based controller solution of (13) with V = J˜ is compared to the optimal con-
troller given in (9), and the RL policy corresponding to the approximate value
function J˜ . As can be seen, the optimization-based controller and the optimal
controller coincide, while the RL policy becomes closer and closer as the number
of training epochs increases.
B Implementation Details
The results reported in Sect. 4 have been obtained using a custom value function
learning algorithm written in Python. The details of each multi-robot task are
given in the following.
Fig. 3. Comparison between the optimal controller (given in (9)), RL policy (based
on the approximate value function J˜ (8)), and optimization-based controller (solution
of (13) with V = J˜ ) employed to stabilize a double-integrator system to the origin
of its state space, i.e. driving both x1 and x2 to 0. As can be seen, when trained for
a sufficiently long time, the RL policy results in the optimal controller, which is also
equivalent to the optimization-based controller.
Each robot in the team of N robots is modeled using single integrator

dynamics ẋi = ui , where xi , ui ∈ R2 are position and velocity input of
robot i. The ensemble state and input will be denoted by x and u, respec-
tively. For the formation control task, the expression of the cost g is given by
g(x, u) = 1000 − 0.01(−E(x) − 10u2 ), where the value of E(x) is the forma-
N
tion energy defined as E(x) = i=1 j∈N i (xi − xj 2 − Wij2 )2 , Ni being the
neighborhood of robot i, i.e. the set of robots with which robot i shares an edge,
and ⎡ √ ⎤
0 l 3l 2l 0 l
⎢ l 0 l 0 2l 0 ⎥
⎢√ ⎥
⎢ 3l l 0 l 0 2l⎥
W =⎢ ⎢ ⎥ (19)
⎥
⎢ 2l 0 l 0 l 0 ⎥
⎣ 0 2l 0 l 0 l ⎦
l 0 2l 0 l 0
with l = 1. The entry ij of the matrix W corresponds to the desired distance to
be maintained between robots i and j.
The cost function g for the go-to-goal tasks is given by g(x, u) = 100 −
0.01(−x − x̂2 − u2 ), where x̂ ∈ R2 is the desired goal point.
Remark 5 (Combination of single-robot and multi-robot tasks). Single-robot
tasks (e.g. the go-to-goal tasks considered in this paper) are combined with
multi-robot tasks (e.g. the formation control task) by defining the task gradient
492 G. Notomista
required to compute Lf0 J˜i (x) ˜

and Lf1Ji (x) in the optimization program (14)
∂ J˜i ∂ J˜ij
in the following way: ∂x = 0 ··· 0
∂x 0 · · · 0 , where the j-th entry J˜ is the
ij
approximate value function for task i and robot j.
References
1. Antonelli, G.: Stability analysis for prioritized closed-loop inverse kinematic algo-
rithms for redundant robotic systems. IEEE Trans. Rob. 25(5), 985–994 (2009).
https://doi.org/10.1109/TRO.2009.2017135
2. Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific
Belmont, MA (2019)
3. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press,
Cambridge (2004)
4. Bryson, A.E., Ho, Y.C.: Applied Optimal Control: Optimization, Estimation, and
Control. Routledge, New York (2018)
5. Bylard, A., Bonalli, R., Pavone, M.: Composable geometric motion policies using
multi-task pullback bundle dynamical systems. arXiv preprint arXiv:2101.01297
(2021)
6. Dulac-Arnold, G., Mankowitz, D., Hester, T.: Challenges of real-world reinforce-
ment learning. arXiv preprint arXiv:1904.12901 (2019)
7. Freeman, R.A., Primbs, J.A.: Control Lyapunov functions: new ideas from an old
source. In: Proceedings of 35th IEEE Conference on Decision and Control, vol. 4,
pp. 3926–3931. IEEE (1996)
8. Gerkey, B.P., Matarić, M.J.: A formal analysis and taxonomy of task allocation in
multi-robot systems. Int. J. Robot. Res. 23(9), 939–954 (2004)
9. Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., Levine, S.: Divide-and-conquer
reinforcement learning. arXiv preprint arXiv:1711.09874 (2017)
10. Gupta, A., et al.: Reset-free reinforcement learning via multi-task learning: learn-
ing dexterous manipulation behaviors without human intervention. arXiv preprint
arXiv:2104.11203 (2021)
11. Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., Levine, S.: Composable
deep reinforcement learning for robotic manipulation. In: 2018 IEEE International
Conference on Robotics and Automation (ICRA), pp. 6244–6251. IEEE (2018)
12. Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep
energy-based policies. In: International Conference on Machine Learning, pp. 1352–
1361. PMLR (2017)
13. Kaelbling, L.P.: The foundation of efficient robot learning. Science 369(6506), 915–
916 (2020)
14. Micchelli, C.A., Pontil, M.: Kernels for multi–task learning. In: NIPS, vol. 86, p. 89.
Citeseer (2004)
15. Mukadam, M., Cheng, C.A., Fox, D., Boots, B., Ratliff, N.: Riemannian motion
policy fusion through learnable lyapunov function reshaping. In: Conference on
Robot Learning, pp. 204–219. PMLR (2020)
16. Nachum, O., Gu, S., Lee, H., Levine, S.: Data-efficient hierarchical reinforcement
learning. arXiv preprint arXiv:1805.08296 (2018)
17. Notomista, G., Mayya, S., Hutchinson, S., Egerstedt, M.: An optimal task allo-
cation strategy for heterogeneous multi-robot systems. In: 2019 18th European
Control Conference (ECC), pp. 2071–2076. IEEE (2019)
18. Notomista, G., Mayya, S., Selvaggio, M., Santos, M., Secchi, C.: A set-theoretic
approach to multi-task execution and prioritization. In: 2020 IEEE International
Conference on Robotics and Automation (ICRA), pp. 9873–9879. IEEE (2020)
19. Peng, X.B., Chang, M., Zhang, G., Abbeel, P., Levine, S.: MCP: learning compos-
able hierarchical control with multiplicative compositional policies. arXiv preprint
arXiv:1905.09808 (2019)
20. Primbs, J.A., Nevistić, V., Doyle, J.C.: Nonlinear optimal control: a control Lya-
punov function and receding horizon perspective. Asian J. Control 1(1), 14–24
(1999)
21. Qureshi, A.H., Johnson, J.J., Qin, Y., Henderson, T., Boots, B., Yip, M.C.: Com-
posing task-agnostic policies with deep reinforcement learning. arXiv preprint
arXiv:1905.10681 (2019)
22. Rana, M.A., et al.: Learning reactive motion policies in multiple task spaces from
human demonstrations. In: Conference on Robot Learning, pp. 1457–1468. PMLR
(2020)
23. Ratliff, N.D., Issac, J., Kappler, D., Birchfield, S., Fox, D.: Riemannian motion
policies. arXiv preprint arXiv:1801.02854 (2018)
24. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv
25. Sahni, H., Kumar, S., Tejani, F., Isbell, C.: Learning to compose skills. arXiv
26. Schwartz, A., Thrun, S.: Finding structure in reinforcement learning. Adv. Neural.
Inf. Process. Syst. 7, 385–392 (1995)
27. Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. arXiv
28. Singh, S.P.: Transfer of learning by composing solutions of elemental sequential
tasks. Mach. Learn. 8(3), 323–339 (1992)
29. Smith, V., Chiang, C.K., Sanjabi, M., Talwalkar, A.: Federated multi-task learning.
30. Sontag, E.D.: A lyapunov-like characterization of asymptotic controllability. SIAM
J. Control. Optim. 21(3), 462–471 (1983)
31. Sontag, E.D.: A “universal” construction of Artstein’s theorem on nonlinear sta-
bilization. Syst. Control Lett. 13(2), 117–123 (1989)
32. Teh, Y.W., et al.: Distral: robust multitask reinforcement learning. arXiv preprint
arXiv:1707.04175 (2017)
33. Todorov, E.: Compositionality of optimal control laws. Adv. Neural. Inf. Process.
Syst. 22, 1856–1864 (2009)
34. Van Niekerk, B., James, S., Earle, A., Rosman, B.: Composing value functions
in reinforcement learning. In: International Conference on Machine Learning, pp.
6401–6409. PMLR (2019)
35. Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE Trans. Knowl. Data
Eng. 34, 5586–5609 (2021)
A Comparison of Two Decoupled
Methods for Simultaneous Multiple
Robots Path Planning
Benjamin Bouvier and Julien Marzat(B)
DTIS, ONERA, Université Paris Saclay, 91123 Palaiseau, France

{benjamin.bouvier,julien.marzat}@onera.fr
Abstract. Two path planning algorithms dedicated to multiple robots

driving simultaneously in a cluttered workspace are defined and com-
pared in this paper. Each robot is assigned an initial position, a goal posi-
tion and a constant reference speed, and has to drive without colliding
with its teammates and the environment. Both approaches are based on
an implementation of the widely known A* algorithm and belong to the
family of decoupled path planning methods, since robots are considered
sequentially and with a predefined priority ranking. The first algorithm
is called Prioritized Planning (PP), where the path of each robot is com-
puted sequentially while taking into account static obstacles as well as the
previously-planned robot positions. The second algorithm, called Fixed-
Path Coordination (FPC), follows a two-step approach: (i) obtaining A*
paths of all robots taking into account static obstacles only; (ii) applying
a velocity-tuning procedure so that lower-priority robots can stop and
restart along their paths to let the higher-priority robots move freely.
Both algorithms have been applied on a variety of test-cases through a
Monte Carlo setup to evaluate and compare their performances.
Keywords: Multi-robot motion coordination · Decoupled path

planning · Prioritized Planning · Fixed-Path Coordination
1 Introduction
Coordinated path planning of a group of mobile robots is a major requirement for
many cooperative tasks in interaction with a complex environment (e.g. unknown,
cluttered, dynamical). A high-level description of the problems related to either
collaborative or competitive robot coordination can be found in [1]. This coordi-
nation implies an inherent conflict of resources: in path planning the conflict is
mostly a space conflict because the robots move around each other, but it may
also be a conflict due to limited communication channels or a conflict of payload
if various robots manipulate the same one. Task allocation is a problem in itself
too, as well as the option of leaving the fleet management to a centralized author-
ity or allowing every robot to take decisions on their own. It is considered here that
the allocation of goal positions has already been performed and that the problem
consists in coordinating the robots’ simultaneous motion in the workspace.
https://doi.org/10.1007/978-3-031-51497-5_35
Comparison of Decoupled Methods 495
When managing more than one robot, a solution can consist in applying
motion coordination [2] i.e. coordinating the robots online, as they are moving,
to prevent collisions from occurring. This can be achieved either through a cen-
tralized authority sending commands to the robots (centralized approach) or with
the robots communicating with each other based on their future displacements
(decentralized approach). Motion coordination can result in various solutions: (i)
a system of traffic rules; (ii) reactive approaches which can be based on poten-
tial fields with attractive forces towards goals and repulsive ones to stay away
from obstacles; (iii) the management of a swarm or fleet with a common objec-
tive. For example, coordinated control approaches have been previously used for
several applications that can be addressed by a fleet of autonomous aerial or
terrestrial robots [3]. To limit computational complexity, the online search of a
model predictive control input for each vehicle can be reduced to a discretized
set while still taking into account the predicted motion of the other robots, as
in [4]. However this strategy can be limited for autonomous robots moving freely
in a very cluttered environment, where in some configurations no feasible solu-
tion can be found. This is why graph-based and sampling-based path planning
methods should also be considered for simultaneous robot coordination [5,6]. In
the latter framework, a simple approach could consist in considering only the
current positions of the other robots when computing the path of each one, but
collision avoidance with the other robots is then not guaranteed. The objective
of the present work is to study path planning methods that can be applied to
multiple robots moving simultaneously in the same cluttered environment, with
an explicit handling of their motion interaction. For this purpose two decou-
pled sampling-based path planning algorithms derived from the A* procedure
are described in Sect. 3, and their performances are evaluated in a Monte Carlo
setup with a combination of map sizes, rate of static obstacles and number of
robots (see Sect. 4).
2 Related Work
Simultaneous motion planning for multiple robots comes with various difficulties,
such as the numerical capacity of planning for a high number of robots or large
workspaces, the consideration of robot dynamics as well as sensing constraints,
or the transition from numerical simulation to real robots experiments [2]. To
deal with this variety of possible issues, several motion planning paradigms have
been derived as described in the reference textbook [5], where in particular algo-
rithms can be classified as either sampling-based or combinatorial methods. In
Sampling-based motion planning, the idea is to sample the free space to conduct
the search for a path relying on a collision detection module, hence the explicit
construction of the obstacle space is avoided. In Combinatorial motion plan-
ning, also referred to as exact methods, paths are computed in the continuous
configuration space without resorting to sampling or discretization. Two main
categories emerge from the literature to address simultaneous path planning,
namely coupled approaches [6–8] and decoupled approaches [8–11]. In coupled
496 B. Bouvier and J. Marzat
planning, the problem can be seen as handling a single robot which concatenates
the states of the actually different robots. This increases the dimension of the
configuration space and complexity is known to rise exponentially in this num-
ber. Classical algorithms may then fail to find a solution or require too much
time (this has been verified in our evaluations, see Sect. 4.1). In decoupled path
planning, two subcategories can be considered [12]:
1. Robots are treated one by one. For every one of them, a path is calculated
considering all other robots as moving obstacles on top of the already-existing
static obstacles. Two steps are required, namely: a planning algorithm and a
prioritization rule. The planning algorithm is applied to obtain the path of
each robot to its respective goal, and then prioritization should favor those
with the highest cost (depending on time, distance, energy, etc.). For instance,
robots with the least cost could be penalized by having to bypass or make
detours.
2. Each robot has its own path, calculated independently of the others with only
the static obstacles taken into account. Then a coordination diagram should
plan the displacements to avoid collisions. Some authors mention a second
phase of velocity tuning (e.g. in [13]) to coordinate robots in parallel: the
relative speed between two robots can thus be tuned in the portion of path
subject to a risk of collision.
Finally, it is possible to build hybrid approaches [12] which rely on both

coupled and decoupled search. The idea is to divide the robots into subgroups
depending on their level of dependency or degree of priority (the constitution of
such groups is in itself a task). For each subgroup, which has a smaller configura-
tion space, a coupled approach is applied. The groups are treated in a decoupled
way with respect to one another. The authors in [14] proposed a planning algo-
rithm based on satisfiability modulo theory, where suboptimal paths can be
accepted if, in turn, this makes the problem solvable.
Coupled or hybrid approaches however still report computational loads that
are prohibitive for embedded applications for realistic numbers of robots and
workspace dimensions. It has thus been chosen to focus on decoupled planning
methods in this work, since this category of approaches has the capacity of
decomposing the problem into simpler ones, and the literature shows that fea-
sible solutions can be found even when the number of robots, the workspace
dimension or the degrees of freedom increase. In the context of managing inter-
sections for autonomous vehicles, a priority-based coordination framework which
combines a high-level priority graph with a feedback control law has been pro-
posed in [15]. Under this framework, the proposed overall coordination system is
proven to be collision-free and deadlock-free. In [16], a prioritized planning app-
roach has been defined with a formal guarantee to provide a solution under strict
conditions regarding the workspace and the robots initial conditions. An asyn-
chronous decentralized implementation of this strategy has also been proposed
and demonstrated reliable results via a simulation evaluation on real-world maps.
Similar approaches relying on priority-based planning have also been proposed
in [9,10]. The main ideas for coordinating the speed of robots having indepen-
dent goals along pre-defined paths have been presented in [8]. The so-called
coordination diagram based on a bounding box representation of the obstacles
has then been introduced in [17] to tackle the motion coordination of several
robots moving along fixed independent paths while avoiding mutual collisions.
Multi-robot path planning continues to be an active area of research with many
different approaches proposed in the past few years, based e.g. on reinforcement
learning [18,19], particle swarm optimization [20,21], graph neural networks [22]
and with the design of novel heuristics [23].
Several decoupled multi-robot coordination strategies have been defined and
evaluated in these previous works, relying either on priority-based successive
planning or on velocity adjustment along fixed paths. Two such decoupled algo-
rithms are defined in the present paper, one from each category, namely Pri-
oritized Planning (PP) and Fixed-Path Coordination (FPC). They are both
derived from the same A* framework with practical implementations, which are
detailed in pseudocodes. This allows to compare fairly the two approaches in the
same simulation setup using extensive Monte-Carlo evaluations on randomly-
defined workspace configurations, and as a result provide recommendations for
the motion coordination of multiple robots having independent goals in cluttered
environments.
3 Planning Algorithms
3.1 Problem Formulation
The problem is illustrated in Fig. 1. A two-dimensional workspace W = R2 is con-

sidered (but the methods can easily generalize to higher-dimension workspaces).
This workspace can be separated into two parts: a first part where robots are
free to move, Cf ree , and a second one occupied by static obstacles, Cobs , where
robots are not allowed to drive. A robot A can undergo various transformations
which result in reaching a particular configuration q ∈ W . An obstacle-free con-
figuration such that q ∈ Cf ree is considered valid, while a collision configuration
q ∈ Cobs is invalid. Let us now consider we have m robots, m ∈ [[2; +∞[[. Every
robot Ai , i ∈ [[1; m]], is assigned an initial configuration, qIi , and a final configu-
i
ration or goal, qG , defined in Cf ree . Contrary to the single-robot case where it is
sufficient to consider robot-obstacle collisions, robot-robot collisions should be
taken into account. Robots are moving with respect to each other, therefore a
given robot Ai should consider all the other robots Aj , j = i, as moving obsta-
cles. The set of configurations such that robot Ai collides with robot Aj , j = i,
i,j
is defined as Cmov obs .
Fig. 1. Multiple robots path planning: robots A1 and A2 drive from their initial to final
configurations while avoiding two static obstacles and ensuring they do not collide.
The path planning problem consists in finding, for every robot Ai , a contin-
uous path τi : {[0; 1] → Cf ree ; s → τi (s)} where s parameterizes the path such
that τ (0) = qIi and τ (1) = qG
i
, and with an empty intersection with both Cobs and
i,j
Cmov obs for all j = i. The notion of moving obstacles requires to have a parame-
ter that can be used to determine where a given robot is located at some instant.
If all robots have the same speed, the cumulative distance could be used and s
would be sufficient. However, for robots with different speeds vi , time should be
considered explicitly in the problem definition. Therefore, a date is associated to
every configuration of the path. Instead of looking for the parameterized path τi
defined previously, we now look for a trajectory φi = τi ◦ σi considering a timing
function defined as σi : {T → [0; 1] ; t → s}, where t denotes time.
3.2 Decoupled Prioritized Planning (PP)
The first algorithm is a decoupled two-dimensional algorithm derived from A*

using prioritization to sequentially obtain the path of every robot without col-
lisions. The method requires that the m robots have been ranked from highest
to lowest priority. For the first one, A1 , a standard A* path is calculated from
its initial position to its goal position, taking into account the static obstacles
present in the workspace. Then, the second robot is treated: besides static obsta-
cles, A2 has to consider the previously computed motion of robot A1 along its
path with a given constant speed v1 . In case of collision with A1 at a given node,
A2 will have to reach another configuration, therefore bypassing A1 which has
priority. This procedure is applied iteratively for the next robots until all have
been treated. The idea is illustrated in Fig. 2. In this algorithm, every robot is
assumed to drive at its own constant nonzero speed, meaning that a robot is
unable to stop. All it can do is bypass, which requires to have sufficient free
space around the static and moving obstacles.
Fig. 2. PP Algorithm: A2 ’s path is obtained by bypassing the higher-priority robot

A1 .
Each robot is assigned a vector of dates associated to its A* path, given its
nominal velocity vi specified by the user. For robot Ai , it will allow to interpo-
late the positions of all higher-priority robots Aj , 1 ≤ j < i, to detect possible
collisions. Collisions are defined as follows: the current robot Ai is considered as
a point, while all the robots with higher priorities Aj , 1 ≤ j < i, are considered
to occupy a disk of parameterized safety radius centered at the robot position
at the current date. If Ai finds itself inside or on the outer circle of any disk
representing Aj , a collision is detected and the corresponding node is declared
unfeasible. The pseudocode of the algorithm is given in Algorithm 1. It globally
consists in a loop with a modified A* procedure applied for each robot (modifi-
cations are indicated in blue). The dates in the list time vec (which is of same
length as the heuristic scores gscores ) are defined for the nodes present either in
the closedSet (which progressively contains the path) or in the openSet (which
contains nodes considered to be part of the path but eventually not selected).
The considered heuristic dist(nbg, n) is based on the Euclidean distance. In the
end, a trajectory (pathi ; datesi ) is obtained for each robot i. If the while loop
is exited with success = f alse, the path is extracted from the node of mini-
i
mal heuristic value instead of qG , therefore an incomplete path is obtained to
determine how far the robot can go before having to stop.
The core of the multi-robot layer lies in the robot-robot collision detection.
When managing static obstacles, it is sufficient and straightforward to check if
the intended configuration is inside or outside Cobs . When it comes to moving
obstacles (higher-priority robots), one has to consider the relative speed between
the robot and the obstacle as well as the obstacle width. Collision checks are
performed every dT seconds, given the relative speed vector and the obstacle
width. When moving from a current node position n to the intended one ngb,
the robot does not change its direction but the moving obstacle might. As a
consequence, care should be taken in selecting the dates and in calculating the
relative speed vector. Moreover, collision checks are performed with all robots of
higher priority (the sweeping is nevertheless stopped when a collision is detected
because one is sufficient). All these additional calculations can be costly so they
are run only if a static obstacle has not already been detected. Finally, a post-
check is conducted on the obtained paths and dates such that the coordination
is considered to have failed in the particular case where a lower-priority robot
has reached its destination and is blocking the path of a higher-priority robot.
1 foreach robot i ∈ [[1; m]] [ranked from highest to lowest priority] do

2 Initialize closedSet to empty and openSet to the node containing qIi ;
3 Initialize lists gscores (cumulated distances from qIi ) and time vec (time instants);
while openSet not empty do
4 Find node n in openSet with minimal heuristic value mhv;
5 if mhv = ∞ then Assign success := false; Break Loop; end;
6 if n = qG i then Assign success := true; Break Loop; end;
7 Add n in closedSet; Remove n from openSet;

8 for all neighbors ngb of ”n” inside the workspace do
9 if ngb ∈/ closedSet then
10 Compute candidate cost alt := gscores (n) + dist(ngb, n);
11 if ”ngb” collides with a static obstacle or with previous robots on
{pathj , datesj , ∀j < i} then
12 alt := ∞;
13 end
14 if alt < gscores (ngb) then Update gscores , time vec end;
15 end
16 end
17 end
18 if success = true then
19 Get full pathi from start qIi to goal qGi and vector datesi from time vec;
20 else
21 Get incomplete pathi and vector datesi from start to best reachable node;
22 end
23 Return pathi and datesi ;
24 end
Algorithm 1: Prioritized Planning (PP) pseudocode
3.3 Decoupled Fixed-Path Coordination (FPC)
The second algorithm is also decoupled, two-dimensional and A*-based. It differs

from the PP algorithm in that it allows robots to stop on their paths and resume
their motions. Therefore, robot-robot collisions are not avoided by bypassing but
simply by stopping and starting again (see Fig. 3a), which is one possible choice
for modulating the robot velocities along their trajectories [5].
Fig. 3. A2 ’s trajectory is obtained by pausing until higher-priority robot A1 has gone

by.
First, an A* path is obtained independently for each of the m robots by

considering static obstacles only (referred to as single-robot planning). The input
requirements are the same as the previous algorithm: the robots should be ranked
from highest to lowest priority and each is assigned a nominal velocity vi . A
speed profile is then initialized for each robot along the A* paths. The next step
consists in tuning these velocity profiles to ensure collision-free paths, which
is achieved by inserting pauses every time a robot-robot collision is detected.
Note that these breaks can be inserted anywhere in the path, which means not
necessarily on the nodes of the A* grid. If a collision is detected (with the same
mechanism as in PP), the robot with the lowest priority will pause as long as
necessary for the higher-priority robots to go past. Another less conservative
option would be to optimize the robot motions given some global criterion to
select which robot should stop for given collision conditions, which might result
in different prioritization rules for similar collision configurations at different
time instants.
In the PP algorithm, a list of dates has been added as additional informa-
tion to handle collision-checking. The same applies in this second algorithm,
with the addition of a binary motion command state associated to each date,
to determine if the robot is driving or waiting (see Fig. 3b). Another differ-
ence with Algorithm 1 is the calculation of dT . In PP it is calculated from
the relative speed vector between two given robots at a certain date, while a
unique dT is now used in FPC for all couples of robots on the basis of the
highest possible relative speed. The pseudocode is given in Algorithm 2. Every
time a collision is detected, a pause is inserted, which requires to check again
the collisions of all pairs of robots at the date of collision, in case the pause
insertion would induce a collision that did not exist before. Also, when two
robots are moving exactly in opposite directions, that stop may need to be
inserted several times dT before the collision date, which requires going back-
wards in time to check all pairs and can turn out to be costly. Although not
explicit in Algorithm 2, the main while loop (line 7) can be interrupted on
a maximum number of collision checks in the sub-routines, which indicates a
likely absence of solution and is thus considered as a failure to find coordinated
paths.
1 for all robots i ∈ [[1; m]] do

2 get A* pathi (with static obstacles only), initial datesi and motioni vectors;
3 end
4 Tmax := maximum final date among all robots;
5 Calculate dT given nominal speeds vi , i ∈ [[1, m]] ;
6 Initialize K := ceil(Tmax /dT ) and time index k ← 1;
7 while k ≤ K do
8 Initialize boolean any collision detected with value true;
9 Initialize integer k candidate with value k;
10 while any collision detected do
11 Set collision counter to 0;
12 for robot r from 1 to m − 1 [loop over priority robots] do
13 for robot s from r to m [loop over subordinate robots] do
14 if robot ’s’ collides with robot ’r’ then
15 collision counter ← collision counter + 1;
16 Insert a stop at last date dstop when a motion command had been
applied before the collision, and update datess and motions ;
17 k candidate ← min {k candidate; ceil (dstop /dT )};
18 if final date of robot ’s’ > final date of all other robots then
19 K ← K + 1;
20 end
21 end
22 end
23 end
24 if collision counter == 0 then any collision detected := f alse;
25 end
26 end
27 if k candidate < k then
28 k ← k candidate; [start over the collision check in the subinterval containing
or ending by the smallest dstop used in the latest collision while loop]
29 else
30 k ← k + 1 [go to next time subinterval];
31 end
32 end
33 For all robots i ∈ [[1; m]], return datesi , motioni (pathi is fixed by design);
Algorithm 2: Fixed-Path Coordination (FPC) pseudocode
4 Numerical Experiments
Extensive simulations have been run based on the following setup. The workspace
is a bounded subset of R2 with translational degrees of freedom. As shown in
Fig. 4a, the free space is sampled at regularly-spaced points (in green) which
form a grid with at most eight neighbors per node. Robots move along the edges
with the aim of traveling from their starting positions (black) to their final ones
(red). The number indicated next to every node is the date at which the robot
reaches the corresponding position.
Fig. 4. Illustrative test case with 3 robots successfully solved by PP (b) and FPC (c-d).
Arrival times are indicated around the nodes and motion command profiles (d) show
the stop-and-go behaviour of FPC.
4.1 Illustrative Test Cases
Coupled Planning. A coupled A∗ strategy has been considered as a baseline,

where all the positions of the robots in the fleet were aggregated into a single
state vector. However the combinatorial complexity grows exponentially with
the number of nodes and the number of robots, making it impractical for the
simultaneous path planning of teams with more than two or three robots. This
has been verified on a simple example with around 100 nodes where computing
the paths takes a few milliseconds for one robot, several seconds for two robots,
and several hours for three robots. Therefore, the focus has been put on the
comparison of the proposed decoupled algorithms.
Decoupled Methods. As an illustrative example, a simple test case for three

robots with different velocities is presented in Figs. 4b and 4d. Both decoupled
algorithms are successful in coordinating the three agents. With PP, robot A2
finds an alternative node to avoid a collision with A1 at node (45; 25). With
FPC, robots A2 and A3 stop once and twice respectively, increasing their motion
duration by 16 and 89% compared to the constant-speed travel time of the initial
independent A* paths.
4.2 Monte Carlo Simulations

In order to evaluate and compare the algorithms performances, a Monte Carlo
simulation was set up. It consisted in (i) creating random scenarios (follow-
ing uniform distributions) as two-dimensional maps with randomly-placed static
obstacles and random initial and final configurations of robots and (ii) applying
the two algorithms. The following parameters are selected for each experiment:
(a) a number of nodes (related to the map size), (b) a rate of static obstacles
and (c) a number of robots. The static obstacles are placed so that a particular
percentage of the map (called occupancy rate) is covered with them, e.g. 10%
means that 10% of the nodes are occupied by static obstacles. Even though both
algorithms allow the robots to have different nominal speeds, they were all set to
the same value in these evaluations for a fair comparison since path lengths and
travel times are handled differently by PP and FPC. Two values were respec-
tively chosen for the map size (about 103 and 104 nodes), the occupancy rate
(10 and 30%) and the number of robots (5 and 10), and all eight combinations
evaluated. Any map where single-robot A* planning failed was dismissed, since
that means at least one destination cannot be reached. The expression single-
robot planning refers to the superposition of the paths obtained for every robot
considering only the static obstacles and ignoring the other agents. Two main
metrics were computed for all the test cases and algorithms: the success rate
and the increase of travel duration with respect to the single-robot reference
(see Table 1).
Table 1. Monte-Carlo comparison (mean results on 1000 runs) for PP and FPC algo-
rithms
Fig. 5. PP and FPC success rates show little dependency on occupancy rate
Rate of Planning Success. Single-robot success refers to cases where indepen-

dent single-robot planning happens to generate no collisions with constant-speed
travel. As can be seen in Table 1 (line 1), the success rates ranging from 10 to
90% decrease by 4 to 15 points when the occupancy rate is tripled and by 26 to
67 points when the number of robots is doubled. Available nodes become pro-
gressively less numerous so the probability of path overlap and/or crowded areas
rises, which produces more collisions. On the contrary, growing the number of
nodes has a positive impact on the success rate because more free space becomes
available: paths are less likely to overlap and, even if they do, robots are less
likely to take these common portions at the same dates.
Considering now the cases where single-robot planning fails and multi-robot
planning becomes necessary, we may compare the efficiency of PP and FPC
(lines 2-3 in Table 1, also shown in Fig. 5). It should be noted that this metric
is calculated only on the subset of single-robot failures. PP has a success rate
between 64 and 96% and FPC from 51 to 81%. PP always performs better than
FPC because FPC fails when an initial or final position is on another robot’s
path. A goal position acts as an additional static obstacle from the perspective of
a lower-priority robot; an initial position is also seen as a static obstacle but by
a higher-priority robot because this is the only position that cannot be freed by
holding up the robot somewhere else in the map. In a nutshell, FPC handles path
overlaps less efficiently. A check on initial and final positions as well as priority
reordering could help avoiding such cases. Finally, the number of nodes has a
positive impact on the planning status because it offers more possibilities for
PP to find alternative paths and because FPC is less affected by the previously
described issues.
Travel Duration Increase. It is worth reminding the difference between PP

and FPC path alteration. PP will change portions of paths to bypass higher-
priority robots while driving continuously so both travel duration and distance
may increase. FPC, however, can make smaller-priority robots wait but can-
not make them divert from single-robot paths, which means distance remains
unchanged. This is why Table 1 only mentions travel duration increase. Path
lengthening would be zero for FPC and equal to duration increase for PP because
all speeds are equal. The metric is computed only on multi-robot configurations
for which the paths were successfully modified by the corresponding algorithm.
The results show that FPC delays the robots much more than PP, with 18 to
57% of time increase on average compared to less than 6.1% for the smaller map
and always less than 1% on the bigger one. In many cases, bypassing a robot
has a limited impact on arrival time because various slightly different paths (and
sometimes ones of exact same length) lead to the goal. If two robots drive in
opposite directions, one may simply shift to a parallel lane with PP whereas in
FPC the smaller-priority agent will have to stop to give way.
5 Conclusion
Two path planning algorithms dedicated to multiple robots driving simultane-
ously in a cluttered workspace have been defined as variations of A* procedures
and compared in this paper. Both are based on a predefined order of priority
and assumptions of constant nominal velocities. The first one named Prioritized
Planning (PP) consists in bypassing higher-priority agents while continuously
driving. The second one named Fixed-Path Coordination (FPC) handles collision
risks along fixed paths by inserting pauses in order to let higher-priority robots
move past. The Monte Carlo simulation setup showed that directly superimpos-
ing single-robot planning paths can be successful in a number of configurations
with limited occupancy rates and fleet sizes. This could thus be a first step when
addressing a multi-robot coordination problem, where the new strategies can
then be called upon when this straightforward approach has failed. PP showed
good performances thanks to its flexibility in adapting the paths by following
alternative nodes. Satisfactory success rates and very limited travel duration and
distance increases were obtained. The proposed FPC version proved less effi-
cient because of the incapacity of diverting from the initial single-robot paths.
This strategy may only prove more efficient in extremely cluttered environments
where alternative nodes would be harder to find for PP. Most of the FPC failed
cases were due to initial and final positions conflicts, as they may prevent other
robots from reaching their goals. Moreover, although distance remains minimal
with FPC, travel duration grows significantly more than with PP. Free space
sampling maximization is of interest to improve the success rate of both single-
and multi-robot algorithms.
Future work directions include the evaluation of priority shuffles for both
PP and FPC algorithms and their extension to higher-dimensional workspaces,
which will allow to handle heterogeneous teams of robots. Additional comparison
criteria can also be considered, such as energy efficiency. Finally, the deployment
on embedded computers of real-world robots and the interaction with actual
perception and control modules would also make it possible to estimate the
computational load and performances in full autonomy loops on the field.
Acknowledgement. This work was supported by ONERA project PR PARADIS.

References
1. Yan, Z., Jouandeau, N., Cherif, A.A.: A survey and analysis of multi-robot coor-
dination. Int. J. Adv. Rob. Syst. 10(12), 399 (2013)
2. Parker, L.E.: Path planning and motion coordination in multiple mobile robot
teams. Encyclopedia of Complexity and System Science, pp. 5783–5800 (2009)
3. Hoy, M., Matveev, A.S., Savkin, A.V.: Algorithms for collision-free navigation of
mobile robots in complex cluttered environments: a survey. Robotica 33(3), 463–
497 (2015)
4. Bertrand, S., Marzat, J., Piet-Lahanier, H., Kahn, A., Rochefort, Y.: MPC strate-
gies for cooperative guidance of autonomous vehicles. Aerospace Lab J. 8, 1–18
(2014)
5. LaValle, S.M.: Planning Algorithms. Cambridge University Press, Cambridge
(2006)
6. Švestka, P., Overmars, M.H.: Coordinated path planning for multiple robots.
Robot. Auton. Syst. 23(3), 125–152 (1998)
7. Yu, J., LaValle, S.M.: Planning optimal paths for multiple robots on graphs. In:
2013 IEEE International Conference on Robotics and Automation, pp. 3612–3617.
IEEE (2013)
8. LaValle, S.M., Hutchinson, S.A.: Optimal motion planning for multiple robots
having independent goals. IEEE Trans. Robot. Autom. 14(6), 912–925 (1998)
9. Van Den Berg, J.P., Overmars, M.H.: Prioritized motion planning for multiple
robots. In: 2005 IEEE/RSJ International Conference on Intelligent Robots and
Systems, pp. 430–435. IEEE (2005)
10. Zheng, T., Liu, D., Wang, P.: Priority based dynamic multiple robot path plan-
ning. In: International Conference on Autonomous Robots and Agents. Massey
University, New Zealand (2004)
11. Kala, R.: Rapidly exploring random graphs: motion planning of multiple mobile
robots. Adv. Robot. 27(14), 1113–1122 (2013)
12. van Den Berg, J., Snoeyink, J., Lin, M.C., Manocha, D.: Centralized path planning
for multiple robots: optimal decoupling into sequential plans. Robot. Sci. Syst. 2,
2–3 (2009)
13. Sánchez, G., Latombe, J.C.: On delaying collision checking in PRM planning: appli-
cation to multi-robot coordination. Int. J. Robot. Res. 21(1), 5–26 (2002)
14. Saha, I., Ramaithitima, R., Kumar, V., Pappas, G.J., Seshia, S.A.: Implan: scalable
incremental motion planning for multi-robot systems. In: 2016 ACM/IEEE 7th
International Conference on Cyber-Physical Systems (ICCPS), pp. 1–10. IEEE
(2016)
15. Gregoire, J.: Priority-based coordination of mobile robots. Ph.D. thesis, Ecole
Nationale Supérieure des Mines de Paris (2014)
16. Čáp, M., Novák, P., Kleiner, A., Seleckỳ, M.: Prioritized planning algorithms for
trajectory coordination of multiple mobile robots. IEEE Trans. Autom. Sci. Eng.
12(3), 835–849 (2015)
17. Siméon, T., Leroy, S., Lauumond, J.P.: Path coordination for multiple mobile
robots: a resolution-complete algorithm. IEEE Trans. Robot. Autom. 18(1), 42–49
(2002)
18. Bae, H., Kim, G., Kim, J., Qian, D., Lee, S.: Multi-robot path planning method
using reinforcement learning. Appl. Sci. 9(15) (2019)
19. Wen, S., Wen, Z., Zhang, D., Zhang, H., Wang, T.: A multi-robot path-planning
algorithm for autonomous navigation using meta-reinforcement learning based on
transfer learning. Appl. Soft Comput. 110 (2021)
20. Das, P.K., Jena, P.K.: Multi-robot path planning using improved particle swarm
optimization algorithm through novel evolutionary operators. Appl. Soft Comput.
92 (2020)
21. Tang, B., Xiang, K., Pang, M., Zhanxia, Z.: Multi-robot path planning using an
improved self-adaptive particle swarm optimization. Int. J. Adv. Robot. Syst. 17(5)
(2020)
22. Li, Q., Gama, F., Ribeiro, A., Prorok, A.: Graph neural networks for decentralized
multi-robot path planning. In: IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pp. 11785–11792 (2020)
23. Han, S.D., Yu, J.: Effective heuristics for multi-robot path planning in warehouse
environments. In: IEEE International Symposium on Multi-Robot and Multi-Agent
Systems (MRS), pp. 10–12 (2019)
Smarticle 2.0: Design of Scalable,
Entangled Smart Matter
Danna Ma1 , Jiahe Chen1 , Sadie Cutler2 , and Kirstin Petersen1(B)

1
School of Electrical and Computer Engineering, Cornell University,
Ithaca, NY 14853, USA
{dm797,jc3472,kirstin}@cornell.edu
2
Sibley School of Mechanical and Aerospace Engineering, Cornell University,
Ithaca, NY 14853, USA
sc3236@cornell.edu
Abstract. We present a new iteration of smart active matter modules

capable of unprecedented 3D entanglement, designed specifically for fab-
rication and operation at large scales by a range of scientific users. We
discuss the benefits of entanglement compared to traditional rigid, lat-
tice formations in active matter and modular robots, and the design
which supports low cost, a small and appropriate form factor, low weight,
low barrier-of-entry, and ease of operation. We characterize the platform
in terms of actuation repeatability and longevity, lifting and holding
strength, a number of sensing modalities, and battery life. We demon-
strate short and (relatively) long range communication using tactile and
acoustic transceivers. We further show exploratory collective behaviors
with up to 10 modules, including static entanglement and self disassem-
bly. We hope that this open-source ‘robo-physical’ platform can pave the
way for new innovations across the fields of modular robots and active
and soft matter.
Keywords: scalable collective · modular robot · active matter
1 Introduction
Smart robotic matter consists of large aggregates of programmable units capa-
ble of actuation, sensing, and intelligent response to their surroundings [1]. It
promises applications both as a physical test platform to gather further insights
on materials, colloids, and biological swarms [2,3], as well as, the ability to
act as a taskable, modular robot that can morph between shapes and change
properties as needed [4–6]. Demonstrations currently span swarms of carefully
linked, individually-capable modules [1,5], to loose aggregates of modules with
little-to-no individual mobility, sensing, or computation [2,7,8]. The latter is of
special interest in this article because system-wide complexity and functionality
can emerge from local interactions between large numbers of individually limited
components. However, it also involves significant challenges related to scalable
fabrication and operation, and methods for coordination and reliable autonomy.
https://doi.org/10.1007/978-3-031-51497-5_36
510 D. Ma et al.
Fig. 1. A) Smarticle 2.0 modules, capable of transitioning between “I” and “U” shapes.
B) Wireless charging circuit. C) Entangled staples hanging from a ruler. D) 20 entan-
gled modules.
This paper presents the newest iteration of a Smarticle (smart active matter)
robot platform [2], specifically designed to support affordable and large-scale
operation, 3D entanglement, a variety of sensing modalities, and low barrier of
entry for novice users across scientific fields (Fig. 1). A Smarticle 2.0 module has
a long slender profile that can alternate between a staple-like shape (“U”) and
an oblong shape (“I”) using a single actuator, and has the ability to sense and
communicate acoustically, optically, and mechanically.
This article focuses on the details of the Smarticle 2.0 design (Sect. 3), fab-
rication (Sect. 4), and mechanical and sensor characterization (Sect. 5). How-
ever, we also show some exploratory multi-robot experiments with up to 10
robots (Sect. 6). To clarify the results presented, we recommend that the reader
watches the accompanying video (https://youtu.be/KrcyYz2ccdg). To the best
of our knowledge, this is the first demonstration of a smart matter platform
capable of large-scale, 3D entangled operation.
Smarticle 2.0 511

The typical trend in smart matter is to rely on modules with sufficient locomo-
tion, sensing, communication, and reasoning skills to function on their own or
in small numbers [1,5]. However, more recent works have suggested that smart
matter can be composed of much less capable modules that are largely agnostic
to each other and their surroundings. Examples include modules that are indi-
vidually immobile, but collectively move [9–11], and, perhaps more impressively,
exhibit emergent intelligent behaviors in loose aggregates with only very coarse
perception and no explicit communication [8,10,12,13].
One of the major practical issues with smart matter and modular robots
is the design of couplings/connection points. Beyond supporting the weight of
adjoining modules, these connection points are often used to transfer drive forces,
communication signals, and power. In reconfigurable robotic matter, these mech-
anisms must dock and undock repeatedly and with high accuracy. Hence, these
coupling mechanisms are often responsible for a significant fraction of the weight,
volume, price, and wear. To overcome this issue, many platforms rely on active or
passive magnetic coupling [1,5,14], with the explicit trade-off of low force trans-
fer which limits configurations to small overhangs or planar operation. Another
interesting tangent are modules with compliant magnetic couplings, which per-
mit loosely coupled aggregates in non-lattice formations [8–10]. A final promising
option is the reversible fusion of surface material, as demonstrated in 3D modular
robots [15].
The Smarticles, first presented in 2015 [2], breaks with these discrete on/off
unit couplings by leveraging entanglement of convex modules. Entanglement is
advantageous because 1) it does not require precise module-module alignment;
2) it does not require expensive, heavy, and bulky coupling mechanisms; and 3)
although it cannot produce particular formations, it permits modules to couple
loosely to generate materials and conglomerates with different shapes and rheo-
logical and structural properties like viscosity, yield stress, and packing density.
The Smarticles entangle simply by leveraging two rotating appendages, causing
units to alternate between “U”, “I”, and, in some cases, “Z” shapes. In spite
of their simplicity, Smarticle aggregates have been shown to move collectively
when many are confined in a small space [11], and to perform phototaxis using
only binary sensor inputs and physical interactions [12,16]. Further demonstrat-
ing their versatility, they have also been used for physics studies, e.g. pertaining
to non-equilibrium ordering phenomena [3] and novel phase dynamics in active
matter [17] and biological model systems [18].
While the original inspiration for the Smarticles platform stems from 3D
entanglement of “U”-shaped particles [19], all prior demonstrations have been
restricted to planar operation. The reason is two-fold. Most importantly, the
Smarticle 1.0 prototype does not have a profile that supports effective 3D entan-
glement. Each module can only entangle with one other due to the aspect ratio.
Furthermore, the current prototype was not optimized for operation in large
robot swarms which are needed for the stochastic effects to average out. In this
paper, we present a new design which enables full 3D entanglement, large scale
512 D. Ma et al.
deployment, and may serve as an easily accessible platform to support novel

studies on smart and active matter systems.
3 Electro-mechanical Design
Compared to other modular and swarming robots in literature (Table 1), the
Smarticles 2.0 have a relatively small form factor (2000×20×17mm3 ), very light
weight (20g), low cost (< 71USD/pcs) and assembly time (∼15min), making
it feasible to acquire and operate many in traditional research labs (Table 1).
Equally important, the modules have an optimal aspect ratio which is the key to
3D entanglement (l/w = 0.4− 0.75 [19]). Smarticle 2.0 can translate between “I”
and “U” shapes leveraging a non-backdrivable 4-bar crank and slider mechanism.
Modules also have the ability to sense and communicate optically, acoustically,
and mechanically. Finally, modules are powered by a Venom Fly 180mAh 3.7V
Lithium Polymer battery which can support reasonably long experiments (up to
200 appendage cycles).
The cost of a module breaks down as follows for a 15-piece order: the raw
PCB costs 22.1USD, the electronic components 25.53USD, the motor 14.99USD,
and the battery 6.49USD. (Note that the price of 3D printed parts is negligible
as they total 2.24g of material, and that the module price will drop with bulk
ordering)
The crank and slider mechanism dominates the module design, volume,
weight, and cost. It works by rotating a custom worm gear with two opposite
thread angles, which is translated to linear motion through two gear racks that
attach to raise or lower the appendages. To support this mechanism we chose a
commercially available, affordable, strong, small, and light weight motor; specif-
ically the 700:1 Sub-Micro Plastic Planetary Gearmotor (6mm diameter, 21mm
length, 1.3g) from Pololu.
Beyond the off-the-shelf motor and the battery, modules consist almost
entirely of FR4 and polyimide material, commonly known as rigid and flexi-
ble printed circuit boards (PCBs). We order these pre-soldered by Sandnwave
for ease; because these fabrication houses are located in China, we also circum-
vent long lead times due to the ongoing shipping crisis. Beyond the PCBs, the
module consists of a 3D printed bearing and motor holder which are snap-fit into
holes in the PCB. We use the Formlabs 2.0 and the Carbon 3D printers, with
Tough1500 and UMA90 material respectively, to produce the bearing, motor
holder, and worm gear. While the resolution of the prints are important, the
material itself only affects how quickly the parts wear down.
Figure 2 shows the breakdown of the mechanical parts as well as the top and
side view of a robot. On the PCB top, it mounts the worm gear and motor
pair, the bearing and motor holder, as well as the left and right sliders. The
bearing and motor holder not only secures the worm gear and motor during the
operation, but also help to guide the sliding motion of the gear racks. These
parts are fitted on the center body, connected to the left and right appendage by
4 flexible PCB link pieces. The battery is mounted on the back of the module.
Smarticle 2.0 513
Table 1. Examples of self-re-configurable modular and swarm robotic platforms, char-

acterized in terms of properties related to scalable fabrication and operation.
Platform Form factor[mm] Weight[g] Cost[USD] Assembly time[min]

Jasmine robots[20] ∼30×30×30 – 130 –
Donuts[9] 46(diameter),70(tall) 25.4 587 48
Kilobots[21] 33(diameter),40(tall) – 14* 15
Smarticles 1.0[2] 151×53×25 – 40.8* –
Smarticles 2.0 2000×20×17 20 71 15
-no information *parts cost only
Fig. 2. Component overview. Mechanical Components: 1. Worm gear; 2. Planetary

DC motor; 3. Motor bearing; 4. Motor holder; 5. Left slider; 6. Right slider; 7. Left
appendage; 8. Flexible PCB link; 9. Center body; 10. Right appendage; 11. Battery.
Electronics: 1. Light sensor; 2. IR encoders; 3. Motor current sensor; 4. Buzzer; 5.
RGB LED; 6. Atmega328pb microcontroller; 7. Battery state sensor; 8. Accelerometer;
9. Microphone.
514 D. Ma et al.
Each module is controlled by an 8-bit ATmega328pb microprocessor, making

it compatible with the popular do-it-yourself (DIY) Arduino IDE programming
framework. Modules have six sensing modalities. In terms of proprioception, they
can measure their battery state, appendage load (motor current), appendage
position through IR encoders, and orientation through a 3-axis accelerometer.
In terms of exteroception, they can measure ambient light using an IR sensor,
audio using a buzzer-microphone pair, and vibrations using the accelerometer.
These sensing modalities compliment each other: acoustic signals can penetrate
an entangled collective, light can address only perimeter modules, and, because
the robots are typically loosely entangled, mechanical vibrations translate only
between neighboring modules.
To lower maintenance, the battery is connected to a coil embedded in the
left appendage, such that the modules can be charged by loose placement on a
wireless charger (Fig. 1.B). The current charging circuit cost 85USD and simul-
taneously charges 5 robots fully over 6hrs. We have yet to optimize for cost and
ease of use, i.e. the amount of misalignment the charger can handle for easier
placement.
4 Fabrication
Fig. 3. Smarticle 2.0 assembly steps.
Fabrication time and ease are both critical to support user-adoption of swarm
modular robot platforms. All electronic components can be mounted on the rigid
PCB either in- or out of house. The latter requires less in-house expertise and
labor, and is in fact economically advantageous due to the current shipping crisis.
Smarticle 2.0 515
Excluding component soldering and 3D printing of components, the in-house

assembly takes less than 15 min per robot and requires little prior knowledge on
soldering techniques. The fabrication process involves six main steps (Fig. 3):
1. Assemble slider pieces: To support accurate, rapid, and inexpensive assem-

bly of the flexible and rigid PCBs that make up the sliders, we added solder
pads such that they can be laminated together in a re-flow oven. This process
enable the pieces to self-align due to the surface tension created by the solder
paste. Depending on the size of the oven, a few dozen sliders can be produced
in each batch. Alternatively, this step can be done with a carefully pointed
heat gun.
2. Connect body and appendages: We connect the two appendages and the
center body by soldering the four flexible link pieces on to each side. To assist
in the alignment of PCBs, we perform this step on a 3D printed holder.
3. Mount the actuator: We snap-fit the bearing and motor holder to the
center body PCB, to hold together the two sliders and the worm gear.
4. Assemble crank-and-slider: We solder the sliders to the appendages.
5. Attach Motor: We glue the motor into the motor holder and solder the
motor connections onto the center body PCB.
6. Finalize: We plug in the battery and apply a thin layer of liquid electrical
insulation paint on the surface of robot.
5 Characterization
In this section we characterize essential properties of the Smarticle 2.0 mod-
ules, including their actuation profile (accuracy/longevity, lifting strength, and
holding strength) and sensor specifications (acoustic, orientation).
First, to test the repeatability and durability of the mechanical components,
a module was programmed to raise-, pause-, lower-, and pause the appendages
open-loop for 2 s-3 s-2 s-3 s, respectively. We repeated this behavior for 67min
continuously, amounting to more than 400 cycles. We found that even with-
out sensor feedback, the appendage motion is fairly symmetrical and accurate
within a standard deviation of less than 2◦ , and persists over many cycles. Specif-
ically, the left and right appendage minimum angle was found to 30.2±1.0◦ and
35.8±1.2◦ respectively; and the left and right appendage maximum angle was
found to 88.0±0.9◦ and 85.4±2.0◦ respectively.
Second, we analyzed the appendage lifting force, by combining the governing
equations for the crank and slider mechanism [22] and the worm gear force
translation [23]. These parameters are also shown in Fig. 4A. Geometrically, the
distance from the end of the connecting rod to the crank axle, x, and the angle
between the slider and the connecting rod, β, can be calculated as follows:

x = rcos(α) + l2 − r2 sin2 (α) (1)
x + rcos(π − α)
β = cos−1 ( ) (2)
l
516 D. Ma et al.
where α is the angle of the crank, r is the rotating crank length, and l is the length
of the connecting rod. The torque from the crank and slider on the appendage,
τ , depends on F , the linear force pulling on the slider:
τ = F rsin(α + β) (3)
Conversely, the motor stall torque, τstall , relates to the maximum linear force
the crank and slider mechanism is capable of delivering, estimated as:
2τstall πdmean − f psec(ζ)
F = ( ) (4)
dmean p + πf dmean sec(ζ)
Fig. 4. A) Left: Kinematic forces acting on the worm gear; Right: Single-appendage
view of the crank and slider mechanism, with design parameters. B) Lifting force of
a single appendage. The blue line is the theoretical prediction, red data points show
experimentally obtained values. C) Alternative design with a stronger motor. (Color
figure online)
where dmean and p is the mean thread diameter and pitch of the worm gear, and
ζ and f are the angle and amplitude of the normal force between the teeth and
the grooves in the (FR4-based) rack, respectively. We now combine Eqs. 4 and
3 and iterate over all appendage angles to estimate their lifting force (Fig. 4B).
Note that the model is inaccurate due largely to the loss of efficiency in the worm-
rack transmission and friction between the sliders and center body PCBs. We
partially make up for this inaccuracy by tuning the value of f to two sets of real
Smarticle 2.0 517
data points at low and high appendage angles gathered from the real modules.
We see that the mechanism is strongest (can lift the highest load) when the
appendage is near vertical, and vice versa. Practically, this means that when the
appendage is at a 90◦ angle, the appendage can lift 3.42 modules placed 40mm
out from the center body PCB; at 45◦ it can lift 1.7 modules; and at 10◦ it can
lift 0.33 modules.
The real strength of the chosen mechanism is that it is non-back-drivable.
This feature allow robots to maintain a joint angle without energy consumption,
independent of load size (provided that it does not exceed the breaking point). As
a rule of thumb, this property holds for worm gears with a lead angle smaller than
atan(μ), where μ is the coefficient of friction. Estimating that μ is comparable
to two glass plates sliding against each other (0.4), we get that the lead angle, λ
must be smaller than 21◦ - well beyond the 5.6◦ of our design. Practically, this
means that the module’s holding force is only limited by the bonding strength of
the solder joints that attach the appendages to the main PCB. In other words,
once in position, the module can hold a significant amount of weight without
necessitating continuous power. In the spirit of keeping our robots intact, we
have yet to perform destructive testing, but in preliminary tests we have loaded
appendages with over 15 modules without mechanical failure.
The Smarticle 2.0 design is also compatible with other off-the-shelf motors
such as the Micro Metal Gearmotors (cross Sect. 10 × 12 mm) from Pololu.
Taking a Pololu micro motor with gear ratio 298:1 (stall torque 20 OzIn) as an
example, it uses the same PCB design only and takes some adjustment on the
3D printed parts to fit the motor as shown in Fig. 4C. Assembling a Smarticle
2.0 with this motor follows the same steps as the Plastic Planetary Gearmotor
design and takes similar assembly time. The overall weight of a Smarticle 2.0 is
25 g and its total cost is slightly higher at 72.47USD. With this motor, when the
appendage is at a 90◦ angle, the appendage can lift 41.6 modules placed 40mm
Fig. 5. A) Characterization of accelerometer accuracy: mean and standard deviation

of ten measurements at each rotation. B) Microphone-buzzer detection between two
modules.
518 D. Ma et al.
out from the center body PCB; and at 10◦ angle, it can lift 2.04 modules. We
will explore the functionalities of a heterogeneous collectives in the future.
Finally, Fig. 5A-B shows a characterization of the onboard accelerometer and
buzzer-microphone pair. We see that rotation around the accelerometer x-axis
(roll) and y-axis (pitch), is accurately detected. We further see that at 50%
power using the onboard microphone and buzzer pair, two modules can signal
over 300mm range, equivalent to 1.5 the module body length. This value can
be tuned by adjusting the average buzzer power (the pulse width) of the carrier
signal.
6 Multi-robot Demonstrations
While this paper is focused mainly on describing the design of and characterizing
individual modules, there are many avenues for extension of this work. In this
Fig. 6. A) Three-module tensile test at different appendage angles. B) Three-module

compression tests with appendages located at 90◦ . C) Recorded audio signal of acoustic
synchronization tests. 0–10 s shows the beeps from three synchronized modules; at 10 s
an unsynchronized module is added to the collective; past 25 s all four modules have
synchronized again. D) Active entanglement by hanging multiple modules. E) Before
and after of disassembly test with 7 modules.
Smarticle 2.0 519
section, we show exploratory multi-robot experiments to illustrate the platform’s

potential to support both active matter and robotic functionality in the future.
The Smarticle 2.0 platform can be used for both taskable robot- and smart
matter experiments. As a simple demonstration of this concept, Fig. 6A-B shows
tensile and compression tests of 3 modules arranged in a 2D chain. Note that to
calculate stress in Fig. 6A, we assume that the cross-sectional area of the “Smar-
ticle 2.0 material” remains constant; i.e. only the force changes. The experiment
shows how the material, similar to most hyperelastic polymers, first exhibits a
roughly linear profile due to linkage deflection, then rapidly declines in stress
because the modules start to slide against each other. As one would expect,
sliding occurs earlier with lower appendage angles; i.e. the higher the appendage
angle, the higher the elastic modulus.
The compression test was limited to Smarticles with 90◦ appendages, because
at lower angles, the center module would immediately start sliding. In the three
repeated trials, we again see a roughly linear profile caused by link deflection,
and then a rapid decline in compression strength due to module sliding.
To demonstrate that, mechanically, the Smarticle 2.0 modules have a form
factor and weight that permits entanglement, we attached a single module to a
ruler and manually hung as many other modules off it as possible (Fig. 6D). In
five repeated trials, we achieved 7, 8, 9, 8, and 9 modules. In the near future, we
plan to repeat this experiment with active entanglement - i.e. by placing modules
in a bin, having them actively entangle, and then pulling out globs of “Smarticle
2.0 material” with a stick similar to past papers demonstrating entanglement of
passive staples [19], fire ants, and worm blobs [24].
Finally, to demonstrate that material composed of entangled Smarticle 2.0
modules can actively undergo a phase change, we manually assembled a glob
of 7 modules with appendages at 90◦ , and programmed them to unfold upon
hearing an acoustic signal in close proximity(Fig. 6E). In doing so, the glob went
from a volume of 1853.8cm3 to 3745.4cm3 , i.e. an order of magnitude change,
similar to a phase change from solid to liquid.
As a building block for more complex and distributed coordination schemes,
we demonstrated acoustic synchronization of four modules using the Kuramoto
model [25]. In Fig. 6C, we show how this works, by placing a non-synchronized
module next to three previously synchronized modules, while recording their
buzzes with a separate module. In this case the modules synchronized over 9
cycles. In longer experiments where all four modules started unsynchronized,
full synchronization took 17 cycles. These characteristics can be optimized with
better coupling factors, but the general concept of module synchronization is
promising for future work on consensus and collective motions.
In Fig. 7A, we demonstrate how the onboard accelerometer can be used for
tactile communication; i.e. by sensing changes in orientation. We manually poke
a module, causing it to activate by flapping its appendages, which causes the
signal to propagate to other nearby robots. It is worth noting that the signal
amplitude is heavily dependent on where along the center body PCB the modules
connect, and that the flapping action can cause unwanted accelerometer read-
outs that self-activate the module. In Fig. 7B we included a small inhibitory
520 D. Ma et al.
Fig. 7. A-B) Smarticles sequentially activate each other through tactile cues.
period, thereby avoiding self-activation and only flapping the appendages once
upon external stimulus.
7 Summary
We presented and characterized a design optimized for scalable fabrication and
operation of robotic modules, capable of 3D entanglement by a range of scientific
users. This relatively new subfield of 3D entangled robots presents unique oppor-
tunities related to active matter and taskable modular robots as evidenced by
entangled social organisms in nature, but also unique research challenges related
to communication and consensus algorithms, reconfiguration, and mobility. In
future work, we aim to scale up the size of our collective and demonstrate more
complex multi-agent behaviors in both hardware and simulation. The design files
for these robots are available upon request from the authors.
Acknowledgements. This project was funded by NSF awards #1933284 and

#2042411, and the Packard Fellowship for Science and Engineering. The authors would
Smarticle 2.0 521
like to acknowledge many fruitful discussions with Prof. Goldman at the Georgia Insti-
tute of Technology.
References
1. Goldstein, S.C., Campbell, J.D., Mowry, T.C.: Programmable matter. Computer
38(6), 99–101 (2005)
2. Savoie, W., Pazouki, A., Negrut, D., Goldman, D.: Smarticles: design and construc-
tion of smart particles to aid discovery of principles of smart, active granular mat-
ter. In: The First International Symposium on Swarm Behavior and Bio-Inspired
Robotics (2015)
3. Chvykov, P., et al.: Low rattling: a predictive principle for self-organization in
active collectives. Science 371(6524), 90–95 (2021)
4. Gardi, G., Ceron, S., Wang, W., Petersen, K., Sitti, M.: Microrobot collectives
with reconfigurable morphologies, behaviors, and functions. Nat. Commun. 13(1),
1–14 (2022)
5. Daudelin, J., Jing, G., Tosun, T., Yim, M., Kress-Gazit, H., Campbell, M.: An inte-
grated system for perception-driven autonomy with modular robots. Sci. Robot.
3(23), eaat4983 (2018)
6. Gilpin, K., Rus, D.: Modular robot systems. IEEE Robot. Autom. Mag. 17(3),
38–55 (2010)
7. Haghighat, B., Droz, E., Martinoli, A.: Lily: a miniature floating robotic platform
for programmable stochastic self-assembly. In: 2015 IEEE International Conference
on Robotics and Automation (ICRA), pp. 1941–1948. IEEE (2015)
8. Ceron, S., Kimmel, M.A., Nilles, A., Petersen, K.: Soft robotic oscillators with
strain-based coordination. IEEE Robot. Autom. Lett. 6(4), 7557–7563 (2021)
9. Wilson, N.J., Ceron, S., Horowitz, L., Petersen, K.: Scalable and robust fabrication,
operation, and control of compliant modular robots. Front. Robot. AI 7, 44 (2020)
10. Li, S., et al.: Particle robotics based on statistical mechanics of loosely coupled
components. Nature 567(7748), 361–365 (2019)
11. Warkentin, R., Savoie, W., Goldman, D.I.: Locomoting robots composed of immo-
bile robots. In: 2018 Second IEEE International Conference on Robotic Computing
(IRC), pp. 224–227. IEEE (2018)
12. Savoie, W., et al.: A robot made of robots: emergent transport and control of a
smarticle ensemble. Sci. Robot. 4(34), eaax4316 (2019)
13. Nilles, A., Ceron, S., Napp, N., Petersen, K.: Strain-based consensus in soft,
inflatable robots. In: 2022 IEEE 5th International Conference on Soft Robotics
(RoboSoft), pp. 789–794. IEEE (2022)
14. Gilpin, K., Knaian, A., Rus, D.: Robot pebbles: one centimeter modules for pro-
grammable matter through self-disassembly. In: 2010 IEEE International Confer-
ence on Robotics and Automation, pp. 2485–2492. IEEE (2010)
15. Swissler, P., Rubenstein, M.: Fireant3D: a 3D self-climbing robot towards non-
latticed robotic self-assembly. In: 2020 IEEE/RSJ International Conference on
16. Savoie, W., et al.: Phototactic supersmarticles. Artif. Life Robot. 23(4), 459–468
(2018)
17. Zhou, W., Gravish, N.: Density dependent synchronization in contact-coupled oscil-
lators. arXiv preprint arXiv:2012.07124 (2020)
18. Ozkan-Aydin, Y., Goldman, D.I., Bhamla, M.S.: Collective dynamics in entangled
worm and robot blobs. Proc. Natl. Acad. Sci. 118(6), e2010542118 (2021)
522 D. Ma et al.
19. Gravish, N., Franklin, S.V., Hu, D.L., Goldman, D.I.: Entangled granular media.
Phys. Rev. Lett. 108(20), 208001 (2012)
20. Woern, H., Szymanski, M., Seyfried, J.: The i-swarm project. In: ROMAN 2006-
The 15th IEEE International Symposium on Robot and Human Interactive Com-
munication, pp. 492–496. IEEE (2006)
21. Rubenstein, M., Ahler, C., Nagpal, R.: Kilobot: a low cost scalable robot system
for collective behaviors. In: 2012 IEEE International Conference on Robotics and
Automation, pp. 3293–3298. IEEE (2012)
22. Hartenberg, R., Danavit, J.: Kinematic Synthesis of Linkages. McGraw-Hill, New
York (1964)
23. Budynas, R., Nisbett, K.: EBOOK Shigley’s Mechanical Engineering Design 11e
in SI Units. McGraw-Hill Education (Asia) (2020)
24. Hu, D., Phonekeo, S., Altshuler, E., Brochard-Wyart, F.: Entangled active matter:
from cells to ants. Eur. Phys. J. Spec. Top. 225(4), 629–649 (2016)
25. Acebrón, J.A., Bonilla, L.L., Vicente, C.J.P., Ritort, F., Spigler, R.: The kuramoto
model: a simple paradigm for synchronization phenomena. Rev. Mod. Phys. 77(1),
137 (2005)
Hybrid Flock - Formation Control
Algorithms
Cyrill Baumann(B) , Jonas Perolini, Emna Tourki, and Alcherio Martinoli
Distributed Intelligent Systems and Algorithms Laboratory, School of Architecture,

Civil and Environmental Engineering, École Polytechnique Fédérale de Lausanne,
Lausanne, Switzerland
{cyrill.baumann,jonas.perolini,emna.tourki,alcherio.martinoli}@epfl.ch
Abstract. Two prominent categories of algorithms for achieving coordi-

nated multi-robot displacement are flocking and navigation in formation.
Both categories have their own body of literature and characteristics,
including their respective advantages and disadvantages. Although typi-
cally they are treated separately, we believe that a combination of flock
and formation control represents a promising algorithmic solution. Such
an algorithm could take advantage of a combination of characteristics of
both categories that are best suited for a given situation. Therefore, in
this work, we propose two distributed algorithms capable of gradually
and reversibly shifting between flocking and formation behaviors using
a single parameter W. We evaluate them using both simulated and real
robots with and without the presence of obstacles. We find that both
algorithms successfully trade off flock density for formation error. Fur-
thermore, leveraging a narrow passage experiment as an application case
study, we demonstrate that an adaptive shift between flock and forma-
tion behavior, adopting a simple method to define W in real-time using
exclusively on-board resources, results in a statistically relevant reduc-
tion of traversing time in comparison to a non-adaptive formation control
algorithm.
Keywords: flocking · formation control · multi-robot coordination
1 Introduction
In recent years, Multi-Robot Systems (MRS) have gained importance, as coor-
dinated teams are able to perform tasks which would be unfeasible for single
robots. Additionally, their inherent redundancy introduces robustness to the
system, allowing it to successfully complete the overall mission objectives even
when some individual agents fail. Spatial coordination of MRS is crucial in a
wide range of applications, ranging from search and rescue missions over surveil-
lance or escorting missions to spacecraft formations. Existing algorithms that
tackle this challenge can be split into two main categories: those that manage
tight spatial topologies, most commonly referred to as formation control, and
those that manage loose ones, usually called flocking.
https://doi.org/10.1007/978-3-031-51497-5_37
524 C. Baumann et al.
Flocking has its origin in nature, where such behavior achieves multiple
functionalities (e.g., energy benefits, navigation accuracy, protection from preda-
tors) and is present in all media (e.g., schools of fish and flocks of birds). This
behavior has been first engineered in [1] for the purpose of modeling the behavior
of birds, fish, etc. in computer animation. It has since been improved upon, but
the main principles have remained the same. The behavior of each individual
agent can be expressed as the vector sum of three complementary components:
separation, cohesion, and velocity matching. The result is a swarm-like behav-
ior with (in steady state) almost equal distances among the flock members, but
without any precise spatial formation. Additionally, formal methods for design-
ing flocking behavior have also been proposed, for example, in [2], where three
different kinds of lattices are leveraged to form triangle-like flocks, allowing for
splitting and rejoining, as well as squeezing maneuvers to avoid obstacles.
Formation control is an actively studied topic with a large amount of prior
work, including recent surveys such as [3]. There exist three main approaches
for addressing formation control: Leader-Follower, Virtual Structure (VS), and
Behavior-Based (BB) methods. In the LF approach, one agent is designated as
leader. All other agents, called followers, are then tasked with maintaining a
predefined relative position to the leader, resulting in formation shape retention
[4]. The VS approach [5] consists of three steps. First, the virtual structure (cor-
responding to the predefined formation) is aligned to the current positions of the
agents in order to minimize the distance from every agent to its corresponding
spot within the formation. Then the virtual structure is moved in the desired
direction of movement. In the third step, the agents are controlled towards their
corresponding positions in the shifted virtual structure. The BB approach is
based on defining several desired behaviors for every agent. The control is then
either a weighted average of the resulting motion vectors created by the indi-
vidual behaviors [6], a Finite State Machine (FSM) switching between them [7],
or a more elaborate combination of behaviors [8]. A BB approach results in a
collective response, fairly adaptive to different environmental situations. While
this represents the main advantage of this approach, one major drawback is that
the resulting spatial topology emerges from local interactions and is therefore
difficult to control it precisely.
In multiple related works, the formation control has subsequently been “loos-
ened” up, usually to account for obstacle avoidance. In [9] the formation is capa-
ble of splitting and rejoining. In [10] and [11] dynamical formation changes are
leveraged. In [12] a potential field augmented with formation transformation is
used. Deformable formations are also leveraged in [13].
On the flocking side, it is worth noting that, in comparison to the flocking
based on Reynold’s rules, flocking based on lattices (as in [2]) can be considered
significantly more rigid and thus closer to formation control.
However, the above approaches all remain faithful to their respective cate-
gories of algorithms. More specifically, to the best of our knowledge, no work
exists which is capable of shifting gradually from flocking to formation con-
trol or vice-versa. Yet, this gradual shift would enable a smooth transition
between the characteristics of both types of algorithm, with an infinite number of
Hybrid Flock - Formation Control Algorithms 525
intermediate states in between. Such a continuous change could then be leveraged

in numerous scenarios. For example, it could be used to optimize multi-robot
navigation by trading off formation precision versus the ability to traverse nar-
row passages or areas with obstacles. As the shift between formation control
and flocking would only be performed to the degree required, this would result
in reduced change of desired positions and could thus even reduce the overall
energy consumption.
In this work, we propose and experimentally validate two such algorithms,
designed to gradually shift between a flock and a formation behavior. For both
algorithms proposed, the formation control is based on a graph-based, Laplacian
control algorithm [14], which can be seen as part of the VS category. The cor-
responding flocking algorithm used differs between the two proposed solutions:
Pot2Lapl leverages a potential-based flocking algorithm based on [2], while
Flock2Lapl uses the original Reynolds rules [1], more specifically the version
introduced in [15].
The remainder of this paper is structured as follows: we start by introduc-
ing the two algorithms Flock2Lapl and Pot2Lapl. We then present our two
experimental evaluation setups, report and discuss the algorithms’ performances,
before ending with some conclusive remarks.
2 Methodology
We consider a homogeneous group of N differential wheeled robots, tasked with
following a predefined trajectory while forming and maintaining spatial coordi-
nation between them. This spatial coordination varies from flock to formation
control depending on the scalar W.
The robot control is designed following the framework introduced and
described in detail in [16]. The control algorithm consists of the following seven
functional elements:
1. Pose estimation: each robot is equipped with both a local range-and-bearing
system and a receiver exploiting data from an emulated Global Navigation
Satellite System (GNSS). Global communication is then used to transmit
(and receive) GNSS-based position data. This effectively results in a global,
occlusion-less sensing of neighboring robots. All information is fused using an
Extended Kalman Filter.
2. Consensus on orientation: the formation orientation is not established a priori
but obtained online using Laplacian consensus.
3. Position allocation: a Hungarian algorithm [17] is used to assign each robot
the optimal position within the formation (minimizing the overall distances
between the robots and their respective target positions).
4. Coordination algorithm: the estimated relative positions of neighboring
robots (Xj with j ∈ [1, N ]) is used for both coordination algorithms. They
are specified in Sects. 2.1 and 2.2, respectively.
5. Group movement: the common goal is given as a temporal trajectory in a
global reference frame.
6. Collision avoidance: the commonly used Braitenberg algorithm is applied

using the modifications introduced in [16] to cast it in the motion vector
framework.
7. Robot control: a single integrator control law that includes a damping term
as in [16] is used.
It is further worth noting that the motion vectors of elements 4, 5 and 6 (Xcoord ,
Xtraj and Xcol respectively) are summed to obtain the final motion vector.
Both coordination algorithms Pot2Lapl and Flock2Lapl use as formation con-
trol part of their respective algorithms a graph-based, Laplacian control algo-
rithm. Using the standard feedback method for a single integrator kinematic
model [18], we obtain for robot i:
1
N
żi = − Li,j zi,j + bi,j (1)
N j=1
with L the Laplacian matrix based on the underlying connectivity graph (which
is fully connected in this work), zi,j the position of robot j and bi,j the desired
position of robot j, both defined in the local coordinates of robot i. bi,j can
be calculated from the desired positions in global coordinates BN for a given
number of robots N . Following the reasoning in [16], we can express the motion
vector of the Laplacian control algorithm as:
Xilapl = −żi dt (2)
with BN , dt numerical parameters. The numerical values used in this work are
summarized in Table 1. The resulting V-shape formation for N = 5 is depicted
in Fig. 3b.
2.1 Pot2Lapl
Intuitively, Pot2Lapl corresponds to a potential-based flocking algorithm (cf.

[2]), with an additional attraction towards holonomic, massless virtual agents
following Laplacian control. Then, varying a single attractive weight W towards
the virtual agents allows the group of robots to gradually shift from pure flocking
(i.e. virtual agents are ignored) to pure formation control (i.e. attraction to
virtual agents dominates flocking).
More specifically, the target motion vector Xcoord is given by
N
dψ zi,j
Xicoord = (||zi,j || − dref ) + WXivirt | i = j (3)
j=1
dz ||zi,j ||
With ψ the potential function as defined below, dref the desired inter-robot
distance and Xivirt the position of the virtual agent associated with robot i
following a Laplacian-based control. That is, Xivirt = Xilapl .
To define ψ, we first define a base potential function ψbase (z) similar to [2]:
a + b a−b
ψbase (z) = 1 + (z + c)2 − 1 + c2 + z (4)
2 2
using c = |a−b|
√
2 ab
and a,b numerical parameters.
Similarly to [2], we further define a cut-off function ρ(z), in order to limit
the influence of distant neighbors:
⎧ dref
⎪
⎪0, if z < +δ
⎪
⎨ d
r
z− ref
r −δ dref
ρ(z) = 1 − 1
1 + cos π dref , if +δ ≤z ≤1 (5)
⎪
⎪
2 1− −δ r
⎪
⎩
r
1, otherwise
Additionally, we introduce a piece-wise constant function B(z) to increase

the repulsive effect of the potential function:
Krep , if z < 0
B(z) = (6)
Katt , otherwise
We can then define a piece-wise derivable, cut-off potential function ψ

z + d
ref
ψ(z) = B(z) ψbase (z) − ρ ψbase (z) − ψbase (rcut ) (7)
rcut
with dref , δ, Krep , Katt and rcut numerical parameters. Figure 1 shows the result-
ing ψ(z − dref ) and the corresponding gradient dψ dz (z − dref ) with the numerical
parameters used in this work.
Fig. 1. Potential function ψ and the corresponding gradient dψ dz

(z − dref ) with the
parameters used in this work as a function of the inter-robot distance z.
2.2 Flock2Lapl
Flock2Lapl is based on the original Renyolds rules [1]. Intuitively, it uses non-
linear cohesion and separation vectors, extended with an attraction to a virtual
agent following a Laplacian formation control. Formally, we can write the target
motion vector X coord of robot i as
Xicoord = Kcoh Xicoh + Ksep Xisep + WXivirt (8)
using again Xivirt = Xilapl , with the cohesion vector defined as
1
N
Xicoh = zi,j (9)
N j=1
and the separation vector defined as

N −zi,j
1
j=1 ||zi,j ||2 , if ||zi,j || < s
Xisep = N
(10)
0, otherwise
with Kcoh , Ksep and s numerical parameters.
Table 1. Numerical parameters used in this work, listed in order of appearance.
Parameter Value Parameter Value
B5 {(0, 0), (0,1.2), (0.4, 0.3), (0.4, 0.9), (0.8, 0.6)} [m] .. ..
. .
B10 {(0.8, 0), (0.8, 0.4), (0.8, 0.8), (0.8,1.2), (1.2, 0.2),
rcut 2 [m]
(1.2, 0.6), (1.2,1), (1.6, 0.4), (1.6, 0.8), (2, 0.6)} [m]
Krep 10 [-]
dt 50 [ms]
Katt 0.9 [-]
a 1 [-]
Kcoh 1 [-]
b 7 [-]
Ksep 0.15 [-]
dref 0.4 [-]
s 0.45 [m]
δ 0 [m]
rs 0.3 [m]
.. ..
. . W0 4 [-]
3 Experimental Evaluation
This section introduces the robotic platform used to demonstrate the algorithms,
the experimental setup for both simulated and real robot experiments, as well
as the evaluation metrics used.
3.1 Robotic Platform

All experiments were conducted using both simulated and real Khepera IV robots
[19]. Khepera IV robots are differentially driven vehicles with a diameter of 14 cm
and a maximum speed of ∼ 81 cm/s. They are further equipped with eight infrared
(IR) proximity sensors with a range of 20 cm that are used for obstacle avoidance.
As illustrated in Fig. 2, the robots are enhanced with a custom range-and-bearing
module [20] for relative localization and an active marker module featuring two
LEDs that allow accurate tracking through the SwisTrack software [21]. Commu-
nication between the robots is leveraging UDP/IP and is assumed to be error-free
and to have a range larger than the experimental arena.
Fig. 2. Khepera IV robot with a custom range-and-bearing module as well as tracking

module. In this work, both the camera and the US sensors are not used and thus not
further mentioned.
3.2 Evaluation Metrics
The performance of the coordination algorithms are evaluated using four metrics
for every temporal instant t given the robots’ positions Zi with i ∈ N in a global
reference frame.
– Φdensity evaluates the density of the group with respect to an ideal, lattice-
like flock with dref between all robots. Based on circle packing, it compares
Ar , the enclosing circle of all robots, with Ao , the optimal enclosing circle for
the given number of robots.
1
Φdensity = (11)
N ||Ao − Ar ||
1
1+
– Φf orm captures the error between the positions of the robots and their respec-
tive attributed positions (ZA ) within the ideal formation (of shape as defined
by BN ), centered on ZF and using optimal position assignments.
1
Φf orm = argmin 1 N (12)
ZF 1+ N i=1 ||Zi − (ZA |ZF )||
– Φorder (adopted from [15]) evaluates the heading alignment of the robots:
1 Żi Żj
Φorder = (13)
N (N − 1) ||Żi || ||Żj ||
i=j
– Φsaf e (adopted from [15]) measures the risk of collision among robots:
ns
Φsaf e = 1 − (14)
N (N − 1)
with ns the number of occurrences where two robots are “too close”, that is,
closer than rs , a numerical parameter: ns = |(i, j) s.t. j = i & ||Zi − Zj || < rs |
We further denote Φ the respective average performances over the entire duration
of the experiment.
Webots [22], an open-source, high-fidelity robotics simulation platform, has been

used for the simulated experiments. All simulated sensors and actuators of the
simulated robots were calibrated to match those of the real robots.
For the physical experiments, SwisTrack [21] is used in combination with a
GigE color camera to both track the robots in order to calculate the evaluation
metrics, and to emulate absolute localization such as that provided by a GNSS
module in indoor settings. SwisTrack is calibrated using the algorithm proposed
by Tsai et al. [23], which results in a spatial error of 1.06 + 0.65 cm and a negli-
gible temporal error of less than 1 mm amplitude. Emulated GNSS information
is sent to the robots via UDP/IP at a rate of 1 Hz.
3.4 Characterization Experiment
Characterization experiments are conducted similarly in simulation and reality:

five Khepera IV robots are randomly placed in an 3 x 2 m2 arena, as depicted in
Fig. 3a. Using the algorithmic framework described in Sect. 2, both algorithms
are then evaluated for different parameters, as summarized in Table 2. It should
be noted that the main purpose of these experiments is to evaluate the perfor-
mance of the algorithms as a function of W, both in simulation and in reality,
both for ideal and more realistic conditions, that is, without and with the pres-
ence of obstacles, respectively. An additional set of experiments, involving ten
robots without obstacles, has been added in simulation to give additional insights
into the scalability of the algorithms. However, for a complete scalability analy-
sis, further, more systematic experiments will be required in future work.
An 8-shaped trajectory has been chosen as group movement for all experi-
ments. For the experiments with obstacles, six square obstacles of 20 x 20 cm
are placed arbitrarily in the arena. We note that the positions of the obstacles
are kept constant over all experiments to increase comparability among them.
Figure 3a shows the corresponding obstacle positions used for both real and sim-
ulated experiments.
Fig. 3. Experimental arenas used in this work. The required V-shaped formation is
identical for both sets of experiments.
Table 2. Experimental conditions evaluated in this work
Environment W N
Webots without obstacles [0, 1, 2, 5] [5, 10]
Webots with obstacles [0, 1, 2, 5] 5
Real robots [0, 1, 2, 5] 5
3.5 Narrow Passage Experiment

A narrow passage experiment, similar to the one in [16], is used as case study
to demonstrate the performance of the algorithms. Five robots are placed ran-
domly within a starting zone (see Fig. 3b). The robots are then tasked to pass
through the narrow passage and cross the finish line, while maintaining spatial
coordination. The migration urge is given as a temporal trajectory in the global
reference frame. Real robot experiments were repeated 10 times, simulated ones
20 times. In addition to the four metrics introduced previously, we record for
every experimental run the time taken until the last robot crosses the finishing
line.
In contrast to the characterization experiments of Sect. 3.4, W is calculated
dynamically on-board for every robot and timestep: W = max(W0 − Nact , 0)
with W0 a numerical parameter corresponding to the desired “rigidness” of the

formation and Nact the number of infrared sensors detecting an obstacle, that
is returning sensor reading above a given threshold (20% of the maximum value
for this work). The performance of this “adaptive” algorithm is compared to a
“non-adaptive” version which uses W = W0 .
Figure 4 shows the resulting performance of the Flock2Lapl algorithm on the

four metrics presented before for different W. As intended, with increasing W,
φf orm increases and φdensity decreases. It is also worth noting that Φsaf e and
φorder are not affected by W.
Fig. 4. Experimental results for the Flock2Lapl algorithm, using both 5 and 10 robots
in simulation without obstacles (WB), using 5 robots in simulation with obstacles (WB)
and using 5 robots in reality without obstacles (RR). Lines show mean values with the
shaded areas representing standard deviation.
Figure 5 shows the corresponding performances for the Pot2Lapl algorithm,

which are similar to the results of Flock2Lapl. Figures 4 and 5 further show the
resulting performance of both algorithms for 10 robots. The most noticeable
impact of this increase in N can be observed for φf orm for low W, that is when
almost purely in flocking configuration. This can be explained that for 10 robots
the difference between a V-shaped formation and a flock is larger than for 5
robots.
Fig. 5. Experimental results for the Pot2Lapl algorithm, using both 5 and 10 robots in
simulation without obstacles (WB), using 5 robots in simulation with obstacles (WB)
and using 5 robots in reality without obstacles (RR). Lines show mean values with the
shaded areas representing standard deviation.
Fig. 6. Experimental results for the narrow passage experiment using five robots in
simulation (WB) and reality (RR).
Figure 6 shows the results of the narrow passage experiments. For all metrics,
a one-way ANOVA revealed that there was a statistically significant difference
in the respective metric between at least two groups. However, Tukey’s HSD
test for multiple comparisons found that the mean value of the metrics differs
only between the simulated and real experiments, with the following exceptions:
for Flock2Lapl, statistical differences (p < 0.05) can be observed for φf orm for
real and φorder for simulated experiments. For Pot2Lapl, φdensity is statistically
different for real experiments (p < 0.01). For both algorithms, the traversing
time is statistically different for both simulated and real experiments (p < 0.05).
This indicates that our proposed algorithms successfully decreased traversing
time, demonstrating one possible use case. We note that the impact on φf orm
is limited by the averaging over the entire experiment. However, given a more
sensitive metric, we would expect this to be a trade-off between traversing time
and φf orm .
For all metrics in Fig. 6, relatively high standard deviations can be observed
for both simulated and real experiments. This is due to the high amount of
stochasticity present, mostly introduced by the random initialization of the
robots within the starting area, but also by sensor noise.
We also acknowledge that since this work focuses on the coordination algo-
rithms, the sensing capabilities of the robots are maximized. Naturally, further
work will be required to analyze the performances of the algorithms under more
realistic situations considering exclusively local sensing affected by occlusions.
5 Conclusion
In this work we introduced two hybrid flock - formation control algorithms. By
varying a single parameter W, both algorithms are able to change gradually
from a pure flocking behavior to a formation behavior and vice-versa. We then
characterized the behavior of the two algorithms using four dedicated metrics for
different numerical values of W in both simulated and real robot experiments.
We further introduced a simple method to compute W on-board, based on the
number of infrared sensors that detect an obstacle. By comparing this adaptive
algorithm to a non-adaptive formation control algorithm for a narrow passage
scenario, we find that the adaptive algorithm achieves a significant traversing
time reduction.
Acknowledgements. We would like to thank Masaki Haruna, Masahiko Kurishige

and Tomoki Emmei of the Advanced Technology R&D Center, Mitsubishi Electric
Corporation, Amagasaki City, Japan for the fruitful technical discussions all along this
work. This work is financially supported by Mitsubishi Electric Corporation, Japan.
Appendix
A video of the experiments performed for this contribution can be found on our
research web page: www.epfl.ch/labs/disal/research/mrsmodelingcontrol.
References
ACM Annual Conference on Computer Graphics and Interactive Techniques,
pp. 25–34 (1987)
2. Olfati-Saber, R.: Flocking for multi-agent dynamic systems: algorithms and theory.
IEEE Trans. Autom. Control 51, 401–420 (2006)
3. Liu, Y., Bucknall, R.: A survey of formation control and motion planning of mul-
tiple unmanned vehicles. Robotica 36(7), 1019–1047 (2018)
4. Wang, P.K.C.: Navigation strategies for multiple autonomous mobile robots mov-
ing in formation. J. Robot. Syst. 8(2), 177–195 (1991)
5. Tan, K.-H., Lewis, M.A.: Virtual structures for high-precision cooperative mobile
robotic control. In: Proceedings of IEEE/RSJ International Conference on Intelli-
gent Robots and Systems, vol. 1, pp. 132–139 (1996)
6. Balch, T., Arkin, R.: Behavior-based formation control for multirobot teams. IEEE
Trans. Robot. Autom. 14(6), 926–939 (1998)
7. Wang, H., Rubenstein, M.: Shape formation in homogeneous swarms using local
task swapping. IEEE Trans. Rob. 36(3), 597–612 (2020)
8. Antonelli, G., Arrichiello, F., Chiaverini, S.: The NSB control: a behavior-based
approach for multi-robot systems. Paladyn, J. Behav. Robot. 1, 48–56 (2010)
9. Monteiro, S., Bicho, E.: Attractor dynamics approach to formation control: theory
and application. Auton. Robots 29, 331–355 (2010)
10. Vilca, J., Adouane, L., Mezouar, Y.: Adaptive leader-follower formation in clut-
tered environment using dynamic target reconfiguration. In: Chong, N.-Y., Cho,
Y.-J. (eds.) Distributed Autonomous Robotic Systems. STAR, vol. 112, pp. 237–
254. Springer, Tokyo (2016). https://doi.org/10.1007/978-4-431-55879-8 17
11. Wasik, A., Pereira, J.N., Ventura, R., Lima, P.U., Martinoli, A.: Graph-based dis-
tributed control for adaptive multi-robot patrolling through local formation trans-
formation. In: IEEE/RSJ International Conference on Intelligent Robots and Sys-
tems, pp. 1721–1728 (2016)
12. Seng, W.L., Barca, J.C., Şekercioğlu, Y.A.: Distributed formation control of net-
worked mobile robots in environments with obstacles. Robotica 34(6), 1403–1415
(2016)
13. Soares, J.M., Aguiar, A.P., Pascoal, A.M., Martinoli, A.: A distributed formation-
based odor source localization algorithm - design, implementation, and wind tun-
nel evaluation. In: IEEE International Conference on Robotics and Automation,
pp. 1830–1836 (2015)
14. Falconi, R., Gowal, S., Martinoli, A.: Graph based distributed control of non-
holonomic vehicles endowed with local positioning information engaged in escort-
ing missions. In: IEEE International Conference on Robotics and Automation,
pp. 3207–3214 (2010)
15. Soria, E., Schiano, F., Floreano, D.: The influence of limited visual sensing on the
reynolds flocking algorithm. In: Third IEEE International Conference on Robotic
Computing, pp. 138–145 (2019)
16. Baumann, C., Martinoli, A.: A modular functional framework for the design and
evaluation of multi-robot navigation. Robot. Auton. Syst. 144, 103849 (2021)
17. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res.
Logist. Q. 2(1–2), 83–97 (1955)
18. Mesbahi, M., Egerstedt, M.: Graph Theoretic Methods in Multiagent Networks.
Princeton Series in Applied Mathematics (2010)
19. Soares, J.M., Navarro, I., Martinoli, A.: The Khepera IV mobile robot: performance
evaluation, sensory data and software toolbox. In: Robot 2015: Second Iberian
Robotics Conference. AISC, vol. 417, pp. 767–781. Springer, Cham (2016). https://
doi.org/10.1007/978-3-319-27146-0 59
20. Pugh, J., Raemy, X., Favre, C., Falconi, R., Martinoli, A.: A fast onboard relative
positioning module for multirobot systems. IEEE/ASME Trans. Mechatron. 14(2),
151–162 (2009)
21. Lochmatter, T., Roduit, P., Cianci, C., Correll, N., Jacot, J., Martinoli, A.: Swis-
Track - a flexible open source tracking software for multi-agent systems. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4004–
4010 (2008)
22. Michel, O.: WebotsTM: professional mobile robot simulation. Int. J. Adv. Rob.
Syst. 1, 39–42 (2004)
23. Tsai, R.: A versatile camera calibration technique for high-accuracy 3D machine
vision metrology using off-the-shelf TV cameras and lenses. IEEE J. Robot. Autom.
3(4), 323–344 (1987)
Multi-Robot Systems Research: A
Data-Driven Trend Analysis
João V. Amorim Marques1,2 , Marı́a-Teresa Lorente1,3 , and Roderich Groß1(B)

1
Sheffield Robotics, The University of Sheffield, Sheffield, UK
r.gross@sheffield.ac.uk
2
INESC Science and Technology, Faculty of Engineering, University of Porto,
Sheffield, Portugal
3
Instituto Tecnológico de Aragón, Zaragoza, Spain
Abstract. This paper provides a data-driven analysis of the trends of

research on multi-robot systems (MRS). First, it reports the findings of
an exhaustive search of the MRS studies published from 2010 to 2020 in
27 leading robotics journals, including a quantitative analysis of trends.
Second, it reports the findings of a survey capturing the views of 68
leading experts in the field of MRSs. Finally, it summarises the findings.
1 Introduction
In the late 1940s, Walter [1] built two autonomous robot tortoises that displayed
not only impressive individual capabilities, including moving towards a light
source or into a station for self-charging, but also collective capabilities, described
by the author as “mutual recognition” or forming a “sort of community”. These
robots are one of the first physical realisations of multi-robot systems (MRSs).
Despite significant developments, especially over the past 35 years, MRSs
are only beginning to become commonplace in real-world applications. Numer-
ous surveys have been published on the topic of MRSs [2–5], and the reader is
referred to these for a comprehensive overview of the field. This paper seeks to
complement previous studies by providing the first data-driven analysis of the
trends in the field. It reports:
1. the quantitative findings of an exhaustive search of the MRS studies published
in leading robotics journals (in between 2010 and 2020); and
2. an analysis of the views of leading MRS experts.
The findings may provide insights to guide future research, for example, by
determining areas of development that require significant effort and the short-
term to long-term prospects for specific applications.
2 Exhaustive Search
This section presents the methods and findings from the exhaustive search.
All authors contributed equally to this work.
https://doi.org/10.1007/978-3-031-51497-5_38
538 J. V. A. Marques et al.
2.1 Methodology
We use robot to refer to a machine that perceives, and acts in, an environment
in physically plausible ways. The environment can be understood as the set of
characteristics that define the space in which the robot resides. Studies employing
abstract environment/robot models are considered within scope where clearly of
relevance to real-world MRSs. For the purpose of the exhaustive search, we
restrict the scope to systems composed of mobile robots, that is, robots that are
able to change their position and orientation in their environment.
A multi-robot system can be defined as a group of robots that operate in the
same environment while exhibiting some form of interaction. When part of an
MRS, robots usually work collaboratively towards a common goal. This allows
them to address a more complex task, or perform a task more efficiently, than
when on their own. MRSs may have different levels of autonomy when deployed,
from being teleoperated to being fully autonomous.
We use an exhaustive search to target all documents published in years 2010–
2020 in 27 of the 28 “robotics” journals that are included in the 2020 Journal
Citations Report of the Web of Science Group.1 The documents include original
research articles, perspectives, and surveys, but exclude editorials, corrections
and other communications. Conference papers, although valuable to the robotics
field, were not considered due to the sheer volume of works and the lack of clear
criteria for which conferences to consider. For each document, we determine
whether it is within scope. Where a document is not within scope, no further
analysis is performed. For each document within scope, it is first rated in terms of
(i) the originality (up to 6 points) of the work that is reported, (ii) its rigour (up
to 4 points) and (iii) its significance (up to 5 points). The individual ratings are
then added to produce a relevance score. Although detailed criteria and training
are used to guide the assessors (two of the authors) and improve consistency in
evaluation, the ratings are finally subjective.
To reduce the impact of possible assessor bias (i.e. one assessor tending to
provide higher values than the other), samples of documents were marked by
both of the assessors—once at the beginning and twice during the exhaustive
search. From the results of these double-marking rounds, a consistent but small
bias can be observed. To counter this bias, the relevance scores that we report
have been adjusted by -0.15 and +0.20 respectively, such that the top 25% of
documents within scope for either assessor have a score of 6.00 or above. In the
following, these works are referred to as highly relevant.
2.2 Findings from Journal-Specific Analysis
Overall, 24 250 documents were examined in the exhaustive search. Of these,

1487 (6.1%) are within scope, and hence analysed in detail. Figure 1 shows the
1
The documents of one of the journals (International Journal of Robotics & Automa-
tion) could not be accessed in full and were hence excluded.
Multi-Robot Systems Research: A Data-Driven Trend Analysis 539
Fig. 1. Percentage of documents within scope (per journal). Portions in red correspond
to papers regarded as highly relevant. The numbers represent the total number of
papers identified to be within scope.
percentage of documents that are within scope for each journal. For eight jour-
nals, more than 10% of documents are within scope. The journals Swarm Intel-
ligence (SI) and Autonomous Robots (AURO) have the strongest focus on MRS
research.
Fig. 2. Distribution of the technology readiness levels for the documents within scope
(per journal).
Figure 2 shows the distribution of the technology readiness levels (TRL) that
we estimated for the systems reported in all documents within scope for each
journal (for a definition, see [6]). Most of the journals primarily report find-
ings with a TRL level of up to 4, considered as laboratory validation by the
assessors. The Journal of Field Robotics (JFR) and Revista Iberoamericana de
Automática e Informática Industrial (RIAI) stand out: 75% and 46% of docu-
ments report work of TRL level 5 or above, owing to these journals primarily
focusing on studies that have been validation via field experiments. It should be
noted that journals RIAI, Intelligent Service Robotics (ISR), JFR, The Inter-
national Journal of Robotics Research (IJRR), AURO and Robotica introduce
some documents evaluated with a TRL of 6, where the system was demonstrated
in a relevant environment.
Fig. 3. Distribution of documents within scope by type of validation (laboratory only,

real-world only, or both).
Figure 3 shows the percentage of documents within scope that validated their
research via laboratory and/or real-world experiments, as well as the percent-
age of documents that report formal, theoretical results. As expected, the JFR
presents the greatest proportion of documents with real-world validation. In
terms of theoretical validation, the journals IEEE Transactions on Robotics (T-
RO), IJRR, AURO, IEEE Robotics and Automation Letters (RA-L), Robotica,
and Journal of Intelligent & Robotic Systems (JINT) contain the largest propor-
tions of works reporting theoretical findings (journals with fewer than 10 works
being omitted from consideration of proportions).
2.3 Findings from Trend Analysis
This section presents a quantitative analysis of the trends observed based on the
data collected in the exhaustive search.
Figure 4 presents the number and percentage of documents that were within
scope for each year. Dashed lines show the trends, obtained through linear regres-
sion using the least squares method. The trend in number of documents on MRSs
suggests a substantial growth in the number of documents within scope, whereas
the trend in percentage of documents suggests that the fraction of documents
devoted to MRSs has decreased. A possible explanation is that robotics journals
tend to report on an increasingly diverse range of topics, some of which attracted
significant attention in recent years (e.g. soft-bodied and learning robots).
Figure 5 presents the evolution of the percentage of studies within scope that
perform theoretical, laboratory and real-world validation. Note that it is pos-
sible for a single study to feature for multiple types of validation. Overall, the
trends suggest the percentage of theoretical studies is increasing, while labora-
tory validation is shown to be decreasing and real-world validation is shown to
remain relatively constant over the years. Note however that the number of doc-
uments within scope is increasing over the years. As a result, the total amount
of validation is increasing as well, for both laboratory and real-world experi-
ments. Although theoretical validation is often based on strong assumptions, it
Fig. 4. Number and percentage of documents within scope (per year), as identified
by the exhaustive search. The dashed lines represent the trends obtained by linear
regression.
Fig. 5. Percentage of documents within scope using theoretical, laboratory and real-
world validation (per year), as identified by the exhaustive search. The dashed lines
represent the trends obtained by linear regression.
can offer a detailed understanding of the underlying mechanisms and allow for
a basic demonstration of the validity of the proposed solutions. The decreased
focus on laboratory validation may be attributed to increases in computational
capabilities, allowing more trustworthy and realistic simulations to be performed,
which may help substitute for laboratory validations.
3 Expert Survey
The expert survey is conducted to help better understand the state of the art and
possible future developments of MRSs. It captures the views of a representative
sample of internationally leading experts in the field of MRSs.
3.1 Methodology
The expert survey received ethics approval following a review by The University
of Sheffield (application reference 034028). It uses a questionnaire that includes
closed- and open-ended questions. Responses are anonymous. Participant can

choose not to answer any specific question if they prefer. Where subject-specific
questions provide multiple options to choose from, these are presented in a ran-
dom order to each participant.
To identify suitable prospective participants, research paper databases; con-
ference, workshop and journal websites (invited speakers, organisers, editorial
boards); resources related to non-academic organisations (e.g. industry, gov-
ernment); websites of technical committees (e.g. IEEE RAS TC Multi-Robot
Systems); and other sources are explored. Special emphasis is placed on iden-
tifying female experts, experts from traditionally underrepresented geographic
regions, and experts from industry and other non-academic organisations. The
participants are given three weeks to complete the survey from the moment of
invitation (9 July 2020).
3.2 Findings
Overall, 179 experts were invited to participate in the survey, including 119 men
(66.5%) and 60 women (33.5%).2 The invited experts were from the following
broad, geographic regions: Africa (10), America (62), Asia (35), Europe (66),
and Oceania (6), as defined by the Statistics Division of the United Nations3 .
A total of 68 participants completed the survey, corresponding to a partic-
ipation of 38.0% of the invited experts. Each participant provided consent to
participate in the survey. In this section, we analyse the findings.
Participant Distribution. To ensure the responses to the expert survey are

representative, or reveal biases where they exist, we included four personal ques-
tions regarding the participant gender, geographic region, sector of work, and the
number of years of relevant professional experience. Overall, 76.5% and 20.6% of
participants reported themselves as male and female, respectively. A small per-
centage of participants preferred not to answer (2.9%). The gender distribution
is clearly biased towards men, primarily as 66.5% of those invited to participate
were men. This can, at least partially, be attributed to women being underrep-
resented in the field of MRSs, and, especially, in leading roles. For the related
disciplines of electrical engineering, mechanical engineering, and computer sci-
ence, in between 13.5% and 19.7% of tenured/tenure-track faculty positions in
the USA are held by women according to a recent survey [7].
Most participants were from Europe, America and Asia. The participant
share of Europe is 42.6%, with 1.5% from Eastern Europe, 7.4% from Northern
Europe, 16.2% from Southern Europe, and 17.6% from Western Europe. The
participant share of America is 32.4%, with 26.5% from North America and
5.9% from South America. The participant share of Asia is 20.6%, with 7.4%
from Eastern Asia and 4.4% from each of South-eastern, Southern, or Western
2
The gender for each prospective participant was estimated. Actual participants could
choose to report their gender.
3
See geographic regions on https://unstats.un.org/unsd/methodology/m49/.
Asia. The remaining participants were from Australia and New-Zealand (2.9%)
or did not disclose their geographic region (1.5%). Although the survey succeeded
in capturing the views of experts from a wide range of regions, some bias can
be observed, in particular towards participants from Europe. It should also be
noted that none of the experts reported Africa as their geographic region.
Most participants work in academia (77.9%), followed by industry (14.7%).
Participants who work in both academia and industry, or in other sectors, are
in the minority (2.9% for either case).
Three quarters of the participants have up to 20 years of relevant profes-
sional experience with half reporting 10 to 20 years (50.0%) and a quarter less
than 10 years (25.0%). Most of the remaining participants have between 20 and
30 years of relevant professional experience (19.1%). Two participants have more
than 30 years of relevant professional experience (2.9%).
Analysis of Responses to Close-Ended Questions. This section presents

the views obtained from the experts when answering the subject-specific, closed-
ended questions of the survey.
The first question is about what constitutes an MRS. The experts are asked
to what extent they consider 12 example systems, labelled a–, to be MRSs.
Figure 6 shows the distribution of their responses. Their opinions are diverse.
However, broad consensus exists that systems a, b and c would not be considered
as MRSs. In these systems, the robots may not even be active. Moreover, broad
consensus exists that system , where the robots operate in a same environment
and have a common purpose, would be considered as an MRS. Most experts
viewed systems where multiple robots interact, such as systems i, j, and k, as
instances of MRSs, even when the robots were physically separated, such as in
system k. Systems d and e are not considered as MRSs by most of the experts,
though conflicting views exist. For system d, the robots are not self-propelled
and do not necessarily interact. For system e, one robot transports another
robot. For the remaining systems f –h, no clear trend is observed. For system
f , the robots share a common infrastructure to obtain position information but
are otherwise independent. In the case of system g, a group of conventional
manipulators work towards a common goal along a factory assembly line. Given
that the motion of each manipulator is fully automated, there appears to be
limited interaction. Where the operations of one manipulator builds upon the
operations of the previous ones, however, the group could be seen as interacting
via the products that they assemble. For system h, each robot is teleoperated
by a dedicated human operator while sharing the same environment. Different
from system j, the robots are not moving autonomously. As the wider context
may not be sufficiently clear, it is difficult to categorise this system, as suggested
by a relatively high percentage of experts selecting the option “other (e.g. would
need more information to decide)”.
Fig. 6. Distribution of the expert responses to the question of whether the following
are considered an MRS: (a) A group of robots stored on the same shelf, (b) Any robot
of which more than two copies exist, (c) A group of robots that can be stacked to ease
transportation, (d) A group of passive beads residing in a same environment, and being
externally controlled through changes in a magnetic field, (e) A robot transporting
another robot, (f ) A group of robots operating in different environments, and that do
not interact; each obtains positional information from the same global infrastructure,
(g) A group of conventional manipulators along a factory assembly line, working on the
same products though at different stages, (h) A group of robots operating in a same
environment, each being teleoperated by a dedicated operator; the operators may have
conflicting goals, though they always attempt to prevent collisions between robots,
(i) A robot composed of several modules that can autonomously rearrange the way
they are connected to assume a new shape, (j) A group of robots operating in a same
environment, each having its own goal, including not to collide with others, (k) A group
of robots operating in different environments and that learn from each other through
a cloud service, () A group of robots operating in a same environment and having a
common purpose. Statements ranked by weighted mean results.
The second question concerns the future deployments of MRSs. The experts
were asked the time frame in which they expect MRSs to be widely deployed
within their geographic region in various application scenarios. Figure 7 illus-
trates the distribution of the expected times for each of the application scenar-
ios. Almost all experts consider warehouse logistics as an achievable target in the
short term (0–5 years). The majority of experts consider surveillance and secu-
rity, manufacturing and entertainment as achievable targets in the short term
(0–5 years). These application scenarios have already fully embraced robotics
in multiple geographic regions. Transport and delivery, inspection and mainte-
nance, agriculture and forestry, education, mining, first response, and domestic
applications are considered achievable targets in the medium term (5–10 years)
by many of the experts. Space, medical applications, rehabilitation and care and
construction are expected to face a longer wait (> 10 years). Additionally, we
should notice that education has a moderate number of experts suggesting MRSs
may never be widely deployed or not be applicable.
Fig. 7. Distribution of the responses by experts to the question in which time frame
they expect MRSs to be widely deployed within their geographic region in the specified
applications. Applications ranked by weighted mean results.
Deploying MRSs in the real world involves solving some problems that are
equally present when deploying single robot systems. The third question specifi-
cally focuses on the transition from systems of single robots to systems of multi-
ple robots. It asks the expert to gauge how much additional effort this transition
requires by the robotics community with respect to a number of areas of develop-
ment. Figure 8 shows the distribution of the additional effort the experts consider
necessary for each of the suggested areas of development. In general, we observe
that all aspects would demand additional effort. Particularly, experimentation
under realistic conditions, autonomy/resilience, system integration, trustworthi-
ness, and end-user acceptance can be highlighted as the most difficult areas of
development, with the highest percentage of experts choosing ratings of either 5
or 6. On the contrary, issues such as materials, modelling, ethical concerns, and
simulation are considered less demanding in terms of additional effort, with the
highest percentage of experts choosing ratings of either 1 or 2. Regarding legal
aspects, the opinions are divided. Diversity and inclusion has been appraised as
“Undecided” by 29.4% of experts, possibly as the challenge is viewed by some
as the same irrespective of whether the system to be designed comprises one or
multiple robots.
The last question asks the experts how fast they expect the field of MRSs
to have grown in ten years from the time of the survey compared to the field
of robotics as a whole. Figure 9 presents the responses, suggesting a bimodal
distribution with a small peak at “slower” and a larger one at “faster”.
Analysis of Responses to Open-Ended Questions. In this section, we

present and discuss the participant responses to the open-ended questions of the
survey.
The first question asked how the experts attitude towards MRSs had changed
over the past 10 years. About a third (32.4%) of the participants responded that
Fig. 8. Distribution of the expert responses to the question of how much additional
effort is required by the robotics community to move from single-robot system to MRS
in real-world applications with respect to various areas of development. The scale is
such that 1 corresponds to “insignificant”, whereas 6 corresponds to “unprecedented
(e.g. paradigm change required)”. Areas ranked by weighted mean results.
Fig. 9. Distribution of the expert responses to the question of how fast they expect
the field of MRSs to have grown in ten years from the time of the survey compared to
the field of robotics as a whole.
their attitude had not changed. The remaining participants presented a multi-
tude of aspects that had changed when considering their position towards MRSs.
The most frequently mentioned aspects include the focus of their work (27.3%),
in particular a shift from theoretical work to more practical applications, the
definition of MRS they used, to either include or exclude specific control archi-
tectures (centralised vs. decentralised), system size, etc., and the consideration of
new capabilities or applications to MRSs, such as machine learning, cloud com-
puting, and warehouse logistics. Although the outlook of most responses to this
question were positive, a small number of participants (6.8% of the participants
that reported a change in attitude) reported a negative position, questioning the
extent to which real-world applications will be impacted on by MRSs.
The next question asked participants to give examples of commercial applica-
tions other than warehouse logistics where MRSs are currently being deployed.
A total of 63.2% participants gave examples, with the most common being (i)
entertainment, for example, through the use of drones such as in light shows
and for filming, (ii) agriculture, for seeding, tending and harvesting of crops,
and (iii) surveillance. Interestingly, agriculture is included in the list of appli-

cations that experts expect to reach deployment only in the medium term (see
Sect. 3.2). Other applications, from the most to the least mentioned, include
defence, inspection, monitoring, logistics, transport, education, hospital applica-
tions, mining, maintenance, and manufacturing.
The field of MRSs is diverse, with research and development constantly pro-
gressing in a wide range of topics. We presented the experts with a list of topics,
and asked them to choose the one they are most familiar with. An option to add
their own topic was provided. Each expert was then asked to present the main
challenges regarding the chosen topic. The two most frequently chosen topics
are coordination/cooperation, control and planning (44.1%) and bio-inspiration,
modular and swarm robots (25.0%). The most frequently mentioned challenges
in coordination/cooperation, control and planning are related to system inte-
gration, particularly in unstructured environments; robustness of the system,
both in terms of hardware and software/algorithms/behaviours; logistics, in
terms of system deployment and energy; scalability, particularly in terms of how
behaviours scale as more robots are added; and costs. Less frequently mentioned
challenges include the reality gap; operation safety, in terms of humans, robots
and tasks; and the lack of real-world testing. For the participants choosing the
topic of bio-inspiration, modular and swarm robotics, the challenge most men-
tioned is the lack of accessibility to, and the high costs of, reliable and capable
robot platforms. This is followed by logistics challenges, which are particularly
severe when studying swarms of robots. One other challenge is the lack of realistic
real-world applications, as many studies performed on this topic frequently fall
back to simple and commonly used behaviours, such as aggregation or pattern
formation, which would generally solve only a small part of a given real-world
problem. The availability of capable, reliable and robust robot platforms is also
the most frequently mentioned challenge on the topic of applications. For net-
worked robots, communication, end-user acceptance and scalability are some of
the challenges that were identified by the experts.
To be able to assess what can be achieved in the future, it is required to have
information on what has been achieved in the past. With that in mind, survey
participants were asked which area in MRSs they consider to have improved the
most in the last 10 years. From the responses (89.7% of the participants replied),
the areas most frequently mentioned are coordination/cooperation, control and
hardware. From the responses to the previous question, we know that more than
40% of the participants are most familiar with coordination/cooperation, control
and planning, and the challenges identified in these fields were mostly related to
implementation issues. As a result, it is reasonable for coordination/cooperation,
control and planning to be perceived as areas with significant development over
the last 10 years. By contrast, hardware, in the form of available platforms, is
one of the aspects frequently mentioned in the previous question (for multiple
topics) that still presents a challenge to be overcome. Despite hardware being
one of the topics participants believe to have shown most development, further
progress is vital, for example, to enable more comprehensive validation and/or

deployment of solutions.
While current challenges and recent developments help determine possible
future development paths, we also wish to identify promising areas for future
research. Survey participants were asked what development specific to MRS they
would like to see in the next 10 years. The most commonly mentioned devel-
opments were real-world testing (16.2%), real-world applications (13.2%) and
coordination/cooperation (11.7%). The interest in real-world testing could be
viewed as a consequence of an insufficient availability of hardware for implemen-
tation and testing. Although hardware has been identified as a topic of interest
to develop in the next 10 years, the percentage of experts mentioning it is rel-
atively small (4.4%). However, developments in this field are required to allow
real-world testing to be performed. The interest in real-world applications could
be viewed as a consequence of the change in perspective, as previously men-
tioned, where more theoretical work is being replaced with more practical work.
Additionally, it is through the study of new applications that new challenges
can be identified. Lastly, with new applications and more experimentation, the
requirements for coordination/cooperation algorithms are expected to become
more stringent, as MRS performance evaluations move from laboratory environ-
ments to realistic environments. Not only does the amount of recent development
imply that coordination/cooperation is a growing topic, the development of new
technologies, particularly in communication, is expected to enable more reliable
coordination/cooperation of MRSs. However, significant work is required to rec-
oncile new technologies and coordination/cooperation strategies.
One of the logistics concerns presented by the participants is the lack of
adequately trained personnel. The participants were asked what they consider
to be an effective activity for training/teaching staff or students to work with
MRSs. The majority (85.3%) of the participants replied, resulting in the following
proposed activities (ordered from most to least mentioned): hands-on experience;
simulations; experimentation; courses; algorithm development; participation in
competitions; study and discussions of recent literature; development of robotic
systems; and summer schools.
In the last open-ended question, we asked participants to state the main dif-
ficulties faced in their current or past jobs when undertaking work related to
MRSs. Different from what we expected, many of the responses concerned tech-
nical difficulties, with the most commonly mentioned ones being hardware relia-
bility, logistics, complexity of experimentation, implementation issues, and com-
munication. Non-technical difficulties that were reported include the costs and
access to equipment and infrastructures, personnel training, space constraints,
tool availability, and management of expectations in the field of study.
4 Conclusions
This paper reported a data-driven trend analysis of MRSs.
An exhaustive search of all papers published in 27 leading robotics journals
from 2010 to 2020 revealed (i) a substantial growth in the number of papers
on MRSs albeit the fraction of papers devoted to the topic decreased; (ii) for
half of the journals, the fraction of papers devoted to MRSs exceeded 5%; (iii)
six journals reported MRS research at TRL 6 or above; and (iv) a decline, in
relative terms, in the use of laboratory demonstrations while theoretical work
and real-world demonstrations remain in focus.
A survey of 68 leading experts revealed (i) that there is no consensus regard-
ing what constitutes an MRS; (ii) the applications where MRSs are expected to
be widely deployed in the short-, medium- and long-term; and (iii) the areas of
development that require the most effort to move from single-robot systems to
MRSs in real-world applications, including experimentation under realistic con-
ditions, autonomy/resilience, system integration, trustworthiness and end-user
acceptance. There was no consensus regarding how rapid the field will develop
over the next decade, with some expected a faster development while others
expected a slower development.
Acknowledgments. This study was conducted as part of the Multi-Robot Systems

project (Oct 2019–Dec 2020), which was funded by the Defence Science and Technology
Laboratory (contract no DSTL-1000141223). The authors thank the 68 participants of
the expert survey, and Alejandro R. Mosteo for feedback on the design of the expert
survey.
References
1. Walter, W.G.: The Living Brain. WW Norton, 1953
2. Dudek, G., Jenkin, M.R.M., Milios, E., Wilkes, D.: A taxonomy for multi-agent
robotics. Auton. Robot. 3(4), 375–397 (1996)
3. Farinelli, A., Iocchi, L., Nardi, D.: Multirobot systems: A classification focused on
coordination. IEEE Trans. Syst. Man, Cybern. Part B 34(5), 2015–2028 (2004)
4. Parker, L.E., Rus, D., Sukhatme, G.S.: Multiple mobile robot systems, in Springer
Handbook of Robotics. Springer, 2016, pp. 1335–1384
5. Rossi, F., Bandyopadhyay, S., Wolf, M., Pavone, M.: Review of multi-agent algo-
rithms for collective behavior: a structural taxonomy. IFAC-PapersOnLine 51(12),
112–117 (2018)
6. UK House of Commons, technology and innovation centres, science and technol-
ogy committee, Annex 1: Technology Readiness Levels, 2010. https://publications.
parliament.uk/pa/cm201011/cmselect/cmsctech/619/61913.htm#note221
7. American Society for Engineering Education, Engineering and engineering technol-
ogy by the numbers, 2019, Washington, DC
Coverage Control for Exploration
of Unknown Non-convex Environments
with Limited Range Multi-robot Systems
Mattia Catellani(B) , Federico Pratissoli, Filippo Bertoncelli,

Department of Sciences and Methods for Engineering (DISMI),

{mattia.catellani,federico.pratissoli,filippo.bertoncelli,
lorenzo.sabattini}@unimore.it
Abstract. This paper presents a novel methodology for the explo-

ration and monitoring of unknown environments performed by a group
of coordinated mobile robots. The proposed methodology builds upon
Voronoi-based coverage control methods, with time-varying probability
density functions. In particular, we propose a methodology that, exploit-
ing locally available information, allows robots to dynamically update the
probability density function in order to drive the robots to explore pre-
viously unexplored areas. The proposed control strategy was validated
in extensive simulations, and by means of experiments carried out on a
group of mobile robots. Results of simulations and experiments demon-
strate the effectiveness of the proposed methodology in leading to the
exploration of large portions of the environment .
1 Introduction
The problem of coordinating multiple robots with the task to explore (and map)
an unknown environment has been widely studied in the last years. One of the
most widely adopted strategies, firstly described in [11], exploits potential fields
to generate collision-free trajectories. Every robot is treated as a positive ion,
while obstacles are assumed as positively charged and the goal as negatively
charged, thus resulting in every robot being driven to the goal while rejecting
both other agents and obstacles. Other control strategies can be derived from the
well known random walk approach, as the ones proposed in [10], where robots
move randomly within the environment. An alternative approach is the one pro-
posed by [18], where the authors study the problem of multi-robot exploration
and coverage with limited communication. They propose a decentralized algo-
rithm based on leaving marks on the unknown terrain, which can be sensed by
the robots and allow them to cover the terrain. The limited communication issue
has been considered also by [6], where the control of the multi-robot system relies
on the communication with a base station. In order to assist the robot cover-
age of the environment, [2] proposes an algorithm that deploys radio beacons

https://doi.org/10.1007/978-3-031-51497-5_39
Coverage Control for Exploration of Unknown Non-convex Environments 551
and, similarly, [3] proposes robots that, in the process of moving around, deploy
network nodes into the environment.
One of the most relevant and studied exploration methodologies is the frontier
based coordination approach [19]. The environment is discretized in standard
occupancy grids and a frontier cell consists in each already explored cell that
is an immediate neighbor of an unknown, unexplored cell. A mapping of the
environment is required, usually generated exploiting grid maps or graphs, to
keep updated the portion of the environment already explored and to localize
the frontier nodes positions [8].
Coverage based control strategies [5] coordinate a group of autonomous
robots with the aim of maximizing the coverage of the environment and gen-
erally exploit a Voronoi partitioning of the environment. The tools behind these
methodologies can be exploited to deal with the exploration of unknown environ-
ments, like in [1], where the Voronoi diagram is used to generate an obstacle-free
convex neighbourhood among strongly convex obstacles. A significant work is
presented by [9] where a frontier-based exploration methodology is integrated
with a Voronoi partitioning of the environment to avoid duplicated exploration
areas. In both of these works the boundary of environment and the obstacles
needs to be fully or partially known. A recent study presented in [17] deals with
the problem caused by updating a single global grid map managing only local
information. In [16] an exploration and navigation framework has been proposed
to explore GPS-denied environments using the cooperation between UAVs and
UGVs to increase the accuracy of the SLAM algorithm.
The mentioned studies and the majority of the approaches in multi-robot
exploration are based on a discretization of the environment, typically exploit-
ing occupancy grids. This approach is not suitable for large environments,
where the number of cells grows significantly, thus leading to severe compu-
tational and memory consumption. Moreover, these approaches require a strong
communication network in order to keep updated the shared map among the
robots [12]. A methodology that deals with communication issues has been pre-
sented in [14], where the total area is partitioned based on the topology exploiting
the Generalized Voronoi Graphs for exploration purposes. However, the robots
are constrained to move on the Voronoi diagrams edges. Moreover, this strat-
egy is mostly effective in environments structured in a way that naturally drives
robots during exploration, e.g. with corridors, losing effectiveness in more generic
spaces. An alternative and relevant approach to obtain a mapping and explo-
ration of large environments is presented by [4]. A Gaussian mixture model for
global mapping is used to model a complex environment. This approach, devel-
oped and tested for one robot monitoring, has the benefits to require a small
memory footprint and a low volume of communication. Following this strategy,
in this work, we propose a methodology that exploits a mixture of Gaussian dis-
tributions to coordinate a group of robots and explore an unknown environment.
Given the Gaussian function defined by only a few parameters and the lack of a
grid discretization of the environment, the strategy aims to an efficient memory
usage and a low communication data among robots.
552 M. Catellani et al.
Contribution. Our methodology uses the coverage control theory described

in [15], which introduces a limited Voronoi diagram able to deal with unknown
and non-convex environments differently from the approach in [5]. Exploiting
this strategy, which allows to obtain an optimal configuration of the agents
in the environment in order to maximize the covered area, we provide a solu-
tion to dynamically explore an unknown environment while continuously passing
through it, in order to keep it constantly monitored. We combine the coverage
based control methodology with a mixture of Gaussian distributions to model
the areas of the environments already explored. The proposed approach performs
the exploration and monitoring without a discretization of the environment (grid
map or graph) and without any knowledge of the boundaries of the environment.
Moreover, it is worth noting that the few parameters that define a Gaussian func-
tion can be easily shared among the robots keeping low the communication and
the memory usage, as stated in [4]. Finally, most of the works found in litera-
ture provide simple simulations to show the proposed strategy and only a few
works can be found that test the control on real platforms. We extensively tested
the proposed control in both simulated and real environment experiments. We
compare the proposed approach with other typical ones that do not require a
discretization or a partial knowledge of the environment, such as random walk
and potential field approaches.
2 Problem Statement
Consider a multi-robot system constituted by n holonomic robots able to move
in two dimensions, controlled with the objective of monitoring and exploring
an unknown environment. We assume each robot to be modeled as a single
integrator system,1 whose position pi ∈ R2 evolves according to pi = ui , where
ui ∈ R2 is the control input, ∀i = 1, . . . , n.
We consider the following assumptions:
1. Localization with respect to a global reference frame: each robot is able to
localize itself with respect to a global reference frame, which is shared among
the robots in the team;2
2. Limited sensing capabilities: each robot is able to measure the position of
neighboring robots and objects (including boundaries of the environment),
within its limited sensing range (defined by a circle with radius rsens ∈ R>0 ).
Based on these assumptions, the problem addressed in this paper is then for-
malized as follows:
1 We would like to remark that, even though the single integrator is a very simplified
model, it can still effectively be exploited to control real mobile robots: using a
sufficiently good trajectory tracking controller, the single integrator model can be
used to generate velocity references for widely used mobile robotic platforms, such
as wheeled mobile robots, and unmanned aerial vehicles.
2 This can be achieved, e.g., by means of Global Positioning System (GPS) or coop-
erative localization techniques, which are however out of the scope of the paper and
will not be addressed.
Problem 1. Define a distributed control strategy that allows a multi-robot sys-

tem with limited sensing capabilities to explore an arbitrary unknown, possibly
non-convex, large environment. The methodology needs to consider the possibil-
ity to not be able to discretize the environment due to memory and communi-
cation limitations.
Notation and Definitions. In the rest of the paper, we denote by N, R, R ≥0 ,
and R>0 the set of natural, real, real non-negative, and real positive numbers,
respectively. Given x ∈ Rn , let x be the Euclidean norm. Let F(R2 ) be the
collection of finite point sets in R2 . We can denote an element of F(R2 ) as P =
{p1, . . . , pn } ⊂ R2 , where {p1, . . . , pn } are points in R2 . We denote, for p ∈ R2
and r ∈ R>0 , the closed and open ball in R2 centered at p with radius r with
B(p, r) = q ∈ R2 |q − p ≤ r and B(p, r) = q ∈ R2 |q − p < r , respectively.
In the paper, Q ⊂ R2 denotes a generic polygon: it will be used, in particular, to
denote the environment to be covered by the robots. An arbitrary point in Q is
denoted by q ∈ Q.
3 Distributed Control Algorithm for Exploration

In this section we briefly describe the control algorithm used to determine the
motion of the robotics agents tasked to explore the unknown environment. The
proposed strategy builds upon the coverage based control approach described
in [15] where the multi-robot system reaches the optimal coverage positions,
i.e., the central Voronoi configurations, in a non necessarily convex environment
according to the robots sensing capabilities. Briefly, the coverage based control
makes use of Lloyd’s algorithm to maximize the sensed area of a mobile sensing
network. The environment is partitioned in diagrams following the Voronoi par-
titioning algorithm. Every diagram Vi is computed according to every robot’s
position pi ∈ P and is limited to the area sensed by the robot. The centroid of
each partition or cell CVi is then computed, possibly considering a probability
density function φ(q) in the environment. The i-th robot is then instructed to
move towards the centroid of its cell according to the law
ui = −k pr op (pi − CVi ) (1)
where k pr op ∈ R ≥0 is a proportional gain. It is worth remarking that the centroid

of each cell, and hence the control law of each robot, is strongly dependent on
the chosen density distribution (e.g., uniform density function or Gaussian-type
density function). In this work we make use of this property to control the group
of robots with the aim to explore and monitor the environment. In fact, the
objective of the coverage based control is a static and optimal positioning of
the robots in the environment without a dynamical exploration or monitoring
of it. As stated in Sect. 2, each robot has a limited sensing range, therefore the
Voronoi partitioning depends on the sensing range of the robot, according to the
definition in [15]:
Vi (P) = {q ∈ B(pi, rsens )|q − pi ≤ q − p j , ∀p j ∈ P}. (2)

Namely, the i-th cell contains all the points in the environment that are closer
to the i-th robot than to other robots, and that are contained within the sensing
radius of the i-th robot.
The centroid of the Voronoi cell, identified by CVi is then computed according
to the probability density distribution φ(q), that is:
∫
V
qφ(q)dq
CVi = ∫ i . (3)
V
φ(q)dq
i
While the probability density function can be exploited, in general, to highlight

the relative importance of specific portions of the environment with respect to
others, in our approach this distribution describes the probability that a specified
area has been already explored, as will be explained in Sect. 4.
The control law can be computed consequently following the methodology
described in [15], where each robot is driven towards the centroid of its cell,
computed as in (3). It is worth noting that, according to (2) and (3), both
the partition and the centroid can be computed in a fully distributed manner:
therefore, no global knowledge is required for the computation of the control
action. Each robot can, in fact, compute its own control input simply knowing
its own location, the neighbors’ location and the probability distribution in the
environment. In the following section we describe how we define the probability
distribution using a mixture of Gaussian functions that brings to an exploration
of the environment.
4 Time-Varying Probability Density Function

for Exploration
As introduced in Sect. 2, the aim of this work is to monitor and explore an

unknown and possibly non-convex environment with a group of robots with
limited sensing capabilities. In order to fulfill the exploration of the whole envi-
ronment, the robots need to keep updated the areas already explored by the
multi-robot system. A probability density function can be exploited for this
purpose focusing the attention on the areas to be explored and avoiding areas
recently visited. The proposed distributed control algorithm uses this approach
to keep updated and shared information about the explored environment.
The probability density function is defined as a combination of multiple Gaus-
sian density functions. Each Gaussian function ϕ(q) is determined by a mean
value q ∈ R2 and a standard deviation σ ∈ R>0 , and is then defined as follows:

q − q 2
ϕ(q) = exp − . (4)
2σ 2
Every agent in the environment must be able to access this data when q lies inside
its sensing range, thus the exchange of information can be achieved exploiting a
central storage unit or physically placing a marker or a beacon that each robot
can easily localize. During the motion and the exploration, every robot places
several Gaussian functions in the environment. The combination of all the shared
Gaussian functions determines the motion of the multi-robot system. Every time
a new area is explored by a robot, a new Gaussian function ϕ(q), defined as in (4),
is added by the robot to the shared density distribution following the algorithm
described later (Algorithm 1), affecting the motion of the overall multi-robot
system. During the exploration of an unknown environment, we consider two
possible scenarios the robot can face: (1) the robot explores areas that need
to be crossed only once during the navigation, and (2) areas that need to be
crossed more than once to complete the exploration without deadlocks. Thus,
we define two types of Gaussian functions to be added by the robot following the
definition in (4): a time-constant Gaussian function and a time-varying Gaussian
function. The first one intuitively remains constant over time during the whole
exploration:
q − qi 2
fFi (q) = exp − . (5)
2σi2
Conversely, the time-varying Gaussian function decreases over time: for this
purpose, we make use of a forgetting factor k j (t) ∈ [0, 1], which gradually
decreases the intensity of the function over time and is defined as
t − T0 j
k j (t) = 1 − , (6)
T f or get
where T0 j is the time instant at which the Gaussian component is added and
T f or get represents how long the component should remain active. Thus, the time-
varying Gaussian function can be defined as

q − q j 2
fVj (q, t) = min exp − , k j (t) (7)
2σj2
where the forgetting factor k j (t), decreasing over time, lowers the peak of the
probability function until completely nullifying it at t = T0 j + T f or get .
The definition of two different functions allows to specify whether an explored
area needs to be crossed again multiple times or instead is needed to be visited
only once, depending on the specific application. This differentiation is carried
out marking explored areas that don’t need to be visited again with the function
fFi (q), while areas required to be crossed multiple times are marked with the
function fVj (q, t): time variance decreases the intensity of the Gaussian function
that models the area over time, allowing a specific region to be visited again
in the future by the system. In this way, the robots will continue monitoring
the environment even once the exploration is completed, keeping the gathered
information up to date. Moreover, time variance is useful to avoid local min-
ima which cause the robots to get stuck, making it difficult to achieve an entire
exploration of the environment. Every robot stores two lists of Gaussian func-
tions, time-varying and time-constant functions, whose combination generates
the density distribution perceived by the robot and exploited to compute the
control input. During the exploration process, the lists of Gaussian functions are
shared among the robots’ network. Each robot, during the motion, computes
the Gaussian components and communicates their parameters to the network.
Every robot’s control input is continuously updated following the changes on the
shared density distribution. The combination of the various different Gaussian
components is computed as follows:
c m
φ(q) = c + m + 1 − fF i (q) − fV j (q), (8)
i=1 j=1
where c is the number of constant density functions fF i and m is the num-

ber of time-varying density functions fV j . The presented formulation generates
a density distribution φ that has smaller values in areas already visited. The
main objective for each robot is to add Gaussian components while exploring
the environment decreasing the intensity of the density distribution around the
recently visited areas in order to drive the robots’ attention towards more inter-
esting areas. The density distribution φ is then used, following the control theory
shown in Sect. 3, to determine the control input for each robot.
Every robot determines the addition of a new Gaussian component according
to Algorithm 1. The main idea of the algorithm is to add new Gaussian functions
in order to keep the robot in motion. A new component is added whenever a
robot’s position is sufficiently far from the mean point of existing components:
the threshold is set equal to the standard deviation σ. The standard deviation
σ can be seen geometrically as the width of the Gaussian function and is set
to be half of the robot’s sensing range r: σ = r/2. In a normal distribution, the
95.45% of the samples are inside a 2σ radius from the center point, therefore
the value σ = r/2 reflects the information that is collected by the robot in its
sensing range.
It is worth noting that new components are not added in the actual position
of the robot: in fact, this might lead to the definition of symmetric probability
distributions, that would lead to have the centroid of the region in the same
position as the robot, resulting in a null control input calculated in (1). To
avoid this, the new components are centered in a previously occupied position:
in particular, this delay Δ is calculated as the time needed for a robot to travel
a distance of σ with maximum velocity v M AX :
σ
Δ= (9)
v M AX
Moreover, if a robot gets stuck, a new component is added to introduce a per-
turbation in the probability distribution of the surrounding environment, mov-
ing the centroid in a different position. The result of using Gaussian functions
described above is the definition of a probability distribution indicating whether
an area has already been explored or not: an example of such a function is
depicted in Fig. 1.
Algorithm 1: Exploration Algorithm

input : Robot’s position p
Time delay Δ
List of probability functions fV (q) and fF (q) with mean points
q1, . . . , q m+n and variance σ 2
output: Updated probability distribution
1 begin
2 if (p − qi ) > σ, ∀i ∈ [0, (m + n)] then
3 if robot is moving then
4 Add a new time-varying gaussian component with mean point
corresponding to the position of p Δ seconds earlier
5 if robot has stopped then
6 Add a new time-varying gaussian component with q in the most
recent robot position that is at a distance ≥ 0.75σ from p
7 if robot lies in an area that needs to be visited only once then
8 Add a time-constant gaussian component with q = p and σ = r/2.
5 Simulations and Experiments

5.1 Implementation
The proposed algorithm has been implemented using ROS2 [13], and has been
validated by means of extensive simulations and real-world experiments. In par-
ticular, several simulation environments have been created using the Gazebo
simulator, representing wide areas populated by a large number of robots. Two
series of rigorous simulations have been conducted, the first one aiming for a
statistical analysis of the results and the performance of the proposed control
strategy, the other one to make a comparison with other strategies found in
literature. As for the first series of simulations, the initial number of robots
was randomly chosen, as well as their starting location in the environment. The
simulations conducted include configurations with 2, 4, 6 and 8 robots with
a sensing range rsens = 4m and a maximum linear velocity vmax = 0.5 m/s.
Moreover, various environment dimensions have been tested along with different
environment configurations and different obstacles locations. In particular, we
considered non-convex polygonal environments with an area of 200m2 , 400m2 ,
900m2 and 2500m2 . Two of the scenarios tested in the simulations are shown in
Fig. 2: they differ in number of robots, environment dimension and complexity.
For each experimental scenario, we ran 20 tests of 5 min of simulation: overall
320 tests were conducted. For each experiment we saved the robots locations
over time, useful to compute the portion of the covered surface.
The same statistical approach was followed to make a comparison with other
strategies. In this case, our solution was compared with the classic potential
field navigation algorithm [11] and a random walk solution presented in [7],
where robots drive with a constant straight motion, only changing their direction
Fig. 1. Probability distribution of the environment
when an obstacle is detected. As before, the same robots were used in teams of
6, 8, 10 and 12 agents placed in random starting positions within the same
2500m2 environment already implemented. 20 simulations were run for every
team configuration, for a total number of 80 simulations for each method.
We also conducted experiments on real robotic platforms exploiting the ROS2
framework. The tests took place in a limited size environment with polygonal
obstacles, using three Turtlebot3 Burger differential drive robots (see Fig. 3).
The robots localization with respect to a global reference system was obtained
through the Optitrack motion capture system. Every robot, following the control
algorithm, shares the locally added Gaussian function among the robots. We
conducted several tests randomly choosing the initial locations and orientations
of the robots.
Both in simulations and real world experiments, information defining the
Gaussian functions parameters were shared among the agents using a central stor-
age unit, where each robot is able to both save and retrieve data. In addition, we
considered non circulating areas such as corners or dead end streets as places that
need to be visited only once, marking them with time-constant Gaussian functions
fFi (q) defined in (5). Assuming that every robot is capable of detecting corners
in the environment adopting a state-of-the-art strategy such as the one proposed
in [20], it’s possible for the agents to understand when a non circulating area is
visited and consequently define a new time-constant Gaussian function fFi (q).
5.2 Results and Discussion

As previously mentioned, a large number of simulations were executed in order to
have a rigorous statistical analysis of the results. We show how the performance
of the control strategy depends on the starting number of robots and is not
affected by the initial environment configuration or initial robots locations. In
all the experiments and simulations, the multi-robot system was able to entirely
explore the unknown non-convex environment. In particular, the percentage of
the covered area during the exploration increases with the number of robots
used.
Fig. 2. Two of the scenarios tested in simulation exploiting Gazebo environment. The
two experiments differ in number of robots, environment dimension and complexity.
The blue circles represent the sensing areas of the robots.
Figure 4 focuses on analyzing the results from the simulations conducted on

the environment shown in Fig. 2a, which has an area of 400m2 . From the results
in Fig. 4, we can conclude that 4 robots are sufficient to successfully explore the
desired environment, while slightly better performances are obtained with 6 and
8 robots.
Figure 5 instead shows the comparison between the results obtained with our
solution and other strategies. The advantages with respect to the Potential Field
Fig. 3. The experimental set-up of one of the experiments conducted with real robotic
platforms. Three Turtlebot3 burger robots were tasked to explore an unknown non-
convex area.
Fig. 4. Explored area in 400m2 environment
Fig. 5. Comparison with other solutions
navigation are clear: this solution had a poor performance because of the large
environment used, which results in a great number of local minima where robots
got stuck. Even the comparison with the Random Walk approach proves our
solution to achieve a better performance: the multi-robot system was able to
explore a larger region of the environment thanks to the ability to coordinate
the agents in an optimal way.
Finally, real-world experiments have also proven very good exploration of the
environment: in the example in Fig. 6, where 3 robots were employed, the covered
area reached 90% of totality. This value is perfectly in line with expectations from
simulated trials, because the total area is much smaller than the one in simulated
environments, but less robots were employed.
Fig. 6. Explored area in field trials
6 Conclusions
In this paper we presented a distributed coverage based control to coordinate
a group of autonomous robots with limited sensing capabilities with the aim
to explore an unknown environment. The proposed control strategy exploits a
density distribution to drive the multi-robot system towards areas of the envi-
ronment to be explored. The density distribution is build from the combination
of multiple Gaussian functions placed by the robots in the environment during
the motion. We show how the full exploration of the unknown environment can
be achieved without a discretization of the environment through graphs or grid
cells differently from the majority of the studies found in literature. Moreover,
the robots need to store and share among the network only the Gaussian param-
eters, which require a low communication volume and a low memory consump-
tion. Several experiments and simulations were performed on a group of mobile
robots to validate the performance of the proposed control method and make a
comparison with other solutions found in literature. As a future work we aim
to further investigate the proposed approach in comparison with frontier-based
exploration methodologies.
References
1. Arslan, O., Koditschek, D.E.: Sensor-based reactive navigation in unknown convex
sphere worlds. Int. J. Robot. Res. 38(2–3), 196–223 (2019)
2. Batalin, M.A., Sukhatme, G.S.: Coverage, exploration and deployment by a mobile
robot and communication network. Telecommun. Syst. 26(2), 181–196 (2004)
3. Batalin, M.A., Sukhatme, G.S.: The analysis of an efficient algorithm for robot
coverage and exploration based on sensor network deployment. In: Proceedings of
the 2005 IEEE International Conference on Robotics and Automation, pp. 3478–
3485. IEEE (2005)
4. Corah, M., O’Meadhra, C., Goel, K., Michael, N.: Communication-efficient plan-
ning and mapping for multi-robot exploration in large environments. IEEE Robot.
Autom. Lett. 4(2), 1715–1721 (2019)
6. De Hoog, J., Cameron, S., Visser, A.: Dynamic team hierarchies in communication-
limited multi-robot exploration. In: 2010 IEEE Safety Security and Rescue
Robotics, pp. 1–7. IEEE (2010)
7. Francesca, G., Brambilla, M., Brutschy, A., Trianni, V., Birattari, M.: Automode:
A novel approach to the automatic design of control software for robot swarms.
Swarm Intell. 8, 1–24 (2014)
8. Franchi, A., Freda, L., Oriolo, G., Vendittelli, M.: The sensor-based random graph
method for cooperative robot exploration. IEEE/ASME Trans. Mechatron. 14(2),
163–175 (2009)
9. Hu, J., Niu, H., Carrasco, J., Lennox, B., Arvin, F.: Voronoi-based multi-robot
autonomous exploration in unknown environments via deep reinforcement learning.
IEEE Trans. Veh. Technol. 69(12), 14413–14423 (2020)
10. Kegeleirs, M., Garzón Ramos, D., Birattari, M.: Random walk exploration for
swarm mapping. In: Althoefer, K., Konstantinova, J., Zhang, K. (eds.) TAROS
2019. LNCS (LNAI), vol. 11650, pp. 211–222. Springer, Cham (2019). https://doi.
org/10.1007/978-3-030-25332-5 19
11. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. In:
Cox, I.J., Wilfong, G.T. (eds.) Autonomous Robot Vehicles, pp. 396–404. Springer,
New York, NY (1986). https://doi.org/10.1007/978-1-4613-8997-2 29
12. Kim, J.: Cooperative exploration and networking while preserving collision avoid-
ance. IEEE Trans. Cybern. 47(12), 4038–4048 (2016)
13. Macenski, S., Foote, T., Gerkey, B., Lalancette, C., Woodall, W.: Robot operating
system 2: design, architecture, and uses in the wild. Sci. Robot. 7(66), eabm6074
(2022)
14. Masaba, K., Quattrini Li, A.: GVGExp: communication-constrained multi-robot
exploration system based on generalized Voronoi graphs. In: 2021 International
Symposium on Multi-Robot and Multi-Agent Systems (MRS). IEEE (2021)
15. Pratissoli, F., Capelli, B., Sabattini, L.: On coverage control for limited range multi-
robot systems. In: IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS) (2022)
16. Qin, H., et al.: Autonomous exploration and mapping system using heterogeneous
UAVs and UGVs in GPS-denied environments. IEEE Trans. Veh. Technol. 68(2),
1339–1350 (2019)
17. Ryu, H.: Graph search-based exploration method using a frontier-graph structure
for mobile robots. Sensors 20(21), 6270 (2020)
18. Senthilkumar, K., Bharadwaj, K.K.: Multi-robot exploration and terrain coverage
in an unknown environment. Robot. Auton. Syst. 60(1), 123–132 (2012)
19. Stachniss, C.: Coordinated Multi-Robot Exploration. In: Robotic Mapping and
Exploration. Springer Tracts in Advanced Robotics, vol. 55, pp. 43–71. Springer,
Berlin, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01097-2 4
20. Yan, C., Shao, B., Zhao, H., Ning, R., Zhang, Y., Xu, F.: 3d room layout estimation
from a single RGB image. IEEE Trans. Multimed. 22(11), 3014–3024 (2020)
Author Index
A Ferrante, Eliseo 187

Abe, Shoma 424 Fierro, Rafael 394
Abraham, Ian 113 Finzi, Alberto 438
Albani, Dario 333 Furukawa, Hidemitsu 424
B G
Baumann, Cyrill 523 Garone, Emanuele 438
Belal, Mehdi 333 Groß, Roderich 537
Beltrame, Giovanni 57 Grover, Jaskaran 317
Bertoncelli, Filippo 333, 550
Bettini, Matteo 42 H
Blumenkamp, Jan 42 Harvey, David 69
Booth, Lorenzo 466 Hauert, Sabine 69, 173
Bouvier, Benjamin 494 Hogg, Elliott 69
Hsieh, M. Ani 83, 216, 347, 363
C
Caccavale, Riccardo 438 I
Cambier, Nicolas 187 Ishihara, Hisashi 100
Cao, Yuhong 202
Carey, Nicole E. 140 K
Carpin, Stefano 466 Karagüzel, Tugay Alperen 187
Catellani, Mattia 550 Kawakami, Masaru 424
Chen, Jiahe 509 Kontoudis, George P. 286
Chen, Jun 127 Kortvelesy, Ryan 42
Choset, Howie 113 Kumar, Vijay 408
Crosscombe, Michael 14 Kunii, Yasuharu 453
Cutler, Sadie 509
L
D Latif, Ehsan 243
Dames, Philip 127, 300, 378 Lawry, Jonathan 14
Defay, Jack A. 257 Lippiello, Vincenzo 438
Liu, Changliu 317
E Liu, Jiazhen 408
Edwards, Victoria 83, 363 Liu, Lantao 273
Eiben, A. E. 187 Lorente, María-Teresa 537
Ermini, Mirko 438
Erunsal, I. Kagan 156 M
Ma, Danna 509
F Maeda, Takao 453
Fedeli, Eugenio 438 Manjanna, Sandeep 347
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2024
https://doi.org/10.1007/978-3-031-51497-5
564 Author Index
Marques, João V. Amorim 537 Shen, Li 216

Martinoli, Alcherio 156, 523 Shiblee, MD Nahin Islam 424
Marzat, Julien 494 Silva, Thales C. 83, 216, 363
Mohanty, Nishant 317 Sofge, Donald 286
Muslimov, Tagir 231 Soorati, Mohammad D. 28
Srivastava, Alkesh Kumar 286
N Sueoka, Yuichiro 100
Naiseh, Mohammad 28 Sukhatme, Gaurav S. 408
Nilles, Alexandra Q. 257 Sun, Xiang 394
Notomista, Gennaro 479 Sun, Zhanhong 202
Novick, David 394 Sycara, Katia 317
O T
Ogawa, Jun 424 Tavano, Fabrizio 438
Osuka, Koichi 100 Tourki, Emna 523
Otte, Michael 286 Tsunoda, Yusuke 100
P V
Pan, Lishuo 347 Ventura, Rodrigo 156
Parasuraman, Ramviyas 243 Vercellino, Cristina 1
Park, Shinkyu 127 Vielfaure, David 57
Perolini, Jonas 523 Villani, Valeria 1
Petersen, Kirstin 257, 509
Pierre, Jean-Elie 394
W
Pratissoli, Federico 333, 550
Watanabe, Yosuke 424
Prorok, Amanda 42
Werfel, Justin 140
Pushp, Durgakant 273
Wilson, James 173
R
Ramachandran, Ragesh 408 X
Ramchurn, Sarvapali 28 Xin, Pujie 300, 378
Rao, Ananya 113 Xu, Junhong 273
Ricard, Guillaume 57
Richards, Arthur 69 Y
Yin, Kai 273
S Yu, Xi 216
Sabattini, Lorenzo 1, 333, 550
Sakamoto, Kosuke 453 Z
Sartoretti, Guillaume 113, 202 Zhang, Yi 100
Sato, Toui 453 Zhou, Lifeng 408

Distributed Autonomous Robotic Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Autonomous Robotic Systems

Uploaded by

Copyright:

Available Formats

Springer Proceedings in Advanced Robotics 28

Series Editors: Bruno Siciliano · Oussama Khatib

Julien Bourgeois · Jamie Paik · Benoît Piranda ·

Benoît Piranda Justin Werfel

Fumitoshi Matsuno Negar Mehr

ISSN 2511-1256 ISSN 2511-1264 (electronic)

Paper in this product is recyclable.

October 2023 Bruno Siciliano

Summer 2023 Julien Bourgeois

How Can We Understand Multi-Robot Systems? a User Study to Compare

The Benefits of Interaction Constraints in Distributed Autonomous Systems . . . 14

Outlining the Design Space of eXplainable Swarm (xSwarm): Experts’

VMAS: A Vectorized Multi-agent Simulator for Collective Robot Learning . . . . 42

FLAM: Fault Localization and Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Social Exploration in Robot Swarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Stochastic Nonlinear Ensemble Modeling and Control for Robot Team

A Decentralized Cooperative Approach to Gentle Human Transportation

Sparse Sensing in Ergodic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Distributed Multi-robot Tracking of Unknown Clustered Targets

A Force-Mediated Controller for Cooperative Object Manipulation

A Distributed Architecture for Onboard Tightly-Coupled Estimation

Search Space Illumination of Robot Swarm Parameters for Trustworthy

Collective Gradient Following with Sensory Heterogeneous UAV Swarm . . . . . . 187

DAN: Decentralized Attention-Based Neural Network for the MinMax

Receding Horizon Control on the Broadcast of Information in Stochastic

Adaptation Strategy for a Distributed Autonomous UAV Formation

DGORL: Distributed Graph Optimization Based Relative Localization

Characterization of the Design Space of Collective Braitenberg Vehicles . . . . . . 257

Decision-Making Among Bounded Rational Agents . . . . . . . . . . . . . . . . . . . . . . . . 273

Distributed Multi-robot Information Gathering Using Path-Based Sensors

Distributed Multiple Hypothesis Tracker for Mobile Sensor Networks . . . . . . . . . 300

Distributed Multirobot Control for Non-cooperative Herding . . . . . . . . . . . . . . . . . 317

On Limited-Range Coverage Control for Large-Scale Teams of Aerial

MARLAS: Multi Agent Reinforcement Learning for Cooperated Adaptive

Proportional Control for Stochastic Regulation on Allocation

Comparing Stochastic Optimization Methods for Multi-robot, Multi-target

Multi-agent Deep Reinforcement Learning for Countering Uncrewed

Decentralized Risk-Aware Tracking of Multiple Targets . . . . . . . . . . . . . . . . . . . . . 408

Application of 3D Printed Vacuum-Actuated Module with Multi-soft

Combining Hierarchical MILP-MPC and Artificial Potential Fields

Exploration System for Distributed Swarm Robots Using Probabilistic

Distributed Estimation of Scalar Fields with Implicit Coordination . . . . . . . . . . . 466

A Constrained-Optimization Approach to the Execution of Prioritized

A Comparison of Two Decoupled Methods for Simultaneous Multiple

Smarticle 2.0: Design of Scalable, Entangled Smart Matter . . . . . . . . . . . . . . . . . . 509

Hybrid Flock - Formation Control Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523

Multi-Robot Systems Research: A Data-Driven Trend Analysis . . . . . . . . . . . . . . 537

Coverage Control for Exploration of Unknown Non-convex Environments

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563

Valeria Villani(B) , Cristina Vercellino, and Lorenzo Sabattini

Department of Sciences and Methods for Engineering,

Abstract. In this paper we investigate the use of different communication modal-

or verbal feedback [7]. Additionally, visual feedback is typically employed [8–10]. In

2.1 Communication Modalities

Fig. 1. Considered communication modalities.

2.2 Experimental Task

Fig. 2. Experimental task with puzzle solving as confounding task.

2.3 Experimental Setup

2.5 Data Collection and Analysis

Table 1. Correctness of participant’s responses for the different communication modalities: 1

Communication modality Correctness

we analyzed the effect of communication modality on the correctness of estimated goals

Confounding task Correctness

where is a binary operator deﬁned on {0, 12 , 1} given in Table 1. We will elab-

B B = B(p1 ) B (p1 ), . . . , B(pn ) B (pn ) (3)