Professional Documents
Culture Documents
2407.10403v1
2407.10403v1
Pathfinding
Zhenyu Song, Ronghao Zheng, Senlin Zhang, Meiqin Liu
TABLE III
AVERAGE S TEPS WITH AND WITHOUT C O RS
time steps for the 40 × 40 and 80 × 80 maps are 256 and 386,
respectively.
From the experimental results illustrated in Fig. 6, it is
evident that our proposed CoRS-DHC significantly outper-
forms the existing DHC algorithm in all test sets. In high
agent density scenarios, such as 40 × 40 with 64 agents
and 80 × 80 with 128 agents, our CoRS-DHC improves the
pathfinding success rate by more than 20% compared to DHC.
Furthermore, as shown in Table III, CoRS effectively reduces
the number of steps required for all agents to reach their target
spot in various test scenarios. Although the DHC algorithm Fig. 7. Success rate and average steps across different testing scenarios.
incorporates heuristic functions and inter-agent communica-
tion, it still struggles with cooperative issues among agents. In
contrast, our CoRS-DHC promotes inter-agent collaboration
through reward shaping, thereby substantially enhancing the algorithms. ODrM* is a centralized algorithm designed to
performance of the DHC algorithm in high-density scenarios. generate optimal paths for multiple agents. It is one of the
Notably, this significant improvement was achieved without best centralized MAPF algorithms currently available. We use
any changes to the algorithm’s structure or network scale, it as a comparative baseline to show the differences between
underscoring the effectiveness of our approach. distributed and centralized algorithms. The experimental re-
sults are demonstrated in Fig. 7:
B. Success Rate and Average Step The experimental results indicate that our CoRS-DHC al-
Additionally, the policy trained through CoRS-DHC is gorithm consistently exceeds the success rate of the DCC
compared with other advanced MAPF algorithms. The current algorithm in the majority of scenarios. Additionally, aside
SOTA algorithm, DCC[25], which is also based on reinforce- from the 40 × 40 grid with 64 agents, the makespan of the
ment learning, is selected as the main comparison object, with policies trained by the CoRS-DHC algorithm is comparable to
the centralized MAPF algorithm ODrM*[6] and PRIMAL[20] or even shorter than that of the DCC algorithm across other
(based on RL and IL) serving as references. DCC is an efficient scenarios. These results clearly demonstrate that our CoRS-
model designed to enhance agent performance by training DHC algorithm achieves a performance comparable to that
agents to selectively communicate with their neighbors during of DCC. However, it should be noted that DCC employs a
both training and execution stages. It introduces a complex significantly more complex communication mechanism during
decision causal unit to each agent, which determines the appro- both training and execution, while our CoRS algorithm only
priate neighbors for communication during these stages. Con- utilizes simple reward shaping during the training phase.
versely, the PRIMAL algorithm achieves distributed MAPF Compared to PRIMAL and DHC, CoRS-DHC exhibits a
by imitating ODrM* and incorporating reinforcement learning remarkably superior performance.
VI. C ONCLUSION [17] Jiahui Li, Kun Kuang, Baoxiang Wang, Furui Liu, Long Chen, Fei
Wu, and Jun Xiao. Shapley counterfactual credits for multi-agent
This letter proposes a reward shaping method termed CoRS, reinforcement learning. In Proceedings of the 27th ACM SIGKDD
Conference on Knowledge Discovery & Data Mining, pages 934–942,
applicable to the standard MAPF tasks. By promoting coop- 2021.
erative behavior among multiple agents, CoRS significantly [18] Zhenghao Peng, Quanyi Li, Ka Ming Hui, Chunxiao Liu, and Bolei
improves the efficiency of MAPF. The experimental results in- Zhou. Learning to simulate self-driven particles system with coordi-
nated policy optimization. Advances in Neural Information Processing
dicate that CoRS significantly enhances the performance of the Systems, 34:10784–10797, 2021.
MARL algorithms in solving the MAPF problem. CoRS also [19] Binyu Wang, Zhe Liu, Qingbiao Li, and Amanda Prorok. Mobile robot
has implications for other multi-agent reinforcement learning path planning in dynamic environments through globally guided rein-
forcement learning. IEEE Robotics and Automation Letters, 5(4):6932–
tasks. We plan to further extend the application of this reward- 6939, 2020.
shaping strategy to a wider range of MARL environments in [20] Guillaume Sartoretti, Justin Kerr, Yunfei Shi, Glenn Wagner, TK Satish
future exploration. Kumar, Sven Koenig, and Howie Choset. Primal: Pathfinding via
reinforcement and imitation multi-agent learning. IEEE Robotics and
Automation Letters, 4(3):2378–2385, 2019.
[21] Zuxin Liu, Baiming Chen, Hongyi Zhou, Guru Koushik, Martial Hebert,
R EFERENCES and Ding Zhao. Mapper: Multi-agent path planning with evolutionary
reinforcement learning in mixed dynamic environments. In 2020
[1] Fu Chen. Aircraft taxiing route planning based on multi-agent system. IEEE/RSJ International Conference on Intelligent Robots and Systems
In 2016 IEEE Advanced Information Management, Communicates, (IROS), pages 11748–11754. IEEE, 2020.
Electronic and Automation Control Conference (IMCEC), pages 1421– [22] Qingbiao Li, Fernando Gama, Alejandro Ribeiro, and Amanda Prorok.
1425, 2016. Graph neural networks for decentralized multi-robot path planning.
[2] Jiankun Wang and Max Q-H Meng. Real-time decision making and In 2020 IEEE/RSJ international conference on intelligent robots and
path planning for robotic autonomous luggage trolley collection at systems (IROS), pages 11785–11792. IEEE, 2020.
airports. IEEE Transactions on Systems, Man, and Cybernetics: Systems, [23] Cornelia Ferner, Glenn Wagner, and Howie Choset. Odrm* optimal
52(4):2174–2183, 2021. multirobot path planning in low dimensional search spaces. In 2013
[3] Oren Salzman and Roni Stern. Research challenges and opportunities in IEEE International Conference on Robotics and Automation, pages
multi-agent path finding and multi-agent pickup and delivery problems. 3854–3859, 2013.
In Proceedings of the 19th International Conference on Autonomous [24] Ziyuan Ma, Yudong Luo, and Hang Ma. Distributed heuristic multi-
Agents and MultiAgent Systems, pages 1711–1715, 2020. agent path finding with communication. In 2021 IEEE International
[4] Altan Yalcin. Multi-Agent Route Planning in Grid-Based Storage Conference on Robotics and Automation (ICRA), pages 8699–8705.
Systems. doctoralthesis, Europa-Universität Viadrina Frankfurt, 2018. IEEE, 2021.
[5] Kevin Nagorny, Armando Walter Colombo, and Uwe Schmidtmann. A [25] Ziyuan Ma, Yudong Luo, and Jia Pan. Learning selective communication
service-and multi-agent-oriented manufacturing automation architecture: for multi-agent path finding. IEEE Robotics and Automation Letters,
An iec 62264 level 2 compliant implementation. Computers in Industry, 7(2):1455–1462, 2022.
63(8):813–823, 2012. [26] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance
under reward transformations: Theory and application to reward shaping.
[6] Cornelia Ferner, Glenn Wagner, and Howie Choset. Odrm* optimal
In Icml, volume 99, pages 278–287, 1999.
multirobot path planning in low dimensional search spaces. In 2013
[27] Alexander Peysakhovich and Adam Lerer. Prosocial learning agents
IEEE international conference on robotics and automation, pages 3854–
solve generalized stag hunts better than selfish ones. arXiv preprint
3859. IEEE, 2013.
arXiv:1709.02865, 2017.
[7] Teng Guo and Jingjin Yu. Sub-1.5 time-optimal multi-robot path
[28] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre,
planning on grids in polynomial time. arXiv preprint arXiv:2201.08976,
Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social
2022.
influence as intrinsic motivation for multi-agent deep reinforcement
[8] Han Zhang, Jiaoyang Li, Pavel Surynek, TK Satish Kumar, and Sven learning. In International conference on machine learning, pages 3040–
Koenig. Multi-agent path finding with mutex propagation. Artificial 3049. PMLR, 2019.
Intelligence, 311:103766, 2022. [29] Sirui Chen, Zhaowei Zhang, Yaodong Yang, and Yali Du. Stas: Spatial-
[9] Glenn Wagner and Howie Choset. Subdimensional expansion for temporal return decomposition for solving sparse rewards problems
multirobot path planning. Artificial intelligence, 219:1–24, 2015. in multi-agent reinforcement learning. In Proceedings of the AAAI
[10] Na Fan, Nan Bao, Jiakuo Zuo, and Xixia Sun. Decentralized multi- Conference on Artificial Intelligence, volume 38, pages 17337–17345,
robot collision avoidance algorithm based on rssi. In 2021 13th Inter- 2024.
national Conference on Wireless Communications and Signal Processing [30] Songyang Han, He Wang, Sanbao Su, Yuanyuan Shi, and Fei Miao.
(WCSP), pages 1–5. IEEE, 2021. Stable and efficient shapley value-based reward reallocation for multi-
[11] Chunyi Peng, Minkyong Kim, Zhe Zhang, and Hui Lei. Vdn: Virtual agent reinforcement learning of autonomous vehicles. In 2022 Inter-
machine image distribution network for cloud data centers. In 2012 national Conference on Robotics and Automation (ICRA), pages 8765–
Proceedings IEEE INFOCOM, pages 181–189. IEEE, 2012. 8771. IEEE, 2022.
[12] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gre- [31] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero,
gory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic and Yung Yi. Qtran: Learning to factorize with transformation for coop-
value function factorisation for deep multi-agent reinforcement learning. erative multi-agent reinforcement learning. In International conference
Journal of Machine Learning Research, 21(178):1–51, 2020. on machine learning, pages 5887–5896. PMLR, 2019.
[13] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas [32] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston.
Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gra- Curriculum learning. In Proceedings of the 26th annual international
dients. In Proceedings of the AAAI conference on artificial intelligence, conference on machine learning, pages 41–48, 2009.
volume 32, 2018.
[14] Wesley Suttle, Zhuoran Yang, Kaiqing Zhang, Zhaoran Wang, Tamer
Başar, and Ji Liu. A multi-agent off-policy actor-critic algorithm for
distributed reinforcement learning. IFAC-PapersOnLine, 53(2):1549–
1554, 2020. 21st IFAC World Congress.
[15] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan
Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent coop-
eration and competition with deep reinforcement learning. PloS one,
12(4):e0172395, 2017.
[16] Sirui Chen, Zhaowei Zhang, Yaodong Yang, and Yali Du. Stas: Spatial-
temporal return decomposition for multi-agent reinforcement learning.
arXiv preprint arXiv:2304.07520, 2023.
A PPENDIX A can also maximize Q−i (s̄ , a−i ). Let Qiπi (s̄t , ai∗ ) =
π∗−i t ∗
P ROOFS tot i −i
Qπ̄i (s̄t , {a∗ , a∗ }) = U . Given that there is no conflict in
∗
In this section, proofs of Theorem 1 is provided. For clarity, τ i and U represents the maximum P accumulated reward of
we hereby reiterate the assumptions involved in the proof 1 i 1 −i t 1 i 1 −i
2 r + 2 r in state s̄t, we have U = τi γ 2r + 2r =
t 1 1 −i −i −i
P i
processes. τi γ 2 maxa r + 2 r
i = Qπ−i (s̄, a∗ ). Assume there
∗
Assumption 1. The reward for agents staying at the target exists â−i ̸= a−i ∗ such that maxa−i Q−i (s̄ , a−i ) =
π∗−i t
point is 0, while the rewards for both movement and staying at Q−i (s̄ , â−i ) = V > U . Therefore:
π −i t
∗
non-target points are rm < 0. The collision reward rc < rm .
Assumption 2. During the interaction process between Ai
X 1 1 X 1 1
U <V = γ t ( r−i + max r i
) = γ t ( r−i + ri )
and Ã−i , neither Ai nor Ã−i has ever reached its respective −i
2 2 a i
−i
2 2
τ τ
endpoint. (b) X 1 1
≤ γ t ( r−i + ri ) = U,
Theorem 1. Assume Assumps. 1 and 2 hold. Then when α = 2 2
−i τi
1 i i −i tot i −i
2 , Qπ∗i (s̄t , a ), Qπ∗−i (s̄t , a ) and Qπ̄∗ (s̄t , {a , a }) satisfy
the IGM condition. which is clearly a contradiction.
The validity of (b) follows
from that τ i γ t 21 ri + 12 r−i = 12 maxā Qtot
P
π̄∗ (s̄t , ā). There-
Proof. Theorem 1 is equivalent to the following state-
fore, Q−i −i −i
−i (s̄t , a∗ ) = maxa−i Q −i (s̄t , a
−i
).
ment: for ai∗ = argmaxai Qiπi (s̄t , ai ), there exists a−i ∗ = π ∗ π ∗
∗
−i −i tot i −i
argmaxa−i Qπ−i (s̄t , a ), such that Qπ̄∗ (s̄t , {a∗ , a∗ }) =
∗
argmaxā Qtot π̄∗ (s̄t , ā).
For the sake of simplicity in the discussion, P we denote A PPENDIX B
−i −i N t j
ri (s̄P,
t tā ) and r (s̄ ,
t t ā ) as rt
i
and r t , and Eτ [ i=0 γ rt ] E XAMPLES
t j 1 i 1 i 1 −i
as τ γ r . When α = 2 , r̃ = 2 r + 2 maxa−i r and
r̃−i = 12 r−i + 21 maxai ri . In this section, some examples will be provided to facilitate
First, under Assumps. 1 and Assump. 2, if no collisions the understanding of the CoRS algorithm.
occur along the trajectory τ , then for all s̄t and āt ∈ τ , we
have ri = maxai ri = rm and r−i = maxa−i r−i = rm . This
holds because if Ai and Ã−i do not reach their goals, then A. The instability of Eq. (2)
ri < 0 and r−i < 0. Meanwhile, if no collisions occur along τ ,
then ri > rc and r−i > rc . Thus, it follows that for all s̄t and In this section, we illustrate the instability of Eq. (2) through
āt ∈ τ , ri = maxai ri = rm and r−i = maxa−i r−i = rm . an example. Fig. 8 presents two agents, A1 and A2 , with the
Second, ∀ai ∈ Ai and ∀{ai , a−i } ∈ A, Qiπi (s̄t , ai ) ≥ blue and red flags representing their respective endpoints. The
∗
1 tot i −i optimal trajectory in this scenario is already depicted in the
2 Qπ̄∗ (s̄t , {a , a }). This is because:
figure. Based on the optimal trajectory, it is evident that the
1
Qiπ∗i (s̄, ai ) − Qtot (s̄, ā) optimal action pair in the current state is Action Pair 1. We
2 π̄∗ anticipate that when A1 takes the action from Action Pair 1,
X 1 1
X
1 i 1 −i
= γ t i
r + max r −i
− γt r + r Ic1 (s̄t , āt ) will be higher than for other actions.
2 2 a −i
τ̄
2 2
τi
(a) X
X
t 1 i 1 −i 1 i 1 −i
≥ γ r + max r − γt r + r
τ̄
2 2 a−i
τ̄
2 2
X 1 1 −i
−i
= γt max r − r ≥ 0,
τ̄
2 a−i 2
i
maximizesP γ t r̃i ,
P
where (a) holds because the trajectory τP
Fig. 8. An example illustrating the instability of Eq. (2).
when the trajectory τ i is replaced by τ̄ , τ i γ tP r̃i ≥ τ̄ γ t r̃i .
i i i t 1 i
Then consider a∗ and Qπi (s̄, a∗ ) = τi γ ( 2 r +
1 −i
∗ According to the calculation method provided in Eq. (2) and
2 maxa
−i r ). There must be no collisions along τ i , oth- the reward function given in Table I, we compute the values
erwise a∗ cannot be optimal. Therefore, maxai Qiπi (s̄t , ai ) =
i
of Ic1 when A1 takes the action from Action Pair 1 under
∗
t 1 i 1 −i t 1 i 1 −i
P P
τ i γ ( 2 r + 2 maxa−i r ) = τ i γ ( 2 r + 2 r ). Fur-
2
−i P Whenj A takes
different conditions. the action in Action Pair
thermore, since Qiπi (s̄t , ai∗ ) ≥ 21 Qtot i
π̄∗ (s̄t , {a∗ , a }), we
1
1, Ic = |A−1 | j∈A−1 r = r2 = −0.070. When A2 takes
1
∗
t 1 i 1 −i i i
P
find that τ i γ ( 2 r + 2 r ) = maxai Qπ∗i (s̄t , a ) = the action in Action Pair P 2, a collision occurs between A1
1 tot 1 tot i 1
2 maxā Qπ̄∗ (s̄t , ā) = 2 Qπ̄∗ (s̄t , āt ), where āt ∈ τ . We select and A , and Ic = |A−1 | j∈A−1 rj = r2 = −0.5. These
2 1
−i i −i
a∗ such that {a∗ , a∗ } = āt . calculations show that in the same state, even though A1 takes
Based on the previous discussion, it can be de- the optimal action, the value of Ic1 changes depending on a2 .
duced that ai∗ and a−i ∗ can maximize both Qiπi (s̄t , ai ) This indicates that the calculation method provided in Eq. (2)
∗
and Qπ̄i (s̄t , {a , a }). Next, we demonstrate that a−i
tot i −i
∗ is unstable.
∗
Fig. 9. An example illustrating how reward shaping method enhances inter-agent cooperation.
B. How Reward Shaping Promotes Inter-Agent Cooperation among other algorithms that use reinforcement learning tech-
In this section, we illustrate with an example how the niques are marked in red.
reward-shaping method Eq. (4) promotes inter-agent coopera-
TABLE IV
tion, thereby enhancing the overall system efficiency. Table I S UCCESS R ATE IN 40 × 40 E NVIRONMENT.
presents the reward function used in this example. As shown in
Fig. 9, there are two agents, A1 and A2 , engaged in interaction. Success Rate% Map Size 40 × 40
Throughout their interaction, A1 and A2 remain neighbors. Agents DHC DCC CoRS-DHC PRIMAL ODrM*
4 96 100 100 84 100
Each agent aims to maximize its cumulative reward. For this 8 95 98 98 88 100
two-agent scenario, we set α = 12 . Two trajectories, τ1 and 16 94 98 100 94 100
τ2 , are illustrated in Fig. 9. We will calculate the cumulative 32 92 94 96 86 100
64 64 91 89 17 92
rewards that A1 can obtain on the two trajectories using
different reward functions. For the sake of simplicity, in this
example, γ = 1. TABLE V
If A1 uses r1 as the reward function, the cumulative reward S UCCESS R ATE IN 80 × 80 E NVIRONMENT.
on the trajectory P 1 is (−0.070) + (−0.070) + (−0.070) =
−0.21, and on the trajectory P 2 it is (−0.070) + (−0.070) + Success Rate% Map Size 80 × 80
(−0.070) = −0.21. Consequently, A1 may choose trajectory Agents DHC DCC CoRS-DHC PRIMAL ODrM*
4 93 99 100 67 100
τ2 . In τ2 , A2 will stop and wait due to being blocked by A1 . 8 90 98 100 53 100
This indicates that using the original reward function may not 16 90 97 97 33 100
promote collaboration between agents, resulting in a decrease 32 88 94 94 17 100
64 87 92 93 6 100
in overall system efficiency. 128 40 81 83 5 100
However, if A1 adopts r̃1 = 21 r1 + 21 maxa2 r2 as its reward,
the cumulative rewards for A1 on trajectories τ1 is (− 12 0.070−
1 1 1 1 1
2 ·0.070)+(− 2 ·0.070− 2 ·0.070)+(− 2 ·0.070− 2 ·0.070) =
TABLE VI
−0.21. In contrast, on trajectory τ2 , it is (− 2 · 0.070 − 21 ·
1 AVERAGE S TEPS IN 40 × 40 E NVIRONMENT.
0.075) + (− 21 · 0.070 − 12 · 0.070) + (− 12 · 0.070 − 12 · 0.070) +
Average Steps Map Size 40 × 40
( 12 · 0 − 12 · 0.070) = −0.2475. Therefore, A1 will choose the Agents DHC DCC CoRS-DHC PRIMAL ODrM*
trajectory τ1 instead of τ2 , allowing A2 to reach its goal as 4 64.15 48.58 50.36 79.08 50
quickly as possible. This demonstrates that using the reward 8 77.67 59.60 64.77 76.53 52.17
16 86.87 71.34 68.48 107.14 59.78
r̃1 shaped by the reward-shaping method Eq. (4) enables A1 to 32 115.72 93.45 95.42 155.21 67.39
adequately consider the impact of its actions on other agents, 64 179.69 135.55 151.02 170.48 82.60
thereby promoting cooperation among agents and enhancing
the overall efficiency of the system.
TABLE VII
AVERAGE S TEPS IN 80 × 80 E NVIRONMENT.
A PPENDIX C
D ETAILS OF E XPERIMENTS Average Steps Map Size 80 × 80
Agents DHC DCC CoRS-DHC PRIMAL ODrM*
In this section, we provide detailed data in Section V. 4 114.69 93.89 92.14 134.86 93.40
Table IV and Table V presents the success rates of different 8 133.39 109.89 109.15 153.20 104.92
16 147.55 122.24 121.25 180.74 114.75
algorithms across various test scenarios, while Table VI and 32 158.58 132.99 137.06 250.07 121.31
Table VII shows the makespan of different algorithms in these 64 183.44 159.67 153.06 321.63 134.42
scenarios. The results for the ODrM* algorithm, which are 128 213.75 192.90 193.50 350.76 143.84
considered optimal, are highlighted in bold. The best metrics