Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Cooperative Reward Shaping for Multi-Agent

Pathfinding
Zhenyu Song, Ronghao Zheng, Senlin Zhang, Meiqin Liu

Abstract—The primary objective of Multi-Agent Pathfinding


(MAPF) is to plan efficient and conflict-free paths for all agents.
Traditional multi-agent path planning algorithms struggle to
achieve efficient distributed path planning for multiple agents. In
contrast, Multi-Agent Reinforcement Learning (MARL) has been
arXiv:2407.10403v1 [cs.AI] 15 Jul 2024

demonstrated as an effective approach to achieve this objective.


By modeling the MAPF problem as a MARL problem, agents can
achieve efficient path planning and collision avoidance through
distributed strategies under partial observation. However, MARL
strategies often lack cooperation among agents due to the absence
of global information, which subsequently leads to reduced
MAPF efficiency. To address this challenge, this letter introduces
a unique reward shaping technique based on Independent Q-
Learning (IQL). The aim of this method is to evaluate the Fig. 1. Multi-agent scenarios in real-world and grid map with numerous
mobile robots. In the grid map, blue cells denote agents, yellow cells indicate
influence of one agent on its neighbors and integrate such an their goals, and green cells signify agents that have reached their goals.
interaction into the reward function, leading to active cooperation
among agents. This reward shaping method facilitates cooper-
ation among agents while operating in a distributed manner. networks do not rely on global observations, these methods
The proposed approach has been evaluated through experiments exhibit excellent scalability and flexibility in dynamic envi-
across various scenarios with different scales and agent counts. ronments. MARL enhances the success rate and robustness
The results are compared with those from other state-of-the- of path planning, making it particularly suitable for large-
art (SOTA) planners. The evidence suggests that the approach
proposed in this letter parallels other planners in numerous scale multi-agent scenarios. Algorithms such as [11], [12],
aspects, and outperforms them in scenarios featuring a large [13] utilize a centralized training distributed execution (CTDE)
number of agents. framework and foster cooperation between agents using global
Index Terms—Multi-agent pathfinding, reinforcement learning, information during training. However, they struggle to scale
motion and pathfinding. for larger numbers of agents due to increasing training costs.
In contrast, algorithms based on distributed training distributed
execution (DTDE) frameworks [14], [15] perform well in
I. I NTRODUCTION large-scale systems. However, due to the lack of global infor-

M APF is a fundamental research area in the field of


multi-agent systems, aiming to find conflict-free routes
for each agent. This has notable implications in various
mation, individual agents tend to solely focus on maximizing
their own rewards, resulting in limited cooperation among
the whole system. To address this issue, recent work [16],
environments such as ports, airports [1], [2], and warehouses [17], [18] emerges that improves the performance of RL
[3], [4], [5]. In these scenarios, there are typically a large algorithms in distributed training frameworks through reward
number of mobile agents. These scenarios can generally be shaping. However, some of these reward shaping methods are
abstracted into grid maps, as illustrated in Fig. 1. The MAPF too computationally complex, while others lack stability. This
methodology is divided mainly into two classes: centralized instability arises because an agent’s rewards are influenced by
and decentralized algorithms. Centralized algorithms [6], [7], the actions of other agents, which are typically unknown to
[8] offer efficient paths by leveraging global information but this agent.
fall short in scalability when handling a significant number of In this letter, a reward shaping method named Cooperative
agents because of increased computational needs and extended Reward Shaping (CoRS) is devised to enhance MAPF effi-
planning time. In contrast, decentralized algorithms [9], [10] ciency within a DTDE framework. The approach is straightfor-
show better scalability in large-scale environments but struggle ward and tailored for a limited action space. The cooperative
to ensure sufficient cooperation among agents, thereby affect- trend of action ai is represented by the maximum rewards
ing the success rate in pathfinding and overall efficiency. that neighboring agents can achieve after agent Ai performs
MARL techniques, particularly those utilizing distributed ai . Specifically, when agent Ai takes action ai , its neighbor Aj
execution, provide effective solutions to MAPF problems. By traverses its action space, determining the maximum reward
modeling MAPF as a partially observable Markov decision that Aj can achieve given the condition of Ai taking action ai .
process (POMDP), MARL algorithms can develop policies The shaped reward is then generated by weighting this index
that make decisions based on agents’ local observations and with the reward earned by Ai itself.
inter-agent communication. Given that MARL-trained policy The principal contributions of our work are as follows:
1) This letter introduces a novel reward-shaping method effect of modifying the reward function in Markov Decision
CoRS, designed to promote cooperation among agents Processes on optimal strategies, indicating that the addition of
in MAPF tasks within the IQL framework. This reward- a transition reward function can boost the learning efficiency.
shaping method is unaffected by actions from other In multi-agent systems, the aim is to encourage cooperation
agents, ensures easy convergence during training, and among agents through appropriate reward shaping methods,
is notable for its computational simplicity. thereby improving overall system performance. [27] probes the
2) It can be demonstrated that the CoRS method can alter enhancement of cooperative agent behavior within the context
the agent’s behavior, making it more inclined to coop- of a two-player Stag Hunt game, achieved through the design
erate with other agents, thereby improving the overall of reward functions. Introducing a prosocial coefficient, the
efficiency of the system. study validates through experimentation that prosocial reward
3) The CoRS method is challenged against present SOTA shaping methods elevate performance in multi-agent systems
algorithms, showcasing equivalent or superior perfor- with static network structures. Moreover, [28] promotes coop-
mance, eliminating the necessity for complex network eration among agents through the reward of an agent whose
structures. actions causally influence the behavior of other agents. The
The rest of this letter is organized as follows: Section II evaluation of causal impacts is achieved through counterfactual
reviews the related works on MAPF. Preliminary concepts are reasoning, with each agent simulating alternative actions at
presented in Section III. The proposed algorithm is detailed each time step and calculating their effect on other agents’
in Section IV. Section V discusses the experimental results. behaviors. Actions that lead to significant changes in the be-
Finally, Section VI concludes this letter. havior of other agents are deemed influential and are rewarded
accordingly.
Several studies, including [29], [17], [30], utilize the Shap-
II. R ELATED W ORKS
ley value decomposition method to calculate or redistribute
A. MAPF Based on Reinforcement Learning each agent’s cooperative benefits. [30] confirms that if a
RL-based planners such as [19], [20], [21], [22], typically transferable utility game is a convex game, the MARL re-
cast MAPF as a MARL problem to learn distributed poli- ward redistribution, based on Shapley values, falls within
cies for agents from partial observations. This method is the core, thereby securing stable and effective cooperation.
particularly effective in environments populated by a large Consequently, agents should maintain their partnerships or
number of agents. Techniques like Imitation Learning (IL) collaborative groups. This concept is the basis for a proposed
often enhance policy learning during this process. Notably, the cooperative strategy learning algorithm rooted in Shapley
PRIMAL algorithm [20] utilizes the Asynchronous Advantage value reward redistribution. The effectiveness of this reward
Actor Critic (A3C) algorithm and applies behavior cloning for shaping method in promoting cooperation among agents,
supervised RL training using experiences from the centralized specifically within a basic autonomous driving scenario, is
planner ODrM* [23]. However, the use of a centralized planner demonstrated in the paper. However, the process of calculating
limits its efficiency, as solving the MAPF problem can be time- the Shapley value can be intricate and laborious. Ref. [29] aims
intensive, particularly in complex environments with a large to alleviate these computational challenges by introducing
number of agents. approximation of marginal contributions and employing Monte
In contrast, Distributed Heuristic Coordination (DHC) [24] Carlo sampling to estimate Shapley values. Coordinated Policy
and Decision Causal Communication (DCC) [25] algorithms Optimization (CoPO)[18] puts forth the concept of “coopera-
do not require a centralized planner. Although guided by tion coefficient”, which shapes the reward by taking a weighted
an individual agent’s shortest path, DHC innovatively incor- average of an agent’s individual rewards and the average
porates all potential shortest path choices into the model’s rewards of its neighboring agents, based on the cooperation
heuristic input rather than obligating an agent to a specific coefficient. This approach proves that rewards shaped in this
path. Additionally, DHC collects data from neighboring agents manner fulfill the Individual Global Max (IGM) condition.
to inform its decisions and employs multi-head attention as Findings from traffic simulation experiments further suggest
a convolution kernel to calculate interactions among agents. that this method of reward shaping can significantly enhance
DCC is an efficient algorithm that enhances the performance the overall performance and safety of the system. However,
of agents by enabling selective communication with neighbors this approach ties an agent’s rewards not merely to its personal
during both training and execution. Specifically, a neighboring actions but also to those of its neighbors. Such dependencies
agent is deemed significant only if its presence instigates a might compromise the stability during the training process and
change in the decision of the central agent. The central agent the efficiency of the converged strategy.
only needs to communicate with its significant neighbors.
III. P RELIMINARY
B. Reward Shaping A. Cooperative Multi-Agent Reinforcement Learning
The reward function significantly impacts the performance Consider a Markov process involving n agents
of RL algorithms. Researchers persistently focus on designing {A1 , . . . , An } := A, represented by the tuple
reward functions to optimize algorithm learning efficiency (S, A, O, R, P, γ). Ai represents agent i. At each time
and agent performance. A previous study [26] analyzes the step t, Ai chooses an action ait from its action space
Ai based on its state sit ∈ S i and observation oi ∈ O TABLE I
according to its policy π i . All ait form a joint action R EWARD F UNCTION .
āt = {a1t , . . . , ant } ∈ (A1 × · · · × An ) := A, and all sit form Action Reward
a joint state s̄t = {s1t , . . . , snt } ∈ (S 1 × · · · × S n ) := S. For Move (towards goal, away from goal) -0.070, -0.075
convenience in further discussions, agent’s local observations Stay (on goal, off goal) 0, -0.075
Collision (obstacle/agents) -0.5
oit are treated as a part of the agent’s state sit . Whenever Finish 3
a joint action āt is taken, the agents acquire a reward
r̄ = {rt1 , rt2 , ..., rtn } ∈ R, which is determined by the local
reward function rti (s̄t , āt ) : S × A → R with respect to
the joint state and action. The state transition function and execution frameworks to ensure good scalability. For
P(s̄t+1 |s̄t , āt ) : S × S × A → [0, 1] characterizes the example, [24], [25] are developed based on IQL. Although
probability of transition from the current state s̄t to s̄t+1 IQL can be applied to scenarios with a large number of agents,
under action āt . The policy π i (ait |sit ) provides the probability it often performs poorly in tasks that require a high degree of
of Ai taking action ait in the state sit . π̄ represents the cooperation among agents, such as the MAPF task. This poor
joint policy for all agents. The action-value PT function is performance arises because, within the IQL framework, each
given by Qiπ (sit , ait ) = Eτ i ∼πi [ t=0 γ t rti ], where the agent greedily maximizes its own cumulative reward, leading
trajectory τ i = [sit+1 , ait+1 , sit+2 , ...] represents the path agents to behave in an egocentric and aggressive manner, thus
taken by the agent Ai . The state-value function is given reducing the overall efficiency of the system. To counteract
by Vπi (s) = i i i
P
a∈Ai π(a|st )Qπ (st , a) and the discounted this, this letter introduces a reward shaping method named
cumulative reward is J i = Es0 ∼ρ0 Vπi (s0 ), where ρ0 represents Cooperative Reward Shaping (CoRS), and combines CoRS
the initial state distribution. For cooperative MARL tasks, with the DHC algorithm. The framework combining CoRS
the objective
Pn is to maximize the total cumulative reward with DHC is shown in Fig. 2. The aim of CoRS is to enhance
J tot = i=1 J i for all agents. performance within MAPF problem scenarios. The employ-
ment of reward shaping intends to stimulate collaboration
B. Multi-agent Pathfinding Environment Setup among agents, effectively curtailing the performance decline
This letter adopts the same definition of the multi-agent in the multi-agent system caused by selfish behaviors within
path-finding problem as presented in [24], [25]. Consider n a distributed framework.
agents in an w × h undirected grid graph G(V, E) with m
obstacles {B 1 , . . . , B m }, where V = {v(i, j)|1 ≤ i ≤ w, 1 ≤
j ≤ h} is the set of vertices in the graph, and all agents A. Design of the Reward Shaping Method
and obstacles located within V . All vertices v(i, j) ∈ V In the MAPF task, the policies trained using IQL often result
follow the 4-neighborhood rule, that is, [v(i, j), v(p, q)] ∈ E in scenarios where one agent blocks the path of other agents
for all v(p, q) ∈ {v(i, j ± 1), v(i ± 1, j)} ∩ V . Each agent or collides with them. To enhance cooperation among agents
Ai has its unique starting vertex si and goal vertex g i , within the IQL framework, a feasible approach is reward
and its position at time t is xit ∈ V . The position of the shaping. Reward shaping involves meticulously designing the
obstacle B k is represented as bk ∈ V . At each time step, agents’ reward functions to influence their behavior. For exam-
each agent can execute an action chosen from its action ple, when a certain type of behavior needs to be encouraged,
space Ai = {“Up”, “Down”, “Left”, “Right”, “Stop”}. Dur- a higher reward function is typically assigned to that behav-
ing the execution process, two types of conflict can arise: ior. Thus, to foster cooperation among agents in the MAPF
vertex conflict (xit = xjt or xit = bk ) and edge conflict problem, it is necessary to design a metric that accurately
([xit−1 , xit ] = [xjt , xjt−1 ]). If two agents conflict with each evaluates the collaboration of agents’ behavior and incorporate
other, their positions remain unchanged. The subscript t for this metric into the agents’ rewards. Consequently, as each
all the aforementioned variables can be omitted as long as it agent maximizes its own reward, it will consider the impact
does not cause ambiguity. The goal of the MAPF problem is to of its action on other agents, thereby promoting cooperation
find a set of non-conflicting paths P̄ = {P 1 , P 2 , ..., P n } for among agents and improving overall system efficiency.
all agents, where the agent’s path P i = [si , xi1 , . . . , xit , . . . , g i ]
Let Ici (s̄t , āt ) be a metric that measures the cooperativeness
is an ordered list of Ai ’s position. Incorporating the setup
of the action ait of agent Ai . To better regulate the behavior
of multi-agent reinforcement learning, we design a reward
of the agent, [27] introduces a cooperation coefficient α and
function for the MAPF task, as detailed in Table I. The
shapes the agent’s reward function in the following form:
design of the reward function basically follows [24], [25], with
slight adjustments to increase the reward for the agent moving r̃ti = (1 − α)rti + αIci (s̄t , āt ), (1)
towards the target to better align with our reward shaping
method. where α describes the cooperativeness of the agent. When
α = 0, the agent completely disregards the impact of its
IV. C OOPERATIVE R EWARD S HAPING actions on other agents, acting entirely selfishly and when
Many algorithms employing MARL techniques to address α = 1.0, the agent behaves with complete altruism. For agent
the MAPF problem utilize IQL or other decentralized training Ai , an intuitive approach to measure the cooperativeness of
Fig. 2. The combined framework of CoRS and DHC algorithms. The CoRS component predominantly shapes rewards. DHC component comprises of
communication blocks and dueling Q networks. Notably, the communication block employs a multi-head attention mechanism. The framework utilizes parallel
training as an efficient solution. This process involves the simultaneous generation of multiple actors to produce experiential data and upload it to the global
buffer. The learner then retrieves this data from the global buffer for training purposes, thereby allowing for frequent updates to the Actor’s network.

Ai ’s behavior is to use the average reward of all agents except


Ai :
1 X
Ici (s̄t , āt ) = −i rj (s̄t , āt ), (2)
|A | −i j∈A

where A−i denotes the set of all agents except Ai , and


|A−i | represents the number of agents in A−i . This reward
shaping method is equivalent to the neighborhood reward
proposed in [18] when dn , the neighborhood radius of the
agent, approaches infinity. The physical significance of Eq. (2)
is as follows: If the average reward of the agents other than Fig. 3. An example of the reward shaping method.
Ai is relatively high, it indicates that Ai ’s actions have not
harmed the interests of other agents. Hence, Ai ’s behavior
maxa−i j∈A−i rj (s̄t , {ait , a−i }). The complete reward shap-
P
can be considered as cooperative. Conversely, if the average
ing method is then as follows:
reward of the other agents is low, it suggests that Ai ’s behavior
exhibits poor cooperation. X rj (s̄t , {ai , a−i })
t
r̃ti (s̄t , ait ) = (1 − α)rti + α max . (4)
However, Ici (s̄t , āt ) in Eq. (2) is unstable, which is not only a−i
−i
|A−i |
j∈A
related to ait but is also strongly correlated with the actions
a−i = {aj |Aj ∈ A, j ̸= i} of other agents. Appendix B-A Fig. 3 gives an one step example to illustrate how this reward-
provides specific examples to illustrate this instability. Within shaping approach Eq. (4) is calculated and promotes inter-
the IQL framework, this instability in the reward function agent cooperation when α = 12 . There are two agents, A1
makes learning of Q-values challenging and can even prevent and A2 , along with some obstacles. Without considering A2 ,
convergence. To address this issue, this letter proposes a new A1 has two optimal actions: “Up” and “Right”. Fig. 3 also
metric to assess the cooperativeness of agent behavior. The illustrates the potential rewards that A2 may receive when A1
specific form of this metric is as follows: takes different actions. Table II presents the specific reward
X rj (s̄t , {ai , a−i }) values for each agent in this example. According to the reward
t
Ici (s̄t , āt ) = max −i |
. (3) shaping method proposed in Eq. (4), choosing the “Right”
a −i
−i
|A action could yield a higher one-step reward, and in such case,
j∈A
A2 should also take action“Right”. Appendix B-B provides
Here āt = {ait , a−i
t }. The use of the max opera- another analysis of this example from the vantage point of the
tor in Eq. (3) eliminates the influence of a−i t on Ici cumulative rewards of the agent, further elucidating how the
i
P the impact of at on other agents. The
while reflecting reward shaping method Eq. (4) facilitates cooperation among
term maxa−i j∈A−i rj (s̄t , {ait , a−i }) represents the max- agents.
imum reward that all the agents except Ai can achieve
under the condition of s̄t and ait , whereas the ac-
P j i −i B. Analysis of Reward Shaping
tual value of j∈A−i r (s̄t , {at , at }) is determined by
−i i
at when at is given. Accordingly, it holds true un- This section provides a detailed analysis of how reward
j
(s̄t , {ait , a−i
P
der any circumstances that j∈A −i r t }) ≤ shaping method Eq. (4) influences agent behavior. For ease
TABLE II Assumption 2. During the interaction process between Ai
R EWARDS FOR AGENTS U NDER D IFFERENT ACTIONS . and Ã−i , neither Ai nor Ã−i has ever reached its respective
endpoint.
a2
a1 max r2 r̃1
↑ ↓ ← → wait Here, Ã−i not reaching destination means ∀j ∈ A−i , Aj
→ -0.3 -0.3 -0.075 -0.070 -0.075 -0.07 -0.140 has not reached its destination. Based on the assumptions, it
↑ -0.3 -0.3 -0.075 -0.3 -0.075 -0.075 -0.145 can be proven that:
Theorem 1. Assume Assumps. 1 and 2 hold. Then when α =
1 i i −i −i tot i −i
of discussion, the subsequent analysis will be conducted from 2 , Qπ∗i (s̄t , a ), Qπ∗−i (s̄t , a ) and Qπ̄∗ (s̄t , {a , a }) satisfy
the perspective of Ai . the IGM condition.
In the MAPF problem, the actions of agents are tightly Proof. The proof is provided in Appendix A.
coupled. The action of Ai may impact Aj , and subsequently,
the action of Aj may also affect Ak . Thus, the action of Ai Theorem 1 demonstrates that the optimal policy, trained
indirectly affect Ak . This coupling makes it challenging to using the reward function given by Eq. (4), maximizes the
analyze the interactions among multiple agents. To mitigate overall rewards of Ai and the virtual agents Ã−i , rather
this coupling, we consider all agents in A−i as a single virtual than selfishly maximizing its own cumulative reward. This
agent Ã−i . The action of −i −i −i implies that Ai will actively cooperate with other agents, thus
1
PÃ is aj and its reward r is their
average reward |A−i | j∈A−i r . The use of average here improving the overall efficiency of the system. It should be
ensures that Ai and Ã−i are placed on an equal footing. The noted that the impact of α on the agent’s behavior is complex.
virtual agent Ã−i must satisfy the condition that no collisions The choice of α = 12 here is a specific result derived from
occur between the agents constituting Ã−i . It is important to using the virtual agent Ã−i . Although this illustrates how the
note that the interaction between Ai and Ã−i is not entirely reward shaping method in Eq. (4) induces cooperative behavior
equivalent to the interaction between Ai and agents in A−i . in agents, it does not imply that 21 is the optimal value of α
This is because when considering the agents in A−i as Ã−i , in all scenarios.
all robots within Ã−i fully cooperate. In contrast, treating
these agents as independent individuals makes it difficult to C. Approximation
ensure full cooperation. Nonetheless, Ã−i can still represent
the ideal behavior of agents in A−i . Therefore, analyzing Theorem 1 illustrates that the reward shaping method
the interaction between Ai and Ã−i can still illustrate the Eq. (4) can promote cooperation among agents in MAPF
impact of reward shaping on agent interactions. Consider the tasks. However, calculating Eq. (3) requires traversing the
i −i joint action space of A−i . For the MAPF problem where the
interaction between t A and à in the time period [ts , te ].
e action space size for each agent is 5, the joint action space
Qiπ (s̄, ai ) = Eτ
P t−ts i
γ r̃ (s̄t , ait ) s̄ts = s̄, aits = ai and for n agents contains up to 5n states, significantly reducing
t=tt s  the computational efficiency of the reward shaping method.
e
−i −i
P t−ts −i −i −i −i
Qπ (s̄, a ) = Eτ γ r̃ (s̄t , at ) s̄ts = s̄, ats = a Therefore, we must simplify the calcu-
t=ts
lation of Eq. (3). Before approximating
are the cumulative reward of Ai and Ã−i . Qtot (s̄, ā) =
te Eq. (3), we first define the neighbors
γ t−ts (ri + r−i ) s̄ts = s̄, āts = ā is the cumulative
P
Eτ of an agent in the grid map as follows:
t=ts
Aj is considered a neighbor of Ai if
reward for both Ai and Ã−i . The optimal policies the Manhattan distance between them
π∗i = argmaxπ Qiπ (s̄, ai ), π∗−i = argmaxπ Q−i −i
π (s̄, a ), is no more than dn , where dn is the
tot i −i
and π̄∗ = argmaxπ Qπ (s̄, ā). A and à will select neighborhood radius of the agent. If
their actions according to ait = argmaxai Qiπ (s̄t , ai ) and Fig. 4. Neighbors of Ai
the Manhattan distance between two when dn = 2.
a−i
t = argmaxa−i Q−i −i i
π (s̄t , a ). We hope that at and at
−i
agents is no more than 2, a conflict may arise between them
tot tot i −i
will collectively maximize Q (s̄, ā) = Q (s̄, {at , at }). in a single step. Therefore, we choose dn = 2, which maxi-
Specifically, mally simplifies interactions between agents while adequately
( )
argmaxai Qiπi (s̄, ai ), considering potential collisions. Let Ni denote the set of
tot
argmaxā Qπ̄∗ (s̄, ā) = ∗
. neighboring agents of Ai , and |Ni | represent the number of
argmaxa−i Q−i
π −i
(s̄, a−i )

neighbors of Ai . Fig. 4 illustrates the neighbors of Ai when
That is, Qiπi , Q−i
and
π∗−i
, Qtot
π̄∗ (s̄, ā)
satisfy the Individual- dn = 2.

Global-Max (IGM) condition. The definition of the IGM Considering that in MAPF tasks, the direct interactions
condition can be found in [31]. We introduce the following among agents are constrained by the distances between them,
two assumptions to facilitate the analysis. the interactions between any given agent and all other agents
can be simplified to the interactions between the agent and its
Assumption 1. The reward for the agents staying at the target neighbors. That is:
point is 0, while the rewards for both movement and staying
at non-target points are rm < 0, and the collision reward 1 X j
Ici (s̄t , āt ) ≈ max r (s̄t , {ait , aNi }),
rc < rm . a i |Ni |
N
j∈Ni
adjusting α. Some work employs zero-order optimization
of stochastic gradient estimation to handle non-differentiable
optimization problems, that is, estimating the gradient by finite
differences of function values. We also use differences in the
cumulative reward of the agent to estimate the gradient of the
cooperation coefficient α:
ˆ
∂J(α) J(α + u) − J(α)
= , u ∈ [−ϵ, ϵ],
∂α u
where ϵ represents the maximum step size of the differential.
Fig. 5. Training losses of two different reward shaping methods in 10 × 10 If the update step length is too small, the updates are halted.
map with 10 agents. Eq. (5) significantly reduces the training loss compared
to the method uses Eq. (2). To expedite the training process, a method of fine-tuning the
network is employed. That is, after each adjustment of α, the
where aNi represents the joint actionsPof all neighbors of Ai . network is not trained from the initial state. Instead, fine-tuning
Next, we approximate maxaNi |N1i | j∈Ni rj (s̄t , {ait , aNi }) training is performed based on the optimal policy network
using |N1i | j∈Ni maxaj rj (s̄t , {ait , aj }). This approximation
P
obtained previously. This training method improves the speed
implies that we consider the interaction between agent Ai of the training process.
and one of its neighbors Aj at a time. The computational
complexity of the approximate Ic is much lower than the V. E XPERIMENTS
original complexity.
To approximate s̄t , we posit that when Ai can observe its We conducted our experiments in the standard MAPF
two-hop neighbors, its observation is considered sufficient, environment, where each agent has a 9 × 9 FOV and can
and its state sit can approximate s̄t to a certain extent, i.e., communicate with up to two nearest neighbors. Following the
r̃i (s̄t , ait ) = r̃i (sit , ait ). When dn = 2, the minimum observa- curriculum learning method [32] used by DHC, we gradually
tion range includes all grids within a Manhattan distance of introduced more challenging tasks to the agents. Training
no more than 2 · dn = 4 from the agent. For convenience, a began with a simple task that involved a single agent in a
9 × 9 square is chosen as the agent’s field of view (FOV), 10 × 10 environment. Upon achieving a success rate above
consistent with the settings in [24] and covering the agent’s 0.9, we either added an agent or increased the environment
minimum observation range. Consequently, the final reward size by 5 to establish two more complex tasks. The model
shaping method is given by: was ultimately trained to handle 10 agents in a 40 × 40
environment. The maximum number of steps per episode was
1 X set to 256. Training was carried out with a batch size of 192,
r̃i (si , ai ) = (1−α)rti +α max rj (sit , {ait , aj }). (5)
|Ni | aj ∈A a sequence length of 20, and a dynamic learning rate starting
j∈Ni
at 10−4 , which was halved at 100,000 and 300,000 steps, with
This reward shaping method possesses the following char- a maximum of 500,000 training steps. During fine-tuning, the
acteristics: learning rate was maintained at 10−5 . Distributed training was
1) r̃i for Ai is explicitly dependent on ai and is explicitly used to improve efficiency, with 16 independent environments
independent of aj for j ̸= i. This property improves the running in parallel to generate agent experiences, which were
stability of the agent’s reward function and facilitates uploaded to a global buffer. The learner then retrieved these
the convergence of the network, as shown in Fig. 5. data from the buffer and trained the agent’s strategy on a GPU.
2) This method of reward shaping is applicable to MAPF CoRS-DHC adopted the same network structure as DHC. All
scenarios where the relationship among agents’ neigh- training and testing were performed on an Intel® i5-13600KF
bors is time-varying. and Nvidia® RTX2060 6G.
3) This method, which considers the interaction between
just two agents at a time, dramatically reduces the
A. Impact of Reward Shaping
computational difficulty of the reward.
Following several rounds of network fine-tuning and updates
to the cooperation coefficient α, a value of α = 0.1675
D. Adjustment of the cooperation coefficient and strategy π are obtained. Upon obtaining the final α
For Eq. (5), it is necessary to adjust α so that the agent can and π, the CoRS-DHC-trained policy is compared with the
balance its own interests with the interests of other agents. To original DHC algorithm-trained policy to assess the effect of
find an appropriate cooperation coefficient α, it is advisable the reward shaping method on performance improvement. For
to examine how the policies trained under different α perform a fair comparison, the DHC and CoRS-DHC algorithms are
in practice and then employ a gradient descent algorithm tested on maps of different scales (40 × 40 and 80 × 80) with
based on the performance of the policy to optimize α. The varying agent counts {4, 8, 16, 32, 64}. Recognizing the larger
cumulative reward J of the agent often serves as a performance environmental spatial capacity of the 80 × 80 map, a scenario
metric in RL. However, in environments with discrete action with 128 agents is also introduced for additional insights.
and state spaces, the cumulative reward J is non-differentiable Each experimental scenario includes 200 individual test cases,
with respect to the α, presenting a significant challenge for maintaining a consistent obstacle density of 0.3. The maximum
Fig. 6. The Enhancement of DHC through CoRS. This figure demonstrates
the success rate of CoRS-DHC and DHC under different testing scenarios.
The table provides the average steps required to complete tasks in different
testing environments.

TABLE III
AVERAGE S TEPS WITH AND WITHOUT C O RS

Steps Map Size 40 × 40 Map Size 80 × 80


Agents DHC CoRS-DHC DHC CoRS-DHC
4 64.15 50.36 114.69 92.14
8 77.67 64.77 133.39 109.15
16 86.87 68.48 147.55 121.25
32 115.72 95.42 158.58 137.06
64 179.69 151.02 183.44 153.06
128 213.75 193.50

time steps for the 40 × 40 and 80 × 80 maps are 256 and 386,
respectively.
From the experimental results illustrated in Fig. 6, it is
evident that our proposed CoRS-DHC significantly outper-
forms the existing DHC algorithm in all test sets. In high
agent density scenarios, such as 40 × 40 with 64 agents
and 80 × 80 with 128 agents, our CoRS-DHC improves the
pathfinding success rate by more than 20% compared to DHC.
Furthermore, as shown in Table III, CoRS effectively reduces
the number of steps required for all agents to reach their target
spot in various test scenarios. Although the DHC algorithm Fig. 7. Success rate and average steps across different testing scenarios.
incorporates heuristic functions and inter-agent communica-
tion, it still struggles with cooperative issues among agents. In
contrast, our CoRS-DHC promotes inter-agent collaboration
through reward shaping, thereby substantially enhancing the algorithms. ODrM* is a centralized algorithm designed to
performance of the DHC algorithm in high-density scenarios. generate optimal paths for multiple agents. It is one of the
Notably, this significant improvement was achieved without best centralized MAPF algorithms currently available. We use
any changes to the algorithm’s structure or network scale, it as a comparative baseline to show the differences between
underscoring the effectiveness of our approach. distributed and centralized algorithms. The experimental re-
sults are demonstrated in Fig. 7:
B. Success Rate and Average Step The experimental results indicate that our CoRS-DHC al-
Additionally, the policy trained through CoRS-DHC is gorithm consistently exceeds the success rate of the DCC
compared with other advanced MAPF algorithms. The current algorithm in the majority of scenarios. Additionally, aside
SOTA algorithm, DCC[25], which is also based on reinforce- from the 40 × 40 grid with 64 agents, the makespan of the
ment learning, is selected as the main comparison object, with policies trained by the CoRS-DHC algorithm is comparable to
the centralized MAPF algorithm ODrM*[6] and PRIMAL[20] or even shorter than that of the DCC algorithm across other
(based on RL and IL) serving as references. DCC is an efficient scenarios. These results clearly demonstrate that our CoRS-
model designed to enhance agent performance by training DHC algorithm achieves a performance comparable to that
agents to selectively communicate with their neighbors during of DCC. However, it should be noted that DCC employs a
both training and execution stages. It introduces a complex significantly more complex communication mechanism during
decision causal unit to each agent, which determines the appro- both training and execution, while our CoRS algorithm only
priate neighbors for communication during these stages. Con- utilizes simple reward shaping during the training phase.
versely, the PRIMAL algorithm achieves distributed MAPF Compared to PRIMAL and DHC, CoRS-DHC exhibits a
by imitating ODrM* and incorporating reinforcement learning remarkably superior performance.
VI. C ONCLUSION [17] Jiahui Li, Kun Kuang, Baoxiang Wang, Furui Liu, Long Chen, Fei
Wu, and Jun Xiao. Shapley counterfactual credits for multi-agent
This letter proposes a reward shaping method termed CoRS, reinforcement learning. In Proceedings of the 27th ACM SIGKDD
Conference on Knowledge Discovery & Data Mining, pages 934–942,
applicable to the standard MAPF tasks. By promoting coop- 2021.
erative behavior among multiple agents, CoRS significantly [18] Zhenghao Peng, Quanyi Li, Ka Ming Hui, Chunxiao Liu, and Bolei
improves the efficiency of MAPF. The experimental results in- Zhou. Learning to simulate self-driven particles system with coordi-
nated policy optimization. Advances in Neural Information Processing
dicate that CoRS significantly enhances the performance of the Systems, 34:10784–10797, 2021.
MARL algorithms in solving the MAPF problem. CoRS also [19] Binyu Wang, Zhe Liu, Qingbiao Li, and Amanda Prorok. Mobile robot
has implications for other multi-agent reinforcement learning path planning in dynamic environments through globally guided rein-
forcement learning. IEEE Robotics and Automation Letters, 5(4):6932–
tasks. We plan to further extend the application of this reward- 6939, 2020.
shaping strategy to a wider range of MARL environments in [20] Guillaume Sartoretti, Justin Kerr, Yunfei Shi, Glenn Wagner, TK Satish
future exploration. Kumar, Sven Koenig, and Howie Choset. Primal: Pathfinding via
reinforcement and imitation multi-agent learning. IEEE Robotics and
Automation Letters, 4(3):2378–2385, 2019.
[21] Zuxin Liu, Baiming Chen, Hongyi Zhou, Guru Koushik, Martial Hebert,
R EFERENCES and Ding Zhao. Mapper: Multi-agent path planning with evolutionary
reinforcement learning in mixed dynamic environments. In 2020
[1] Fu Chen. Aircraft taxiing route planning based on multi-agent system. IEEE/RSJ International Conference on Intelligent Robots and Systems
In 2016 IEEE Advanced Information Management, Communicates, (IROS), pages 11748–11754. IEEE, 2020.
Electronic and Automation Control Conference (IMCEC), pages 1421– [22] Qingbiao Li, Fernando Gama, Alejandro Ribeiro, and Amanda Prorok.
1425, 2016. Graph neural networks for decentralized multi-robot path planning.
[2] Jiankun Wang and Max Q-H Meng. Real-time decision making and In 2020 IEEE/RSJ international conference on intelligent robots and
path planning for robotic autonomous luggage trolley collection at systems (IROS), pages 11785–11792. IEEE, 2020.
airports. IEEE Transactions on Systems, Man, and Cybernetics: Systems, [23] Cornelia Ferner, Glenn Wagner, and Howie Choset. Odrm* optimal
52(4):2174–2183, 2021. multirobot path planning in low dimensional search spaces. In 2013
[3] Oren Salzman and Roni Stern. Research challenges and opportunities in IEEE International Conference on Robotics and Automation, pages
multi-agent path finding and multi-agent pickup and delivery problems. 3854–3859, 2013.
In Proceedings of the 19th International Conference on Autonomous [24] Ziyuan Ma, Yudong Luo, and Hang Ma. Distributed heuristic multi-
Agents and MultiAgent Systems, pages 1711–1715, 2020. agent path finding with communication. In 2021 IEEE International
[4] Altan Yalcin. Multi-Agent Route Planning in Grid-Based Storage Conference on Robotics and Automation (ICRA), pages 8699–8705.
Systems. doctoralthesis, Europa-Universität Viadrina Frankfurt, 2018. IEEE, 2021.
[5] Kevin Nagorny, Armando Walter Colombo, and Uwe Schmidtmann. A [25] Ziyuan Ma, Yudong Luo, and Jia Pan. Learning selective communication
service-and multi-agent-oriented manufacturing automation architecture: for multi-agent path finding. IEEE Robotics and Automation Letters,
An iec 62264 level 2 compliant implementation. Computers in Industry, 7(2):1455–1462, 2022.
63(8):813–823, 2012. [26] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance
under reward transformations: Theory and application to reward shaping.
[6] Cornelia Ferner, Glenn Wagner, and Howie Choset. Odrm* optimal
In Icml, volume 99, pages 278–287, 1999.
multirobot path planning in low dimensional search spaces. In 2013
[27] Alexander Peysakhovich and Adam Lerer. Prosocial learning agents
IEEE international conference on robotics and automation, pages 3854–
solve generalized stag hunts better than selfish ones. arXiv preprint
3859. IEEE, 2013.
arXiv:1709.02865, 2017.
[7] Teng Guo and Jingjin Yu. Sub-1.5 time-optimal multi-robot path
[28] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre,
planning on grids in polynomial time. arXiv preprint arXiv:2201.08976,
Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social
2022.
influence as intrinsic motivation for multi-agent deep reinforcement
[8] Han Zhang, Jiaoyang Li, Pavel Surynek, TK Satish Kumar, and Sven learning. In International conference on machine learning, pages 3040–
Koenig. Multi-agent path finding with mutex propagation. Artificial 3049. PMLR, 2019.
Intelligence, 311:103766, 2022. [29] Sirui Chen, Zhaowei Zhang, Yaodong Yang, and Yali Du. Stas: Spatial-
[9] Glenn Wagner and Howie Choset. Subdimensional expansion for temporal return decomposition for solving sparse rewards problems
multirobot path planning. Artificial intelligence, 219:1–24, 2015. in multi-agent reinforcement learning. In Proceedings of the AAAI
[10] Na Fan, Nan Bao, Jiakuo Zuo, and Xixia Sun. Decentralized multi- Conference on Artificial Intelligence, volume 38, pages 17337–17345,
robot collision avoidance algorithm based on rssi. In 2021 13th Inter- 2024.
national Conference on Wireless Communications and Signal Processing [30] Songyang Han, He Wang, Sanbao Su, Yuanyuan Shi, and Fei Miao.
(WCSP), pages 1–5. IEEE, 2021. Stable and efficient shapley value-based reward reallocation for multi-
[11] Chunyi Peng, Minkyong Kim, Zhe Zhang, and Hui Lei. Vdn: Virtual agent reinforcement learning of autonomous vehicles. In 2022 Inter-
machine image distribution network for cloud data centers. In 2012 national Conference on Robotics and Automation (ICRA), pages 8765–
Proceedings IEEE INFOCOM, pages 181–189. IEEE, 2012. 8771. IEEE, 2022.
[12] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gre- [31] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero,
gory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic and Yung Yi. Qtran: Learning to factorize with transformation for coop-
value function factorisation for deep multi-agent reinforcement learning. erative multi-agent reinforcement learning. In International conference
Journal of Machine Learning Research, 21(178):1–51, 2020. on machine learning, pages 5887–5896. PMLR, 2019.
[13] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas [32] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston.
Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gra- Curriculum learning. In Proceedings of the 26th annual international
dients. In Proceedings of the AAAI conference on artificial intelligence, conference on machine learning, pages 41–48, 2009.
volume 32, 2018.
[14] Wesley Suttle, Zhuoran Yang, Kaiqing Zhang, Zhaoran Wang, Tamer
Başar, and Ji Liu. A multi-agent off-policy actor-critic algorithm for
distributed reinforcement learning. IFAC-PapersOnLine, 53(2):1549–
1554, 2020. 21st IFAC World Congress.
[15] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan
Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent coop-
eration and competition with deep reinforcement learning. PloS one,
12(4):e0172395, 2017.
[16] Sirui Chen, Zhaowei Zhang, Yaodong Yang, and Yali Du. Stas: Spatial-
temporal return decomposition for multi-agent reinforcement learning.
arXiv preprint arXiv:2304.07520, 2023.
A PPENDIX A can also maximize Q−i (s̄ , a−i ). Let Qiπi (s̄t , ai∗ ) =
π∗−i t ∗
P ROOFS tot i −i
Qπ̄i (s̄t , {a∗ , a∗ }) = U . Given that there is no conflict in

In this section, proofs of Theorem 1 is provided. For clarity, τ i and U represents the maximum P accumulated reward of
we hereby reiterate the assumptions involved in the proof 1 i 1 −i t 1 i 1 −i
2 r + 2 r in state s̄t, we have U = τi γ 2r + 2r =
t 1 1 −i −i −i
P i

processes. τi γ 2 maxa r + 2 r
i = Qπ−i (s̄, a∗ ). Assume there

Assumption 1. The reward for agents staying at the target exists â−i ̸= a−i ∗ such that maxa−i Q−i (s̄ , a−i ) =
π∗−i t
point is 0, while the rewards for both movement and staying at Q−i (s̄ , â−i ) = V > U . Therefore:
π −i t

non-target points are rm < 0. The collision reward rc < rm .
Assumption 2. During the interaction process between Ai
X 1 1 X 1 1
U <V = γ t ( r−i + max r i
) = γ t ( r−i + ri )
and Ã−i , neither Ai nor Ã−i has ever reached its respective −i
2 2 a i
−i
2 2
τ τ
endpoint. (b) X 1 1
≤ γ t ( r−i + ri ) = U,
Theorem 1. Assume Assumps. 1 and 2 hold. Then when α = 2 2
−i τi
1 i i −i tot i −i
2 , Qπ∗i (s̄t , a ), Qπ∗−i (s̄t , a ) and Qπ̄∗ (s̄t , {a , a }) satisfy
the IGM condition. which is clearly a contradiction.
 The validity of (b) follows
from that τ i γ t 21 ri + 12 r−i = 12 maxā Qtot
P
π̄∗ (s̄t , ā). There-
Proof. Theorem 1 is equivalent to the following state-
fore, Q−i −i −i
−i (s̄t , a∗ ) = maxa−i Q −i (s̄t , a
−i
).
ment: for ai∗ = argmaxai Qiπi (s̄t , ai ), there exists a−i ∗ = π ∗ π ∗

−i −i tot i −i
argmaxa−i Qπ−i (s̄t , a ), such that Qπ̄∗ (s̄t , {a∗ , a∗ }) =

argmaxā Qtot π̄∗ (s̄t , ā).
For the sake of simplicity in the discussion, P we denote A PPENDIX B
−i −i N t j
ri (s̄P,
t tā ) and r (s̄ ,
t t ā ) as rt
i
and r t , and Eτ [ i=0 γ rt ] E XAMPLES
t j 1 i 1 i 1 −i
as τ γ r . When α = 2 , r̃ = 2 r + 2 maxa−i r and
r̃−i = 12 r−i + 21 maxai ri . In this section, some examples will be provided to facilitate
First, under Assumps. 1 and Assump. 2, if no collisions the understanding of the CoRS algorithm.
occur along the trajectory τ , then for all s̄t and āt ∈ τ , we
have ri = maxai ri = rm and r−i = maxa−i r−i = rm . This
holds because if Ai and Ã−i do not reach their goals, then A. The instability of Eq. (2)
ri < 0 and r−i < 0. Meanwhile, if no collisions occur along τ ,
then ri > rc and r−i > rc . Thus, it follows that for all s̄t and In this section, we illustrate the instability of Eq. (2) through
āt ∈ τ , ri = maxai ri = rm and r−i = maxa−i r−i = rm . an example. Fig. 8 presents two agents, A1 and A2 , with the
Second, ∀ai ∈ Ai and ∀{ai , a−i } ∈ A, Qiπi (s̄t , ai ) ≥ blue and red flags representing their respective endpoints. The

1 tot i −i optimal trajectory in this scenario is already depicted in the
2 Qπ̄∗ (s̄t , {a , a }). This is because:
figure. Based on the optimal trajectory, it is evident that the
1
Qiπ∗i (s̄, ai ) − Qtot (s̄, ā) optimal action pair in the current state is Action Pair 1. We
2 π̄∗ anticipate that when A1 takes the action from Action Pair 1,
X 1 1
 X 
1 i 1 −i

= γ t i
r + max r −i
− γt r + r Ic1 (s̄t , āt ) will be higher than for other actions.
2 2 a −i
τ̄
2 2
τi
(a) X
  X  
t 1 i 1 −i 1 i 1 −i
≥ γ r + max r − γt r + r
τ̄
2 2 a−i
τ̄
2 2
X 1 1 −i

−i
= γt max r − r ≥ 0,
τ̄
2 a−i 2
i
maximizesP γ t r̃i ,
P
where (a) holds because the trajectory τP
Fig. 8. An example illustrating the instability of Eq. (2).
when the trajectory τ i is replaced by τ̄ , τ i γ tP r̃i ≥ τ̄ γ t r̃i .
i i i t 1 i
Then consider a∗ and Qπi (s̄, a∗ ) = τi γ ( 2 r +
1 −i
∗ According to the calculation method provided in Eq. (2) and
2 maxa
−i r ). There must be no collisions along τ i , oth- the reward function given in Table I, we compute the values
erwise a∗ cannot be optimal. Therefore, maxai Qiπi (s̄t , ai ) =
i
of Ic1 when A1 takes the action from Action Pair 1 under

t 1 i 1 −i t 1 i 1 −i
P P
τ i γ ( 2 r + 2 maxa−i r ) = τ i γ ( 2 r + 2 r ). Fur-
2
−i P Whenj A takes
different conditions. the action in Action Pair
thermore, since Qiπi (s̄t , ai∗ ) ≥ 21 Qtot i
π̄∗ (s̄t , {a∗ , a }), we
1
1, Ic = |A−1 | j∈A−1 r = r2 = −0.070. When A2 takes
1

t 1 i 1 −i i i
P
find that τ i γ ( 2 r + 2 r ) = maxai Qπ∗i (s̄t , a ) = the action in Action Pair P 2, a collision occurs between A1
1 tot 1 tot i 1
2 maxā Qπ̄∗ (s̄t , ā) = 2 Qπ̄∗ (s̄t , āt ), where āt ∈ τ . We select and A , and Ic = |A−1 | j∈A−1 rj = r2 = −0.5. These
2 1
−i i −i
a∗ such that {a∗ , a∗ } = āt . calculations show that in the same state, even though A1 takes
Based on the previous discussion, it can be de- the optimal action, the value of Ic1 changes depending on a2 .
duced that ai∗ and a−i ∗ can maximize both Qiπi (s̄t , ai ) This indicates that the calculation method provided in Eq. (2)

and Qπ̄i (s̄t , {a , a }). Next, we demonstrate that a−i
tot i −i
∗ is unstable.

Fig. 9. An example illustrating how reward shaping method enhances inter-agent cooperation.

B. How Reward Shaping Promotes Inter-Agent Cooperation among other algorithms that use reinforcement learning tech-
In this section, we illustrate with an example how the niques are marked in red.
reward-shaping method Eq. (4) promotes inter-agent coopera-
TABLE IV
tion, thereby enhancing the overall system efficiency. Table I S UCCESS R ATE IN 40 × 40 E NVIRONMENT.
presents the reward function used in this example. As shown in
Fig. 9, there are two agents, A1 and A2 , engaged in interaction. Success Rate% Map Size 40 × 40
Throughout their interaction, A1 and A2 remain neighbors. Agents DHC DCC CoRS-DHC PRIMAL ODrM*
4 96 100 100 84 100
Each agent aims to maximize its cumulative reward. For this 8 95 98 98 88 100
two-agent scenario, we set α = 12 . Two trajectories, τ1 and 16 94 98 100 94 100
τ2 , are illustrated in Fig. 9. We will calculate the cumulative 32 92 94 96 86 100
64 64 91 89 17 92
rewards that A1 can obtain on the two trajectories using
different reward functions. For the sake of simplicity, in this
example, γ = 1. TABLE V
If A1 uses r1 as the reward function, the cumulative reward S UCCESS R ATE IN 80 × 80 E NVIRONMENT.
on the trajectory P 1 is (−0.070) + (−0.070) + (−0.070) =
−0.21, and on the trajectory P 2 it is (−0.070) + (−0.070) + Success Rate% Map Size 80 × 80
(−0.070) = −0.21. Consequently, A1 may choose trajectory Agents DHC DCC CoRS-DHC PRIMAL ODrM*
4 93 99 100 67 100
τ2 . In τ2 , A2 will stop and wait due to being blocked by A1 . 8 90 98 100 53 100
This indicates that using the original reward function may not 16 90 97 97 33 100
promote collaboration between agents, resulting in a decrease 32 88 94 94 17 100
64 87 92 93 6 100
in overall system efficiency. 128 40 81 83 5 100
However, if A1 adopts r̃1 = 21 r1 + 21 maxa2 r2 as its reward,
the cumulative rewards for A1 on trajectories τ1 is (− 12 0.070−
1 1 1 1 1
2 ·0.070)+(− 2 ·0.070− 2 ·0.070)+(− 2 ·0.070− 2 ·0.070) =
TABLE VI
−0.21. In contrast, on trajectory τ2 , it is (− 2 · 0.070 − 21 ·
1 AVERAGE S TEPS IN 40 × 40 E NVIRONMENT.
0.075) + (− 21 · 0.070 − 12 · 0.070) + (− 12 · 0.070 − 12 · 0.070) +
Average Steps Map Size 40 × 40
( 12 · 0 − 12 · 0.070) = −0.2475. Therefore, A1 will choose the Agents DHC DCC CoRS-DHC PRIMAL ODrM*
trajectory τ1 instead of τ2 , allowing A2 to reach its goal as 4 64.15 48.58 50.36 79.08 50
quickly as possible. This demonstrates that using the reward 8 77.67 59.60 64.77 76.53 52.17
16 86.87 71.34 68.48 107.14 59.78
r̃1 shaped by the reward-shaping method Eq. (4) enables A1 to 32 115.72 93.45 95.42 155.21 67.39
adequately consider the impact of its actions on other agents, 64 179.69 135.55 151.02 170.48 82.60
thereby promoting cooperation among agents and enhancing
the overall efficiency of the system.
TABLE VII
AVERAGE S TEPS IN 80 × 80 E NVIRONMENT.
A PPENDIX C
D ETAILS OF E XPERIMENTS Average Steps Map Size 80 × 80
Agents DHC DCC CoRS-DHC PRIMAL ODrM*
In this section, we provide detailed data in Section V. 4 114.69 93.89 92.14 134.86 93.40
Table IV and Table V presents the success rates of different 8 133.39 109.89 109.15 153.20 104.92
16 147.55 122.24 121.25 180.74 114.75
algorithms across various test scenarios, while Table VI and 32 158.58 132.99 137.06 250.07 121.31
Table VII shows the makespan of different algorithms in these 64 183.44 159.67 153.06 321.63 134.42
scenarios. The results for the ODrM* algorithm, which are 128 213.75 192.90 193.50 350.76 143.84
considered optimal, are highlighted in bold. The best metrics

You might also like