Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2948830

Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey

Article · June 2004


Source: CiteSeer

CITATIONS READS

176 3,835

2 authors:

Erfu Yang Dongbing Gu


University of Strathclyde University of Essex
191 PUBLICATIONS 3,408 CITATIONS 233 PUBLICATIONS 5,635 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Dongbing Gu on 21 August 2014.

The user has requested enhancement of the downloaded file.


Multiagent Reinforcement Learning for Multi-Robot
Systems: A Survey
Erfu Yang and Dongbing Gu
Department of Computer Science, University of Essex
Wivenhoe Park, Colchester, Essex, CO4 3SQ, United Kingdom
Emails: {eyang, dgu}@essex.ac.uk

Abstract— Multiagent reinforcement learning for multi- However, there have still been many challenging
robot systems is a challenging issue in both robotics and issues in MRSs. These challenges often involve the
artificial intelligence. With the ever increasing interests in realization of basic behaviours , such as trajectory
theoretical researches and practical applications, currently
tracking, formation-keeping control, and collision
there have been a lot of efforts towards providing some
solutions to this challenge. However, there are still many
avoidance, or allocating tasks, communication, co-
difficulties in scaling up the multiagent reinforcement ordinating actions, team reasoning, etc. For a prac-
learning to multi-robot systems. The main objective of this tical multi-robot system, firstly basic behaviours or
paper is to provide a survey, though not completely on the lower functions must be feasible or available. At
multiagent reinforcement learning in multi-robot systems. the same time upper modules for task allocation
After reviewing important advances in this field, some
and planning have to be designed carefully. When
challenging problems and promising research directions
are analyzed. A concluding remark is made from the
designing MRSs, it is impossible to predict all the
perspectives of the authors. potential situation robots may encounter and specify
all robot behaviours optimally in advance. Robots
I. I NTRODUCTION in MRSs have to learn from, and adapt to their
operating environment and their counterparts. Thus
Multi-Robot Systems (MRSs) can often be used control and learning become two important and
to fulfil the tasks that are difficult to be accom- challenging problems in MRSs. In this paper we
plished by an individual robot, especially in the will mainly focus on the learning problems of multi-
presence of uncertainties, incomplete information, robot systems and assume that basic behaviours for
distributed control, and asynchronous computation, each participating robot are available.
etc. The performance of MRSs in redundancy and Currently, there has been a great deal of research
co-operation contributes to task solutions with a on multiagent reinforcement learning (RL) in MRSs
more reliable, faster, or cheaper way. Many practical [2], [3], [5]–[7], [9]–[11], [15]–[28]. Multiagent
and potential applications, such as unmanned aerial reinforcement learning allows participating robots
vehicles (UAVs), spacecraft, autonomous underwa- to learn mapping from their states to their actions
ter vehicles (AUVs), ground mobile robots, and by rewards or payoffs obtained through interacting
other robot-based applications in hazardous and/or with their environment. MRSs can benefit from RL
unknown environments can benefit from the use of in many aspects. Robots in MRSs are expected to
MRSs. Therefore, MRSs have received considerable co-ordinate their behaviours to achieve their goals.
attention during the last decade [1]–[16]. These robots can either obtain co-operative be-

1
haviours or accelerate their learning speed through gent systems and MRSs. The objective of this paper
learning. Among RL algorithms, Q-learning has is to review these existing works and analyze some
attracted a great deal of attention [29]–[31]. Explicit challenging issues from the viewpoint of multiagent
presentation of an emergent idea of co-operative RL in MRSs. Moreover, we hope to find some inter-
behaviours through an individual Q-learning algo- esting directions for our ongoing research projects.
rithm can be found in [23]. Improving learning
efficiency through co-learning was shown by Tan II. P RELIMINARIES
[32]. The study indicates that K co-operative robots
A. Markov Decision Process
learned faster than they did individually. Tan also
demonstrated that sharing perception and learning Markov Decision Processes (MDPs) are the math-
experience can accelerate the learning processes ematical foundation for RL in a single agent envi-
within robot group. ronment. Formally, its definition is as follows [31],
Although Q-learning has been applied to many [42]:
MRSs, such as forage robots [2], soccer playing Definition 1 (Markov Decision Process): A
robots [5], [23], prey-pursuing robots [32], prey- Markov Decision Process is a tuple < S, A, T , R,
pursuing robots [17], and moving target observation γ >, where S is a finite discrete set of environment
robots [9], etc., most research work in these appli- states, A is a finite discrete set of actions available
cations has only focused on tackling large learning to the agent, γ (0 ≤ γ < 1) is a discount factor,
spaces of MRSs. For example, modular Q-learning T : S × R →Π(S) is a transition function giving
approaches advocate that a large learning space can for each state and action, a probability distribution
be separated into several small learning spaces to over states, R : S × R → R is a reward function
ease exploration [17], [23]. Normally, a mediator is of the agent, giving the expected immediate reward
needed in these approaches to select optimal policies received by the agent under each actions in each
generated from different modules. state.
Theoretically, the environment of MRSs are not Definition 2 (Policies): A policy π is denoted for
stationary. Thus the basic assumption for traditional a description of behaviours of an agent. A stationary
Q-learning working will be violated. Rewards or policy π : S → Π(A) is a probability distribution
payoffs learning robots receive depend not only over actions to be taken for each state. A determin-
on their own actions but also on the action of istic policy is one with probability 1 to some action
other robots. Therefore, the individual Q-learning in each state.
methods are unable to model the dynamics of si- Each MDP has a deterministic stationary optimal
multaneous learners in the shared environment. policy [42]. In an MDP, the agent acts in a way such
Over the last decade there has been increasing as to maximize the long-run value it can expect to
interest in extending the individual RL to multiagent gain. Under the discounted objective the factor γ
systems, particularly MRSs [16], [28], [33]–[50]. controls how much effect future rewards have on
From a theoretic viewpoint, this is a very attractive the decisions at each moment. Denoting by Qπ (s, a)
research field since it will expand the range of RL the expected discounted future reward to the agent
from the realm of simple single-agent to the realm for starting in a state s and taking an action a for
of complex multiagents where there are agents one step then following a policy π, we can define a
learning simultaneously. set of simultaneous linear equations for each state
There have been some advances in both multia- s, i.e., the Q-function for π:

2
experience to improve its estimate by blending new
X 0 information into its prior experience. Although there
Qπ (s, a) = R(s, a) + γ T (s, a, s )
may be more than one optimal policy, the Q∗ values
s ∈S
0
X 0 0 0 0 are unique [29].
× π(s , a )Qπ (s , a ) (1)
In Q-learning the agent’s experience consists of
a 0 ∈A
a sequence of distinct episodes. The available ex-
The Q-function Q∗ for the deterministic station- perience for an agent in an MDP environment can
ary policy π ∗ that is optimal for every starting state be described by a sequence of experience tuples
is defined by a set of equations: 0
< st , at , st , rt >. Table I shows the scheme of Q-
learning.
0
The individual Q-learning in discrete cases has
X
Q∗ (s, a) = R(s, a) + γ T (s, a, s )V ∗ (s) (2)
s0 ∈S
been proved to converge to optimal values with
0 0 probability one if state action pairs are visited infi-
where V ∗ (s) = maxa0 ∈A Q∗ (s , a ). Thus a greedy
nite times and learning rate declines. The following
policy can be defined according to the Q-function
theorem [29] provides a set of conditions under
Q∗
which Qt (s, a) converges to Q∗ (s, a) as t → ∞:
Definition 3 (Greedy Policy): A policy π is said
Theorem 1: Given bounded rewards rt , learning
to be greedy if it always assigns probability one to
rates αt ∈ [0, 1), and
an action argmaxa Q∗ (s, a) in state s.
X ∞ X∞

B. Reinforcement Learning αti (s, a) = ∞, [αti (s, a)]2 < ∞, ∀s, a (4)
i=1 i=1
The objective of RL is to learn how to act in a
dynamic environment from experience by maximiz- then Qt (s, a) → Q∗ (s, a) as t → ∞ with probability
ing some payoff functions or minimizing some cost one for all s, a. Where ti (s, a) denotes the index of
functions equivalently. In RL, the state dynamics the i-th time that an action a is tried in a state s.
and reinforcement function are at least partially Although the greedy policy onverges to an opti-
unknown. Thus the learning occurs iteratively and is mal policy as Qt (s, a) → Q∗ (s, a), the agent may
performed only through trial-and-error methods and not explore a sufficient amount to guarantee the con-
reinforcement signals, based on the experience of vergent performance if the greedy policy is adopted
interactions between the agent and its environment. to choose actions throughout the learning process
[51]. To overcome this conflict, a GLIE (Greedy
C. Q-Learning in the Limit with infinite exploration) policy was
Q-learning [29] is a value learning version of RL proposed in [51]. To show the convergence of GLIE
that learns utility values (Q values) of state and policy the following concept is crucial [42]:
action pairs. It is a form of model-free RL and Definition 4 (Convergence in Behaviour):
provides a simple way for agents to learn how to An agent converges in behaviour if its action
act optimally in controlled Markovian domains. It distribution becomes stationary in the limit.
also can be viewed as a method of asynchronous According to Definition 4 Littman [42] pointed
dynamic programming. In essence Q-learning is a out that a GLIE policy need not converge in be-
temporal-difference learning method. The objective haviour since ties in greedy actions are broken
of Q-learning is to estimate Q values for an opti- arbitrarily. However, an agent with Q-learning will
mal policy. During the learning an agent uses its also converge in behaviour if there is a unique

3
TABLE I
T HE Q- LEARNING A LGORITHM

1) Observes the current state st .


2) Chooses an action at and performs it.
0
3) Observes the new state st and receives an immediate reward rt .
4) Adjusts the Qt−1 values using a learning factor αt according to the following
rule:

 (1 − α )Q (x, a) + α [r + δV (s0 )] if s = s and a = a
t t−1 t t t−1 t t t
Qt (s, a) =
 Qt−1 (x, a) otherwise
(3)
0 0
where Vt−1 (s ) = maxb∈A Qt−1 (s , b)

optimal policy and the Q-function converges. The Among matrix games, bimatrix games are often
following convergence results come from [42], [51]. used to formulate the frameworks for multiagent
Theorem 2: In a single-agent environment, a Q- reinforcement learning. Hence we particularly give
learning agent will converge to Q∗ (s, a) with proba- its definition as follows [53]:
bility one. Furthermore, an agent following a GLIE Definition 6 (Bimatrix Games): A bimatrix
policy will be of convergence in behaviour with game is defined by a pair of payoff matrices
probability one if the optimal policy is unique. (M1 , M2 ). Each matrix Mi (i = 1, 2) has a
However, one may find that this theorem is diffi- dimension of |A1 | × |A2 | and its entry Mi (a1 , a2 )
cult to apply since one often does not know whether gives the reward of the ith player under a joint
there is a unique optimal policy or not. For Q- action pair (a1 , a2 ), ∀(a1 , a2 ) ∈A1 × A2 .
learning, the situation oriented is usually that the In matrix games the joint actions correspond
Q∗ values rather than one optimal policy are unique to particular entries in the payoff matrices. For
[29]. the reinforcement learning purpose, agents play the
same matrix game repeatedly. For applying repeated
D. Matrix Games matrix game theory to multiagent reinforcement
learning, the payoff structure must be given explic-
Matrix games are the most elementary type of
itly. This requirement seems to be a restriction for
many players, particularly two-player games [52].
applying matrix games to the domains of MRSs
In matrix games players select actions from their
where payoff structure is often difficult to define
available action space and receive rewards that
in advance.
depend on all the other player’s actions.
Definition 5 (Matrix Games): A matrix game is E. Stochastic Games
given by a tuple < n,A1 , · · · ,An , R1 , · · · , Rn >, Currently multiagent learning has focused on the
where n is the number of players, Ai and Ri theoretic framework of Stochastic Games (SGs) or
(i = 1, · · · , n) are the finite action set and payoff Markov Games (MGs). SGs extend one-state MG
function respectively for player i. to multi-state cases by modelling state transitions

4
with MDP. Each state in a SG can be viewed as be a natural and powerful extension of MDPs to
a MG and a SG with one player can be viewed multi-agent domains. In the framework of SGs Nash
as an MDP. In what follows we first review some equilibria is an important solution concept for the
important concepts for easing the understanding of problem of simultaneously finding optimal policies
the related works in this survey. in the presence of other learning agents. At a Nash
Definition 7 (Best-Response): A policy is said to equilibrium each agent player is playing optimally
be a best-response to the other player’ policies if it with respect to the others under a Nash equilibrium
is optimal given their policies. policy. If all the agents are playing a policy at a
Definition 8 (Nash Equilibrium): A Nash equi- Nash equilibrium rationally, then no agent could
librium is a collection of strategies for each of the learn a better policy.
players such that each player’s strategy is a best-
III. T HEORETIC F RAMEWORKS FOR
response to the other players’ strategies.
M ULTIAGENT R EINFORCEMENT L EARNING
At a Nash equilibrium, no player can do better by
A. SG-Based Frameworks
changing strategies unilaterally given that the other
players don’t change their Nash strategies. There is The framework of SGs (or MGs) is widely
adopted by many researchers to model multiagent
at least one Nash equilibrium existed in a game [52],
[54]. systems with finite states and actions [33], [42],
Definition 9 (Mixed-Strategy Nash Equilibrium): [50], [59], [60]. Particularly the framework of SGs
A Mixed-Strategy Nash equilibrium for a is exploited in extending Q-learning to multiagent
bimatrix game is a pair of strategies (π1∗ , π2∗ ) systems. In a SG, all agents select their actions
for each of the players such that each player’s simultaneously. the reward each agent receives de-
strategy is a best-response to the other player’ pends on the joint actions of all agents and the cur-
strategy,mathematically i.e., rent state as well as the state transitions according
to the Markov property. A reinforcement learning
π1∗ T M1 π2∗ ≥ π1 T M1 π2∗ ∀π1 ∈ Π(A1 )
(5) framework of SGs is given by the following formal
π1∗ T M2 π2∗ ≥ π1∗ T M2 π2 ∀π2 ∈ Π(A2 ) definition [33], [42], [50], [59].
where Π(A1 ) and Π(A2 ) are the probability dis- Definition 10 (Framework of SGs): A learning
tributions over the corresponding action spaces of framework of SGs is described by a tuple
agent 1 and 2, respectively. < S, A1 , · · · ,An , T, R1 , · · · , Rn , γ >, where
• S is a finite state space;
F. Game Theory and Multiagent Reinforcement • A1 , · · · ,An are the corresponding finite sets of
Learning actions available to each agent.
Game Theory (GT) is explicitly designed for • T : S×A1 × · · · ×An → Π(S) is a state tran-

reasoning among multiple players. Players in GT sition function, given each state and one action
are assumed to act rationally. They always take from each agent. Here Π(S) is a probability
their best policies to play, rather than play tricks. distribution over the state space S.
A great deal of research on multiagent learning has • Ri : S×A1 × · · · ×An →R(i = 1, · · · , n)

borrowed the theoretic frameworks and notions from represents a reward function for each agent.
SGs [33]–[37], [39], [41]–[45], [47]–[50], [55]– • 0 ≤ γ < 1 is the discount factor.

[58]. SGs have been well studied in the field of In such a learning framework of SGs, learning
multiagent reinforcement learning and appear to agents attempt to maximize their expected sum of

5
discounted rewards. Correspondingly a set of Q- C. Bayesian Framework
functions for agent i (i = 1, · · · , n) can be defined
according to their stationary policies π1 , · · · , πn . The multiagent reinforcement learning algo-
Unlike a single-agent system, in multiagent systems rithms developed from the SG framework, such
the joint actions determine the next state and re- as Minimax-Q, Nash-Q, etc., always require to
wards to each agent. After selecting actions, the converge to desirable equilibria. Thus, sufficient
agents are transitioned to the next state and receive exploration of strategy space is needed before con-
their rewards. vergence can be established. Solutions to multiagent
reinforcement learning problems are usually based
on equilibrium. Thus to obtain an optimal policy,
B. Fictitious Play Framework agents have to find and even identify the equilibria
In a known SG, the framework of fictitious play before the policy is used at the current state.
can be used as a technique to finding equilibria. A Bayesian framework for exploration in multi-
For a learning paradigm, fictitious play can also agent reinforcement learning systems was proposed
be applied to form a theoretical framework [60]. in [47], [49]. The Bayesian framework is a model-
It provides a quite simple learning model. In the based reinforcement learning model. In this frame-
framework of fictitious play, the algorithm maintains work the learning agent can use priors to reason
information about the average estimated sum of about how its action will influence the behaviours of
future discounted rewards. According to the Q- other agents. Thus, some prior density over possible
functions of the agents, the fictitious play method dynamics and reward distribution have to be known
deterministically chooses the actions for each agent by a learning agent in advance.
that would have done the best in the past. For A basic assumption in the Bayesian framework
computing the estimated sum of future discounted is that the learning agent is able to observe the
rewards, a simple temporal difference backup may actions taken by all agents, the resulting game
be used. state, and rewards received by other agents. Of
Compared with the framework of SGs, the main course, this assumption will have no problem for
merit of fictitious play is that it is capable of the coordination of multiagents, but it will restrict its
finding equilibria in both zero-sum games and some applications in other settings where opponent agents
classes of general-sum games [60]. One obvious generally will not broadcast their information to the
disadvantage of this framework is that fictitious others.
play merely adopts deterministic policies and can- To establish the belief, a learning agent under
not play stochastic strategies. Hence it is hard to the Bayesian framework has some priors , such
apply in zero-sum games because it can only find as probability distribution over the state space as
an equilibrium policy but does not actually play well as the possible strategy space. The belief is
according to that policy [60]. In addition learning then updated during the learning by observing the
stability is another serious problem. Since the ficti- results of its actions and action choices of other
tious play framework is of inherent discontinuity, a agents. In order to predict accurately the actions
small change in the data could lead to an abrupt in of other agents, the learning agent has to record
behaviour [49]. To overcome this unstable problem, and maintain appropriate observable history. In [47],
many variants of fictitious play have been devel- [49] it is assumed that the learning agent can keep
oped, see [49] as well as the literature therein. track of sufficient history to make such predictions.

6
Besides the aforementioned assumptions, in [47], in a straightforward fashion to each agent in a
[49] there are two extra assumptions on the belief. multiagent system. However, the aforementioned
First, the priors over models can be factored into fact that the environment is no longer stationary
independent local models for both rewards and in multiagent system is usually neglected. Over the
transitions. Second, it needs to be assumed that the last decade many researchers have made efforts to
belief about opponent strategies also can be factored use the RL methodology, particularly the Q-learning
and represented in some convenient form [47], [49]. framework as an alternative approach to the learning
of MRSs. As pointed out early, the basic assumption
D. Policy Iteration Framework for traditional Q-learning working is violated in the
Unlike the value iteration frameworks, the policy case of MRSs.
iteration framework can provide a direct way to find
the optimal strategy in the policy space. Under the A. Minimax-Q Learning Algorithm
policy iteration framework Bowling and Veloso [61] Under SG framework, Littman [33] proposed a
proposed a WoLF-PHC algorithm if the other agents Minimax-Q learning algorithm for zero-sum games
are assumed to be playing stationary policies. The in which learning player maximizes its payoffs
other works following the thinking lines of [61] in the worst situation. The players’ interests in
can be found in [53], [62]. It seems to have no the game are opposite. Essentially the Minimax-Q
other reported works on the algorithms under this learning is a value-function reinforcement learning
framework when the other agents in the system are algorithm. In the Minimax-Q learning the player
considered to learn simultaneously. Compared with always try to maximize its expected value in the face
the aforementioned frameworks, many researches of the worst-possible action choice of the opponent.
for policy iteration reinforcement learning in mul- Hence the player would become more cautious after
tiagent systems still need to be done in the future. learning. To calculate the probability distribution or
Fortunately, there already have been many research the optimal policy of the player, Littman [33] simply
results on policy iteration algorithms, for instance, used linear programming.
one may refer to [63] as well as the literature An illustrating version of the Minimax-Q learning
therein. Thus one possible way is to extend the algorithm is shown in Table II. The Minimax-Q
existing policy iteration algorithms in single-agent learning algorithm was firstly given in [33], which
systems to the field of multiagent systems. just included empirical results on a simple zero-sum
SG game version of soccer. A complete conver-
IV. M ULTIAGENT R EINFORCEMENT L EARNING gence proof was provided in the works thereafter
A LGORITHMS [34], [39], [42], which can be summarized in the
The difference between single-agent and multia- following theorem:
gent system exists in the environments. In multia- Theorem 3: In a two-player zero-sum multiagent
gent systems other adapting agents make the envi- SG environment, an agent following the Minimax-Q
ronment no longer stationary, violating the Markov learning algorithm will converge to the optimal Q-
property that traditional single agent behavior learn- function with probability one. Furthermore, an agent
ing relies upon. using a GLIE policy will converge in behaviour with
For individual robot learning, the traditional Q- probability one if the limit equilibrium is unique.
learning has been successfully applied to many The Minimax-Q learning algorithm may provide
paradigms. Some researchers also apply Q-learning a safe policy in that it can be performed regardless

7
TABLE II
T HE M INIMAX -Q L EARNING A LGORITHM

1) Initialize V (s), Q(s, a1 , a2 ) for all s ∈ S, a1 ∈ A1 , and a2 ∈ A2 .


2) Choose an action,
a) With an exploring probability, return an action uniformly at random.
b) Otherwise, return action a1 with probability π(s, a1 )
3) Learn,
0
a) After receiving a reward r for moving from state s to s via action a1
and opponent’s action a2 ,
b) Update,
0
Q(s, a1 , a2 ) ← (1 − α)Q(s, a1 , a2 ) + α(r + γV (s )

c) Using linear programming to find π(s, ·) such that:


0 0 0
π(s, ·) ← arg max min Σa0 π(s, a1 )Q(s, a1 , a2 )
π 0 (s,·)∈Π(A1 ) a2 ∈A2
0 1

d) Let
0 0 0
V (s) ← 0min Σa0 π(s, a1 )Q(s, a1 , a2 )
a2 ∈A2
1

e) Let α ← αε, where ε is a decaying rate for learning parameter α.

of the existence of its opponent [42]. The policy gorithm for multiagent reinforcement learning. To
used in the Minimax-Q learning algorithm can extend Q-learning to the multiagent learning do-
guarantee that it receives the largest value possible main, the joint actions of participating agents rather
in the absence of knowledge of the opponent’s than merely individual actions are needed to take
policy. Although the Minimax-Q learning algorithm into account. Considering this important difference
manifest many advantages in the domain of two- between single-agent and multiagent reinforcement
player zero-sum multiagent SG environment, an learning, the Nash-Q learning algorithm needs to
explicit drawback of this algorithm is that it is maintain Q values for both the learner itself and
very slow to learn since in each episode and in other players. The idea is to find Nash equilibria
each state a linear programming is needed. The at each state in order to obtain Nash equilibrium
use of linear programming significantly increases policies for Q value updating.
the computation cost before the system reaches
convergence. To apply the Nash-Q learning algorithm, one has
to define the Nash Q-value. A Nash Q-value is
B. Nash-Q Learning Algorithm defined as the expected sum of discounted rewards
Hu and Wellman [37], [50] extended the zero- when all agents follow specified Nash equilibrium
sum game framework of Littman [33] to general- strategies from the next period on. The following
sum games and developed a Nash-Q learning al- several definitions directly come from [50] for un-

8
derstanding the Nash-Q learning algorithm. one as long as all Q-functions encountered have
Definition 11 (Nash Q-function): The Nash Q- coordination equilibria and these are used in the
function for the ith agent is defined over update rule. Furthermore, the agent using a GLIE
(s, a1 , · · · , an ) as the sum of its current reward plus policy will converge in behaviour with probability
its future rewards when all agents follow a joint one if the limit equilibrium is unique.
Nash equilibrium strategy, i.e., To guarantee the convergence, the Nash-Q learn-
ing algorithm needs to know that a Nash equilib-
Q∗i (s, a1 , · · · , an ) = ri (s, a1 , · · · , an ) +
0 0 rium is either unique or has the same value as all
γΣs0 ∈S T (s, s , a1 , · · · , an )Vi (s , π1∗ , · · · , πn∗ )
others. Littman [42] has argued the applicability of
(6) Theorem 4 and pointed out that it is hard to apply
where (π1∗ , · · · , πn∗ ) is a joint Nash equilibrium strat- since the strict conditions are difficult to verify
egy, ri (s, a1 , · · · , an ) is an one-period reward for ith in advance. To tackle this difficulty, Littman [44]
agent in state s under the joint action (a1 , · · · , an ), thereafter proposed a so-called Friend-or-Foe Q-
Vi (s , π1∗ , · · · , πn∗ ) is the total discounted reward learning (FFQ) algorithm, which will be introduced
0

over infinite periods starting from state s given in the following subsection.
0

that the agents follow the equilibrium strategies,


0 C. Friend-or-Foe Q-learning (FFQ) Algorithm
T (s, s , a1 , · · · , an ) is the probability distribution of
state transitions under the joint action (a1 , · · · , an ). Motivated by the conditions of Theorem 4 on
Definition 12 (Nash Joint Strategy): A joint the convergence of Nash-Q learning, Littman [44]
strategy (π1 , · · · , πn ) constitutes a Nash equilibrium developed a Friend-or-Foe Q-learning (FFQ) algo-
for the stage games (M1 , · · · , Mn ) if, rithm for the RL in general-sum SGs. The main idea
is that each agent in the system is identified as being
πk π−k Mk ≥ π̂k π−k Mk for all πk ∈ Ak , either “friend” or “foe”. Thus, the equilibria can be
where k = 1, · · · , n, π−k is the product of strategies classified as either coordination or adversarial. Com-
for all agents other than k, i.e., π−k = π1 · · · πk−1 · pared with the Nash-Q learning, the FFQ-learning
πk+1 · · · πn . can provide a stronger convergence guarantee.
With the previous preliminaries, the Nash-Q Littman [44] has presented the following results
learning algorithm now can be summarized in Table to prove the convergence of the FFQ-learning algo-
III. Hu and Wellman used quadratic programming rithm:
to find Nash equilibrium in the Nash-Q learning Theorem 5: Foe-Q learns values for a Nash equi-
algorithm for general-sum games. librium policy if there is an adversarial equilibrium;
Hu and Wellman [37], [50] has shown that the Friend-Q learns values for a Nash equilibrium pol-
Nash-Q learning algorithm in multi-player environ- icy if the game has a coordination equilibrium. This
ment converges to Nash equilibrium policies with is true regardless of opponent behaviour.
probability one under some conditions and addi- Theorem 6: Foe-Q learns a Q-function whose
tional assumptions to the payoff structures. More corresponding policy will achieve at least the
formally, the main results can be summarized in the learned values regardless of the policy selected by
following theorem [37], [42], [50]: the opponent.
Theorem 4: In a multiagent SG environment, an Although the convergence property of FFQ-
agent following the Nash-Q learning algorithm will learning has been improved over that of Nash-Q
converge to the optimal Q-function with probability learning algorithm, a complete treatment of general-

9
TABLE III
T HE NASH -Q L EARNING A LGORITHM

1) Initialize:
a) Give the initial state s0 and let t ← 0.
b) Take the ith agent as the learning agent.
c) For all s ∈S and ai ∈ Ai , i = 1, · · · , n, Qti (s, a1 , · · · , an ) ← 0.
2) Repeat:
a) Choose action ati .
0
b) Observe r1t , · · · , rnt ; at1 , · · · , atn , and st+1 ← s .
c) Update Qti for i = 1, · · · , n:
0
Qt+1 t t t
i (s, a1 , · · · , an ) = (1 − αt )Qi (s, a1 , · · · , an ) + αt [ri + γN ashQi (s )]

3) t ← t + 1.
0
where αt ∈ (0, 1) and the N ashQti (s ) is defined as:
0 0 0 0
N ashQti (s ) = π1 (s ) · · · πn (s ).Qti (s )

sum stochastic games using Friend-or-Foe concepts state is defined by a set of first-order relations,
is still lacking [44]. such as goal in front, team robot to the left, oppo-
In comparison to the Nash-Q learning algorithm, nent robot with ball, etc. A r-action is described by
the FFQ-learning does not require learning estimates a set of pre-conditions, a generalized action, and
to the Q-functions of opponents. However, the FFQ- possibly a set of post-conditions. For a r-action to
learning still require a very strong condition for be defined properly, the following condition must be
application, that is the agent must know how many satisfied: if a r-action is applicable to a particular
equilibria there are in game and an equilibrium is instance of a r-state, then it should be applicable
known either coordinating or adversarial in advance. to all the instances of that r-state. The rQ-learning
The FFQ-learning itself does not provide a way algorithm can reduce the size of search space, the
to find a Nash equilibrium or identify a Nash process is given in Table IV.
equilibrium as being either a coordination or an
adversarial one. Like the Nash-Q learning, the FFQ- Although the rQ-learning algorithm seems to be
learning also cannot apply to the system where nei- useful for dealing with large search space problem,
ther coordination nor adversarial equilibrium exists. it may be very difficult to define a r-state and a
r-action set properly, particularly in the case with
D. rQ-learning Algorithm incomplete knowledge on the concerned MRS. Fur-
Morales [64] developed a so-called rQ-learning thermore, in the r-state space there is no guarantee
algorithm for dealing with large search space that the defined r-actions are adequate to find
problem. In this algorithm a r-state and a r- an optimal sequence of primitive actions and sub-
action set need to be defined in advance. A r- optimal policies can be produced [64].

10
TABLE IV
T HE R Q- LEARNING A LGORITHM

1) Initialize Q(S, A) (S is the r-state set and A is r-action set) arbitrarily.


2) Repeat,
a) Initialize state s.
b) S← rels(s) for evaluating the set of relations over state s,
c) Repeat for each step of episode,
i) Choose A from S using a persistently exciting policy;
ii) Choose action a from A randomly;
0
iii) Apply action a and observe r, s
iv) Update
0
S ← rels(s)
0 0
Q(S, A) ← Q(S, A) + α(r + γ max 0 Q(S , A ) − Q(S, A))
A
0
S←S
d) until s is terminal.

E. Fictitious Play Algorithm games where the players learn Q values of their joint
actions - the player is called Joint Action Learner
Since the Nash-equilibrium-based learning has
(JAL) [1].
difficulty in finding Nash equilibria, the fictitious
For the fictitious-play-based approaches, the algo-
play may provide another method to deal with
rithms will converge to a Nash equilibrium in games
multiagent reinforcement learning under SG frame-
that are iterated dominance solvable if all players are
work. In fictitious play algorithm, the beliefs of
playing fictitious play [65].
other players’ policies are represented by empirical
Although the fictitious-play based learning elim-
distribution of their past play [1], [48]. Hence, the
inate the necessity of finding equilibria, learning
players only need to maintain their own Q values,
agents have to model others and the learning con-
which are related to joint actions and are weighted
vergence has to depend on some heuristic rules [61].
by their belief distribution of other players’ actions.
Table V shows a fictitious play algorithm for two- F. Multiagent SARSA Learning Algorithm
player zero-sum SGs using a model [60]. The Minimax-Q and Nash-Q learning algorithms
For stationary policies of other players, the fic- are actually off-policy RL since they replace the
titious play algorithm becomes variants of individ- max operator of individual Q-learning algorithm
ual Q-learning. For non-stationary policies of other with their best response (Nash equilibrium policy).
players, these fictitious-play-based approaches have In RL, an off-policy learning algorithm always tries
been empirically used in either competitive games to converge to optimal Q values of optimal policy
where the players can model their adversarial oppo- regardless of what policy is currently being exe-
nents - called opponent modelling, or collaborative cuted. For off-policy learning algorithms, ε-greedy

11
TABLE V
T HE F ICTITIOUS P LAY A LGORITHM

1) Let t ← 0 and initialize Qi for all s ∈ S and ai ∈ Ai .


2) Repeat for each state s ∈ S:
a) Let ai = arg maxai ∈Ai Qi (s, ai ).
0
b) Update Qi (s, ai ), ∀s ∈ S, ai ∈ Ai
0 0 0
 0 0

Qi (s, ai ) ← Qi (s, ai ) + R(s, < a−i , ai >) + γ Σs0 ∈S T (s, a, s )V (s )

where,
V (s) = max Qi (s, ai )/t
ai ∈Ai

3) t ← t + 1.

policies can usually be used to balance exploration gorithm will have a difficulty when there exist
and exploitation of learning space. multiple equilibria. Another obvious shortcoming of
SARSA algorithm [31] is an on-policy RL al- the EXORL algorithm is that one agent is assumed
gorithm that tries to converge to optimal Q values to be capable of observing the opponent’s action
of policy currently being executed. Considering and rewards. In some cases this will be a very
the disadvantages of the Minimax-Q and Nash-Q serious restriction since all the agents may learn
learning algorithms, a SARSA-based multi-agent their strategies simultaneously and one agent cannot
algorithm called as EXORL (Extended Optimal Re- obtain the actions of the opponent at all in advance.
sponse Learning) was developed in [48]. In [48] the Moreover, the opponent also may take stochastic
fact that the opponents may take stationary policies strategy instead of deterministic policies. The ob-
is taken into account, rather than Nash equilibrium serving rewards obtained by the opponent will be
policies. Once opponents take stationary policies, more difficult since the rewards are only available
there is no need to finding Nash equilibria at all after the policies are put into action practically.
during learning. So, the learning updating can be Only some empirical results were given in [48]
simplified by eliminating the necessity of finding for the EXORL algorithm, there still lacks a theo-
Nash equilibria if the opponents take stationary retic foundation. Hence, a complete proof for the
policies. In addition some heuristic rules were also convergence will be expected. The theory results
employed to switch the algorithm between the Nash- provided in [35], [39] may be helpful to obtain some
equilibrium-based learning and the fictitious-play- convergence properties for the EXORL algorithm.
based learning. G. Policy Hill Climbing (PHC) Algorithm
The EXORL algorithm is depicted as in Table VI. The PHC algorithm updates Q values in the same
The basic idea of this algorithm is that the agent way as the fictitious play algorithm, but it maintains
should learn a policy which is an optimal response a mixed policy (or stochastic policy) by performing
to the opponent’s policy, but it tries to reach a Nash hill-climbing in the space of mixed policies. Bowl-
equilibrium when the opponent is adaptable. ing and Veloso [60], [61] proposed a WoLF-PHC
Like Nash-Q leaning algorithm, the EXORL al- algorithm by adopting an idea of Win or Learn Fast

12
TABLE VI
T HE EXORL A LGORITHM : A M ULTIAGENT V ERSION OF SARSA A LGORITHM

1) Initialize for all s ∈ S, a1 ∈ A1 and a2 ∈ A2 :


1 1
Q1 (s, a1 , a2 ) ← 0, Q2 (s, a1 , a2 ) ← 0, π1 (s, a1 ) ← , π̂2 (s, a2 ) ←
|A1 | |A2 |
2) Repeat for each state s ∈ S:
a) Choose action at1 according to π1 (st ) with suitable exploration.
b) Observe rewards (r1t+1 , r2t+1 ), the action at2 taken by the opponent, and
the next state st+1 .
c) Update value-functions for i = 1, 2

Qi (st−1 , at−1 t−1 t−1


, at−1 t−1 t t t t

1 , a2 ) ← (1−α)Qi (s 1 , a2 )+α ri + γQi (s , a1 , a2 )

d) Update the estimate of opponent policy:

π̂2 (st ) ← (1 − β) + βπ2t

where π2t is a vector as follows



 1 if a2 6= at2
π2t (a2 ) =
 0 otherwise

e) Update policy π1 (st−1 ) to maximize Ost−1 (π1 (st−1 )) defined as in the


following:
Os (π1 ) = π1T Q1 (s)π̂2 (s) − σρs (π1 )

where σ is a tuning parameter and,

ρs (π1 ) = max π1T Q2 (s)π2 (s) − π1T Q2 (s)π̂2 (s)


 
π2

(WoLF) and using a variable learning rate. The full properties has not been provided so far.
algorithm for an agent i is shown in Table VII.
The WoLF principle can result in the agent learning
quickly when it is doing poorly and cautiously when Rigorously speaking, the WoLF-PHC algorithm
it is performing well. The change in such a way for is still not a multiagent version of PHC algorithm
the learning rates will be helpful for convergence by since the learning factors of other agents in the non-
not overfitting to the other agents’ changing policies. Markovian environment are not taken into account
At this point the WoLF-PHC algorithm seems to at all. Thus, it is only rational and reasonable if
be attractive. Although many examples from MGs the other agents are playing stationary strategies. In
to zero-sum and general-sum SGs were given in addition, the convergence may become very slow
[60], [61], a complete proof for the convergence when the WoLF principle is applied [48].

13
TABLE VII
W O LF-PHC A LGORITHM FOR AN AGENT i

1) Take learning rates α ∈ (0, 1] and δl > δw ∈ (0, 1]. Initialize Q(s, a), π(s, a),
and C(s) for all s ∈ S and a ∈ Ai :
1
Q(s, a) ← 0, π(s, a) ← , C(s) ← 0
|Ai |
2) Repeat for each state s ∈ S:
a) Choose an action a at state s according to a mixed strategy π(s, a) with
suitable exploration.
0
b) Observe reward r and new state s .
c) Update value-function Q(s, a), ∀s ∈ S, a ∈ Ai

Q(s, a) ← (1 − α)Q(s, a) + α (r + γV (s))

where,
0 0
V (s) = max Q(s , a )
a0 ∈Ai
0 0
d) Update the estimate of average policy for all a ∈ Ai , π̄(s, a ),

C(s) ← C(s) + 1
0 0 0 0
π̄(s, a ) ← π̄(s, a ) + (π(s, a ) − π̄(s, a ))/C(s)

e) Step π(s, a) closer to the optimal policy: π(s, a) ← π(s, a) + ∆sa


where,

 −δ 0 0 0
sa if a 6= arg maxa Q(s , a )
∆sa =
 P0 δ 0 otherwise
a 6=a sa
 
δ
δsa = min π(s, a),
|Ai | − 1

 δ 0 0 0 0
w if Σa0 ∈Ai π(s, a )Q(s, a ) > Σa0 ∈Ai π(s̄, a )Q(s, a )
∆=
 δl otherwise
3) Until terminal state.

14
H. Other Algorithms actions, states, and rewards of other agents, it is
only rational to be used in coordination multiagent
Sen et al. [66] studied multiagent coordination systems. The advantage of this method is that there
with learning classifier systems. Action policies is no need to find equilibria for obtaining a best
mapping from perceptions to actions were used response-based policy, like Minimax-Q or Nash-Q
by multiple agents to learn coordination strategies learning algorithms.
without relying on shared information. The ex-
perimental results provided in [66] indicated that V. S CALING R EINFORCEMENT L EARNING TO
classifier systems can be more effective than the M ULTI -ROBOT S YSTEMS
more widely used Q-learning scheme for multiagent Multi-robot learning is a challenge for learning
coordination. to act in an non-Markovian environment which
In multiagent systems, a learning agent may contains other robots. Robots in MRSs have to
learn faster and establish some new rules for its interact with and adapt to their environment, as well
own utility under future unseen situations if the as learn from and adapt to their counterparts, rather
experiences and knowledge from other agents are than only taking stationary policies.
available to it. Considering this fact and the possible The tasks arising from MRSs have continuous
benefits gained from extracting proper rules out of state and/or action spaces. As a result, there will be
the other agents’ knowledge, a weighted strategy difficulties in directly applying the aforementioned
sharing (WSS) method was proposed in [46] for results on multiagent RL with finite states and
coordination learning by using the expertness of RL. actions to MRSs.
In this method, each agent measures the expertness State and action abstraction approaches claim that
of the other agent in a team and assigns a weight to extracting features from a large learning space are
their knowledge and learns from them accordingly. effective. The approaches include condition and be-
Moreover, the Q-table of one of the cooperative haviour extraction [2], teammate internal modelling,
agent is changed randomly. relationship-state estimation [5], and state vector
In tackling the problem of coordination in multi- quantisation [9]. However, all these approaches can
agent systems, Boutilier [55] proposed a method for be viewed as variants of individual Q-learning algo-
solving sequential multiagent decision problems by rithms since they have modelled other robots either
allowing agents to reason explicitly about specific as parts of their environment or as stationary-policy
coordination mechanisms. In this method an exten- holders.
sion of value iteration in which the state space of the One research on scaling reinforcement learning
system is augmented with the state of the adopted toward RoboCup soccer has been reported by Stone
coordination mechanism needs to be defined. This and Sutton [40]. The RoboCup soccer can be viewed
method allows the agents to reason about the short as a special class of MRS and often be used a good
and long term prospects for coordination, and make test-bed for developing AI techniques in both single-
decisions to engage or avoid coordination problems agent and multiagent systems. The most challeng-
based on expected value [55]. ing issues in MRSs also appear in the RoboCup
A Bayesian approach to the coordination in mul- soccer, such as the large state/action space, uncer-
tiagent RL was proposed in [47], [49]. Since this tainties, etc. In [40], an approach using episodic
method requires a very restrictive assumption, i.e., SMDP SARSA(λ) with linear tile-coding function
the learning agent has the ability to observe the approximation and variable λ was designed to learn

15
higher-level decisions in a keepaway subtask of form of heterogeneous reinforcement functions and
RobotCup soccer. Since the general theory of RL progress estimators.
with function approximation has not yet been well Morales [64] proposed an approach to RL in
understood, the linear SARSA(λ) which could be robotics based on a relational representation. With
the best understood among current methods [40] this relational representation, this method can be
was used in the scaling of reinforcement learning to applied over large search spaces and domain knowl-
RoboCup soccer. Moreover, they also claimed that edge also can be incorporated. The main idea be-
it has advantages over off-policy methods such as hind this approach is to represent states as sets of
Q-learning, which can be unstable with linear and properties to characterize a particular state which
other kinds of function approximation. However, may be common to other states. Since both states
they did not answer the open question that whether and actions are represented in terms of first order
SARSA(λ) fails to converge as well. relations in the proposed framework of [64], policies
To study the cooperation problems in learning are learned over such generalized representation.
many behaviours using RL, a subtask of RoboCup In order to deal with the state space growing ex-
soccer, i.e., keep away was also investigated in [27] ponentially in the number of team members, Touzet
by combining SARSA(λ) and linear tile coding [8] studied the robot awareness in cooperative mo-
function approximation. However, only single-agent bile robot learning and proposed a method which
RL techniques, including SARSA(λ) with eligibility requires a less cooperative mechanism, i.e., vari-
traces, tile coding function approximation, were ous levels of awareness rather than communication.
directly applied to a multiagent domain. As pointed The results illustrated in [8] with applications to
out previously, such a straightforward applications the cooperative multi-robot observation of multiple
of single-agent RL techniques to multiagent systems moving targets shows some better performance than
have no sound theoretic foundation. Kostiadis and a purely collective learned behaviour.
Hu [22] used Kanerva coding technique [31] to In [11] a variety of methods were reviewed and
produce a decision-making module for possession used to demonstrate for learning in multi-robot
football in RoboCup soccer. In this application domain. In that study behaviours were thought as
Kanerva coding was used as a generalisation method the underlying control representation for handling
to form a feature vector from raw sensory reading scaling in learning policies and models, as well as
while the RL uses this feature vector to learn an learning from other agents. Touzet [16] proposed
optimal policy. Although the results provided in a pessimistic algorithm-based distributed lazy Q-
[22] demonstrated that the learning approach outper- learning for cooperative mobile robots. The pes-
formed a number of benchmark policies including simistic algorithm was used to compute a lower
a hand-coded one, there lacked a theoretic analysis bound of the utility of executing an action in a
on how a series of single-agent RL techniques can given situation for each robot in a team. Although
work very well in a domain of multiagent systems. the Q-learning with lazy learning were used, the
The work in [2] presented a formulation of RL author also neglected the important fact for the
that enables learning in the concurrent multi-robot applicability of Q-learning, that is in multi-agent
domain. The methodology adopted in that study systems the environment is not stationary.
makes use of behaviours and conditions to minimize Park et al. [23] studied modular Q-learning based
the learning space. The credit assignment problem multi-agent cooperation for robot soccer, where
was dealt with through shaped reinforcement in the modular Q-learning was used to assign a proper

16
action to an agent in multiagent systems. In this Minimax Q-learning to update fuzzy Q-values by
approach the architecture of modular Q-learning using fuzzy state and fuzzy goal representation. It
consists of learning modules and a mediator module. should be noted that the Minimax Q-learning in [71]
The function of the mediator is to select a proper is from the sense of fuzzy operators (i.e., max and
action for the learning agent based on the Q-value min) and it is totally different with the Minimax-Q
obtained from each learning module. learning of Littman [33]. Similarly to [70], there was
Although there have been a variety of RL tech- no any proof to guarantee the optimal convergence
niques that are developed for multiagent learning of the Minimax fuzzy Q-learning.
systems, very few of these techniques scale well A fuzzy game theoretic approach to multiagent
to MRSs. On the one hand, the theory itself on coordination was presented in [72] by considering
multiagent RL systems in the finite discrete do- that utility values are usually approximate and the
mains are still underway and have not been well- differences between utility values are somewhat
established. On the other hand, it is essentially very vague. Thus, a fuzzy game theoretic approach may
difficult to solve MRSs in general case because of be useful when there are uncertains in utility values.
the continuous and large state space as well as action For establishing a framework of fuzzy game, a
space. series of notions, including fuzzy dominant rela-
tions, fuzzy Nash equilibrium, and fuzzy policies
VI. F UZZY L OGIC S YSTEMS AND M ULTIAGENT were defined in [72] under both fuzzy logic theory
R EINFORCEMENT L EARNING and game theory. It was also shown in [72] that
Fuzzy Logic Controllers (FLCs) can be used to a fuzzy strategy can outperform a mixed strategy
generalize Q-learning over continuous state spaces. in traditional game theory in tackling the cases
The combination of FLCs with Q-learning has been of multiple equilibria, which is a very challenging
proposed as Fuzzy Q-Learning (FQL) for many issue in game theory. The study in [72] merely
single robot applications [67]–[69]. focused on one-stage policy fuzzy game, instead of
In [70] a modular-fuzzy cooperative algorithm RL. Thus, a combination of the fuzzy game theoretic
for multiagent systems was presented by taking approach with the popular RL techniques is quite
advantage of modular architecture, internal model of possible for multi-agent/robot systems.
other agent, and fuzzy logic in multiagent systems. The convergence proof appears to be very dif-
In this algorithm, the internal model is used to ficult, particularly for the multiagent reinforcement
estimate the agent’s own action and evaluate other learning with fuzzy generalizations. More recently, a
agents’ actions. To overcome the problem of huge convergence proof for single agent fuzzy reinforce-
dimension of state space, fuzzy logic was used to ment learning (FRL) was provided in [73]. However,
map from input fuzzy sets representing the state one can find that the example presented in [73] does
space of each learning module to output fuzzy sets not reflect the theoretic work of that study at all.
denoting action space. A fuzzy rule base of each An obvious fact is that the triangular membership
learning module was built through the Q-learning, functions were used in the experiment instead of the
but without providing any convergence proof. Gaussian membership functions which are the basis
Kilic and Arslan [71] developed a Minimax fuzzy of the theoretic work in [73]. Therefore one can find
Q-learning for cooperative multi-agent systems. In that the proving techniques and outcomes will be
this method, the learning agent always need to very difficult to extend to the domains of multiagent
observe the actions other agents take and uses the reinforcement learning with fuzzy logic general-

17
izations. Furthermore, Watkins [29] also pointed performance (such as convergence, efficiency, and
out that Q-learning may not converge correctly for stability, etc.) cannot be guaranteed when approxi-
other representations rather than a look-up table mation and generalization techniques are applied.
representation for the Q-function. One important fact is that most of the multiagent
RL algorithms, such as Minimax-Q leaning, Nash-
VII. M AIN C HALLENGES
Q learning is value-function based iteration method.
MRSs often have all of the challenges for mul- Thus, for applying these technique to a continuous
tiagent learning systems, such as continuous state, system the value-function has to be approximated
and action spaces, uncertainties, and nonstationary by either using discretization or general approxima-
environment. Since the aforementioned algorithms tors (such as neural networks,polynomial functions,
in Section IV require the enumeration of states fuzzy logic, etc.). However, some researchers has
either for policies or value functions, one must have pointed out that the combination of DP methods
a major limitation for scaling the established multi- with function approximators may produce unstable
agent reinforcement learning outcomes to MRs. or divergent results even when applied to some very
Most SGs studied in multiagent RL are of sim- simple problems, see [74] as the references therein.
ple agent-based background where players execute
VIII. F UTURE R ESEARCH D IRECTIONS
perfect actions, observe complete states (or partially
observed), and have full knowledge of other players’ A. Coordination Games Or Teams Cooperation
actions, states and rewards. This is not true for Due to the aforementioned difficulties, a possible
most MRSs. It is unfeasible for robots to completely opportunity of RL in MRSs is to learn robot’s
obtain other players’ information, especially for coordination. Robots in a team may learn to work
competitive games since opponents do not actively co-operatively or share their learned experience to
broadcast their information to share with the other accelerate their learning processes through their
players. limited physical communication or observation abil-
In addition, adversarial opponents may not ities. These robots have common interests or iden-
act rationally. Accordingly, it is difficult to find tical payoffs. The games with these characteristics
Nash equilibria for the Nash-equilibrium-based ap- are referred to as co-ordination games. Littman [42]
proaches or model their dynamics for the fictitious- has proved the convergence of Q-learning in co-
play-based approaches. ordination games. However, it is hard to apply in
Taking into account the state of the art for mul- practice because the conditions are difficult to verify
tiagent learning system, there is particular difficulty in advance since games may contain adversarial
in scaling the established (or partially recognized at equilibria and co-ordination equilibria simultane-
least) multiagent RL algorithms, such as Minimax- ously even though the FoF idea could be adopted.
Q leaning, Nash-Q learning, etc., to MRSs with One special case is the team game where all players
large and continuous state and action spaces. On have the same Q values. Littman also showed that
the one hand, most theoretic works on multiagent the conditions are easily satisfied in team games
systems merely focus on the domains with small since there is only one MDP under these condi-
finite state and action sets. On the other hand there tions. But the phenomenon of multiple equilibria
is still lacking of sound theoretic grounds which in coordination games questions its practical values
can be used to guide the scaling up the multiagent and remains a challenge issue. Thus an ongoing
RL algorithms to MRSs. As a result, the learning research is how to select one from multiple Nash

18
equilibria. Furthermore, the exact same Q values are to learn coordinated behaviors. To cope with the
hard to maintain in each physical robot because of huge state space, function approximation and gen-
their sensory and executive uncertainties. Therefore, eralization techniques were used in their work. Un-
more research needs to be performed to gain a clear fortunately, the proof of convergence with function
understanding of Q-learning convergence in the co- approximation and generalization techniques was
ordination games. not provided at all in [75]. Currently, a generic
theoretic framework for proving the optimal con-
B. State and Action Abstraction of MRSs vergence of function approximation implementa-
Incomplete information, large learning space, tion of the popular RL algorithms (such as Q-
and uncertainty are major obstacles for learning learning) has not been established yet. Interestingly,
in MRSs. Learning in Behaviour-Based Robotics there is an increasing effort in this direction in
(BBR) can effectively reduce the search space in either single-agent or multi-agent systems. For the
size and dimension and handle uncertainties locally. single-agent Temporal-Difference learning with lin-
The action space will be transformed from con- ear function approximation, Tadić [76] studied the
tinuous space of control inputs into some limited convergence and analyzed its asymptotic properties.
discrete sets. However, the convergence proof for Under mild conditions, the almost sure convergence
the algorithms using state and action abstraction of of Temporal-Difference learning with linear func-
MRSs will be a very challenging problem. tion approximation was given and the upper error
There are some advances in state and action bound also can be determined.
abstraction of MRSs, though, there are still not
completely satisfactory solutions to cope with con- D. Continuous Reinforcement Learning
tinuous state and action spaces occurring in the Since there are many difficulties in extending RL
domains of MRSs, for example, see at [64]. of discrete finite domains to continuous learning
systems, recently there have been a great deal of
C. Generalization and Approximation efforts towards directly developing continuous RL
When the state and action spaces of the sys- techniques for the complex continuous systems.
tem are small and finite discrete, the lookup table Based on the theoretic framework of viscosity solu-
method is generally feasible. However, in MRSs, tions, Munos [74] conducted a study of RL for the
the state and action spaces are often very huge or continuous state-space and time control problems.
continuous, thus the lookup table method seems In this continuous case, the value-function will sat-
inappropriate. To solve this problem, besides the isfy the Hamilton-Jacobi-Bellman (HJB) equation,
state and action abstraction, function approximation which is a nonlinear first or second order equation.
and generalization appears to be another feasible It is well known that solving the HJB equation is a
solution. For learning in a partially observable and very hard task. To solve the HJB equation, Munos
nonstationary environment in the area of multiagent used a powerful framework of viscosity solutions
systems, Abul et al. [75] presented two multia- and showed that the unique value function solution
gent based domain independent coordination mech- to the HJB equation can be found in the sense of
anisms,i.e. perceptual coordination mechanism and viscosity solutions.
observing coordination mechanism. The advantage A continuous Q-learning method was presented
of their approach is that multiple agents do not in [77] by using an incremental topology preserving
require explicit communication among themselves map to partition the input space and the incorpora-

19
tion of bias to initialize the learning process. The tiagent systems and cooperative robotics. In their
resulting continuous action is an average of the paper cooperative robotics and related issues on
discrete actions of the winning unit weighted by group architecture, resource conflict, and geometric
their Q-values [77]. More interesting, the author problems were involved. However few theoretical
also showed the experimental results in robotics research advances were mentioned.
indicating that the continuous Q-learning method A technical report on the SGs with multiple learn-
works better than the standard discrete action ver- ing players in multiagent RL systems was given
sion of Q-learning in terms of both asymptotic by Chalkiadakis [49]. However, there is almost no
performance and learning speed. This continuous Q- survey on the scaling up of multiagent RL to MRSs.
learning method still focus on single-agent systems. Shoham and Powers [58] made a very critical
Hence a version of continuous Q-learning method survey to the recent work in AI on multiagent RL.
for multiagent systems is expected accordingly. They argued that the work on learning in SGs is
Another ongoing research for solving continuous flawed, though there have been a lot of research
cases is continuous RL for feedback control sys- in this field under the frameworks of SGs. The
tems [78], [79]. In [78] a continuous RL algorithm fundamental flaw, they thought, is unclarity about
was developed and applied to the control problem the problem or problems being addressed. In their
involving the refinement of a Proportional-Integral work they questioned why focus on equilibria. They
(PI) controller. More interestingly, Tu [78] claimed commented that the results concerning convergence
that according to his results the continuous RL al- of Nash-Q learning are quite awkward. Particularly,
gorithm outperforms the discrete RL algorithms. In if multiple optimal equilibria exist then the agents
[79] continuous Q-Learning was studied for the use need an oracle to coordinate their choices in order
of RL methodology to the control of real systems. to converge to a Nash equilibrium, which begs the
Linear Quadratic Regularization (LQR) techniques question of why to use learning for coordination at
and Q-learning were combined for both linear and all [58]. Although there are many negative com-
nonlinear continuous control systems. ments on the current work on RL in SGs, they
also agreed that these results are unproblematic
IX. OTHER R ELATED W ORK
for the cases of zero-sum SGs and team or pure
A survey on the field of RL was given by Kael- coordination games with common payoff functions.
bling et al. [80] from a computer-science perspec-
tive. Some central issues of RL, including trading X. C ONCLUDING R EMARKS
off exploration and exploitation, learning from de- Recently there has been growing interests in scal-
layed reinforcement learning, making use of gener- ing multiagent RL to MRSs. Although RL seems
alization, etc. were discussed in that investigation. to be a good option for learning in multiagent
Stone and Veloso made a survey on multiahent systems, the continuous state and action spaces
systems from a machine learning perspective and often hamper its applicability in MRSs. Fuzzy logic
presented a series of general multiagent scenarios. methodology seems to be a candidate for dealing
Although their investigation did not focus exclu- with the approximation and generalization issues in
sively on robotic systems, robotic soccer was used the RL of multiagent systems. However, this scaling
as a test bed and several robotic multiagent systems approach still remains open. Particularly there is a
were discussed. From the robot-soccer perspective, lack of theoretical grounds which can be used for
Kim and Vadakkepat [21] gave a survey on the mul- proving the convergence and predicting performance

20
of fuzzy logic-based multiagent RL (such as fuzzy [8] C. F. Touzet, “Robot awareness in cooperative mobile robot
multiagent Q-learning). learning,” Autonomous Robots, vol. 8, pp. 87–97, 2000.
[9] F. Fernandez and L. E. Parker, “Learning in large co-operative
For cooperative robots systems, although some multi-robot domains,” International Journal of Robotics and
research outcomes in some special cases have been Automation, vol. 16, no. 4, pp. 217–226, 2001.
available now, there are also some difficulties (such [10] J. Liu and J. Wu, Multi-Agent Robotic Systems. CRC Press,
2001.
as multiple equilibrium and selecting payoff struc- [11] M. J. Matarić, “Learning in behavior-based multi-robot sys-
ture, etc) for directly applying them to a practical tems: policies, models, and other agents,” Journal of Cognitive
MRS, e.g., robotic soccer system. Systems Research, vol. 2, pp. 81–93, 2001.
[12] M. Bowling and M. Veloso, “Simultaneous adversarial multi-
This paper gave a survey on multiagent RL in robot learning,” in Proceedings of the Eighteenth International
MRSs. The main objective of this work is to re- Joint Conference on Artificial Intelligence, August 2003.
view some important advances in this field, though [13] I. H. Elhajj, A. Goradia, N. Xi, and et al, “Design and anal-
ysis of internet-based tele-coordinated multi-robot systems,”
still not completely. Some challenging problems Autonomous Robots, vol. 15, pp. 237–254, 2003.
and promising research directions are provided and [14] L. Iocchi, D. Nardi, M. Piaggio, and A. Sgorbissa, “Dis-
discussed. Although this paper cannot provide a tributed coordination in heterogeneous multi-robot systems,”
Autonomous Robots, vol. 15, pp. 155–168, 2003.
complete and exhaustive survey of multiagent RL [15] M. J. Mataric, G. S. Sukhatme, and E. H. Østergaard, “Multi-
in MRSs, we still believe that it will help us more robot task allocation in uncertain environments,” Autonomous
clearly understand the existing works and challeng- Robots, vol. 14, pp. 255–263, 2003.
[16] C. F. Touzet, “Distributed lazy Q-learning for cooperative
ing issues in this ongoing research field. mobile robots,” International Journal of Advanced Robotic
Systems, vol. 1, no. 1, pp. 5–13, 2004.
ACKNOWLEDGMENT [17] T. Fujii, Y. Arai, H. Asama, and I. Endo, “Multilayered rein-
forcement learning for complicated collision avoidance prob-
This research is funded by the Engineering and
lems,” in Proceeding of the IEEE International Conference on
Physical Sciences Research Council (EPSRC) under Robotics and Automation, 1998, pp. 2186–2191.
grant GR/S45508/01 (2003-2005). [18] R. P. Salustowicz, M. A. Wiering, and J. Schmidhuber, “Learn-
ing team strategies: soccer case studies,” Machine Learning,
R EFERENCES vol. 33, pp. 263–282, 1998.
[19] S. Sen, “Multiagent systems: Milestones and new horizons.”
[1] Y. U. Cao, A. S. Fukunaga, and A. B. Kahng, “Cooperative mo- [Online]. Available: citeseer.ist.psu.edu/294585.html
bile robotics: antecedents and directions,” Autonomous Robots, [20] G. Weiß and P. Dillenbourg, What is ’multi’ in multi-
vol. 4, pp. 1–23, 1997. agent learning?, 1999, ch. 4 in Collaborative-learning:
[2] M. J. Matarić, “Reinforcement learning in the multi-robot Cognitive. [Online]. Available: wwwbrauer.in.tum.de/˜weissg/
domain,” Autonomous Robots, vol. 4, pp. 73–83, 1997. Docs/weiss-dillenbourg.pdf
[3] F. Michaud and M. J. Matarić, “Learning from history for [21] J.-H. Kim and P. Vadakkepat, “Multi-agent systems: a survey
behavior-based mobile robots in non-stationary conditions,” from the robot-soccer perspective,” International Journal of
Autonomous Robots, vol. 5, pp. 335–354, 1998. Intelligent Automation and Soft Computing, vol. 6, no. 1, pp.
[4] T. Balch and R. C. Arkin, “Behavior-based formation control 3–17, 2000.
for multirobot teams,” IEEE Transactions on Robotics and [22] K. Kostiadis and H. Hu, “KaBaGe-RL: kanerva-based general-
Automation, vol. 14, no. 6, pp. 926–939, December 1998. isation and reinforcement learning for possession football,” in
[5] M. Asada, E. Uchibe, and K. Hosoda, “Co-operative behaviour Proceedings of IEEE/RSJ International Conference on Intelli-
acquisition for mobile robots in dynamically changing real gent Robots and Systems, Hawaii, 2001.
worlds via vision-based reinforcement learning and develop- [23] K.-H. Park, Y.-J. Kim, and J.-H. Kim, “Modular Q-learning
ment,” Artificial Intelligence, vol. 110, pp. 275–292, 1999. based multi-agent cooperation for robot soccer,” Robotics and
[6] M. Wiering, R. P. Salustowicz, and J. Schmidhuber, “Reinforce- Autonomous Systems, vol. 35, pp. 109–122, 2001.
ment learning soccer teams with incomplete world models,” [24] A. Merke and M. Riedmiller, “A reinforcement learning ap-
Autonomous Robots, vol. 7, pp. 77–88, 1999. proach to robotic soccer,” in RoboCup 2001, 2001, pp. 435–
[7] S. V. Zwaan, J. A. A. Moreira, and P. U. Lima, “Cooperative 440.
learning and planning for multiple robots,” 2000. [Online]. [25] S. Maes, K. Tuyls, and B. Manderick, “Reinforcement learning
Available: citeseer.nj.nec.com/299778.html in large state spaces: simulated robotic soccer as a testbed,”

21
2002. [Online]. Available: http://citeseer.ist.psu.edu/502799. [40] P. Stone and M. Veloso, “Multiagent systems: a survey from a
html machine learning perspective,” Autonomous Robots, vol. 8, pp.
[26] W. D. Smart and L. P. Kaelbling, “Effective reinforcement 345–383, 2000.
learning for mobile robots,” in Proceedings of the IEEE [41] J. Hu and M. P. Wellman, “Learning about other agents in
International Conference on Robotics and Automation, a dynamic multiagent system,” Journal of Cognitive Systems
2002. [Online]. Available: www.ai.mit.edu/people/lpk/papers/ Research, vol. 2, pp. 67–79, 2001.
icra2002.pdf [42] M. L. Littman, “Value-function reinforcement learning in
[27] S. Valluri and S. Babu, “Reinforcement learning for keepaway markov games,” Journal of Cognitive Systems Research, vol. 2,
soccer problem,” 2002. [Online]. Available: http://www.cis. pp. 55–66, 2001.
ksu.edu/˜babu/final/html/ProjectReport.htm [43] M. L. Littman and P. Stone, “Leading best-response strategies
[28] M. Wahab, “Reinforcement learning in multiagent systems,” in repeated games,” in The 17th Annual International Joint
2003. [Online]. Available: http://www.cs.mcgill.ca/˜mwahab/ Conference on Artificial Intelligence Workshop on Economic
RL%20in%20MAS.pdf Agents, Models, and Mechanism, 2001.
[29] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, [44] M. L. Littman, “Friend-or-foe Q-learning in general-sum
vol. 8, pp. 279–292, 1992. games,” in Proceedings of The 18th International Conference
[30] M. E. Harmon and S. S. Harmon, “Reinforcement learning: a on Machine Learning, Morgan Kaufman, 2001, pp. 322–328.
tutorial,” 1996. [Online]. Available: http://citeseer.ist.psu.edu/ [45] F. A. Dahl, “The lagging anchor algorithm: reinforcement learn-
harmon96reinforcement.html ing in two-player zero-sum games with imperfect information,”
[31] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Machine Learning, vol. 49, pp. 5–37, 2002.
Introduction. Cambridge, Massachusetts: MIT Press, 1998.
[46] M. N. Ahmadabali and M. Asadpour, “Expertness based coop-
[32] M. Tan, “Multi-agent reinforcement learning: independent vs.
erative Q-learning,” IEEE Transactions on Systems, Man, and
cooperative agents,” in Proceedings of the Tenth International
Cybernetics-Part B: Cybernetics, vol. 32, no. 1, pp. 66–76,
Conference on Machine Learning, Amherst, MA, 1993, pp.
February 2002.
330–337.
[47] G. Chalkiadakis, “Multiagent reinforcement learning: stochastic
[33] M. L. Littman, “Markov games as a framework for multi-
games with multiple learning players,” Department of Computer
agent learning,” in Proceedings of the Eleventh International
Science, Univeristy of Toronto,” Technical report, March 2003.
Conference on Machine Learning, San Francisco, California,
[Online]. Available: www.cs.toronto.edu/˜gehalk/DepthReport/
1994, pp. 157–163.
DepthReport.ps
[34] M. L. Littman and C. Szepesvári, “A generalized reinforcement-
[48] N. Suematsu and A. Hayashi, “A multiagent reinforcement
learning model: convergence and applications,” in Proceedings
learning algorithm using extended optimal response,” in Pro-
of the Thirteenth International Conference on Machine Learn-
ceedings of the First International Joint Conference on Au-
ing, Bari, Italy, July 3-6 1996, pp. 310–318.
tonomous Agents & Multiagent Systems, Bologna, Italy, July
[35] C. Szepesvári and M. L. Littman, “Generalized markov
15-19 2002, pp. 370–377.
decision processes: dynamic-programming and reinforcement-
[49] G. Chalkiadakis and C. Boutilier, “Multiagent reinforcement
learning algorithms,” Department of Computer Science,
learning: theoretical framework and an algorithm,” in the Sec-
Brown University, Technical report CS-96-11, November
ond International Joint Conference on Autonomous Agents &
1996. [Online]. Available: http://www.cs.duke.edu/˜mlittman/
Multiagent Systems (AAMAS), Melbourne, Australia, July 14-18
docs/unrefer.html
2003, pp. 709–716.
[36] C. Claus and C. Boutilier, “The dynamics of reinforcement
learning in coopertive multiagent systems,” in Proceedings of [50] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum
the Fifteenth National Conference on Artificial Intelligence, stochastic games,” Journal of Machine Learning Research,
Madison, WI, 1998, pp. 746–752. vol. 4, pp. 1039–1069, 2003.
[37] J. Hu and M. P. Wellman, “Multiagent reinforcement [51] S. Singh, T. Jaakkola, M. L. Littman, and et al, “Convergence
learning in stochastic games,” 1999. [Online]. Available: results for single-step on-policy reinforcement-learning algo-
citeseer.ist.psu.edu/hu99multiagent.html rithms,” Machine Learning, vol. 39, pp. 287–308, 2000.
[38] R. Sun and D. Qi, “Rationality assumptions and optimality [52] T. Basar and G. J. Olsder, Dynamic Noncooperative Game
of co-learning,” in Design and Applications of Intelligent Theory. London: Academic Press, 1982.
Agents, Third Pacific Rim International Workshop on Multi- [53] B. Banerjee and J. Peng, “Adaptive policy gradient in mul-
Agents, PRIMA 2000, ser. Lecture Notes in Computer Science, tiagent learning,” in Proceedings of the second international
C. Zhang and V.-W. Soo, Eds., vol. 1881. Springer, 2000, pp. joint conference on Autonomous agents and multiagent systems.
61–75. ACM Press, 2003, pp. 686–692.
[39] C. Szepesvári and M. L. Littman, “A unified analysis of [54] D. Fudenberg and J. Tirole, Game Theory. London: MIT Press,
value-function-based reinforcement learning algorithms,” Neu- 1991.
ral Computing, vol. 11, no. 8, pp. 2017–2059, 1999. [55] C. Boutilier, “Sequential optimality and coordination in multi-

22
agent systems,” in Proceedings of the Sixteenth International algorithms,” Intelligent Inference Systems Corp.,Sunnyvale,
Joint Conference on Artificial Intelligence, 1999, pp. 478–485. CA, Technical report IIS-00-10, October 2000. [Online].
[56] M. G. Lagoudakis and R. Parr, “Value function approximation Available: http://www.iiscorp.com/projects/multi-agent/tech˙
in zero-sum markov games,” 2002. [Online]. Available: rep˙iis˙00˙10.pdf
http://www.cs.duke.edu/˜mgl/papers/PDF/uai2002.pdf [70] I. Gültekin and A. Arslan, “Modular-fuzzy cooperative algo-
[57] X. Li, “Refining basis functions in least-square approximation rithm for multi-agent systems,” in Advances in Information
of zero-sum markov games,” 2003. [Online]. Available: Systems, Second International Conference, ADVIS 2002, Izmir,
http://www.xiaolei.org/research/li03basis.pdf Turkey, October 23-25, 2002, Proceedings, ser. Lecture Notes
[58] Y. Shoham and R. Powers, “Multi-agent reinforce- in Computer Science, T. M. Yakhno, Ed., vol. 2457. Springer,
ment learning: a critical survey,” 2003. [Online]. 2002, pp. 255–263.
Available: http://www.stanford.edu/˜grenager/MALearning˙ [71] A. Kilic and A. Arslan, “Minimax fuzzy Q-learning in co-
ACriticalSurvey˙2003˙0516.%pdf operative multi-agent systems,” in Advances in Information
[59] J. Hu and M. P. Wellman, “Multiagent reinforcement learning: Systems, Second International Conference, ADVIS 2002, Izmir,
theoretical framework and an algorithm,” in Proceedings of the Turkey, October 23-25, 2002, Proceedings, ser. Lecture Notes
Fifteenth International Conference on Machine Learning, San in Computer Science, T. M. Yakhno, Ed., vol. 2457. Springer,
Francisco, California, 1998, pp. 242–250. 2002, pp. 264–272.
[60] M. Bowling, “Multiagent learning in the presence of agents with [72] S.-H. Wu and V.-W. Soo, “Fuzzy game theoretic approach to
limitations,” Ph.D. dissertation, School of Computer Science, multi-agent coordination,” in Multiagent Platforms, First Pacific
Carnegie Mellon University, Pittsburgh, PA 15213, May 2003, Rim International Workshop on Multi-Agents, PRIMA ’98,
CMU-CS-03-118. Singapore, November 23, 1998, Selected Papers, ser. Lecture
[61] M. H. Bowling and M. M. Veloso, “Multiagent learning using Notes in Computer Science, T. Ishida, Ed., vol. 1599. Springer,
a variable learning rate,” Artificial Intelligence, vol. 136, no. 2, 1999, pp. 76–87.
pp. 215–250, 2002. [Online]. Available: citeseer.ist.psu.edu/ [73] H. R. Berenji and D. Vengerov, “A convergent ActorCCritic-
bowling02multiagent.html based FRL algorithm with application to power management
[62] B. Banerjee and J. Peng, “Convergent gradient ascent in of wireless transmitters,” IEEE Transactions on Fuzzy Systems,
general-sum games,” in Proceedings of the 13th European vol. 11, no. 4, pp. 478–485, August 2003.
Conference on Machine Learning, August 13-19 2002, pp. 686– [74] R. Munos, “A study of reinforcement learning in the continuous
692. case by the means of viscosity solutions,” Machine Learning,
[63] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy vol. 40, pp. 265–299, 2000.
gradient methods for reinforcement learning with function [75] O. Abul, F. Polat, and R. Alhajj, “Multiagent reinforcement
approximation,” in Advances in Neural Information Proceesing learning using function approximation,” IEEE Transactions
Systems. MIT Press, 12, pp. 1057–1063. on Systems, Man, and Cybernetics-Part C: Application and
[64] E. F. Morales, “Scaling up reinforcement learning with a Reviews, vol. 30, no. 4, pp. 485–497, November 2000.
relational representation,” in Workshop on Adaptability in [76] V. Tadić, “On the convergence of temporal-difference learning
Multi-Agent Systems,The First RoboCup Australian Open 2003 with linear function approximation,” Machine Learning, vol. 42,
(AORC-2003), Sydney, Australia, January 31 2003. pp. 241–267, 2001.
[65] D. Fudenberg and D. K. Levine, The Theory of Learning in [77] J. Del R. Millán, D. Posenato, and E. Dedieu, “Continuous-
Games. Cambridge, Massachusetts: MIT Press, 1999. action Q-learning,” Machine Learning, vol. 49, no. 2-3, 2002.
[66] S. Sen and M. Sekaran, “Multiagent coordination with learning [78] J. Tu, “Continuous reinforcement learning for feedback
classifier systems,” in Proceedings of the IJCAI Workshop on control systems,” Master’s thesis, Computer Science
Adaption and Learning in Multi-Agent Systems, G. Weiß and Department, Colorado State University, Fort Collins,
S. Sen, Eds., vol. 1042. Springer Verlag, 1996, pp. 218–233. Colorado, 2001. [Online]. Available: http://www.engr.colostate.
[Online]. Available: citeseer.ist.psu.edu/sen96multiagent.html edu/nnhvac/papers/jilin-tu.ps.gz
[67] H. R. Berenji and D. A. Vengerov, “Cooperation and coordina- [79] S. H. G. ten Hagen, “Continuous state space Q-Learning for
tion between fuzzy reinforcement learning agents in continuous- control of nonlinear systems,” Ph.D. dissertation, Faculty of
state partially observable markov decision processes,” in Pro- Science, IAS, University of Amsterdam, Kruislaan 403, 1098
ceedings of the 8th IEEE International Conference onFuzzy SJ Amsterdam, 2001.
Systems (FUZZ-IEEE’99), 2000. [80] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein-
[68] ——, “Advantages of cooperation between reinforcement learn- forcement learning: a survey,” Journal of Artificial Intelligence
ing agents in difficult stochastic problems,” in Proceedings Research, vol. 4, no. 4, pp. 237–285, 1996.
of the 9th IEEE International Conference onFuzzy Systems
(FUZZ-IEEE ’00), 2000.
[69] H. R. Berenji and D. Vengerov, “Generalized markov decision
processes: dynamic-programming and reinforcement-learning

23

View publication stats

You might also like