Yang - Good Brief of RL

Multiagent Reinforcement Learning
(MARL)
September 27, 2013 - ECML'13
Presenters
! Daan Bloembergen
! Daniel Hennes
! Michael Kaisers
! Peter Vrancx
September 27, 2013 - ECML MARL Tutorial

.
Schedule
! Fundamentals of multi-agent reinforcement learning

! 15:30 - 17:00, Daan Bloembergen and Daniel Hennes
! Dynamics of learning in strategic interactions
! 17:15 - 17:45, Michael Kaisers
! Scaling multi-agent reinforcement learning
! 17:45 - 18:45, Peter Vrancx

.
Who are you?
We would like to get to know our audience!

.
Fundamentals of Multi-Agent
Reinforcement Learning
Daan Bloembergen and Daniel Hennes

.
Outline (1)
Single Agent Reinforcement Learning

! Markov Decision Processes
! Value Iteration
! Policy Iteration
! Algorithms
! Q-Learning
! Learning Automata
Multiagent Reinforcement Learning - 2/60

.
Outline (2)

! Game Theory
! Markov Games
! Value Iteration
! Algorithms
! Minimax-Q Learning
! Nash-Q Learning
! Other Equilibrium Learning Algorithms
! Policy Hill-Climbing

.
Part I:
Single Agent Reinforcement Learning

.
Richard S. Sutton and Andrew G. Barto
Reinforcement Learning: An
Introduction
MIT Press, 1998
Available on-line for free!

.
Why reinforcement learning?
Based on ideas from psychology

! Edward Thorndike's law of effect
! Satisfaction strengthens behavior,
discomfort weakens it
! B.F. Skinner's principle of
reinforcement
! Skinner Box: train animals by
providing (positive) feedback
Learning by interacting with the

environment

.
Why reinforcement learning?
Control theory
! Design a controller to minimize some measure of a
dynamical systems's behavior
! Richard Bellman
! Use system state and value functions (optimal return)
! Bellman equation
! Dynamic programming
! Solve optimal control problems by solving the Bellman
equation
These two threads came together in the 1980s, producing the
modern field of reinforcement learning

.
The RL setting
ELIMINARIES: SETTING OF RL
Environment
at is it? a(t)
s(t+1)
arning from interaction r(t+1)
arning about, from and while interacting with an

ternal environment
! Learning from interactions
arning what to doLearning

- howwhat to do - how to map situations to actions -
!
to map situations to
so as to maximize a numerical reward signal
ctions - so as to maximize a numerical reward signal
Multiagent
10 Reinforcement Learning - 8/60
.
Key features of RL
! Learner is not told which action to take

! Trial-and-error approach
! Possibility of delayed reward
! Sacrifice short-term gains for greater long-term gains
! Need to balance exploration and exploitation
! In between supervised and unsupervised learning

.
AGENT-ENVIRONMENT INTERFACE
The agent-environment interface
Agent interacts at discrete time steps i = 0, 1, 2, . . .
! Observes state bi ∈ a
Agent
Agent
! Selects action i ∈ (bi )
st rt reward a
Obtains immediate reward
state action
! st rt a t
t
rt+1rt+1
ì+1 ∈ R st+1 Environment
Environment
! Observes resulting state bi+1 st+1
... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3
14

.
Elements of RL
! Time steps need not refer to fixed intervals of real time

! Actions can be
! low level (voltage to motors)
! high level (go left, go right)
! "mental" (shift focus of attention)
! States can be
! low level "sensations" (temperature, (t, v) coordinates)
! high level abstractions, symbolic
! subjective, internal ("surprised", "lost")
! The environment is not necessarily known to the agent

.
Elements of RL
! State transitions are

! changes to the internal state of the agent
! changes in the environment as a result of the agent's action
! can be nondeterministic
! Rewards are
! goals, subgoals
! duration
! ...

.
Learning how to behave
! The agent's policy π at time i is

! a mapping from states to action probabilities
! πi (b, ) = S(i = |bi = b)
! Reinforcement learning methods specify how the agent

changes its policy as a result of experience
! Roughly, the agent's goal is to get as much reward as it
can over the long run

.
The objective
Suppose the sequence of rewards after time i is

ì+1 , ì+2 , ì+3 , . . .
! The goal is to maximize the expected return 1{_i } for

each time step i
! Episodic tasks naturally break into episodes, e.g., plays of a
game, trips through a maze
_i = ì+1 + ì+2 + . . . + `h

.
The objective
! Continuing tasks do not naturally break up into episodes

! Use discounted return instead of total reward
∞
!
_i = ì+1 + γì+2 + γ ì+3 + . . . =
2
γ F ì+F+1
F=0
where γ, 0 ≤ γ ≤ 1 is the discount factor such that
shortsighted 0 ← γ → 1 farsighted

.
Example: pole balancing

! As an episodic task where each
episode ends upon failure
!
AN EXAMPLE
reward = +1 for each step before
failure
! return = number of steps before
failure
• As an episodic task where episode ends upon failure
! As a continuing task with discounted return
• reward = +1 for each step before failure
! reward = -1 upon failure
F
return
•!return = −γ of
= number for Fbefore
, steps stepsfailure
before failure
! In both cases, return is maximized by avoiding failure for as
• As a continuing task with discounted return:
long as possible
• reward = -1 upon failure, 0 otherwise In either case, return is
maximized by avoiding failure
• return = k , for k steps before failure for as long as possible

20
.
A UNIFIED NOTATION
A unified notation
• Think of each episode as ending in an absorbing state that
always produces reward of zero
! Think of each episode as ending in an absorbing state that
always produces a reward
• We can cover allofcases
zeroby writing
1
X
r1 = +1 r2 = +1 r3 = +1 r4 = 0 k
s0 s1 s2 r5 = 0 Rt = rt+k+1
k=0
! Now we can cover both episodic and continuing tasks by

writing
!∞ 24
_i = γ F ì+F+1
F=0

.
Markov decision processes
! It is often useful to a assume that all relevant information is

present in the current state: Markov property
S(bi+1 , ì+1 |bi , i ) = S(bi+1 , ì+1 |bi , i , ì , bi−1 , i−1 , . . . , `1 , b0 , 0 )
! If a reinforcement learning task has the Markov property, it

is basically a Markov Decision Process (MDP)
! Assuming finite state and action spaces, it is a finite MDP

.
Markov decision processes
An MDP is defined by
! State and action sets
! One-step dynamics defined by state transition
probabilities

! = S(bi+1 = b |bi = b, i = )
#
Pbb
! Reward probabilties
Rbb! = 1(ì+1 |bi = b, i = , bi+1 = b# )

.
Value functions
! When following a fixed policy π we can define the value of

a state b under that policy as
∞
!
o (b) = 1π (_i |bi = b) = 1π (
π
γ F ì+F+1 |bi = b)
F=0
! Similarly we can define the value of taking action in state

b as
Zπ (b, ) = 1π (_i |bi = b, i = )

.
Value functions
! The value function has a particular recursive relationship,

defined by the Bellman equation
! !

oπ (b) = π(b, ) Pbb π #
! [Rbb! + γo (b )]
b!
! The equation expresses the recursive relation between the

value of a state and its successor states, and averages over
all possibilities, weighting each by its probability of
occurring

.
Optimal policy for an MDP

! We want to find the policy that maximizes long term
reward, which equates to finding the optimal value
function
o∗ (b) = Kt oπ (b) ∀b ∈ a

π
Z∗ (b, ) = Kt Zπ (b, ) ∀b ∈ a, ∈ (b)
π
! Expressed recursively, this is the Bellman optimality

equation
o∗ (b) = Kt Zπ∗ (b, )

∈(b)
!

= Kt Pbb ∗ #
! [Rbb! + γo (b )]
∈(b)
b!

.
Solving the Bellman equation
! We can find the optimal policy by solving the Bellman

equation
! Dynamic Programming
! Two approaches:
! Iteratively improve the value function: value iteration
! Iteratively evaluate and improve the policy: policy iteration
! Both approaches are proven to converge to the optimal
value function

.
VALUE ITERATION
Value iteration
• Possibility to finding optimal policy
Bootstrapping!
36

.
Policy iteration
! Often the optimal policy has been reached long before the
value function has converged
! Policy iteration calculates a new policy based on the
current value function, and then calculates a new value
function based on this policy
! This process often converges faster to the optimal policy

.
POLICY ITERATION
Policy iteration
39

.
Learning an optimal policy online
! Both previous approaches require to know the dynamics

of the environment
! Often this information is not available
! Using temporal difference (TD) methods is one way of
overcoming this problem
! Learn directly from raw experience
! No model of the environment required (model-free)
! E.g.: Q-learning
! Update predicted state values based on new observations
of immediate rewards and successor states

.
Q-learning
! Q-learning updates state-action values based on the

immediate reward and the optimal expected return
" #
Z(bi , i ) ← Z(bi , i ) + α ì+1 + γ Kt Z(bi+1 , ) − Z(bi , i )

! Directly learns the optimal value function independent of

the policy being followed
! In contrast to on-policy learners, e.g. SARSA
! Proven to converge to the optimal policy given "sufficient"
updates for each state-action pair, and decreasing learning
rate α [Watkins92]

.
One-step Q-learning
Q-learning
Q(st , at ) Q(st , at ) + ↵[rt+1 + max Q(st+1 , a) Q(st , at )]
a
Proven to converge to the optimal policy,

given certain conditions [Tsitsiklis, 1994]
46

.
Action selection
! How to select an action based on the values of the states

or state-action pairs?
! Success of RL depends on a trade-off
! Exploration
! Exploitation
! Exploration is needed to prevent getting stuck in local
optima
! To ensure convergence you need to exploit

.
Action selection
Two common choices

! !-greedy
! Choose the best action with probability 1 − $
! Choose a random action with probability $
! Boltzmann exploration (softmax) uses a temperature
parameter τ to balance exploration and exploitation
2Zi (b,)/τ
πi (b, ) = $
! ∈ 2
Zi (b,! )/τ
pure exploitation 0 ← τ → ∞ pure exploration

.
Learning automata
! Learning automata [Narendra74] directly modify their

policy based on the observed reward (policy iteration)
! Finite action-set learning automata learn a policy over a
finite set of actions
%
α`(1 − π()) − β(1 − `)π() if = i
π # () = π()+
−α`π() + β(1 − `)[(F − 1)−1 − π()] if (= i
where F = ||, and α and β are reward and penalty

parameters respectively, and ` ∈ [0, 1]
! Cross learning is a special case where α = 1 and β = 0

.
Networks of learning automata
! A single learning automaton ignores any state information

! In a network of learning automata [Wheeler86] control is
passed on from one automaton to another
! One automaton A is active for each state
! The immediate reward ` is replaced by the average
cumulative reward ¯` since the last visit to that state
$i−1
∆` B=H(b) `B
¯ì (b) = =
∆i i − H(b)
where H(b) indicates in which time step state b was last

visited

.
Extensions
! Multi-step TD: eligibility traces
! Instead of observing one immediate reward, use M
consecutive rewards for the value update
! Intuition: your current choice of action may have
implications for the future
! State-action pairs are eligible for future rewards, with more
recent states getting more credit

.
Extensions
! Reward shaping
! Incorporate domain knowledge to provide additional
rewards during an episode
! Guide the agent to learn faster
! (Optimal) policies preserved given a potential-based
shaping function [Ng99]
! Function approximation
! So far we have used a tabular notation for value functions
! For large state and actions spaces this approach becomes
intractable
! Function approximators can be used to generalize over
large or even continuous state and action spaces

.
Questions so far?

.
Part II:
Preliminaries: Fundamentals of Game Theory

.
Game theory
! Models strategic interactions as games

! In normal form games, all players simultaneously select an
action, and their joint action determines their individual
payoff
! One-shot interaction
! Can be represented as an M-dimensional payoff matrix, for M
players
! A player's strategy is defined as a probability distribution
over his possible actions

.
Example: Prisoner's Dilemma
! Two prisoners (A and B) commit a crime

together
! They are questioned separately and can
choose to confess or deny C D
! If both confess, both prisoners will serve 3
years in jail C -3, -3 -0, -5
! If both deny, both serve only 1 year for
minor charges D -5, -0 -1, -1
! If only one confesses, he goes free, while
the other serves 5 years

.
Example: Prisoner's Dilemma
! What should they do?

! If both deny, their total penalty is lowest
! But is this individually rational? C D
! Purely selfish: regardless of what the other C -3, -3 -0, -5
player does, confess is the optimal choice
! If the other confesses, 3 instead of 5 years D -5, -0 -1, -1
! If the other denies, free instead of 1 year

.
Solution concepts
! Nash equilibrium
! Individually rational
! No player can improve by unilaterally
changing his strategy
! Mutual confession is the only Nash
equilibrium of this game C D
! Jointly the players could do better
C -3, -3 -0, -5
! Pareto optimum: there is no other
solution for which all players do at least as D -5, -0 -1, -1
well and at least one player is strictly
better off
! Mutual denial Pareto dominates the Nash
equilibrium in this game

.
Types of games
Matching Pennies
H T
! Competitive or zero-sum H +1, -1 -1, +1
! Players have opposing preferences T -1, +1 +1, -1

! E.g. Matching Pennies
Prisoner's Dilemma
! Symmetric games C D
! Players are identical C -3, -3 -0, -5

! E.g. Prioner's Dilemma D -5, -0 -1, -1
! Asymmetric games Battle of the Sexes

! Players are unique B S
! E.g. Battle of the Sexes B 2, 1 0, 0
S 0, 0 1, 2

.
Part II:

.
MARL: Motivation
! MAS offer a solution paradigm that can cope with complex
problems
! Technological challenges require decentralised solutions
! Multiple autonomous vehicles for exploration, surveillance
or rescue missions
! Distributed sensing
! Traffic control (data, urban or air traffic)
! Key advantages: Fault tolerance and load balancing
! But: highly dynamic and nondeterministic environments!
! Need for adaptation on an individual level
! Learning is crucial!

.
MARL: From single to multiagent learning
! Inherently more challenging

! Agents interact with the environment and each other
! Learning is simultaneous
! Changes in strategy of one agent might affect strategy of
other agents
! Questions:
! One vs. many learning agents?
! Convergence?
! Objective: maximise common reward or individual reward?
! Credit assignment?

.
Independent reinforcement learners
! Naive extension to multi agent setting

! Independent learners mutually ignore each other
! Implicitly perceive interaction with other agents as noise in
a stochastic environment

.
Learning in matrix games
! Two Q-learners interact in Battle of the

Sexes
B S
! α = 0.01
! Boltzmann exploration with τ = 0.2 B 2, 1 0, 0
! They only observe their immediate reward S 0, 0 1, 2
! Policy is gradually improved

.

1
0.9
Probability first action
0.8
B S
0.7
B 2, 1 0, 0
0.6 S 0, 0 1, 2
0.5 Player 1
Player 2
0.4
0 50 100 150 200 250
Iterations

.
2
1.8
1.6
1.4
Avg. cum. reward
1.2 B S
1
B 2, 1 0, 0
0.8
0.6 S 0, 0 1, 2
0.4
Player 1
0.2
Player 2
0
0 50 100 150 200 250
Iterations

.
Markov games
& '
M-player game: M, a, 1 , . . . , M , R1 , . . . , RM , P
! a : set of states
! B : action set for player B
! RB : reward/payoff for player B
! P : transition function
The payoff &function 'RB : a × 1 × · · · × M *→ R maps the joint

action = 1 . . . M to an immediate payoff value for player B.
The transition function P : a × 1 × · · · × M *→ +(a)

determines the probabilistic state change to the next state bi+1 .

.
Value iteration in Markov games
Single agent MDP:
o∗ (b) = Kt Zπ∗ (b, )

∈(b)
!

= Kt Pbb π #
! [Rbb! + γo (b )]
∈(b)
b!
2-player zero-sum stochastic game:

& ' & ' ! & '
Z∗ (b, 1 , 2 ) = R(b, 1 , 2 ) + γ Pb! (b, 1 , 2 )o∗ (b# )
b! ∈ a
! & '
o (b) = Kt
∗
KBM π1 Z∗ (b, 1 , 2 )
π∈&(1 ) 2 ∈ 2
1 ∈ 1

.
Minimax-Q
! Value iteration requires knowledge of the reward and
transition functions
! Minimax-Z [Littman94]: learning algorithm for zero-sum
games
! Payoffs balance out, each agent only needs to observe its
own payoff
! Z is a function of the joint action:
& ' & ' ! & '
Z(b, , ) = R(b, , ) + γ
1 2 1 2
P (b, , )o(b# )
b!
1 2
b! ∈ a
! A joint action learner (JAL) is an agent that learns Z-values

for joint actions as opposed to individual actions.

.
Minimax-Q (2)
Update rule for agent 1 with reward function Ri at stage i:

& ' & '
Zi+1 (bi , 1i , 2i ) = (1 − αi ) Zi (bi , 1i , 2i ) + αi [Ri + γoi (bi+1 )]
The value of the next state o(bi+1 ):

! & '
oi+1 (b) = Kt KBM π1 Zi (b, 1 , 2 ) .
π∈&(1 ) 2 ∈ 2
1 ∈ 1
Minimax-Z converges to Nash equilibria under the same

assumptions as regular Z-learning [Littman94]

.
Nash-Q learning
! Nash-Z learning [Hu03]: joint action learner for

general-sum stochastic games
! Each individual agent has to estimate Z values for all other
agents as well
! Optimal Nash-Z values: sum of immediate reward and
discounted future rewards under the condition that all
agents play a specified Nash equilibrium from the next
stage onward

.
Nash-Q learning (2)
Update rule for agent B:

& ' & '
ZBi+1 (bi , 1 , . . . , M ) = (1 − αi ) Z(bi , 1 , . . . , M )
( )
+ αi Ri + γ Lb?oBi (bi+1 )
A
* Nash equilibrium is computed + for each stage game
Z1i (bi+1 , ·), . . . , ZMi (bi+1 , ·) and results in the equilibrium
payoff Lb?oBi (bi+1 , ·) to agent B
Agent B uses the same update rule to estimate Z values for all
other agents, i.e., ZD ∀D ∈ {1, . . . , M}\B

.
Other equilibrium learning algorithms
! Friend-or-Foe Z-learning [Littman01]

! Correlated-Z learning (CE-Z) [Greenwald03]
! Nash bargaining solution Z-learning (NBS-Z) [Qiao06]]
! Optimal adaptive learning (OAL) [Wang02]
! Asymmetric-Z learning [Kononen03]

.
Limitations of MARL
! Convergence guarantees are mostly restricted to stateless

repeated games
! ... or are inapplicable in general-sum games
! Many convergence proofs have strong assumptions with
respect to a-priori knowledge and/or observability
! Equilibrium learners focus on stage-wise solutions (only
indirect state coupling)

.
Summary
In a multi-agent system
! be aware what information is available to the agent
! if you can afford to try, just run an algorithm that matches
the assumptions
! proofs of convergence are available for small games
! new research can focus either on engineering solutions, or
advancing the state-of-the-art theories

.
Questions so far?

.
Thank you!
Daan Bloembergen ¦ daan.bloembergen@gmail.com
Daniel Hennes ¦ daniel.hennes@gmail.com

.
Dynamics of Learning in Strategic
Interactions
Michael Kaisers

.
Outline
! Action values in dynamic environments

! Deriving learning dynamics
! Illustrating Convergence
! Comparing dynamics of various algorithms
! Replicator dynamics as models of evolution, swarm
intelligence and learning
! Summary

.
Action values in dynamic environments
Action values are estimated by sampling from interactions with

the environment, possibly in the presence of other agents.

.

Static environment, off-policy Q-value updates
1
action 1 Expected value
action 2 Q-value
0.8
0.6
reward
0.4
0.2
0
1 50 100
time

.
Static environment, on-policy Q-value updates
1
action 2 Q-value
0.8
0.6
reward
0.4
0.2
0
1 50 100
time

.

Static environment, on-policy Q-value updates
1
action 1
action 2
0.8
0.6
reward
0.4
0.2
0
1 50 100
time
1
action 1
0.8
0.6
policy
0.4
0.2
0
action 2
1 50 100
time

.
Adversarial environment, on-policy Q-value updates
1
action 2 Q-value
0.8
0.6
reward
0.4
0.2
0
1 50 100
time

.

Adversarial environment, on-policy Q-value updates
1
action 1
action 2
0.8
0.6
reward
0.4
0.2
0
1 50 100
time
1
action 1
0.8
0.6
policy
0.4
0.2
0
action 2
1 50 100
time

.
Deriving Learning Dynamics
Matching Pennies
H T
H 1, 0 0, 1
T 0, 1 1, 0
α = 0.1 α = 0.001
.
! " /t
Learning algorithm HBK 1(∆t) = = ṫ Dynamical system
α→0 /i
Advantages of dynamical systems

! Deterministic
! Convergence guarantees using Jacobian
! Vast related body of literature (e.g., bifurcation theory)

.
Example: Cross learning [Boergers97]
#
(1 − α`B )tB + α`B if B selected
tB (i + 1) ←
(1 − α`D )tB for other action D selected
M
$
1(∆tB ) = tB [(1 − α`B )tB + α`B − tB ] + tF [(1 − α`F )tB − tB ]
F"=B
M
$ M
$
= αtB [(1 − tB )`B − tF `F ] = αtB [`B − tF `F ]
F"=B F
! " /t
Learning algorithm HBK 1(∆t) = = ṫ Dynamical system
α→0 /i
% M
&
$
ṫB = tB 1 [`B ] − tF 1 [`F ] = tB [(v)B − tv]
' () *
F
replicator dynamics

.

Prisoners' Dilemma Battle of Sexes Matching Pennies
D C B S H T
D 1, 1 5, 0 B 2, 1 0, 0 H 1, −1 −1, 1 t1
C 0, 5 3, 3 S 0, 0 1, 2 T −1, 1 1, −1 1 − t1
v1 1 − v1 v1 1 − v1 v1 1 − v1
1 1 1
0.8 0.8 0.8
0.6 0.6 0.6

y1 y1 y1
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
ṫB = tB [(v)B − tv]

v̇B = vB [(t")B − t"v]

.
Dynamics have been derived for
! Learning Automata (Cross Learning)
! Regret Matching (RM)
! Variations of Infinitesimal Gradient Ascent
! Infinitesimal Gradient Ascent (IGA)
! Win-or-Learn-Fast (WoLF) IGA
! Weighted Policy Learning (WPL)
! Variations of Q-learning
! Repeated Update Q-learning
See our talk on Thursday, Session F4 Learning 1, 13:50
! Frequency Adjusted Q-learning

.
Illustrating Convergence
Q-learning [Watkins92]
tB probability of playing action B
α learning rate
` reward
τ temperature
Update rule
+ ,
ZB (i + 1) ← ZB (i) + α `B (i) + γ Kt ZD (i) − ZB (i)
D
Policy generation function

2 τ ZB
−1
tB (Z, τ ) = - τ −1 Z
D2
D

.
Frequency Adjusted Q-learning (FAQ-learning) [Kaisers2010]
tB probability of playing action B
α learning rate
` reward
τ temperature
Update rule
+ ,
1
ZB (i + 1) ← ZB (i) + α `B (i) + γ Kt ZD (i) − ZB (i)
tB D
Policy generation function

2 τ ZB
−1
tB (Z, τ ) = - τ −1 Z
D2
D

.
Cross Learning [Boergers97]

% M
&
$
ṫB = tB 1 [`B (i)] − tF 1 [`F (i)]
F
Frequency Adjusted Q-learning [Tuyls05, Kaisers2010]

. % M
& /
$ $
ṫB = αtB τ −1 1 [`B (i)] − tF 1 [`F (i)] − HQ; tB + tF HQ; tF
F F
Proof of convergence in two-player two-action games

[Kaisers2011, Kianercy2012]

.
Prisoners' Dilemma
1 1 1
0.8 0.8 0.8
Q
0.6 0.6 0.6
y1
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
1 1 1
0.8 0.8 0.8
FAQ
0.6 0.6 0.6
y1
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
ZBMBi pessimistic neutral optimistic

.
Battle of Sexes
1 1 1
0.8 0.8 0.8
Q
0.6 0.6 0.6
y
1
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
1 1 1
1 1 1
0.8 0.8 0.8
FAQ
0.6 0.6 0.6
y
1
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
1 1 1

.
Matching Pennies
1 1 1
0.8 0.8 0.8
Q
0.6 0.6 0.6
y1
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
1 1 1
0.8 0.8 0.8
FAQ
0.6 0.6 0.6
y1
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1

.
Comparing Dynamics
Dynamical systems have been associated with

! Infinitesimal Gradient Ascent (IGA)
! Win-or-Learn-Fast Infinitesimal Gradient Ascent (WoLF)
! Weighted Policy Learning (WPL)
! Cross Learning (CL)

! Frequency Adjusted Q-learning (FAQ)
! Regret Matching (RM)

.
Comparing Dynamics
Learning dynamics for two-agent two-action
0 games. The 1
h
common gradient is abbreviated ð = v?? + 12 − 22 .
Algorithm ṫ
IGA αð #
αKBM B7 o(t, v) > o(t2 , v)
WoLF ð·
# αKt Qi?2`rBb2
t B7 ð < 0
WPL αð ·
(1 − t) Qi?2`rBb2
CL αt(1 − t)0ð 1
t
FAQ αt(1 − t) ð ·τ −1 − HQ; 1−t
#
(1 + αtð)−1 B7 ð < 0
RM αt(1 − t) ð ·
(1 − α(1 − t)ð)−1 Qi?2`rBb2

.
Comparing Dynamics
Cross Learning is linked to the replicator dynamics
% M
&
$
ṫB = tB 1 [7B (i)] − tF 1 [7F (i)]
F
Gradient based algorithms use the orthogonal projection

function, leading to the following dynamics for IGA:
% M
&
$ 1
ṫB = α 1 [7B (i)] − 1 [7F (i)]
M
F
Gradient dynamics use information about all actions, and are

equivalent to the replicator dynamics under the uniform policy
(i.e., also given off-policy updates).

.
Replicator Dynamics
Path finding with Ant Colony Optimization
Pheromones τ and travel cost heuristic η lead to probabilistic

selection of state transition tB,D from B to D:
α ηβ
τB,D
tB,D =- , with α, β tuning parameters
τ α ηβ
+ B,+ B,+

.
Replicator Dynamics
Pheromone trail reinforcement in Ant Colony Optimization
J
$
τB,D (i + 1) = (1 − ρ)τB,D (i) + δB,D (i, K),
K=1
where ρ denotes the pheromone evaporation rate, J is the

M
number of ants and δB,D (i, K) = Z GB,D with Z being a constant,
MB,D being the number of times edge (B, D) has been visited.
. /$
α
τB,D η β $ τB,+
τ̇B,D ˙
ṫB,D = - = αt B,D − αtB,D tB,+
τ α ηβ τ B,D τ B,+
+ B,+ B,+ +
. /
$ τB,D
˙
= α tB,D ΘB,D − tB,F ΘB,F , ΘB,D =
τB,D
F
' () *
replicator dynamics

.
Replicator Dynamics
% M
&
$
ṫB = tB 1 [7B (i)] − tF 1 [7F (i)]
F
Relative competitiveness as encoded by the replicator

dynamics models
! the selection operator in evolutionary game theory
! pheromone trail reinforcement in swarm intelligence
! and exploitation in reinforcement learning dynamics.

.
Summary
In strategic interactions
! the action values change over time
! the joint learning is a complex stochastic system
! dynamics can be captured in dynamical systems
! proof of convergence
! similarity of dynamics despite different implementations
! link between learning, evolution and swarm intelligence
! different assumptions about observability give rise to a
menagerie of algorithms to choose from

.
Questions?

.
Thank you!
Michael Kaisers ¦ michaelkaisers@gmail.com

.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82

Yang - Good Brief of RL

Uploaded by

Copyright:

Available Formats

You might also like

Yang - Good Brief of RL

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yang - Good Brief of RL

Uploaded by

Copyright:

Available Formats

Multiagent Reinforcement Learning

September 27, 2013 - ECML MARL Tutorial

! Fundamentals of multi-agent reinforcement learning

September 27, 2013 - ECML MARL Tutorial

Who are you?

We would like to get to know our audience!

September 27, 2013 - ECML MARL Tutorial

September 27, 2013 - ECML MARL Tutorial

Single Agent Reinforcement Learning

Multiagent Reinforcement Learning - 2/60

Multiagent Reinforcement Learning

Multiagent Reinforcement Learning - 3/60

Multiagent Reinforcement Learning - 4/60

Available on-line for free!

Multiagent Reinforcement Learning - 5/60

Why reinforcement learning?

Based on ideas from psychology

Learning by interacting with the

Multiagent Reinforcement Learning - 6/60

Multiagent Reinforcement Learning - 7/60

arning from interaction r(t+1)

arning about, from and while interacting with an

arning what to doLearning

! Learner is not told which action to take

Multiagent Reinforcement Learning - 9/60

Multiagent Reinforcement Learning - 10/60

! Time steps need not refer to ﬁxed intervals of real time

Multiagent Reinforcement Learning - 11/60

! State transitions are

Multiagent Reinforcement Learning - 12/60

! The agent's policy π at time i is

! Reinforcement learning methods specify how the agent

Multiagent Reinforcement Learning - 13/60

Suppose the sequence of rewards after time i is

! The goal is to maximize the expected return 1{_i } for

Multiagent Reinforcement Learning - 14/60

! Continuing tasks do not naturally break up into episodes

where γ, 0 ≤ γ ≤ 1 is the discount factor such that

Multiagent Reinforcement Learning - 15/60

Example: pole balancing

Multiagent Reinforcement Learning - 16/60

! Now we can cover both episodic and continuing tasks by

Multiagent Reinforcement Learning - 17/60

Markov decision processes

! It is often useful to a assume that all relevant information is

S(bi+1 , ì+1 |bi , i ) = S(bi+1 , ì+1 |bi , i , ì , bi−1 , i−1 , . . . , `1 , b0 , 0 )

! If a reinforcement learning task has the Markov property, it

Multiagent Reinforcement Learning - 18/60

Rbb! = 1(`i+1 |bi = b, i = , bi+1 = b# )

Multiagent Reinforcement Learning - 19/60

! When following a ﬁxed policy π we can deﬁne the value of

! Similarly we can deﬁne the value of taking action  in state

Multiagent Reinforcement Learning - 20/60

! The value function has a particular recursive relationship,

! The equation expresses the recursive relation between the

Multiagent Reinforcement Learning - 21/60

Optimal policy for an MDP

o∗ (b) = Kt oπ (b) ∀b ∈ a

S(bi+1 , ì+1 |bi , i ) = S(bi+1 , ì+1 |bi , i , ì , bi−1 , i−1 , . . . , `1 , b0 , 0 )

Rbb! = 1(`i+1 |bi = b, i = , bi+1 = b# )

! Similarly we can deﬁne the value of taking action in state

o∗ (b) = Kt oπ (b) ∀b ∈ a

o∗ (b) = Kt Zπ∗ (b, )

where F = ||, and α and β are reward and penalty

The payoﬀ &function 'RB : a × 1 × · · · × M *→ R maps the joint

The transition function P : a × 1 × · · · × M *→ +(a)

o∗ (b) = Kt Zπ∗ (b, )