Yang - Good Brief of RL

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

Multiagent Reinforcement Learning

(MARL)
September 27, 2013 - ECML'13

Presenters

! Daan Bloembergen
! Daniel Hennes
! Michael Kaisers
! Peter Vrancx

September 27, 2013 - ECML MARL Tutorial


.
Schedule

! Fundamentals of multi-agent reinforcement learning


! 15:30 - 17:00, Daan Bloembergen and Daniel Hennes
! Dynamics of learning in strategic interactions
! 17:15 - 17:45, Michael Kaisers
! Scaling multi-agent reinforcement learning
! 17:45 - 18:45, Peter Vrancx

September 27, 2013 - ECML MARL Tutorial


.

Who are you?

We would like to get to know our audience!

September 27, 2013 - ECML MARL Tutorial


.
Fundamentals of Multi-Agent
Reinforcement Learning
Daan Bloembergen and Daniel Hennes

September 27, 2013 - ECML MARL Tutorial


.

Outline (1)

Single Agent Reinforcement Learning


! Markov Decision Processes
! Value Iteration
! Policy Iteration
! Algorithms
! Q-Learning
! Learning Automata

Multiagent Reinforcement Learning - 2/60


.
Outline (2)

Multiagent Reinforcement Learning


! Game Theory
! Markov Games
! Value Iteration
! Algorithms
! Minimax-Q Learning
! Nash-Q Learning
! Other Equilibrium Learning Algorithms
! Policy Hill-Climbing

Multiagent Reinforcement Learning - 3/60


.

Part I:
Single Agent Reinforcement Learning

Multiagent Reinforcement Learning - 4/60


.
Richard S. Sutton and Andrew G. Barto
Reinforcement Learning: An
Introduction
MIT Press, 1998

Available on-line for free!

Multiagent Reinforcement Learning - 5/60


.

Why reinforcement learning?

Based on ideas from psychology


! Edward Thorndike's law of effect
! Satisfaction strengthens behavior,
discomfort weakens it
! B.F. Skinner's principle of
reinforcement
! Skinner Box: train animals by
providing (positive) feedback

Learning by interacting with the


environment

Multiagent Reinforcement Learning - 6/60


.
Why reinforcement learning?
Control theory
! Design a controller to minimize some measure of a
dynamical systems's behavior
! Richard Bellman
! Use system state and value functions (optimal return)
! Bellman equation
! Dynamic programming
! Solve optimal control problems by solving the Bellman
equation
These two threads came together in the 1980s, producing the
modern field of reinforcement learning

Multiagent Reinforcement Learning - 7/60


.

The RL setting
ELIMINARIES: SETTING OF RL
Environment

at is it? a(t)

s(t+1)

arning from interaction r(t+1)

arning about, from and while interacting with an


ternal environment
! Learning from interactions

arning what to doLearning


- howwhat to do - how to map situations to actions -
!
to map situations to
so as to maximize a numerical reward signal
ctions - so as to maximize a numerical reward signal

Multiagent
10 Reinforcement Learning - 8/60
.
Key features of RL

! Learner is not told which action to take


! Trial-and-error approach
! Possibility of delayed reward
! Sacrifice short-term gains for greater long-term gains
! Need to balance exploration and exploitation
! In between supervised and unsupervised learning

Multiagent Reinforcement Learning - 9/60


.

AGENT-ENVIRONMENT INTERFACE
The agent-environment interface
Agent interacts at discrete time steps i = 0, 1, 2, . . .

! Observes state bi ∈ a
Agent
Agent
! Selects action i ∈ (bi )
st rt reward a
Obtains immediate reward
state action
! st rt a t
t

rt+1rt+1
`i+1 ∈ R st+1 Environment
Environment
! Observes resulting state bi+1 st+1

... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3

14

Multiagent Reinforcement Learning - 10/60


.
Elements of RL

! Time steps need not refer to fixed intervals of real time


! Actions can be
! low level (voltage to motors)
! high level (go left, go right)
! "mental" (shift focus of attention)
! States can be
! low level "sensations" (temperature, (t, v) coordinates)
! high level abstractions, symbolic
! subjective, internal ("surprised", "lost")
! The environment is not necessarily known to the agent

Multiagent Reinforcement Learning - 11/60


.

Elements of RL

! State transitions are


! changes to the internal state of the agent
! changes in the environment as a result of the agent's action
! can be nondeterministic
! Rewards are
! goals, subgoals
! duration
! ...

Multiagent Reinforcement Learning - 12/60


.
Learning how to behave

! The agent's policy π at time i is


! a mapping from states to action probabilities
! πi (b, ) = S(i = |bi = b)

! Reinforcement learning methods specify how the agent


changes its policy as a result of experience
! Roughly, the agent's goal is to get as much reward as it
can over the long run

Multiagent Reinforcement Learning - 13/60


.

The objective

Suppose the sequence of rewards after time i is


`i+1 , `i+2 , `i+3 , . . .

! The goal is to maximize the expected return 1{_i } for


each time step i
! Episodic tasks naturally break into episodes, e.g., plays of a
game, trips through a maze

_i = `i+1 + `i+2 + . . . + `h

Multiagent Reinforcement Learning - 14/60


.
The objective

! Continuing tasks do not naturally break up into episodes


! Use discounted return instead of total reward

!
_i = `i+1 + γ`i+2 + γ `i+3 + . . . =
2
γ F `i+F+1
F=0

where γ, 0 ≤ γ ≤ 1 is the discount factor such that

shortsighted 0 ← γ → 1 farsighted

Multiagent Reinforcement Learning - 15/60


.

Example: pole balancing


! As an episodic task where each
episode ends upon failure
!
AN EXAMPLE
reward = +1 for each step before
failure
! return = number of steps before
failure
• As an episodic task where episode ends upon failure
! As a continuing task with discounted return
• reward = +1 for each step before failure
! reward = -1 upon failure
F
return
•!return = −γ of
= number for Fbefore
, steps stepsfailure
before failure
! In both cases, return is maximized by avoiding failure for as
• As a continuing task with discounted return:
long as possible
• reward = -1 upon failure, 0 otherwise In either case, return is
maximized by avoiding failure
• return = k , for k steps before failure for as long as possible

Multiagent Reinforcement Learning - 16/60


20
.
A UNIFIED NOTATION
A unified notation
• Think of each episode as ending in an absorbing state that
always produces reward of zero
! Think of each episode as ending in an absorbing state that
always produces a reward
• We can cover allofcases
zeroby writing
1
X
r1 = +1 r2 = +1 r3 = +1 r4 = 0 k
s0 s1 s2 r5 = 0 Rt = rt+k+1
k=0

! Now we can cover both episodic and continuing tasks by


writing
!∞ 24

_i = γ F `i+F+1
F=0

Multiagent Reinforcement Learning - 17/60


.

Markov decision processes

! It is often useful to a assume that all relevant information is


present in the current state: Markov property

S(bi+1 , `i+1 |bi , i ) = S(bi+1 , `i+1 |bi , i , `i , bi−1 , i−1 , . . . , `1 , b0 , 0 )

! If a reinforcement learning task has the Markov property, it


is basically a Markov Decision Process (MDP)
! Assuming finite state and action spaces, it is a finite MDP

Multiagent Reinforcement Learning - 18/60


.
Markov decision processes

An MDP is defined by
! State and action sets
! One-step dynamics defined by state transition
probabilities

! = S(bi+1 = b |bi = b, i = )
#
Pbb

! Reward probabilties

Rbb! = 1(`i+1 |bi = b, i = , bi+1 = b# )

Multiagent Reinforcement Learning - 19/60


.

Value functions

! When following a fixed policy π we can define the value of


a state b under that policy as

!
o (b) = 1π (_i |bi = b) = 1π (
π
γ F `i+F+1 |bi = b)
F=0

! Similarly we can define the value of taking action  in state


b as
Zπ (b, ) = 1π (_i |bi = b, i = )

Multiagent Reinforcement Learning - 20/60


.
Value functions

! The value function has a particular recursive relationship,


defined by the Bellman equation
! !
 
oπ (b) = π(b, ) Pbb π #
! [Rbb! + γo (b )]
 b!

! The equation expresses the recursive relation between the


value of a state and its successor states, and averages over
all possibilities, weighting each by its probability of
occurring

Multiagent Reinforcement Learning - 21/60


.

Optimal policy for an MDP


! We want to find the policy that maximizes long term
reward, which equates to finding the optimal value
function

o∗ (b) = Kt oπ (b) ∀b ∈ a


π
Z∗ (b, ) = Kt Zπ (b, ) ∀b ∈ a,  ∈ (b)
π

! Expressed recursively, this is the Bellman optimality


equation

o∗ (b) = Kt Zπ∗ (b, )


∈(b)
!
 
= Kt Pbb ∗ #
! [Rbb! + γo (b )]
∈(b)
b!

Multiagent Reinforcement Learning - 22/60


.
Solving the Bellman equation

! We can find the optimal policy by solving the Bellman


equation
! Dynamic Programming
! Two approaches:
! Iteratively improve the value function: value iteration
! Iteratively evaluate and improve the policy: policy iteration
! Both approaches are proven to converge to the optimal
value function

Multiagent Reinforcement Learning - 23/60


.

VALUE ITERATION
Value iteration
• Possibility to finding optimal policy

Bootstrapping!
36

Multiagent Reinforcement Learning - 24/60


.
Policy iteration

! Often the optimal policy has been reached long before the
value function has converged
! Policy iteration calculates a new policy based on the
current value function, and then calculates a new value
function based on this policy
! This process often converges faster to the optimal policy

Multiagent Reinforcement Learning - 25/60


.

POLICY ITERATION
Policy iteration

39

Multiagent Reinforcement Learning - 26/60


.
Learning an optimal policy online

! Both previous approaches require to know the dynamics


of the environment
! Often this information is not available
! Using temporal difference (TD) methods is one way of
overcoming this problem
! Learn directly from raw experience
! No model of the environment required (model-free)
! E.g.: Q-learning
! Update predicted state values based on new observations
of immediate rewards and successor states

Multiagent Reinforcement Learning - 27/60


.

Q-learning

! Q-learning updates state-action values based on the


immediate reward and the optimal expected return
" #
Z(bi , i ) ← Z(bi , i ) + α `i+1 + γ Kt Z(bi+1 , ) − Z(bi , i )


! Directly learns the optimal value function independent of


the policy being followed
! In contrast to on-policy learners, e.g. SARSA
! Proven to converge to the optimal policy given "sufficient"
updates for each state-action pair, and decreasing learning
rate α [Watkins92]

Multiagent Reinforcement Learning - 28/60


.
One-step Q-learning
Q-learning
Q(st , at ) Q(st , at ) + ↵[rt+1 + max Q(st+1 , a) Q(st , at )]
a

Proven to converge to the optimal policy,


given certain conditions [Tsitsiklis, 1994]
46

Multiagent Reinforcement Learning - 29/60


.

Action selection

! How to select an action based on the values of the states


or state-action pairs?
! Success of RL depends on a trade-off
! Exploration
! Exploitation
! Exploration is needed to prevent getting stuck in local
optima
! To ensure convergence you need to exploit

Multiagent Reinforcement Learning - 30/60


.
Action selection

Two common choices


! !-greedy
! Choose the best action with probability 1 − $
! Choose a random action with probability $
! Boltzmann exploration (softmax) uses a temperature
parameter τ to balance exploration and exploitation

2Zi (b,)/τ
πi (b, ) = $
! ∈ 2
Zi (b,! )/τ

pure exploitation 0 ← τ → ∞ pure exploration

Multiagent Reinforcement Learning - 31/60


.

Learning automata

! Learning automata [Narendra74] directly modify their


policy based on the observed reward (policy iteration)
! Finite action-set learning automata learn a policy over a
finite set of actions
%
α`(1 − π()) − β(1 − `)π() if  = i
π # () = π()+
−α`π() + β(1 − `)[(F − 1)−1 − π()] if  (= i

where F = ||, and α and β are reward and penalty


parameters respectively, and ` ∈ [0, 1]
! Cross learning is a special case where α = 1 and β = 0

Multiagent Reinforcement Learning - 32/60


.
Networks of learning automata

! A single learning automaton ignores any state information


! In a network of learning automata [Wheeler86] control is
passed on from one automaton to another
! One automaton A is active for each state
! The immediate reward ` is replaced by the average
cumulative reward ¯` since the last visit to that state
$i−1
∆` B=H(b) `B
¯`i (b) = =
∆i i − H(b)

where H(b) indicates in which time step state b was last


visited

Multiagent Reinforcement Learning - 33/60


.

Extensions
! Multi-step TD: eligibility traces
! Instead of observing one immediate reward, use M
consecutive rewards for the value update
! Intuition: your current choice of action may have
implications for the future
! State-action pairs are eligible for future rewards, with more
recent states getting more credit

Multiagent Reinforcement Learning - 34/60


.
Extensions
! Reward shaping
! Incorporate domain knowledge to provide additional
rewards during an episode
! Guide the agent to learn faster
! (Optimal) policies preserved given a potential-based
shaping function [Ng99]
! Function approximation
! So far we have used a tabular notation for value functions
! For large state and actions spaces this approach becomes
intractable
! Function approximators can be used to generalize over
large or even continuous state and action spaces

Multiagent Reinforcement Learning - 35/60


.

Questions so far?

Multiagent Reinforcement Learning - 36/60


.
Part II:
Multiagent Reinforcement Learning

Preliminaries: Fundamentals of Game Theory

Multiagent Reinforcement Learning - 37/60


.

Game theory

! Models strategic interactions as games


! In normal form games, all players simultaneously select an
action, and their joint action determines their individual
payoff
! One-shot interaction
! Can be represented as an M-dimensional payoff matrix, for M
players
! A player's strategy is defined as a probability distribution
over his possible actions

Multiagent Reinforcement Learning - 38/60


.
Example: Prisoner's Dilemma

! Two prisoners (A and B) commit a crime


together
! They are questioned separately and can
choose to confess or deny C D
! If both confess, both prisoners will serve 3
years in jail C -3, -3 -0, -5
! If both deny, both serve only 1 year for
minor charges D -5, -0 -1, -1
! If only one confesses, he goes free, while
the other serves 5 years

Multiagent Reinforcement Learning - 39/60


.

Example: Prisoner's Dilemma

! What should they do?


! If both deny, their total penalty is lowest
! But is this individually rational? C D
! Purely selfish: regardless of what the other C -3, -3 -0, -5
player does, confess is the optimal choice
! If the other confesses, 3 instead of 5 years D -5, -0 -1, -1
! If the other denies, free instead of 1 year

Multiagent Reinforcement Learning - 40/60


.
Solution concepts
! Nash equilibrium
! Individually rational
! No player can improve by unilaterally
changing his strategy
! Mutual confession is the only Nash
equilibrium of this game C D
! Jointly the players could do better
C -3, -3 -0, -5
! Pareto optimum: there is no other
solution for which all players do at least as D -5, -0 -1, -1
well and at least one player is strictly
better off
! Mutual denial Pareto dominates the Nash
equilibrium in this game

Multiagent Reinforcement Learning - 41/60


.

Types of games
Matching Pennies
H T

! Competitive or zero-sum H +1, -1 -1, +1

! Players have opposing preferences T -1, +1 +1, -1


! E.g. Matching Pennies
Prisoner's Dilemma
! Symmetric games C D

! Players are identical C -3, -3 -0, -5


! E.g. Prioner's Dilemma D -5, -0 -1, -1

! Asymmetric games Battle of the Sexes


! Players are unique B S
! E.g. Battle of the Sexes B 2, 1 0, 0

S 0, 0 1, 2

Multiagent Reinforcement Learning - 42/60


.
Part II:
Multiagent Reinforcement Learning

Multiagent Reinforcement Learning - 43/60


.

MARL: Motivation
! MAS offer a solution paradigm that can cope with complex
problems
! Technological challenges require decentralised solutions
! Multiple autonomous vehicles for exploration, surveillance
or rescue missions
! Distributed sensing
! Traffic control (data, urban or air traffic)
! Key advantages: Fault tolerance and load balancing
! But: highly dynamic and nondeterministic environments!
! Need for adaptation on an individual level
! Learning is crucial!

Multiagent Reinforcement Learning - 44/60


.
MARL: From single to multiagent learning

! Inherently more challenging


! Agents interact with the environment and each other
! Learning is simultaneous
! Changes in strategy of one agent might affect strategy of
other agents
! Questions:
! One vs. many learning agents?
! Convergence?
! Objective: maximise common reward or individual reward?
! Credit assignment?

Multiagent Reinforcement Learning - 45/60


.

Independent reinforcement learners

! Naive extension to multi agent setting


! Independent learners mutually ignore each other
! Implicitly perceive interaction with other agents as noise in
a stochastic environment

Multiagent Reinforcement Learning - 46/60


.
Learning in matrix games

! Two Q-learners interact in Battle of the


Sexes
B S
! α = 0.01
! Boltzmann exploration with τ = 0.2 B 2, 1 0, 0
! They only observe their immediate reward S 0, 0 1, 2
! Policy is gradually improved

Multiagent Reinforcement Learning - 47/60


.

Learning in matrix games


1

0.9
Probability first action

0.8

B S
0.7
B 2, 1 0, 0
0.6 S 0, 0 1, 2

0.5 Player 1
Player 2

0.4
0 50 100 150 200 250
Iterations

Multiagent Reinforcement Learning - 48/60


.
Learning in matrix games
2

1.8

1.6

1.4
Avg. cum. reward

1.2 B S
1
B 2, 1 0, 0
0.8

0.6 S 0, 0 1, 2
0.4
Player 1
0.2
Player 2

0
0 50 100 150 200 250
Iterations

Multiagent Reinforcement Learning - 49/60


.

Markov games
& '
M-player game: M, a, 1 , . . . , M , R1 , . . . , RM , P
! a : set of states
! B : action set for player B
! RB : reward/payoff for player B
! P : transition function

The payoff &function 'RB : a × 1 × · · · × M *→ R maps the joint


action  = 1 . . . M to an immediate payoff value for player B.

The transition function P : a × 1 × · · · × M *→ +(a)


determines the probabilistic state change to the next state bi+1 .

Multiagent Reinforcement Learning - 50/60


.
Value iteration in Markov games
Single agent MDP:

o∗ (b) = Kt Zπ∗ (b, )


∈(b)
!
 
= Kt Pbb π #
! [Rbb! + γo (b )]
∈(b)
b!

2-player zero-sum stochastic game:


& ' & ' ! & '
Z∗ (b, 1 , 2 ) = R(b, 1 , 2 ) + γ Pb! (b, 1 , 2 )o∗ (b# )
b! ∈ a

! & '
o (b) = Kt

KBM π1 Z∗ (b, 1 , 2 )
π∈&(1 ) 2 ∈ 2
1 ∈ 1

Multiagent Reinforcement Learning - 51/60


.

Minimax-Q
! Value iteration requires knowledge of the reward and
transition functions
! Minimax-Z [Littman94]: learning algorithm for zero-sum
games
! Payoffs balance out, each agent only needs to observe its
own payoff
! Z is a function of the joint action:
& ' & ' ! & '
Z(b,  ,  ) = R(b,  ,  ) + γ
1 2 1 2
P (b,  ,  )o(b# )
b!
1 2

b! ∈ a

! A joint action learner (JAL) is an agent that learns Z-values


for joint actions as opposed to individual actions.

Multiagent Reinforcement Learning - 52/60


.
Minimax-Q (2)

Update rule for agent 1 with reward function Ri at stage i:


& ' & '
Zi+1 (bi , 1i , 2i ) = (1 − αi ) Zi (bi , 1i , 2i ) + αi [Ri + γoi (bi+1 )]

The value of the next state o(bi+1 ):


! & '
oi+1 (b) = Kt KBM π1 Zi (b, 1 , 2 ) .
π∈&(1 ) 2 ∈ 2
1 ∈  1

Minimax-Z converges to Nash equilibria under the same


assumptions as regular Z-learning [Littman94]

Multiagent Reinforcement Learning - 53/60


.

Nash-Q learning

! Nash-Z learning [Hu03]: joint action learner for


general-sum stochastic games
! Each individual agent has to estimate Z values for all other
agents as well
! Optimal Nash-Z values: sum of immediate reward and
discounted future rewards under the condition that all
agents play a specified Nash equilibrium from the next
stage onward

Multiagent Reinforcement Learning - 54/60


.
Nash-Q learning (2)

Update rule for agent B:


& ' & '
ZBi+1 (bi , 1 , . . . , M ) = (1 − αi ) Z(bi , 1 , . . . , M )
( )
+ αi Ri + γ Lb?oBi (bi+1 )

A
* Nash equilibrium is computed + for each stage game
Z1i (bi+1 , ·), . . . , ZMi (bi+1 , ·) and results in the equilibrium
payoff Lb?oBi (bi+1 , ·) to agent B

Agent B uses the same update rule to estimate Z values for all
other agents, i.e., ZD ∀D ∈ {1, . . . , M}\B

Multiagent Reinforcement Learning - 55/60


.

Other equilibrium learning algorithms

! Friend-or-Foe Z-learning [Littman01]


! Correlated-Z learning (CE-Z) [Greenwald03]
! Nash bargaining solution Z-learning (NBS-Z) [Qiao06]]
! Optimal adaptive learning (OAL) [Wang02]
! Asymmetric-Z learning [Kononen03]

Multiagent Reinforcement Learning - 56/60


.
Limitations of MARL

! Convergence guarantees are mostly restricted to stateless


repeated games
! ... or are inapplicable in general-sum games
! Many convergence proofs have strong assumptions with
respect to a-priori knowledge and/or observability
! Equilibrium learners focus on stage-wise solutions (only
indirect state coupling)

Multiagent Reinforcement Learning - 57/60


.

Summary

In a multi-agent system
! be aware what information is available to the agent
! if you can afford to try, just run an algorithm that matches
the assumptions
! proofs of convergence are available for small games
! new research can focus either on engineering solutions, or
advancing the state-of-the-art theories

Multiagent Reinforcement Learning - 58/60


.
Questions so far?

Multiagent Reinforcement Learning - 59/60


.

Thank you!
Daan Bloembergen ¦ daan.bloembergen@gmail.com
Daniel Hennes ¦ daniel.hennes@gmail.com

Multiagent Reinforcement Learning - 60/60


.
Dynamics of Learning in Strategic
Interactions
Michael Kaisers

September 27, 2013 - ECML MARL Tutorial


.

Outline

! Action values in dynamic environments


! Deriving learning dynamics
! Illustrating Convergence
! Comparing dynamics of various algorithms
! Replicator dynamics as models of evolution, swarm
intelligence and learning
! Summary

Multiagent Reinforcement Learning - 2/23


.
Action values in dynamic environments

Action values are estimated by sampling from interactions with


the environment, possibly in the presence of other agents.

Multiagent Reinforcement Learning - 3/23


.

Action values in dynamic environments


Static environment, off-policy Q-value updates
1
action 1 Expected value
action 2 Q-value
0.8

0.6
reward

0.4

0.2

0
1 50 100
time

Multiagent Reinforcement Learning - 4/23


.
Action values in dynamic environments
Static environment, on-policy Q-value updates
1
action 1 Expected value
action 2 Q-value
0.8

0.6
reward

0.4

0.2

0
1 50 100
time

Multiagent Reinforcement Learning - 4/23


.

Action values in dynamic environments


Static environment, on-policy Q-value updates
1
action 1
action 2
0.8

0.6
reward

0.4

0.2

0
1 50 100
time
1
action 1
0.8

0.6
policy

0.4

0.2

0
action 2
1 50 100
time

Multiagent Reinforcement Learning - 4/23


.
Action values in dynamic environments
Adversarial environment, on-policy Q-value updates
1
action 1 Expected value
action 2 Q-value
0.8

0.6
reward

0.4

0.2

0
1 50 100
time

Multiagent Reinforcement Learning - 4/23


.

Action values in dynamic environments


Adversarial environment, on-policy Q-value updates
1
action 1
action 2
0.8

0.6
reward

0.4

0.2

0
1 50 100
time
1
action 1
0.8

0.6
policy

0.4

0.2

0
action 2
1 50 100
time

Multiagent Reinforcement Learning - 4/23


.
Deriving Learning Dynamics
Matching Pennies
H T
H 1, 0 0, 1
T 0, 1 1, 0

α = 0.1 α = 0.001
Multiagent Reinforcement Learning - 5/23
.

Deriving Learning Dynamics

! " /t
Learning algorithm HBK 1(∆t) = = ṫ Dynamical system
α→0 /i

Advantages of dynamical systems


! Deterministic
! Convergence guarantees using Jacobian
! Vast related body of literature (e.g., bifurcation theory)

Multiagent Reinforcement Learning - 6/23


.
Deriving Learning Dynamics
Example: Cross learning [Boergers97]
#
(1 − α`B )tB + α`B if B selected
tB (i + 1) ←
(1 − α`D )tB for other action D selected
M
$
1(∆tB ) = tB [(1 − α`B )tB + α`B − tB ] + tF [(1 − α`F )tB − tB ]
F"=B
M
$ M
$
= αtB [(1 − tB )`B − tF `F ] = αtB [`B − tF `F ]
F"=B F

! " /t
Learning algorithm HBK 1(∆t) = = ṫ Dynamical system
α→0 /i
% M
&
$
ṫB = tB 1 [`B ] − tF 1 [`F ] = tB [(v)B − tv]
' () *
F
replicator dynamics

Multiagent Reinforcement Learning - 7/23


.

Deriving Learning Dynamics


Prisoners' Dilemma Battle of Sexes Matching Pennies
D C B S H T
D 1, 1 5, 0 B 2, 1 0, 0 H 1, −1 −1, 1 t1
C 0, 5 3, 3 S 0, 0 1, 2 T −1, 1 1, −1 1 − t1
v1 1 − v1 v1 1 − v1 v1 1 − v1
1 1 1

0.8 0.8 0.8

0.6 0.6 0.6


y1 y1 y1
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1

ṫB = tB [(v)B − tv]


v̇B = vB [(t")B − t"v]

Multiagent Reinforcement Learning - 8/23


.
Deriving Learning Dynamics
Dynamics have been derived for
! Learning Automata (Cross Learning)
! Regret Matching (RM)
! Variations of Infinitesimal Gradient Ascent
! Infinitesimal Gradient Ascent (IGA)
! Win-or-Learn-Fast (WoLF) IGA
! Weighted Policy Learning (WPL)
! Variations of Q-learning
! Repeated Update Q-learning
See our talk on Thursday, Session F4 Learning 1, 13:50
! Frequency Adjusted Q-learning

Multiagent Reinforcement Learning - 9/23


.

Illustrating Convergence
Q-learning [Watkins92]
tB probability of playing action B
α learning rate
` reward
τ temperature
Update rule
+ ,
ZB (i + 1) ← ZB (i) + α `B (i) + γ Kt ZD (i) − ZB (i)
D

Policy generation function


2 τ ZB
−1

tB (Z, τ ) = - τ −1 Z
D2
D

Multiagent Reinforcement Learning - 10/23


.
Illustrating Convergence
Frequency Adjusted Q-learning (FAQ-learning) [Kaisers2010]
tB probability of playing action B
α learning rate
` reward
τ temperature
Update rule
+ ,
1
ZB (i + 1) ← ZB (i) + α `B (i) + γ Kt ZD (i) − ZB (i)
tB D

Policy generation function


2 τ ZB
−1

tB (Z, τ ) = - τ −1 Z
D2
D

Multiagent Reinforcement Learning - 10/23


.

Illustrating Convergence

Cross Learning [Boergers97]


% M
&
$
ṫB = tB 1 [`B (i)] − tF 1 [`F (i)]
F

Frequency Adjusted Q-learning [Tuyls05, Kaisers2010]


. % M
& /
$ $
ṫB = αtB τ −1 1 [`B (i)] − tF 1 [`F (i)] − HQ; tB + tF HQ; tF
F F

Proof of convergence in two-player two-action games


[Kaisers2011, Kianercy2012]

Multiagent Reinforcement Learning - 11/23


.
Illustrating Convergence
Prisoners' Dilemma
1 1 1

0.8 0.8 0.8

Q
0.6 0.6 0.6
y1
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1

1 1 1

0.8 0.8 0.8

FAQ
0.6 0.6 0.6
y1
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1

ZBMBi pessimistic neutral optimistic

Multiagent Reinforcement Learning - 12/23


.

Illustrating Convergence
Battle of Sexes
1 1 1

0.8 0.8 0.8

Q
0.6 0.6 0.6
y
1
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
1 1 1
1 1 1

0.8 0.8 0.8

FAQ
0.6 0.6 0.6
y
1
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
1 1 1

ZBMBi pessimistic neutral optimistic

Multiagent Reinforcement Learning - 13/23


.
Illustrating Convergence
Matching Pennies
1 1 1

0.8 0.8 0.8

Q
0.6 0.6 0.6
y1
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1

1 1 1

0.8 0.8 0.8

FAQ
0.6 0.6 0.6
y1
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1

ZBMBi pessimistic neutral optimistic

Multiagent Reinforcement Learning - 14/23


.

Comparing Dynamics

Dynamical systems have been associated with


! Infinitesimal Gradient Ascent (IGA)
! Win-or-Learn-Fast Infinitesimal Gradient Ascent (WoLF)
! Weighted Policy Learning (WPL)

! Cross Learning (CL)


! Frequency Adjusted Q-learning (FAQ)
! Regret Matching (RM)

Multiagent Reinforcement Learning - 15/23


.
Comparing Dynamics
Learning dynamics for two-agent two-action
0 games. The 1
h
common gradient is abbreviated ð = v?? + 12 − 22 .
Algorithm ṫ
IGA αð #
αKBM B7 o(t, v) > o(t2 , v)
WoLF ð·
# αKt Qi?2`rBb2
t B7 ð < 0
WPL αð ·
(1 − t) Qi?2`rBb2
CL αt(1 − t)0ð 1
t
FAQ αt(1 − t) ð ·τ −1 − HQ; 1−t
#
(1 + αtð)−1 B7 ð < 0
RM αt(1 − t) ð ·
(1 − α(1 − t)ð)−1 Qi?2`rBb2

Multiagent Reinforcement Learning - 16/23


.

Comparing Dynamics
Cross Learning is linked to the replicator dynamics
% M
&
$
ṫB = tB 1 [7B (i)] − tF 1 [7F (i)]
F

Gradient based algorithms use the orthogonal projection


function, leading to the following dynamics for IGA:
% M
&
$ 1
ṫB = α 1 [7B (i)] − 1 [7F (i)]
M
F

Gradient dynamics use information about all actions, and are


equivalent to the replicator dynamics under the uniform policy
(i.e., also given off-policy updates).

Multiagent Reinforcement Learning - 17/23


.
Replicator Dynamics
Path finding with Ant Colony Optimization

Pheromones τ and travel cost heuristic η lead to probabilistic


selection of state transition tB,D from B to D:
α ηβ
τB,D
tB,D =- , with α, β tuning parameters
τ α ηβ
+ B,+ B,+

Multiagent Reinforcement Learning - 18/23


.

Replicator Dynamics
Pheromone trail reinforcement in Ant Colony Optimization
J
$
τB,D (i + 1) = (1 − ρ)τB,D (i) + δB,D (i, K),
K=1

where ρ denotes the pheromone evaporation rate, J is the


M
number of ants and δB,D (i, K) = Z GB,D with Z being a constant,
MB,D being the number of times edge (B, D) has been visited.
. /$
α
τB,D η β $ τB,+
τ̇B,D ˙
ṫB,D = - = αt B,D − αtB,D tB,+
τ α ηβ τ B,D τ B,+
+ B,+ B,+ +
. /
$ τB,D
˙
= α tB,D ΘB,D − tB,F ΘB,F , ΘB,D =
τB,D
F
' () *
replicator dynamics

Multiagent Reinforcement Learning - 19/23


.
Replicator Dynamics

% M
&
$
ṫB = tB 1 [7B (i)] − tF 1 [7F (i)]
F

Relative competitiveness as encoded by the replicator


dynamics models
! the selection operator in evolutionary game theory
! pheromone trail reinforcement in swarm intelligence
! and exploitation in reinforcement learning dynamics.

Multiagent Reinforcement Learning - 20/23


.

Summary

In strategic interactions
! the action values change over time
! the joint learning is a complex stochastic system
! dynamics can be captured in dynamical systems
! proof of convergence
! similarity of dynamics despite different implementations
! link between learning, evolution and swarm intelligence
! different assumptions about observability give rise to a
menagerie of algorithms to choose from

Multiagent Reinforcement Learning - 21/23


.
Questions?

Multiagent Reinforcement Learning - 22/23


.

Thank you!
Michael Kaisers ¦ michaelkaisers@gmail.com

Multiagent Reinforcement Learning - 23/23


.
1

2
3

4
5

6
7

8
9

10
11

12
13

14
15

16
17

18
19

20
21

22
23

24
25

26
27

28
29

30
31

32
33

34
35

36
37

38
39

40
41

42
43

44
45

46
47

48
49

50
51

52
53

54
55

56
57

58
59

60
61

62
63

64
65

66
67

68
69

70
71

72
73

74
75

76
77

78
79

80
81

82

You might also like