Professional Documents
Culture Documents
Yang - Good Brief of RL
Yang - Good Brief of RL
Yang - Good Brief of RL
(MARL)
September 27, 2013 - ECML'13
Presenters
! Daan Bloembergen
! Daniel Hennes
! Michael Kaisers
! Peter Vrancx
Outline (1)
Part I:
Single Agent Reinforcement Learning
The RL setting
ELIMINARIES: SETTING OF RL
Environment
at is it? a(t)
s(t+1)
Multiagent
10 Reinforcement Learning - 8/60
.
Key features of RL
AGENT-ENVIRONMENT INTERFACE
The agent-environment interface
Agent interacts at discrete time steps i = 0, 1, 2, . . .
! Observes state bi ∈ a
Agent
Agent
! Selects action i ∈ (bi )
st rt reward a
Obtains immediate reward
state action
! st rt a t
t
rt+1rt+1
`i+1 ∈ R st+1 Environment
Environment
! Observes resulting state bi+1 st+1
... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3
14
Elements of RL
The objective
_i = `i+1 + `i+2 + . . . + `h
shortsighted 0 ← γ → 1 farsighted
_i = γ F `i+F+1
F=0
An MDP is defined by
! State and action sets
! One-step dynamics defined by state transition
probabilities
! = S(bi+1 = b |bi = b, i = )
#
Pbb
! Reward probabilties
Value functions
VALUE ITERATION
Value iteration
• Possibility to finding optimal policy
Bootstrapping!
36
! Often the optimal policy has been reached long before the
value function has converged
! Policy iteration calculates a new policy based on the
current value function, and then calculates a new value
function based on this policy
! This process often converges faster to the optimal policy
POLICY ITERATION
Policy iteration
39
Q-learning
Action selection
2Zi (b,)/τ
πi (b, ) = $
! ∈ 2
Zi (b,! )/τ
Learning automata
Extensions
! Multi-step TD: eligibility traces
! Instead of observing one immediate reward, use M
consecutive rewards for the value update
! Intuition: your current choice of action may have
implications for the future
! State-action pairs are eligible for future rewards, with more
recent states getting more credit
Questions so far?
Game theory
Types of games
Matching Pennies
H T
S 0, 0 1, 2
MARL: Motivation
! MAS offer a solution paradigm that can cope with complex
problems
! Technological challenges require decentralised solutions
! Multiple autonomous vehicles for exploration, surveillance
or rescue missions
! Distributed sensing
! Traffic control (data, urban or air traffic)
! Key advantages: Fault tolerance and load balancing
! But: highly dynamic and nondeterministic environments!
! Need for adaptation on an individual level
! Learning is crucial!
0.9
Probability first action
0.8
B S
0.7
B 2, 1 0, 0
0.6 S 0, 0 1, 2
0.5 Player 1
Player 2
0.4
0 50 100 150 200 250
Iterations
1.8
1.6
1.4
Avg. cum. reward
1.2 B S
1
B 2, 1 0, 0
0.8
0.6 S 0, 0 1, 2
0.4
Player 1
0.2
Player 2
0
0 50 100 150 200 250
Iterations
Markov games
& '
M-player game: M, a, 1 , . . . , M , R1 , . . . , RM , P
! a : set of states
! B : action set for player B
! RB : reward/payoff for player B
! P : transition function
! & '
o (b) = Kt
∗
KBM π1 Z∗ (b, 1 , 2 )
π∈&(1 ) 2 ∈ 2
1 ∈ 1
Minimax-Q
! Value iteration requires knowledge of the reward and
transition functions
! Minimax-Z [Littman94]: learning algorithm for zero-sum
games
! Payoffs balance out, each agent only needs to observe its
own payoff
! Z is a function of the joint action:
& ' & ' ! & '
Z(b, , ) = R(b, , ) + γ
1 2 1 2
P (b, , )o(b# )
b!
1 2
b! ∈ a
Nash-Q learning
A
* Nash equilibrium is computed + for each stage game
Z1i (bi+1 , ·), . . . , ZMi (bi+1 , ·) and results in the equilibrium
payoff Lb?oBi (bi+1 , ·) to agent B
Agent B uses the same update rule to estimate Z values for all
other agents, i.e., ZD ∀D ∈ {1, . . . , M}\B
Summary
In a multi-agent system
! be aware what information is available to the agent
! if you can afford to try, just run an algorithm that matches
the assumptions
! proofs of convergence are available for small games
! new research can focus either on engineering solutions, or
advancing the state-of-the-art theories
Thank you!
Daan Bloembergen ¦ daan.bloembergen@gmail.com
Daniel Hennes ¦ daniel.hennes@gmail.com
Outline
0.6
reward
0.4
0.2
0
1 50 100
time
0.6
reward
0.4
0.2
0
1 50 100
time
0.6
reward
0.4
0.2
0
1 50 100
time
1
action 1
0.8
0.6
policy
0.4
0.2
0
action 2
1 50 100
time
0.6
reward
0.4
0.2
0
1 50 100
time
0.6
reward
0.4
0.2
0
1 50 100
time
1
action 1
0.8
0.6
policy
0.4
0.2
0
action 2
1 50 100
time
α = 0.1 α = 0.001
Multiagent Reinforcement Learning - 5/23
.
! " /t
Learning algorithm HBK 1(∆t) = = ṫ Dynamical system
α→0 /i
! " /t
Learning algorithm HBK 1(∆t) = = ṫ Dynamical system
α→0 /i
% M
&
$
ṫB = tB 1 [`B ] − tF 1 [`F ] = tB [(v)B − tv]
' () *
F
replicator dynamics
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
Illustrating Convergence
Q-learning [Watkins92]
tB probability of playing action B
α learning rate
` reward
τ temperature
Update rule
+ ,
ZB (i + 1) ← ZB (i) + α `B (i) + γ Kt ZD (i) − ZB (i)
D
tB (Z, τ ) = - τ −1 Z
D2
D
tB (Z, τ ) = - τ −1 Z
D2
D
Illustrating Convergence
Q
0.6 0.6 0.6
y1
0.4 0.4 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
1 1 1
FAQ
0.6 0.6 0.6
y1
0.4 0.4 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
Illustrating Convergence
Battle of Sexes
1 1 1
Q
0.6 0.6 0.6
y
1
0.4 0.4 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
1 1 1
1 1 1
FAQ
0.6 0.6 0.6
y
1
0.4 0.4 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
1 1 1
Q
0.6 0.6 0.6
y1
0.4 0.4 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
1 1 1
FAQ
0.6 0.6 0.6
y1
0.4 0.4 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1 x1
Comparing Dynamics
Comparing Dynamics
Cross Learning is linked to the replicator dynamics
% M
&
$
ṫB = tB 1 [7B (i)] − tF 1 [7F (i)]
F
Replicator Dynamics
Pheromone trail reinforcement in Ant Colony Optimization
J
$
τB,D (i + 1) = (1 − ρ)τB,D (i) + δB,D (i, K),
K=1
% M
&
$
ṫB = tB 1 [7B (i)] − tF 1 [7F (i)]
F
Summary
In strategic interactions
! the action values change over time
! the joint learning is a complex stochastic system
! dynamics can be captured in dynamical systems
! proof of convergence
! similarity of dynamics despite different implementations
! link between learning, evolution and swarm intelligence
! different assumptions about observability give rise to a
menagerie of algorithms to choose from
Thank you!
Michael Kaisers ¦ michaelkaisers@gmail.com
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82