Professional Documents
Culture Documents
Reinforement Learning
Reinforement Learning
Reinforement Learning
3 +1
2 −1
0.8
1 START
0.1 0.1
1 2 3 4
States s ∈ S, actions a ∈ A
Model T (s, a, s′) ≡ P (s′|s, a) = probability that a in s leads to s′
′
Reward function
R(s) (or R(s, a), R(s, a, s ))
−0.04 (small penalty) for nonterminal states
=
±1 for terminal states
3 +1
2 −1
1 2 3 4
2 0.762 0.660 −1 2 −1
1 2 3 4 1 2 3 4
U (1, 1) = −0.04
+ γ max{0.8U (1, 2) + 0.1U (2, 1) + 0.1U (1, 1), up
0.9U (1, 1) + 0.1U (1, 2) left
0.9U (1, 1) + 0.1U (2, 1) down
0.8U (2, 1) + 0.1U (1, 2) + 0.1U (1, 1)} right
One equation per state = n nonlinear equations in n unknowns
0.6 (3,1)
0.4 (4,1)
0.2
-0.2
0 5 10 15 20 25 30
Number of iterations
3 +1
2 −1
1 START
1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12
25 24 23 22 21 20 19 18 17 16 15 14 13
s′
0.6
1 (4,3)
(3,3) 0.5
0.8 (1,3)
(1,1) 0.4
(2,1)
0.6
0.3
0.4
0.2
0.2 0.1
0 0
0 100 200 300 400 500 0 20 40 60 80 100
Number of trials Number of trials
One drawback of learning U (s): still need T (s, a, s′) to make decisions
Learning Q(a, s) directly avoids this problem
Bellman equation:
Q(a, s) = R(s) + γ T (s, π(s), s′) max Q(a ′ ′
,s )
X
′
a
s′
Q-learning update:
′ ′
Q(a, s) ← Q(a, s) + α(R(s) + γ max
′
Q(a , s ) − Q(a, s))
a
2 3 +1
RMS error
RMS error, policy loss
2 –1
1
0.5
1
0
0 50 100 150 200 250 300 350 400 450 500
Number of trials 1 2 3 4
π1 a1
State
π2 a2 Arbitrator
Action
s a
π3 a3
300000
200000
Local SARSA
Global SARSA
150000 Local Q
100000
50000
0 100 200 300 400 500 600 700 800 900 1000
Training episode