Professional Documents
Culture Documents
5.4-Reinforcement learning-part3-Q-Learning
5.4-Reinforcement learning-part3-Q-Learning
The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances.
For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it
maximizes the expected value of the total reward over any and all successive steps, starting from the current
state.
Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time
and a partly-random policy.
"Q" names the function Q(s,a) that can be said to stand for the "quality" of an action a taken in a given state s.
Suppose we have the optimal Q-function (s, a) then the optimal policy in state s is argmax a Q(s, a).
Q-learning Algorithm
r=8
r=0
r=-8
States and Actions
States: s Actions: a
1 2 3 4 5
N
6 7 8 9 10
S
11 12 13 14 15
E
16 17 18 19 20
W
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
An Episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Calculating new Q(s, a) values
1st step:
2nd step:
3rd step:
4th step:
The Q(s, a) function after the first episode
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A second episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Calculating new Q(s, a) values
1st step:
2nd step:
3rd step:
4th step:
The Q(s, a) function after the second episode
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
The Q(s, a) function after a few episodes
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
A
c
t S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
One of the optimal policies
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
A
c S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
t
i W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
o
n
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
s
An optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Another of the optimal policies
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
Another optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
NPTEL