Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 5 Machine Learning enabled


by prior Theories

Video 5.4 Reinforcement Learning – Part 3 Q-learning


Q Learning

Q-learning is a model-free off-policy TD reinforcement learning algorithm.

The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances.

For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it
maximizes the expected value of the total reward over any and all successive steps, starting from the current
state.

Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time
and a partly-random policy.

"Q" names the function Q(s,a) that can be said to stand for the "quality" of an action a taken in a given state s.

Suppose we have the optimal Q-function (s, a) then the optimal policy in state s is argmax a Q(s, a).
Q-learning Algorithm

Initialize Q(s, a) arbitrarily


Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Take action a, observe r, s’
Q(s, a) 🡨 Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)]
a’
s 🡨 s’

With α =1 or α =1 and γ = 1 the updating formula is simplified


Q(s, a) 🡨 r + γ max Q(s’, a’)
Q(s, a) 🡨 r + max Q(s’, a’)
Example

r=8

r=0
r=-8
States and Actions

States: s Actions: a
1 2 3 4 5
N

6 7 8 9 10
S

11 12 13 14 15
E

16 17 18 19 20
W

Assume that α=1 and γ = 0.5


Initializing the Q(s, a) function

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
An Episode

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
Calculating new Q(s, a) values

1st step:

2nd step:

3rd step:

4th step:
The Q(s, a) function after the first episode

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A second episode

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
Calculating new Q(s, a) values

1st step:

2nd step:

3rd step:

4th step:
The Q(s, a) function after the second episode

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
The Q(s, a) function after a few episodes

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
A
c
t S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
One of the optimal policies

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
A
c S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
t
i W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
o
n
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
s
An optimal policy graphically

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
Another of the optimal policies

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
Another optimal policy graphically

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Thanks for your attention!

The next lecture 5.5 will be on the topic:

Case Based Reasoning

You might also like