Professional Documents
Culture Documents
Lec 21
Lec 21
Lec 21
University of Toronto
Reinforcement LearningLearning
Reinforcement Problem: AnAnagent
Problem: continually
agent continually interacts
interacts with the
with the
environment.
environment. How How should
should it it choose
choose itsitsactions
actions so
sothat its long-term
that its rewards rewards
long-term are are
maximized?
maximized? Also might be called:
• Adaptive Situated Agent
UofT Design
CSC411 2019 Winter Lecture 21 2 / 28
Playing Games: Atari
https://www.youtube.com/watch?v=V1eYniJ0Rnk
UofT CSC411 2019 Winter Lecture 21 3 / 28
Playing Games: Super Mario
https://www.youtube.com/watch?v=wfL4L_l4U9A
UofT CSC411 2019 Winter Lecture 21 4 / 28
Making Pancakes!
https://www.youtube.com/watch?v=W_gxLKSsSIE
The agent has a state s ∈ S in the environment, e.g., the location of X and
O in tic-tac-toc, or the location of a robot in a room.
At every time step t = 0, 1, . . . , the agent is at state st .
Takes an action at
Moves into a new state st+1 , according to the dynamics of the
environment and the selected action, i.e., st+1 ∼ P(·|st , at )
Receives some reward rt+1 ∼ R(·|st , at , st+1 )
<latexit sha1_base64="jsVk2wqUYQLyJV+kHrWSdo6cMHs=">AAACM3icdVDLSgMxFM34tr6qLt0Ei1JRyowKuhFEN4KbCrYVbBky6a0NJpkhuSOUsX/g1wiu9EfEnbh169r0sdCKBwKHc+7h3pwokcKi7796Y+MTk1PTM7O5ufmFxaX88krVxqnhUOGxjM1VxCxIoaGCAiVcJQaYiiTUotvTnl+7A2NFrC+xk0BDsRstWoIzdFKY36yXz8FokEUbZrgddOkR3aX31Ibo2N4OZSEeBVthvuCX/D7oXxIMSYEMUQ7zX/VmzFMFGrlk1l4HfoKNjBkUXEI3V08tJIzfshu4dlQzBbaR9f/TpRtOadJWbNzTSPvqz0TGlLUdFblJxbBtR72e+J+HbTWyHVuHjUzoJEXQfLC8lUqKMe0VRpvCAEfZcYRxI9z9lLeZYRxdrbl6P5idxkox3bRdV1QwWstfUt0tBX4puNgvHJ8MK5sha2SdFElADsgxOSNlUiGcPJBH8kxevCfvzXv3PgajY94ws0p+wfv8BrCoqP0=</latexit>
<latexit sha1_base64="OF/6FqjFx4yC2SuBAL6dE0AoJc0=">AAACM3icdVDLSgMxFM34tr6qLt0Ei6IoZUYF3QiiG8FNBfsAW4ZMemuDSWZI7ghl7B/4NYIr/RFxJ27dujatXWiLBwKHc+7h3pwokcKi7796Y+MTk1PTM7O5ufmFxaX88krFxqnhUOaxjE0tYhak0FBGgRJqiQGmIgnV6Pas51fvwFgR6yvsJNBQ7EaLluAMnRTmN+ulCzAa5JYNM9wJuvSY7tN7akPssV3KQjwOtsN8wS/6fdBREgxIgQxQCvNf9WbMUwUauWTWXgd+go2MGRRcQjdXTy0kjN+yG7h2VDMFtpH1/9OlG05p0lZs3NNI++rvRMaUtR0VuUnFsG2HvZ74n4dtNbQdW0eNTOgkRdD8Z3krlRRj2iuMNoUBjrLjCONGuPspbzPDOLpac/V+MDuLlWK6abuuqGC4llFS2SsGfjG4PCicnA4qmyFrZJ1skYAckhNyTkqkTDh5II/kmbx4T96b9+59/IyOeYPMKvkD7/MbsmKo/g==</latexit>
<latexit sha1_base64="Yz4ccPPApxJDQwbdQS/TyFc9LSU=">AAACM3icdVDLSgMxFM34tr6qLt0Ei1JRyowKuhFEN4KbCrYVbBky6a0NJpkhuSOUsX/g1wiu9EfEnbh169r0sdCKBwKHc+7h3pwokcKi7796Y+MTk1PTM7O5ufmFxaX88krVxqnhUOGxjM1VxCxIoaGCAiVcJQaYiiTUotvTnl+7A2NFrC+xk0BDsRstWoIzdFKY36yXz8FokEUbZrgddOkR3af31Ibo2N4OZSEeBVthvuCX/D7oXxIMSYEMUQ7zX/VmzFMFGrlk1l4HfoKNjBkUXEI3V08tJIzfshu4dlQzBbaR9f/TpRtOadJWbNzTSPvqz0TGlLUdFblJxbBtR72e+J+HbTWyHVuHjUzoJEXQfLC8lUqKMe0VRpvCAEfZcYRxI9z9lLeZYRxdrbl6P5idxkox3bRdV1QwWstfUt0tBX4puNgvHJ8MK5sha2SdFElADsgxOSNlUiGcPJBH8kxevCfvzXv3PgajY94ws0p+wfv8BrQcqP8=</latexit>
The transition probability describes the changes in the state of the agent
when it chooses actions
This model has Markov property: the future depends on the past only
through the current state
A policy is the action selection mechanism of the agent, and describes its
behaviour.
Policy can be deterministic or stochastic:
Deterministic policy: a = π(s)
Stochastic policy: A ∼ π(·|s)
Value function is the expected future reward, and is used to evaluate the
desirability of states.
State-value function V π (or simply value function) for policy π is a function
defined as
X
V π (s) , Eπ γ t Rt | S0 = s .
t≥0
It describes the expected discounted reward if the agent starts from state s
and follows policy π.
The action-value function Q π for policy π is
X
π t
Q (s, a) , Eπ γ Rt | S0 = s, A0 = a .
t≥0
It describes the expected discounted reward if the agent starts from state s,
takes action a, and afterwards follows policy π.
UofT CSC411 2019 Winter Lecture 21 18 / 28
Value Function
Our aim will be to find a policy π that maximizes the value function (the
total reward we receive over time): find the policy with the highest expected
reward
Optimal value function:
Key observation:
Qπ = T πQπ
Q ∗ = T ∗Q ∗
Q̂ ≈ T ∗ Q̂
Q π(x;Q̂) ≈ Q ∗
Q∗
Qk+1 ← T ∗ Qk
Bellman operator
T∗
Qk
Z
Qk+1 (s, a) ← r (s, a) + γ P(ds 0 |s, a) max Qk (s 0 , a0 )
S a0 ∈Å
X
Qk+1 (s, a) ← r (s, a) + γ P(s 0 |s, a) max Qk (s 0 , a0 )
s 0 ∈S a0 ∈Å
UofT CSC411 2019 Winter Lecture 21 22 / 28
Value Iteration
The Value Iteration converges to the optimal value function.
This is because of the contraction property of the Bellman (optimality)
operator, i.e., kT ∗ Q1 − T ∗ Q2 k∞ ≤ γ kQ1 − Q2 k∞ .
Qk+1 ← T ∗ Qk
Z
Qk+1 (s, a) ← r (s, a) + γ P(ds 0 |s, a) max Qk (s 0 , a0 )
S a0 ∈Å
X
Qk+1 (s, a) ← r (s, a) + γ P(s |s, a) max Qk (s 0 , a0 )
0
s 0 ∈S a0 ∈Å
UofT CSC411 2019 Winter Lecture 21 23 / 28
Maze Example
Each board position (taking into account symmetry) has some probability