Lec 21

CSC 411: Introduction to Machine Learning
CSC 411 Lecture 21: Reinforcement Learning I
Mengye Ren and Matthew MacKay
University of Toronto
UofT CSC411 2019 Winter Lecture 21 1 / 28

Reinforcement Learning Problem
In supervised learning, the problem is to predict an output t given an input x.

But often the ultimate goal is not to predict, but to make decisions, i.e.,
take actions.
And we need to take a sequence of actions.
Reinforcement Learning (RL)

The actions have long-term consequences.
An agent observes the takes an action and with the goal of

world its states changes achieving long-term
rewards.
Reinforcement LearningLearning
Reinforcement Problem: AnAnagent
Problem: continually
agent continually interacts
interacts with the
with the
environment.
environment. How How should
should it it choose
choose itsitsactions
actions so
sothat its long-term
that its rewards rewards
long-term are are
maximized?
maximized? Also might be called:
• Adaptive Situated Agent
UofT Design
CSC411 2019 Winter Lecture 21 2 / 28
Playing Games: Atari
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Playing Games: Super Mario
https://www.youtube.com/watch?v=wfL4L_l4U9A
Making Pancakes!
https://www.youtube.com/watch?v=W_gxLKSsSIE

Reinforcement Learning Resources
Reinforcement Learning: An Introduction second edition, Sutton & Barto

Book (2018)
Video lectures by David Silver

Reinforcement Learning
Learning algorithms differ in the information available to learner

Supervised: correct outputs, e.g., class label
Unsupervised: no feedback, must construct measure of good output
Reinforcement learning: Reward (or cost)

More realistic learning scenario:
Continuous stream of input information, and actions
Effects of action depend on state of the world
Obtain reward that depends on world state and actions

You know the reward for your action, not other actions.
Could be a delay between action and reward.

Reinforcement Learning

Example: Tic Tac Toe, Notation




Formalizing Reinforcement Learning Problems
Markov Decision Process (MDP) is the mathematical framework to describe

RL problems
A discounted MDP is defined by a tuple (S, A, P, R, γ).
S: State space. Discrete or continuous
A: Action space. Here we consider finite action space, i.e.,
A = {a1 , . . . , a|A| }.
P: Transition probability
R: Immediate reward distribution
γ: Discount factor (0 ≤ γ < 1)

Formalizing Reinforcement Learning Problems
The agent has a state s ∈ S in the environment, e.g., the location of X and
O in tic-tac-toc, or the location of a robot in a room.
At every time step t = 0, 1, . . . , the agent is at state st .
Takes an action at
Moves into a new state st+1 , according to the dynamics of the
environment and the selected action, i.e., st+1 ∼ P(·|st , at )
Receives some reward rt+1 ∼ R(·|st , at , st+1 )
<latexit sha1_base64="jsVk2wqUYQLyJV+kHrWSdo6cMHs=">AAACM3icdVDLSgMxFM34tr6qLt0Ei1JRyowKuhFEN4KbCrYVbBky6a0NJpkhuSOUsX/g1wiu9EfEnbh169r0sdCKBwKHc+7h3pwokcKi7796Y+MTk1PTM7O5ufmFxaX88krVxqnhUOGxjM1VxCxIoaGCAiVcJQaYiiTUotvTnl+7A2NFrC+xk0BDsRstWoIzdFKY36yXz8FokEUbZrgddOkR3aX31Ibo2N4OZSEeBVthvuCX/D7oXxIMSYEMUQ7zX/VmzFMFGrlk1l4HfoKNjBkUXEI3V08tJIzfshu4dlQzBbaR9f/TpRtOadJWbNzTSPvqz0TGlLUdFblJxbBtR72e+J+HbTWyHVuHjUzoJEXQfLC8lUqKMe0VRpvCAEfZcYRxI9z9lLeZYRxdrbl6P5idxkox3bRdV1QwWstfUt0tBX4puNgvHJ8MK5sha2SdFElADsgxOSNlUiGcPJBH8kxevCfvzXv3PgajY94ws0p+wfv8BrCoqP0=</latexit>
<latexit sha1_base64="OF/6FqjFx4yC2SuBAL6dE0AoJc0=">AAACM3icdVDLSgMxFM34tr6qLt0Ei6IoZUYF3QiiG8FNBfsAW4ZMemuDSWZI7ghl7B/4NYIr/RFxJ27dujatXWiLBwKHc+7h3pwokcKi7796Y+MTk1PTM7O5ufmFxaX88krFxqnhUOaxjE0tYhak0FBGgRJqiQGmIgnV6Pas51fvwFgR6yvsJNBQ7EaLluAMnRTmN+ulCzAa5JYNM9wJuvSY7tN7akPssV3KQjwOtsN8wS/6fdBREgxIgQxQCvNf9WbMUwUauWTWXgd+go2MGRRcQjdXTy0kjN+yG7h2VDMFtpH1/9OlG05p0lZs3NNI++rvRMaUtR0VuUnFsG2HvZ74n4dtNbQdW0eNTOgkRdD8Z3krlRRj2iuMNoUBjrLjCONGuPspbzPDOLpac/V+MDuLlWK6abuuqGC4llFS2SsGfjG4PCicnA4qmyFrZJ1skYAckhNyTkqkTDh5II/kmbx4T96b9+59/IyOeYPMKvkD7/MbsmKo/g==</latexit>
<latexit sha1_base64="Yz4ccPPApxJDQwbdQS/TyFc9LSU=">AAACM3icdVDLSgMxFM34tr6qLt0Ei1JRyowKuhFEN4KbCrYVbBky6a0NJpkhuSOUsX/g1wiu9EfEnbh169r0sdCKBwKHc+7h3pwokcKi7796Y+MTk1PTM7O5ufmFxaX88krVxqnhUOGxjM1VxCxIoaGCAiVcJQaYiiTUotvTnl+7A2NFrC+xk0BDsRstWoIzdFKY36yXz8FokEUbZrgddOkR3af31Ibo2N4OZSEeBVthvuCX/D7oXxIMSYEMUQ7zX/VmzFMFGrlk1l4HfoKNjBkUXEI3V08tJIzfshu4dlQzBbaR9f/TpRtOadJWbNzTSPvqz0TGlLUdFblJxbBtR72e+J+HbTWyHVuHjUzoJEXQfLC8lUqKMe0VRpvCAEfZcYRxI9z9lLeZYRxdrbl6P5idxkox3bRdV1QwWstfUt0tBX4puNgvHJ8MK5sha2SdFElADsgxOSNlUiGcPJBH8kxevCfvzXv3PgajY94ws0p+wfv8BrQcqP8=</latexit>

Formulating Reinforcement Learning
The action selection mechanism is described by a policy π

Policy π is a mapping from states to actions, i.e., at = π(st )
The goal is to find a policy π such that long-term rewards of the agent is
maximized.
Different notations of long-term reward:
Average reward:
rt + rt+1 + rt+2 + . . .
Sometimes a future reward is discounted by γ k−1 , where k is the
number of time-steps in the future when it is received:
rt + γrt+1 + γ 2 rt+2 + . . .
If γ close to 1, rewards further in the future count more, and we say

that the agent is “farsighted”
γ is less than 1 because there is usually a time limit to the sequence of
actions needed to solve a task (we prefer rewards sooner rather than

Transition Probability (or Dynamics)
The transition probability describes the changes in the state of the agent
when it chooses actions
P(st+1 = s 0 , rt+1 = r 0 |st = s, at = a)
This model has Markov property: the future depends on the past only
through the current state

Policy
A policy is the action selection mechanism of the agent, and describes its
behaviour.
Policy can be deterministic or stochastic:
Deterministic policy: a = π(s)
Stochastic policy: A ∼ π(·|s)

Value Function
Value function is the expected future reward, and is used to evaluate the
desirability of states.
State-value function V π (or simply value function) for policy π is a function
defined as
 
X
V π (s) , Eπ  γ t Rt | S0 = s  .
t≥0
It describes the expected discounted reward if the agent starts from state s
and follows policy π.
The action-value function Q π for policy π is
 
X
π t
Q (s, a) , Eπ  γ Rt | S0 = s, A0 = a .
t≥0
It describes the expected discounted reward if the agent starts from state s,
takes action a, and afterwards follows policy π.
Value Function
Our aim will be to find a policy π that maximizes the value function (the
total reward we receive over time): find the policy with the highest expected
reward
Optimal value function:
Q ∗ (s, a) = sup Q π (s, a)

π
Given Q ∗ , the optimal policy can be obtained as
π ∗ (s) ← argmax Q ∗ (s, a)

a
The goal of an RL agent is to find a policy π that is close to optimal, i.e.,

Q π ≈ Q ∗.

Bellman Equation
The value function satisfies the following recursive relationship:
"∞ #
X
π t
Q (s, a) = E γ Rt |S0 = s, A0 = a
t=0
∞
" #
X
= E R(S0 , A0 ) + γ γ t Rt+1 |s0 = s, a0 = a
t=0
= E [R(S0 , A0 ) + γQ π (S1 , π(S1 ))|S0 = s, A0 = a]
Z
= r (s, a) + γ P(ds 0 |s, a)Q π (s 0 , π(s 0 ))
S
| {z }
,(T π Q π )(s,a)
This is called the Bellman equation and T π : B(S × A) → B(S × A) is the

Bellman operator. Similarly, we define the Bellman optimality operator:
Z
(T ∗ Q)(s, a) , r (s, a) + γ P(ds 0 |s, a) max
0
Q(s 0 , a0 )
S a ∈A

Bellman Equation
Key observation:
Qπ = T πQπ
Q ∗ = T ∗Q ∗
Value-based approaches try to find a Q̂ such that
Q̂ ≈ T ∗ Q̂
The greedy policy of Q̂ is close to the optimal policy:
Q π(x;Q̂) ≈ Q ∗
where the greedy policy is defined as
π(s; Q̂) ← argmax Q̂(s, a)

a∈A

Finding the Optimal Value Function: Value Iteration
Assume that we know the model P and R. How can we find the optimal
value function?
This is the problem of Planning.
We can benefit from the Bellman optimality equation and use a method
called Value Iteration
Qk+1 ← T ∗ Qk
Q∗
Qk+1 ← T ∗ Qk
Bellman operator
T∗
Qk
Z
Qk+1 (s, a) ← r (s, a) + γ P(ds 0 |s, a) max Qk (s 0 , a0 )
S a0 ∈Å
X
Qk+1 (s, a) ← r (s, a) + γ P(s 0 |s, a) max Qk (s 0 , a0 )
s 0 ∈S a0 ∈Å
Value Iteration
The Value Iteration converges to the optimal value function.
This is because of the contraction property of the Bellman (optimality)
operator, i.e., kT ∗ Q1 − T ∗ Q2 k∞ ≤ γ kQ1 − Q2 k∞ .
Qk+1 ← T ∗ Qk
Z
Qk+1 (s, a) ← r (s, a) + γ P(ds 0 |s, a) max Qk (s 0 , a0 )
S a0 ∈Å
X
Qk+1 (s, a) ← r (s, a) + γ P(s |s, a) max Qk (s 0 , a0 )
0
s 0 ∈S a0 ∈Å
Maze Example
Rewards: −1 per time-step

Actions: N, E, S, W
States: Agent’s location
[Slide credit: D. Silver]

Maze Example
Arrows represent policy π(s)

for each state s

Maze Example
Numbers represent value V π (s)

of each state s

Example: Tic-Tac-Toe
Consider the game tic-tac-toe:

reward: win/lose/tie the game (+1/ − 1/0) [only at final move in
given game]
state: positions of X’s and O’s on the board
policy: mapping from states to actions

based on rules of game: choice of one open position
value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, we can use a table to represent
value function

RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:

start with all values = 0.5
policy: choose move with highest
probability of winning given current
legal moves from current state
update entries in table based on
outcome of each game
After many games value function will
represent true probability of winning
from each state
Can try alternative policy: sometimes select moves randomly (exploration)

Lec 21

Uploaded by

Copyright:

Available Formats

You might also like

Lec 21

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 21

Uploaded by

Copyright:

Available Formats

CSC 411: Introduction to Machine Learning

CSC 411 Lecture 21: Reinforcement Learning I

Mengye Ren and Matthew MacKay

UofT CSC411 2019 Winter Lecture 21 1 / 28

In supervised learning, the problem is to predict an output t given an input x.

Reinforcement Learning (RL)

An agent observes the takes an action and with the goal of

UofT CSC411 2019 Winter Lecture 21 5 / 28

Reinforcement Learning: An Introduction second edition, Sutton & Barto

UofT CSC411 2019 Winter Lecture 21 6 / 28

Learning algorithms differ in the information available to learner

Unsupervised: no feedback, must construct measure of good output

Reinforcement learning: Reward (or cost)

Effects of action depend on state of the world

Obtain reward that depends on world state and actions

UofT CSC411 2019 Winter Lecture 21 7 / 28

UofT CSC411 2019 Winter Lecture 21 8 / 28

UofT CSC411 2019 Winter Lecture 21 9 / 28

UofT CSC411 2019 Winter Lecture 21 10 / 28

UofT CSC411 2019 Winter Lecture 21 11 / 28

UofT CSC411 2019 Winter Lecture 21 12 / 28

Markov Decision Process (MDP) is the mathematical framework to describe

UofT CSC411 2019 Winter Lecture 21 13 / 28

UofT CSC411 2019 Winter Lecture 21 14 / 28

The action selection mechanism is described by a policy π

If γ close to 1, rewards further in the future count more, and we say

UofT CSC411 2019 Winter Lecture 21 15 / 28

P(st+1 = s 0 , rt+1 = r 0 |st = s, at = a)

UofT CSC411 2019 Winter Lecture 21 16 / 28

UofT CSC411 2019 Winter Lecture 21 17 / 28

Q ∗ (s, a) = sup Q π (s, a)

Given Q ∗ , the optimal policy can be obtained as

π ∗ (s) ← argmax Q ∗ (s, a)

The goal of an RL agent is to find a policy π that is close to optimal, i.e.,

UofT CSC411 2019 Winter Lecture 21 19 / 28

This is called the Bellman equation and T π : B(S × A) → B(S × A) is the

UofT CSC411 2019 Winter Lecture 21 20 / 28

Value-based approaches try to find a Q̂ such that

The greedy policy of Q̂ is close to the optimal policy:

where the greedy policy is defined as

π(s; Q̂) ← argmax Q̂(s, a)

UofT CSC411 2019 Winter Lecture 21 21 / 28

Rewards: −1 per time-step

[Slide credit: D. Silver]

UofT CSC411 2019 Winter Lecture 21 24 / 28

Arrows represent policy π(s)

[Slide credit: D. Silver]

UofT CSC411 2019 Winter Lecture 21 25 / 28

Numbers represent value V π (s)

[Slide credit: D. Silver]

UofT CSC411 2019 Winter Lecture 21 26 / 28

Consider the game tic-tac-toe:

state: positions of X’s and O’s on the board

policy: mapping from states to actions

value function: prediction of reward in future, based on current state

UofT CSC411 2019 Winter Lecture 21 27 / 28

Simple learning process: