Lec 10

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

CS 188: Artificial Intelligence

Reinforcement Learning

Instructor: Pieter Abbeel


University of California, Berkeley
[Slides by Dan Klein, Pieter Abbeel, Anca Dragan. http://ai.berkeley.edu.]
Reinforcement Learning
Reinforcement Learning

Agent

State: s
Reward: r Actions: a

Environment

o Basic idea:
o Receive feedback in the form of rewards
o Agent’s utility is defined by the reward function
o Must (learn to) act so as to maximize expected rewards
o All learning is based on observed samples of outcomes!
Reinforcement Learning
o Still assume a Markov decision process (MDP):
o A set of states s Î S
o A set of actions (per state) A
o A model T(s,a,s’)
o A reward function R(s,a,s’)
o Still looking for a policy p(s)

o New twist: don’t know T or R


o I.e. we don’t know which states are good or what the actions do
o Must actually try actions and states out to learn
Offline (MDPs) vs. Online (RL)

Offline Solution Online Learning


Example: Learning to Walk

Initial A Learning Trial After Learning [1K Trials]

[Kohl and Stone, ICRA 2004]


Example: Learning to Walk

[Kohl and Stone, ICRA 2004]


Initial
[Video: AIBO WALK – initial]
Example: Learning to Walk

[Kohl and Stone, ICRA 2004]


Training
[Video: AIBO WALK – training]
Example: Learning to Walk

[Kohl and Stone, ICRA 2004]


Finished
[Video: AIBO WALK – finished]
The Crawler!

[Demo: Crawler Bot (L10D1)] [You, in Project 3]


Video of Demo Crawler Bot
DeepMind Atari (©Two Minute Lectures)

14
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)

o Active Reinforcement Learning (= agent also needs to decide how to


collect experiences)
o Key challenges:
o How to efficiently explore?
o How to trade off exploration <> exploitation
o Applies to both model-based and model-free. In CS188 we’ll cover only in
context of Q-learning
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)

o Active Reinforcement Learning (= agent also needs to decide how to


collect experiences)
o Key challenges:
o How to efficiently explore?
o How to trade off exploration <> exploitation
o Applies to both model-based and model-free. In CS188 we’ll cover only in
context of Q-learning
Model-Based Reinforcement Learning
Model-Based Reinforcement Learning
o Model-Based Idea:
o Learn an approximate model based on experiences
o Solve for values as if the learned model were correct

o Step 1: Learn empirical MDP model


o Count outcomes s’ for each s, a
o Normalize to give an estimate of
o Discover each when we experience (s, a, s’)

o Step 2: Solve the learned MDP


o For example, use value iteration, as before

(and repeat as needed)


Example: Model-Based RL
Input Policy p Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A T(C, east, D) = 0.75
C, east, D, -1 C, east, D, -1
T(C, east, A) = 0.25
D, exit, x, +10 D, exit, x, +10 …
B C D
Episode 3 Episode 4 R(s,a,s’).
E R(B, east, C) = -1
E, north, C, -1 E, north, C, -1
R(C, east, D) = -1
C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10
Assume: g = 1 D, exit, x, +10 A, exit, x, -10 …
Analogy: Expected Age
Goal: Compute expected age of cs188 students
Known P(A)

Without P(A), instead collect samples [a1, a2, … aN]

Unknown P(A): “Model Based” Unknown P(A): “Model Free”

Why does this Why does this


work? Because work? Because
eventually you samples appear
learn the right with the right
model. frequencies.
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)

o Active Reinforcement Learning (= agent also needs to decide how to


collect experiences)
o Key challenges:
o How to efficiently explore?
o How to trade off exploration <> exploitation
o Applies to both model-based and model-free. In CS188 we’ll cover only in
context of Q-learning
Passive Model-Free Reinforcement Learning
Passive Model-Free Reinforcement Learning
o Simplified task: policy evaluation
o Input: a fixed policy p(s)
o You don’t know the transitions T(s,a,s’)
o You don’t know the rewards R(s,a,s’)
o Goal: learn the state values

o In this case:
o Learner is “along for the ride”
o No choice about what actions to take
o Just execute the policy and learn from experience
o This is NOT offline planning! You actually take actions in the world.
Direct Evaluation

o Goal: Compute values for each state under p

o Idea: Average together observed sample


values
o Act according to p
o Every time you visit a state, write down what the
sum of discounted rewards turned out to be
o Average those samples

o This is called direct evaluation


Example: Direct Evaluation
Input Policy p Observed Episodes (Training) Output Values
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A C, east, D, -1 C, east, D, -1 A
D, exit, x, +10 D, exit, x, +10
+8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E, north, C, -1 E
E, north, C, -1
C, east, D, -1 C, east, A, -1
D, exit, x, +10 A, exit, x, -10 If B and E both go to C
Assume: g = 1
under this policy, how can
their values be different?
Problems with Direct Evaluation
o What’s good about direct evaluation? Output Values
o It’s easy to understand
o It doesn’t require any knowledge of T, R -10
A
o It eventually computes the correct average
values, using just sample transitions +8 +4 +10
B C D
-2
o What bad about it? E
o It wastes information about state connections
If B and E both go to C
o Each state must be learned separately
under this policy, how can
o So, it takes a long time to learn their values be different?
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)

o Active Reinforcement Learning (= agent also needs to decide how to


collect experiences)
o Key challenges:
o How to efficiently explore?
o How to trade off exploration <> exploitation
o Applies to both model-based and model-free. In CS188 we’ll cover only in
context of Q-learning
Temporal Difference Value Learning
Why Not Use Policy Evaluation?
o Simplified Bellman updates calculate V for a fixed policy: s
o Each round, replace V with a one-step-look-ahead layer over V
p(s)
s, p(s)

s, p(s),s’
s’
o This approach fully exploited the connections between the states
o Unfortunately, we need T and R to do it!

o Key question: how can we do this update to V without knowing T and R?


o In other words, how to we take a weighted average without knowing the weights?
Sample-Based Policy Evaluation?
o We want to improve our estimate of V by computing these averages:

o Idea: Take samples of outcomes s’ (by doing the action!) and average
s
p(s)
s, p(s)

s, p(s),s’
s2' s1'
s' s3'

Almost! But we can’t


rewind time to get sample
after sample from state s.
Temporal Difference Value Learning
o Big idea: learn from every experience! s
o Update V(s) each time we experience a transition (s, a, s’, r)
p(s)
o Likely outcomes s’ will contribute updates more often
s, p(s)
o Temporal difference learning of values
o Policy still fixed, still doing evaluation! s’
o Move values toward value of whatever successor occurs: running
average
Sample of V(s):

Update to V(s):

Same update:
Exponential Moving Average
o Exponential moving average
o The running interpolation update:

o Makes recent samples more important

o Forgets about the past (distant past values were wrong anyway)

o Decreasing learning rate (alpha) can give converging averages


Example: Temporal Difference Value Learning
States Observed Transitions

B, east, C, -2 C, east, D, -2

A 0 0 0

B C D 0 0 8 -1 0 8 -1 3 8

E 0 0 0

Assume: g = 1, α = 1/2
Problems with TD Value Learning

o TD value leaning is a model-free way to do policy evaluation,


mimicking Bellman updates with running sample averages
o However, if we want to turn values into a (new) policy, we’re sunk:

s
a
s, a

o Idea: learn Q-values, not values s,a,s’


o Makes action selection model-free too! s’
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)

o Active Reinforcement Learning (= agent also needs to decide how to


collect experiences)
o Key challenges:
o How to efficiently explore?
o How to trade off exploration <> exploitation
o Applies to both model-based and model-free. In CS188 we’ll cover only in
context of Q-learning
Q-Value Iteration
o Value iteration: find successive (depth-limited) values
o Start with V0(s) = 0, which we know is right
o Given Vk, calculate the depth k+1 values for all states:

o But Q-values are more useful, so compute them instead


o Start with Q0(s,a) = 0, which we know is right
o Given Qk, calculate the depth k+1 q-values for all q-states:
Q-Learning
o Q-Learning: sample-based Q-value iteration

o Learn Q(s,a) values as you go


o Receive a sample (s,a,s’,r)
o Consider your old estimate:
o Consider your new sample estimate:
no longer policy
evaluation!

o Incorporate the new estimate into a running average:

[Demo: Q-learning – gridworld (L10D2)]


[Demo: Q-learning – crawler (L10D3)]
Q-Learning Properties
o Amazing result: Q-learning converges to optimal policy --
even if you’re acting suboptimally!

o This is called off-policy learning

o Caveats:
o You have to explore enough
o You have to eventually make the learning rate
small enough
o … but not decrease it too quickly
o Basically, in the limit, it doesn’t matter how you select actions (!)
Video of Demo Q-Learning -- Gridworld
Video of Demo Q-Learning -- Crawler
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)

o Active Reinforcement Learning (= agent also needs to decide how


to collect experiences)
o Key challenges:
o How to efficiently explore?
o How to trade off exploration <> exploitation
o Applies to both model-based and model-free. In CS188 we’ll cover only in
context of Q-learning
Active Reinforcement Learning
Active Reinforcement Learning
o Full reinforcement learning: optimal policies (like value
iteration)
o You don’t know the transitions T(s,a,s’)
o You don’t know the rewards R(s,a,s’)
o You choose the actions now
o Goal: learn the optimal policy / values

o In this case:
o Learner makes choices!
o Fundamental tradeoff: exploration vs. exploitation
o This is NOT offline planning! You actually take actions in the world
and find out what happens…
Exploration vs. Exploitation
Video of Demo Q-learning – Manual Exploration – Bridge
Grid
How to Explore?
o Several schemes for forcing exploration
o Simplest: random actions (e-greedy)
o Every time step, flip a coin
o With (small) probability e, act randomly
o With (large) probability 1-e, act on current policy

o Problems with random actions?


o You do eventually explore the space, but keep
thrashing around once learning is done
o One solution: lower e over time
o Another solution: exploration functions
[Demo: Q-learning – manual exploration – bridge grid (L10D5)]
[Demo: Q-learning – epsilon-greedy -- crawler (L10D3)]
Video of Demo Q-learning – Epsilon-Greedy – Crawler
Exploration Functions
o When to explore?
o Random actions: explore a fixed amount
o Better idea: explore areas whose badness is not
(yet) established, eventually stop exploring

o Exploration function
o Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:
Modified Q-Update:

o Note: this propagates the “bonus” back to states that lead to unknown states
as well! [Demo: exploration – Q-learning – crawler – exploration function (L10D4)]
Video of Demo Q-learning – Exploration Function –
Crawler
Regret
o Even if you learn the optimal
policy, you still make mistakes
along the way!
o Regret is a measure of your total
mistake cost: the difference
between your (expected) rewards,
including youthful suboptimality,
and optimal (expected) rewards
o Minimizing regret goes beyond
learning to be optimal – it requires
optimally learning to be optimal
o Example: random exploration and
exploration functions both end up
optimal, but random exploration
has higher regret
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)

o Active Reinforcement Learning (= agent also needs to decide how to


collect experiences)
o Key challenges:
o How to efficiently explore?
o How to trade off exploration <> exploitation
o Applies to both model-based and model-free. In CS188 we’ll cover only in
context of Q-learning
Discussion: Model-Based vs Model-Free RL

53

You might also like