Professional Documents
Culture Documents
Lec 10
Lec 10
Lec 10
Reinforcement Learning
Agent
State: s
Reward: r Actions: a
Environment
o Basic idea:
o Receive feedback in the form of rewards
o Agent’s utility is defined by the reward function
o Must (learn to) act so as to maximize expected rewards
o All learning is based on observed samples of outcomes!
Reinforcement Learning
o Still assume a Markov decision process (MDP):
o A set of states s Î S
o A set of actions (per state) A
o A model T(s,a,s’)
o A reward function R(s,a,s’)
o Still looking for a policy p(s)
14
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)
o In this case:
o Learner is “along for the ride”
o No choice about what actions to take
o Just execute the policy and learn from experience
o This is NOT offline planning! You actually take actions in the world.
Direct Evaluation
s, p(s),s’
s’
o This approach fully exploited the connections between the states
o Unfortunately, we need T and R to do it!
o Idea: Take samples of outcomes s’ (by doing the action!) and average
s
p(s)
s, p(s)
s, p(s),s’
s2' s1'
s' s3'
Update to V(s):
Same update:
Exponential Moving Average
o Exponential moving average
o The running interpolation update:
o Forgets about the past (distant past values were wrong anyway)
B, east, C, -2 C, east, D, -2
A 0 0 0
B C D 0 0 8 -1 0 8 -1 3 8
E 0 0 0
Assume: g = 1, α = 1/2
Problems with TD Value Learning
s
a
s, a
o Caveats:
o You have to explore enough
o You have to eventually make the learning rate
small enough
o … but not decrease it too quickly
o Basically, in the limit, it doesn’t matter how you select actions (!)
Video of Demo Q-Learning -- Gridworld
Video of Demo Q-Learning -- Crawler
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)
o In this case:
o Learner makes choices!
o Fundamental tradeoff: exploration vs. exploitation
o This is NOT offline planning! You actually take actions in the world
and find out what happens…
Exploration vs. Exploitation
Video of Demo Q-learning – Manual Exploration – Bridge
Grid
How to Explore?
o Several schemes for forcing exploration
o Simplest: random actions (e-greedy)
o Every time step, flip a coin
o With (small) probability e, act randomly
o With (large) probability 1-e, act on current policy
o Exploration function
o Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:
Modified Q-Update:
o Note: this propagates the “bonus” back to states that lead to unknown states
as well! [Demo: exploration – Q-learning – crawler – exploration function (L10D4)]
Video of Demo Q-learning – Exploration Function –
Crawler
Regret
o Even if you learn the optimal
policy, you still make mistakes
along the way!
o Regret is a measure of your total
mistake cost: the difference
between your (expected) rewards,
including youthful suboptimality,
and optimal (expected) rewards
o Minimizing regret goes beyond
learning to be optimal – it requires
optimally learning to be optimal
o Example: random exploration and
exploration functions both end up
optimal, but random exploration
has higher regret
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)
53