Fundamentals of Reinforcement Learning

Fundamentals of Reinforcement
Learning
December 9, 2013 - Techniques of AI
Yann-Michaël De Hauwere - ydehauwe@vub.ac.be
December 9, 2013 - Techniques of AI

Course material
Slides online
T. Mitchell
Machine Learning, chapter 13
McGraw Hill, 1997
Richard S. Sutton and Andrew G. Barto

Reinforcement Learning: An
Introduction
MIT Press, 1998
Available on-line for free!
Reinforcement Learning - 2/33

Why reinforcement learning?
Based on ideas from psychology

I Edward Thorndike’s law of effect
I Satisfaction strengthens behavior,
discomfort weakens it
I B.F. Skinner’s principle of
reinforcement
I Skinner Box: train animals by
providing (positive) feedback
Learning by interacting with the
environment

Why reinforcement learning?
Control learning
I Robot learning to dock on battery charger
I Learning to choose actions to optimize factory output
I Learning to play Backgammon/other games

The RL setting
I Learning from interactions

I Learning what to do - how to map situations to actions -
so as to maximize a numerical reward signal

Key features of RL
I Learner is not told which action to take

I Trial-and-error approach
I Possibility of delayed reward
I Sacrifice short-term gains for greater long-term gains
I Need to balance exploration and exploitation
I Possible that states are only partially observable
I Possible needs to learn multiple tasks with same sensors
I In between supervised and unsupervised learning

AGENT-ENVIRONMENT INTERFACE
The agent-environment interface
Agent interacts at discrete time steps t = 0, 1, 2, . . .
I Observes state st ∈ S Agent

Agent
I Selects action at ∈ A(st )
st s
state rt reward
rt atat
action
I Obtains immediate reward t
rt+1rt+1
rt+1 ∈ R st+1 Environment
Environment
st+1
I Observes resulting state st+1
... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3
14

Elements of RL
I Time steps need not refer to fixed intervals of real time

I Actions can be
I low level (voltage to motors)
I high level (go left, go right)
I ”mental” (shift focus of attention)
I States can be
I low level ”sensations” (temperature, (x, y) coordinates)
I high level abstractions, symbolic
I subjective, internal (”surprised”, ”lost”)
I The environment is not necessarily known to the agent

Elements of RL
I State transitions are

I changes to the internal state of the agent
I changes in the environment as a result of the agent’s action
I can be nondeterministic
I Rewards are
I goals, subgoals
I duration
I ...

Learning how to behave
I The agent’s policy π at time t is

I a mapping from states to action probabilities
I πt (s, a) = P (at = a|st = s)
I Reinforcement learning methods specify how the agent

changes its policy as a result of experience
I Roughly, the agent’s goal is to get as much reward as it can
over the long run

The objective
I Use discounted return instead of total reward

∞
X
Rt = rt+1 + γrt+2 + γ 2 rt+3 + . . . = γ k rt+k+1
k=0
where γ ∈ [0, 1] is the discount factor such that
shortsighted 0←γ→1 farsighted

Example: backgammon
I Learn to play backgammon

I Immediate reward:
I +100 if win
I -100 if lose
I 0 for all other states
Trained by playing 1.5 million games against itself

Now approximately equal to best human player.

Example: pole balancing
I A continuing task with discounted

return:
I reward = -1 upon failure
I return = −γ k , for k steps before
failure
Return is maximized by avoiding failure for as long as possible

∞
X
Rt = γ k rt+k+1
k=0

Examples: pole balancing (movie)

Markov decision processes
I It is often useful to a assume that all relevant information is

present in the current state: Markov property
P (st+1 , rt+1 |st , at ) = P (st+1 , rt+1 |st , at , rt , st−1 , at−1 , . . . , r1 , s0 , a0 )
I If a reinforcement learning task has the Markov property, it is

basically a Markov Decision Process (MDP)
I Assuming finite state and action spaces, it is a finite MDP

AGENT-ENVIRONMENT INTERFACE
Markov decision processes
An MDP is defined by
I State and action sets
I a Transition function
Agent
a 0
Pss 0 = P (st+1 = s |ss t =reward
r s, at = a)
state
t
action
at
t
rt+1
I a Reward function st+1 Environment
Rass0 = E(rt+1 |st = s, at = a, st+1 = s0 )
... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3
14

Value functions
I Goal: learn π : S → A, given hhs, ai, ri
I When following a fixed policy π we can define the value of a
state s under that policy as
X∞
π
V (s) = Eπ (Rt |st = s) = Eπ ( γ k rt+k+1 |st = s)
k=0
I Similarly we can define the value of taking action a in state s

as
Qπ (s, a) = Eπ (Rt |st = s, at = a)
I Optimal π ∗ = argmaxπ V π (s)

Value functions
I The value function has a particular recursive relationship,

expressed by the Bellman equation
X X
π 0
V π (s) = π(s, a) a
Pss a
0 [Rss0 + γV (s )]
a∈A(s) s0 ∈S
I The equation expresses the recursive relation between the

value of a state and its successor states, and averages over all
possibilities, weighting each by its probability of occurring

Learning an optimal policy online
I Often transition and reward functions are unknown

I Using temporal difference (TD) methods is one way of
overcoming this problem
I Learn directly from raw experience
I No model of the environment required (model-free)
I E.g.: Q-learning
I Update predicted state values based on new observations of
immediate rewards and successor states

Q-function
Q(s, a) = r(s, a) + γV ∗ (δ(s, a))with st+1 = δ(st , at )
I if we know Q, we do not have to know δ.
π ∗ (s) = argmaxa [r(s, a) + γV ∗ (δ(s, a))]
π ∗ (s) = argmaxa Q(s, a)

Training rule to learn Q
I Q and V ∗ are closely related:
V ∗ (s) = maxa0 Q(s, a0 )
I which allows us to write Q as:
Q(st , at ) = r(st , at ) + γV ∗ (δ(st , at ))
Q(st , at ) = r(st , at ) + γmaxa0 Q(st+1 , a0 )

I So if Q̂ represents the learner’s current approximation of Q:
Q̂(s, a) ← r + γmaxa0 Q̂(s0 , a0 )

Q-learning
I Q-learning updates state-action values based on the

immediate reward and the optimal expected return
h i
Q(st , at ) ← Q(st , at )+α rt+1 + γ max Q(st+1 , a) − Q(st , at )
a
I Directly learns the optimal value function independent of the

policy being followed
I Proven to converge to the optimal policy given ”sufficient”
updates for each state-action pair, and decreasing learning
rate α [Watkins92,Tsitsiklis94]

Q-learning

Action selection
I How to select an action based on the values of the states or

state-action pairs?
I Success of RL depends on a trade-off
I Exploration
I Exploitation
I Exploration is needed to prevent getting stuck in local optima
I To ensure convergence you need to exploit

Action selection
Two common choices

I -greedy
I Choose the best action with probability 1 −
I Choose a random action with probability
I Boltzmann exploration (softmax) uses a temperature
parameter τ to balance exploration and exploitation
eQt (s,a)/τ
πt (s, a) = P Qt (s,a0 )/τ
a0 ∈A e
pure exploitation 0 ← τ → ∞ pure exploration

Updating Q: in practice

Convergence of deterministic Q-learning
Q̂ converges to Q when each hs, ai is visited infinitely often

Proof:
I Let a full interval be an interval during which each hs, ai is
visited
I Let Q̂n be the Q-table after n-updates
I ∆n is the maximum error in Q̂n :
∆n = maxs,a |Q̂n (s, a) − Q(s, a)|

Convergence of deterministic Q-learning
For any table entry Q̂n (s, a) updated on iteration n + 1, the error
in the revised estimate is Q̂n+1 (s, a)
Q̂n+1 (s, a) − Q(s, a)| = |(r + γmaxa0 Q̂n (s0 , a0 ))

−(r + γmaxa0 Q(s0 , a0 ))|
= |γmaxa0 Q̂n (s0 , a0 )) − γmaxa0 Q(s0 , a0 ))|
≤ γmaxa0 |Q̂n (s0 , a0 ) − Q(s0 , a0 ))|
≤ γmaxs00 ,a0 |Q̂n (s00 , a0 ) − Q(s00 , a0 ))|
Q̂n+1 (s, a) − Q(s, a)| ≤ γ∆n < ∆n

Extensions
I Multi-step TD
I Instead of observing one immediate reward, use n consecutive
rewards for the value update
I Intuition: your current choice of action may have implications
for the future
I Eligibility traces
I State-action pairs are eligible for future rewards, with more
recent states getting more credit

Extensions
I Reward shaping
I Incorporate domain knowledge to provide additional rewards
during an episode
I Guide the agent to learn faster
I (Optimal) policies preserved given a potential-based shaping
function [Ng99]
I Function approximation
I So far we have used a tabular notation for value functions
I For large state and actions spaces this approach becomes
intractable
I Function approximators can be used to generalize over large or
even continuous state and action spaces

Demo
http://wilma.vub.ac.be:3000

Questions?

Fundamentals of Reinforcement Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fundamentals of Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Fundamentals of Reinforcement

December 9, 2013 - Techniques of AI

Richard S. Sutton and Andrew G. Barto

Available on-line for free!

Reinforcement Learning - 2/33

Based on ideas from psychology

Reinforcement Learning - 3/33

Reinforcement Learning - 4/33

I Learning from interactions

Reinforcement Learning - 5/33

I Learner is not told which action to take

Reinforcement Learning - 6/33

I Observes state st ∈ S Agent

Reinforcement Learning - 7/33

I Time steps need not refer to fixed intervals of real time

Reinforcement Learning - 8/33

I State transitions are

Reinforcement Learning - 9/33

I The agent’s policy π at time t is

I Reinforcement learning methods specify how the agent

Reinforcement Learning - 10/33

I Use discounted return instead of total reward

where γ ∈ [0, 1] is the discount factor such that

shortsighted 0←γ→1 farsighted

Reinforcement Learning - 11/33

I Learn to play backgammon

Trained by playing 1.5 million games against itself

Reinforcement Learning - 12/33

I A continuing task with discounted

Return is maximized by avoiding failure for as long as possible

Reinforcement Learning - 13/33

Reinforcement Learning - 14/33

I It is often useful to a assume that all relevant information is

P (st+1 , rt+1 |st , at ) = P (st+1 , rt+1 |st , at , rt , st−1 , at−1 , . . . , r1 , s0 , a0 )

I If a reinforcement learning task has the Markov property, it is

Reinforcement Learning - 15/33

Rass0 = E(rt+1 |st = s, at = a, st+1 = s0 )

Reinforcement Learning - 16/33

I Similarly we can define the value of taking action a in state s

Reinforcement Learning - 17/33

I The value function has a particular recursive relationship,

I The equation expresses the recursive relation between the

Reinforcement Learning - 19/33

I Often transition and reward functions are unknown

Reinforcement Learning - 20/33

Q(s, a) = r(s, a) + γV ∗ (δ(s, a))with st+1 = δ(st , at )

I if we know Q, we do not have to know δ.

π ∗ (s) = argmaxa [r(s, a) + γV ∗ (δ(s, a))]

π ∗ (s) = argmaxa Q(s, a)

Reinforcement Learning - 21/33

V ∗ (s) = maxa0 Q(s, a0 )

I which allows us to write Q as:

Q(st , at ) = r(st , at ) + γV ∗ (δ(st , at ))

Q(st , at ) = r(st , at ) + γmaxa0 Q(st+1 , a0 )

Q̂(s, a) ← r + γmaxa0 Q̂(s0 , a0 )

Reinforcement Learning - 22/33

I Q-learning updates state-action values based on the

I Directly learns the optimal value function independent of the

Reinforcement Learning - 23/33

Reinforcement Learning - 24/33

I How to select an action based on the values of the states or

Reinforcement Learning - 25/33