Professional Documents
Culture Documents
Batch Reinforcement Learning: Alan Fern
Batch Reinforcement Learning: Alan Fern
Batch Reinforcement Learning: Alan Fern
Alan Fern
Fitted Q-iteration
Batch DQN
Online versus Batch RL
Online RL: integrates data collection and optimization
Select actions in environment and at the same time update
parameters based on each observed (s,a,s’,r)
3. Repeat
Evaluate current policy against database
Run LSTDQ to generate new set of weights
New weights imply new Q-function and hence new
policy
Replace current weights with new weights
Until convergence
Results: Bicycle Riding
21
Projection Approach to Approximation
Recall the standard Bellman equation:
V * ( s ) max a R( s, a ) s ' P( s ' | s, a )V * ( s ' )
or equivalently V * T [V * ] where T[.] is the
Bellman operator
Recall from value iteration, the sub-optimality of a
value function can be bounded in terms of the
Bellman error:
V T [V ]
Vˆ * T [Vˆ * ]
Depending on space this will have a small Bellman error
V i 1 T [V i ]
Vˆ i 1 T [Vˆ i ]
No
s1)=1 s2)=2
Example: Projected Bellman Backup
Restrict space to linear functions over a single feature :
Vˆ (s) w (s)
s1)=1 s2)=2
Problem: Stability
Exact value iteration stability ensured by
contraction property of Bellman backups:
V i 1 T [V i ]
ˆ i 1
ˆ
V T [V ]i
?
Example: Stability Problem [Bertsekas & Tsitsiklis 1996]
Problem: Most projections lead to backups that
are not contractions and unstable
s1 s2
Vˆ 3
V(x)
Vˆ 2
Iteration #
Vˆ 1 0
Vˆ
1 2 S
Vˆ ( s) k wkk ( s)
The k are arbitrary feature functions of states
Vˆ ( s1 ) w1 k ( s1 )
V w
ˆ
1 K
k
Vˆ ( s ) k ( sn )
n
wk
Deriving LSTD
ˆ
V w assigns a value to every state
Vˆ R PVˆ
R( s1 , ( s1 )) P( s1 | s1 , ( s1 )) P( sn | s1 , ( s1 ))
R , P
R( sn , ( sn )) P( s1 | sn , ( sn )) P( s1 | sn , ( sn ))
But…
Expensive to construct matrix (e.g. P is |S|x|S|)
Presumably we are using LSPI because |S| is enormous
We don’t know P
We don’t know R
Using Samples for
Suppose we have state transition samples of the policy
running in the MDP: {(si,ai,ri,si’)}
K basis functions
1(s1) 2(s1)...
1(s2) 2(s2)…
̂ . samples
states
.
.
Using Samples for R
Suppose we have state transition samples of the policy
running in the MDP: {(si,ai,ri,si’)}
r1
r2
R= .
samples
.
.
40
Using Samples for P
Idea: Replace expectation over next states with sampled
next states.
K basis functions
1(s1’) 2(s1’)...
1(s2’) 2(s2’)…
P .
s’ from (s,a,r,s’)
.
.
Putting it Together
LSTD needs to compute:
w (T T P) 1 T R B 1b
B T T ( P)
b T R from previous slide