Professional Documents
Culture Documents
Lecture 6: Value Function Approximation: David Silver
Lecture 6: Value Function Approximation: David Silver
Lecture 6: Value Function Approximation: David Silver
David Silver
Lecture 6: Value Function Approximation
Outline
1 Introduction
2 Incremental Methods
3 Batch Methods
Lecture 6: Value Function Approximation
Introduction
Outline
1 Introduction
2 Incremental Methods
3 Batch Methods
Lecture 6: Value Function Approximation
Introduction
v̂ (s, w) ≈ vπ (s)
or q̂(s, a, w) ≈ qπ (s, a)
^
v(s,w) ^
q(s,a,w) ^
q(s,a … q(s,a
^
1,w) m,w)
w w w
s s a s
Lecture 6: Value Function Approximation
Introduction
Outline
1 Introduction
2 Incremental Methods
3 Batch Methods
Lecture 6: Value Function Approximation
Incremental Methods
Gradient Descent
Gradient Descent
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#
Let J(w) be a differentiable function of
parameter vector w
y !"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$
Define the gradient of J(w) to be
'*$%-/0'*,('-*$.'("$("#$1)*%('-*$
∂J(w)
,22&-3'/,(-& %,*$0#$)+#4$(-$
∂w. 1
%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
∇w J(w) = .
.
∂J(w)
∂wn
y !"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$
To find a local minimum of J(w)
1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($
Adjust w in direction of -ve gradient
%,*$*-.$0#$)+#4$(-$)24,(#$("#$
'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$
1
∆w = − α∇w J(w)
,22&-3'/,(-&29,*4$%&'('%:;$$$$$$
where α is a step-size parameter
Lecture 6: Value Function Approximation
Incremental Methods
Gradient Descent
Feature Vectors
xn (S)
For example:
Distance of robot from landmarks
Trends in the stock market
Piece and pawn configurations in chess
Lecture 6: Value Function Approximation
Incremental Methods
Linear Function Approximation
1(S = sn ) wn
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms
q
w =q
π
Starting w qw ≈ q*
)
dy(q w
ε- gree
π =
q̂(S, A, w) ≈ qπ (S, A)
∇w q̂(S, A, w) = x(S, A)
∆w = α(qπ (S, A) − q̂(S, A, w))x(S, A)
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Control Algorithms
Baird’s Counterexample
Lecture 6: Value Function Approximation
Incremental Methods
Convergence
Outline
1 Introduction
2 Incremental Methods
3 Batch Methods
Lecture 6: Value Function Approximation
Batch Methods
Repeat:
1 Sample state, value from experience
hs, v π i ∼ D
Repeat:
1 Sample state, value from experience
hs, v π i ∼ D
wπ = argmin LS(w)
w
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction
DQN in Atari
ED [∆w] = 0
T
X
α x(st )(vtπ − x(st )> w) = 0
t=1
T
X T
X
x(st )vtπ = x(st )x(st )> w
t=1 t=1
T
!−1 T
X X
>
w= x(st )x(st ) x(st )vtπ
t=1 t=1
q
w =q
π
Starting w qw ≈ q*
)
(q w
g r eedy
π =
0.1
0.9 0.9 0.9
R R R R
L
1 2 3 4 0.9
0.9 L L L
I(a = R) × s2
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control
4 4
2
2
0
0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iteration3 Iteration4
4 4
2 2
0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iteration5 Iteration6
0
5 10 15 20 25 30 35 40 45 50
Iteration7
2 2
Lecture 6: Value Function Approximation
Batch Methods
0 0
5 10Control
Least Squares 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iteration5 Iteration6
4
LSPI in Chain Walk: Policy
2
0
5 10 15 20 25 30 35 40 45 50
Iteration7
10 20 30 40 50 10 20 30 40 50
Iteration1 Iteration2
10 20 30 40 50 10 20 30 40 50
Iteration3 Iteration4
10 20 30 40 50 10 20 30 40 50
Iteration5 Iteration6
10 20 30 40 50
Iteration7
Figure 13: LSPI iterations on a 50-state chain with a radial basis function approximator
(reward only in states 10 and 41). Top: The state-action value function of the
policy being evaluated in each iteration (LSPI approximation - solid lines; exact
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control
Questions?