Professional Documents
Culture Documents
Lecture 7: Policy Gradient: David Silver
Lecture 7: Policy Gradient: David Silver
David Silver
Lecture 7: Policy Gradient
Outline
1 Introduction
Vθ (s) ≈ V π (s)
Qθ (s, a) ≈ Q π (s, a)
πθ (s, a) = P [a | s, θ]
Value Based
Learnt Value Function
Implicit policy
Value Function Policy
(e.g. -greedy)
Policy Based
Value-Based Actor Policy-Based
No Value Function Critic
Learnt Policy
Actor-Critic
Learnt Value Function
Learnt Policy
Lecture 7: Policy Gradient
Introduction
Advantages of Policy-Based RL
Advantages:
Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies
Disadvantages:
Typically converge to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance
Lecture 7: Policy Gradient
Introduction
Rock-Paper-Scissors Example
Example: Rock-Paper-Scissors
It will reach the goal state in a few steps with high probability
Policy-based RL can learn the optimal stochastic policy
Lecture 7: Policy Gradient
Introduction
Policy Search
Policy Optimisation
Policy Gradient
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#
Let J(θ) be any policy objective function
Policy gradient algorithms search for a
y !"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$
local maximum in J(θ) by ascending the
'*$%-/0'*,('-*$.'("$("#$1)*%('-*$
gradient of the policy, w.r.t. parameters θ
,22&-3'/,(-& %,*$0#$)+#4$(-$
%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
∆θ = α∇θ J(θ)
Velocity
(UT Austin Villa)
Lecturechose
7: Policy Gradient
to evaluate t = 15 policies per iteration. Since there Hand!tuned Gait
(German Team)
wasDifference
Finite significant noise
Policyin Gradient
each evaluation, each set of parameters 220
was evaluated three times. The resulting score for that set of
AIBO example
parameters was computed by taking the average of the three 200
values for each !j . To offset the small values of each !j , we Fig. 6. The velocity of the best gait from each iteration during training,
accelerated the learning process by using a larger value of compared to previous results. We were able to learn a gait significantly faster
than both hand-coded gaits and previously learned gaits.
η = 2 for the step size.
z
Our team (UT Austin Villa [10]) first approached the gait Parameter Initial ! Best
optimization problem by hand-tuning a gait described by half- Value Value
elliptical loci. This gait performed
Landmarks C
comparably to those of Front locus:
other teams participating in (height) 4.2 0.35 4.081
B RoboCup 2003. The work reported
(x offset) 2.8 0.35 0.574
in this paper usesA the hand-tuned UT Austin Villa walk asC’a (y offset) 4.9 0.35 5.152
starting point for learning. Figure 1 compares the reported Rear locus:
speeds of the gaits mentioned above, both hand-tuned B’ and (height) 5.6 0.35 6.02
learned, including that of our starting point, the UT Austin (x offset) 0.0 0.35 0.217
Villa walk. The latter walk is described fully in a team A’ (y offset) -2.8 0.35 -2.982
Locus length 4.893 0.35 5.285
technical report [10]. The remainder of this section describes
Locus skew multiplier 0.035 0.175 0.049
the details of the UT Austin Villa walk that are important to x
Front height 7.7 0.35 7.483
understand for the purposes of this paper. y
Rear height 11.2 0.35 10.843
Time to move
Hand-tuned gaits Learned gaits through
Fig. 2. The locus locus of 0.704
elliptical 0.016
the Aibo’s 0.679
foot. The half-ellipse is defined by
Fig. 5. The training environment for our experiments. Each Aibo times itself
length,Time on and
height, ground 0.5x-y plane.
position in the 0.05 0.430
CMU as itGerman UTforth
moves back and Austin Hornby
between a pair of landmarks (A and A’, B and B’,
(2002)
or C and C’).
Team
Goal: learn a fast AIBO walk
Villa UNSW
Fig. 7. (useful
The initial policy,for
1999 UNSW
Robocup)
the amount of change (!) for each parameter, and
best policy learned after 23 iterations. All values are given in centimeters,
200 230 245 254 170 270 except time to move through locus, which is measured in seconds, and time
AIBO walk policy
IV. R ESULTS is controlled by
During
on ground, 12
which the numbers
American
is a fraction. (elliptical
Open tournament loci)
in May of 2003,4
Fig. 1. Maximum forward velocities of the best current gaits (in mm/s) for UT Austin Villa used a simplified version of the parameter-
Ourboth
different teams, main resultandis hand-tuned.
learned that using the algorithm described in
Adapt these parameters by finite
There are adifference
ization described above thatpolicy
Section III, we were able to find one of the fastest known Aibo gradient
did not allow
couple of possible explanations for
the front and rear
the amount
gaits. Figure 6 shows the performance of the best policy of heights of the robot to differ. Hand-tuning these parameters
of variation in the learning process, visible as the spikes in
The half-elliptical locustheused
each iteration during by process.
learning our teamAfteris23shown
iterationsin
thegenerated
Evaluate performance of policy a field
gait
bycurve
learning thatin
shown allowed
Figure 6. the
traversal Aibo thetofact
time
Despite move at 140 mm/s.
that we
Figure 2.the
Bylearning
instructing eachproduced
algorithm foot to move throughin aFigure
a gait (shown
After allowing
averaged locus8)of
the evaluations
over multiple front and rear height to
to determine thediffer, the Aibo was
score for
thatwith
this shape, yielded
eacha velocity
pair of of 291 ± 3 mm/s,
diagonally faster legs
opposite than in
both the
phase
best hand-tuned gaits and the best previously known learned
each policy,
tuned there at
to walk was245
stillmm/s
a fair in
amount 2003 competition.5
of noise associated
the RoboCup
with each other and perfectly out of phase with the other two with the scoremachine
Applying for each learning
policy. It isto entirely possible thatoptimization
this parameter
gaits (see Figure 1). The parameters for this gait, along with
Lecture 7: Policy Gradient
Finite Difference Policy Gradient
AIBO example
Before training
During training
After training
Lecture 7: Policy Gradient
Monte-Carlo Policy Gradient
Likelihood Ratios
Score Function
∇θ πθ (s, a)
∇θ πθ (s, a) = πθ (s, a)
πθ (s, a)
= πθ (s, a)∇θ log πθ (s, a)
Softmax Policy
Gaussian Policy
One-Step MDPs
J(θ) = Eπθ [r ]
X X
= d(s) πθ (s, a)Rs,a
s∈S a∈A
X X
∇θ J(θ) = d(s) πθ (s, a)∇θ log πθ (s, a)Rs,a
s∈S a∈A
= Eπθ [∇θ log πθ (s, a)r ]
Lecture 7: Policy Gradient
Monte-Carlo Policy Gradient
Policy Gradient Theorem
Theorem
For any differentiable policy πθ (s, a),
1
for any of the policy objective functions J = J1 , JavR , or 1−γ JavV ,
the policy gradient is
function REINFORCE
Initialise θ arbitrarily
for each episode {s1 , a1 , r2 , ..., sT −1 , aT −1 , rT } ∼ πθ do
for t = 1 to T − 1 do
θ ← θ + α∇θ log πθ (st , at )vt
end for
end for
return θ
end function
-35
A
Lecture 7:-40
Policy Gradient
Monte-Carlo
-45 Policy Gradient
-50
Policy Gradient Theorem
-55
0 5e+07 1e+08 1.5e+08 2e+08 2.5e+08 3e+08 3.5e+08
BAXTER ET AL .
Iterations
Puck World Example
10: Plots of the performance of the neural-network puck controller for the four runs (out of
100) that converged to substantially suboptimal local minima.
-5
-10
target -15
-20
Average Reward
-25
-30
-35
-40
-45
-50
-55
0 3e+07 6e+07 9e+07 1.2e+08 1.5e+08
Iterations
Qw (s, a) ≈ Q πθ (s, a)
Action-Value Actor-Critic
Simple actor-critic algorithm based on action-value critic
Using linear value fn approx. Qw (s, a) = φ(s, a)> w
Critic Updates w by linear TD(0)
Actor Updates θ by policy gradient
function QAC
Initialise s, θ
Sample a ∼ πθ
for each step do
Sample reward r = Ras ; sample transition s 0 ∼ Ps,·
a
0 0 0
Sample action a ∼ πθ (s , a )
δ = r + γQw (s 0 , a0 ) − Qw (s, a)
θ = θ + α∇θ log πθ (s, a)Qw (s, a)
w ← w + βδφ(s, a)
a ← a0 , s ← s 0
end for
end function
Lecture 7: Policy Gradient
Actor-Critic Policy Gradient
Compatible Function Approximation
∇w ε = 0
θ
Eπθ (Q (s, a) − Qw (s, a))∇w Qw (s, a) = 0
Eπθ (Q θ (s, a) − Qw (s, a))∇θ log πθ (s, a) = 0
Eπθ Q θ (s, a)∇θ log πθ (s, a) = Eπθ [Qw (s, a)∇θ log πθ (s, a)]
Vv (s) ≈ V πθ (s)
Qw (s, a) ≈ Q πθ (s, a)
A(s, a) = Qw (s, a) − Vv (s)
= Q πθ (s, a) − V πθ (s)
= Aπθ (s, a)
So we can use the TD error to compute the policy gradient
∇θ J(θ) = Eπθ [∇θ log πθ (s, a) δ πθ ]
In practice we can use an approximate TD error
δv = r + γVv (s 0 ) − Vv (s)
This approach only requires one set of critic parameters v
Lecture 7: Policy Gradient
Actor-Critic Policy Gradient
Eligibility Traces
Natural Actor-Critic
Natural b Actor
θ , the robot
i
Critic
soon hit the wall in
and Snake Domain
got stuck (Fig. (2) the policy parameter
3(a)). In contrast,
after learning, θia , allowed the robot to pass through the crank course repeatedly (Fig. 3(b)).
This result shows a good controller was acquired by our off-NAC.
8 8
7 7
6 6
5 5
4 4
Pisition y (m)
Pisition y (m)
3 3
2 2
1 1
0 0
ï1 ï1
ï2 ï2
ï2 ï1 0 1 2 3 4 5 6 7 8 ï2 ï1 0 1 2 3 4 5 6 7 8
Position x (m) Position x (m)
0.5
1
cos( )
-1
5
(rad)
0
Angle
-5
2
Distance d (m)
0
0 2000 4000 6000 8000 10000 12000
Time Steps
Lecture 7: Policy Gradient
Actor-Critic Policy Gradient
Snake example