Some Stuff

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

L INKING S CORE F UNCTION A LGORITHMS , PATHWISE

D ERIVATIVE A LGORITHMS , AND Q-L EARNING A LGORITHMS

Lord Apricot
School of the Dark Arts
Shadow Realm University
The Shadow Realm, TAS, AUSTRALIA, 666
lord.apricot@darkmagic.edu

March 12, 2020

A BSTRACT
Nothing here.

1 Introduction
Let’s just get to it.

1.1 Deriving standard policy gradient loss functions

We start with a parameterized distribution pθ (x) and a fixed distribution q(x). The cross-entropy between the two is:
Z
H(p, q) = − pθ (x) log q(x)dx = Ex∼p [log q(x)]
x

Our goal is to reduce H(p, q) by updating the parameters of pθ (x), which we can do by taking the gradient of H(p, q),
and stepping θ in the direction that minimizes it:
Z
∇θ H(p, q) = −∇θ pθ (x) log q(x)dx
x

Using standard identities from the policy gradient theorem:

∇θ H(p, q) = −Ex∼p [∇θ log pθ (x) log q(x)]


exp (Qπ (st ,at )−V π (st )
We can recover the policy gradient theorem by letting pθ (x) = πθ (at |st ), and q(x) = Zt , and
summing over timesteps T :

T
" T
#
X X
π π
∇θ J(θ) = ∇θ H(pt , qt ) = −Es,a∼π ∇θ log πθ (at |st ) (Q (st , at ) − V (st ) − log Zt )
t t

This also implies a form of the policy gradient theorem that minimizes the KL-divergence instead. This can be recovered
using:

KL(p||q) = H(p, q) − H(p)


A PREPRINT - M ARCH 12, 2020

Where H(p) = −Ex∼p [log pθ (x)]. Thus:

∇θ KL(p||q) = Ex∼p [∇θ log pθ (x)] − Ex∼p [∇θ log pθ (x) log q(x)]
Which reduces to:

∇θ KL(p||q) = Ex∼p [∇θ log pθ (x)(1 − log q(x))]


We can recover the target objective in [1] by substituting pθ (x) = πθ (a|s), and the entropy-regularized objective
log q(x) = Qπ (s, a) − log πθ (a|s). Alternatively, we could make the same substitutions as we previously made to
recover the policy gradient theorem, and get a policy gradient that minimizes the KL-divergence rather than the cross
entropy:

T
" T
#
X X
π π
∇θ J(θ) = ∇θ KL(pt ||qt ) = −Es,a∼π ∇θ log πθ (at |st ) (1 − (Q (st , at ) − V (st ) − log Zt ))
t t

It is also straightforward to show that the pathwise derivative as used in DDPG, TD3, SVG(0), and SAC follows the
same pattern. Using a change of variables:
Z Z
∇θ H(p, q) = − pθ (x)∇θ log pθ log q(x)dx = − p()∇x log q(x(θ; ))∇θ x(θ; )d
x 

We can recover the DDPG update by letting x = a such that a(θ; ) = f (θ) + , where epsilon is some form of noise
injection (Ornstein-Uhlenbeck in the case of DDPG, N (0, σI) in the case of TD3 and SVG(0)) and log q(x) = Qµ (s, a)
is the off-policy state-action value function:
Z
∇θ H(p, q) = − p(s)p()∇a Qµ (s, a(θ; ))∇θ a(θ; ) d ds ≈ −Es∼D,a∼π [∇a Qµ (s, a(θ; ))∇θ a(θ; )]
s,

Again, this implies a form of the loss function that minimizes the KL-divergence instead, which – using the same
identities as used previously – gives us:

∇θ KL(p||q) ≈ Es∼D,a∼π [∇θ (log πθ (a|s) − Qµ (s, a(θ; )))]


Which is the SAC policy loss [2].

1.2 Linking policy gradients with q-learning algorithms

As a side-effect of the above, we can choose other objective functions that minimize the same quantities, such as the
mean squared-error loss function. For example, A2C can be trained using the objective:
"T −1 #
X
π π 2
J(θ) = Eπ (log πθ (at |st ) + V (st ) − Q (st , at ))
t

Where the un-normalized value estimates are used (yes, this works). This not only reduces the cross-entropy between
the policy and the value function, but it also maximizes the return of the policy, since the expansion of the loss function
takes the form Kt − log πθ (at |st )(Qπ (st , at ) − V π (st )). This also provides a potential insight into what discrete-action
algorithms such as Q-learning are actually doing.
Q-learning uses the loss:
h i
µ 0 0
J(θ) = Es∼D,a∼ (Qπ (s, a) − (r + γ max
0
Q (s , a ))) 2
a

Since Qπ (s, a) is actually a function θ : S → Rd where d is the dimensionality of the action space, we can think of it
as being log pθ (a|s) for a categorical distribution. Rather than sampling from pθ (a|s) directly, we use heuristic noise
injection to select actions. Taking this view, we can re-frame q-learning as minimizing the loss:

2
A PREPRINT - M ARCH 12, 2020

J(θ) = Es∼D,a∼ (log pθ (a|s) − Q∗ (s, a))2


 

Where we estimate V ∗ (s) ≈ maxa Qµ (s, a), and make use of the definition of the q-function. This can be expanded to
get K − log pθ (a|s)Q∗ (s, a), which – when we take the gradient with respect to θ – takes the form of a policy gradient
using the score function estimator.

1.3 Is the KL-loss a better loss function?

As a side effect of minimizing the KL-loss function above, we also maximize the entropy (since we minimize the
negative entropy) in the process. This may have benefits for exploration at the cost of taking longer to learn (since we
would expect the agent to take worse actions a greater percentage of the time). This ties in with entropy-regularization
for exploration in RL algorithms – the entropy bonus is actually approximating a KL-divergence loss function.

References
[1] Wenjie Shi, Shiji Song, and Cheng Wu. Soft policy gradient method for maximum entropy deep reinforcement
learning. arXiv preprint arXiv:1909.03198, 2019.
[2] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy
deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.

You might also like