Professional Documents
Culture Documents
ORF 544 Week 8 PFAs and Policy Search
ORF 544 Week 8 PFAs and Policy Search
Warren Powell
Princeton University
http://www.castlelab.princeton.edu
120.00
100.00
80.00
60.00
Discharge
40.00
Charge 20.00
0.00
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
parameters
© 2019 Warren B. Powell
Policy function approximations
Our policy function might be the parametric
model (this is nonlinear in the parameters):
1 if pt charge
X ( St | ) 0 if charge pt discharge
1 if p charge
t
Energy in storage:
Price of electricity:
Discharge
Charge
© 2019 Warren Powell Slide 7
PFAs
St 2
St 3
X ( St )
StI
1 if pt charge
X ( St | ) 0 if charge pt discharge
1 if p charge
t
Discharge
Charge
© 2019 Warren B. Powell
Derivative-based policy search
Simulating finite difference
One-dimensional search
x n 1 x n nf ( x n )
Starting point
Stepsizes for policy search
Tuning the parameters
» Stepsizes tuned for region [0,2].
Prior
3 5 8
Posterior iter 2
Exclude w.
Exclude w. prob.
prob.
2 3 5 8
Stepsizes for policy search
Fibonacci search
» Fibonacci search is an optimal
algorithms for finding the
maximum of a unimodular
function.
» This process requires that we be
able to evaluate the function
deterministically.
» We assume Lipschitz continuity, as
well as knowledge of the error
distribution. This allows us to
estimate the probability that the
optimum is toward the left or right.
» We then repeat this multiple times,
and create a distribution about where
the optimum is.
Stepsizes for policy search
Fibonacci search for
» 34 Fibonacci numbers
» Iterations: 1
» Deterministic response
Frequency
» Always finds the correct
optimum.
Stepsizes for policy search
Fibonacci search for
» 34 Fibonacci numbers
» Iterations: 1
» Medium noise
Frequency
Stepsizes for policy search
Fibonacci search for
» 34 Fibonacci numbers
» Iterations: 1
» High noise
Frequency
Stepsizes for policy search
Fibonacci search for
» 34 Fibonacci numbers
» Iterations: 1
» High noise
Frequency
Fibonacci search for
» 34 Fibonacci numbers
» Iterations: 10
» High noise
Frequency
» Clearly 10 iterations (340
function evaluations) are
not needed.
Stepsizes for policy search
Fibonacci search for
» Iterations: 10
» High noise
Frequency
Stepsizes for policy search
Fibonacci search for
» Iterations: 10
» High noise
Frequency
Stepsizes for policy search
Notes
» Even with very high noise, it appears that Fibonacci
does a very reliable job of finding a near-optimal point.
» The next step is to see how well this works in a
stochastic gradient algorithm.
Derivative-based:
Categorical actions – Boltzmann policies
Derivative-free
» Let:
be a sample of any variables in a Bayesian prior in .
be a sampled sequence (where we follow the learning policy)
be a sampled sequence (where we follow the implementation policy)
» Now replace each expectation with a sampled estimate.
Belief mode:
Structural uncertainty:
𝑥𝑝𝑟𝑜𝑥
Quadratic
Estimate of optimal approximation
solution
Knowledge gradient with local quadratic belief
Assumes polynomial approx. is
globally accurate.
True optimum
Estimated optimum