ORF 544 Week 5 Derivative Free Stochastic Optimization VFA and DLA

ORF 544
Stochastic Optimization and Learning

Spring, 2019
.
Warren Powell
Princeton University
http://www.castlelab.princeton.edu
© 2018 W.B. Powell

Week 5
Chapter 7: Derivative-free stochastic search II

Value function approximation policies
Direct lookahead policies
© 2019 Warren Powell Slide 2

Designing policies

Designing policies
Two fundamental strategies:
1) Policy search – Search over a class of functions for

making decisions to optimize some metric.
T 
max  ( f F , f  f ) E  C  St , X t ( St |  )  | S0 

 t 0 
2) Lookahead approximations – Approximate the impact

of a decision now on the future.
   T  
X ( St )  arg max xt  C ( St , xt )  E max   E  C ( St ' , X t ' ( St ' )) | St 1  | S t , xt  
*
t

   t 't 1  
Designing policies
Policy search:
1a) Policy function approximations (PFAs) xt  X PFA ( St |  )
• Lookup tables
– “when in this state, take this action”
• Parametric functions
– Order-up-to policies: if inventory is less than s, order up to S.
– Affine policies - xt  X PFA ( St |  )    f  f ( St )
– Neural networks f F
• Locally/semi/non parametric
– Requires optimizing over local regions
1b) Cost function approximations (CFAs)

• Optimizing a deterministic model modified to handle uncertainty
(buffer
CFA
stocks, schedule slack) 
X ( St |  )  arg max x X ( ) C ( St , xt |  )
t t
Designing policies
The ultimate lookahead policy is optimal
   T  
X ( St )  arg max xt  C ( St , xt )  E max   E  C ( St ' , X t ' ( St ' )) | S t 1  | S t , xt  
*
t

   t 't 1  
Designing policies
Lookahead approximations – Approximate the impact of a
decision now on the future:
» An optimal policy (based on looking ahead):
   T  
X ( St )  arg max xt  C ( St , xt )  E max   E  C ( St ' , X t ' ( St ' )) | St 1  | S t , xt  
*
t

   t 't 1  
2a) Approximating the value of being in a downstream state using

machine learning (“value function approximations”)

X t* ( St )  arg max xt C ( St , xt )  E Vt 1 ( St 1 ) | St , xt  
X tVFA ( St )  arg max xt C (S , x )  E V
t t t 1 ( St 1 ) | S , x 
t t
 arg max xt C ( St , xt )  Vt x ( Stx ) 

Designing policies
The ultimate lookahead policy is optimal
   T  
X ( St )  arg max xt  C ( St , xt )  E max   E  C ( St ' , X t ' ( St ' )) | S t 1  | S t , xt  
*
t

   t 't 1  
» 2b) Instead, we have to solve an approximation called

the lookahead model:
     tH      

X ( St )  arg max xt  C ( St , xt )  E max   E  C ( Stt ' , X tt ' ( S tt ' )) | S t ,t 1  | S t , xt  
*
t
   t 't 1  
» A lookahead policy works by approximating the

lookahead model.
Designing policies
Types of lookahead approximations
» One-step lookahead – Widely used in pure learning
policies:
• Bayes greedy/naïve Bayes
• Expected improvement
• Value of information (knowledge gradient)
» Multi-step lookahead
• Deterministic lookahead, also known as model predictive
control, rolling horizon procedure
• Stochastic lookahead:
– Two-stage (widely used in stochastic linear programming)
– Multistage
» Monte carlo tree search (MCTS) for discrete action
spaces
» Multistage scenario trees (stochastic linear
programming) – typically not tractable.
Four (meta)classes of policies
1) Policy function approximations (PFAs)
Policy search
» Lookup tables, rules, parametric/nonparametric functions

2) Cost function approximation (CFAs)
» X CFA ( S |  )  arg max C 
( St , xt |  )
t 
xt Xt ( )
3) Policies based on value function approximations (VFAs)

» X VFA ( S )  arg max C ( S , x )  V x S x ( S , x )
t t xt t
4) Direct lookahead policies (DLAs)
t t t t t  
Lookahead approximations
» Deterministic lookahead/rolling horizon proc./model predictive control

T
X t
LA D
( St )  arg max C ( Stt , xtt )   C (S tt ' , xtt ' )
xtt ,..., xt ,t  H
t 't 1
» Chance
P[ Atconstrained
xt  f (W )]programming
 1
T
» X LA S
Stochastic
t ( S )  arg max
t lookahead C ( Stt , xttprog/Monte
/stochastic )   p(Carlo C ( Ssearch
)  tree   
tt ' (), xtt ' ( ))

  t 't 1
t
T
X t
LA  RO
( St )  arg max min C ( Stt , xtt )   C (S tt ' ( w), xtt ' ( w))
xtt ,..., xt ,t  H wWt ( )
» “Robust optimization” t 't 1
Lookahead policies:
Value function approximations

Multiarmed bandit problems
Multi-armed bandit problems
» We do not know the expected
winnings from each slot machine
(“arm”).
» We need to find a policy X  ( S n ) for
playing machine x that maximizes:
N 1
max  E  Wxnn1
n0
where
S n  State of knowledge
x n  X  (S n )
W n 1  "winnings"
© 2019 Warren Powell

Derivative free stochastic search

Bandit problems


Bellman’s equation
» A graph
» Bellman equation (state = node)

Bellman’s equation
» A belief state
» Bellman equation (state = node)

Bellman’s equation for online and offline learning
» A belief state
» Online learning
» Offline learning
Gittins indices

Gittins indices
Historical notes:
» 1970 – Degroot gives Bellman
equation for optimal learning
(but cannot be computed).
» 1974 – Gittins publishes a
decomposition method that
breaks the “curse of
dimensionality,” requiring that
we solve a DP for one “arm” at
a time.
» 1985 – Lai and Robbins
introduce upper confidence
bounding which takes off in the
computer science community.
21
Gittins indices
22
Gittins indices
23
Gittins indices
24
Gittins indices
Beta distribution
25
Gittins indices
Beta distributions
26
Gittins indices
27
Gittins indices
Gittins indices
General theory of Gittins indices

Gittins indices

Gittins indices
Gittins indices
» The main result (for normally distributed rewards):
» While these are computable, they are hard to compute.

» Even if we could, Gittins indices are not optimal in practical
applications. © 2019 Warren Powell
Gittins indices
Approximations for Gittins indices:

Gittins indices
Gittins indices for
normal-normal model
G (1 / k ,  ) is increasing in 
G (1 / k ,  ) is decreasing in k

Lookahead policies:
One-period lookahead
The knowledge gradient

Direct lookahead policies:
» One-step lookaheads
• Knowledge gradient
• Expected improvement
» Limited multistep lookahead

• Repeated lookahead for a single choice
» Full multistep lookahead

• Full tree search (multiple decisions and outcomes each time
period)

35

Expensive experiments
6

37
38
The knowledge gradient is a one-period lookahead
that maximizes the value of information:
 xKG ,n  E max y F ( y, B n 1 ( x))  max y F ( y, B n )

Current belief state

Choosing the best design

given what we know now.

Proposed experiment

Updated parameter
estimates after running
experiment with density
x.

Finding the new design with

our new knowledge (but
without knowing the
outcome of the experiment)

Averaging over the

possible outcomes of
the experiment (and
our different beliefs
about parameters)

46

Knowledge gradient
 xKG ,n  E  E W | max x '  xn'1 ( x) | S n   max x '  xn'

» Discuss as the estimated updated belief if we run experiment .
» Think of this estimate as peeking into the future.
» We did this before in the decision tree. In fact, we did not just look one
step ahead – we looked two steps (in our baseball example).
» Think about how we might estimate the expectations using simulation:
» Review how means are updated given estimates in (which contains ) and
the observation for a truth and noise
1 K
1 L
 KG , n
x 
K
 
k 1 L l 1
max x '  xn'1 ( x | S n , Wxl|k   xk   l )  max x '  xn'
» Remember that is a possible true value of , not the estimate. We have an

estimate, but the random outcome W comes from the truth.
7

Knowledge gradient policy
1 2 3 4 5
» We want to measure the weighted area under the curve

for option 5 that is over the value for option 4.
48
Knowledge gradient
Value of information
Probability that
Alternative 4:
Alternative 5:
0
0 Improvement if updated
The knowledge gradient is the expected improvement if we observe

alternative 5. We have to capture how much better alternative 5
ends up looking relative to the current best, alternative 4. Here,
“0” means “no better than alternative 4”.
Conditional expectation of standard normal random
variable
» Let . We want .
0 z
E[ Z | Z  z}  f ( z )  z( z )   ( z )
 ( z )  Cumulative standard normal distribution
 ( z )  Standard normal density

Derivation
» Notation
 xn  Precision (inverse variance) of our estimate  xn of the value of x.

 W  Precision of the measurement noise (  1/  W2 )
wxn 1  Measurement of x in iteration n  1 (unknown at n)
» We xnupdate
1
  xn the
 W precision using
 terms
 ofthe variance,
2, n 1 1
  
2, n 1 2 1
» Inx x
this is the same as
W
2, n 1  x2,n
 x 
1   x2,n /  W2
50 50
Derivation
» The change in variance can be found to be
 x2,n  Var   xn 1   xn | S n 
  x2,n   x2,n 1
 x2,n

1   W2 /  x2, n
» Next compute the normalized influence:
f ( )   ( )   ( ) ( )  Cumulative standard normal distribution

 ( )  Standard normal density
» Let
 xKG   xn f  xn 
» Knowledge gradient is computed using 51 51

52

The normal distribution
Computing the cdf

» Matlab – normcdf(x,mu,sigma)
» Excel – normdist(x,mu,sigma,prob) z = standard normal
Knowledge gradient
n n
  max 
 xn   x x ' x x'
 xn
 Normalized distance to best (or second best)
 5KG
 xn
1 2 3 4 5
x
53 53
4

KG calculations
Sigma^n Decision & $\mu^n$ & $\beta^n$ & $\beta^{n+1}$ & $\sigmatilde$ & $max_x' x'$ & $\zeta$ & $f(z)$ & $\nu^{KG)_x$
5.00 1 & 3.0 & 0.0400 & 1.0400 & 4.9029 & 5 & -0.4079 & 0.2277 & 1.1165
8.00 2 & 4.0 & 0.0156 & 1.0156 & 7.9382 & 5 & -0.1260 & 0.3391 & 2.6920
8.00 3 & 5.0 & 0.0156 & 1.0156 & 7.9382 & 4.5 & -0.0630 & 0.3682 & 2.9232
9.00 4 & 4.5 & 0.0123 & 1.0123 & 8.9450 & 5 & -0.0559 & 0.3716 & 3.3241
10.00 5 & 3.5 & 0.0100 & 1.0100 & 9.9504 & 5 & -0.1507 & 0.3281 & 3.2646
» To link to spreadsheet, click here.

Properties
Slide 55
Properties of the knowledge gradient
56 56
57 57
58

Compare two policies:
» Interval estimation
» Knowledge gradient
where (remember that is the experiment we are thinking of

doing)
» Both policies balance the value of doing well now, and

59

60

61

62

Notes:
» The KG likes decisions that appear to be good.
» It also likes experiments that are uncertain.
» It will not pick choices where you are sure it is bad.
» It will never pick a choice where there is no
uncertainty, because there is nothing to learn.
» It likes choices that have a chance of being the best.
» Refer to decision tree for hitter (next slide): we choose
batter B because there is a chance that batter B might
be best.
Lookahead policies:
S-curve effect and a limited multistep
lookahead

The S-curve effect

The S-curve effect
The marginal value of information
» Imagine that we are choosing between an uncertain alternative we
can measure, and a certain one. Click here for spreadsheet.

The S-curve effect
The marginal value of information
» Imagine that we are choosing between an uncertain alternative we
can measure, and a certain one. Click here for spreadsheet.

The S-curve effect
The KG(*) policy
» Assume that we are going to make enough measurements to
maximize the average value of an alternative:

69
The S-curve effect

The KG(*) policy
» The KG(*) policy is just like the KG policy with one twist…
» Instead of updating the precision with
… we use instead
» What we are doing is pretending that the experiment is more

precise than it is.
» As an alternative, we can introduce a tunable parameter :
» Now let be the knowledge gradient computed using precision

instead of . Now we have a one-step lookahead policy, but with a
parameter that now has to be tuned just as we did with the earlier
heuristic policies.
» When the value of information is not concave, this can be quite
valuable.
70
The S-curve effect

71
The S-curve effect

Comparison of policies on health application
» Patients arrive at random with a set of characteristics a
» Choice x is a medical decision
» y = 1 if medical episode is a “success”, 0 otherwise.
2
The S-curve effect

The paradox of too many choices
» Situation: You are the coach of a travel baseball team. Would you
rather have 100 kids show up to try out, or 200?
» You have a fixed amount of time to evaluate kids. Imagine that as

you have more choices, you have to allocate your measurement
budget more thinly.
B
nx 
» So M where:
• B = total number of measurements (budget)

• M = number of alternatives
Lookahead policies:
Full tree search

74
Decision trees
Looking further into the future
» Ultimately, we can always make decisions by planning

for multiple periods into the future.
» We first introduced this idea with our baseball example

in the second lecture…
Choose Hit? Choose Hit?
hitter hitter .366
.333
.318
1 .360
.356
0
.333
.640
.318
A .360
.500
1 .318
B .333
.360
0
.667 .250
.318
C .360
.333
1 .348
.318
.360
0
.333
.682
.304
75
76
Decision trees
From decision trees to dynamic programs.
» Some people equate a decision tree with a dynamic

program.
» This is not true. The next two slides use a learning

problem with a beta-prior (similar to the hitting
example) where outcomes are success/failure.
» These slides illustrate the fundamental difference

between decision trees and dynamic programs:
• In a decision tree, the state variable is the history, which can
contain more information than is necessary.
• In a dynamic program the state variable is only the information
Decision trees
Decision Outcome Decision Outcome
W 1 a0 = 4
3 b0 = 1
wp 
4
W 1 a 0 = 3 Continue
2 b0 = 1 W 0
wp 
3 1 a0 = 3
a 0 = 2 Continue Stop wp 
W 1 4
0 W 0 b0 = 2
1 b =1
wp  1 a0 = 2
2 Stop wp 
3 b0 = 2
W 1 a0 = 3
a0 = 1 Continue 2 b0 = 2
wp 
a 0 = 2 Continue 4
b0 = 1 W 0 W 1
1 1 b0 = 2 W 0
wp  wp 
2 3 2 a0 = 2
Stop a 0 = 1 Continue Stop wp 
4 b0 = 3
b0 = 2 W 0
Stop 2 a0 = 1
wp 
3 b0 = 3
As a decision tree (the “state variable” is the history)
77
Decision trees
Decision Outcome Decision Outcome
W 1 a0 = 4
3 b0 = 1
wp 
4
W 1 a 0 = 3 Continue
2 b0 = 1
wp 
3 W 0
a 0 = 2 Continue Stop
W 1 1
W 0 wp 
1 b0 = 1 4 a0 = 3
wp  1
2 Stop wp  b0 = 2
3 W 1
2
a0 = 1 Continue wp 
4
b0 = 1 W 0 W 1 a 0 = 2 Continue
1 1 b0 = 2 W 0
wp  wp 
2 3 2 a0 = 2
Stop a 0 = 1 Continue Stop wp 
4 b0 = 3
b0 = 2 W 0
Stop 2 a0 = 1
wp 
3 b0 = 3
As a dynamic program (state variables use only necessary information)
78
Decision trees
From March 2, 2019:

80
Decision trees
The next series of slides illustrate the exponential
expansion when we use decision trees (the path
from the origin to each node is unique).
Decision Outcome Decision Outcome Decision
Offline vs. online objectives

Offline vs. online learning
“Offline” optimization (maximizing final reward)
» We can iteratively search for the best solution, but only
care about the final answer.
» Asymptotic formulation:
max x E{F (x ,W ) | S 0 }
» Finite
max horizon
E{F formulation:
(x  ,N ,W ) | S 0 }

• This will be our standard formulation, since this is what we do

in practice.
• The policy might be a rule for making a decision (e.g. choose
that appears to be best), but we
© 2019 Warren might call it an algorithm.
Powell
Evaluating an offline learning policy/algorithm:
» Over time, I will encourage you to write expectations subscripted
by the random variable over which you are taking the expectation.
If W is the only random variable, we would then write:]
max  E W {F ( x , N , W ) | S 0 }
» If we have uncertainty in the initial state, then we have to perform

iterative learning, and we then have to test how well the policy
works, we would have three levels of nesting:
max  E S 0 E W 1 ,...,W N |S 0 E Wˆ |S 0 {F ( x , N ,Wˆ ) | S 0 }
» Always be ready to replace any expectation with a sampled

approximation. This means you have to know what random
variables you are sampling over.

“Online” optimization (maximizing final reward)
» If we have an online learning policy, then we have only
two sources of randomness:
• The distribution of belief about the mean demand.
• The randomness you experience while learning the
best order quantity.
» We would write our nesting in the evaluation of a

learning policy as
N 1 

 n n 1
max  E S 0 EW 1 ,W 2 ,...,W N |S 0 F ( X (S ),W ) | S0
 n 0 

Notes:
» The machine learning community uses the term
“online” to refer to any sequential algorithm, while
“offline” refers to a batch algorithm.
» In optimization, “offline” typically means in a
laboratory or simulator, while “online” means in the
field, where you have to experience results as they
happen.
» To be clear, I will use “final reward” when we only care
about the quality of the final solution, rather than how
we get there. “Cumulative reward” means we sum the
rewards as we progress.

The knowledge gradient for online learning
Slide 90
91
Knowledge gradient for online learning

92

93

Discounted finite horizon
Discounted infinite horizon
Tunable versions:
Knowledge gradient policy
» For finite-horizon on-line problems:
 xKG OL ,n   xn  ( N  n) xKG ,n Value of information
» For infinite-horizon discounted problems:

KG OL , n n  KG ,n
 x  
x x Value of information
1 
CompareGittins
to Gittinsn indices for bandit problems
x   x  (n,  ) W ???
…intervalIEestimation
 x ,n   xn  z  xn ???
… and UCB
UCB1 n  log n
 x   x  4 ???
N xn 94
On-line KG vs. Gittins for finite horizon problems
On-line KG slightly
underperforms
Gittins
On-line KG slightly
outperforms Gittins.
Measurement budget
95
Contrast:
» Offline vs. online KG
• Offline:
• Online:
» Tuning PFA policies for offline vs. online
• Policies are the same – just tuned differently
• E.g. Interval estimation:
• Optimize for offline:
• Optimize for online:

Classifying policies

Policies
Board setup:
» Left side – list examples of different policies
» Right side – The two ways of creating policies

• Policy search – Top board
• Lookahead policies – Bottom board
» Middle board – List the four classes of policies

• Top board – PFAs, and CFAs
• Bottom board – VFAs and DLAs
» Ask class to match policies with class

Policies
List of policies (left board)

Policies
List of policies (left board)

Learning policies
Heuristic learning policies
» Pure exploitation – Always make the choice that appears to be the
best.
» Pure exploration – Make choices at random so that you are
always learning more.
» Epsilon-greedy
• Explore with probability  and exploit with probability1  
• Epsilon-greedy exploration – explore with probability  n  c / n .
n
Goes to zero as  , but not too quickly.
Discuss
» Convergence
» Applying to large problems
© 2019 Warren Powell Slide 101 101

Learning policies
Heuristic measurement policies
» Boltzmann exploration n e  xn
• Explore choice x with probability x ( ) 

P
e
x'
 xn'
U [0,1]
•
» Upper confidence bounding
 log n 
X UCB ( S n |  UCB )  arg max x   xn   UCB n 
 Nx 
X TS ( S n )  arg max x ˆ xn 1 where ˆ xn  N (  xn ,  xn )

» Thompson sampling
z  xn X IE ( S n |  IE )  arg max x  xn   IE xn

» Interval estimation (or upper confidence bounding)
• Choose x which maximizes
0
© 2019 Warren Powell Slide 102 102
Creating belief models

Creating belief models
Notes:
» At the heart of any learning problem is the belief model
» If experiments are expensive, you want to exploit
domain knowledge about the structure of the problem.

The learning challenge
Finding the best material
1e 2 3 4 4 5
F Fe Fe
Fe Ni 6
nm Ni
nm m S m 0.
nm 2n B 2n nm
1.
4 1 1 Ni +1
nm 3
. 2 2 0
1 AI
03+ D
2 L
AI A
LD 0 nm
A 1
0 nm
1 © 2019 Warren Powell
Finding the best material
1e 2 3 4 4 5
F Fe Fe
Fe Ni 6
nm Ni
nm m S m 0.
nm 2n B 2n nm
1.
4 1 1 Ni +1
nm 3
. 2 2 0
1 AI
03+ D
2 L
AI A
LD 0 nm
A 1
0 nm
Correlations
» Simple belief model assumes independence
» Similar material properties between catalysts
» Scientists using domain knowledge can estimate
correlations in experiments between similar catalysts.

Testing one material teaches us about other
materials
1e 2 3 4 4 5
F Fe Fe
Fe Ni 6
nm Ni
nm m S m 0.
nm 2n B 2n nm
1.
4 1 1 Ni +1
nm 3
. 2 2 0
1 AI
03+ D
2 L
AI A
LD 0 nm
A 1
0 nm
Testing one material teaches us about other
materials
1e 2 3 4 4 5
F Fe Fe
Fe Ni 6
nm Ni
nm m S m 0.
nm 2n B 2n nm
1.
4 1 1 Ni +1
nm 3
. 2 2 0
1 AI
03+ D
2 L
AI A
LD 0 nm
A 1
0 nm
Nonlinear, parametric belief model:
» For example, the following model describes the length
of nanotubes in low temperatures:
» We might enumerate a number of potential sets of

We start by specifying our state of knowledge
(say, after n experiments):


Kn 


111
» Our state of knowledge may be a series of functions.

Set of possible truths:
ϑ1 ϑ2 ϑ3 ϑ4 ϑ5
ϑ6 ϑ7 ϑ8 ϑ9 ϑ10
ϑ11 ϑ12 ϑ13 ϑ14 ϑ15
ϑ16 ϑ17 ϑ18 ϑ19 ϑ20
ϑ21 ϑ22 ϑ23 ϑ24 ϑ25
Building a belief model
Temperature dependence on
1  ln( A)
chemical reaction rate k
Ea 1 Ea
ln(k )  ln( A)  2 
kB T kb
 1   2 x 1
x
ln(k ) T
1
We might have beliefs
about different slopes…
2
ln(k )
1  ln( A)
Ea 1 Ea
ln( k )  ln( A)  2 
kB T kb
 1   2 x 1
x
ln( k ) T

… as well as different
intercepts.
ln(k )
Ea 1 0  0

ln( k )  ln( A)     01
kB T  2 
 1   2 x   2
 2

 ,0 1 12
  2 2

 12  2 

… as well as different
intercepts.
But they are likely to be

correlated.
A prior can consist of a series of hand-drawn curves:
Possible relationships
Photo-induced current
Density
Classifying
temperature/concentration
regions
» Different regions produce
different materials
» Each combination of temperature
and concentration is a very
expensive experiment.
» How do we minimize the number
of experiments to come up with a
good classification?

Possible combinations we might run:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
One possible clustering:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
Courtesy C. Lam, B. Olsen
Other clusterings:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
Other clusterings:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
Other clusterings:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
Tutorial article in preparation:

ORF 544 Week 5 Derivative Free Stochastic Optimization VFA and DLA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ORF 544 Week 5 Derivative Free Stochastic Optimization VFA and DLA

Uploaded by

Copyright:

Available Formats

ORF 544

Stochastic Optimization and Learning

© 2018 W.B. Powell

Chapter 7: Derivative-free stochastic search II

© 2019 Warren Powell Slide 2

© 2019 Warren Powell Slide 3

1) Policy search – Search over a class of functions for

2) Lookahead approximations – Approximate the impact

1b) Cost function approximations (CFAs)

2a) Approximating the value of being in a downstream state using

 arg max xt C ( St , xt )  Vt x ( Stx ) 

» 2b) Instead, we have to solve an approximation called

» A lookahead policy works by approximating the

» Lookup tables, rules, parametric/nonparametric functions

» Deterministic lookahead/rolling horizon proc./model predictive control

© 2019 Warren Powell Slide 11

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

» Bellman equation (state = node)

» Bellman equation (state = node)

© 2019 Warren Powell

© 2019 Warren Powell Slide 19

© 2019 Warren Powell

© 2019 Warren Powell

» While these are computable, they are hard to compute.

© 2019 Warren Powell

© 2019 Warren Powell

The knowledge gradient

© 2019 Warren Powell Slide 33

» Limited multistep lookahead

» Full multistep lookahead

© 2019 Warren Powell

The knowledge gradient

The knowledge gradient

 xKG ,n  E max y F ( y, B n 1 ( x))  max y F ( y, B n )

© 2019 Warren Powell

 xKG ,n  E max y F ( y, B n 1 ( x))  max y F ( y, B n )

Current belief state

© 2019 Warren Powell

 xKG ,n  E max y F ( y, B n 1 ( x))  max y F ( y, B n )

Choosing the best design

© 2019 Warren Powell

 xKG ,n  E max y F ( y, B n 1 ( x))  max y F ( y, B n )

© 2019 Warren Powell

 xKG ,n  E max y F ( y, B n 1 ( x))  max y F ( y, B n )

© 2019 Warren Powell

 xKG ,n  E max y F ( y, B n 1 ( x))  max y F ( y, B n )

Finding the new design with

© 2019 Warren Powell

 xKG ,n  E max y F ( y, B n 1 ( x))  max y F ( y, B n )

Averaging over the

© 2019 Warren Powell

The knowledge gradient

 xKG ,n  E  E W | max x '  xn'1 ( x) | S n   max x '  xn'

» Remember that is a possible true value of , not the estimate. We have an

The knowledge gradient

» We want to measure the weighted area under the curve

The knowledge gradient is the expected improvement if we observe

© 2019 Warren Powell

 xn  Precision (inverse variance) of our estimate  xn of the value of x.

» Next compute the normalized influence: