Professional Documents
Culture Documents
ORF 544 Week 5 Derivative Free Stochastic Optimization VFA and DLA
ORF 544 Week 5 Derivative Free Stochastic Optimization VFA and DLA
Warren Powell
Princeton University
http://www.castlelab.princeton.edu
t 0
t 't 1
Designing policies
Policy search:
1a) Policy function approximations (PFAs) xt X PFA ( St | )
• Lookup tables
– “when in this state, take this action”
• Parametric functions
– Order-up-to policies: if inventory is less than s, order up to S.
– Affine policies - xt X PFA ( St | ) f f ( St )
– Neural networks f F
• Locally/semi/non parametric
– Requires optimizing over local regions
t 't 1
Designing policies
Lookahead approximations – Approximate the impact of a
decision now on the future:
» An optimal policy (based on looking ahead):
T
X ( St ) arg max xt C ( St , xt ) E max E C ( St ' , X t ' ( St ' )) | St 1 | S t , xt
*
t
t 't 1
t 't 1
tH
X ( St ) arg max xt C ( St , xt ) E max E C ( Stt ' , X tt ' ( S tt ' )) | S t ,t 1 | S t , xt
*
t
t 't 1
» Chance
P[ Atconstrained
xt f (W )]programming
1
T
» X LA S
Stochastic
t ( S ) arg max
t lookahead C ( Stt , xttprog/Monte
/stochastic ) p(Carlo C ( Ssearch
) tree
tt ' (), xtt ' ( ))
t 't 1
t
T
X t
LA RO
( St ) arg max min C ( Stt , xtt ) C (S tt ' ( w), xtt ' ( w))
xtt ,..., xt ,t H wWt ( )
» “Robust optimization” t 't 1
Lookahead policies:
Value function approximations
where
S n State of knowledge
x n X (S n )
W n 1 "winnings"
» Online learning
» Offline learning
© 2019 Warren Powell
Gittins indices
Gittins indices
22
Gittins indices
23
Gittins indices
24
Gittins indices
Beta distribution
25
Gittins indices
Beta distributions
26
Gittins indices
27
Gittins indices
Gittins indices
General theory of Gittins indices
G (1 / k , ) is decreasing in k
Proposed experiment
Updated parameter
estimates after running
experiment with density
x.
» Review how means are updated given estimates in (which contains ) and
the observation for a truth and noise
1 K
1 L
KG , n
x
K
k 1 L l 1
max x ' xn'1 ( x | S n , Wxl|k xk l ) max x ' xn'
1 2 3 4 5
Knowledge gradient
Value of information
Probability that
Alternative 4:
Alternative 5:
0
0 Improvement if updated
0 z
E[ Z | Z z} f ( z ) z( z ) ( z )
( z ) Cumulative standard normal distribution
( z ) Standard normal density
» We xnupdate
1
xn the
W precision using
terms
ofthe variance,
2, n 1 1
2, n 1 2 1
» Inx x
this is the same as
W
2, n 1 x2,n
x
1 x2,n / W2
50 50
The knowledge gradient
Derivation
» The change in variance can be found to be
x2,n Var xn 1 xn | S n
x2,n x2,n 1
x2,n
1 W2 / x2, n
xKG xn f xn
xn
Normalized distance to best (or second best)
5KG
xn
1 2 3 4 5
x
53 53
4
Sigma^n Decision & $\mu^n$ & $\beta^n$ & $\beta^{n+1}$ & $\sigmatilde$ & $max_x' x'$ & $\zeta$ & $f(z)$ & $\nu^{KG)_x$
5.00 1 & 3.0 & 0.0400 & 1.0400 & 4.9029 & 5 & -0.4079 & 0.2277 & 1.1165
8.00 2 & 4.0 & 0.0156 & 1.0156 & 7.9382 & 5 & -0.1260 & 0.3391 & 2.6920
8.00 3 & 5.0 & 0.0156 & 1.0156 & 7.9382 & 4.5 & -0.0630 & 0.3682 & 2.9232
9.00 4 & 4.5 & 0.0123 & 1.0123 & 8.9450 & 5 & -0.0559 & 0.3716 & 3.3241
10.00 5 & 3.5 & 0.0100 & 1.0100 & 9.9504 & 5 & -0.1507 & 0.3281 & 3.2646
Properties
Slide 55
Properties of the knowledge gradient
56 56
Properties of the knowledge gradient
57 57
58
» Knowledge gradient
… we use instead
Decision trees
Looking further into the future
1 .348
.318
.360
0
.333
.682
.304
75
76
Decision trees
From decision trees to dynamic programs.
a0 = 1 Continue 2 b0 = 2
wp
a 0 = 2 Continue 4
b0 = 1 W 0 W 1
1 1 b0 = 2 W 0
wp wp
2 3 2 a0 = 2
Stop a 0 = 1 Continue Stop wp
4 b0 = 3
b0 = 2 W 0
Stop 2 a0 = 1
wp
3 b0 = 3
77
Decision trees
Decision Outcome Decision Outcome
W 1 a0 = 4
3 b0 = 1
wp
4
W 1 a 0 = 3 Continue
2 b0 = 1
wp
3 W 0
a 0 = 2 Continue Stop
W 1 1
W 0 wp
1 b0 = 1 4 a0 = 3
wp 1
2 Stop wp b0 = 2
3 W 1
2
a0 = 1 Continue wp
4
b0 = 1 W 0 W 1 a 0 = 2 Continue
1 1 b0 = 2 W 0
wp wp
2 3 2 a0 = 2
Stop a 0 = 1 Continue Stop wp
4 b0 = 3
b0 = 2 W 0
Stop 2 a0 = 1
wp
3 b0 = 3
78
Decision trees
From March 2, 2019:
Decision trees
The next series of slides illustrate the exponential
expansion when we use decision trees (the path
from the origin to each node is unique).
Decision Outcome Decision Outcome Decision
Decision Outcome Decision Outcome Decision
Decision Outcome Decision Outcome Decision
Decision Outcome Decision Outcome Decision
Offline vs. online objectives
max x E{F (x ,W ) | S 0 }
» Finite
max horizon
E{F formulation:
(x ,N ,W ) | S 0 }
max E W {F ( x , N , W ) | S 0 }
N 1
n n 1
max E S 0 EW 1 ,W 2 ,...,W N |S 0 F ( X (S ),W ) | S0
n 0
Slide 90
91
Tunable versions:
Knowledge gradient for online learning
Knowledge gradient policy
» For finite-horizon on-line problems:
…intervalIEestimation
x ,n xn z xn ???
… and UCB
UCB1 n log n
x x 4 ???
N xn 94
Knowledge gradient for online learning
On-line KG slightly
underperforms
Gittins
On-line KG slightly
outperforms Gittins.
Measurement budget
95
Knowledge gradient for online learning
Contrast:
» Offline vs. online KG
• Offline:
• Online:
» Tuning PFA policies for offline vs. online
• Policies are the same – just tuned differently
• E.g. Interval estimation:
Discuss
» Convergence
» Applying to large problems
U [0,1]
•
» Upper confidence bounding
log n
X UCB ( S n | UCB ) arg max x xn UCB n
Nx
1e 2 3 4 4 5
F Fe Fe
Fe Ni 6
nm Ni
nm m S m 0.
nm 2n B 2n nm
1.
4 1 1 Ni +1
nm 3
. 2 2 0
1 AI
03+ D
2 L
AI A
LD 0 nm
A 1
0 nm
1 © 2019 Warren Powell
The learning challenge
Finding the best material
1e 2 3 4 4 5
F Fe Fe
Fe Ni 6
nm Ni
nm m S m 0.
nm 2n B 2n nm
1.
4 1 1 Ni +1
nm 3
. 2 2 0
1 AI
03+ D
2 L
AI A
LD 0 nm
A 1
0 nm
1 © 2019 Warren Powell
The learning challenge
Correlations
» Simple belief model assumes independence
» Similar material properties between catalysts
» Scientists using domain knowledge can estimate
correlations in experiments between similar catalysts.
1e 2 3 4 4 5
F Fe Fe
Fe Ni 6
nm Ni
nm m S m 0.
nm 2n B 2n nm
1.
4 1 1 Ni +1
nm 3
. 2 2 0
1 AI
03+ D
2 L
AI A
LD 0 nm
A 1
0 nm
1 © 2019 Warren Powell
The learning challenge
Testing one material teaches us about other
materials
1e 2 3 4 4 5
F Fe Fe
Fe Ni 6
nm Ni
nm m S m 0.
nm 2n B 2n nm
1.
4 1 1 Ni +1
nm 3
. 2 2 0
1 AI
03+ D
2 L
AI A
LD 0 nm
A 1
0 nm
1 © 2019 Warren Powell
The learning challenge
Nonlinear, parametric belief model:
» For example, the following model describes the length
of nanotubes in low temperatures:
Kn
111
ϑ1 ϑ2 ϑ3 ϑ4 ϑ5
ϑ6 ϑ7 ϑ8 ϑ9 ϑ10
ϑ11 ϑ12 ϑ13 ϑ14 ϑ15
ϑ16 ϑ17 ϑ18 ϑ19 ϑ20
ϑ21 ϑ22 ϑ23 ϑ24 ϑ25
© 2019 Warren Powell
Building a belief model
Temperature dependence on
1 ln( A)
chemical reaction rate k
Ea 1 Ea
ln(k ) ln( A) 2
kB T kb
1 2 x 1
x
ln(k ) T
1
We might have beliefs
about different slopes…
2
ln(k )
Building a belief model
Temperature dependence on
1 ln( A)
chemical reaction rate k
Ea 1 Ea
ln( k ) ln( A) 2
kB T kb
1 2 x 1
x
ln( k ) T
… as well as different
intercepts.
ln(k )
Building a belief model
Temperature dependence on
chemical reaction rate k
Ea 1 0 0
ln( k ) ln( A) 01
kB T 2
1 2 x 2
2
,0 1 12
2 2
12 2
… as well as different
intercepts.
Possible relationships
Photo-induced current
Density
© 2019 Warren Powell
Building a belief model
Classifying
temperature/concentration
regions
» Different regions produce
different materials
» Each combination of temperature
and concentration is a very
expensive experiment.
» How do we minimize the number
of experiments to come up with a
good classification?
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
© 2019 Warren Powell
Building a belief model
One possible clustering:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
Courtesy C. Lam, B. Olsen
© 2019 Warren Powell
Building a belief model
Other clusterings:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
Courtesy C. Lam, B. Olsen
© 2019 Warren Powell
Building a belief model
Other clusterings:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
Courtesy C. Lam, B. Olsen
© 2019 Warren Powell
Building a belief model
Other clusterings:
40
DM Hex
Temperature (C)
30
20
US
NB La
10
20 30 40 50
Concentration
© 2019 Warren Powell
Tutorial article in preparation: