Professional Documents
Culture Documents
ORF 544 Week 1 Introduction and Overview
ORF 544 Week 1 Introduction and Overview
Warren Powell
Princeton University
http://www.castlelab.princeton.edu
Introduction
Drug discovery
Designing molecules
Drug discovery
We express our belief using a linear, additive QSAR model
» X m X m Indicator variable for molecule m.
ij ij
»
Y 0 ij X ij
sites i substituents j
Drug discovery
Knowledge gradient versus pure exploration for 99 compounds
Performance under best possible
Pure exploration
Optimal learning
» Optimizing nanoparticles to
maximize photoconductivity
© 2019 Warren Powell
E-commerce
Revenue management
» Optimizing prices to maximize
total revenue for a particulate
night in a hotel.
Ad-click optimization
» How much to bid for ads on the
internet.
Personalized offer
optimization
» Designing offers for individual
customers.
120.00
100.00
80.00
60.00
Discharge
40.00
Charge 20.00
0.00
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
parameters
© 2019 Warren Powell
Policy function approximations
Our policy function might be the parametric
model (this is nonlinear in the parameters):
1 if pt charge
X ( St | ) 0 if charge pt discharge
1 if p charge
t
Energy in storage:
Price of electricity:
Discharge
Charge
© 2019 Warren Powell Slide 16
Energy storage
How much energy to store in a battery to handle the
volatility of wind and spot prices to meet demands?
Bonds
$350
$150
$450
$300
TX ˆ
St ( ,Dt )
t
© 2019 Warren Powell
Nomadic trucker illustration
We use initial value function approximations…
V 0 ( MN ) 0
V 0 ( NY ) 0
0
V (CO) 0 $350
0
$150
V (CA) 0
$450
$300
TX ˆ
St ( ,Dt )
t
© 2019 Warren Powell
Nomadic trucker illustration
1
… and make our first choice: x
V 0 ( MN ) 0
V 0 ( NY ) 0
0
V (CO) 0 $350
0
$150
V (CA) 0
$450
$300
NY
x
S (
t )
t 1
© 2019 Warren Powell
Nomadic trucker illustration
Update the value of being in Texas.
V 0 ( MN ) 0
V 0 ( NY ) 0
0
V (CO) 0 $350
0
$150
V (CA) 0
$450
$300
V 1 (TX ) 450
NY
x
S (
t )
t 1
© 2019 Warren Powell
Nomadic trucker illustration
Now move to the next state, sample new demands and make a new
decision
V 0 ( MN ) 0
$180
0 V 0 ( NY ) 0
V (CO) 0
$400
V 0 (CA) 0 $600
$125
V 1 (TX ) 450
NY ˆ
St 1 ( ,Dt 1 )
t 1
© 2019 Warren Powell
Nomadic trucker illustration
Update value of being in NY
V 0 ( MN ) 0
$180
0 V 0 ( NY ) 600
V (CO) 0
$400
V 0 (CA) 0 $600
$125
V 1 (TX ) 450
x CA
S t 1 ( )
t 2
© 2019 Warren Powell
Nomadic trucker illustration
Move to California.
V 0 ( MN ) 0
$400
0 V 0 ( NY ) 600
V (CO) 0
$200
V 0 (CA) 0 $150
$350
V 1 (TX ) 450
CA ˆ
St 2 ( ,Dt 2 )
t 2
© 2019 Warren Powell
Nomadic trucker illustration
Make decision to return to TX and update value of being in CA
V 0 ( MN ) 0
$400
0 V 0 ( NY ) 500
V (CO) 0
$200
V 0 (CA) 800 $150
$350
V 1 (TX ) 450
CA ˆ
St 2 ( ,Dt 2 )
t 2
© 2019 Warren Powell
Nomadic trucker illustration
Updating the value function:
Old value:
V 1 (TX ) $450
New estimate:
vˆ 2 (TX ) $800
V 0 ( MN ) 0
V 0 ( NY ) 600
0
V (CO) 0 $385
0
$800
V (CA) 800
$125
$275
V 1 (TX ) 485
TX ˆ
St 3 ( ,Dt 3 )
t 3
© 2019 Warren Powell
Real-time logistics
Uber
» Provides real-time, on-demand
transportation.
» Drivers are encouraged to enter or leave
the system using pricing signals and
informational guidance.
Decisions:
» How to price to get the right balance of
drivers relative to customers.
» Assigning and routing drivers to
manage Uber-created congestion.
» Real-time management of drivers.
» Pricing (trips, new services, …)
» Policies (rules for managing drivers,
customers, …)
t t+1 t+2
t t+1 t+2
t t+1 t+2
The assignment of cars to riders evolves over time, with new riders
arriving, along with updates of cars available.
© 2019 Warren Powell
Autonomous EVs
t t+1 t+2
Challenges
» Faces a variety of
challenges to manage risk:
» Spikes in natural gas prices,
electricity prices.
» Pipeline outages due to
storms.
© 2019 Warren Powell
An energy generation portfolio
Jan Feb March April May June July Aug Sept Oct Nov Dec
Sept Oct Nov Dec Jan Feb March April May June July Aug
© 2019 Warren Powell
Solar energy
Solar from a single solar farm
$977/MW !!!
Stochastic prices
Sell
120.00
100.00
Price
80.00
Sell
60.00 Sell Sell Sell
Sell
40.00
20.00
Buy Buy
Buy Buy
Buy Buy
0.00
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
Planning under uncertainty
Don’t gamble; take all your savings
and buy some good stock and hold
it till it goes up, then sell it. If it
don’t go up, don’t buy it.
Will Rogers
» subject
A0 x0 tob0
At xt Bt 1 xt 1 bt , t 1.
xt X t ( St )
We can convert
T this to a proper stochastic model by
min E ct with
replacing
X t ( St ) and taking an expectation:
t 0
X t ( St ) At xt Rt
St 1 S M St , xt ,Wt 1
The policy has to satisfy with transition function:
© 2019 Warren Powell
Unknown outages
Known customers in outage
Outage calls
(known)
Storm
Network outages
(unknown)
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal
cal
l © 2019 Warren Powell
Escalation Algorithm
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
l l © 2019 Warren Powell
Escalation Algorithm
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
l l © 2019 Warren Powell
Escalation Algorithm
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
l l © 2019 Warren Powell
Escalation Algorithm
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
l l © 2019 Warren Powell
Escalation Algorithm
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
l l © 2019 Warren Powell
Lookahead policies
Decision trees:
C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis and S.
Colton, “A survey of Monte Carlo tree search methods,” IEEE Transactions on Computational Intelligence and AI in
Games, vol. 4, no. 1, pp. 1–49, March 2012. © 2019 Warren Powell
© 2019 Warren Powell
Monte Carlo tree search
AlphaGo
» Much more complex state
space.
» Uses hybrid of policies:
• MCTS
• PFA
• VFA
» Stochastic gradient
x n 1 x n n x F ( x n ,W n 1 )
» Asymptotic convergence:
n *
lim n E F ( x , W ) E F ( x ,W )
» Finite
max time
E F ( xperformance
,n
, W ) where is an algorithm (or policy)
© 2019 Warren Powell
Canonical problems
Ranking and selection (derivative free)
» Basic problem:
max E F x , N ,W
n
» We need to find a policy X ( S ) for
playing machine x that maximizes:
N 1
max E F ( X ( S n ), W n 1 )
n 0
where
W n 1 "winnings" New information
S n State of knowledge What we know about each slot machine
xn X (S n ) Choose next “arm” to play
min at A C ( St , at ) p ( St 1 s ' | St , at )Vt 1 ( St 1 )
s'
• Decision:
1 If we stop and sell at time t
X t ( )
0 Otherwise
max E p X
» Optimization
problem: " Ft measurable function"
• State:
Rt = 1 if we are holding asset, 0 otherwise.
St = ( Rt , pt , pt )
1 pt pt
X t ( St | )
• Policy: 0 Otherwise
T T
max E pt X ( St | ) max E pt X ( St | )
t 0 t 0
© 2019 Warren Powell
Canonical problems
Linear quadratic regulation (LQR)
» A popular optimal control problem in engineering
involves solving:
T
min u0 ,...,uT E ( xt )T Qxt (ut )T Rut
t 0
» where:
xt State at time t
ut Control at time t (must be Ft measurable)
xt 1 f ( xt , ut ) wt ( wt is random at time t )
U t* ( xt ) K t xt
» Possible to show that the optimal policy looks like:
Kt
where is a complicated
© 2019 Warrenfunction
Powell of Q and R.
Canonical problems
Stochastic programming
» A (two-stage) stochastic programming problem
min x0 X 0 c0 x0 E Q( x0 , 1 )
where
Q( x0 , 1 ( )) min x1 ( )X1 ( ) c1 ( ) x1 ( )
» This is the canonical form of stochastic programming,
which might also be written over multiple periods:
T
min c0 x0 p ( ) ct ( ) xt ( )
t 1
min xt X t ct xt E Q ( xt , t 1 )
where
Q( xt , t 1 ( )) min xt 1 ( )X t 1 ( ) ct 1 ( ) xt 1 ( )
» This is the canonical form of stochastic programming,
which might also be written over multiple periods:
tH
min ct xt
t t
p(t ) ctt ' ( ) xtt ' ( )
t 't 1
» State variables
» Decision variables
» Exogenous information
» Transition function
» Objective function
Weather
Kt Knowledge state ("belief state")
Belief about traffic delays
Belief about the status of equipment
Bizarrely, only the controls community has a tradition
of actually defining state variables. We return to state
variables later.
© 2019 Warren Powell Slide 90
Modeling dynamic problems
Decisions:
Markov decision processes/Computer science
at Discrete action
Control theory
ut Low-dimensional continuous vector
Operations research
xt Usually a discrete or continuous but high-dimensional
vector of decisions.
At this point, we do not specify how to make a decision.
Instead, we define the function X ( s) (or A ( s ) or U ( s )),
where specifies the type of policy. " " carries information
about the type of function f , and any tunable parameters f .
» Continuous
x X a,scalar
b
x ( x1 ,..., xvector
» Continuous K ), xk R
x ( x1 ,..., xK ), xk Z
» Discrete vector
x (a1 ,..., aI ), ai is a category (e.g. red/green/blue)
» Categorical
© 2019 Warren Powell
Modeling dynamic problems
Exogenous information:
Wt New information that first became known at time t
= Rˆt , Dˆ t , pˆ t , Eˆ t
Rˆt Equipment failures, delays, new arrivals
New drivers being hired to the network
Dˆ t New customer demands
pˆ t Changes in prices
Eˆ t Information about the environment (temperature, ...)
Note: Any variable indexed by t is known at time t. This
convention, which is not standard in control theory,
dramatically simplifies the modeling of information.
Below, we let represent a sequence of actual observations W1 ,W2 ,....
Wt refers to a sample realization of the random variable Wt .
© 2019 Warren Powell Slide 93
Modeling dynamic problems
The transition function
St 1 S M ( St , xt , Wt 1 )
R R x Rˆ Inventories
t 1
pt 1 pt
t
t t 1
pˆ t 1 Spot prices
Dt 1 Dt Dˆ t 1 Market demands
Also known as the:
“System model” “Transfer function”
“State transition model”
“Plant model”
“Transformation function”
“Law of motion”
“Plant equation” “Model”
“Transition law”
For many applications, these equations are unknown.
This is known as “model-free” dynamic programming.
© 2019 Warren Powell Slide 94
Modeling dynamic problems
The objective function
Dimensions of objective functions
» Type of performance metric
» Final cost vs. cumulative cost
» Expectation or risk measures
» Mathematical properties (convexity,
monotonicity, continuity, unimodularity,
…)
» Time to compute (fractions of seconds
to minutes, to hours, to days or months)
© 2019 Warren Powell Slide 95
Elements of a dynamic model
Objective functions
» Cumulative reward (“online learning”)
T
max E Ct St , X t ( St ), Wt 1 | S0
t 0
max E F ( x ,W ) | S
• Policies have to work
, N well
ˆ over time.
» Final reward (“offline learning”) 0
max E F ( x , N , Wˆ ) | S0
• Final reward (“offline learning”)
max C ( S0 , X 0 ( S0 )), C ( S1 , X 1 (S1 )),..., C ( ST , X T ( ST )) | S0
S
• Risk: t 1 S M
St , xt ,Wt 1
Performance metrics
How the state variables evolve
What we didn’t know
when we made our decision
What we control
What we need to know
(and only what we need)
© 2019 Warren Powell
Course overview