ORF 544 Week 1 Introduction and Overview

ORF 544
Stochastic Optimization and Learning

Spring, 2019
.
Warren Powell
Princeton University
http://www.castlelab.princeton.edu
© 2018 W.B. Powell

Week 1 - Monday
Introduction
© 2019 Warren Powell Slide 2

Sample applications

My experience
Economics
» Analysis of SREC certificates
for valuing solar energy
credits.
» System for valuing 10-year

forward contracts for
electricity.
» Optimizing cash for mutual

funds.
© 2019 Warren Powell
Learning the market in Africa
How do African consumers
respond to energy pricing
strategies for recharging cell
phones?
» Cell phone use is widespread in
Africa, but the lack of a reliable
power grid complicates recharging
cell phone batteries.
» A low cost strategy is to encourage
an entrepreneurial market to
develop which sells energy from
small, low cost solar panels.
» We do not know the demand curve,
and we need to learn it as quickly as
possible.

7
Drug discovery
Designing molecules
» X and Y are sites where we can hang substituents to change the

behavior of the molecule

8
Drug discovery
We express our belief using a linear, additive QSAR model
» X m  X m  Indicator variable for molecule m.
 ij ij
»
Y   0     ij X ij
sites i substituents j

9
Drug discovery
Knowledge gradient versus pure exploration for 99 compounds
Performance under best possible
Pure exploration
Optimal learning
Number of molecules tested (out of 99)

Health applications
Health sciences
» Sequential design of
experiments for drug discovery
» Drug delivery – Optimizing

the design of protective
membranes to control drug
release
» Medical decision making –

Optimal learning for medical
treatments.
State-dependent applications
Materials science
» Optimizing payloads: reactive
species, biomolecules,
fluorescent markers, …
» Controllers for robotic scientist

for materials science
experiments
» Optimizing nanoparticles to
maximize photoconductivity
E-commerce
Revenue management
» Optimizing prices to maximize
total revenue for a particulate
night in a hotel.
Ad-click optimization
» How much to bid for ads on the
internet.
Personalized offer
optimization
» Designing offers for individual
customers.

Policy function approximations
Battery arbitrage – When to charge, when to
discharge, given volatile LMPs

Grid operators require that batteries bid charge and
discharge prices, an hour in advance.
140.00
120.00
100.00
80.00
60.00
Discharge
 40.00
 Charge 20.00
0.00
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
We have to search for the

 Charge and best. values for the policy
Discharge
parameters
Our policy function might be the parametric
model (this is nonlinear in the parameters):
1 if pt   charge
 
X ( St |  )   0 if  charge  pt   discharge
1 if p   charge
 t
Energy in storage:
Price of electricity:

Finding the best policy
» We need to maximize
T
max F ( )  E   t C  St , X t ( St |  ) 
t 0
» We cannot compute the expectation, so we run simulations:
Discharge
 Charge

Energy storage
How much energy to store in a battery to handle the
volatility of wind and spot prices to meet demands?

Planning cash reserves
How much money should we hold in cash given variable
market returns and interest rates to meet the needs of a
business?
Stock prices
Bonds

Electricity forward contracts

Fleet management
Fleet management problem

» Optimize the assignment of drivers to loads over time.
» Tremendous uncertainty in loads being called in

Nomadic trucker illustration
Pre-decision state: we see the demands
$350
$150
$450
$300
 TX  ˆ
St  (  ,Dt )
 t 
We use initial value function approximations…
V 0 ( MN )  0
V 0 ( NY )  0
0
V (CO)  0 $350
0
$150
V (CA)  0
$450
$300
 TX  ˆ
St  (  ,Dt )
 t 
1
… and make our first choice: x
V 0 ( MN )  0
V 0 ( NY )  0
0
V (CO)  0 $350
0
$150
V (CA)  0
$450
$300
 NY 
x
S  (
t )
 t  1
Update the value of being in Texas.
V 0 ( MN )  0
V 0 ( NY )  0
0
V (CO)  0 $350
0
$150
V (CA)  0
$450
$300
V 1 (TX )  450
 NY 
x
S  (
t )
 t  1
Now move to the next state, sample new demands and make a new
decision
V 0 ( MN )  0
$180
0 V 0 ( NY )  0
V (CO)  0
$400
V 0 (CA)  0 $600
$125
V 1 (TX )  450
 NY  ˆ
St 1  (  ,Dt 1 )
 t  1
Update value of being in NY
V 0 ( MN )  0
$180
0 V 0 ( NY )  600
V (CO)  0
$400
V 0 (CA)  0 $600
$125
V 1 (TX )  450
x  CA 
S t 1  ( )
t  2
Move to California.
V 0 ( MN )  0
$400
0 V 0 ( NY )  600
V (CO)  0
$200
V 0 (CA)  0 $150
$350
V 1 (TX )  450
 CA  ˆ
St  2  (  ,Dt  2 )
t  2
Make decision to return to TX and update value of being in CA
V 0 ( MN )  0
$400
0 V 0 ( NY )  500
V (CO)  0
$200
V 0 (CA)  800 $150
$350
V 1 (TX )  450
 CA  ˆ
St  2  (  ,Dt  2 )
t  2
Updating the value function:
Old value:
V 1 (TX )  $450
New estimate:
vˆ 2 (TX )  $800
How do we merge old with new?

V 2 (TX )  (1   )V 1 (TX )  ( )vˆ 2 (TX )
 (0.90)$450+(0.10)$800
 $485
» We are updating the previous post-decision state
(describe).
An updated value of being in TX
V 0 ( MN )  0
V 0 ( NY )  600
0
V (CO)  0 $385
0
$800
V (CA)  800
$125
$275
V 1 (TX )  485
 TX  ˆ
St  3  (  ,Dt 3 )
 t  3
Real-time logistics
Uber
» Provides real-time, on-demand
transportation.
» Drivers are encouraged to enter or leave
the system using pricing signals and
informational guidance.
Decisions:
» How to price to get the right balance of
drivers relative to customers.
» Assigning and routing drivers to
manage Uber-created congestion.
» Real-time management of drivers.
» Pricing (trips, new services, …)
» Policies (rules for managing drivers,
customers, …)

Effect of Current Decision on the Future
Assigning car in less

dense area allows closer
car to handle potential
demands in more dense
areas.
Closest car…. but it
moves car away from
busy downtown area
and strands other car in
low density area.

Cost function approximations
Cars Riders

Optimizing over time
t t+1 t+2

t t+1 t+2

t t+1 t+2
The assignment of cars to riders evolves over time, with new riders
arriving, along with updates of cars available.
Autonomous EVs
t t+1 t+2

Matching buyers with sellers
Now we have a logistic curve for
each origin-destination pair (i,j)
Buyer Seller
ij0 ij p ija a
e
PY ( p, a |  )  ij0 ij p ija a
1 e
Number of offers for each (i,j) pair

is relatively small.
Need to generalize the learning
across hundreds to thousands of
markets.
Offered price

Industrial sponsors
Air Liquide
» Largest industrial gases
company with 64,000
employees.
» Consumes 0.1 percent of
global electricity.
Challenges
» Faces a variety of
challenges to manage risk:
» Spikes in natural gas prices,
electricity prices.
» Pipeline outages due to
storms.
An energy generation portfolio

Wind in the U.S.

Energy from wind
 Wind power from all PJM wind farms
1 year
Jan Feb March April May June July Aug Sept Oct Nov Dec

Energy from wind
 Wind from all PJM wind farms
30 days

Solar energy
Solar energy in the PJM region
Princeton solar array

Solar energy
PSE&G solar farms
Sept Oct Nov Dec Jan Feb March April May June July Aug
Solar energy
Solar from a single solar farm

Solar energy
Within-day sample trajectories

LMPs – Locational marginal prices
Locational marginal prices on the grid
$977/MW !!!

Locational marginal prices on the grid

Planning under uncertainty
Battery arbitrage – When to charge, when to
discharge, given volatile LMPs

Snapshot of electricity prices for New Jersey:

Stochastic prices

The models in these papers allow decisions to see into the
future:
» Strategy posed by the battery manufacturer: “Buy low, sell high”
140.00
Sell
120.00
100.00
Price
80.00
Sell
60.00 Sell Sell Sell
Sell
40.00
20.00
Buy Buy
Buy Buy
Buy Buy
0.00
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
Don’t gamble; take all your savings
and buy some good stock and hold
it till it goes up, then sell it. If it
don’t go up, don’t buy it.
Will Rogers
It is not enough to mix “optimization” (intelligent decision

making) and uncertainty. You have to be sure that each decision
has access to the information available at the time.

From deterministic to stochastic
Imagine that you would like to solve the time-dependent
linear program:
T
min x0 ,..., xT c x
t 0
t t
» subject
A0 x0 tob0
At xt  Bt 1 xt 1  bt , t  1.
xt X t ( St )
We can convert
T this to a proper stochastic model by
min  E  ct with
replacing

X t ( St ) and taking an expectation:
t 0
X t ( St ) At xt  Rt
St 1  S M  St , xt ,Wt 1 
The policy has to satisfy with transition function:




Unknown outages

Known customers in outage
Outage calls
(known)
Storm 
Network outages
(unknown)

Problem Description - Emergency Storm Response
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal
cal
l © 2019 Warren Powell
Escalation Algorithm
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
l l © 2019 Warren Powell
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
cal
l
X
X
cal
l
cal
l
X
X cal
l
X X
cal
l
cal cal
Lookahead policies
Decision trees:

Decision Outcome Decision Outcome Decision

Monte Carlo tree search
Steps of MCTS:
C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis and S.
Colton, “A survey of Monte Carlo tree search methods,” IEEE Transactions on Computational Intelligence and AI in
Games, vol. 4, no. 1, pp. 1–49, March 2012. © 2019 Warren Powell
Monte Carlo tree search
AlphaGo
» Much more complex state
space.
» Uses hybrid of policies:
• MCTS
• PFA
• VFA
© 2017 Warren B. Powell

Canonical problems

Stochastic Robust Decision
Approximate
dynamic
programming optimization
Simulation programming
analysis
optimization
Optimal Dynamic
Model
learning predictive Programming
and Stochastic
control Optimal
Bandit control search
problems control
Reinforcement Online
computation
learning Markov
Stochastic Simulation
control decision optimization
processes
Canonical problems
Decision trees

Canonical problems
Stochastic search (derivative based)
» Basic problem:
 Manufacturing network (x=design)
 Unit commitment problem (x=day ahead decisions)

max x E F ( x, W )  Inventory system (x=design, replenishment policy)
Battery system (x=choice of material)
 Patient treatment cost (x=drug, treatments)
 Trucking company (x=fleet size and mix)
» Stochastic gradient
x n 1  x n   n x F ( x n ,W n 1 )
» Asymptotic convergence:
n *
lim n E F ( x , W )  E F ( x ,W )
» Finite
max time
E F ( xperformance
 ,n
, W ) where  is an algorithm (or policy)
Canonical problems
Ranking and selection (derivative free)
» Basic problem:
max xx1 ,..., xM  E F ( x, W )
» We need to ,Ndesign a policy X  (S n )

that finds a design
x
given by
max  E F  x , N ,W 
» We refer to this objective as maximizing the final

reward. © 2019 Warren Powell
Canonical problems
Multi-armed bandit problems
» We learn the reward from playing each
“arm”
 n
» We need to find a policy X ( S ) for
playing machine x that maximizes:
N 1
max  E  F ( X  ( S n ), W n 1 )
n 0
where
W n 1  "winnings" New information
S n  State of knowledge What we know about each slot machine
xn  X  (S n ) Choose next “arm” to play
We refer to this problem as maximizing

Canonical problems
(Discrete) Markov decision processes
» Bellman’s optimality equation
Vt ( St )  min a A C ( St , at )   E Vt 1 ( St 1 ) | St 
t
 
 min at A  C ( St , at )    p ( St 1  s ' | St , at )Vt 1 ( St 1 ) 
 s' 
» This is also the same as solving

T 
min  E  C  St , X t ( St )  | S 0 
 t 0 
X p ( St ) = arg min xt (C ( St , xt ) + E {Vt + 1 ( St + 1 ) | St , xt })

where the optimal policy has the form

Value function approximations
Reinforcement learning (computer science)
» Q-learning (for discrete actions)
qˆ n ( s n , a n ) = r ( s n , a n ) + g max a ' Q n- 1 ( s ', a ')
Q n ( s n , a n ) = (1- a n- 1 )Q n- 1 ( s n , a n ) + a n- 1qˆ n ( s n , a n )
Policy:
p ( s) = arg max a Q n ( s, a)
» The second edition of Reinforcement Learning (Sutton

and Barto, forthcoming) includes other solution
approaches:
• Policy search (Boltzmann policies)
• Upper confidence bounding
• Monte Carlo tree search
Canonical problems
Optimal stopping I
» Model:
• Exogenous process:
   p1 , p2 ,..., pT   Sequence of stock prices
• Decision:
1 If we stop and sell at time t
X t ( )  
0 Otherwise
pt  Price received if we stop at time t

• Reward:
max E p X 
» Optimization
 problem: " Ft  measurable function"

Canonical problems
Optimal stopping II
» Model:
• Exogenous process:
   p1 , p2 ,..., pT   Sequence of stock prices
pt  (1   ) pt 1   pt
• State:
Rt = 1 if we are holding asset, 0 otherwise.
St = ( Rt , pt , pt )
1 pt  pt  
X t ( St |  )  
• Policy: 0 Otherwise
T T
max  E  pt X ( St |  )  max E  pt X  ( St |  )

t 0 t 0
Canonical problems
Linear quadratic regulation (LQR)
» A popular optimal control problem in engineering
involves solving:
T
min u0 ,...,uT E   ( xt )T Qxt  (ut )T Rut 
t 0
» where:
xt  State at time t
ut  Control at time t (must be Ft  measurable)
xt 1  f ( xt , ut )  wt ( wt is random at time t )
U t* ( xt )  K t xt
» Possible to show that the optimal policy looks like:
Kt
where is a complicated
© 2019 Warrenfunction
Powell of Q and R.
Canonical problems
Stochastic programming
» A (two-stage) stochastic programming problem
min x0 X 0 c0 x0  E Q( x0 , 1 )
where
Q( x0 , 1 ( ))  min x1 ( )X1 ( ) c1 ( ) x1 ( )
» This is the canonical form of stochastic programming,
which might also be written over multiple periods:
T
min c0 x0   p ( ) ct ( ) xt ( )
 t 1

Canonical problems
Stochastic programming
» A (two-stage) stochastic programming policy
min xt X t ct xt  E Q ( xt , t 1 )
where
Q( xt , t 1 ( ))  min xt 1 ( )X t 1 ( ) ct 1 ( ) xt 1 ( )
» This is the canonical form of stochastic programming,
which might also be written over multiple periods:
tH
min ct xt  
t t
p(t )  ctt ' ( ) xtt ' ( )
t 't 1

Canonical problems
A robust optimization problem would be written
min xX max wW ( ) F ( x, w)
» This means finding the best design x for the worst

outcome w in an “uncertainty set” W ( )
» This has been adapted to multiperiod problems

T
min x0 ,..., x1 max ( w0 ,..., wT )W ( )  ct ' ( wt ' ) xt '
t ' 0

Canonical problems
A robust optimization problem would be written
min xX max wW ( ) F ( x, w)
» This means finding the best design x for the worst

outcome w in an “uncertainty set” W ( )
» … but it is often used as a policy

tH
min xt ,..., xt H max ( wt ,...,wt H )W ( )  ct ' ( wt ' ) xt '
t 't

Canonical problems
Stochastic control (from text by Rene Carmona)
» This is mathematically correct, but does not suggest a path

to computation, or even how to model a real problem.
Canonical problems
Why do we need a unified framework?
» The classical frameworks and algorithms are fragile.
» Small changes to problems invalidate optimality

conditions, or make algorithmic approaches
intractable.
» Practitioners need robust approaches that will provide

high quality solutions for all problems.

A modeling framework

Modeling
We propose to model problems along five
fundamental dimensions:
» State variables
» Decision variables
» Exogenous information
» Transition function
» Objective function
» This framework draws heavily from Markov decision

processes and the control theory communities, but it is
not the standard form used anywhere.
Modeling dynamic problems
The system state:
Controls community
xt  "Information state"
 Operations research/MDP/Computer science
 St   Rt , I t , K t   System state, where:
 Rt  Resource state (physical state)

Location/status of truck/train/plane
 Energy in storage
 I t  Information state
 Prices
 Weather
Kt  Knowledge state ("belief state")
 Belief about traffic delays
 Belief about the status of equipment
Bizarrely, only the controls community has a tradition
of actually defining state variables. We return to state
variables later.
Decisions:
Markov decision processes/Computer science
 at  Discrete action
 Control theory
ut  Low-dimensional continuous vector
 Operations research
 xt  Usually a discrete or continuous but high-dimensional
 vector of decisions.
 At this point, we do not specify how to make a decision.
 Instead, we define the function X  ( s) (or A ( s ) or U  ( s )),
 where  specifies the type of policy. " " carries information
 about the type of function f , and any tunable parameters    f .

Problem classes
Types of decisions
» Binary
x  X  0,1
» Finite
x  X  1, 2,..., M 
» Continuous
x  X  a,scalar
b
x  ( x1 ,..., xvector
» Continuous K ), xk  R
x  ( x1 ,..., xK ), xk  Z
» Discrete vector
x  (a1 ,..., aI ), ai is a category (e.g. red/green/blue)
» Categorical
Exogenous information:
Wt  New information that first became known at time t
 
= Rˆt , Dˆ t , pˆ t , Eˆ t 
 Rˆt  Equipment failures, delays, new arrivals
 New drivers being hired to the network
 Dˆ t  New customer demands
 pˆ t  Changes in prices
 Eˆ t  Information about the environment (temperature, ...)
 Note: Any variable indexed by t is known at time t. This
 convention, which is not standard in control theory,
dramatically simplifies the modeling of information.
 Below, we let  represent a sequence of actual observations W1 ,W2 ,....
Wt   refers to a sample realization of the random variable Wt .
The transition function
St 1  S M ( St , xt , Wt 1 )
 R  R  x  Rˆ Inventories
 t 1
pt 1  pt
t

t t 1
pˆ t 1 Spot prices

 Dt 1  Dt  Dˆ t 1 Market demands
 Also known as the:
 “System model” “Transfer function”
 “State transition model”
“Plant model”
“Transformation function”
“Law of motion”
 “Plant equation” “Model”
 “Transition law”
For many applications, these equations are unknown.
This is known as “model-free” dynamic programming.
The objective function
Dimensions of objective functions
 » Type of performance metric
 » Final cost vs. cumulative cost
 » Expectation or risk measures
 » Mathematical properties (convexity,
 monotonicity, continuity, unimodularity,
 …)
 » Time to compute (fractions of seconds
 to minutes, to hours, to days or months)

Elements of a dynamic model
Objective functions
» Cumulative reward (“online learning”)
T 
max  E  Ct  St , X t ( St ), Wt 1  | S0 

 t 0 

max  E F ( x ,W ) | S 
• Policies have to work
 , N well
ˆ over time.
» Final reward (“offline learning”) 0
max  C (S0 , X 0 ( S0 )), C (S1 , X 1 ( S1 )),..., C ( ST , X T ( ST )) | S0 

• We only care about how well the final decision works.
» Risk
The complete model:
» Objective function
• Cumulative reward (“online learning”)

T 
max  E  Ct  St , X t ( St ),Wt 1  | S0 
 t 0 

max  E F ( x , N , Wˆ ) | S0 
• Final reward (“offline learning”)
max   C ( S0 , X 0 ( S0 )), C ( S1 , X 1 (S1 )),..., C ( ST , X T ( ST )) | S0 
S
• Risk: t 1  S M
 St , xt ,Wt 1 
 S0 ,W1 ,W2 ,...,WT 

The modeling process
» I conduct a conversation with a domain expert to fill in
the elements of a problem:
State Decision New Transition Objective

variables variables information function function
Performance metrics
How the state variables evolve
What we didn’t know
when we made our decision
What we control
What we need to know
(and only what we need)
Course overview

Course overview
Requirements
» Weekly problem sets. These will emphasize modeling

and the testing of different algorithmic strategies
» Take-home midterm and final
» “Diary problem” – You will propose a problem of your

choosing, and we will periodically return to this
problem to develop a model and suggest solution
approaches.
» Active participation in class – We have a diverse class

with different backgrounds and contexts. I need active
Course overview
Programming
» Most numerical work will require nothing more than

Excel.
» We have two libraries available for testing more

advanced algorithms:
• Python – This is a newer set of modules.
• Matlab – We have an older package, MOLTE, that was
developed to illustrate a wide range of algorithms.

Course overview
Prerequisites
» ORF 544 primarily needs a basic course in statistics
and probability. From this we will need:
• General understanding of probability distributions and random
variables.
• Basic understanding of conditional probability and Bayes
theorem.
• In this course, we will emphasize recursive statistics which is
covered in chapter 3.
» So why is this a grad course?
• Translating real problems into notation (mathematical
modeling) requires some patience and an interest in solving
problems using analytics.
• Stochastic modeling, which means modeling the flow of
decisions and information, can be fairly subtle.

Course overview
What you have to do.
» Most important, ask questions!!! If you are not going to
ask questions, then just read the book!
» The course will follow the structure of the book
Stochastic Optimization and Learning. I will present
selected topics from most chapters, with the goal that
you will feel comfortable reading the rest of the chapter
on your own.
» You will best understand the material when you can
relate the notation to problems you are familiar with.
» I will do my best to provide a range of applications, but
it really helps when you contribute your own problems.

ORF 544 Week 1 Introduction and Overview

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ORF 544 Week 1 Introduction and Overview

Uploaded by

Copyright:

Available Formats

ORF 544

Stochastic Optimization and Learning

© 2018 W.B. Powell

© 2019 Warren Powell Slide 2

© 2019 Warren Powell Slide 3

» System for valuing 10-year

» Optimizing cash for mutual

© 2019 Warren Powell

» X and Y are sites where we can hang substituents to change the

© 2019 Warren Powell

© 2019 Warren Powell

Number of molecules tested (out of 99)

» Drug delivery – Optimizing

» Medical decision making –

» Controllers for robotic scientist

© 2019 Warren Powell

© 2019 Warren Powell

We have to search for the

© 2019 Warren Powell

» We cannot compute the expectation, so we run simulations:

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

Fleet management problem

© 2019 Warren Powell

How do we merge old with new?

© 2019 Warren Powell

Assigning car in less

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

Number of offers for each (i,j) pair

© 2019 Warren Powell

© 2019 Warren Powell Slide 42

© 2019 Warren Powell

© 2019 Warren Powell Slide 44

© 2019 Warren Powell Slide 45

Princeton solar array

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

It is not enough to mix “optimization” (intelligent decision

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2019 Warren Powell

© 2017 Warren B. Powell

© 2019 Warren Powell Slide 71

© 2019 Warren Powell

max xx1 ,..., xM  E F ( x, W )

» We need to ,Ndesign a policy X  (S n )

» We refer to this objective as maximizing the final

We refer to this problem as maximizing

» This is also the same as solving

X p ( St ) = arg min xt (C ( St , xt ) + E {Vt + 1 ( St + 1 ) | St , xt })

© 2019 Warren Powell