NeoscholarDPRLL1 Intro

Optimal Control and Planning via
Dynamic Programming and

Reinforcement Learning Methods
Lecture 1: Introduction to DP and RL
Peter Zhang
Introductions
Instructor
TA
Yourselves
High school, undergraduate, and graduate schools
Little preparation in linear algebra and calculus, medium, advanced
Have some knowledge about artificial intelligence
Motivating Examples
Sequential Decision Problems are
Everywhere!
• Planning your way from residence to classroom
• Deciding whether to go to graduate school or go into the industry
• Deciding what to do with current saving
•…
• These problems share a common trait:

• you have a sequence of decisions to make today, tomorrow, the day after
tomorrow,...
• they all affect each other!
Sequential Decision Problems are Fundamental in
Artificial (General) Intelligence
• Examples…
Games
6
Path Planning
7
Vendor’s Pricing Problem
8
Airline’s Pricing Problem
9
Netflix’s Recommendation
10
Flappy Bird
2000 20,000 2,000,000
11
RL with Human Feedback for LLM
Autonomous Driving and Traffic Simulation
with CARLA and SUMMIT
https://www.youtube.com/watch?v=dNiR0z2dROg
Course Organization
Course Objective
• Two focuses of this course
• Modeling of sequential decision problems
• Solution methods to solve these problems
• Key learning objective
• Learning to recognize and model sequential decision problems in a
disciplined way
• Mastering several key solution techniques conceptually and in Python
• Starting to interact with complex sequential decision problem
environment
• Goal for subsequent research course
• Being able to design your own dynamic programming planning and
reinforcement learning environment
Course Outline
Course Outline
Course Outline
Overview of Model Notations
Word of Caution
1. This subject is highly interdisciplinary
• Operations research: dynamic programming
• Engineering: control theory
• Computer science: reinforcement learning
• …
2. Unlike many other subjects, learning the notations of this
subject is (more than) half the battle
• Unfortunately, different domains use different sets of notations
• We will switch between two sets of notations – not to confuse you, but to
familiarize you with the wild world out there
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
• Reward / cost function
• Transition function
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
Key Concepts Stay still
• Agent
• Environment jump
• States
• Actions
• Policies go left
go right
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
Case 1: Hospital Management
We will unpack this example in the next few lectures. For now if you do not
follow the details, that is okay!
Example: Waiting in the Hospital
• Staffing issues in the hospital (e.g., ER)

• …too many nurses and doctors on a unit one hour and too few the next
• Results in long wait times, under-utilized resources
29
• How do we approach this problem?
• Step 1: gather longitudinal patient data – arrival rate, demographics, symptoms;
predict future trend
30
• Step 2: gather data about
staff availability, wages,
labor union rules, etc.
• Step 3: optimize
• Reading Hospital (West

Reading, PA) is saving 1M
per year after implementing
new staffing policies to
balance supply and demand
in 2021
• How would you have solved

the problem?
31
• You have been hired to review a hospital’s operations
• The hospital provides high-quality care to its patients, but
there have been complaints due to long wait times in the past
• What are possible solutions one might consider?
• Which practical considerations do you need to account for?
• Which data would you need to gather?
32
• From conversations with the hospital’s leadership, major
opportunities seem to exist on supply-side management
• Currently, doctors are staffed based on long-term
contracts
• Pre-determined number of doctors at each hour
• However, current operations lead to high wait times at
some times of the day and idle doctors at other times
• The hospital’s leadership would like to leverage a new
opportunity to leverage on-demand staffing of doctors
• Additional doctors can be staffed at each hour, at a price
premium
→Trade-off between staffing costs and patient waiting
costs
33
What is a Policy?
• A policy is a complete contingency plan that tells us what to do at
each time period, under any possible situation
Time à
t=0 t=1 t=2 t=3 …

These are our decision
variables
N=0 ? ? ? ?
system N=1 ? ? ? ?
state
N=2 ? ? ? ?
N=3 ? ? ? ?
.
.
34
Some Intuition
• The policy
• Number of on-demand doctors to staff in each hour as a function of number of
patients waiting
• What are some characteristics of the optimal policy?
• (cost of hiring one more doctor) = (cost of longer waiting)
• Factors that would change the optimal policy
• Cost of staffing on-demand doctors for every doctor-hour
• Waiting costs get higher for every patient-hour
• The cost of unserved patients gets higher for every patient
• Patient demand gets higher
• Open questions
• For how many patients do we start staffing additional doctors?
• Then, how many more doctors do we staff as more patients are waiting?
• How does all this depend on the time of day?
35
The Hospital Problem
Stages: hour t = 1, 2, …, T (end of day)
…
Action
(decision)
Policy: a sequence of
States: number of Transition function actions
patients waiting
• Cost per stage: patient waiting costs + doctor staffing costs

• Terminal cost: cost of any unserved patient
36
State Variables
• Observable system characteristics: information needed to
compute costs/rewards, determine the optimal decisions, and
characterize the evolution of the system
• It can include different “types” of information
• Resource state, or physical state
• Information state—mainly, exogenous information on system dynamics
• Knowledge state, or belief state—useful when learning occurs
• Notation:
• Examples
• Hospital example: how many patients are waiting
• Climate policy: temperature, sea levels, state of the economy, etc.
• Power systems planning: current capacity, demand, technology, etc.
• Disaster response: state of evacuation, disaster magnitude, etc.
37
Decision Rules, or Policy
• Deterministic optimization focused on decisions
• In stochastic optimization and MDP, we must consider a
decision rule, i.e., a decision as a function of the (unknown)
state (e.g., 2nd stage decisions in two-stage stochastic opt)
Can be different!
• The decision rule is called policy

• Examples
• Hospital example: ______________________________
• Climate policy: emissions / cap & trade as a function of temperatures
• Disaster response: evacuation plan as a function of real-time info.
38
Transition Function
• Characterization of state in next period, given the state of the
system and the decision made in the current period
• Usually relies on Markovian “memoryless” property: State at
stage 𝑡 + 1 depends only on state and decision at stage 𝑡
• Relies on exogenous information forecasts, models, etc.
• Examples
• Hospital example: future queue length = f(current queue length,
number of doctors)
• Climate policy: future temp = f(current temp., emissions)
• Disaster response: future state = f(current state, evacuation plan)
39
Cost/Reward Contributions
• Minimization of total expected costs, or maximization or
total expected rewards, across the full planning horizon
cost per stage, as a function of state and
action
terminal cost
• Examples
• Hospital example: cost of patients’ time + cost of staffing
• Climate policy: climate damages + economic costs)
• Disaster response: number of fatalities
40
Summary
• Stages
• States
• Policy
• Transition
• Cost
• Objective
41
Case Study: Data
• Setting:
• The hospital operates for 12 hours between 8 am and 8 pm
• You start the day with an empty waiting room
• Doctor supply
• 10 doctors have been staffed for the full 12-hour period
• Each additional doctor can be staffed hourly at a cost of $500
• Each doctor can treat exactly 2 patients per hour
• Patient demand; waiting costs:
• The number of patients arriving at each time step (may be
random)
• At the beginning of each hour, you incur a cost equal to 𝑤 for
each patient in the waiting room
• At the end of the day, you incur a cost 𝑊 for any remaining
patient in the waiting room
42
Case Study Model
• Transition function:
• Cost function:
• “Bellman equation”: Cost of State 𝑥! at time 𝑡
43
Case Study: Results
• Hire additional doctors if…
• >15 patients waiting in early stages of the day
• Waiting room is non-empty towards the end of the day
𝑢! (𝑥! )
44
Case 2: Epidemic Control
Compartmental Model in Epidemiology: Variables and
Parameters
• N: population size
• S(t): number of susceptible people at time t
• I(t): number of actively infected people at t
States
• R(t): number of recovered/removed people at t
• S(t) + I(t) + R(t) = N for every t
• 𝛽: avg number of people that an infectious person can infect within one time step
• 𝛾: recovery rate
46
Compartmental Model in Epidemiology: Interpretable
Dynamic Functions
Transition
47
Actions, reward…
• What are potential actions in epidemic control?
• What are the rewards / costs?
Sample System Trajectory
Time to React = 25 days Time to React = 15 days Time to React = 5 days
SD = 14
days
SD = 56
days
Lecture Summary
Modeling Sequential Decision Problems
• Sequential decision problems are ubiquitous – you can model
almost every decision problem in this way
• A main challenge in this subject is learning the “language”
• States
• Actions
• Policies
• With sufficient practice, you’ll find them to be very intuitive!
Course Objective
• Two focuses of this course
• Modeling of sequential decision problems
• Solution methods to solve these problems
• Key learning objective
• Learning to recognize and model sequential decision problems in a
disciplined way
• Mastering several key solution techniques conceptually and in Python
• Starting to interact with complex sequential decision problem
environment
• Goal for subsequent research course
• Being able to design your own dynamic programming planning and
reinforcement learning environment

NeoscholarDPRLL1 Intro

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NeoscholarDPRLL1 Intro

Uploaded by

Copyright:

Available Formats

Optimal Control and Planning via

Dynamic Programming and

• These problems share a common trait:

2000 20,000 2,000,000

• Staffing issues in the hospital (e.g., ER)

• Reading Hospital (West

• How would you have solved

t=0 t=1 t=2 t=3 …

• Cost per stage: patient waiting costs + doctor staffing costs

• The decision rule is called policy

• “Bellman equation”: Cost of State 𝑥! at time 𝑡

• S(t) + I(t) + R(t) = N for every t

You might also like