Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Optimal Control and Planning via

Dynamic Programming and


Reinforcement Learning Methods
Lecture 1: Introduction to DP and RL
Peter Zhang
Introductions
Instructor
TA
Yourselves
High school, undergraduate, and graduate schools
Little preparation in linear algebra and calculus, medium, advanced
Have some knowledge about artificial intelligence
Motivating Examples
Sequential Decision Problems are
Everywhere!
• Planning your way from residence to classroom
• Deciding whether to go to graduate school or go into the industry
• Deciding what to do with current saving
•…

• These problems share a common trait:


• you have a sequence of decisions to make today, tomorrow, the day after
tomorrow,...
• they all affect each other!
Sequential Decision Problems are Fundamental in
Artificial (General) Intelligence
• Examples…
Games

6
Path Planning

7
Vendor’s Pricing Problem

8
Airline’s Pricing Problem

9
Netflix’s Recommendation

10
Flappy Bird

2000 20,000 2,000,000

11
RL with Human Feedback for LLM
Autonomous Driving and Traffic Simulation
with CARLA and SUMMIT
https://www.youtube.com/watch?v=dNiR0z2dROg
Course Organization
Course Objective
• Two focuses of this course
• Modeling of sequential decision problems
• Solution methods to solve these problems
• Key learning objective
• Learning to recognize and model sequential decision problems in a
disciplined way
• Mastering several key solution techniques conceptually and in Python
• Starting to interact with complex sequential decision problem
environment
• Goal for subsequent research course
• Being able to design your own dynamic programming planning and
reinforcement learning environment
Course Outline
Course Outline
Course Outline
Overview of Model Notations
Word of Caution
1. This subject is highly interdisciplinary
• Operations research: dynamic programming
• Engineering: control theory
• Computer science: reinforcement learning
• …
2. Unlike many other subjects, learning the notations of this
subject is (more than) half the battle
• Unfortunately, different domains use different sets of notations
• We will switch between two sets of notations – not to confuse you, but to
familiarize you with the wild world out there
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
• Reward / cost function
• Transition function
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
• Reward / cost function
• Transition function
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
• Reward / cost function
• Transition function
Key Concepts Stay still

• Agent
• Environment jump

• States
• Actions
• Policies go left
• Reward / cost function
• Transition function
go right
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
• Reward / cost function
• Transition function
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
• Reward / cost function
• Transition function
Key Concepts
• Agent
• Environment
• States
• Actions
• Policies
• Reward / cost function
• Transition function
Case 1: Hospital Management
We will unpack this example in the next few lectures. For now if you do not
follow the details, that is okay!
Example: Waiting in the Hospital

• Staffing issues in the hospital (e.g., ER)


• …too many nurses and doctors on a unit one hour and too few the next
• Results in long wait times, under-utilized resources

29
Example: Waiting in the Hospital
• How do we approach this problem?
• Step 1: gather longitudinal patient data – arrival rate, demographics, symptoms;
predict future trend

30
Example: Waiting in the Hospital
• Step 2: gather data about
staff availability, wages,
labor union rules, etc.

• Step 3: optimize

• Reading Hospital (West


Reading, PA) is saving 1M
per year after implementing
new staffing policies to
balance supply and demand
in 2021

• How would you have solved


the problem?

31
Example: Waiting in the Hospital
• You have been hired to review a hospital’s operations
• The hospital provides high-quality care to its patients, but
there have been complaints due to long wait times in the past
• What are possible solutions one might consider?
• Which practical considerations do you need to account for?
• Which data would you need to gather?

32
Example: Waiting in the Hospital
• From conversations with the hospital’s leadership, major
opportunities seem to exist on supply-side management
• Currently, doctors are staffed based on long-term
contracts
• Pre-determined number of doctors at each hour
• However, current operations lead to high wait times at
some times of the day and idle doctors at other times
• The hospital’s leadership would like to leverage a new
opportunity to leverage on-demand staffing of doctors
• Additional doctors can be staffed at each hour, at a price
premium
→Trade-off between staffing costs and patient waiting
costs

33
What is a Policy?
• A policy is a complete contingency plan that tells us what to do at
each time period, under any possible situation
Time à

t=0 t=1 t=2 t=3 …


These are our decision
variables
N=0 ? ? ? ?

system N=1 ? ? ? ?
state
N=2 ? ? ? ?

N=3 ? ? ? ?

.
.
34
Some Intuition
• The policy
• Number of on-demand doctors to staff in each hour as a function of number of
patients waiting
• What are some characteristics of the optimal policy?
• (cost of hiring one more doctor) = (cost of longer waiting)
• Factors that would change the optimal policy
• Cost of staffing on-demand doctors for every doctor-hour
• Waiting costs get higher for every patient-hour
• The cost of unserved patients gets higher for every patient
• Patient demand gets higher
• Open questions
• For how many patients do we start staffing additional doctors?
• Then, how many more doctors do we staff as more patients are waiting?
• How does all this depend on the time of day?

35
The Hospital Problem
Stages: hour t = 1, 2, …, T (end of day)


Action
(decision)

Policy: a sequence of
States: number of Transition function actions
patients waiting

• Cost per stage: patient waiting costs + doctor staffing costs


• Terminal cost: cost of any unserved patient

36
State Variables
• Observable system characteristics: information needed to
compute costs/rewards, determine the optimal decisions, and
characterize the evolution of the system
• It can include different “types” of information
• Resource state, or physical state
• Information state—mainly, exogenous information on system dynamics
• Knowledge state, or belief state—useful when learning occurs
• Notation:
• Examples
• Hospital example: how many patients are waiting
• Climate policy: temperature, sea levels, state of the economy, etc.
• Power systems planning: current capacity, demand, technology, etc.
• Disaster response: state of evacuation, disaster magnitude, etc.

37
Decision Rules, or Policy
• Deterministic optimization focused on decisions
• In stochastic optimization and MDP, we must consider a
decision rule, i.e., a decision as a function of the (unknown)
state (e.g., 2nd stage decisions in two-stage stochastic opt)

Can be different!

• The decision rule is called policy


• Examples
• Hospital example: ______________________________
• Climate policy: emissions / cap & trade as a function of temperatures
• Disaster response: evacuation plan as a function of real-time info.

38
Transition Function
• Characterization of state in next period, given the state of the
system and the decision made in the current period
• Usually relies on Markovian “memoryless” property: State at
stage 𝑡 + 1 depends only on state and decision at stage 𝑡
• Relies on exogenous information forecasts, models, etc.

• Examples
• Hospital example: future queue length = f(current queue length,
number of doctors)
• Climate policy: future temp = f(current temp., emissions)
• Disaster response: future state = f(current state, evacuation plan)

39
Cost/Reward Contributions
• Minimization of total expected costs, or maximization or
total expected rewards, across the full planning horizon
cost per stage, as a function of state and
action

terminal cost

• Examples
• Hospital example: cost of patients’ time + cost of staffing
• Climate policy: climate damages + economic costs)
• Disaster response: number of fatalities

40
Summary
• Stages

• States

• Policy

• Transition

• Cost

• Objective

41
Case Study: Data
• Setting:
• The hospital operates for 12 hours between 8 am and 8 pm
• You start the day with an empty waiting room
• Doctor supply
• 10 doctors have been staffed for the full 12-hour period
• Each additional doctor can be staffed hourly at a cost of $500
• Each doctor can treat exactly 2 patients per hour
• Patient demand; waiting costs:
• The number of patients arriving at each time step (may be
random)
• At the beginning of each hour, you incur a cost equal to 𝑤 for
each patient in the waiting room
• At the end of the day, you incur a cost 𝑊 for any remaining
patient in the waiting room

42
Case Study Model

• Transition function:
• Cost function:

• “Bellman equation”: Cost of State 𝑥! at time 𝑡

43
Case Study: Results
• Hire additional doctors if…
• >15 patients waiting in early stages of the day
• Waiting room is non-empty towards the end of the day

𝑢! (𝑥! )

44
Case 2: Epidemic Control
Compartmental Model in Epidemiology: Variables and
Parameters

• N: population size
• S(t): number of susceptible people at time t
• I(t): number of actively infected people at t
States
• R(t): number of recovered/removed people at t

• S(t) + I(t) + R(t) = N for every t

• 𝛽: avg number of people that an infectious person can infect within one time step

• 𝛾: recovery rate

46
Compartmental Model in Epidemiology: Interpretable
Dynamic Functions

Transition

47
Actions, reward…
• What are potential actions in epidemic control?
• What are the rewards / costs?
Sample System Trajectory
Time to React = 25 days Time to React = 15 days Time to React = 5 days

SD = 14
days

SD = 56
days
Lecture Summary
Modeling Sequential Decision Problems
• Sequential decision problems are ubiquitous – you can model
almost every decision problem in this way
• A main challenge in this subject is learning the “language”
• States
• Actions
• Policies
• Reward / cost function
• Transition function
• With sufficient practice, you’ll find them to be very intuitive!
Course Objective
• Two focuses of this course
• Modeling of sequential decision problems
• Solution methods to solve these problems
• Key learning objective
• Learning to recognize and model sequential decision problems in a
disciplined way
• Mastering several key solution techniques conceptually and in Python
• Starting to interact with complex sequential decision problem
environment
• Goal for subsequent research course
• Being able to design your own dynamic programming planning and
reinforcement learning environment

You might also like