Professional Documents
Culture Documents
lec01
lec01
Course
Wilhelm Kirchgässner
19 July 2023
Lecture 01: Introduction to Reinforcement Learning
Wilhelm Kirchgässner
2 / 49
Table of contents
1 Course framework
4 Basic terminology
3 / 49
The teaching team
Contact
● Email: see departments homepage
● Offices: E-building, 4th floor
● Individual appointments on request (remote or personally)
4 / 49
Examination Regulations
Final exam
● Oral examination
● Average 45 minutes for presentation and discussion
● Individual appointment request via email (at least 2 weeks in
advance)
6 / 49
Recommended textbooks
● Reinforcement learning: an introduction,
● R. Sutton and G. Barto
● MIT Press, 2nd edition, 2018
● Available here
7 / 49
Table of contents
1 Course framework
4 Basic terminology
8 / 49
The basic reinforcement learning structure
e
at
St Key characteristics:
Environment ● No supervisor
● Data-driven
Action
Rew
a
● Discrete time steps
rd
Interpreter ● Sequential data
stream (not i.i.d.
Obser data)
vation
● Agent actions affect
Agent subsequent data
Figure: The basic RL operation principle (sequential decision
(derivative of www.wikipedia.org, CC0 1.0) making)
The nomenclature of this slide set is based on the default variable usage
in control theory. In other RL books, one often finds s as state, a as
action and o as observation.
9 / 49
Agent and environment
At each step k the agent:
at
e ● Picks an action uk .
St
● Receives an observation yk .
Environment
● Receives a reward rk .
Action
Rew At each step k the environment:
a rd
Interpreter ● Receives an action uk .
● Emits an observation yk+1 .
Obser ● Emits a reward rk+1 .
vation
Remark on time
A one step time delay is assumed between executing the action and
receiving the observation as well as reward. We assume that the
resulting time interval ∆t = tk − tk+1 is constant.
10 / 49
Some basic definitions from the literature
What is reinforcement?
● “Reinforcement is a consequence applied that will strengthen an
organism’s future behavior whenever that behavior is preceded by a
specific antecedent stimulus.[...]There are four types of
reinforcement: positive reinforcement, negative reinforcement,
extinction, and punishment.”, wikipedia.org (obtained 2023-03-31)
What is learning?
11 / 49
Context around reinforcement learning
Unsupervised Clustering
Learning
Process and interpret Dimension Reduction
data based only
on the input
Supervised Regression
Learning
Machine Develop models Classification
Learning to map input
and output data
Reinforcement Single-Agent
Learning
Learn optimal control Multi-Agent
actions to maximize
long-term reward
12 / 49
Context around machine learning
13 / 49
Many faces
Many of reinforcement
Faces of Reinforcement learning
Learning
Computer Science
Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Bounded
Mathematics Psychology
Rationality
Economics
14 / 49
Table of contents
1 Course framework
4 Basic terminology
15 / 49
Methodical origins
16 / 49
History of reinforcement learning
17 / 49
Contemporary application examples
18 / 49
Table of contents
1 Course framework
4 Basic terminology
19 / 49
Reward
20 / 49
Reward examples
● Flipping a pancake:
● Pos. reward: catching the 180○ rotated pancake
● Neg. reward: droping the pancake on the floor
● Stock trading:
● Trading portfolio monetary value
● Playing Atari games:
● Highscore value at the end of a game episode
● Driving an autonomous car:
● Pos. reward: getting save from A to B without crashing
● Neg. reward: hitting another car, pedestrian, bicycle,...
● Classical control task (e.g., electric drive, inverted pendulum,...):
● Pos. reward: following a given reference trajectory precisely
● Neg. reward: violating system constraints and/or large control error
21 / 49
Reward characteristics
Rewards can have many different flavors and are highly depending on
the given problem:
● Actions may have short and/or long term consequences.
● The reward for a certain action may be delayed.
● Examples: Stock trading, strategic board games,...
● Rewards can be positive and negative values.
● Certain situations (e.g., car hits wall) might lead to a negative
reward.
● Exogenous impacts might introduce stochastic reward components.
Remark on reward
The RL agent’s learning process is heavily linked with the reward
distribution over time. Designing expedient rewards functions is
therefore crucially important for successfully applying RL. And often
there is no predefined way on how to design the “best reward function”.
22 / 49
The reward function hassle
● “Be careful what you wish for - you might get it” (pro-verb)
● “...it grants what you ask for, not what you should have asked for
or what you intend.” (Norbert Wiener, American mathematician)
23 / 49
Task-dependent return definitions
Episodic tasks
● A problem which naturally breaks into subsequences (finite
horizon).
● Examples: most games, maze,...
● The return becomes a finite sum:
gk = rk+1 + rk+2 + ⋯ + rN . (2)
● Episodes end at their terminal step k = N.
Continuing tasks
● A problem which lacks a natural end (infinite horizon).
● Example: process control task
● The return should be discounted to prevent infinite numbers:
∞
gk = rk+1 + γrk+2 + γ 2 rk+3 + ⋯ = ∑ γ i rk+i+1 . (3)
i=0
● Here, γ ∈ {R∣0 ≤ γ ≤ 1} is the discount rate.
24 / 49
Discounted rewards
Numeric viewpoint
● If γ = 1 and rk > 0 for k → ∞, gk in (3) gets infinite.
● If γ < 1 and rk is bounded for k → ∞, gk in (3) is bounded.
Strategic viewpoint
● If γ ≈ 1: agent is farsighted.
● If γ ≈ 0: agent is shortsighted (only interested in immediate
reward).
Mathematical options
● The current return is the discounted future return:
gk = rk+1 + γrk+2 + γ 2 rk+3 + ⋯ = rk+1 + γ (rk+2 + γrk+3 + ⋯)
(4)
= rk+1 + γgk+1 .
● If rk = r is a constant and γ < 1 one receives:
∞ ∞
1
gk = ∑ γ i r = r ∑ γ i = r . (5)
i=0 i=0 1−γ
25 / 49
State (1)
26 / 49
State (2)
Agent state
● Random variable Xka with realizations xka
● Internal status representation of the agent
● In general: xka ≠ xke , e.g., due to measurement noise or an
additional agent’s memory
● Agent’s condensed information relevant for next action
● Dependent on internal knowledge / policy representation of the
agent
● Continuous or discrete quantity
27 / 49
History and information state
28 / 49
Model examples with Markov states
xk+1 = f (xk , uk ) ,
(9)
yk = h (xk , uk ) .
29 / 49
Degree of observability
Full observability
● Agent directly measures full environment state (e.g., yk = Ixk ).
● If xk is Markov: Markov decision process (MDP).
Partial observability
● Agent does not have full access to environment state (e.g.,
I 0
yk = [ ] x ).
0 0 k
● If xk is Markov: partial observable MDP (POMDP).
● Agent may reconstructs state information x̂k ≈ xk (belief, estimate).
POMDP examples
● Technical systems with limited sensors (cutting costs)
● Poker game (unknown opponents’ cards)
● Human health status (too complex system)
30 / 49
Action
● An action is the agent’s degree of freedom in order to maximize its
reward.
● Major distinction:
● Finite action set (FAS): uk ∈ {uk,1 , uk,2 , . . .} ∈ Rm
● Continuous action set (CAS): Infinite number of actions: uk ∈ Rm
● Deterministic uk or random Uk variable
● Often state-dependent and potentially constrained:uk ∈ U(xk ) ⊆ Rm
● Examples:
● Take a card during Black Jack game (FAS)
● Drive an autonomous car (CAS)
● Buy stock options for your trading portfolio (FAS/CAS)
31 / 49
Policy
uk = π(xk ) . (10)
32 / 49
Deterministic policy example
Find optimal gains {Kp , Ki , Kd } given the reward rk = −ek2 :
Plant /
Process
Agent
33 / 49
Deterministic policy example
Find optimal gains {Kp , Ki , Kd } given the reward rk = −ek2 :
● Agent’s behavior is explicitly determined by {Kp , Ki , Kd }.
Plant /
Process
Agent
33 / 49
Deterministic policy example
Find optimal gains {Kp , Ki , Kd } given the reward rk = −ek2 :
● Agent’s behavior is explicitly determined by {Kp , Ki , Kd }.
● Reference value is part of the environment state: xk = [yk yk∗ ].
Plant /
Process
Agent
33 / 49
Deterministic policy example
Find optimal gains {Kp , Ki , Kd } given the reward rk = −ek2 :
● Agent’s behavior is explicitly determined by {Kp , Ki , Kd }.
● Reference value is part of the environment state: xk = [yk yk∗ ].
● Control output is the agent’s action: uk = π(xk ∣Kp , Ki , Kd ).
Plant /
Process
Agent
33 / 49
Stochastic policy example
Two-player game of extended rock-paper-scissors:
34 / 49
Stochastic policy example
Two-player game of extended rock-paper-scissors:
● A deterministic policy can be easily exploited by the opponent.
34 / 49
Stochastic policy example
Two-player game of extended rock-paper-scissors:
● A deterministic policy can be easily exploited by the opponent.
● A uniform random policy would be instead unpredictable (assuming
an ideal random number generator).
34 / 49
Value functions
● The state-value function is the expected return being in state xk
following a policy π: vπ (xk ).
● Assuming an MDP problem structure the state-value function is
∞
vπ (xk ) = Gk ∣Xk = xk π = ∑ γ i Rk+i+1 ∣ xk π. (12)
i=0
35 / 49
Exploration and exploitation
● In RL the environment is initially unknown. How to act optimal?
● Exploration: find out more about the environment.
● Exploitation: maximize current reward using limited information.
● Trade-off problem: what’s the best split between both strategies?
36 / 49
Model
● Or a reward model R:
37 / 49
Table of contents
1 Course framework
4 Basic terminology
38 / 49
Maze example
Problem statement:
Start
● Reward: rk = −1-1 per time-ste
Rewards:
● At goal: episode termination
Actions: N, E, S, W
● Actions: uk ∈ {N, E , S, W }
States: Agent’s location
● State: agent’s location
Goal
● Deterministic problem (no
stochastic influences)
Figure: Maze setup
39 / 49
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Start
Key characteristics:
● For any state there is a
direct action command.
● The policy is explicitly
Goal available.
40 / 49
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
41 / 49
Maze Example: RL-solution
Maze example: Model by model evaluation
42 / 49
RLRLagent
Agenttaxonomy
Taxonomy
Model-Free
Value-Based Policy-Based
Model-Based
Model
43 / 49
Table of contents
1 Course framework
4 Basic terminology
44 / 49
RL vs. planning
Two fundamental solutions to sequential decision making:
● Reinforcement learning:
● The environment is initially unknown.
● The agents interacts with the environment.
● The policy is improved based on environment feedback (reward).
● Planning:
● An a priori environment model exists.
● The agents interacts with its own model.
● The policy is improved based on the model feedback (’virtual
reward’).
For certain cases closed-form solutions can be found, e.g., a LTI system
with quadratic costs and no further constraints can be optimally
controlled by a linear-quadratic regulator (LQR). However, for arbitrary
cases that is not possible and one relaxes the problem to a finite-horizon
optimal control problem:
Np
vk∗ (xk ) = min ∑ γ i rk+i+1 (xk+i , uk+i ) . (17)
uk
i=0
...
47 / 49
MPC vs. RL
Hence, MPC and RL are two sides of the same coin. Both share the
same general goal (sequential optimal decision making), but follow their
own philosophy:
Property MPC RL
Objective minimize costs maximize return
48 / 49
MPC vs. RL
Hence, MPC and RL are two sides of the same coin. Both share the
same general goal (sequential optimal decision making), but follow their
own philosophy:
Property MPC RL
Objective minimize costs maximize return
A priori model required not required
48 / 49
MPC vs. RL
Hence, MPC and RL are two sides of the same coin. Both share the
same general goal (sequential optimal decision making), but follow their
own philosophy:
Property MPC RL
Objective minimize costs maximize return
A priori model required not required
Pre-knowledge integration easy rather complex
48 / 49
MPC vs. RL
Hence, MPC and RL are two sides of the same coin. Both share the
same general goal (sequential optimal decision making), but follow their
own philosophy:
Property MPC RL
Objective minimize costs maximize return
A priori model required not required
Pre-knowledge integration easy rather complex
Constraint handling inherent only indirect
48 / 49
MPC vs. RL
Hence, MPC and RL are two sides of the same coin. Both share the
same general goal (sequential optimal decision making), but follow their
own philosophy:
Property MPC RL
Objective minimize costs maximize return
A priori model required not required
Pre-knowledge integration easy rather complex
Constraint handling inherent only indirect
Adaptivity requires add-ons inherent
48 / 49
MPC vs. RL
Hence, MPC and RL are two sides of the same coin. Both share the
same general goal (sequential optimal decision making), but follow their
own philosophy:
Property MPC RL
Objective minimize costs maximize return
A priori model required not required
Pre-knowledge integration easy rather complex
Constraint handling inherent only indirect
Adaptivity requires add-ons inherent
Online complexity it depends / it depends /
48 / 49
MPC vs. RL
Hence, MPC and RL are two sides of the same coin. Both share the
same general goal (sequential optimal decision making), but follow their
own philosophy:
Property MPC RL
Objective minimize costs maximize return
A priori model required not required
Pre-knowledge integration easy rather complex
Constraint handling inherent only indirect
Adaptivity requires add-ons inherent
Online complexity it depends / it depends /
Stability theory mature immature
Table: Key differences on MPC vs. RL
(inspired from D. Grges, Relations between Model Predictive Control and
Reinforcement Learning, IFAC PapersOnine 50-1, pp. 4920-4928, 2017)
48 / 49
Summary: what you’ve learned in this lecture
49 / 49