lecture-1

Dynamic Programming and Reinforcement Learning 1
MA338 : Dynamic Programming

and Reinforcement Learning
Lecture 1
Felipe Maldonado
Department of Mathematical Sciences
University of Essex
email: felipe.maldonado@essex.ac.uk
Key references
• Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction (2nd Edition) MIT
Press, Cambridge, MA, 2018. http://incompleteideas.net/book/RLbook2020.pdf
• Csaba Szepesvri, Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and
machine learning 4.1 (2010): 1-103.
https://sites.ualberta.ca/˜szepesva/papers/RLAlgsInMDPs.pdf
• Martin L. Puterman, Markov decision processes. Handbooks in operations research and management
science 2 (1990): 331-434.
• Wayne L. Winston, Operations Research: Applications and Algorithms, 4th edition, 2004.
• Andreas Lindholm, Niklas Wahlstroem, Fredrik Lindsten, and Thomas B. Schoen Machine Learning - A
First Course for Engineers and Scientists Cambridge University Press, 2022, http://smlbook.org
Module Information:
Module materials homepage: via Moodle

Lectures:
• Tuesday, 16:00 - 18:00, IC HallSem (LH.1.12), https://findyourway.essex.ac.
uk/bcdc98e0-e3c3-11eb-b52e-05a67b7792fc/search/projects/23/
60ef1a852031e800c2303d90
• (PG only) Thursday, 16:00 - 17:00, Room CTC.2.05, (weeks 21-25)
Lab Sessions (Everyone):
• Friday, 09:00 - 10:00, IT Lab T,
Assessment:
• Lab Assignment: 10%
• Project 20%
• Examination: 70%
Module Outline
1. Introduction to Reinforcement Learning
Part I: Tabular Methods

2. Multi-armed Bandits
3. (Finite) Markov Decision Processes
4. Dynamic Programming
5. Monte Carlo (MC) Methods
6. Temporal Difference (TD) Learning (Q-Learning and SARSA)
7. Extensions: n-step TD and Planning Methods
Part II: Approximative Methods

8. Prediction with Approximation
9. Control with Approximation
10. Policy-Gradient and Actor-Critic Methods
11. (PGT) Special Topics
1. Introduction to Reinforcement Learning
People vector created by freepik - www.freepik.com
• Computational approach to learning from interactions.
• Trial and error and delayed rewards: most distinguishing features of Reinforcement Learning (RL)
• Formalisation of the methods?: capture the most important aspects facing a learning agent that interacts
over time with its environment to achieve a goal.
• A reinforcement learning agent must be able to sense the environment, take actions to affect their
current state and have a particular goal.
• In Supervised Learning (SL), learning occurs from from a training set of labeled examples provided by
a knowledgable external supervisor. In uncharted territory, where one would expect learning to be most
beneficial? An agent must be able to learn from its own experience .
• Unsupervised Learning (UL), which is typically about finding structure hidden in collections of
unlabelled data. Uncovering structure in an agent?s experience can certainly be useful in RL, but by
itself does not address the reinforcement learning problem of maximising a reward signal.
• UL could tell you how identify places where wild animals live, SL could tell you to recognise a lion. RL
could tell you that you need to run from that place.
• Unlike Supervised and Unsupervised Learning, RL has to deal with the trade-off of exploration and
exploitation.
• To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past
and found to be effective in producing reward. But what if I have not explored the best action yet?
• The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at
the task. The agent must try a variety of actions and progressively favour those that appear to be best.
• Curse of dimensionality: What it is? How to tackle it?
• Core algorithms for RL were originally inspired by biological learning systems (e.g., genetic algorithms)
Examples
• Chess
• Roomba
• Prepare breakfast
Remarks
• These examples share features that are so basic that they are easy to overlook. All involve interaction
between an active decision-making agent and its environment, within which the agent seeks to achieve a
goal despite uncertainty about its environment.
• Correct choice requires taking into account indirect, delayed consequences of actions, and thus may
require foresight or planning.
General Elements of Reinforcement Learning Problems
• Optimisation: Find an optimal way to make decisions.
• Delayed Consequences: Decisions NOW can impact things much LATER
• Exploration: Learning about the world by making decisions. But those decisions impact what we learn.
• Decisions: based on a POLICY, that takes experience from the past and recommends an action.
A (reinforcement) learning agent uses a POLICY to make their decisions, observes states, environment
(given by a MODEL or real experience), considers their past rewards and decides accordingly.
A REWARD SIGNAL defines the goal of a reinforcement learning problem. On each time step, the
environment sends to the reinforcement learning agent a single number called the reward (good and bad
events for the agent).
A VALUE FUNCTION specifies what is good in the long run. The value of a state is the total amount of
reward an agent can expect to accumulate over the future, starting from that state.
Example: https://www.youtube.com/watch?v=qy_mIEnnlF4 Sheldon trains Penny with
positive reinforcement when she does what he thinks it is good behaviour.
Agent and Environment

At each step t the agent: Executes and action At , receives observation (how the environment changes) Ot ,
and receives reward Rt .
At each step t the environment: Receives action At , emits observation Ot+1 , emits reward Rt+1 .
Motivation: Flappy Bird

Part I. Tabular Methods

• Here we describe almost all the core ideas of reinforcement learning algorithms in their simplest forms:
that in which the state and action spaces are small enough for the approximate value functions to be
represented as arrays, or tables. In this case, the methods can often find exact solutions, that is, they can
often find exactly the optimal value function and the optimal policy.
• Section 2. special case of the reinforcement learning problem in which there is only a single state, called
bandit problems.
• Section 3. general problem formulation: finite Markov decision processes - and its main ideas including
Bellman equations and value functions.
• Sections 4. . 5. and 6. describe three fundamental classes of methods for solving RL problems:
Dynamic Programming, Monte Carlo methods, and Temporal- Difference learning.
• Section 7. strengths of Monte Carlo methods can be combined with the strengths of temporal-
difference methods via multi-step bootstrapping methods. As well as combining them with Model
Learning on Planning Methods.
2. Multi-armed Bandits
2.1 k-armed Bandit Problem
• Repeatedly choose among k different options,

or actions.
• After each choice you receive a numerical

reward chosen from a stationary probability
distribution that depends on the action you
selected.
• Value of action a: expected or mean reward of

action a given that it was selected at some point
in time.
• Objective: maximise the expected total reward

over a period of time.
Notation
At Action selected at time step t.
Rt Corresponding reward of selecting At at time t.
q ∗ (a) value of arbitrary action a: q ∗ (a) = E[Rt [At = a]. We assume that we do not know those values, but
we have estimates of thema .
Qt (a) The estimate value of action a at time t, which is close to q ∗ (a).
Definition: Assume that for all actions a we have good estimates Qt (a) ∼ q ∗ (a). An action is called a
Greedy Action if it is chosen such as At := argmaxa Qt (a).
Selecting one of the greedy actions implies that the agent is exploiting their current knowledge. If choose a
non-greedy action, we say the they are exploring. There are sophisticated ways to balance the exploration -
exploitation trade-off but with strong assumptions that rarely are satisfied in real examples. In this Module
we will only focused on finding any (reasonable) type of balance.
Example: Listening to unknown songs but with common names.

a What happen if we know those values?
2.2 Action-Value Methods
• Action-value methods are the methods for estimating the value of actions (their mean reward) and for
suing those estimates to make action decisions.
• One natural way to estimate this is by averaging the rewards actually received:
Pt−1
sum of rewards when a taken prior to t i=1 Ri 1[At =a]
Qt (a) := = P t−1
number of times a taken prior to t i=1 1[At =a]
As the denominator goes to infinity (when t → ∞), by the Law of Large Numbers, Qt (a) converges to
q ∗ (a). We call this the sample-average method for estimating action values because each estimate is an
average of the sample of relevant rewards.
The simplest action selection rule is to select one of the actions with the highest estimated value: greedy
actions At = argmax Qt (a)
Any issue with this approach? .... only exploitation.
One simple alternative is to behave greedily most of the time, but with a small probability select randomly
another action (independently of the action values). We will call this type of method as -greedy,
2.3 Incremental Methods

One of the key elements of RL methods in general is the natural recursiveness on how values can be
computed. This can save lots of computational time if done properly.
• Let us consider a single action. Let Ri now denote the reward received after the i − th selection of this
action.
• Let Qn denote the estimate of its action value after it has been selected n − 1 times, which we can now
write simply as
R1 + R2 + · · · + Rn−1
Qn = (2.1)
n−1
• Incremental formula for Qn can be written as
1
Qn+1 = Qn + [Rn − Qn ] (2.2)
n
• With this type of incremental formula
N ewEstimate ← OldEstimate + StepSize[T arget − OldEstimate]
are quite useful since they allow to update averages with a small, constant computation.
• T arget − OldEstimate is an error in the estimate and leads to find ways to get closer to that target.
Exercise 1: Show how to obtain Eq. (2.2) from Eq. (2.1)

Solution Exercise 1:
Pseudocode for bandit algorithm
Algorithm 1: A simple bandit algorithm

Result: Qn as per Eq. (2.2)
initialization: for a = 1 to a = k (number of bandits);
Q(a) ← 0;
N (a) ← 0;
while Truedo
 argmax Q(a)
a with probability1 − breaking ties randomly
A← ;
 a random action with probability
R ← bandit(A);
N (A) ← N (A) + 1 ;
1
Q(A) ← Q(A) + N (A)
[R − Q(A)] ;
end
2.4 Nonstationary Problem

A RL problem where the true action values (reward probabilities) change over time is called a nonstationary
problem. Most of the problem we will study, fall into this category. In such cases it makes sense to give more
weight to recent rewards than to long-past rewards. One of the most popular ways of doing this is to use a
constant step-size parameter. α ∈]0, 1]
Qn+1 = Qn + α[Rn − Qn ] (2.3)
Exercise 2: Show that (2.3) can be written as

n
X
n
Qn+1 = (1 − α) Q1 + α(1 − α)n−i Ri (2.4)
i=1
Pn
Exercise 3: Check that (1 − α)n + i=1 α(1 − α)n−i = 1
Note that the weight, (1 − α)n−i given to the reward Ri depends on how many rewards ago, n − i, it was
observed. The quantity 1 − α ≤ 1, and thus the weight given to Ri decreases as n increases. In fact, the
weight decays exponentially according to the exponent on 1 − α.
Sometimes it is convenient to vary the step-size parameter from step to step. Let αn (a) denote the step-size
parameter used to process the reward received after the nth selection of action a. As we have noted, the
choice αn (a) = n1 results in the sample-average method.
Theorem: Let Qn given by the recurrence Qn+1 = Qn + αn [Rn − Qn ]. If ∞

P
n=1 αn (a) = ∞ and
P∞ 2 ∗
n=1 αn (a) < ∞. Then, with probability 1, Qn converges to the expected reward q .
The first condition is required to guarantee that the steps are large enough to eventually overcome any initial
conditions or random fluctuations. The second condition guarantees that eventually the steps become small
enough to assure convergence.
• α constant does not satisfy second condition, and hence there is no convergence.
• Variable α does not work so well on nonstationary cases (which are the most common on RL).
• Variable stepsize are not that common in RL, but there are other environments where they appear
naturally.
Example: Market share.

Initial Values and Biases

All the methods we have discussed so far are dependent to some extent on the initial action-value estimates,
Q1 (a). In the language of statistics, these methods are biased by their initial estimates. For the
sample-average methods, the bias disappears once all actions have been selected at least once, but for
methods with constant α, the bias is permanent!!, though decreasing over time as given by Eq. (2.4)
Upper-Confidence-Bound (UCB) Action Selection

-greedy action selection forces the non-greedy actions to be tried, but indiscriminately, with no preference
for those that are nearly greedy or particularly uncertain. One effective way of doing this is to select actions
according to
s
ln(t)
At := argmax[Qt (a) + c ]
a Nt (a)
Nt (a) denotes the number of times that action a has been selected prior to time t, and the number c > 0
controls the degree of exploration. If Nt (a) = 0, then a is considered to be a maximising action.
Since we take the argmax it gives us an upper bound on the possible true value of action a, with c
determining the confidence level. It can be tricky to implement in more general RL problems, such as those
with large state-space.
2.5 Gradient Bandit Algorithms
Ht (a) a numerical preference for action a that we learn.
πt (a) probability of taking action a at time t.

1
Pt−1
R̂t the average of the rewards up to (but not including) time t: R̂t = t−1 i=1 Ri .
Then, the probabilities are given by the soft-max function
eHt (a)
πt (a) = P r{At = a} := Pk
Ht (b)
b=1 e
The larger the preference, the more often that action is taken, but the preference has no interpretation in terms
of reward.
On each step t. After selecting action At and receiving reward Rt , the preferences are updated under the
following rule.
Ht+1 (a) := Ht (a) + α(Rt − R̂t )(1[At =a] − πt (a)) (2.5)
R̂t serves as a baseline with which the reward is compared. If the reward is higher than the baseline, then the
probability of taking At in the future is increased, and if the reward is below baseline, then the probability is
decreased. The non-selected actions move in the opposite direction.
Remark: the name Gradient bandit, comes from the fact (2.5) can be interpreted as the gradient of the
expected reward. And therefore this method is an instance of the stochastic gradient ascent. This assures us
that the algorithm has robust convergence properties (we will discuss more about this in the Part II of the
Module).
Notes:

lecture-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

lecture-1

Uploaded by

Copyright:

Available Formats

Dynamic Programming and Reinforcement Learning 1

MA338 : Dynamic Programming

Module materials homepage: via Moodle

1. Introduction to Reinforcement Learning

Part I: Tabular Methods

Part II: Approximative Methods

1. Introduction to Reinforcement Learning

People vector created by freepik - www.freepik.com

• Computational approach to learning from interactions.

• Curse of dimensionality: What it is? How to tackle it?

General Elements of Reinforcement Learning Problems

• Optimisation: Find an optimal way to make decisions.

• Delayed Consequences: Decisions NOW can impact things much LATER

Agent and Environment

Motivation: Flappy Bird

Part I. Tabular Methods

2.1 k-armed Bandit Problem

• Repeatedly choose among k different options,

• After each choice you receive a numerical

• Value of action a: expected or mean reward of

• Objective: maximise the expected total reward

At Action selected at time step t.

Rt Corresponding reward of selecting At at time t.

Qt (a) The estimate value of action a at time t, which is close to q ∗ (a).

Example: Listening to unknown songs but with common names.

2.2 Action-Value Methods

2.3 Incremental Methods

Exercise 1: Show how to obtain Eq. (2.2) from Eq. (2.1)

Pseudocode for bandit algorithm

Algorithm 1: A simple bandit algorithm

2.4 Nonstationary Problem

Qn+1 = Qn + α[Rn − Qn ] (2.3)

Exercise 2: Show that (2.3) can be written as

Theorem: Let Qn given by the recurrence Qn+1 = Qn + αn [Rn − Qn ]. If ∞

Example: Market share.

Initial Values and Biases

Upper-Confidence-Bound (UCB) Action Selection

2.5 Gradient Bandit Algorithms

Ht (a) a numerical preference for action a that we learn.

πt (a) probability of taking action a at time t.

Then, the probabilities are given by the soft-max function

You might also like