Professional Documents
Culture Documents
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
The simplest
reinforcement learning
problem
You are the algorithm! (bandit1)
• Action 1 — Reward is always 8
• value of action 1 is q⇤ (1) = 8
-10 0 35
q⇤ (3) = 12.5
q⇤ (3)
0 q⇤ (4) 20
1 1 1 20 13 33
q⇤ (4) = ⇥ 0 + ⇥ 20 + ⇥ 13 = 0 + + = = 11
3 3 3 3 3 3
The k-armed Bandit Problem
• On each of an infinite sequence of time steps, t=1, 2, 3, …,
you choose an action At from k possibilities, and receive a real-
valued reward Rt
• You can never stop exploring, but maybe you should explore
less with time. Or maybe not.
2 Action-Value Methods
Action-Value Methods
e begin by looking more closely at some simple methods for estimating
• and
actions Methods that learn
for using action-value
the estimates to estimates
make actionand selection
nothing else
decisions.
e true value of an action is the mean reward when that action is sel
• Fortoexample,
tural way estimateestimate
this is action values asthe
by averaging sample averages:
rewards actually receive
Pt 1
. sum of rewards when a taken prior to t i=1 Ri · 1Ai =a
Qt (a) = = Pt 1
number of times a taken prior to t i=1 1Ai =a
Repeat forever:
⇢
arg maxa Q(a) with probability 1 " (breaking ties randomly)
A
a random action with probability "
R bandit(A)
N (A) N (A) + 1 ⇥ ⇤
1
Q(A) Q(A) + N (A) R Q(A)
The update rule (2.3) is of a form that occurs frequently throughout this book.
The general form is
h i
NewEstimate OldEstimate + StepSize Target OldEstimate . (2.4)
⇥ ⇤
The expression Target OldEstimate is an error in the estimate. It is reduced by
Bandits: What we learned so far
1. Multi-armed bandits are a simplification of the real problem
Rt ⇠ N(q⇤ (a), 1)
3
q⇤ (3)
2 q⇤ (5)
1 q⇤ (9)
q⇤ (4)
q⇤ (1)
Reward 0 q⇤ (7) q⇤ (10)
distribution
-1 q⇤ (2) q⇤ (8)
q⇤ (6)
-2
Run for 1000 steps
Action
ε-Greedy Methods on the 10-Armed Testbed
1.5
𝜀 = 0.1
𝜀 = 0.01
1
𝜀 = 0 (greedy)
Average
reward
0.5
0
01 250 500 750 1000
Steps
100%
80%
𝜀 = 0.1
% 60%
𝜀 = 0.01
Optimal
action 40%
𝜀 = 0 (greedy)
20%
0%
01 250 500 750 1000
Steps
Bandits: What we learned so far
1. Multi-armed bandits are a simplification of the real problem
convergence
nth selection of action a. As weconditions
have noted, the choice ↵ (a) =
sample-average method, which is guaranteed to converge to the true
n
The first condition is required to guarantee that the steps are large
. 1
• e.g., ↵n = n any initial conditions or random
tually overcome
.
fluctuations. The
if ↵n small
guarantees that eventually the steps become = n p ,enough
p 2 (0,to
1) assur
Note
• not n ↵ that
=
1
. both convergence conditions then convergence
are met is
for the sample-aver
2
1 n at the optimal rate:
n , but not for the case of constant step-size parameter, ↵n (a) = ↵. I
p
the second condition is not met, indicating thatO(1/then)estimates never
verge but continue to vary in response to the most recently receiv
Tracking a Non-stationary Problem
• Suppose the true action values change slowly over time
20%
0%
01 200 400 600 800 1000
Plays
Steps
Exploration is needed because the estimates of the action values
greedy actions are those that look best at present, but some of
Upper Confidence Bound (UCB) action selection
may actually be better. "-greedy action selection forces the non
be tried, but indiscriminately, with no preference for those that ar
• A clever way of reducing exploration over time
particularly uncertain. It would be better to select among the n
• Estimate an upper
according bound
to their on the
potential for true action
actually values
being optimal, taking into
close their estimates are to being maximal and the uncertainties
• Select
Onethe actionway
e↵ective withofthe largest
doing (estimated)
this is upper as
to select actions bound
" s #
. log t
At = argmax Qt (a) + c ,
a N t (a)
Steps
2.7. GRADIENT BANDITS
Gradient-Bandit Algorithms
of one action over another is important; if we add 1000 to all the preferences th
no e↵ect on the action probabilities, which are determined according to a sof
distribution
• Let H(i.e., Gibbs
(a) be or Boltzmann
a learned distribution)
preference for takingasaction
follows:
a
t
. eHt (a) .
Pr{At = a} = Pk = ⇡t (a),
Ht (b)
b=1 e
.
where here we have also introduced a useful new notation ⇡t (a) for the probabi
Ht+1 (a) = Ht (a) + ↵ Rt R̄t 1a=At ⇡t (a) , 8a,
taking action a at time t. Initially all preferences are the same (e.g., H1 (a) =
so that all actions
t
have an equal100%
probability of being selected.
. 1X
There
R̄t is=a natural
Ri learning algorithm for thisα setting
= 0.1
based on the idea of stoc
t
gradient ascent. i=1
80%
On each step, after selecting the actionwith
Abaseline
and receiving the r
t
α = 0.4
Rt , the preferences are updated
%
by:
60%
. Optimal α = 0.1
Ht+1 (At ) = Ht (At )action
+ ↵ R40%
t R̄ t 1 ⇡ t (At ) , and
without baseline
.
Ht+1 (a) = Ht (a) ↵ Rt R̄t ⇡t (a), 8a 6= At ,α = 0.4
20%
where ↵ > 0 is a step-size parameter, and R̄t 2 R is the average of all the re
0%
up through and including time t, which
1 can 250
be computed
500 incrementally
750 as desc
1000
Steps
in Section 2.3 (or Section 2.4 if the problem is nonstationary). The R̄t term
Derivation of gradient-bandit algorithm
In exact gradient ascent:
. @ E [Rt ]
Ht+1 (a) = Ht (a) + ↵ , (1)
@Ht (a)
where:
. X
E[Rt ] = ⇡t (b)q⇤ (b),
b
" #
@ E[Rt ] @ X
= ⇡t (b)q⇤ (b)
@Ht (a) @Ht (a)
b
X @ ⇡t (b)
= q⇤ (b)
@Ht (a)
b
X @ ⇡t (b)
= q⇤ (b) Xt ,
@Ht (a)
b
P @ ⇡t (b)
where Xt does not depend on b, because b @Ht (a) = 0.
@ E[Rt ] X @ ⇡t (b)
= q⇤ (b) Xt
@Ht (a) @Ht (a)
b
X @ ⇡t (b)
= ⇡t (b) q⇤ (b) Xt /⇡t (b)
@Ht (a)
b
@ ⇡t (At )
= E q⇤ (At ) Xt /⇡t (At )
@Ht (a)
@ ⇡t (At )
= E Rt R̄t /⇡t (At ) ,
@Ht (a)
@ ⇡t (b)
= ⇡t (b) 1a=b ⇡t (a) .
@Ht (a)
1.5
UCB greedy with
optimistic
1.4 initialization
α = 0.1
Average 1.3 𝜀-greedy
reward
gradient
over first 1.2 bandit
1000 steps
1.1
1
1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4
" / ↵ / c / Q0
Conclusions
• These are all simple methods
Bandits?
• Evaluative vs Instructive
Problem or Solution?
• Associative vs Non-associative
Problem space
Single State Associative
Instructive
feedback
Evaluative
feedback
Problem space
Single State Associative
Instructive
feedback
Evaluative Bandits
feedback (Function optimization)
Problem space
Single State Associative
Instructive Supervised
feedback learning
Evaluative Bandits
feedback (Function optimization)
Problem space
Single State Associative
Instructive Supervised
Averaging
feedback learning
Evaluative Bandits
feedback (Function optimization)
Problem space
Single State Associative
Instructive Supervised
Averaging
feedback learning
Evaluative Associative
Bandits
feedback (Function optimization) Search
(Contextual bandits)