Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

What we learned last time

1. Intelligence is the computational part of the ability to achieve goals


• looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies
with observer and purpose

2. We will (probably) figure out how to make intelligent systems in


our lifetimes; it will change everything

3. But prior to that it will probably change our careers


• as companies gear up to take advantage of the economic
opportunities

4. This course has a demanding workload


Multi-arm Bandits
Sutton and Barto, Chapter 2

The simplest
reinforcement learning
problem
You are the algorithm! (bandit1)
• Action 1 — Reward is always 8
• value of action 1 is q⇤ (1) = 8

• Action 2 — 88% chance of 0, 12% chance of 100!


• value of action 2 is q⇤ (2) = .88 ⇥ 0 + .12 ⇥ 100 = 12

• Action 3 — Randomly between -10 and 35, equiprobable

-10 0 35
q⇤ (3) = 12.5
q⇤ (3)

• Action 4 — a third 0, a third 20, and a third from {8,9,…, 18}

0 q⇤ (4) 20

1 1 1 20 13 33
q⇤ (4) = ⇥ 0 + ⇥ 20 + ⇥ 13 = 0 + + = = 11
3 3 3 3 3 3
The k-armed Bandit Problem
• On each of an infinite sequence of time steps, t=1, 2, 3, …, 

you choose an action At from k possibilities, and receive a real-
valued reward Rt

• The reward depends only on the action taken;



it is indentically, independently distributed (i.i.d.):
.
q⇤ (a) = E[Rt |At = a] , 8a 2 {1, . . . , k} true values

• These true values are unknown. The distribution is unknown

• Nevertheless, you must maximize your total reward

• You must both try actions to learn their values (explore), 



and prefer those that appear best (exploit)
The Exploration/Exploitation Dilemma
• Suppose you form estimates

Qt (a) ⇡ q⇤ (a), 8a action-value estimates

• Define the greedy action at time t as


⇤ .
At = arg max Qt (a)
a

• If At = A⇤t then you are exploiting



6 A⇤t then you are exploring
If At =

• You can’t do both, but you need to do both

• You can never stop exploring, but maybe you should explore
less with time. Or maybe not.
2 Action-Value Methods
Action-Value Methods
e begin by looking more closely at some simple methods for estimating
• and
actions Methods that learn
for using action-value
the estimates to estimates
make actionand selection
nothing else
decisions.
e true value of an action is the mean reward when that action is sel
• Fortoexample,
tural way estimateestimate
this is action values asthe
by averaging sample averages:
rewards actually receive
Pt 1
. sum of rewards when a taken prior to t i=1 Ri · 1Ai =a
Qt (a) = = Pt 1
number of times a taken prior to t i=1 1Ai =a

ere 1predicate denotes the random


• The sample-average estimatesvariable thattoisthe
converge 1 iftrue
predicate
values
 is true a
the action is taken
t. If theIfdenominator is zero, an then
infinite
wenumber
insteadofdefine
times Qt (a) as some de
ch as Q1 (a) = 0. As the denominator goes to infinity, by the law of larg
(a) converges to q⇤ (a). lim WeQtcall (a) this
= qthe
⇤ (a)sample-average method for
Nt (a)!1
ion values because each estimate is an average of the sample of releva
course this The
is number
just ofone way to estimate action values, and not necessar
times action a
has been taken by time t
e. Nevertheless, for now let us stay with this simple estimation metho
ε-Greedy Action Selection

• In greedy action selection, you always exploit

• In 𝜀-greedy, you are usually greedy, but with probability 𝜀 you


instead pick an action at random (possibly the greedy action
again)

• This is perhaps the simplest way to balance exploration and


exploitation
= Qn + Rn Qn , (2.3)
n
which holds even for n = 1, obtaining Q2 = R1 for arbitrary Q1 . This implemen-
tation requires memory only for Qn and n, and only the small computation (2.3)
for each new reward. Pseudocode for a complete bandit algorithm using incremen-
tally computed sample averages and "-greedy action selection is shown below. The
function bandit(a) is assumed to take an action and return a corresponding reward.

A simple bandit algorithm


Initialize, for a = 1 to k:
Q(a) 0
N (a) 0

Repeat forever:

arg maxa Q(a) with probability 1 " (breaking ties randomly)
A
a random action with probability "
R bandit(A)
N (A) N (A) + 1 ⇥ ⇤
1
Q(A) Q(A) + N (A) R Q(A)

The update rule (2.3) is of a form that occurs frequently throughout this book.
The general form is
h i
NewEstimate OldEstimate + StepSize Target OldEstimate . (2.4)
⇥ ⇤
The expression Target OldEstimate is an error in the estimate. It is reduced by
Bandits: What we learned so far
1. Multi-armed bandits are a simplification of the real problem

1. they have action and reward (a goal), but no input or sequentiality

2. A fundamental exploitation-exploration tradeoff arises in bandits

1. 𝜀-greedy action selection is the simplest way of trading off

3. Learning action values is a key part of solution methods

4. The 10-armed testbed illustrates all



 One Bandit Task from

The 10-armed Testbed


q⇤ (a) ⇠ N(0, 1)
4

Rt ⇠ N(q⇤ (a), 1)
3
q⇤ (3)

2 q⇤ (5)

1 q⇤ (9)
q⇤ (4)
q⇤ (1)
Reward 0 q⇤ (7) q⇤ (10)
distribution
-1 q⇤ (2) q⇤ (8)

q⇤ (6)
-2
Run for 1000 steps

-3 Repeat the whole


thing 2000 times
with different bandit
-4 tasks
1 2 3 4 5 6 7 8 9 10

Action
ε-Greedy Methods on the 10-Armed Testbed
1.5
𝜀 = 0.1
𝜀 = 0.01
1
𝜀 = 0 (greedy)
Average
reward
0.5

0
01 250 500 750 1000
Steps

100%

80%
𝜀 = 0.1
% 60%
𝜀 = 0.01
Optimal
action 40%

𝜀 = 0 (greedy)
20%

0%
01 250 500 750 1000
Steps
Bandits: What we learned so far
1. Multi-armed bandits are a simplification of the real problem

1. they have action and reward (a goal), but no input or sequentiality

2. The exploitation-exploration tradeoff arises in bandits

1. 𝜀-greedy action selection is the simplest way of trading off

3. Learning action values is a key part of solution methods

4. The 10-armed testbed illustrates all

5. Learning as averaging – a fundamental learning rule


n
!
sample averages
1 of X1
observed rewards. We now turn to the questio

Averaging ⟶ learning rule


averages= can beRcomputed
n+ Ri in a computationally efficient manner, in p
n
constant memory andi=1 per-time-step computation.
1 ⇣ ⌘
To simplify
= Rnotation
n + (n we 1)Q concentrate
n + Qn Qon
n a single action. Let Ri n
• To simplify
reward notation,
n
received
⇣ afterlet usith
the focus on oneofaction
selection
⌘ this action, and let Qn deno
1
of its action
= value
Rn +after
nQn it Q has
n been selected n 1 times, which we
simply as n onlyh its rewards,
• We consider i and its estimate after n+1 rewards:
1
= .QR n 1 + R2R+
+ n · · ·Q
+ n R, n 1
Qn = n .
n 1
• How
which holdscan
evenwe
The obvious
fordon this
= 1,incrementally
obtaining Q2 (without
implementation storing all
would be to maintain
= R1 for arbitrary Qthe rewards)?
a record
. This
of all th
implem
then perform this computation whenever the estimated1value was ne
requires memory only for Qn and n, and only the small computation (2.3
• Could
in thisstore
new reward.
case atherunning
memory sum and
and count (and
computational divide), or equivalently:
requirements would gro
more rewards are seen.h Each additional
i reward would require more m
The itupdate 1 of a to
rule=computation
(2.3) is
andQmore
n+1 Qn + Rn form that occurs
compute
Q n
the sumfrequently throughout t
in the numerator.
The general form is n
• This is a standard form for learning/update rules:
h i
NewEstimate OldEstimate + StepSize Target OldEstimate .
To simplify notation we concentrate on a single action. Let

Derivation of incremental update


2.3. INCREMENTAL
reward IMPLEMENTATION
received after the ith selection of this action, and let Qn
of its action value after it has been selected n 1 times, which
simply as might suspect, this is not really necessary. It is easy to de
As you
formulas for R
updating averages with small, constant computation re
. 1 + R2 + · · · + Rn 1
Qn =
each new reward. Given Qn and the.nth reward, Rn , the new averag
n 1
can be computed by
The obvious implementation
n
would be to maintain a record of
then perform this X
1 computation whenever the estimated value wa
Qn+1 = Ri
n
in this case the memory and computational requirements would
i=1
more rewards are seen. Each n 1
!
additional reward would require mo
1 X
it and more computation
= Rn + to compute
Ri the sum in the numerator.
n
i=1
n 1
!
1 1 X
= Rn + (n 1) Ri
n n 1
i=1
1 ⇣ ⌘
= Rn + (n 1)Qn
n
1⇣ ⌘
= Rn + nQn Qn
n
1h i
= Qn + Rn Q n ,
n
which holds even for n = 1, obtaining Q = R for arbitrary Q .
n
!
sample averages
1 of X1
observed rewards. We now turn to the questio

Averaging ⟶ learning rule


averages= can beRcomputed
n+ Ri in a computationally efficient manner, in p
n
constant memory andi=1 per-time-step computation.
1 ⇣ ⌘
To simplify
= Rnotation
n + (n we 1)Q concentrate
n + Qn Qon
n a single action. Let Ri n
• To simplify
reward notation,
n
received
⇣ afterlet usith
the focus on oneofaction
selection
⌘ this action, and let Qn deno
1
of its action
= value
Rn +after
nQn it Q has
n been selected n 1 times, which we
simply as n onlyh its rewards,
• We consider i and its estimate after n+1 rewards:
1
= .QR n 1 + R2R+
+ n · · ·Q
+ n R, n 1
Qn = n .
n 1
• How
which holdscan
evenwe
The obvious
fordon this
= 1,incrementally
obtaining Q2 (without
implementation storing all
would be to maintain
= R1 for arbitrary Qthe rewards)?
a record
. This
of all th
implem
then perform this computation whenever the estimated1value was ne
requires memory only for Qn and n, and only the small computation (2.3
• Could
in thisstore
new reward.
case atherunning
memory sum and
and count (and
computational divide), or equivalently:
requirements would gro
more rewards are seen.h Each additional
i reward would require more m
The itupdate 1 of a to
rule=computation
(2.3) is
andQmore
n+1 Qn + Rn form that occurs
compute
Q n
the sumfrequently throughout t
in the numerator.
The general form is n
• This is a standard form for learning/update rules:
h i
NewEstimate OldEstimate + StepSize Target OldEstimate .
Standard stochastic approximation
2.5. OPTIMISTIC INITIAL VALUES

convergence
nth selection of action a. As weconditions
have noted, the choice ↵ (a) =
sample-average method, which is guaranteed to converge to the true
n

the law of large numbers. But of course convergence is not guarante


• ofTothe
assure convergence
sequence with
{↵n (a)}. A probability
well-known1:result in stochastic appro
gives us the conditions required to assure convergence with probab
1
X 1
X
↵n (a) = 1 and ↵n2 (a) < 1.
n=1 n=1

The first condition is required to guarantee that the steps are large
. 1
• e.g., ↵n = n any initial conditions or random
tually overcome
.
fluctuations. The
if ↵n small
guarantees that eventually the steps become = n p ,enough
p 2 (0,to
1) assur
Note
• not n ↵ that
=
1
. both convergence conditions then convergence
are met is 

for the sample-aver
2
1 n at the optimal rate:
n , but not for the case of constant step-size parameter, ↵n (a) = ↵. I
p
the second condition is not met, indicating thatO(1/then)estimates never
verge but continue to vary in response to the most recently receiv
Tracking a Non-stationary Problem
• Suppose the true action values change slowly over time

• then we say that the problem is nonstationary

• In this case, sample averages are not a good idea (Why?)

• Better is an “exponential, recency-weighted average”:


h i
.
Qn+1 = Q n + ↵ Rn Qn
n
X
= (1 ↵)n Q1 + ↵(1 ↵)n i Ri ,
i=1

where ↵ is a constant step-size parameter, ↵ 2 (0, 1]

• There is bias due to Q1 that becomes smaller over time


Optimistic Initial Values
• All methods so far depend on Q1 (a), i.e., they are biased.

So far we have used Q1 (a) = 0

• Suppose we initialize the action values optimistically (Q1 (a) = 5 ), 



e.g., on the 10-armed testbed (with ↵ = 0.1 )
100%
optimistic, greedy
80% Q01 = 5, 𝜀 = 0

% 60% realistic, !-greedy


Optimal Q01 = 0, 𝜀 = 0.1
action 40%

20%

0%
01 200 400 600 800 1000
Plays
Steps
Exploration is needed because the estimates of the action values
greedy actions are those that look best at present, but some of
Upper Confidence Bound (UCB) action selection
may actually be better. "-greedy action selection forces the non
be tried, but indiscriminately, with no preference for those that ar
• A clever way of reducing exploration over time
particularly uncertain. It would be better to select among the n
• Estimate an upper
according bound
to their on the
potential for true action
actually values
being optimal, taking into
close their estimates are to being maximal and the uncertainties
• Select
Onethe actionway
e↵ective withofthe largest
doing (estimated)
this is upper as
to select actions bound
" s #
. log t
At = argmax Qt (a) + c ,
a N t (a)

where log t denotes the natural logarithm of t (the number that


UCB c = 2
have to be raised to in order to equal t), and the number c > 0 c
𝜀-greedy
of exploration. If Nt (a) = 0, then a is𝜀 considered
= 0.1
to be a maximiz
The idea of this upper confidence bound (UCB) action selection
Average
reward
root term is a measure of the uncertainty or variance in the est

Steps
2.7. GRADIENT BANDITS

Gradient-Bandit Algorithms
of one action over another is important; if we add 1000 to all the preferences th
no e↵ect on the action probabilities, which are determined according to a sof
distribution
• Let H(i.e., Gibbs
(a) be or Boltzmann
a learned distribution)
preference for takingasaction
follows:
a
t

. eHt (a) .
Pr{At = a} = Pk = ⇡t (a),
Ht (b)
b=1 e

.
where here we have also introduced a useful new notation ⇡t (a) for the probabi
Ht+1 (a) = Ht (a) + ↵ Rt R̄t 1a=At ⇡t (a) , 8a,
taking action a at time t. Initially all preferences are the same (e.g., H1 (a) =
so that all actions
t
have an equal100%
probability of being selected.
. 1X
There
R̄t is=a natural
Ri learning algorithm for thisα setting
= 0.1
based on the idea of stoc
t
gradient ascent. i=1
80%
On each step, after selecting the actionwith
Abaseline
and receiving the r
t
α = 0.4
Rt , the preferences are updated
%
by:
60%
. Optimal α = 0.1
Ht+1 (At ) = Ht (At )action
+ ↵ R40%
t R̄ t 1 ⇡ t (At ) , and
without baseline
.
Ht+1 (a) = Ht (a) ↵ Rt R̄t ⇡t (a), 8a 6= At ,α = 0.4
20%

where ↵ > 0 is a step-size parameter, and R̄t 2 R is the average of all the re
0%
up through and including time t, which
1 can 250
be computed
500 incrementally
750 as desc
1000
Steps
in Section 2.3 (or Section 2.4 if the problem is nonstationary). The R̄t term
Derivation of gradient-bandit algorithm
In exact gradient ascent:

. @ E [Rt ]
Ht+1 (a) = Ht (a) + ↵ , (1)
@Ht (a)
where:
. X
E[Rt ] = ⇡t (b)q⇤ (b),
b

" #
@ E[Rt ] @ X
= ⇡t (b)q⇤ (b)
@Ht (a) @Ht (a)
b
X @ ⇡t (b)
= q⇤ (b)
@Ht (a)
b
X @ ⇡t (b)
= q⇤ (b) Xt ,
@Ht (a)
b
P @ ⇡t (b)
where Xt does not depend on b, because b @Ht (a) = 0.
@ E[Rt ] X @ ⇡t (b)
= q⇤ (b) Xt
@Ht (a) @Ht (a)
b
X @ ⇡t (b)
= ⇡t (b) q⇤ (b) Xt /⇡t (b)
@Ht (a)
b

@ ⇡t (At )
= E q⇤ (At ) Xt /⇡t (At )
@Ht (a)

@ ⇡t (At )
= E Rt R̄t /⇡t (At ) ,
@Ht (a)

where here we have chosen Xt = R̄t and substituted Rt for q⇤ (At ),


which is permitted because E[Rt |At ] = q⇤ (At ).
For now assume: @@H⇡tt (b)
(a) = ⇡t (b) 1a=b ⇡t (a) . Then:
⇥ ⇤
= E Rt R̄t ⇡t (At ) 1a=At ⇡t (a) /⇡t (At )
⇥ ⇤
= E Rt R̄t 1a=At ⇡t (a) .

Ht+1 (a) = Ht (a) + ↵ Rt R̄t 1a=At ⇡t (a) , (from (1), QED)


Thus it remains only to show that

@ ⇡t (b)
= ⇡t (b) 1a=b ⇡t (a) .
@Ht (a)

Recall the standard quotient rule for derivatives:


 @f (x)
@ f (x) @x g (x) f (x) @g@x(x)
= 2
.
@x g (x) g (x)

Using this, we can write...


Recall the standard quotient rule for derivatives:
 @f (x)
@ f (x) @x g (x) f (x) @g@x(x)
Quotient Rule: = .
@x g (x) g (x)2
@ ⇡t (b) @
= ⇡t (b) Using this, we can write...
@Ht (a) @Ht (a)
" #
@ H
e t (b)
= Pk
@Ht (a) c=1 e
Ht (c)
Pk
@e Ht (b) Pk
Ht (c)
H (c) H (b) @ c=1 e
c=1 e e
t t
@Ht (a) @Ht (a)
= ⇣P ⌘2 (Q.R.)
k Ht (c)
c=1 e
H (a)
Pk Ht (c) Ht (b) e Ht (a)
1a=b e t
c=1 e e @e x x
= ⇣P ⌘2 ( @x = e )
k Ht (c)
c=1 e

1a=b e Ht (b) e Ht (b) e Ht (a)


= Pk ⇣P ⌘2
Ht (c)
c=1 e
k Ht (c)
c=1 e

= 1a=b ⇡t (b) ⇡t (b)⇡t (a)


= ⇡t (b) 1a=b ⇡t (a) . (Q.E.D.)
Summary Comparison of Bandit Algorithms

1.5
UCB greedy with
optimistic
1.4 initialization
α = 0.1
Average 1.3 𝜀-greedy
reward
gradient
over first 1.2 bandit
1000 steps
1.1

1
1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4

" / ↵ / c / Q0
Conclusions
• These are all simple methods

• but they are complicated enough—we will build on them

• we should understand them completely

• there are still open questions

• Our first algorithms that learn from evaluative feedback

• and thus must balance exploration and exploitation

• Our first algorithms that appear to have a goal



—that learn to maximize reward by trial and error
Our first dimensions!

• Problems vs Solution Methods

Bandits?
• Evaluative vs Instructive
Problem or Solution?
• Associative vs Non-associative
Problem space
Single State Associative

Instructive
feedback

Evaluative
feedback
Problem space
Single State Associative

Instructive
feedback

Evaluative Bandits
feedback (Function optimization)
Problem space
Single State Associative

Instructive Supervised
feedback learning

Evaluative Bandits
feedback (Function optimization)
Problem space
Single State Associative

Instructive Supervised
Averaging
feedback learning

Evaluative Bandits
feedback (Function optimization)
Problem space
Single State Associative

Instructive Supervised
Averaging
feedback learning

Evaluative Associative
Bandits
feedback (Function optimization) Search
(Contextual bandits)

You might also like