CS 747, Autumn 2023: Lecture 5: Shivaram Kalyanakrishnan

CS 747, Autumn 2023: Lecture 5
Shivaram Kalyanakrishnan
Department of Computer Science and Engineering

Indian Institute of Technology Bombay
Autumn 2023
1/16
Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 1 / 16

Multi-armed Bandits
1. Analysis of UCB
2. Other bandit problems
2/16

Multi-armed Bandits
1. Analysis of UCB
2/16

UCB (Auer et al., 2002)
- Pull each arm once. q
- For t ∈ {n, n + 1, . . . }, for a ∈ A, ucbta = p̂at + 2 ln(t) ; pull argmaxa∈A ucbta .
def
ut a
ucb at
pt
a
0
R
3/16

def
ut a
ucb at
pt
a
0
R
3/16

def
ut a
ucb at
pt
a
0
R
PT −1
Recall that RT = Tp⋆ − t=0 E[r t ].
3/16

def
ut a
ucb at
pt
a
0
R
PT −1
Recall that RT = Tp⋆ − t=0 E[r t ]. P
1
We shall show that UCB achieves RT = O a:pa ̸=p⋆ p⋆ −pa
log(T ) . 3/16

Notation
def
∆a = p⋆ − pa (instance-specific constant); ⋆ an optimal arm.
4/16

Notation
def
Let Zat be the event that arm a is pulled at time t.
4/16

Notation
def

Let zat be a random variable that takes value 1 if arm a is pulled at time t, and
0 otherwise.
4/16

Notation
def

0 otherwise.
Observe that E[zat ] = P{Zat }(1) + (1 − P{Zat })(0) = P{Zat }.
4/16

Notation
def

0 otherwise.
As in the algorithm, uat is a random variable that denotes the number of pulls
arm a has received up to time t:
t−1
X
uat = zai .
i=0
4/16

Notation
def

0 otherwise.
As in the algorithm, uat is a random variable that denotes the number of pulls
arm a has received up to time t:
t−1
X
uat = zai .
i=0
l m
def
We define an instance-specific constant ūaT = (∆8a )2 ln(T ) that will serve in
our proof as a “sufficient” number of pulls of arm a for horizon T . 4/16

Proof Sketch
To upper-bound RT , upper-bound the number of pulls of each sub-optimal
arm a.
Give each such arm a ūaT pulls for free.

Beyond ūaT pulls, arm a’s UCB will have width at most ∆a /2.
If a continues to be pulled beyond ūaT pulls, either its empirical mean has
deviated by more than ∆a /2 from its true mean, or ⋆’s UCB has fallen below
its true mean.
Both events above have a low probability—in aggregate at most a constant
even if summed over an infinite horizon.
5/16

Proof Sketch
To upper-bound RT , upper-bound the number of pulls of each sub-optimal
arm a.
Give each such arm a ūaT pulls for free.

Beyond ūaT pulls, arm a’s UCB will have width at most ∆a /2.
If a continues to be pulled beyond ūaT pulls, either its empirical mean has
deviated by more than ∆a /2 from its true mean, or ⋆’s UCB has fallen below
its true mean.
Both events above have a low probability—in aggregate at most a constant
even if summed over an infinite horizon.
KL-UCB uses the KL inequality, and slightly more sophisticated analysis. 5/16

E[uaT ]∆a .
P
Step 1: Show that RT = a:pa ̸=p⋆
6/16

E[uaT ]∆a .
P
T −1
X
RT = Tp − ⋆
E[r t ]
t=0
6/16

E[uaT ]∆a .
P
T −1
X T −1 X
X
t
RT = Tp − ⋆ ⋆
E[r ] = Tp − P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
6/16

E[uaT ]∆a .
P
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
X
= Tp⋆ − E[zat ]pa
t=0 a∈A
6/16

E[uaT ]∆a .
P
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
!
X X X
= Tp⋆ − E[zat ]pa = E[uaT ] p⋆ − E[uaT ]pa
t=0 a∈A a∈A a∈A
6/16

E[uaT ]∆a .
P
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
!
X X X
X
= E[uaT ](p⋆ − pa )
a∈A
6/16

E[uaT ]∆a .
P
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
!
X X X
X X
= E[uaT ](p⋆ − pa ) = E[uaT ]∆a .
a∈A a:pa ̸=p⋆
6/16

E[uaT ]∆a .
P
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
!
X X X
X X
= E[uaT ](p⋆ − pa ) = E[uaT ]∆a .
a∈A a:pa ̸=p⋆
To show the regret bound, we shall show for each sub-optimal arm a that

T 1
E[ua ] = O log(T ) .
(∆a )2
6/16

Step 2: Two Regimes for Sub-optimal Pulls
7/16


1
To prove E[uaT ] =O ∆2a
log(T ) , we show E[uaT ] ≤ ūaT + C for constant C.
7/16


1
T −1
X
E[uaT ] = E[zat ]
t=0
7/16


1
T −1
X T −1
X
E[uaT ] = E[zat ] = P{Zat }
t=0 t=0
7/16


1
T −1
X T −1
X
t=0 t=0
T −1
X T −1
X
= P{Zat and (uat < ūaT )} + P{Zat and (uat ≥ ūaT )}
t=0 t=0
7/16


1
T −1
X T −1
X
t=0 t=0
T −1
X T −1
X
t=0 t=0
= A + B.
7/16


1
T −1
X T −1
X
t=0 t=0
T −1
X T −1
X
t=0 t=0
= A + B.
We show A is upper-bounded by ūaT and B is upper-bounded by a constant.

7/16

Step 3: Bounding A
8/16

Step 3: Bounding A
T −1
X
A= P{Zat and (uat < ūaT )}
t=0
8/16

Step 3: Bounding A
T −1
X
t=0
T
a −1
T −1 ūX
X
= P{Zat and (uat = m)}
t=0 m=0
8/16

Step 3: Bounding A
T −1
X
t=0
T ūaT −1 T −1
X a −1
T −1 ūX XX
= P{Zat and (uat = m)} = P{Zat and (uat = m)}
t=0 m=0 m=0 t=0
8/16

Step 3: Bounding A
T −1
X
t=0
T ūaT −1 T −1
X a −1
T −1 ūX XX
t=0 m=0 m=0 t=0
ūaT −1
X
= P{Za0 , (ua0 = m) or Za1 , (ua1 = m) or . . . or ZaT −1 , (uaT −1 = m)}
m=0
8/16

Step 3: Bounding A
T −1
X
t=0
T ūaT −1 T −1
X a −1
T −1 ūX XX
t=0 m=0 m=0 t=0
ūaT −1
X
m=0
ūaT −1
X
≤ 1
m=0
8/16

Step 3: Bounding A
T −1
X
t=0
T ūaT −1 T −1
X a −1
T −1 ūX XX
t=0 m=0 m=0 t=0
ūaT −1
X
m=0
ūaT −1
X
≤ 1 = ūaT .
m=0
8/16

Step 3: Bounding A
T −1
X
t=0
T ūaT −1 T −1
X a −1
T −1 ūX XX
t=0 m=0 m=0 t=0
ūaT −1
X
m=0
ūaT −1
X
≤ 1 = ūaT .
m=0
We have used the fact that for 0 ≤ i < j ≤ t − 1, (Zai , (uai = m)) and
(Zaj , (uaj = m)) are mutually exclusive. 8/16

Step 4.1: Bounding B
9/16

T −1
X
B= P{Zat and (uat ≥ ūaT )}
t=0
9/16

T −1
X
t=0
T −1
X
= P{Zat and (uat ≥ ūaT )}
t=n
9/16

T −1
X
t=0
T −1
X
t=n
s s
T −1
( ! )
X 2 2
≤ P p̂at + t
ln(t) ≥ p̂⋆t + t T
ln(t) and (ua ≥ ūa )
t=n
ua u⋆t
9/16

T −1
X
t=0
T −1
X
t=n
s s
T −1
( ! )
X 2 2
≤ P + p̂at t
ln(t) ≥ p̂⋆ + t T
ln(t) and (ua ≥ ūa )
t=n
uat u⋆t
s
T −1 X t Xt
( r )
X 2 2
≤ P p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t) where
t=n
x y
Tx=ūa y =1
p̂a (x) is the empirical mean of the first x pulls of arm a, and
p̂⋆ (y ) is the empirical mean of the first y pulls of arm ⋆. 9/16

Fix x ∈ {ūaT , ūaT + 1, . . . , t} and y ∈ {1, 2, . . . , t}.
10/16

1. We have:
r s
2 2
p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t)
x y
r ! s !
2 2
=⇒ p̂a (x) + ln(t) ≥ p⋆ or p̂⋆ (y ) + ln(t) < p⋆ .
x y
10/16

1. We have:
r s
2 2
p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t)
x y
r ! s !
2 2
x y
Fact: If α > β, then α ≥ γ or β < γ. Holds for arbitrary α, β, γ!
10/16

1. We have:
r s
2 2
p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t)
x y
r ! s !
2 2
x y
Fact: If α > β, then α ≥ γ or β < γ. Holds for arbitrary α, β, γ!

q q
2. Since x ≥ ūa , we have x ln(t) ≤ ū2T ln(t) ≤ ∆2a , and so
T 2
a
r
2 ∆a
p̂a (x) + ln(t) ≥ p⋆ =⇒ p̂a (x) ≥ pa + .
x 2 10/16

Continuing from Step 4.1, using the two results from Step 4.2, and invoking
Hoeffding’s Inequality:
s
−1
T t X
t
( r )
X X 2 2
B≤ P p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t)
x y
t=n x=ūaT y =1
11/16

s
−1
T t X
t
( r )
X X 2 2
B≤ P p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t)
x y
t=n x=ūaT y =1
s
−1 X
T t X t ( )!
X ∆a 2
≤ P p̂a (x) ≥ pa + + P p̂⋆ (y ) < p⋆ − ln(t)
2 y
t=n x=ūa y =1
T
11/16

s
−1
T t X
t
( r )
X X 2 2
B≤ P p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t)
x y
t=n x=ūaT y =1
s
−1 X
T t X t ( )!
X ∆a 2
≤ P p̂a (x) ≥ pa + + P p̂⋆ (y ) < p⋆ − ln(t)
2 y
t=n x=ūa y =1
T
T −1 t X
t q 2
∆a 2 2
e−2x ( ) + e−2y ln(t)
X X
≤ 2 y
t=n x=ūaT y =1
11/16

s
−1
T t X
t
( r )
X X 2 2
B≤ P p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t)
x y
t=n x=ūaT y =1
s
−1 X
T t X t ( )!
X ∆a 2
≤ P p̂a (x) ≥ pa + + P p̂⋆ (y ) < p⋆ − ln(t)
2 y
t=n x=ūa y =1
T
T −1 t X
t q 2
∆a 2 2
e−2x ( ) + e−2y ln(t)
X X
≤ 2 y
t=n x=ūaT y =1
T −1 t X
t T −1 ∞
π2

X X
−4 ln(t) −4 ln(t)
X
2 2 X 2
≤ e +e ≤ t ≤ = .
t4 t2 3
t=n x=ūaT y =1 t=n t=1
11/16

s
−1
T t X
t
( r )
X X 2 2
B≤ P p̂a (x) + ln(t) ≥ p̂⋆ (y ) + ln(t)
x y
t=n x=ūaT y =1
s
−1 X
T t X t ( )!
X ∆a 2
≤ P p̂a (x) ≥ pa + + P p̂⋆ (y ) < p⋆ − ln(t)
2 y
t=n x=ūa y =1
T
T −1 t X
t q 2
∆a 2 2
e−2x ( ) + e−2y ln(t)
X X
≤ 2 y
t=n x=ūaT y =1
T −1 t X
t T −1 ∞
π2

X X
−4 ln(t) −4 ln(t)
X
2 2 X 2
≤ e +e ≤ t ≤ = .
t4 t2 3
t=n x=ūaT y =1 t=n t=1
We are done! 11/16

Multi-armed Bandits
1. Analysis of UCB
12/16

Other Bandit Problems
In this course, we have covered
▶ stochastic multi-armed bandits,
▶ minimisation of expected cumulative regret.
There are many other variations/formulations.
13/16

Incorporating risk/variance in the objective.

▶ Arm 1 gives rewards 0 and 100, each w.p. 1/2.
▶ Which arm would you prefer?
13/16

Incorporating risk/variance in the objective.

▶ Which arm would you prefer?
What if the arms’ (true) means vary over time?

▶ Nonstationary setting, seen for example, in on-line ads.
▶ Approach depends on nature of drift/change in rewards.
▶ In practice, one might only trust most recent data from arms.
▶ In practice, the set of arms can itself change over time! 13/16

Pure exploration.
▶ Separate “testing” and “live” phases.
▶ In testing phase, rewards don’t matter.
▶ PAC formulation: W.p. at least 1 − δ, must return an ϵ-optimal arm, while
incurring a small number of pulls.
▶ Simple regret formulation: Given a budget of T pulls, must output an arm a
such that pa is large, or equivalently, simple regret = p⋆ − pa is small).
14/16

Pure exploration.
▶ Separate “testing” and “live” phases.
▶ In testing phase, rewards don’t matter.
▶ PAC formulation: W.p. at least 1 − δ, must return an ϵ-optimal arm, while
incurring a small number of pulls.
▶ Simple regret formulation: Given a budget of T pulls, must output an arm a
such that pa is large, or equivalently, simple regret = p⋆ − pa is small).
Limited number of feedback stages.

▶ Suppose you are given budget T , but your algorithm can look at history only
s < T times?
▶ UCB, Thompson Sampling, etc. are fully sequential (s = T ).
▶ How to manage with fewer “stages” s?
14/16

What if the number of arms is large (thousands, millions)?
▶ If arms can be described using features, mean reward is often treated as a
(linear) function of these features.
▶ Quantile-regret: look for “good”, rather than “optimal” arms.
15/16

What if we are interacting with many bandits simultaneously?

▶ Contextual bandits: If the bandits themselves can be described using features
(a “context”), data from one can be used to generate estimates about others.
15/16

What if we are interacting with many bandits simultaneously?

▶ Contextual bandits: If the bandits themselves can be described using features
(a “context”), data from one can be used to generate estimates about others.
What if the rewards do not come from a fixed random process?

▶ Adversarial bandits make no assumption on the rewards.
▶ Possible to show sub-linear regret when compared against playing a single arm
for the entire run.
▶ Necessary to use a randomised algorithm. 15/16

Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret
UCB, KL-UCB algorithms
Thompson Sampling algorithm
Concentration bounds
Understanding Thompson Sampling
Other bandit problems
16/16

Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret
UCB, KL-UCB algorithms
Thompson Sampling algorithm
Concentration bounds
Understanding Thompson Sampling
Other bandit problems
Next class: Markov Decision Problems 16/16

CS 747, Autumn 2023: Lecture 5: Shivaram Kalyanakrishnan

Uploaded by

Copyright:

Available Formats

You might also like

CS 747, Autumn 2023: Lecture 5: Shivaram Kalyanakrishnan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 747, Autumn 2023: Lecture 5: Shivaram Kalyanakrishnan

Uploaded by

Copyright:

Available Formats

CS 747, Autumn 2023: Lecture 5

Department of Computer Science and Engineering

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 1 / 16

2. Other bandit problems

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 2 / 16

2. Other bandit problems

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 2 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 16

Let Zat be the event that arm a is pulled at time t.

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 16

Let Zat be the event that arm a is pulled at time t.

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 16

Let Zat be the event that arm a is pulled at time t.

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 16

Let Zat be the event that arm a is pulled at time t.

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 16

Let Zat be the event that arm a is pulled at time t.

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 16

Give each such arm a ūaT pulls for free.

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 16

Give each such arm a ūaT pulls for free.

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 16

We show A is upper-bounded by ūaT and B is upper-bounded by a constant.

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 10 / 16

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 10 / 16

Fact: If α > β, then α ≥ γ or β < γ. Holds for arbitrary α, β, γ!