Professional Documents
Culture Documents
CS 747, Autumn 2023: Lecture 5: Shivaram Kalyanakrishnan
CS 747, Autumn 2023: Lecture 5: Shivaram Kalyanakrishnan
CS 747, Autumn 2023: Lecture 5: Shivaram Kalyanakrishnan
Shivaram Kalyanakrishnan
Autumn 2023
1/16
1. Analysis of UCB
2/16
1. Analysis of UCB
2/16
ucb at
pt
a
0
R
3/16
ucb at
pt
a
0
R
3/16
ucb at
pt
a
0
R
PT −1
Recall that RT = Tp⋆ − t=0 E[r t ].
3/16
ucb at
pt
a
0
R
PT −1
Recall that RT = Tp⋆ − t=0 E[r t ]. P
1
We shall show that UCB achieves RT = O a:pa ̸=p⋆ p⋆ −pa
log(T ) . 3/16
4/16
4/16
4/16
4/16
As in the algorithm, uat is a random variable that denotes the number of pulls
arm a has received up to time t:
t−1
X
uat = zai .
i=0
4/16
As in the algorithm, uat is a random variable that denotes the number of pulls
arm a has received up to time t:
t−1
X
uat = zai .
i=0
l m
def
We define an instance-specific constant ūaT = (∆8a )2 ln(T ) that will serve in
our proof as a “sufficient” number of pulls of arm a for horizon T . 4/16
If a continues to be pulled beyond ūaT pulls, either its empirical mean has
deviated by more than ∆a /2 from its true mean, or ⋆’s UCB has fallen below
its true mean.
Both events above have a low probability—in aggregate at most a constant
even if summed over an infinite horizon.
5/16
If a continues to be pulled beyond ūaT pulls, either its empirical mean has
deviated by more than ∆a /2 from its true mean, or ⋆’s UCB has fallen below
its true mean.
Both events above have a low probability—in aggregate at most a constant
even if summed over an infinite horizon.
KL-UCB uses the KL inequality, and slightly more sophisticated analysis. 5/16
6/16
T −1
X
RT = Tp − ⋆
E[r t ]
t=0
6/16
T −1
X T −1 X
X
t
RT = Tp − ⋆ ⋆
E[r ] = Tp − P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
6/16
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
X
= Tp⋆ − E[zat ]pa
t=0 a∈A
6/16
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
!
X X X
= Tp⋆ − E[zat ]pa = E[uaT ] p⋆ − E[uaT ]pa
t=0 a∈A a∈A a∈A
6/16
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
!
X X X
= Tp⋆ − E[zat ]pa = E[uaT ] p⋆ − E[uaT ]pa
t=0 a∈A a∈A a∈A
X
= E[uaT ](p⋆ − pa )
a∈A
6/16
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
!
X X X
= Tp⋆ − E[zat ]pa = E[uaT ] p⋆ − E[uaT ]pa
t=0 a∈A a∈A a∈A
X X
= E[uaT ](p⋆ − pa ) = E[uaT ]∆a .
a∈A a:pa ̸=p⋆
6/16
T −1
X T −1 X
X
t
RT = Tp − ⋆
E[r ] = Tp − ⋆
P{Zat }E[r t |Zat ]
t=0 t=0 a∈A
T −1 X
!
X X X
= Tp⋆ − E[zat ]pa = E[uaT ] p⋆ − E[uaT ]pa
t=0 a∈A a∈A a∈A
X X
= E[uaT ](p⋆ − pa ) = E[uaT ]∆a .
a∈A a:pa ̸=p⋆
To show the regret bound, we shall show for each sub-optimal arm a that
T 1
E[ua ] = O log(T ) .
(∆a )2
6/16
7/16
7/16
T −1
X
E[uaT ] = E[zat ]
t=0
7/16
T −1
X T −1
X
E[uaT ] = E[zat ] = P{Zat }
t=0 t=0
7/16
T −1
X T −1
X
E[uaT ] = E[zat ] = P{Zat }
t=0 t=0
T −1
X T −1
X
= P{Zat and (uat < ūaT )} + P{Zat and (uat ≥ ūaT )}
t=0 t=0
7/16
T −1
X T −1
X
E[uaT ] = E[zat ] = P{Zat }
t=0 t=0
T −1
X T −1
X
= P{Zat and (uat < ūaT )} + P{Zat and (uat ≥ ūaT )}
t=0 t=0
= A + B.
7/16
T −1
X T −1
X
E[uaT ] = E[zat ] = P{Zat }
t=0 t=0
T −1
X T −1
X
= P{Zat and (uat < ūaT )} + P{Zat and (uat ≥ ūaT )}
t=0 t=0
= A + B.
8/16
8/16
8/16
8/16
8/16
8/16
8/16
We have used the fact that for 0 ≤ i < j ≤ t − 1, (Zai , (uai = m)) and
(Zaj , (uaj = m)) are mutually exclusive. 8/16
9/16
9/16
9/16
9/16
p̂a (x) is the empirical mean of the first x pulls of arm a, and
p̂⋆ (y ) is the empirical mean of the first y pulls of arm ⋆. 9/16
10/16
10/16
10/16
11/16
11/16
T −1 t X
t q 2
∆a 2 2
e−2x ( ) + e−2y ln(t)
X X
≤ 2 y
t=n x=ūaT y =1
11/16
T −1 t X
t q 2
∆a 2 2
e−2x ( ) + e−2y ln(t)
X X
≤ 2 y
t=n x=ūaT y =1
T −1 t X
t T −1 ∞
π2
X X
−4 ln(t) −4 ln(t)
X
2 2 X 2
≤ e +e ≤ t ≤ = .
t4 t2 3
t=n x=ūaT y =1 t=n t=1
11/16
T −1 t X
t q 2
∆a 2 2
e−2x ( ) + e−2y ln(t)
X X
≤ 2 y
t=n x=ūaT y =1
T −1 t X
t T −1 ∞
π2
X X
−4 ln(t) −4 ln(t)
X
2 2 X 2
≤ e +e ≤ t ≤ = .
t4 t2 3
t=n x=ūaT y =1 t=n t=1
1. Analysis of UCB
12/16
13/16
13/16
14/16
15/16
15/16
16/16