Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Lecture Notes for MS&E 325: Topics in Stochastic Optimization

(Stanford) and CIS 677: Algorithmic Decision Theory and Bayesian


Optimization (UPenn)
Ashish Goel
Stanford University
ashishg@stanford.edu
Sudipto Guha
University of Pennsylvania
sudipto@cis.upenn.edu
Winter 2008-09 (Stanford); Spring 2008-09 (UPenn)
Under Construction: Do not Distribute
2
Chapter 1
Introduction: Class Overview,
Markov Decision Processes, and
Priors
This class deals with optimization problems where the input comes from a probability distribu-
tion, or in some cases, is generated iteratively by an adversary. The rst part of the class deals
with Algorithmic Decision Theory, where we will study algorithms for designing strategies for
making decisions which are provably (near)-optimal, computationally ecient, and use available
and acquired data, as well as probabilistic models thereon. This eld touches upon statistics, ma-
chine learning, combinatorial algorithms, and convex/linear optimization and some of the results
we study are several decades old. This eld is seeing a resurgence because of the large scale of data
being generated by Internet applications.
The second part of this class will deal with combinatorial optimization problems, such as knap-
sack, scheduling, routing, network design, and inventory management, given stochastic inputs.
1.1 Input models and objectives
Consider N alternatives, and assume that you have to choose one alternative during every time
step t, where t goes from 0 to . This series of choices is called a strategy; the arm chosen by
the strategy at time t is denoted as a
t
. Generally, the alternatives are called arms, and choosing
an alternative is called playing the corresponding arm. Arm i gives a reward of r
i
(t) at time t,
where r
i
(t) may depend on all the past choices made by the strategy. The quantity r
i
(t) may be
a random variable, with a distribution that is unknown, known, or on which you have some prior
beliefs. The quantity r
i
(t) may also be chosen adversarially. This gives two broad classes of input
models, probabilistic and adversarial, with many important variations. We will study these models
in great detail.
There are also several objectives we could have. The rst is a nite-horizon objective, where
we are given a nite horizon T and the goal is to:
Maximize E[
T1

t=0
r
at
(t)].
3
The second is the innite horizon discounted reward model. Here, we are given a discount factor
[0, 1); informally this is todays value for a Dollar that we will make tomorrow. Think of this
as (1 - the interest rate). If you have a long planning horizon, should be chosen to be very close
to 1. If you are very short-sighted, will be close to 0. The objective is:
Maximize E[

t=0

t
r
at
(t)].
By linearity of expectations, we can exchange the limit and the sum so that we obtain the
equivalent objectives:
Maximize
T1

t=0
E[r
at
(t)], and
Maximize

t=0

t
E[r
at
(t)].
We will now dene r
at
(t) as the expected reward, and get rid of the expectation altogether.
This expectation is over both the input (if the input is probabilistic) and the strategy (if the strategy
is randomized).
These problems are all collectively called multi-armed bandit problems. Colloquially, a slot
machine in a casino is also called a one-armed bandit, since it has one lever, and usually robs you
of your money. Hence, a multi-armed bandit is an appropriate name for this class of problems,
where we can think of each alternative as one arm of a multi-armed slot machine.
Remember, our goal in this class is to design these strategies algorithmically.
1.2 An illustrative example
Suppose you are in Las Vegas for an year, and will go to a casino everyday. In the casino there are
two slot machines, each of which gives a Dollar as a reward with some unknown probability, which
may not be the same for both machines. For the rst, there is data available from 3 trials, and one
of them was a success (i.e. gave a reward) and and two were failures (i.e. gave no reward). For the
second machine, there is data available from 210
8
trials, 10
8
of which gave a reward. We will say
that the rst machine is a (1, 2) machine: the rst component of the tuple refers to the number
of successful trials, and the second refers to the number of unsuccessful ones. Thus, the second
machine is a (10
8
, 10
8
) machine.
What would be your best guess of the expected reward from the rst machine? Clearly 1/3.
For the second machine? Clearly 1/2. Also, it seems clear that the second machine is also less
risky. So which machine should you play the very rst day? Surprisingly, it is the rst. If you get
a reward the rst day, the rst machine would become a (2, 2) machine and you can play it again.
If you get a reward again, the rst machine would become a (3, 2) machine, and would start to
look better than the second. If on the other hand, the rst machine does not give you a reward the
rst day, or the second day, then you can revert to playing the second machine. We will see how
to make this statement more formal.
Here, we sacrice some expected reward (i.e. choose not to exploit the best available machine) on
the rst day in order to explore an arm with a high upside; this is an example of an exploration-
exploitation tradeo and is a central feature of algorithmic decision theory..
4
1.3 Markov Decision Processes
Before starting on the main topics in this class, it is worth seeing one of the most basic and
useful tools in stochastic optimization: Markov Decision Processes. When you see a stochastic
optimization problem, this is probably the rst tool you must try, the second being Bellmans
formula for dynamic programming which we will briey see later. Only when these are inecient
or inapplicable should you try more advanced approaches such as the ones we are going to see in
the rest of this class.
Assume you are given a nite state space S, an initial state s
0
, a set of actions A, a reward
function r : S A ', and a function P : S AS [0, 1] such that

vS
P(u, a, v) = 1 for all
states u S and all actions a A. Informally, a MDP is like a Markov chain, but the transitions
happen only when an action is taken, and the transition probabilities depend on the state as well
as the action taken. Also, depending on the state u and the action a, we get a reward r(u, a). Of
course the reward itself could be a random variable, but as pointed out earlier, we replace it by its
expected value in that case.
Given a MDP, you might want to maximize the nite horizon reward or the innite horizon
discounted reward. We will focus on the latter for now. The former is tractable as well. Let
be the discount factor. Let (s) be the expected discounted reward obtained by the optimum
strategy starting from state s, and let (s, a) be the expected discounted reward obtained by the
optimum strategy starting from state s assuming that action a is performed. Then the following
linear constraints must be satised by the s:
u S, a A : (u) (u, a) (1.1)
u A, a A : (u, a) r(u, a) +

vS
P(u, a, v)(v) (1.2)
The optimum solution can now be found by a linear program with the constraints as given
above and the linear objective:
Minimize (s
0
) (1.3)
Thus, MDPs can be solved very eciently, provided the state space is small. MDPs can also
be dened with countable state spaces, but then the LP formulation above is not directly useful as
a solution procedure.
1.4 Priors and posteriors
Let us revisit the illustrative example. Consider the (1, 2) machine. Given that this is all you
know about the machine, what would be a reasonable estimate of the success probability of the
machine? It seems reasonable to say 1/3; remember, this is an illustrative example, and there are
scenarios in which some other estimate of probability might make sense. The tuple (1, 2) and the
probability estimate (1/3) represents a prior belief on the machine. If you play this machine, and
you get a success (which we now believe will happen with a probability of 1/3), then the machine
will become a (2, 2) machine, with a success probability estimate of 1/2; this is called a posterior
belief. Of course, if the rst trial results in a failure, the posterior would have been (1, 3) with a
success probability estimate of 1/4.
5
This is an example of what are known as Beta priors. We will use (, ) to roughly
1
, denote
the number of successes and failures, respectively. We will see these priors in some detail later on,
and will rene and motivate the denitions. These are the most important class of priors. Another
prior we will use frequently (despite it being trivial) is the xed prior, where we believe we know
the success probability p and it never changes. We will refer to an arm with this prior as a standard
arm with probability p.
For the purpose of this class, we can think of a prior as a Markov chain with a countable state
space S, transition probability matrix (or kernel) P, a current state u S, and a reward function
r : S '. Typically, the range of the reward function will be [0, 1] and we will interpret that as
the probability of success. Beta priors can then be interpreted as having a state space Z
+
Z
+
, a
reward function r(, ) = /(+), and transition probabilities P((, ), (+1, )) = /(+)
and P((, ), (, + 1)) = /( +) (and 0 elsewhere).
When we perform an experiment on, i.e. play, an arm with prior (S, r, P, u), the state of the
arm changes to v according to the transition probability matrix P. We assume that we observe this
change, and we now obtain the posterior (S, r, P, v), which acts as the prior for the next step.
For much of this class, we will assume that the prior changes only when an arm is played.
How do we know we have the correct prior? Where does a prior come from? On one level, these
are philosophical questions, for which there is no formal answer since this is a class in optimization,
we can assume that the priors are given to us, they are reliable, and we will optimize assuming
they are correct. On another level, the priors often come from a generative model, i.e. from some
knowledge of the underlying process. For example, we might know (or believe) that a slot machine
gives Bernoulli iid rewards, with an unknown probability parameter p. We might further make the
Bayesian assumption that the probability that this parameter is p given the number of successes
and failures we have observed is proportional to the probability that we would get the observed
number of successes and failures if the parameter were p. This leads to the Beta priors, as we
will discuss later. Of course this just brings up the philosophical question of whether the Bayesian
assumption is valid or not. In this class, we will be agnostic with respect to to this question. Given
a prior, we will attempt to obtain optimum strategies with respect to that prior. But we will also
discuss the scenario where the rewards are adversarial.
1
More precisely, 1 and 1 denote the number of successes and failures, respectively.
6
Chapter 2
Discounted Multi-Armed Bandits and
the Gittins Index
We will now study the problem of maximizing the expected discounted innite horizon reward,
given a discount factor , and given n arms with the i-th arm having a prior (S
i
, r
i
, P
i
, u
i
). Our
goal is to maximize the expected discounted reward by playing exactly one arm in every time step.
The state u
i
is observable, an arm makes a transition according to P
i
when it is played, and this
transition takes exactly one time unit, during which no other arm can be played.
This problem is useful in many domains including advertising, marketing, clinical trials, oil
exploration, etc. Since S
i
, r
i
, P
i
remain xed as the prior evolves, the state of the system at any
time is described by (u
1
, u
2
, . . . , u
n
). If the state spaces of the individual arms are nite and of
size k each, the state space of the system can be of size O(k
n
) which precludes a direct dynamic
programming or MDP based approach. We will still be able to solve this eciently using a striking
and beautiful theorem, due to Gittins and Jones. We will assume for now that the range of each
reward function is [0, 1]. Also recall that [0, 1).
Theorem 2.1 [The Gittins index theorem:] Given a discount factor , there exists a function
g

from the space of all priors to [0, 1/(1 )] such that it is an optimum strategy to play an arm
i for which g

(S
i
, r
i
, P
i
, u
i
) is the largest.
This theorem is remarkable in terms of how sweeping it is. Notice that the function g

, also
called the Gittins index, depends only on one arm, and is completely oblivious to how many
other arms there are in the system and what their priors are. For common priors such as the
Beta priors, one can purchase or pre-compute the Gittins index for various values of , , and
then the optimum strategy can be implemented using a simple table-lookup based process, much
simpler than an MDP or a dynamic program over the joint space of all the arms. This theorem is
existential, but as it turns out, the very existence of the Gittins index leads to an ecient algorithm
for computing it. We will outline a method for nite state spaces.
2.1 Computing the Gittins index
First, recall the denition of the standard arm with success probability p; this arm, denoted R
p
,
always gives a reward with probability p. Given two standard arms R
p
and R
q
, where p < q, which
7
would you rather play? Clearly, R
q
. Thus, the Gittins index of R
q
must be higher than that of R
p
.
R
1
dominates all possible arms, and R
0
is dominated by all possible arms. The arm R
p
yields a
total discounted prot of p/(1), by summing up the innite geometric progression p, p, p
2
, . . ..
Given an arm J with prior (S, r, P, u), there must be a standard arm R
p
such that given these two
arms, the optimum strategy is indierent between playing J or R
p
in the rst step. The Gittins
index of arm J must then be the same as the Gittins index of arm R
p
, and hence any strictly
increasing function of p can be used as the Gittins index. In this class, we will use g

= p/(1 ).
The goal then is merely to nd the arm R
p
given arm J.
This is easily accomplished using a simple LP. Consider the MDP with a nite state space S
and action space (a
J
, a
R
), where the rst action corresponds to playing arm J and the second
corresponds to playing arm R
p
, in which case we will always play R
p
since nothing changes in the
next step. The rst action yields reward r(u) when in state u, whereas the second yields the reward
x = p/(1 ). The rst action leads to a transition according to the matrix P whereas the second
leads to termination of the process. The optimum solution to this MDP is given by:
Minimize (u), subject to: (a) s S, (s) x, and (b)s S, (s) r(s) +

vS
P(u, v)(v).
If this objective function is bigger than x then it must be better to play arm J in state u.
Let the optimum objective function value from this LP be denoted z

(x). Our goal is to nd the


smallest x such that z

(x) x (which will denote the point of indierence between R


p
and J). We
can obtain this by performing a binary search over x: if z

(x) = x then the Gittins index can not


be larger than x, and if z

(x) > x then the Gittins index must be larger than x. In fact, this can
also be obtained from the LP:
Minimize x, subject to: (a) (u) x, (b) s S, (s) x, and (c) s S, (s) r(s)+

vS
P(u, v)(v).
A spreadsheet for approximately computing the Gittins index for arms with Beta priors is
available at http://www.stanford.edu/

ashishg/msande325 09/gittins index.xls.


Exercise 2.1 In this problem, we will assume that three advertisers have made bids on the same
keyword in a search engine. The search engine (which acts as the auctioneer) assigns (, ) priors
to each advertiser, and uses their Gittins index to compute the winner. The advertisers are:
1. Advertiser (a) has = 2, = 5, has bid $1 per click and has no budget constraint.
2. Advertiser (b) has = 1, = 4, pays $0.2 per impression and additionally $1 if his ad is
clicked. He has no budget constraint.
3. Advertiser (c) has = 1, = 2, has bid $1.5 per click and his ad can only be shown 5 times
(including this one).
There is a single slot, the discount factor is = 0.95 and a rst price auction is used. Compute
the Gittins index for each of the three advertisers. Which ad should the auctioneer allocate the slot
to? Briey speculate on what might be a reasonable second price auction.
8
2.2 A proof of the Gittins index theorem
We will follow the proof of Tsitsiklis; the paper is on the class web-page.
9
10
Chapter 3
Bayesian Updates, Beta Priors, and
Martingales
As mentioned before, priors often come from a parametrized generative model, followed by Bayesian
updates. We will call such priors Bayesian. We will assume that the generative model is a family
of single parameter distributions, parametrized by . Let f

denote the probability distribution


on the reward, if the underlying parameter of the generative model is . If we knew , the prior
would be trivial. At time t, the prior is denoted as a probability distribution p
t
on the parameter
. Let x
t
denote the reward obtained the t-th time an arm is played (i.e. the observation at time
t). We are going to assume there exist suitable probability measures over which we can integrate
the functions f

and p
t
.
A Bayesian update essentially says that the posterior probability at time t (i.e. the prior for
time t + 1) of the parameter being given an observation x
t
is proportional to the probability of
the observation being t, given parameter . This of course is modulated by the probability of the
parameter being at time t. Hence, p
t+1
([x
t
) is proportional to f

(x
t
)p
t
(). We have to normalize
this to make p
t+1
a probability distribution, which gives us
p
t+1
([x
t
) =
f

(x
t
)p
t
()
_

(x
t
)p
t
(

)d

.
3.1 Beta priors
Recall that the prior Beta(
t
,
t
) corresponds to having observed
t
1 successes and
t
1 failures
up to time t. We will show that we can also interpret the prior Beta(
t
,
t
) as one that comes from
the generative model of Bernoulli distributions with Bayesian updates, where the parameter
corresponds to the on probability of the Bernoulli distribution.
Suppose
0
= 1 and
0
= 1, i.e., we have observed 0 successes and 0 failures initially. It seems
natural to have this correspond to having a uniform prior on over the range [0, 1]. Applying the
Bayes rule repeatedly, we get p
t
([
t
,
t
) is proportional to the probability of observing
t
1
successes and
t
1 failures if the underlying parameter is , i.e. proportional to
_

t
+
t
2

t
1
_

t1
(1 )
t1
.
11
Normalizing, and using the fact that
_
1
x=0
x
a1
(1 x)
b1
= (a)(b)/(a +b), we get
p
t
() =
(
t
+
t
)
t1
(1 )
t1
(
t
)(
t
)
.
This is known as the Beta distribution. The following exercise completes the proof that the
transition matrix over state spaces dened earlier is the same as the Bayesian update based prior
derived above.
Exercise 3.1 Given the prior p
t
as dened above, show that the probability of obtaining a reward
at time t is
t
/(
t
+
t
).
3.2 Bayesian priors and martingales
We will now show that Bayesian updates result in a martingale process for the rewards. Consider
the case where the prior can be given both over a state space (S, r, P, y) as well as a generative
model with Bayesian updates. Let s
t
denote the state at time t and let r
t
denote the expected
reward in state s
t
. We will show that the sequence of rewards r
t
) is a martingale with respect to
the sequence of states s
t
). This fact will come in handy multiple times later in this class.
Formally, the claim that the sequence of rewards r
t
) is a martingale with respect to the sequence
of states s
t
) is the same as saying that
E[r
t+1
[s
0
, s
1
, . . . , s
t
] = r
t
.
We will use x to denote the observed reward at time t and y to denote the observed re-
ward at time t + 1. Thinking of the prior as coming from a generative model, we get r
t
=
_
x
x
_

p
t
()f

(x)ddx, where p
t
depends only on s
t
. Similarly, we get
1
E[r
t+1
[s
t
] =
_
y
y
_

(y)p
t+1
(

[s
t
)d

dy.
Using Bayesian updates, we get p
t+1
(

[s
t
, x) =
f

(x)pt(

)
_

(x)pt(

)d

. In order to remove the condition-


ing over x we need to integrate over x, i.e., p
t+1
(

[s
t
) =
_
x
p
t+1
(

[s
t
, x)
_

p
t
(

)f

(x)d

dx.
Combining, we obtain:
E[r
t+1
[s
t
] =
_
y
y
_

(y)
_
x
f

(x)p
t
(

)
_

(x)p
t
(

)d

p
t
(

)f

(x)d

dxd

dy.
The integrals over

and

cancel out, giving


E[r
t+1
[s
t
] =
_
y
y
_

(y)
_
x
f

(x)p
t
(

)dxd

dy.
Since f

is a probability distribution, the inner-most integral evaluates to p


t
(

), giving
E[r
t+1
[s
t
] =
_
y
y
_

(y)p
t
(

)d

dy.
This is the same as r
t
.
Exercise 3.2 Give an example of a prior over a state space such that this prior can not be obtained
from any generative model using Bayesian updates. Prove your claim.
1
Here

are just dierent symbols; they are not derivatives or second derivatives of .
12
Exercise 3.3 Extra experiments can not hurt: The budgeted learning problem is dened as follows:
You are given n arms with a separate prior (S
i
, r
i
, P
i
, u
i
) on each arm. You are allowed to make
T plays. At the end of the T plays, you must pick a single arm i, and you will earn the expected
reward of the chosen arm at that time. Let z

denote the expected reward obtained by the optimum


strategy for this problem. Show that z

is non-decreasing in T if the priors are Bayesian.


The next two exercises illustrate how surprising the Gittins index theorem is.
Exercise 3.4 Show that the budgeted learning problem does not admit an index-based solution,
preferably using Bayesian priors as counter-examples. Hint: Dene arms A,B such that given A,B
the optimum choice is to play A whereas given A and two copies of B, the optimum choice is to
play B. Hence, there can not be a total order on all priors.
Exercise 3.5 Show that the nite-horizon multi-armed bandit problem does not admit an index-
based solution, preferably using Bayesian priors as counter-examples.
13
14
Chapter 4
Minimizing Regret Against Unknown
Distributions
The algorithm in this chapter is based on the paper, Finite-time Analysis of the
Multiarmed Bandit Problem. P. Auer, N. Cesa-Bianchi, and P. Fischer, http://www.
springerlink.com/content/l7v1647363415h1t/.
The Gittins index is eciently computable, decouples dierent arms, and gives an optimum
solution. There is one problem of course; it assumes a prior, and is optimum only in the class
of strategies which have no additional information. We will now make the problem one level
more complex. We will assume that the reward for each arm comes from an unknown probability
distribution. Consider N arms. Let X
i,s
denote the reward obtained when the i-th arm is played
for the s-th time. We will assume that the random variables X
i,s
are independent of each other,
and we will further assume that for any arm i, the variables X
i,s
are identically distributed. No
other assumptions will be necessary. We will assume that these distributions are generated by an
adversary who knows the strategy we are going to employ (but not any random coin tosses we
may make). Let
i
= E[X
i,s
] be the expected reward each time arm i is played. Let i

denote the
arm with the highest expected reward,

. Let
i
=

i
denote the dierence in the expected
reward of the optimal arm and arm i.
A strategy must choose an arm to play during each time step. Let I
t
denote the arm chosen at
time t and let k
i,t
denote the number of times arm i is played during the rst t steps.
Ideally, we would like to maximize the total reward obtained over all time horizons T simulta-
neously. Needless to say there is no hope of achieving this: the adversary may randomly choose
a special arm j, make all X
j,s
equal to 1 deterministically, and for all i ,= j, make all X
i,s
= 0
deterministically. For any strategy, there is a probability at least half that the strategy will not
play arm j in the rst N/2 steps, and hence, the expected prot of any strategy is at most N/4
over the rst N/2 steps, whereas the optimum prot if we knew the distribution would be N/2.
Instead, we can set a more achievable goal. Dene the regret of a strategy at time T to be the
total dierence between the optimal reward over the rst T steps and the reward of the strategy
over the same period. The expected regret is then given by
E[Regret(T)] = T

i=1

It
=

i:
i
<

i
E[k
i,T
.
15
In a classic paper, Lai and Robbins showed that this expected regret can be no less than
_
N

i=1

i
D(X
i
[[X

)
_
log T.
Here D(X
i
[[X

) is the Kullback-Leibler divergence (also known as the information divergence) of


the distribution of X
i
with respect to the distribution of X

and is given by
_
f
i
ln f
i
/f

, where
the integral is over some suitable underlying measure space. Surprisingly, they showed that this
lower bound can be asymptotically matched for special classes of distributions asymptotically, i.e.,
as T . We are going to see an even more surprising result, due to Auer, Cesa-Bianchi, and
Fischer, where a very similar bound is achieved simultaneously for all T and for all distributions
with support in [0, 1]. We will skip the proof of correctness since that is clearly specied in the
paper (with a dierent notation), but will describe the algorithm, state their main theorem, and
see a useful balancing trick.
The algorithm, which they call UCB
1
, assigns an upper condence bound to each arm. Let
X
i
(t) denote the average reward obtained from all the times arm i was played up to and including
time t. Let c
t,s
=
_
2 ln t
s
denote a condence interval. Then, at the end of time t, assign the
following index, called the upper-condence-index and denoted U
i
(t) to arm i:
U
i
(t) = X
i
(t) +c
t,k
i
(t)
.
During the rst N steps, each arm is played exactly once in some arbitrary order. After that,
the algorithm repeatedly plays the arm with the highest upper-condence index, i.e. at time t +1,
plays the arm with the highest condence index U
i
(t). Ties are broken arbitrarily.
This strikingly simple rule leads to the following powerful theorem (proof scribed by Pranav
Dandekar):
Theorem 4.1 The expected number of times arm i is played up to time T, E[k
i
(T)], is at most
8 ln T

2
i
+, where is some xed constant independent of the distributions, T, or the number of arms.
Proof: In the rst N steps, the algorithm will play each arm once. Therefore, we have
i, k
i
(T) = 1 +
T

t=N+1
I
t
= i
where I
t
= i is an indicator variable which takes value 1 if arm i is played at time t and 0
otherwise. For any integer l 1, we can similarly write
i, k
i
(T) = l +
T

t=N+1
I
t
= i k
i
(t 1) l
Let
i

= arg max
i

=
i

(t) = U
i
(t)
X

(t) = X
i
(t)
k

(t) = k
i
(t)
16
If arm i was played at time t, this implies its score, U
i
(t 1), was at least the score of the arm with
the highest mean, U

(t 1) (note that this is a necessary but not sucient condition). Therefore


we have
i, k
i
(T) l +
T

t=N+1
U
i
(t 1) U

(t 1) k
i
(t 1) l
i, k
i
(T) l +
T

t=N+1
__
max
ls
i
<t
U
i
(s
i
) min
0<s<t
U

(s)
_
k
i
(t 1) l
_
Instead of taking the max over l s
i
< t and the min over 0 < s < t, we sum over all occurrences
where U
i
(s
i
) U

(s)
i, k
i
(T) l +
T

t=1
t

s
i
=l
t

s=1
_
U
i
(s
i
) U

(s)
_
= l +
T

t=1
t

s
i
=l
t

s=1
_
X
i
(s
i
) +c
t,s
i
X

s
+c
t,s
_
(4.1)
Observe that X
i
(s
i
) +c
t,s
i
X

s
+c
t,s
implies that at least one of the following must hold
X

c
t,s
(4.2)
X
i
(s
i
)
i
+c
t,s
i
(4.3)


i
+ 2c
t,s
i
(4.4)
We choose l such that the last condition is false. Since c
T,l
=
_
2 ln T
l
, we set l
8 ln T

2
i
. Ignoring
the k
i
(t 1) l condition, we have
E[k
i
(T)]
8 ln T

2
i
+

t=1
t
t

s
i
=1
Pr[X
i
(s
i
)
i
+c
t,s
i
] +

t=1
t
t

s=1
Pr[X

c
t,s
]
To bound the probabilities in the above expression, we make use of the Cherno-Hoeding
bound:
Fact 4.2 (Cherno-Hoeding Bound) Given a sequence of t i.i.d random variables z
1
, z
2
, . . . , z
t
such that for all i, z
i
[0, 1], let S =

t
i=1
z
i
and = E[z
i
]. Then for all a 0
Pr
_
S
t
> +a
_
e
2ta
2
and Pr
_
S
t
< a
_
e
2ta
2
Using the Cherno-Hoeding bound, we get
Pr[X
i
(s
i
)
i
+c
t,s
i
] e
2s
i
c
2
t,s
i
= e
4 ln t
=
1
t
4
Similarly,
Pr[X

c
t,s
] e
2sc
2
t,s
= e
4 ln t
=
1
t
4
Substituting these bounds on the probabilities, we get
E[k
i
(T)]
8 ln T

2
i
+

t=1
t
t

s
i
=1
1
t
4
+

t=1
t
t

s=1
1
t
4
=
8 ln T

2
i
+ 2

t=1
1
t
2
=
8 ln T

2
i
+
17
where is a constant.
Plugging into the expression for the regret, we get an expected regret of O((ln T)

i:
i
<
(1/
i
))
which is very close (both qualitatively and quantitatively) to the lower bound of Lai and Robbins.
Exercise 4.1 Prove that if the discount factor theta is larger than 1 1/N
2
then the algorithm
UCB
1
results in a near-optimal solution to the discounted innite horizon problem.
Exercise 4.2 Prove that the algorithm UCB
1
plays every arm innitely often.
The above algorithm appears to have very high regret when the means are all very close together,
i.e.
i
s are very small. That doesnt seem right, since intuitive the algorithm should do well in
that setting. The key to understanding this case is a balance argument. Let us ignore the constant
in the above theorem. Now, divide the arms into two classes those with
i

_
N ln T
T
and those
with
i
<
_
N ln T
T
. Let
A
+
=
_
i :
i

N ln T
T
_
A

=
_
i :
i
<

N ln T
T
_
Then the total regret of the algorithm is given by
E[Regret(T)] =

iA
+
E[Regret
i
(T)] +

iA

E[Regret
i
(T)]
E[Regret(T)]

iA
+
_
8 ln T

i
+O(
i
)
_
+

iA

i
E[k
i
(T)]
E[Regret(T)]

iA
+
_
_
8

T ln T
N
+O(
i
)
_
_
+

iA

N ln T
T
E[k
i
(T)]
E[Regret(T)] 8

NT ln T +O(N) +T

N ln T
T
= O(

NT ln T)
Thus, we have the corollary:
Corollary 4.3 The algorithm UCB
1
has expected regret O(

NT ln T) for all T.
Notice that the algorithm does not depend on the choice of the balancing factor just the analysis.
18
Chapter 5
Minimizing Regret in the Partial
Information Model
Scribed by Michael Kapralov. Based primarily on the paper: The Nonstochastic
Multiarmed Bandit Problem. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire.
SIAM J on Computing, 32(1), 48-77.
We use x
i
(t) [0, 1] to denote the reward obtained from playing arm i at time t, but now
we assume that the values x
i
(t) are determined by an adversary and need not come from a xed
distribution. We assume that the adversary chooses a value for x
i
(t) at the beginning of time t.
The algorithm must then choose an arm I
t
to play, possibly using some random coin tosses. The
algorithm then receives prot X
It(t)
. It is easy to see that any deterministic solution gets revenue
0, so any solution with regret bounds needs to be randomized. It is important that the adversary
does not see the outcomes of the random coin tosses by the algorithm. This is called the partial
information model, since only the prot from the chosen arm is revealed. We will compare the
prot obtained by an algorithm to the prot obtained by the best arm in hindsight.
Our algorithm maintains weights w
i
(t) 0, where t is the timestep. We set w
i
(0) := 1 for all
i. Denote W(t) :=

N
i=1
w
i
(t).
At time t arm i is chosen with probability
p
i
(t) = (1 )
w
i
(t)
W(t)
+

N
,
where > 0 is a parameter that will be assigned a value later. We denote the index of the arm
played at time t by I
t
.
We dene the random variable x
i
(t) as
x
i
(t) =
_
x
i
(t)/p
i
(t) if arm i was chosen
0 o.w.
Note that E[ x
i
(t)] = x
i
(t). We can now dene the update rule for weights w
i
(t):
w
i
(t + 1) = w
i
(t) exp
_

N
x
i
(t)
_
.
The regret after T steps is
Regret[T] = max
i
_
T

t=1
E[X
i
]
T

t=1
E[X
It
(t)]
_
.
19
Note that the Regret[T] = T, i.e. linear, but we will be able to tune to a xed value of T to
obtain sublinear regret. In order to handle unbounded T we can play using a xed for some time
and then adjust .
We will use the following facts, which follow from the denition of x
i
(t):
x
i
(t) =
x
i
(t)
p
i
(t)

N

(5.1)
N

i=1
p
i
(t) x
i
(t) = x
It
(t) (5.2)
N

i=1
p
i
(t) ( x
i
(t))
2
=
N

i=1
x
i
(t)x
i
(t)
N

i=1
x
i
(t). (5.3)
We have
W(t + 1)
W(t)
=

N
i=1
w
i
(t + 1)
W(t)
=

N
i=1
w
i
(t) exp
_

N
x
i
(t)
_
W(t)

i=1
w
i
(t)
W(t)
+
N

i=1
w
i
(t)
W(t)
_

N
x
i
(t) +

2
N
2
( x
i
(t))
2
_
= 1 +
N

i=1
w
i
(t)
W(t)
_

N
x
i
(t) +

2
N
2
( x
i
(t))
2
_
.
We applied the inequality exp(x) 1 +x+x
2
, x [0, 1] to exp
_

N
x
i
(t)
_
(justied by 5.1) and used
the denition of W(t) to substitute the rst sum with 1.
Since p
i
(t) (1 )w
i
(t)/W(t), we have
w
i
(t)
W(t)

p
i
(t)
1


N(1)
. Using this estimate together
with 5.2 and 5.3, we get
W(t + 1)
W(t)
1 +
N

i=1
p
i
(t)
1
_

N
x
i
(t) +

2
N
2
( x
i
(t))
2
_
1 +

N(1 )
x
I
k
(t) +

2
N
2
(1 )
N

i=1
x
i
(t).
We now take logarithms of both sides and sum over t from 1 to T. Note that the lhs telescopes
and we get after applying the inequality log(1 +x) x to the rhs
log
W(T)
W(0)

T

t=1
_
_

N(1 )
x
I
k
(t) +

2
N
2
(1 )
N

j=1
x
i
(t)
_
_
=
1
1
_
_
T

t=1

N
x
I
k
(t) +

2
N
2
N

j=1
T

t=1
x
i
(t)
_
_
.
We denote the reward obtained by the algorithm by G =

T
t=1
x
I
k
(t) and the optimal reward by
G

= max
i

T
t=1
x
i
(t). Using the fact that E[G

] E[

T
t=1
x
i
(i)]i and W(0) = N, we get
E
_
log
W(T)
N
_

1
1
_

N
E[G] +

2
N
E[G

]
_
. (5.4)
20
On the other hand, since W(T) = exp
_

T
t=1

N
x
j
(t)
_
, we have
log
W(T)
N
log
W
j
(T)
N
=
T

t=1

N
x
j
(t) log N.
Using the fact that E[ x
j
(t)] = x
j
(t) and setting j = argmax
1jN

T
t=1
x
j
(t), we get
Elog
W(T)
N


N
EG

log N. (5.5)
Putting 5.4 and 5.5 together, we get

N
E[G

] log N
1
1
_

N
E[G]
_
+

2
N(1 )
E[G

] (5.6)
This implies that
E[G] (1 )E[G

] E[G

]
(1 )N log N

, (5.7)
i.e.
E[G

] E[G] 2E[G

] +
N log N

. (5.8)
To balance the rst two terms, we set :=
_
N log N
2G

and getting
E[G

] E[G]
_
2N log NE[G

]. (5.9)
Since G

T, we also have
E[G

] E[G] 2T +
N log N

, (5.10)
and setting :=
_
N log N
2T
yields
E[G

] E[G]
_
2NT log N. (5.11)
Exercise 5.1 This is the second straight algorithm we have seen that has a regret that depends on
T. Unlike the previous algorithm, this one requires a knowledge of T. Present a technique that
converts an algorithm which achieves a regret of O(f(N)

T) for any given T to one that achieves


a regret of O(f(N)

T) for all T.
Exercise 5.2 Imagine now that there are M distinct types of customers. During each time step,
you are told which type of customer you are dealing with. You must show the customer one product
(equivalent to playing an arm) which the customer will either purchase or discard. If the customer
purchases the product, then you make some amount between 0 and 1. The regret is computed relative
to the best product-choice for each customer type, in hindsight. Present an algorithm that achieves
regret O(

MNT log N) against an adversary. Prove your result. What lower bound can you deduce
from material that has been covered in class or pointed out in the reading list?
21
Exercise 5.3 Designed by Bahman Bahmani.
Assume a seller with an unlimited supply of a good is sequentially selling copies of the good to
n buyers each of whom is interested in at most one copy of the good and has a private valuation for
the good which is a number in [0, 1]. At each instance, the seller oers a price to the current buyer,
and the buyer will buy the good if the oered price is less than or equal to his private valuation.
Assume the buyers valuations are iid samples from a xed but unknown (to the seller) distribu-
tion with cdf F(x) = Pr(valuation x). Dene D(x) = 1 F(x) and f(x) = xD(x).
In this problem, we will prove that if f(x) has a unique global maximum x

in (0, 1) and f

(x

) <
0, then the seller has a pricing strategy that achieves O(

nlog n) regret compared to the adversary


who knows the exact value of each buyers valuation but is restricted to oer a single price to all
the buyers.
To do this, assume the seller restricts herself to oering one of the prices in 1/K, 2/K, , (K
1)/K, 1. Dene
i
as the expected reward of oering price i/K,

= max
1
, ,
K
, and

i
=


i
.
a) Prove that: constants C
1
, C
2
: x [0, 1] : C
1
(x

x)
2
< f(x

) f(x) < C
2
(x

x)
2
b) Prove that
i
C
1
(x

i/K)
2
for all i. Also, prove that the j th smallest value among
i
s
is at least as large as C
1
(
j1
2K
)
2
c) Prove that

> f(x

) C
2
/K
2
d) Prove that using a good choice of K and some of the results on MAB discussed in class, the
seller can achieve O(

nlog n) regret.
22
Chapter 6
The Full Information Model, along
with Linear Generalizations
This chapter is based on the paper Ecient algorithms for online decision problems
by A. Kalai and S. Vempala, http://people.cs.uchicago.edu/

kalai/papers/onlineopt/
onlineopt.pdf
Exercise 6.1 Read the algorithm by Zinkevich, Online convex programming and generalized in-
nitesimal gradient ascent, http: // www. cs. ualberta. ca/

maz/ publications/ ICML03. pdf .


How would you apply his technique and proof in a black-box fashion to the simplest linear gener-
alization of the multi-armed bandit problem in the full information model (eg. as described in the
paper by Kalai and Vempala)? Note that Zinkevich requires a convex decision space, whereas Kalai
and Vempala assume that the decision space is arbitrary, possibly just a set of points.
Exercise 6.2 Modify the analysis of UCB1 to show that in the full information model, playing
the arm with the best average return so far has statistical regret O(
_
T(log T + log N)) (i.e. regret
against a distribution).
23

You might also like