Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

This article was downloaded by: [103.37.201.

178] On: 06 August 2021, At: 02:17


Publisher: Institute for Operations Research and the Management Sciences (INFORMS)
INFORMS is located in Maryland, USA

INFORMS TutORials in Operations Research


Publication details, including instructions for authors and subscription information:
http://pubsonline.informs.org

Recent Advances in Multiarmed Bandits for Sequential


Decision Making
https://orcid.org/0000-0003-4486-3871Shipra Agrawal

To cite this entry: https://orcid.org/0000-0003-4486-3871Shipra Agrawal. Recent Advances in Multiarmed Bandits for
Sequential Decision Making. In INFORMS TutORials in Operations Research. Published online: 02 Oct 2019; 167-188.
https://doi.org/10.1287/educ.2019.0204

Full terms and conditions of use: https://pubsonline.informs.org/Publications/Librarians-Portal/PubsOnLine-Terms-and-


Conditions

This article may be used only for the purposes of research, teaching, and/or private study. Commercial use
or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher
approval, unless otherwise noted. For more information, contact permissions@informs.org.

The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness
for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or
inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or
support of claims made of that product, publication, or service.

Copyright © 2019, INFORMS

Please scroll down for article—it is on subsequent pages

With 12,500 members from nearly 90 countries, INFORMS is the largest international association of operations research (O.R.)
and analytics professionals and students. INFORMS provides unique networking and learning opportunities for individual
professionals, and organizations of all types and sizes, to better understand and use O.R. and analytics tools and methods to
transform strategic visions and achieve better outcomes.
For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org
© 2019 INFORMS j ISBN 978-0-9906153-3-0
https://pubsonline.informs.org/series/educ
https://doi.org/10.1287/educ.2019.0204

Recent Advances in Multiarmed Bandits for


Sequential Decision Making
Shipra Agrawala
a
Columbia University, New York, New York 10027
Contact: sa3305@columbia.edu, https://orcid.org/0000-0003-4486-3871 (SA)

Abstract Reinforcement learning (RL) is a very general framework for making sequential decisions
when the underlying system dynamics are a priori unknown. RL algorithms use the outcomes
of past actions to learn system dynamics and improve the decision maker’s strategy over time.
The stochastic multiarmed bandit (MAB) problem refers to a special case of RL when the
responses are independent across different actions and independent and identically dis-
tributed across time for a given action. The stochastic MAB problem enjoys availability of
many efficient algorithms in the literature, with rigorous near-optimal theoretical guarantees
on performance. This tutorial discusses some recent advances in sequential decision-making
models that build on the basic MAB setting to greatly expand its purview. Specifically, we
discuss progress on three models that lie between MAB and RL: (a) contextual bandits,
(b) combinatorial bandits, and (c) bandits with long-term constraints and nonadditive re-
wards. These models incorporate settings that are well beyond the purview of MAB by
getting rid of premises such as stationary distribution and independence of feedback
across actions or across time. This tutorial discusses the state of the art in algorithm design
and analysis techniques for these and related models, along with applications in several
domains such as online advertising, recommendation systems, crowdsourcing, healthcare,
network routing, assortment optimization, revenue management, and resource allocation.

Keywords multiarmed bandits • reinforcement learning • exploration-exploitation • regret bounds

1. Introduction
In many operations management problems, decision makers are faced with the challenge of
making sequential decisions that are not just profitable today but also put the system into
a better position to face the constraints and uncertainties of tomorrow. In order to achieve this,
the decision maker needs to utilize the past observations and data to understand the nature of
uncertainties, and optimize for the long-term goals.
Reinforcement learning (RL) is a very general framework for learning to optimize such
sequential decisions under uncertainty. In reinforcement learning, the underlying stochastic
process is modeled as a Markov decision process (MDP). However, the parameters of this
model (e.g., parameters determining the reward function or the transition probability dis-
tribution) are a priori unknown to the decision maker. A reinforcement learning algorithm
learns the unknown model parameters by exploring different actions and uses the observed
outcomes to adaptively improve the decision maker’s policy over time. This requires the
algorithm to manage the trade-off between exploration and exploitation—that is, exploring
different actions in order to learn versus taking actions that currently seem to be reward
maximizing.
The stochastic multiarmed bandit (MAB) problem is a special case of RL when the re-
sponses to actions are independent across time and across different actions. And furthermore,

167
Agrawal: Recent Advances in MABs for Sequential Decision Making
168 Tutorials in Operations Research, © 2019 INFORMS

feedback from one action does not provide any information about another action. The mul-
tiarmed bandit problem derives its name from the problem of a gambler playing on N rigged
nonidentical slot machines in a casino. The extent to which each machine is rigged is unknown
to the gambler. Pulling the arm of any machine once requires investing one dollar. The gambler
has a fixed amount of dollars that can be invested to pull the arms of the slot machines in order
to receive some (random) reward. The gambler must decide which arms to pull in a sequence of
trials so as to maximize total reward. The gambler wants to use the outcomes of the pulls to
learn the statistics of the slot machines and use that learning to adaptively make better
investments over time. Thus, in each trial, the gambler faces a trade-off between exploration
(in order to find the best machine) and exploitation (playing the arm of the machine believed
to give the best payoff).
MAB is often considered as the most fundamental model that captures the exploration-
exploitation trade-off in sequential decision making. Although restricted in its ability to capture
many practical sequential decision-making settings, the basic MAB problem is a mature area
of research and enjoys availability of many efficient algorithms with rigorous near-optimal
theoretical guarantees on performance. This tutorial discusses some recent advances in se-
quential decision-making models that build on the above-described basic MAB setting to
greatly expand its purview.
Specifically, we discuss progress on three models that lie between MAB and RL: (a) con-
textual bandits, (b) combinatorial bandits, and (c) bandits with long-term constraints and
nonadditive rewards. These models allow incorporating problem settings and considerations
that may violate some basic premises of the MAB framework such as the assumptions of
stationary distribution and independence of feedback across actions or across time. Although
these settings do not still allow the full generality of a reinforcement learning problem, their
structure has enabled new efficient algorithms and performance analysis based on multiarmed
bandit techniques. In this tutorial, we discuss the state of the art in algorithm design and
analysis techniques for these and related models, along with applications in several domains
such as online advertising, recommendation systems, crowdsourcing, healthcare, network
routing, assortment optimization, revenue management, and resource allocation.

1.1. Organization
The rest of this tutorial is organized as follows. Section 2 provides an introduction to the page:
2 stochastic multiarmed bandit problem and popular algorithmic techniques. Sections 3, 4,
and 5 discuss the three variations of contextual bandits, combinatorial bandits, and bandits
with long-term constraints and nonadditive objectives, respectively.

2. The Stochastic Multiarmed Bandit Problem


The stochastic MAB problem is a prominent framework for capturing the exploration-
exploitation trade-off in online decision making and experiment design. The MAB problem
proceeds in discrete sequential rounds. In each round t ¼ 1; 2; 3; . . . ; one of N arms (or actions)
must be chosen to be pulled (or played). Let It 2 f1; . . . ; N g denote the arm pulled at the t-th
time step. On pulling arm It ¼ i at time t, a random real-valued reward rt 2 R is observed,
generated according to a fixed but unknown distribution associated with arm i and mean
E½rt jIt ¼ i ¼ i . The random rewards obtained from playing an arm repeatedly are in-
dependent and identically distributed over time and independent of the plays of the other
arms. The reward is observed immediately after playing the arm. An algorithm for the stochastic
MAB problem must decide which arm to play at each time step t, based on the outcomes Pof the
previous t  1 plays. The goal is to maximize the expected total reward at time T (i.e., E½ t It ,
where It is the arm played in step t). Here, the expectation is over the random choices of It made
by the algorithm, where the randomization can result from any randomization in the algorithm
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 169

as well as the randomness in the outcomes of arm pulls, which may affect the algorithm’s
sequential decisions.
As an example of a multiarmed bandit problem, consider a news website that needs to decide
which articles to display to a visitor. Showing an article (i.e., pulling an arm) generates clicks
(i.e., reward). The website has no a priori information about the click-through rates (CTRs) of
different articles. Then, the MAB problem can be used to formulate the problem of se-
quentially choosing articles to display from a fixed pool of N articles in order to learn the CTRs
of different articles while maximizing the user engagement measured as the total number of
clicks. It is important to note, however, that this basic MAB formulation makes an implicit
assumption that every time an article is shown, it has the same likelihood of generating a click
(irrespective of the preferences of the current user or the external context). Later, we dis-
cuss extensions such as the contextual bandit problem, which can handle these additional
considerations.
2.1. Regret Definition. To measure the performance of an algorithm for the MAB
problem, it is common to work with the measure of expected total regret (i.e., the amount lost
because of not playing the optimal arm in each step).
To formally define regret, let us introduce some notation. Let  :¼ maxi i , and let
i :¼   i . Let ni;t denote the total number of times arm i is played in rounds 1 to t; thus ni;t
is a random variable. Then the expected total regret in T rounds is defined as
X T  XN 

5ðT Þ :¼ E ð  It Þ ¼ E ni;T i ;
t¼1 i¼1

where expectation is taken with respect to both randomness in outcomes, which may affect the
sequential decisions made by the algorithm, and any randomization in the algorithm.
Two kinds of regret bounds appear in the literature for the stochastic MAB problem:
(1) logarithmic problem-dependent (or instance-dependent) bounds that may have de-
pendence on problem parameters such as i or i , and
(2) sublinear problem-independent (or worst-case) bounds that provide uniform bounds for
all instances with N arms.
To differentiate between these two types of bounds, we use a more detailed notation
5ðT ; Þ to denote regret for problem instance . The instance  is specified by the sufficient
statistics for the reward distributions of arms. Then, problem-dependent or instance-dependent
bounds on regret are bounds on 5ðT ; Þ for every problem instance , in terms of T ; i ; i ¼
1; . . . ; N and possibly other distribution parameters associated with . Problem-independent
bounds are bounds on the worst-case regret max 5ðT ; Þ, where the maximum is taken over
all instances  with N arms, with arbitrary arm distributions (possibly within a given family of
distributions).
P For example, Auer[12] derives an instance-dependent bound of 5ðT ; Þ ¼

O i:i < ð1=ð  i ÞÞ logðT Þ on the regret of the UCB algorithm for any instance  in
which the distribution for each arm i has bounded support in ½0; 1 and mean ffii , and
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 ¼ maxi i . The same article also provides a worst-case regret bound of Oð NT log T Þ over
all such instances.
2.1.1. Bayesian Regret. If we have a good priors available on the distribution of instances
, we would hope that we could get a better performance. Accordingly, another definition of
regret called Bayesian regret is often considered in literature, especially when analyzing al-
gorithms such as Thompson sampling (discussed later), which are based on Bayesian posterior
sampling. Given a prior PðÞ over instances  of the stochastic MAB problem, Bayesian
regret is expected regret over instances sampled from this prior:
Bayesian regret in time T ¼ EP ½5ðT ; Þ;
Agrawal: Recent Advances in MABs for Sequential Decision Making
170 Tutorials in Operations Research, © 2019 INFORMS

where the first expectation is over the instance and the second expectation is over the
algorithm/reward generation.
In this article, we focus on the frequentist regret bounds (problem dependent and problem
independent); however, some references to Bayesian analysis are provided at relevant places.
2.2. Overview of Algorithmic Techniques. We briefly discuss two widely used algo-
rithmic techniques for the multiarmed bandit problems: (1) optimism under uncertainty, or
more specifically, the upper confidence bound (UCB) algorithm (Auer [12], Auer et al. [13]), and
(2) posterior sampling, or more specifically, the Thompson sampling (TS) algorithm (Agrawal
and Goyal [4, 8], Russo and Van Roy [47], Russo et al. [50], Thompson [56]. Some other prominent
techniques include inverse propensity scoring and multiplicative weight update algorithms
such as, for example, the EXP3 algorithm (Auer et al. [14]), epsilon greedy algorithm, and
successive elimination algorithm (see the survey in Bubeck and Cesa-Bianchi [22]).

2.3. Upper Confidence Bound Algorithm


The UCB algorithm is based on the “optimism under uncertainty” principle. The basic idea is
to maintain an “optimistic” bound on the mean reward for each arm—that is, a quantity that is
above the mean with high probability and converges to the mean as more observations are
made. In each round, the algorithm pulls the arm with largest UCB. An observation made from
the pulled arm is used to update its UCB.
The precise mechanics of the algorithm are as follows. As before, let ni;t denote the number of
times arm i was played until (and including) round t, let It 2 f1; . . . ; N g denote the arm pulled
at time t, and let rt 2 ½0; 1 denote the reward observed at time t. Then, an empirical reward
estimate of arm i at time t is defined as
Pt
s¼1: Is ¼i rs
^i;t ¼ : (1)
ni;t

The UCB algorithm computes the following quantity for each arm i at the end of each round t:
qffiffiffiffiffi
UCBi;t :¼ ^i;t þ 2 ln
ni;t :
t
(2)

Then, the algorithm pulls the arm i that has the highest UCB i;t at time t. The algorithm is
summarized as Algorithm 1.
Algorithm 1 (UCB Algorithm for the Stochastic N-Armed Bandit Problem)
foreach t ¼ 1; . . . ; N do
Play arm t
end
foreach t ¼ N þ 1; N þ 2 . . . ; T do
Play arm It ¼ arg maxi2f1;...;N g UCBi;t1 :
Observe rt , compute UCBi;t
end
Here, for simplicity, it was assumed that T  N , and the algorithm started by playing every
arm once. Other variations of this algorithm may be found in Auer [12] and Bubeck and Cesa-
Bianchi [22].
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2.3.1. Regret Analysis. Intuitively, the additional term ln t=ni;t in (2) allows explo-
ration of an arm that has been played less often so far (i.e., an arm with low ni;t ), even if its
current empirical mean estimate is low. A key observation in the analysis of the UCB algorithm
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 171
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
is that the term ln t=ni;t is a high confidence upper bound on the empirical error of ^i;t . More
precisely, for each arm i at time t, we have that1 with probability at least 1  2=t 2 ,
sffiffiffiffiffiffiffiffiffiffi
4 ln t
j^
i;t  i j < : (3)
ni;t

There are two useful observations we can immediately derive from (3):
(1) A lower bound for UCBi;t . With probability at least 1  2=t 2 ,
UCBi;t > i : (4)

(2) An upper bound for ^i;t with many samples. Given that ni;t  ð16 ln tÞ=2i , with
probability at least 1  2=t 2 ,
i
^i;t < i þ : (5)
2
Inequality (4) states that the UCB value is probably as large as the true reward: in this sense,
the UCB algorithm is optimistic. Inequality (5) states that given enough (specifically, at least
ð16 ln tÞ=2i ) samples, the reward estimate probably does not exceed the true reward by more
than i =2. Together, these bounds can be used to show that after being pulled ð16 ln tÞ=2i
times, a suboptimal arm i has a very low probability to be pulled in subsequent rounds. More
precisely, consider any arm i with i <  . At the end of round t, let ni;t  ð16 ln tÞ=2i . Then, if
both (4) and (5) hold,
sffiffiffiffiffiffiffi
ln t i
UCBi;t ¼ ^i;t þ  ^i;t þ Since ni;t  16ln2 t ;
ni;t 2 i

 
i i
< i þ þ by (5);
2 2
¼  since i :¼   i ;
< UCBi ;t by (4):

The probability of (4) or (5) not holding is at most 4=t 2 by the union bound. Now, by the
algorithm’s selection criterion, we have that because UCBi ;t > UCBi;t , the probability of
playing arm i in round t þ 1 is at most 4=t 2 . This observation can be used to show that with
high probability, ni;T  ð16 lnðT ÞÞ=2i for any time horizon T and suboptimal arm i. Because
the regret on playing arm i is bounded by i ¼   i , this yields the following upper bound on
regret. Formal proofs with tighter constants can be found in Auer [12] and Bubeck and Cesa-
Bianchi [22].

Theorem 1. Let 5ðT ; Þ denote the regret of the UCB algorithm in time T for instance
 ¼ f1 ; 1 ; . . . ; N ; N g of the stochastic independent and identically distributed (i.i.d.) mul-
tiarmed bandit problem. For all instances , and all T  N , the expected regret of UCB algo-
rithm is bounded as
X 16 ln T
5ðT ; Þ  þ 8i ;
i: i < 
i

where i ¼   i .
Theorem 1 gives an upper bound on E½RðT ; Þ that is logarithmic in T . This is near
optimal: Lai and Robbins [38] provided a lower bound demonstrating that any algorithm must
suffer at least ln T expected total regret on any instance . However, note that the bound
in Theorem 1 depends on parameters 1 ; . . . ; N (i.e., it is an “instance-dependent” or
Agrawal: Recent Advances in MABs for Sequential Decision Making
172 Tutorials in Operations Research, © 2019 INFORMS

“problem-dependent” bound). This bound does not directly imply a very good worst-case
bound: for an instance with i ¼ ln T =T , this bound is linear in T . But a simple trick can
be applied to obtain a sublinear “instance-independent” (a.k.a. “problem-independent” or
“worst-case”) regret bound. The idea is to separately bound the regret of arms with i 
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð16N ln T Þ=T by a trivial bound of T i  16NT ln T . And for the remaining arms,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
substitute i  ð16N ln T Þ=T to bound regret as a result of their pulls by ð16 ln T Þ=i 
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
16NT lnðT Þ. This analysis gives the following instance-independent bound.
Theorem 2. For all T  N , the expected total regret achieved by the UCB algorithm in
round T is
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
5ðT Þ  8 NT ln T þ 8N :

2.4. Thompson Sampling


Thompson sampling (TS), also known as Bayesian posterior sampling, is one of the oldest
heuristics for the multiarmed bandit problem. It first appeared in a 1933 article by W. R.
Thompson [56]. In recent years, there have been significant advances in theoretical regret-
based analysis of this algorithm for the N -armed stochastic MAB problem, including worst-case
near-optimal problem-dependent and problem-independent bounds (Agrawal and Goyal [4, 6, 8],
Kaufmann et al. [36]) and Bayesian regret bounds (Russo and Van Roy [47, 48]). The algorithm is
based on a Bayesian philosophy of learning.
2.4.1. Bayesian Learning. Consider the problem of learning from observations generated
from a parametric distribution. A frequentist approach assumes parameters to be fixed and
uses the observed data to learn those parameters as accurately as possible.
For example, consider the problem of learning distribution parameters given 10 independent
samples from a Bernoulli distribution (a random variable is distributed as BernoulliðÞ is 1
with probability  and 0 with probability 1  ):
0; 0; 1; 1; 0; 1; 1; 1; 0; 0:

A frequentist would guess that  is close to 0.5 with some confidence (probability).
On the other hand, a Bayesian learner maintains a probability distribution (or belief) to
capture the uncertainty about the unknown parameter. At the beginning (before seeing the
data), the prior distribution encodes the initial belief of the learner about the value of the
parameter. Upon seeing the data, the learner updates the belief using Bayes’ rule. This
updated distribution is called the posterior distribution.
Let us continue with the example from above where the learner observes independent
realizations from a Bernoulli distribution with parameter . Let the learner start with a prior
pðxÞ representing the learner’s prior belief (probability) that  takes value x:
pðxÞ ¼ Pr½ ¼ x:

After observing data D (e.g., samples 0; 0; 1; 1; 0; 1; 1; 1; 0; 0), the learner obtains a posterior
distribution, using Bayes’ rule:
Pr½Dj ¼ x  Pr½ ¼ x
Pr½ ¼ xjD ¼
Pr½D
/ Pr½Dj ¼ x  pðxÞ:

Here, Pr½Dj ¼ x is the probability of generating data D from the Bernoulli distribution
with parameter x. This is also called the likelihood function.
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 173

2.4.2. Algorithm Overview. Suppose that for each arm reward is generated from some
parametric distribution i . Then, the overall structure of the Thompson sampling algorithm, as
described in Thompson [56], is as follows:
For every arm i, start with a prior belief on the parameters of its reward distribution.
In every round t,
—pull an arm with its probability of being the best arm according to the current belief, and
—use the observed reward to update the posterior belief distribution for the pulled arm.
Given the prior distribution and the likelihood function, in some cases the posterior dis-
tribution has a closed analytical form. In particular, given Bernoulli i.i.d. samples, if the prior is
a Beta distribution,2 then the posterior distribution is also given by a Beta distribution. Also,
given Gaussian i.i.d. samples, if the prior is a Gaussian distribution, then the posterior is also
given by a Gaussian distribution. This property makes these distributions a convenient choice
for implementation of Thompson sampling. Below, we give precise details of the TS algorithm
for the special cases of (a) Bernoulli reward distribution and (b) Gaussian reward distribution.

2.4.3. Thompson Sampling for Bernoulli MAB. Assume that we have a Bernoulli
multiarmed bandit (Bernoulli MAB) instance. That is, for arm i, every time it is pulled, the
reward is generated from Bernoulliði Þ. The aim is to learn model parameters i ; i ¼ 1; . . . ; N
for all arms to find the best arm.
The calculations below show that given a Beta prior with parameters ð; Þ, on ob-
serving one sample r 2 f0ðw.p. 1  Þ; 1ðw.p. Þg, the posterior distribution is Betað þ r;
 þ 1  rÞ:
Pr½ ¼ jr / Pr½rj ¼  Pr½ ¼ 
¼ Bernoulli ðrÞBeta; ðÞ
ð þ Þ 1
¼ r ð1  Þ1r  ð1  Þ1
ðÞðÞ
/ þr 1 ð1  Þr
/ Betaþr;þ1r ðÞ:

From this observation, for every arm i, the Thompson sampling algorithm (as presented in
Agrawal and Goyal [4, 6, 8]) starts with a uniform prior belief Betað1; 1Þ about its mean. After
ni;t pulls in time 1; . . . ; t, the algorithm updates its belief to BetaðSi;t þ 1; Fi;t þ 1Þ, where
Si;t : the number of 1s in ni;t pulls of arm i, and
Fi;t : the number of 0s in ni;t pulls of arm i.
The initial values of these variables (before any pulls) are set to 0.
The algorithm, at time t, plays an arm i with its probability of being the best. That is, if
Xj ; j ¼ 1; . . . ; N are random variables distributed as BetaðSj;t þ 1; Fj;t þ 1Þ, then the algo-
rithm plays an arm i with probability PrðXi > maxj6¼i Xj Þ. Note that a quick way to implement
this is to generate a sample from BetaðSi;t þ 1; Fi;t þ 1Þ for each arm i and then pull the arm
whose sample is largest.
Algorithm 2 (Thompson Sampling for Bernoulli MAB Using Beta Priors)
foreach t ¼ 1; 2; . . . ; do
For each arm i ¼ 1; . . . ; N , independently sample i;t  BetaðSi;t1 þ 1; Fi;t1 þ 1Þ
Play arm It :¼ arg maxi i;t
Observe rt .
end
Agrawal: Recent Advances in MABs for Sequential Decision Making
174 Tutorials in Operations Research, © 2019 INFORMS

The mean of the posterior distribution BetaðSi;t1 þ 1; Fi;t1 þ 1Þ is ðSi;t1 þ 1Þ=ðSi;t1 þ


Fi;t1 þ 1Þ, which is close to the empirical mean ^i;t1 . And its variance is inversely pro-
portional to Si;t1 þ Fi;t1 þ 2 ¼ ni;t1 þ 2. Therefore, as the number of plays ni;t of an arm
increases, the variance of the posterior distribution decreases, and the empirical mean ^i;t
converges to the true mean  of the Bernoulli distribution. For arms with small ni;t , the
variance is high, which enables exploration of arms that have been played less often and
therefore have more uncertainty in their estimates.
These observations were utilized to derive the following problem-dependent (or instance-
dependent) bound for the Bernoulli MAB in Agrawal and Goyal [6, 8].
Theorem 3 (Agrawal and Goyal [6, 8]). For any instance  ¼ f1 ; . . . ; N g of the Bernoulli
MAB,
X lnðT Þi
5ðT ; Þ  ð1 þ Þ Þ
þ OðN =2 Þ;
i6¼I  KLði ; 

where KLð; Þ denotes the Kullback–Leibler divergence: KLði ;  Þ :¼ i logði = Þ þ ð1 


i Þ  logðð1  i Þ=ð1   ÞÞ. The big O notation above assumes i ; i ; i ¼ 1; . . . ; N ; to be
constants.
P
In Lai and Robbins [38], a lower bound of limT !1 ðRðT ;Þ=lnðT ÞÞ  i6¼I  ði =KLði ; ÞÞ
is demonstrated for any (consistent) algorithm for the Bernoulli MAB problem. Therefore, the
above theorem shows that Thompson sampling matches this lower bound. The following
problem-independent regret bound is also known for this algorithm.
Theorem 4 (Agrawal and Goyal [8]). The regret in time T for the Bernoulli MAB problem is
bounded as follows:
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
5ðT Þ ¼ max RðT ; Þ  Oð NT log T þ N Þ;


where the maximization is over all instances of the N -armed Bernoulli MAB problem.

2.4.4. Thompson Sampling for Gaussian MAB. Consider instance  ¼ ð1 ; . . . ; i Þ of


the stochastic MAB problem, where reward rt on pulling arm i is generated i.i.d. from the
Gaussian distribution i ¼ 1ði ; 1Þ; i is unknown. Using Gaussian priors is a convenient
choice for deriving the Thompson sampling algorithm in this case—when starting from
Gaussian priors, on observing independent Gaussian samples, the posterior distribution is also
Gaussian. In particular, if the prior for arm i is the Gaussian 1ð0; 1Þ, then by applying Bayes’
rule we can derive that the posterior at time t (after observing ni;t independent samples from
distribution 1ði ; 1Þ) will be the Gaussian 1ð^i;t ; 1=ðni;t þ 1ÞÞ. Here, ni;t is the number of plays
P
of arm i in time ½1; . . . ; t and ^i;t ¼ ð1=ðni;t þ 1ÞÞ t¼1:I ¼i rt . This observation results in the
following variation of the Thompson sampling algorithm.
Algorithm 3 (Thompson Sampling Using Gaussian Priors)
foreach t ¼ 1; 2; . . . ; do
Independently for each arm i ¼ 1; . . . ; N , sample i;t from 1ð^
i;t1 ; 1=ðni;t1 þ 1ÞÞ.
Play arm It :¼ arg maxi i;t
Observe reward rt .
end
In general, we may not be able to assume that the distributions i for arm i are Bernoulli or
Gaussian or even parametric. Algorithm 3 can be used for an MAB problem even if rewards do
not follow a Gaussian distribution and/or there is no Gaussian prior on the distribution
parameters. Strictly speaking, in that case, Algorithm 3 would not be a Bayesian posterior
sampling algorithm, because the posterior calculations (which were done for Gaussian reward
distributions) are no longer valid. However, still, the following result from Agrawal and Goyal
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 175

[6, 8] shows that these versions of the Thompson sampling algorithm achieve a logarithmic
problem-dependent regret bound. The regret bounds require only a bounded or sub-Gaussian
reward assumption. The intuition is that given the opportunity of collecting enough samples
from the arms, starting from the exact prior is not important, as long as the starting prior has
enough variance (e.g., uniform distribution or standard Gaussian) and its support includes the
optimal parameter.
Theorem 5 (Agrawal and Goyal [6, 8]). Consider any N -armed stochastic MAB instance  ¼
f1 ; . . . ; N g such that reward distributions fi g are sub-Gaussian or have bounded support in
½0; 1. Then, Algorithm 3 achieves the following regret bound for any such instance:
X 18 logðT 2 Þ 25
5ðT ; Þ  i
þ þ Oð1Þ:
i6¼I 
i i

Here, i is the mean of distribution i ; and i ¼   i .


It is not known if this algorithm achieves the asymptotic lower bound on regret for Gaussian
MAB, as the previous algorithm did for Bernoulli MAB. Korda et al. [37] provided a Thompson
sampling algorithm for all distributions in single-parameter exponential family of distribu-
tions (which include the Gaussian MAB we defined above, because we consider only i as
the “single” unknown parameter) using Jeffrey priors and proved that it achieves asymptotic
lower bounds for this family.
Here, we discussed only frequentist analysis of regret. For near-tight bounds on Bayesian
regret of Thompson sampling, the interested reader should refer to Russo and Van Roy [47],
who derived these bounds for a very general setting.

3. Contextual Bandits
In many sequential decision-making applications, including online recommendation systems (Li
et al. [40]), online advertising (Tang et al. [54]), online retail (Cohen et al. [26]), and healthcare
(Bastani and Bayati [18], Durand et al. [32], Tewari and Murphy [55]), the decision in every
round needs to be customized to the time-varying features of the users being served and/or
seasonal factors. In the contextual bandit problem (Langford and Zhang [39]), also referred to as
“associative reinforcement learning” [52], these factors and features form the context or “side
information” that the algorithm can take into account before making the decision in every round.
3.1. Problem Definition. The precise definition of this problem is as follows. In every
round t, first the context xi;t for every arm i ¼ 1; . . . ; N is observed, and then the algorithm
needs to pick an arm It 2 At
f1; . . . ; N g to be pulled. The outcome of pulling an arm depends
on the context xIt ;t of the arm pulled.
A special case of this problem is the linear contextual bandit problem (Abbasi-yadkori et al. [1],
Auer [12], Chu et al. [25]), where the expected reward on pulling an arm is a linear function of
the context. Specifically, an instance of the linear contextual bandit problem is defined by
a d-dimensional parameter µ 2 Rd a priori unknown to the algorithm. The expected value of
the observed reward rt on pulling an arm i 2 At with context vector xi;t is given by
E½rt jIt ¼ i ¼ µ> xi;t . The regret definition compares the performance of an algorithm to
a clairvoyant policy that picks the arm with highest expected reward in every round:
XT   X T 
>
5ðT Þ :¼ max µ xi;t  E rt :
i2At
t¼1 t¼1

More generally, the contextual bandit problem is defined via a linear or nonlinear, parametric
or nonparametric, contextual response function f ðÞ, so that the expected value of the observed
reward rt on pulling an arm i with context vector xi;t is given by E½rt jIt ¼ i ¼ f ðxi;t Þ. The
function f is unknown to the decision maker and may be learned using observations rt . For the
Agrawal: Recent Advances in MABs for Sequential Decision Making
176 Tutorials in Operations Research, © 2019 INFORMS

special case of the linear contextual bandit problem defined above, f ðxi;t Þ ¼ µ> xi;t . A significant
generalization to Lipschitz bandits was provided in Slivkins [51], where the only assumption on
f is that it satisfies a Lipschitz condition with respect to a metric.
3.1.1. Significance. The contextual bandit models represent a significant leap beyond the
basic MAB setting: they get rid of the basic premise of stationary distributions for every arm and
allow time-varying distributions. However, some important modeling assumptions implicit in
this setting allow extensions of the algorithmic techniques discussed earlier for MAB. These
assumptions include the ability to observe the contexts before making the decision, concise
parametric dependence of the distribution of rewards on the context and arm (e.g., through the
unknown parameter µ), and the oblivious nature of context to decisions made. Many such
assumptions may be satisfied in many interesting settings—for example, in a recommendation
system, where a user’s arrival does not depend strategically on the past decisions, user features
and external seasonal factors may be observed before a recommendation is served to the user.
Contextual bandits inherit the idea of using features to encode contexts and the models for
the relation between these feature vectors from supervised learning, in order to utilize sim-
ilarity between arms and achieve scalable learning when the number of arms is large. For
example, the linear contextual bandit formulation is useful as a scalable model even when
contexts are not changing with time, but instead, there is a fixed context vector, also known as
the feature vector, associated with every arm. In this formulation, also known as a static
contextual bandit problem, or a linear bandit problem, there is a fixed known feature vector xi
associated with every arm, and the expected value of the observed reward rt on pulling an arm i
with context vector xi is given by E½rt jIt ¼ i ¼ µ> xi . When the number of arms N is large, this
formulation can provide significant advantages in terms of computation, and learning effi-
ciency reduces the problem of learning N distributions to learning a (potential much smaller)
d-dimensional parameter vector µ. The linear bandit problem can thus be thought of as
a combination of linear regression and reinforcement learning; more general formulations have
been studied, for example, those based on kernel regression (Valko et al. [58]) or under access
to an oracle for supervised learning (Agarwal et al. [2], Dudı́k et al. [31]).
3.1.2. An Illustrative Example. As an example, consider the following network route
optimization problem. We are given a graph G with n nodes and d edges, representing a
transportation network. There is an unknown expected delay e associated with each edge e of
the d edges in the graph. In every one of the sequential rounds t ¼ 1; . . . ; T , the decision maker
routes one request from a fixed start node to a fixed end node through this network and
observes the total delay on the route. The aim is to learn from the observed delays to optimize
the future routes. A naı̈ve MAB formulation would model each possible path in the graph as an
arm, to get an exponentially large number of arms and therefore a regret bound that is
exponential in n. Instead, in a linear bandit formulation, for every path p, we can define
a feature vector xp 2 f0; 1gd by the incidence vector of the path (xp;e ¼ 1 if edge e belongs to
the path, and xp;e ¼ 0 otherwise). Then, the expected delay on a path p is given by xp  µ,
where µ 2 Rd is the unknown delay vector (µe denotes the unknown expected delay on using
the edge e). Using linear bandit algorithms, we can then obtain regret bounds (as discussed in
the following) that scale with d instead of N . Furthermore, using generalization to a linear
contextual bandit, we can model the problem where in every round t, a request for a different
source-destination pair ðst ; et Þ needs to be routed.

3.2. LinUCB Algorithm


The UCB algorithm has been adequately modified to obtain the LinUCB algorithm for linear
contextual bandits (Abbasi-yadkori et al. [1], Auer [12], Chu et al. [25], Rusmevichientong and
Tsitsiklis [45]). The following steps are involved in making decisions in any round  of this
algorithm. Here, we use xt to denote xIt ;t , the context of the arm pulled at time t.
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 177

Step 1. Given the history up to time , ðr1 ; x1 Þ; ðr2 ; x2 Þ; . . . ; ðr ; x Þ, compute a regularized
least square estimate of the unknown parameter by solving
( )
X
> 2
^  ¼ argmax
µ ðrt  xt zÞ þ kzk ; 2
z2Rd t¼1

whose solution is
^  ¼ M1 y ;
µ
P P
where M ¼ Id d þ t¼1 xt x>
t and y ¼ t¼1 rt xt .

As a sanity check, consider the N -armed bandit problem. It can be modeled as linear bandit
problem with xt ¼ 1It (the It -th canonical vector) for all t; then,
2 3 0 1
n1; þ 1 ^1;
X 6 7 X
M ¼ I þ xt xt> ¼ 6
4
..
.
7 and y;i ¼
5 ^ ¼ B
rs ; therefore, µ . C
@ .. A:
s : Is ¼i
t¼1
nd; þ 1 ^d;

Step 2. Using exponential inequalities for self-normalized martingales, the following theorem
has been proven in the literature, which provides a useful concentration bound for the
computed estimate.
pffiffiffiffiffiffi
Theorem pffiffiffi 6 (Abbasi-yadkori et al. [1], Rusmevichientong and Tsitsiklis [45]). If kxt k2  Ld ,
kµk2  d and jrt j  1. Then, with probability at least 1  , the vector µ lies in the set
( sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  ffi)
TdL pffiffiffi
Ct ¼ z 2 R : kz  µ
d
^ t kM  d log þ1 þ d : (6)
t
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Here, k  kM denotes the matrix norm, kxkM ¼ x> M x.
pffiffiffi
Observe that this bound will recover the UCB confidence interval within d in the special
case of the N -armed bandit problem, using the linear bandit formulation discussed above.
Step 3. Next, we use the above theorem to define an upper confidence bound UCBðxÞ for any
x 2 Rd such that UCBðxÞ  x> µ with high probability. Specifically, define
UCBðxÞ :¼ argmax x> z:
z2Ct

Step 4. At time t, pick


arg max UCBðxi;t Þ ¼ arg max max z > xi;t :
i2At i2At z2Ct

Below is the summary of the steps in the algorithm.

Algorithm 4 (LinUCB Algorithm)


foreach t ¼ 1; . . . ; T do
Observe set At
½N , and context xi;t for all i 2 At .
Play arm It ¼ arg maxi2At maxz2Ct z > xi;t with Ct as defined in (6)
Observe rt . Compute Ctþ1
end
In fact, the above steps can be modified to obtain an algorithm even if the number of arms is
infinite, as long as the maximization problem can be solved. In particular, in that case we could
Agrawal: Recent Advances in MABs for Sequential Decision Making
178 Tutorials in Operations Research, © 2019 INFORMS

define At as the set of contexts (corresponding to arms) available in each round t. And then we
pick the arm corresponding to context:
arg max max z > xi;t :
x2At z2Ct

One must note, however, that solving the above double maximization problem is difficult (NP-
hard even when the sets At are convex).
The LinUCB algorithm has been shown to achieve an Oð ~ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

dT log N Þ regret bound3 (Abbasi-
yadkori et al. [1]). In case the number of arms is very large, the modified pffiffiffiffi version of the al-
gorithm discussed above can also achieve a regret bound of Oðd ~ T Þ independent of the
number of arms. These bounds match the available lower bound for this problem within
logarithmic factors in T and d (Bubeck and Cesa-Bianchi [22]). However, as discussed above,
in the latter case, the algorithm may not be efficiently implementable. Dani et al. [27] showed pffiffiffiffi
a modification to get an efficiently implementable algorithm with regret bound of Oðd ~ 3=2 T Þ.

3.3. Thompson Sampling for Linear Contextual Bandits


An extension of Thompson sampling for linear contextual bandits was introduced in Agrawal
and Goyal [5, 7], which is discussed below. The algorithm is derived using a Gaussian likelihood
function and a Gaussian prior. Suppose that the likelihood of reward rt at time t, given context
xi;t and parameter , wasqgiven by ffithe probability density function of a Gaussian distribution
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1ðxTi;t ; v 2 Þ. Here, v ¼ R 9d lnðT Þ, where is a parameter used by the algorithm. As before,
let xt :¼ xIt ;t :
tP
 t 1 
1 P
Mt :¼ Id þ x xT ; ^ t :¼ Mt1
µ x r : (7)
¼1  ¼1

Then, if the prior for  at time t is given by 1ð^ t ; v 2 Mt1 Þ, then using Bayes’ rule, it is easy to
compute the posterior distribution at time t þ 1:
Prðµ~ jrt Þ / Prðrt jµ
~ Þ Prðµ~ Þ;
as 1ð^tþ1 ; v 2 Mtþ1 1 Þ (details of this computation are in Agrawal and Goyal [5, 7]). The
Thompson sampling algorithm (Agrawal and Goyal [5, 7]) generates a sample µ ~ t from the
distribution 1ðµ ^ t ; v 2 Mt 1 Þ, at every time step t, and it pulls the arm i that maximizes xTi;t µ
~t .
Algorithm 5 (Thompson Sampling for Linear Contextual Bandits)
foreach t ¼ 1; 2; . . . ; do
Observe context xi;t for all arms i ¼ 1; . . . ; N .
Sample µ
~ t from distribution 1ðµ ^ t ; v 2 Mt1 Þ.
Play arm It :¼ arg maxi xi;t µt .
T ~

Observe reward rt . Compute µ ^ tþ1 ; Mtþ1 as given by (7).


end
It must be noted that the Gaussian priors and the Gaussian likelihood model for rewards are
only used above to design the Thompson sampling algorithm for contextual bandits. The
regret bounds below hold in worst-case parameter settings and do not require rewards to have
a Gaussian distribution. They rely only on the bounded reward assumption (or more generally,
on an assumption of R-sub-Gaussian distribution of rewards; Agrawal and Goyal [5]).
Theorem 7 (Agrawal and Goyal [5]). With probability 1  , the total regret for the Thompson
sampling algorithm in time T is bounded as
pffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
5ðT Þ ¼ Oðd 3=2 T ðlnðT Þ þ lnðT Þ lnð1= ÞÞ; or (8)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
5ðT Þ ¼ Oðd T lnðN ÞðlnðT Þ þ lnðT Þ lnð1= ÞÞ; (9)
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 179

whichever is smaller, for any 0 < < 1, where is a parameter used by the algorithm.
pffiffiffi
The above regret bounds are within the d ln T factor of the lower bound for this problem
(Bubeck and Cesa-Bianchi [22]).
The regret bound in (8) for the Thompson sampling algorithm has a slightly worse de-
pendence on d compared with the corresponding bounds for the LinUCB algorithm. However,
the bounds match the best available bounds for any efficiently implementable algorithm for
this problem (e.g., that given by Dani et al. [27]).

4. Combinatorial Bandits: The MNL-Bandit Problem


In many applications of sequential decision making, the decision in every round can be best
described as pulling of a set or “assortment” of multiple arms. For example, consider the
problem of choosing a set of ads to display on a page in online advertising, or the assortment of
products to recommend to a customer in online retail. The decision maker needs to select
a subset of items from a universe of items. The objective may be to maximize the expected
number of clicks or sales revenue. Importantly, the customer response to the recommended
assortment may depend on the combination of items and not just the marginal utility of each
item in the assortment. For example, two complementary items such as bread and milk may
generate more purchases when presented together. On the other hand, an item’s purchase
probability may decrease when presented with a substitutable item similar to another product
with similar functionality but different brand/color/price; this is referred to as a substitution
effect. Thus, pulling an arm (i.e., offering an item as part of an assortment) no longer generates
a reward from its marginal distribution independent of other arms.
A general combinatorial bandit problem can be stated as the problem of selecting a subset
St
½N  in each of the sequential rounds t ¼ 1; . . . ; T . On selecting a subset St , reward rt is
observed with expected value E½rt jSt  ¼ f ðSt Þ, where the function f : RN ! ½0; 1 is unknown.
The goal is to minimize regret against the subset with maximum expected value:

X X
T
5ðT Þ :¼ Tf ðS  Þ  E½ rt  ¼ ðf ðS  Þ  f ðSt ÞÞ; (10)
t t¼1

where S  ¼ maxS
½N  f ðSÞ. However, it is easy to construct instances of function f ðÞ such that
the lower bounds for the MAB problem would imply a regret at least exponential in N . In many
cases, even if the expected value f ðSÞ is known for all S, computing S  may be intractable.
Therefore, for this problem to be tractable, some structural assumptions on f ðÞ must be
utilized. Examples of such structural assumptions include the linear model f ðSÞ ¼ µT 1S or
Lipschitz functions (metric bandits) discussed in the previous section. Another example is the
assumption of submodularity of function f , also known as the submodular bandit problem.
The algorithm for online submodular minimization in Hazan and Kale [33] can achieve a regret
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
that is bounded by OðNT 2=3 logð1= Þ with probability 1  , for the submodular bandit
problem. Their results are, in fact, applicable to the adversarial bandit problem (i.e., when
rt ¼ ft ðSt Þ for an arbitrary unknown sequence of submodular functions f1 ; . . . ; fT ).
This tutorial focuses on the useful structural properties provided by some well-studied
consumer choice models in assortment optimization. Choice models capture substitution
effects among products by specifying the probability that a consumer selects a product from
the offered set. The multinomial logit (MNL) model is a natural and convenient way to specify
these distributions; it is one of the most widely used choice models for assortment selection
problems in retail settings. The model was introduced independently by Luce [41] and
Plackett [44]; see also Ben-Akiva and Lerman [19], McFadden [42], and Train [57] for further
discussion and surveys of other commonly used choice models.
Agrawal: Recent Advances in MABs for Sequential Decision Making
180 Tutorials in Operations Research, © 2019 INFORMS

4.1. The MNL-Bandit Problem


The MNL-bandit problem is formulated by using the MNL choice model for modeling feedback
f ðSÞ on offering a set S in the combinatorial bandit problem. In this problem, in every round
t ¼ 1; . . . ; T , the seller selects an assortment St f1; . . . ; N g and observes the customer’s
choice ct 2 St [ f0g; where f0g denotes the no-purchase alternative, which is always available
for the consumer. Consumer choice is modeled using an MNL model. Under this model, the
probability that a consumer purchases product i at time t when offered an assortment St ¼
S f1; . . . ; N g is given by
8 v
< Pi if i 2 S [ f0g;
Pðct ¼ i jSt ¼ S Þ ¼ pi ðSÞ :¼ 0
v þ j2S vj (11)
:
0 otherwise;
where v0 ; v1 ; . . . ; vN are fixed, but a priori unknown, parameters of the MNL model. From (11),
the expected revenue at time t is
X X q i vi
E½rt jSt ¼ S ¼ f ðSÞ :¼ q i pi ðSÞ ¼ P ; (12)
i2S i2S
1 þ j2S vj

where q i is the revenue obtained when product i is purchased and is known a priori.
If the consumer preference model parameters (i.e., MNL parameters v) are known a priori,
then the problem of computing the optimal assortment, which we refer to as the static as-
sortment optimization problem, is well studied. Talluri and van Ryzin [53] considered the
unconstrained assortment planning problem under the MNL model and presented a greedy
approach to obtain the optimal assortment. Recent works by Davis et al. [28] and Désir and
Goyal [30] consider assortment planning problems under MNL with various constraints.
In the absence of a priori knowledge of the MNL parameters, the bandit problem aims to
offer a sequence of assortments, S1 ; . . . ; ST , where T is the planning horizon, and learn the
model parameters in order to maximize cumulative expected revenue or minimize regret as
defined in (10).

4.2. Algorithmic Techniques and Regret Bounds


A key difficulty in applying standard MAB techniques for this problem is that the response
observed on offering a product i (as part of an assortment S) is not independent of other
products in the assortment. Therefore, the N products cannot be directly treated as N in-
dependent arms. As mentioned

before, a naive extension of MAB algorithms for this problem
would treat each of the N K possible assortments as an arm, leading to a computationally
inefficient algorithm with regret exponential in K . The algorithms in Agrawal et al. [10, 11]
utilize the specific properties of the dependence structure in the MNL model to obtain efficient
algorithms based on the principle of optimism under uncertainty and Thompson sampling,
respectively.
Here, we present the algorithm from Agrawal et al. [10] based on a nontrivial extension of the
UCB algorithm. It utilizes two novel ideas. The first idea is to offer a set S repeatedly until
a no-purchase happens. Then, the MNL model implies that the expected number of purchases
of product i in these repeated offerings is vi =v0 . Thus, assuming v0 ¼ 1, the number of observed
purchases in such repeated offerings can provide an unbiased estimate of parameter vi .
Therefore, upper confidence bounds viUCB can be constructed for each item i in a manner
similar to the UCB algorithm. Second, the authors observe that given the upper confidence
bounds estimates for each item parameter, the problem of finding an optimistic assortment
can be formulated as a static assortment optimization problem but with fviUCB ; i ¼ 1; . . . ; N g
as the model parameters.
More precisely, if any round t a purchase of any item in the offered set St is observed,
then the algorithm continues to offer the same assortment in round t þ 1 (i.e., Stþ1 ¼ St ).
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 181

If a no-purchase is observed in round t, then the algorithm updates the parameter estimates
and makes a new assortment selection for round t þ 1 in the following way. Let ni;t be the
number of time steps until time t, when the item i was offered as part of an assortment. And
let mi;t be the number of times the item was purchased (or picked by the customer). Then,
mi;t
vi;t ¼
 : (13)
ni;t
UCB
Using these estimates, upper confidence bounds vi;t are computed as
sffiffiffiffiffiffiffiffiffiffiffi
12 vi;t 30 log2 T
UCB
vi;t vi;t þ
:¼  log T þ : (14)
ni;t ni;t
UCB
In Agrawal et al. [10], it is proven that vi;‘ is an upper confidence bound on the true parameter
vi (i.e., vi;t  vi ; 8i with high probability).
UCB

From the above estimates, define a revenue estimate:


P UCB
i2S ri vi;t
f ðS; vt Þ :¼
UCB
P : (15)
1 þ j2S vj;t UCB

The algorithm then selects assortment Stþ1 as the assortment with highest revenue estimate;
that is,
Stþ1 :¼ arg max f ðS; vUCB
t Þ: (16)
S
½N ;jSjK

In fact, f ðStþ1 ; vUCB


t Þ gives an upper confidence bound on the optimal revenue. That is, with
high probability, f ðStþ1 ; vUCB
t Þ  f ðS  Þ (Agrawal et al. [10]). Furthermore, the optimization
problem (16) is a standard K -cardinality-constrained assortment optimization problem under
the MNL model, with model parameters being vi;‘ UCB
; i ¼ 1; . . . ; N . There are efficient polynomial-
time algorithms available to solve this assortment optimization problem. Davis et al. [28]
showed a simple linear programming formulation of this problem. Rusmevichientong and
Tsitsiklis [45] proposed an enumerative method that utilizes an observation that the optimal
assortment belongs to an efficiently enumerable collection of N 2 assortments.
The algorithm is summarized as Algorithm 6.
The results in Agrawal et al. [10] provide the following upper bound on the regret of
Algorithm 6.
Algorithm 6 (UCB-Based Algorithm for the MNL-Bandit Problem)
vi;0 ¼ 1; for i ¼ 1; . . . ; N , and c0 ¼ 0
Initialize 
foreach t ¼ 1; . . . ; T do
if ct1 ¼ 0 then
UCB
Compute vi;t1 for i ¼ 1; . . . ; N as given by (14).
Offer set St :¼ arg maxS
½N ;jSjK f ðS; vUCBt Þ
end
else
Offer set St :¼ St1
end
Observe consumer’s choice ct 2 St [ f0g.
end

Theorem 8. For any instance of the MNL-bandit problem with N products, 1  K  N ,


ri 2 ½0; 1, and v0  vi for i ¼ 1; . . . ; N , the regret of Algorithm 6 in time T is bounded as
pffiffiffiffiffiffiffiffi
5ðT Þ ¼ Oð NT log T þ N log3 T Þ:
Agrawal: Recent Advances in MABs for Sequential Decision Making
182 Tutorials in Operations Research, © 2019 INFORMS

pffiffiffiffiffiffiffiffi
In Chen and Wang [24], a lower bound of ð NT Þ on the regret of any online algorithm for
the MNL-bandit problem is provided under the same assumptions as in Theorem 8. Thus,
Algorithm 6 is near optimal.

5. Bandits with Long-Term Constraints and Nonadditive Rewards


For many real-world problems, there are multiple complex constraints on resources that are
consumed during the entire decision process. Furthermore, it may be desirable to evaluate the
solution not simply by the sum of rewards obtained at individual time steps but by a more
complex utility function. For instance, consider the online ad allocation problem (Chakrabarti
et al. [23], Pandey and Olston [43]). In this problem, in each round the decision maker needs to
pick an ad to be displayed, and the goal is to maximize the total number of clicks. An MAB
formulation of this problem would use exploration-exploitation to implicitly learn the click-
through rates of ads in order to identify the ad with maximum click-through rate. However, in
practice, advertisers may have prespecified budgets, as a result of which there is a constraint on
the number of times an ad can be shown even if it is identified to be the one with maximum
click-through rate. Furthermore, to model advertisers’ dissatisfaction on being underserved,
we may want include underdelivery penalties as part of the objective, which are often formulated
as nonlinear functions of leftover budget (e.g., see Balseiro et al. [17]). Similar considerations
originate from product supply constraints and nonlinear risk functions on revenue in dynamic
pricing and network revenue management (Babaioff et al. [15], Besbes and Zeevi [20, 21]).
Such global constraints and objectives introduce state dependence in rewards/costs, so that
the rewards are no longer independent across time. For example, in the budgeted ad allocation
problem discussed above, the vector of leftover advertiser budgets captures the state in every
round, which determines the reward in addition to the arm pulled. In particular, pulling an
arm (ad) generates a reward (payment-per-click) of 0 after its budget is consumed. Even before
the budget is consumed, the reward in round t may be given by the increment in a nonadditive/
nonlinear utility function, and therefore it depends on the number of pulls so far as determined
by the current state. Such state dependence suggests a need for modeling these problems as an
MDP (with an unknown reward and transition function) and using reinforcement learning to
solve them. However, fortunately, for many such problems of interest, their specific structure
can be utilized to formulate them as a variation of MAB, so that many efficient algorithm
design and analysis techniques can be extended and applied. Below we discuss two such
formulations in the recent literature.

5.1. Bandit with Knapsacks (Badanidiyuru et al. [16])


In the bandit with knapsacks (BwK) problem, as defined by Badanidiyuru et al. [16], there are
d  1 resources and B units available for each resource. In every time step t, after pulling an
arm It 2 f1; . . . ; N g, a reward rt 2 ½0; 1 and a cost vector ct 2 ½0; 1d1 is observed. Given It ¼ i,
the observations ðrt ; ct Þ in round t are generated independently of previous rounds from
a fixed but unknown distribution for arm i, so that for all rounds, E½rt jIt ¼ i ¼ i ; E½ct ¼
jIt ¼ i ¼ Ci 2 Rd1 . The goal is to maximize the total reward while satisfying knapsack
constraints on the total cost. That is,
PT PT
maximize t¼1 rt while ensuring t¼1 ct  B1d1 : (17)

This formulation allows box constraints on the aggregate (sum of) costs incurred.
As an example, the blind network revenue management problem discussed in Besbes and
Zeevi [21] can be formulated as a special case of this problem. In that problem, a firm is selling
multiple products. At every time step t, the firm chooses a price vector qt from a set of N
possible price vectors for its products and observes a realization of demand vector Dt , generating
revenue rt ¼ qt  Dt and resource consumption ct ¼ ADt of d  1 resources. The distribution of
demand is unknown. A typical assumption is that the demand at any time t is generated by
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 183

a multivariate Poisson process with intensity


ðqt Þ at time t, where the function
ðÞ is unknown.
The expected revenue at time t is therefore E½rt jqt  ¼ qt
ðqt Þ. The objective is to dynamically
adjust the prices in order to maximize the expected revenue while satisfying the resource
constraints.
The next formulation generalizes the BwK setting to further allow convex constraints on
aggregate cost and concave functions of aggregate reward.

5.2. Bandits with Global Convex Constraints and Objective (Agrawal and
Devanur [3])
In this problem, we are given a convex set S
½0; 1d and a concave objective function
f : ½0; 1d ! ½0; 1. On playing an arm It 2 f1; . . . ; N g at time t, it observes a vector v t 2 ½0; 1d ;
generated independently of the previous observations, from a fixed but unknown distribution
such that E½vt jIt  ¼ V It , the It th column vector of matrix V . The matrix V 2 ½0; 1d m is fixed
but
P is unknown to the algorithm. The goal is to make the average of the observed vectors
P
1
T v
t t be contained in the set S and at the same time maximize f ð 1
T t v t Þ:
PT P
T
maximize f ðT1 t¼1 v t Þ while ensuring 1
T vt 2 S: (18)
t¼1

A simple special case of this formulation occurs when the observation vector v t is a vector
composed of a reward and several costs (i.e., P v t ¼ ðrt ;P ct Þ); f ðÞ is a function of the first
component of this vector (reward) (i.e., f ðT1 t v t Þ ¼ f ðT1 t rt Þ); and the constraints are only
P
on the last d  1 components of this vector (costs) (i.e., T1 Tt¼1 ct 2 S). Through this special
case, the bandits with global convex constraints and objective (BwCR) model generalizes BwK
to allow maximizing arbitrary concave utility functions on the total reward under arbitrary
convex constraints on the total cost.
But more generally, the observation vector v t can be used to model other aspects of user
feedback beyond rewards and costs in order to incorporate unconventional considerations into
the objective/constraints. For example, suppose that the arms belong to two subgroups
ðA; A0 Þ, and the decision maker is interested in fairness across the two groups in addition to
revenue. This consideration can be captured in the BwCR model as follows. Extend the vector
v t to include one more component indicating membership to a subgroup: let v t :¼ ðrt ; ct ; at Þ,
where at is set toP1 if the pulled P arm belongs to group A and 0 otherwise. P Now, extend the
objective to f ðT1 t rt Þ 
k 12  T1 t at k2 . This objective is concave in T1 t v t and encourages
equal allocation among the two groups.

5.3. Regret Definition


In this problem, two kinds of regret are analyzed: regret in the objective and regret in the
constraints, denoted below by avg-regret1 , and avg-regret2 , respectively. Regret in objective
(avg-regret1 ) compares the algorithm’s objective value to the objective value achieved by the
best static distribution over arms; that is,
 T 
P
avg-regret1 ðT Þ :¼ maxp2N :V p2S f ðV pÞ  f T1 vt ; (19)
t¼1

and (avg-regret2 ) captures constraint violation:


 
P
T
avg-regret2 ðT Þ :¼ d 1
T vt ; S : (20)
t¼1

Here, N denotes the N -dimensional simplex, and dðx; SÞ is a distance function defined as
dðx; SÞ :¼ miny2S kx  yk; (21)
Agrawal: Recent Advances in MABs for Sequential Decision Making
184 Tutorials in Operations Research, © 2019 INFORMS

with k  k denoting an Lq norm for some q.


Agrawal and Devanur [3] justified the above as a very strong definition of regret by showing
that f ðV p Þ  OPTf , where OPTf is the maximum expected revenue achievable by a
clairvoyant online algorithm. In particular, this clairvoyant online algorithm can utilize the
knowledge of distributions governing the vector outcomes for each arm, and furthermore, it is
required to be feasible only in expectation. That is,
  T   T 
P P
OPTf :¼ max E f T1 vt subject to E T1 v t 2 S; (22)
t¼1 t¼1

where the maximum is taken over all online algorithms. Then, the following is proven in
Agrawal and Devanur [3] using the concavity of f and convexity of S.
Lemma 1 (Agrawal and Devanur [3]). Assuming (22) is feasible, there exists a distribution p
over N arms such that V p 2 S, and f ðV p Þ  OPT f .

5.4. Algorithmic Techniques and Regret Bounds


A challenge in using UCB techniques for the BwCR problem is that the observation vector
cannot be interpreted as either a cost or a reward. Agrawal and Devanur [3] presented a UCB-
like algorithm, which constructs both lower and upper confidence bounds on the expected
feedback V i for every arm i, and then considers the range of estimates defined by these bounds.
More precisely, for every arm i and component j, two estimates LCBt;ji ðV Þ and UCBt;ji ðV Þ are
constructed at time t using past observations in the following manner. Define the empirical
average Vb t;ji for each arm i and component j at time t as
P
b s<t:is ¼i v t;j
V t;ji :¼ ; (23)
kt;i þ 1
pffiffiffiffi
where kt;i is the number of plays of arm i before time t. Let rad ð; N Þ ¼  N þ N for some
> 0. Then, define the upper and lower confidence bounds as
UCBt;ji ðV Þ :¼ minf1;Vb t;ji þ 2rad ðVb t;ji ; kt;i þ 1Þg;
(24)
LCBt;ji ðV Þ :¼ maxf0;Vb t;ji  2rad ðVb t;ji ; kt;i þ 1Þg;

for i ¼ 1; . . . ; m; j ¼ 1; . . . ; d; t ¼ 1; . . . ; T .
The design and analysis of the UCB algorithm for BwCR then follows closely the design and
analysis of the UCB algorithm for the N -armed bandit problem. In particular, parallel to the
two observations made in Section 2.1, the following observations are made regarding UCBt
and LCBt (Agrawal and Devanur [3]):
(1) The mean for every arm i and component j is guaranteed to lie in the range defined by
its estimates LCBt;ji ðV Þ and UCBt;ji ðV Þ, with high probability. That is, with probability
1  mTdeð Þ ,
V 2 *t ; where (25)

~ :V
*t :¼ fV ~ ji 2 ½LCBt;ji ðV Þ; UCBt;ji ðV Þ; j ¼ 1; . . . ; d; i ¼ 1; . . . ; mg: (26)

(2) Let the probability of playing arm i at time t be pt;i . Then, with probability
1  mTdeð Þ , the total difference between the estimated and the actual observations for
the played arms can be bounded as
X
T pffiffiffiffiffiffiffiffiffiffiffi
k ~ t pt  v t Þk  Oðk1d k
ðV mT Þ (27)
t¼1

~ t gT such that V
for any fV ~ t 2 *t for all t.
t¼1
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 185

The first property shows that ½LCBt;i ; UCBt;i  forms a high-confidence interval for the
unknown parameter V i at time t. And the second property shows that this interval becomes
more refined as more observations are made, so that the total estimation error over played
arms is small. Then, using the optimism under uncertainty principle, at time t, the UCB
algorithm for BwCR plays the best arm (or the best distribution over arms) according to the
best estimates in the set *t . The algorithm is summarized as Algorithm 7.
Algorithm 7 (UCB Algorithm for BwCR)
foreach t ¼ 1; 2; . . . ; T do

~ pÞ
arg max max f ðU
~ 2*t
p2m U
pt ¼ (28)
~ p; SÞ ¼ 0
s:t: min dðV
~ 2*t
V

If no feasible solution is found to the above problem, set pt arbitrarily.


Play arm i with probability pt;i .
end
Observe that when f ðÞ is a monotone nondecreasing function as in the classic MAB problem
~ t ¼ UCBt ðV Þ,
(where f ðxÞ ¼ x), the inner maximizer in the objective of (28) will be simply U
and therefore, for the classic MAB problem, this algorithm reduces to the UCB algorithm.
Given the parallel between the two properties of the confidence interval ½LCBt;i ; UCBt;i 
mentioned above and the two properties of UCBi;t estimates in Section 2.1, the regret analysis
of Algorithm 7 follows steps similar to those for analyzing the UCB algorithm for the N -armed
bandit problem. The following regret bounds are proven in Agrawal and Devanur [3].
Theorem 9. For any  0, with probability 1  , the regret of Algorithm 7 for the BwCR
problem is bounded as
pffiffiffiffiffi pffiffiffiffiffi
areg-regret 1 ðT Þ ¼ OðLk1d k m
T Þ; and areg-regret 2 ðT Þ ¼ Oðk1d k m
T Þ;

where ¼ OðlogðmTd
ÞÞ, and 1d is the d-dimensional vector of all ones.

Observe that Algorithm 7 requires computing the best distribution for the most optimistic
estimates in the given confidence interval. Specifically, (28) requires maximizing the maximum
of some concave functions while ensuring that the minimum of some convex (distance)
functions is less than or equal to 0. Agrawal and Devanur [3] demonstrated that this opti-
mization problem is, in fact, a convex optimization problem and is solvable in time poly-
nomial in N and d. However, solving the optimization problem can still be slow. Agrawal and
Devanur [3] also presented two efficient algorithmic approaches, based on primal-dual tech-
niques and on the Frank–Wolfe algorithm for convex optimization. The primal-dual algorithm
maintains (Fenchel) dual variables representing the incremental cost of the constraints or the
incremental value of the objective. A procedure for efficiently updating these dual variables
is provided. Then, the problem of choosing an arm in a round involves simply comparing
different arms based on a combination of these dual variables for each arm. For more details,
refer to Agrawal and Devanur [3].

6. Summary
This tutorial discussed some recent advances in bandit models and their applications to
a variety of operations management problems. Beyond the models presented here, many
additional innovative settings have been studied in the recent literature that expand the
representability and applicability of the multiarmed bandit model. These include bandits with
delayed feedback (Joulani et al. [34]), sleeping bandits (Kanade et al. [35]), bandits with
Agrawal: Recent Advances in MABs for Sequential Decision Making
186 Tutorials in Operations Research, © 2019 INFORMS

switching costs (Dekel et al. [29]), and so forth. More recently, there has also been progress on
using bandit techniques for settings that involve more complex dependence on states—for
example, the inventory control problem (Zhang et al. [59]) and finite-state, finite-action MDPs
(e.g., Agrawal and Jia [9], Russo et al. [49]).

Endnotes
1
This is derived from the Azuma–Hoeffding inequality, which states that given samples x1 ; . . . ; xn 2 ½0; 1 with
E½xi jx1 ; . . . ; xi1  ¼ ,
 Pn x 
i¼1 i
Pr n     2eðn 2 =2Þ :

A Beta distribution has support ð0; 1Þ with two parameters, ð; Þ, with probability density function
2

ð þ Þ
f ðx:;Þ¼ðÞðÞx 1 ð1xÞ1 :

Here, ðxÞ is called the Gamma function. For integers x  1, ðxÞ ¼ ðx  1Þ!
3 ~ notation hides logarithmic factors in T and d, in addition to the absolute constants.
The OðÞ

References
[1] Y. Abbasi-yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits.
J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, eds. Advances in
Neural Information Processing Systems Vol. 24. Curran Associates, Red Hook, NY, 2312–2320,
2011.
[2] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire. Taming the monster: A fast and
simple algorithm for contextual bandits. E. P. Xing and T. Jebara, eds. Proceedings of the 31st
International Conference on Machine Learning. PMLR, 1638–1646, 2014.
[3] S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. Proceedings
of the 15th ACM Conference on Economics and Computation. ACM, New York, 989–1006, 2014.
[4] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem.
S. Mannor, N. Srebro, and R. C. Williamson, eds. Proceedings of the 25th Annual Conference on
Learning Theory. PMLR, 39.1–39.26, 2012.
[5] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. Working
paper, Microsoft Research India, Bangalore. https://arxiv.org/abs/1209.3352, 2012.
[6] S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson sampling. C. M. Carvalho
and P. Ravikumar, eds. Proceedings of the 16th International Conference on Artificial Intelligence
and Statistics. PMLR, 99–107, 2013.
[7] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. S. Dasgupta
and D. McAllester, eds. Proceedings of the 30th International Conference on Machine Learning.
JMLR, 1220–1228, 2013.
[8] S. Agrawal and N. Goyal. Near-optimal regret bounds for Thompson sampling. Journal of ACM
64(5):1–30, 2017.
[9] S. Agrawal and R. Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret
bounds. I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
R. Garnett, eds. Advances in Neural Information Processing Systems, Vol. 30. Curran Associates,
Red Hook, NY, 1184–1194, 2017.
[10] S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. A near-optimal exploration-exploitation
approach for assortment selection. Proceedings of the 2016 ACM Conference on Economics and
Computation. ACM, New York, 599–600, 2016.
[11] S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Thompson sampling for the MNL-bandit.
S. Kale and O. Shamir, eds. Proceedings of the 30th Annual Conference on Learning Theory.
PMLR, 76–78, 2017.
[12] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine
Learning Research 3(3):397–422, 2002.
[13] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.
Machine Learning 47(2–3):235–256, 2002.
[14] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit
problem. SIAM Journal on Computing 32(1):48–77, 2002.
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 187

[15] M. Babaioff, S. Dughmi, R. Kleinberg, and A. Slivkins. Dynamic pricing with limited supply. Pro-
ceedings of the 13th ACM Conference on Electronic Commerce. ACM, New York, 74–91, 2012.
[16] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. Proceedings of the 2013
IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE Computer Society,
Washington, DC, 207–216, 2013.
[17] S. R. Balseiro, J. Feldman, V. Mirrokni, and S. Muthukrishnan. Yield optimization of display
advertising with ad exchange. Management Science 60(12):2886–2907, 2014.
[18] H. Bastani and M. Bayati. Online decision-making with high-dimensional covariates. Working
paper, University of Pennsylvania, Philadelphia, 2015.
[19] M. Ben-Akiva and S. Lerman. Discrete Choice Analysis: Theory and Application to Travel Demand,
Vol. 9. MIT Press, Cambridge, MA, 1985.
[20] O. Besbes and A. Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and
near-optimal algorithms. Operations Research 57(6):1407–1420, 2009.
[21] O. Besbes and A. Zeevi. Blind network revenue management. Operations Research 60(6):1537–1550,
2012.
[22] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit
problems. Foundations and Trends in Machine Learning 5(1):1–122, 2012.
[23] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. D. Koller,
D. Schuurmans, Y. Bengio, and L. Bottou, eds. Advances in Neural Information Processing
Systems, Vol. 21. Curran Associates, Red Hook, NY, 273–280, 2008.
[24] X. Chen and Y. Wang. A note on tight lower bound for MNL-bandit assortment selection models.
Operation Research Letters 46(5):534–537, 2018.
[25] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions.
G. J. Gordon, D. B. Dunson, and M. Dudı́k, eds. Proceedings of the 14th International Conference
on Artificial Intelligence and Statistics. PMLR, 2011.
[26] M. C. Cohen, I. Lobel, and R. Paes Leme. Feature-based dynamic pricing. Proceedings of the 2016
ACM Conference on Economics and Computation. ACM, New York, 817–817, 2016.
[27] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback.
Proceedings of the 21st Conference on Learning Theory. 355–366, 2008.
[28] J. Davis, G. Gallego, and H. Topaloglu. Assortment planning under the multinomial logit model
with totally unimodular constraint structures. Technical report, Cornell University, Ithaca, NY,
2013.
[29] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: T2/3 regret. Proceedings of
the 46th Annual ACM Symposium on Theory of Computing. ACM, New York, 459–467, 2014.
[30] A. Désir and V. Goyal. Near-optimal algorithms for capacity constrained assortment optimization.
Working paper, Columbia University, New York, 2014.
[31] M. Dudı́k, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient
optimal learning for contextual bandits. F. Cozman and A. Pfeffer, eds. Proceedings of the 27th
Conference on Uncertainty in Artificial Intelligence. AUAI Press, Arlington, VA, 169–178, 2011.
[32] A. Durand, C. Achilleos, D. Iacovides, K. Strati, G. D. Mitsis, and J. Pineau. Contextual bandits for
adapting treatment in a mouse model of de novo carcinogenesis. Proceedings of the 3rd Machine
Learning for Healthcare Conference, Vol. 85. PMLR, 67–82, 2018.
[33] E. Hazan and S. Kale. Online submodular minimization. Journal of Machine Learning Research
13(1):2903–2922, 2012.
[34] P. Joulani, A. György, and C. Szepesvári. Online learning under delayed feedback. S. Dasgupta and
D. McAllester, eds. Proceedings of the 30th International Conference on Machine Learning. PMLR,
1453–1461, 2013.
[35] V. Kanade, H. B. McMahan, and B. Bryan. Sleeping experts and bandits with stochastic action
availability and adversarial rewards. D. van Dyk and M. Welling, eds. Proceedings of the 12th
International Conference on Artificial Intelligence and Statistics. PMLR, 272–279, 2009.
[36] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-
time analysis. N. H. Bshouty, G. Stoltz, N. Vayatis, and T. Zeugmann, eds. Algorithmic Learning
Theory—23rd International Conference. Springer, Berlin, 199–213, 2012.
[37] N. Korda, E. Kaufmann, and R. Munos. Thompson sampling for 1-dimensional exponential family
bandits. C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.
Advances in Neural Information Processing Systems, Vol. 26. Curran Associates, Red Hook, NY,
1448–1456, 2013.
Agrawal: Recent Advances in MABs for Sequential Decision Making
188 Tutorials in Operations Research, © 2019 INFORMS

[38] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied
Mathematics 6(1):4–22, 1985.
[39] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. J. C. Platt,
D. Koller, Y. Singer, and S. T. Roweis, eds. Advances in Neural Information Processing Systems,
Vol. 20. Curran Associates, Red Hook, NY, 817–824, 2007.
[40] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news
article recommendation. Proceedings of the 19th International Conference on World Wide Web.
ACM, New York, 661–670, 2010.
[41] R. Luce. Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons, New York, 1959.
[42] D. McFadden. Modeling the choice of residential location. Transportation Research Record (673):
72–77, 1978.
[43] S. Pandey and C. Olston. Handling advertisements of unknown quality in search advertising.
B. Schölkopf, J. C. Platt, and T. Hoffman, eds. Advances in Neural Information Processing
Systems, Vol. 19. MIT Press, Cambridge, MA, 1065–1072, 2006.
[44] R. L. Plackett. The analysis of permutations. Applied Statistics 24(2):193–202, 1975.
[45] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations
Research 35(2):395–411, 2010.
[46] P. Rusmevichientong, Z. M. Shen, and D. B. Shmoys. Dynamic assortment optimization with a
multinomial logit choice model and capacity constraint. Operations Research 58(6):1666–1680,
2010.
[47] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations
Research 39(4):1221–1243, 2014.
[48] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. Journal of
Machine Learning Research 17(1):2442–2471, 2016.
[49] D. Russo, I. Osband, and B. Van Roy. (More) efficient reinforcement learning via posterior
sampling. C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.
Advances in Neural Information Processing Systems, Vol. 26. Curran Associates, Red Hook, NY,
2013.
[50] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on Thompson sampling.
Foundations and Trends in Machine Learning 11(1):1–96, 2018.
[51] A. Slivkins. Multi-armed bandits on implicit metric spaces. J. Shawe-Taylor, R. S. Zemel,
P. L. Bartlett, F. Pereira, and K. Q. Weinberger, eds. Advances in Neural Information Processing
Systems, Vol. 24. Curran Associates, Red Hook, NY, 1602–1610, 2011.
[52] A. L. Strehl, Associative reinforcement learning. C. Sammut, G. I. Webb, eds. Encyclopedia of
Machine Learning, Springer US, Boston, MA, 49–51, 2019.
[53] K. Talluri and G. van Ryzin. Revenue management under a general discrete choice model of
consumer behavior. Management Science 50(1):15–33, 2004.
[54] L. Tang, R. Rosales, A. Singh, and D. Agarwal. Automatic ad format selection via contextual
bandits. Proceedings of the 22nd ACM International Conference on Information and Knowledge
Management. ACM, New York, 1587–1594, 2013.
[55] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health.
J. M. Rehg, S. A. Murphy, and S. Kumar, eds. Mobile Health: Sensors, Analytic Methods and
Applications. Springer, Cham, Switzerland, 495–517, 2017.
[56] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the
evidence of two samples. Biometrika 25(3–4):285–294, 1933.
[57] K. E. Train. Discrete Choice Methods with Simulation. Cambridge University Press, Cambridge,
UK, 2009.
[58] M. Valko, N. Korda, R. Munos, I. N. Flaounas, and N. Cristianini. Finite-time analysis of kernelised
contextual bandits. A. Nicholson and P. Smyth, eds. Proceedings of the 29th Conference on
Uncertainty in Artificial Intelligence. AUAI Press, Corvallis, OR, 2013.
[59] H. Zhang, X. Chao, and C. Shi. Closing the gap: A learning algorithm for the lost-sales inventory
system with lead times. Working paper, Pennsylvania State University, State College. https://
ssrn.com/abstract=2922820, 2018.

You might also like