Professional Documents
Culture Documents
Recent Advances in Multiarmed Bandits For Sequential Decision Making
Recent Advances in Multiarmed Bandits For Sequential Decision Making
To cite this entry: https://orcid.org/0000-0003-4486-3871Shipra Agrawal. Recent Advances in Multiarmed Bandits for
Sequential Decision Making. In INFORMS TutORials in Operations Research. Published online: 02 Oct 2019; 167-188.
https://doi.org/10.1287/educ.2019.0204
This article may be used only for the purposes of research, teaching, and/or private study. Commercial use
or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher
approval, unless otherwise noted. For more information, contact permissions@informs.org.
The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness
for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or
inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or
support of claims made of that product, publication, or service.
With 12,500 members from nearly 90 countries, INFORMS is the largest international association of operations research (O.R.)
and analytics professionals and students. INFORMS provides unique networking and learning opportunities for individual
professionals, and organizations of all types and sizes, to better understand and use O.R. and analytics tools and methods to
transform strategic visions and achieve better outcomes.
For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org
© 2019 INFORMS j ISBN 978-0-9906153-3-0
https://pubsonline.informs.org/series/educ
https://doi.org/10.1287/educ.2019.0204
Abstract Reinforcement learning (RL) is a very general framework for making sequential decisions
when the underlying system dynamics are a priori unknown. RL algorithms use the outcomes
of past actions to learn system dynamics and improve the decision maker’s strategy over time.
The stochastic multiarmed bandit (MAB) problem refers to a special case of RL when the
responses are independent across different actions and independent and identically dis-
tributed across time for a given action. The stochastic MAB problem enjoys availability of
many efficient algorithms in the literature, with rigorous near-optimal theoretical guarantees
on performance. This tutorial discusses some recent advances in sequential decision-making
models that build on the basic MAB setting to greatly expand its purview. Specifically, we
discuss progress on three models that lie between MAB and RL: (a) contextual bandits,
(b) combinatorial bandits, and (c) bandits with long-term constraints and nonadditive re-
wards. These models incorporate settings that are well beyond the purview of MAB by
getting rid of premises such as stationary distribution and independence of feedback
across actions or across time. This tutorial discusses the state of the art in algorithm design
and analysis techniques for these and related models, along with applications in several
domains such as online advertising, recommendation systems, crowdsourcing, healthcare,
network routing, assortment optimization, revenue management, and resource allocation.
1. Introduction
In many operations management problems, decision makers are faced with the challenge of
making sequential decisions that are not just profitable today but also put the system into
a better position to face the constraints and uncertainties of tomorrow. In order to achieve this,
the decision maker needs to utilize the past observations and data to understand the nature of
uncertainties, and optimize for the long-term goals.
Reinforcement learning (RL) is a very general framework for learning to optimize such
sequential decisions under uncertainty. In reinforcement learning, the underlying stochastic
process is modeled as a Markov decision process (MDP). However, the parameters of this
model (e.g., parameters determining the reward function or the transition probability dis-
tribution) are a priori unknown to the decision maker. A reinforcement learning algorithm
learns the unknown model parameters by exploring different actions and uses the observed
outcomes to adaptively improve the decision maker’s policy over time. This requires the
algorithm to manage the trade-off between exploration and exploitation—that is, exploring
different actions in order to learn versus taking actions that currently seem to be reward
maximizing.
The stochastic multiarmed bandit (MAB) problem is a special case of RL when the re-
sponses to actions are independent across time and across different actions. And furthermore,
167
Agrawal: Recent Advances in MABs for Sequential Decision Making
168 Tutorials in Operations Research, © 2019 INFORMS
feedback from one action does not provide any information about another action. The mul-
tiarmed bandit problem derives its name from the problem of a gambler playing on N rigged
nonidentical slot machines in a casino. The extent to which each machine is rigged is unknown
to the gambler. Pulling the arm of any machine once requires investing one dollar. The gambler
has a fixed amount of dollars that can be invested to pull the arms of the slot machines in order
to receive some (random) reward. The gambler must decide which arms to pull in a sequence of
trials so as to maximize total reward. The gambler wants to use the outcomes of the pulls to
learn the statistics of the slot machines and use that learning to adaptively make better
investments over time. Thus, in each trial, the gambler faces a trade-off between exploration
(in order to find the best machine) and exploitation (playing the arm of the machine believed
to give the best payoff).
MAB is often considered as the most fundamental model that captures the exploration-
exploitation trade-off in sequential decision making. Although restricted in its ability to capture
many practical sequential decision-making settings, the basic MAB problem is a mature area
of research and enjoys availability of many efficient algorithms with rigorous near-optimal
theoretical guarantees on performance. This tutorial discusses some recent advances in se-
quential decision-making models that build on the above-described basic MAB setting to
greatly expand its purview.
Specifically, we discuss progress on three models that lie between MAB and RL: (a) con-
textual bandits, (b) combinatorial bandits, and (c) bandits with long-term constraints and
nonadditive rewards. These models allow incorporating problem settings and considerations
that may violate some basic premises of the MAB framework such as the assumptions of
stationary distribution and independence of feedback across actions or across time. Although
these settings do not still allow the full generality of a reinforcement learning problem, their
structure has enabled new efficient algorithms and performance analysis based on multiarmed
bandit techniques. In this tutorial, we discuss the state of the art in algorithm design and
analysis techniques for these and related models, along with applications in several domains
such as online advertising, recommendation systems, crowdsourcing, healthcare, network
routing, assortment optimization, revenue management, and resource allocation.
1.1. Organization
The rest of this tutorial is organized as follows. Section 2 provides an introduction to the page:
2 stochastic multiarmed bandit problem and popular algorithmic techniques. Sections 3, 4,
and 5 discuss the three variations of contextual bandits, combinatorial bandits, and bandits
with long-term constraints and nonadditive objectives, respectively.
as well as the randomness in the outcomes of arm pulls, which may affect the algorithm’s
sequential decisions.
As an example of a multiarmed bandit problem, consider a news website that needs to decide
which articles to display to a visitor. Showing an article (i.e., pulling an arm) generates clicks
(i.e., reward). The website has no a priori information about the click-through rates (CTRs) of
different articles. Then, the MAB problem can be used to formulate the problem of se-
quentially choosing articles to display from a fixed pool of N articles in order to learn the CTRs
of different articles while maximizing the user engagement measured as the total number of
clicks. It is important to note, however, that this basic MAB formulation makes an implicit
assumption that every time an article is shown, it has the same likelihood of generating a click
(irrespective of the preferences of the current user or the external context). Later, we dis-
cuss extensions such as the contextual bandit problem, which can handle these additional
considerations.
2.1. Regret Definition. To measure the performance of an algorithm for the MAB
problem, it is common to work with the measure of expected total regret (i.e., the amount lost
because of not playing the optimal arm in each step).
To formally define regret, let us introduce some notation. Let :¼ maxi i , and let
i :¼ i . Let ni;t denote the total number of times arm i is played in rounds 1 to t; thus ni;t
is a random variable. Then the expected total regret in T rounds is defined as
X T XN
5ðT Þ :¼ E ð It Þ ¼ E ni;T i ;
t¼1 i¼1
where expectation is taken with respect to both randomness in outcomes, which may affect the
sequential decisions made by the algorithm, and any randomization in the algorithm.
Two kinds of regret bounds appear in the literature for the stochastic MAB problem:
(1) logarithmic problem-dependent (or instance-dependent) bounds that may have de-
pendence on problem parameters such as i or i , and
(2) sublinear problem-independent (or worst-case) bounds that provide uniform bounds for
all instances with N arms.
To differentiate between these two types of bounds, we use a more detailed notation
5ðT ; Þ to denote regret for problem instance . The instance is specified by the sufficient
statistics for the reward distributions of arms. Then, problem-dependent or instance-dependent
bounds on regret are bounds on 5ðT ; Þ for every problem instance , in terms of T ; i ; i ¼
1; . . . ; N and possibly other distribution parameters associated with . Problem-independent
bounds are bounds on the worst-case regret max 5ðT ; Þ, where the maximum is taken over
all instances with N arms, with arbitrary arm distributions (possibly within a given family of
distributions).
P For example, Auer[12] derives an instance-dependent bound of 5ðT ; Þ ¼
O i:i < ð1=ð i ÞÞ logðT Þ on the regret of the UCB algorithm for any instance in
which the distribution for each arm i has bounded support in ½0; 1 and mean ffii , and
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ maxi i . The same article also provides a worst-case regret bound of Oð NT log T Þ over
all such instances.
2.1.1. Bayesian Regret. If we have a good priors available on the distribution of instances
, we would hope that we could get a better performance. Accordingly, another definition of
regret called Bayesian regret is often considered in literature, especially when analyzing al-
gorithms such as Thompson sampling (discussed later), which are based on Bayesian posterior
sampling. Given a prior PðÞ over instances of the stochastic MAB problem, Bayesian
regret is expected regret over instances sampled from this prior:
Bayesian regret in time T ¼ EP ½5ðT ; Þ;
Agrawal: Recent Advances in MABs for Sequential Decision Making
170 Tutorials in Operations Research, © 2019 INFORMS
where the first expectation is over the instance and the second expectation is over the
algorithm/reward generation.
In this article, we focus on the frequentist regret bounds (problem dependent and problem
independent); however, some references to Bayesian analysis are provided at relevant places.
2.2. Overview of Algorithmic Techniques. We briefly discuss two widely used algo-
rithmic techniques for the multiarmed bandit problems: (1) optimism under uncertainty, or
more specifically, the upper confidence bound (UCB) algorithm (Auer [12], Auer et al. [13]), and
(2) posterior sampling, or more specifically, the Thompson sampling (TS) algorithm (Agrawal
and Goyal [4, 8], Russo and Van Roy [47], Russo et al. [50], Thompson [56]. Some other prominent
techniques include inverse propensity scoring and multiplicative weight update algorithms
such as, for example, the EXP3 algorithm (Auer et al. [14]), epsilon greedy algorithm, and
successive elimination algorithm (see the survey in Bubeck and Cesa-Bianchi [22]).
The UCB algorithm computes the following quantity for each arm i at the end of each round t:
qffiffiffiffiffi
UCBi;t :¼ ^i;t þ 2 ln
ni;t :
t
(2)
Then, the algorithm pulls the arm i that has the highest UCB i;t at time t. The algorithm is
summarized as Algorithm 1.
Algorithm 1 (UCB Algorithm for the Stochastic N-Armed Bandit Problem)
foreach t ¼ 1; . . . ; N do
Play arm t
end
foreach t ¼ N þ 1; N þ 2 . . . ; T do
Play arm It ¼ arg maxi2f1;...;N g UCBi;t1 :
Observe rt , compute UCBi;t
end
Here, for simplicity, it was assumed that T N , and the algorithm started by playing every
arm once. Other variations of this algorithm may be found in Auer [12] and Bubeck and Cesa-
Bianchi [22].
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2.3.1. Regret Analysis. Intuitively, the additional term ln t=ni;t in (2) allows explo-
ration of an arm that has been played less often so far (i.e., an arm with low ni;t ), even if its
current empirical mean estimate is low. A key observation in the analysis of the UCB algorithm
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 171
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
is that the term ln t=ni;t is a high confidence upper bound on the empirical error of ^i;t . More
precisely, for each arm i at time t, we have that1 with probability at least 1 2=t 2 ,
sffiffiffiffiffiffiffiffiffiffi
4 ln t
j^
i;t i j < : (3)
ni;t
There are two useful observations we can immediately derive from (3):
(1) A lower bound for UCBi;t . With probability at least 1 2=t 2 ,
UCBi;t > i : (4)
(2) An upper bound for ^i;t with many samples. Given that ni;t ð16 ln tÞ=2i , with
probability at least 1 2=t 2 ,
i
^i;t < i þ : (5)
2
Inequality (4) states that the UCB value is probably as large as the true reward: in this sense,
the UCB algorithm is optimistic. Inequality (5) states that given enough (specifically, at least
ð16 ln tÞ=2i ) samples, the reward estimate probably does not exceed the true reward by more
than i =2. Together, these bounds can be used to show that after being pulled ð16 ln tÞ=2i
times, a suboptimal arm i has a very low probability to be pulled in subsequent rounds. More
precisely, consider any arm i with i < . At the end of round t, let ni;t ð16 ln tÞ=2i . Then, if
both (4) and (5) hold,
sffiffiffiffiffiffiffi
ln t i
UCBi;t ¼ ^i;t þ ^i;t þ Since ni;t 16ln2 t ;
ni;t 2 i
i i
< i þ þ by (5);
2 2
¼ since i :¼ i ;
< UCBi ;t by (4):
The probability of (4) or (5) not holding is at most 4=t 2 by the union bound. Now, by the
algorithm’s selection criterion, we have that because UCBi ;t > UCBi;t , the probability of
playing arm i in round t þ 1 is at most 4=t 2 . This observation can be used to show that with
high probability, ni;T ð16 lnðT ÞÞ=2i for any time horizon T and suboptimal arm i. Because
the regret on playing arm i is bounded by i ¼ i , this yields the following upper bound on
regret. Formal proofs with tighter constants can be found in Auer [12] and Bubeck and Cesa-
Bianchi [22].
Theorem 1. Let 5ðT ; Þ denote the regret of the UCB algorithm in time T for instance
¼ f1 ; 1 ; . . . ; N ; N g of the stochastic independent and identically distributed (i.i.d.) mul-
tiarmed bandit problem. For all instances , and all T N , the expected regret of UCB algo-
rithm is bounded as
X 16 ln T
5ðT ; Þ þ 8i ;
i: i <
i
where i ¼ i .
Theorem 1 gives an upper bound on E½RðT ; Þ that is logarithmic in T . This is near
optimal: Lai and Robbins [38] provided a lower bound demonstrating that any algorithm must
suffer at least ln T expected total regret on any instance . However, note that the bound
in Theorem 1 depends on parameters 1 ; . . . ; N (i.e., it is an “instance-dependent” or
Agrawal: Recent Advances in MABs for Sequential Decision Making
172 Tutorials in Operations Research, © 2019 INFORMS
“problem-dependent” bound). This bound does not directly imply a very good worst-case
bound: for an instance with i ¼ ln T =T , this bound is linear in T . But a simple trick can
be applied to obtain a sublinear “instance-independent” (a.k.a. “problem-independent” or
“worst-case”) regret bound. The idea is to separately bound the regret of arms with i
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð16N ln T Þ=T by a trivial bound of T i 16NT ln T . And for the remaining arms,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
substitute i ð16N ln T Þ=T to bound regret as a result of their pulls by ð16 ln T Þ=i
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
16NT lnðT Þ. This analysis gives the following instance-independent bound.
Theorem 2. For all T N , the expected total regret achieved by the UCB algorithm in
round T is
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
5ðT Þ 8 NT ln T þ 8N :
A frequentist would guess that is close to 0.5 with some confidence (probability).
On the other hand, a Bayesian learner maintains a probability distribution (or belief) to
capture the uncertainty about the unknown parameter. At the beginning (before seeing the
data), the prior distribution encodes the initial belief of the learner about the value of the
parameter. Upon seeing the data, the learner updates the belief using Bayes’ rule. This
updated distribution is called the posterior distribution.
Let us continue with the example from above where the learner observes independent
realizations from a Bernoulli distribution with parameter . Let the learner start with a prior
pðxÞ representing the learner’s prior belief (probability) that takes value x:
pðxÞ ¼ Pr½ ¼ x:
After observing data D (e.g., samples 0; 0; 1; 1; 0; 1; 1; 1; 0; 0), the learner obtains a posterior
distribution, using Bayes’ rule:
Pr½Dj ¼ x Pr½ ¼ x
Pr½ ¼ xjD ¼
Pr½D
/ Pr½Dj ¼ x pðxÞ:
Here, Pr½Dj ¼ x is the probability of generating data D from the Bernoulli distribution
with parameter x. This is also called the likelihood function.
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 173
2.4.2. Algorithm Overview. Suppose that for each arm reward is generated from some
parametric distribution i . Then, the overall structure of the Thompson sampling algorithm, as
described in Thompson [56], is as follows:
For every arm i, start with a prior belief on the parameters of its reward distribution.
In every round t,
—pull an arm with its probability of being the best arm according to the current belief, and
—use the observed reward to update the posterior belief distribution for the pulled arm.
Given the prior distribution and the likelihood function, in some cases the posterior dis-
tribution has a closed analytical form. In particular, given Bernoulli i.i.d. samples, if the prior is
a Beta distribution,2 then the posterior distribution is also given by a Beta distribution. Also,
given Gaussian i.i.d. samples, if the prior is a Gaussian distribution, then the posterior is also
given by a Gaussian distribution. This property makes these distributions a convenient choice
for implementation of Thompson sampling. Below, we give precise details of the TS algorithm
for the special cases of (a) Bernoulli reward distribution and (b) Gaussian reward distribution.
2.4.3. Thompson Sampling for Bernoulli MAB. Assume that we have a Bernoulli
multiarmed bandit (Bernoulli MAB) instance. That is, for arm i, every time it is pulled, the
reward is generated from Bernoulliði Þ. The aim is to learn model parameters i ; i ¼ 1; . . . ; N
for all arms to find the best arm.
The calculations below show that given a Beta prior with parameters ð; Þ, on ob-
serving one sample r 2 f0ðw.p. 1 Þ; 1ðw.p. Þg, the posterior distribution is Betað þ r;
þ 1 rÞ:
Pr½ ¼ jr / Pr½rj ¼ Pr½ ¼
¼ Bernoulli ðrÞBeta; ðÞ
ð þ Þ 1
¼ r ð1 Þ1r ð1 Þ1
ðÞðÞ
/ þr 1 ð1 Þr
/ Betaþr;þ1r ðÞ:
From this observation, for every arm i, the Thompson sampling algorithm (as presented in
Agrawal and Goyal [4, 6, 8]) starts with a uniform prior belief Betað1; 1Þ about its mean. After
ni;t pulls in time 1; . . . ; t, the algorithm updates its belief to BetaðSi;t þ 1; Fi;t þ 1Þ, where
Si;t : the number of 1s in ni;t pulls of arm i, and
Fi;t : the number of 0s in ni;t pulls of arm i.
The initial values of these variables (before any pulls) are set to 0.
The algorithm, at time t, plays an arm i with its probability of being the best. That is, if
Xj ; j ¼ 1; . . . ; N are random variables distributed as BetaðSj;t þ 1; Fj;t þ 1Þ, then the algo-
rithm plays an arm i with probability PrðXi > maxj6¼i Xj Þ. Note that a quick way to implement
this is to generate a sample from BetaðSi;t þ 1; Fi;t þ 1Þ for each arm i and then pull the arm
whose sample is largest.
Algorithm 2 (Thompson Sampling for Bernoulli MAB Using Beta Priors)
foreach t ¼ 1; 2; . . . ; do
For each arm i ¼ 1; . . . ; N , independently sample i;t BetaðSi;t1 þ 1; Fi;t1 þ 1Þ
Play arm It :¼ arg maxi i;t
Observe rt .
end
Agrawal: Recent Advances in MABs for Sequential Decision Making
174 Tutorials in Operations Research, © 2019 INFORMS
where the maximization is over all instances of the N -armed Bernoulli MAB problem.
[6, 8] shows that these versions of the Thompson sampling algorithm achieve a logarithmic
problem-dependent regret bound. The regret bounds require only a bounded or sub-Gaussian
reward assumption. The intuition is that given the opportunity of collecting enough samples
from the arms, starting from the exact prior is not important, as long as the starting prior has
enough variance (e.g., uniform distribution or standard Gaussian) and its support includes the
optimal parameter.
Theorem 5 (Agrawal and Goyal [6, 8]). Consider any N -armed stochastic MAB instance ¼
f1 ; . . . ; N g such that reward distributions fi g are sub-Gaussian or have bounded support in
½0; 1. Then, Algorithm 3 achieves the following regret bound for any such instance:
X 18 logðT 2 Þ 25
5ðT ; Þ i
þ þ Oð1Þ:
i6¼I
i i
3. Contextual Bandits
In many sequential decision-making applications, including online recommendation systems (Li
et al. [40]), online advertising (Tang et al. [54]), online retail (Cohen et al. [26]), and healthcare
(Bastani and Bayati [18], Durand et al. [32], Tewari and Murphy [55]), the decision in every
round needs to be customized to the time-varying features of the users being served and/or
seasonal factors. In the contextual bandit problem (Langford and Zhang [39]), also referred to as
“associative reinforcement learning” [52], these factors and features form the context or “side
information” that the algorithm can take into account before making the decision in every round.
3.1. Problem Definition. The precise definition of this problem is as follows. In every
round t, first the context xi;t for every arm i ¼ 1; . . . ; N is observed, and then the algorithm
needs to pick an arm It 2 At
f1; . . . ; N g to be pulled. The outcome of pulling an arm depends
on the context xIt ;t of the arm pulled.
A special case of this problem is the linear contextual bandit problem (Abbasi-yadkori et al. [1],
Auer [12], Chu et al. [25]), where the expected reward on pulling an arm is a linear function of
the context. Specifically, an instance of the linear contextual bandit problem is defined by
a d-dimensional parameter µ 2 Rd a priori unknown to the algorithm. The expected value of
the observed reward rt on pulling an arm i 2 At with context vector xi;t is given by
E½rt jIt ¼ i ¼ µ> xi;t . The regret definition compares the performance of an algorithm to
a clairvoyant policy that picks the arm with highest expected reward in every round:
XT X T
>
5ðT Þ :¼ max µ xi;t E rt :
i2At
t¼1 t¼1
More generally, the contextual bandit problem is defined via a linear or nonlinear, parametric
or nonparametric, contextual response function f ðÞ, so that the expected value of the observed
reward rt on pulling an arm i with context vector xi;t is given by E½rt jIt ¼ i ¼ f ðxi;t Þ. The
function f is unknown to the decision maker and may be learned using observations rt . For the
Agrawal: Recent Advances in MABs for Sequential Decision Making
176 Tutorials in Operations Research, © 2019 INFORMS
special case of the linear contextual bandit problem defined above, f ðxi;t Þ ¼ µ> xi;t . A significant
generalization to Lipschitz bandits was provided in Slivkins [51], where the only assumption on
f is that it satisfies a Lipschitz condition with respect to a metric.
3.1.1. Significance. The contextual bandit models represent a significant leap beyond the
basic MAB setting: they get rid of the basic premise of stationary distributions for every arm and
allow time-varying distributions. However, some important modeling assumptions implicit in
this setting allow extensions of the algorithmic techniques discussed earlier for MAB. These
assumptions include the ability to observe the contexts before making the decision, concise
parametric dependence of the distribution of rewards on the context and arm (e.g., through the
unknown parameter µ), and the oblivious nature of context to decisions made. Many such
assumptions may be satisfied in many interesting settings—for example, in a recommendation
system, where a user’s arrival does not depend strategically on the past decisions, user features
and external seasonal factors may be observed before a recommendation is served to the user.
Contextual bandits inherit the idea of using features to encode contexts and the models for
the relation between these feature vectors from supervised learning, in order to utilize sim-
ilarity between arms and achieve scalable learning when the number of arms is large. For
example, the linear contextual bandit formulation is useful as a scalable model even when
contexts are not changing with time, but instead, there is a fixed context vector, also known as
the feature vector, associated with every arm. In this formulation, also known as a static
contextual bandit problem, or a linear bandit problem, there is a fixed known feature vector xi
associated with every arm, and the expected value of the observed reward rt on pulling an arm i
with context vector xi is given by E½rt jIt ¼ i ¼ µ> xi . When the number of arms N is large, this
formulation can provide significant advantages in terms of computation, and learning effi-
ciency reduces the problem of learning N distributions to learning a (potential much smaller)
d-dimensional parameter vector µ. The linear bandit problem can thus be thought of as
a combination of linear regression and reinforcement learning; more general formulations have
been studied, for example, those based on kernel regression (Valko et al. [58]) or under access
to an oracle for supervised learning (Agarwal et al. [2], Dudı́k et al. [31]).
3.1.2. An Illustrative Example. As an example, consider the following network route
optimization problem. We are given a graph G with n nodes and d edges, representing a
transportation network. There is an unknown expected delay e associated with each edge e of
the d edges in the graph. In every one of the sequential rounds t ¼ 1; . . . ; T , the decision maker
routes one request from a fixed start node to a fixed end node through this network and
observes the total delay on the route. The aim is to learn from the observed delays to optimize
the future routes. A naı̈ve MAB formulation would model each possible path in the graph as an
arm, to get an exponentially large number of arms and therefore a regret bound that is
exponential in n. Instead, in a linear bandit formulation, for every path p, we can define
a feature vector xp 2 f0; 1gd by the incidence vector of the path (xp;e ¼ 1 if edge e belongs to
the path, and xp;e ¼ 0 otherwise). Then, the expected delay on a path p is given by xp µ,
where µ 2 Rd is the unknown delay vector (µe denotes the unknown expected delay on using
the edge e). Using linear bandit algorithms, we can then obtain regret bounds (as discussed in
the following) that scale with d instead of N . Furthermore, using generalization to a linear
contextual bandit, we can model the problem where in every round t, a request for a different
source-destination pair ðst ; et Þ needs to be routed.
Step 1. Given the history up to time , ðr1 ; x1 Þ; ðr2 ; x2 Þ; . . . ; ðr ; x Þ, compute a regularized
least square estimate of the unknown parameter by solving
( )
X
> 2
^ ¼ argmax
µ ðrt xt zÞ þ kzk ; 2
z2Rd t¼1
whose solution is
^ ¼ M1 y ;
µ
P P
where M ¼ Idd þ t¼1 xt x>
t and y ¼ t¼1 rt xt .
As a sanity check, consider the N -armed bandit problem. It can be modeled as linear bandit
problem with xt ¼ 1It (the It -th canonical vector) for all t; then,
2 3 0 1
n1; þ 1 ^1;
X 6 7 X
M ¼ I þ xt xt> ¼ 6
4
..
.
7 and y;i ¼
5 ^ ¼ B
rs ; therefore, µ . C
@ .. A:
s : Is ¼i
t¼1
nd; þ 1 ^d;
Step 2. Using exponential inequalities for self-normalized martingales, the following theorem
has been proven in the literature, which provides a useful concentration bound for the
computed estimate.
pffiffiffiffiffiffi
Theorem pffiffiffi 6 (Abbasi-yadkori et al. [1], Rusmevichientong and Tsitsiklis [45]). If kxt k2 Ld ,
kµk2 d and jrt j 1. Then, with probability at least 1 , the vector µ lies in the set
( sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi)
TdL pffiffiffi
Ct ¼ z 2 R : kz µ
d
^ t kM d log þ1 þ d : (6)
t
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Here, k kM denotes the matrix norm, kxkM ¼ x> M x.
pffiffiffi
Observe that this bound will recover the UCB confidence interval within d in the special
case of the N -armed bandit problem, using the linear bandit formulation discussed above.
Step 3. Next, we use the above theorem to define an upper confidence bound UCBðxÞ for any
x 2 Rd such that UCBðxÞ x> µ with high probability. Specifically, define
UCBðxÞ :¼ argmax x> z:
z2Ct
define At as the set of contexts (corresponding to arms) available in each round t. And then we
pick the arm corresponding to context:
arg max max z > xi;t :
x2At z2Ct
One must note, however, that solving the above double maximization problem is difficult (NP-
hard even when the sets At are convex).
The LinUCB algorithm has been shown to achieve an Oð ~ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
dT log N Þ regret bound3 (Abbasi-
yadkori et al. [1]). In case the number of arms is very large, the modified pffiffiffiffi version of the al-
gorithm discussed above can also achieve a regret bound of Oðd ~ T Þ independent of the
number of arms. These bounds match the available lower bound for this problem within
logarithmic factors in T and d (Bubeck and Cesa-Bianchi [22]). However, as discussed above,
in the latter case, the algorithm may not be efficiently implementable. Dani et al. [27] showed pffiffiffiffi
a modification to get an efficiently implementable algorithm with regret bound of Oðd ~ 3=2 T Þ.
Then, if the prior for at time t is given by 1ð^ t ; v 2 Mt1 Þ, then using Bayes’ rule, it is easy to
compute the posterior distribution at time t þ 1:
Prðµ~ jrt Þ / Prðrt jµ
~ Þ Prðµ~ Þ;
as 1ð^tþ1 ; v 2 Mtþ1 1 Þ (details of this computation are in Agrawal and Goyal [5, 7]). The
Thompson sampling algorithm (Agrawal and Goyal [5, 7]) generates a sample µ ~ t from the
distribution 1ðµ ^ t ; v 2 Mt 1 Þ, at every time step t, and it pulls the arm i that maximizes xTi;t µ
~t .
Algorithm 5 (Thompson Sampling for Linear Contextual Bandits)
foreach t ¼ 1; 2; . . . ; do
Observe context xi;t for all arms i ¼ 1; . . . ; N .
Sample µ
~ t from distribution 1ðµ ^ t ; v 2 Mt1 Þ.
Play arm It :¼ arg maxi xi;t µt .
T ~
whichever is smaller, for any 0 < < 1, where is a parameter used by the algorithm.
pffiffiffi
The above regret bounds are within the d ln T factor of the lower bound for this problem
(Bubeck and Cesa-Bianchi [22]).
The regret bound in (8) for the Thompson sampling algorithm has a slightly worse de-
pendence on d compared with the corresponding bounds for the LinUCB algorithm. However,
the bounds match the best available bounds for any efficiently implementable algorithm for
this problem (e.g., that given by Dani et al. [27]).
X X
T
5ðT Þ :¼ Tf ðS Þ E½ rt ¼ ðf ðS Þ f ðSt ÞÞ; (10)
t t¼1
where S ¼ maxS
½N f ðSÞ. However, it is easy to construct instances of function f ðÞ such that
the lower bounds for the MAB problem would imply a regret at least exponential in N . In many
cases, even if the expected value f ðSÞ is known for all S, computing S may be intractable.
Therefore, for this problem to be tractable, some structural assumptions on f ðÞ must be
utilized. Examples of such structural assumptions include the linear model f ðSÞ ¼ µT 1S or
Lipschitz functions (metric bandits) discussed in the previous section. Another example is the
assumption of submodularity of function f , also known as the submodular bandit problem.
The algorithm for online submodular minimization in Hazan and Kale [33] can achieve a regret
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
that is bounded by OðNT 2=3 logð1= Þ with probability 1 , for the submodular bandit
problem. Their results are, in fact, applicable to the adversarial bandit problem (i.e., when
rt ¼ ft ðSt Þ for an arbitrary unknown sequence of submodular functions f1 ; . . . ; fT ).
This tutorial focuses on the useful structural properties provided by some well-studied
consumer choice models in assortment optimization. Choice models capture substitution
effects among products by specifying the probability that a consumer selects a product from
the offered set. The multinomial logit (MNL) model is a natural and convenient way to specify
these distributions; it is one of the most widely used choice models for assortment selection
problems in retail settings. The model was introduced independently by Luce [41] and
Plackett [44]; see also Ben-Akiva and Lerman [19], McFadden [42], and Train [57] for further
discussion and surveys of other commonly used choice models.
Agrawal: Recent Advances in MABs for Sequential Decision Making
180 Tutorials in Operations Research, © 2019 INFORMS
where q i is the revenue obtained when product i is purchased and is known a priori.
If the consumer preference model parameters (i.e., MNL parameters v) are known a priori,
then the problem of computing the optimal assortment, which we refer to as the static as-
sortment optimization problem, is well studied. Talluri and van Ryzin [53] considered the
unconstrained assortment planning problem under the MNL model and presented a greedy
approach to obtain the optimal assortment. Recent works by Davis et al. [28] and Désir and
Goyal [30] consider assortment planning problems under MNL with various constraints.
In the absence of a priori knowledge of the MNL parameters, the bandit problem aims to
offer a sequence of assortments, S1 ; . . . ; ST , where T is the planning horizon, and learn the
model parameters in order to maximize cumulative expected revenue or minimize regret as
defined in (10).
If a no-purchase is observed in round t, then the algorithm updates the parameter estimates
and makes a new assortment selection for round t þ 1 in the following way. Let ni;t be the
number of time steps until time t, when the item i was offered as part of an assortment. And
let mi;t be the number of times the item was purchased (or picked by the customer). Then,
mi;t
vi;t ¼
: (13)
ni;t
UCB
Using these estimates, upper confidence bounds vi;t are computed as
sffiffiffiffiffiffiffiffiffiffiffi
12 vi;t 30 log2 T
UCB
vi;t vi;t þ
:¼ log T þ : (14)
ni;t ni;t
UCB
In Agrawal et al. [10], it is proven that vi;‘ is an upper confidence bound on the true parameter
vi (i.e., vi;t vi ; 8i with high probability).
UCB
The algorithm then selects assortment Stþ1 as the assortment with highest revenue estimate;
that is,
Stþ1 :¼ arg max f ðS; vUCB
t Þ: (16)
S
½N ;jSjK
pffiffiffiffiffiffiffiffi
In Chen and Wang [24], a lower bound of ð NT Þ on the regret of any online algorithm for
the MNL-bandit problem is provided under the same assumptions as in Theorem 8. Thus,
Algorithm 6 is near optimal.
This formulation allows box constraints on the aggregate (sum of) costs incurred.
As an example, the blind network revenue management problem discussed in Besbes and
Zeevi [21] can be formulated as a special case of this problem. In that problem, a firm is selling
multiple products. At every time step t, the firm chooses a price vector qt from a set of N
possible price vectors for its products and observes a realization of demand vector Dt , generating
revenue rt ¼ qt Dt and resource consumption ct ¼ ADt of d 1 resources. The distribution of
demand is unknown. A typical assumption is that the demand at any time t is generated by
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 183
5.2. Bandits with Global Convex Constraints and Objective (Agrawal and
Devanur [3])
In this problem, we are given a convex set S
½0; 1d and a concave objective function
f : ½0; 1d ! ½0; 1. On playing an arm It 2 f1; . . . ; N g at time t, it observes a vector v t 2 ½0; 1d ;
generated independently of the previous observations, from a fixed but unknown distribution
such that E½vt jIt ¼ V It , the It th column vector of matrix V . The matrix V 2 ½0; 1dm is fixed
but
P is unknown to the algorithm. The goal is to make the average of the observed vectors
P
1
T v
t t be contained in the set S and at the same time maximize f ð 1
T t v t Þ:
PT P
T
maximize f ðT1 t¼1 v t Þ while ensuring 1
T vt 2 S: (18)
t¼1
A simple special case of this formulation occurs when the observation vector v t is a vector
composed of a reward and several costs (i.e., P v t ¼ ðrt ;P ct Þ); f ðÞ is a function of the first
component of this vector (reward) (i.e., f ðT1 t v t Þ ¼ f ðT1 t rt Þ); and the constraints are only
P
on the last d 1 components of this vector (costs) (i.e., T1 Tt¼1 ct 2 S). Through this special
case, the bandits with global convex constraints and objective (BwCR) model generalizes BwK
to allow maximizing arbitrary concave utility functions on the total reward under arbitrary
convex constraints on the total cost.
But more generally, the observation vector v t can be used to model other aspects of user
feedback beyond rewards and costs in order to incorporate unconventional considerations into
the objective/constraints. For example, suppose that the arms belong to two subgroups
ðA; A0 Þ, and the decision maker is interested in fairness across the two groups in addition to
revenue. This consideration can be captured in the BwCR model as follows. Extend the vector
v t to include one more component indicating membership to a subgroup: let v t :¼ ðrt ; ct ; at Þ,
where at is set toP1 if the pulled P arm belongs to group A and 0 otherwise. P Now, extend the
objective to f ðT1 t rt Þ
k 12 T1 t at k2 . This objective is concave in T1 t v t and encourages
equal allocation among the two groups.
Here, N denotes the N -dimensional simplex, and dðx; SÞ is a distance function defined as
dðx; SÞ :¼ miny2S kx yk; (21)
Agrawal: Recent Advances in MABs for Sequential Decision Making
184 Tutorials in Operations Research, © 2019 INFORMS
where the maximum is taken over all online algorithms. Then, the following is proven in
Agrawal and Devanur [3] using the concavity of f and convexity of S.
Lemma 1 (Agrawal and Devanur [3]). Assuming (22) is feasible, there exists a distribution p
over N arms such that V p 2 S, and f ðV p Þ OPT f .
for i ¼ 1; . . . ; m; j ¼ 1; . . . ; d; t ¼ 1; . . . ; T .
The design and analysis of the UCB algorithm for BwCR then follows closely the design and
analysis of the UCB algorithm for the N -armed bandit problem. In particular, parallel to the
two observations made in Section 2.1, the following observations are made regarding UCBt
and LCBt (Agrawal and Devanur [3]):
(1) The mean for every arm i and component j is guaranteed to lie in the range defined by
its estimates LCBt;ji ðV Þ and UCBt;ji ðV Þ, with high probability. That is, with probability
1 mTdeðÞ ,
V 2 *t ; where (25)
~ :V
*t :¼ fV ~ ji 2 ½LCBt;ji ðV Þ; UCBt;ji ðV Þ; j ¼ 1; . . . ; d; i ¼ 1; . . . ; mg: (26)
(2) Let the probability of playing arm i at time t be pt;i . Then, with probability
1 mTdeðÞ , the total difference between the estimated and the actual observations for
the played arms can be bounded as
X
T pffiffiffiffiffiffiffiffiffiffiffi
k ~ t pt v t Þk Oðk1d k
ðV mT Þ (27)
t¼1
~ t gT such that V
for any fV ~ t 2 *t for all t.
t¼1
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 185
The first property shows that ½LCBt;i ; UCBt;i forms a high-confidence interval for the
unknown parameter V i at time t. And the second property shows that this interval becomes
more refined as more observations are made, so that the total estimation error over played
arms is small. Then, using the optimism under uncertainty principle, at time t, the UCB
algorithm for BwCR plays the best arm (or the best distribution over arms) according to the
best estimates in the set *t . The algorithm is summarized as Algorithm 7.
Algorithm 7 (UCB Algorithm for BwCR)
foreach t ¼ 1; 2; . . . ; T do
~ pÞ
arg max max f ðU
~ 2*t
p2m U
pt ¼ (28)
~ p; SÞ ¼ 0
s:t: min dðV
~ 2*t
V
where ¼ OðlogðmTd
ÞÞ, and 1d is the d-dimensional vector of all ones.
Observe that Algorithm 7 requires computing the best distribution for the most optimistic
estimates in the given confidence interval. Specifically, (28) requires maximizing the maximum
of some concave functions while ensuring that the minimum of some convex (distance)
functions is less than or equal to 0. Agrawal and Devanur [3] demonstrated that this opti-
mization problem is, in fact, a convex optimization problem and is solvable in time poly-
nomial in N and d. However, solving the optimization problem can still be slow. Agrawal and
Devanur [3] also presented two efficient algorithmic approaches, based on primal-dual tech-
niques and on the Frank–Wolfe algorithm for convex optimization. The primal-dual algorithm
maintains (Fenchel) dual variables representing the incremental cost of the constraints or the
incremental value of the objective. A procedure for efficiently updating these dual variables
is provided. Then, the problem of choosing an arm in a round involves simply comparing
different arms based on a combination of these dual variables for each arm. For more details,
refer to Agrawal and Devanur [3].
6. Summary
This tutorial discussed some recent advances in bandit models and their applications to
a variety of operations management problems. Beyond the models presented here, many
additional innovative settings have been studied in the recent literature that expand the
representability and applicability of the multiarmed bandit model. These include bandits with
delayed feedback (Joulani et al. [34]), sleeping bandits (Kanade et al. [35]), bandits with
Agrawal: Recent Advances in MABs for Sequential Decision Making
186 Tutorials in Operations Research, © 2019 INFORMS
switching costs (Dekel et al. [29]), and so forth. More recently, there has also been progress on
using bandit techniques for settings that involve more complex dependence on states—for
example, the inventory control problem (Zhang et al. [59]) and finite-state, finite-action MDPs
(e.g., Agrawal and Jia [9], Russo et al. [49]).
Endnotes
1
This is derived from the Azuma–Hoeffding inequality, which states that given samples x1 ; . . . ; xn 2 ½0; 1 with
E½xi jx1 ; . . . ; xi1 ¼ ,
Pn x
i¼1 i
Pr n 2eðn 2 =2Þ :
A Beta distribution has support ð0; 1Þ with two parameters, ð; Þ, with probability density function
2
ð þ Þ
f ðx:;Þ¼ðÞðÞx 1 ð1xÞ1 :
Here, ðxÞ is called the Gamma function. For integers x 1, ðxÞ ¼ ðx 1Þ!
3 ~ notation hides logarithmic factors in T and d, in addition to the absolute constants.
The OðÞ
References
[1] Y. Abbasi-yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits.
J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, eds. Advances in
Neural Information Processing Systems Vol. 24. Curran Associates, Red Hook, NY, 2312–2320,
2011.
[2] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire. Taming the monster: A fast and
simple algorithm for contextual bandits. E. P. Xing and T. Jebara, eds. Proceedings of the 31st
International Conference on Machine Learning. PMLR, 1638–1646, 2014.
[3] S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. Proceedings
of the 15th ACM Conference on Economics and Computation. ACM, New York, 989–1006, 2014.
[4] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem.
S. Mannor, N. Srebro, and R. C. Williamson, eds. Proceedings of the 25th Annual Conference on
Learning Theory. PMLR, 39.1–39.26, 2012.
[5] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. Working
paper, Microsoft Research India, Bangalore. https://arxiv.org/abs/1209.3352, 2012.
[6] S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson sampling. C. M. Carvalho
and P. Ravikumar, eds. Proceedings of the 16th International Conference on Artificial Intelligence
and Statistics. PMLR, 99–107, 2013.
[7] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. S. Dasgupta
and D. McAllester, eds. Proceedings of the 30th International Conference on Machine Learning.
JMLR, 1220–1228, 2013.
[8] S. Agrawal and N. Goyal. Near-optimal regret bounds for Thompson sampling. Journal of ACM
64(5):1–30, 2017.
[9] S. Agrawal and R. Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret
bounds. I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
R. Garnett, eds. Advances in Neural Information Processing Systems, Vol. 30. Curran Associates,
Red Hook, NY, 1184–1194, 2017.
[10] S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. A near-optimal exploration-exploitation
approach for assortment selection. Proceedings of the 2016 ACM Conference on Economics and
Computation. ACM, New York, 599–600, 2016.
[11] S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Thompson sampling for the MNL-bandit.
S. Kale and O. Shamir, eds. Proceedings of the 30th Annual Conference on Learning Theory.
PMLR, 76–78, 2017.
[12] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine
Learning Research 3(3):397–422, 2002.
[13] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.
Machine Learning 47(2–3):235–256, 2002.
[14] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit
problem. SIAM Journal on Computing 32(1):48–77, 2002.
Agrawal: Recent Advances in MABs for Sequential Decision Making
Tutorials in Operations Research, © 2019 INFORMS 187
[15] M. Babaioff, S. Dughmi, R. Kleinberg, and A. Slivkins. Dynamic pricing with limited supply. Pro-
ceedings of the 13th ACM Conference on Electronic Commerce. ACM, New York, 74–91, 2012.
[16] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. Proceedings of the 2013
IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE Computer Society,
Washington, DC, 207–216, 2013.
[17] S. R. Balseiro, J. Feldman, V. Mirrokni, and S. Muthukrishnan. Yield optimization of display
advertising with ad exchange. Management Science 60(12):2886–2907, 2014.
[18] H. Bastani and M. Bayati. Online decision-making with high-dimensional covariates. Working
paper, University of Pennsylvania, Philadelphia, 2015.
[19] M. Ben-Akiva and S. Lerman. Discrete Choice Analysis: Theory and Application to Travel Demand,
Vol. 9. MIT Press, Cambridge, MA, 1985.
[20] O. Besbes and A. Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and
near-optimal algorithms. Operations Research 57(6):1407–1420, 2009.
[21] O. Besbes and A. Zeevi. Blind network revenue management. Operations Research 60(6):1537–1550,
2012.
[22] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit
problems. Foundations and Trends in Machine Learning 5(1):1–122, 2012.
[23] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. D. Koller,
D. Schuurmans, Y. Bengio, and L. Bottou, eds. Advances in Neural Information Processing
Systems, Vol. 21. Curran Associates, Red Hook, NY, 273–280, 2008.
[24] X. Chen and Y. Wang. A note on tight lower bound for MNL-bandit assortment selection models.
Operation Research Letters 46(5):534–537, 2018.
[25] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions.
G. J. Gordon, D. B. Dunson, and M. Dudı́k, eds. Proceedings of the 14th International Conference
on Artificial Intelligence and Statistics. PMLR, 2011.
[26] M. C. Cohen, I. Lobel, and R. Paes Leme. Feature-based dynamic pricing. Proceedings of the 2016
ACM Conference on Economics and Computation. ACM, New York, 817–817, 2016.
[27] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback.
Proceedings of the 21st Conference on Learning Theory. 355–366, 2008.
[28] J. Davis, G. Gallego, and H. Topaloglu. Assortment planning under the multinomial logit model
with totally unimodular constraint structures. Technical report, Cornell University, Ithaca, NY,
2013.
[29] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: T2/3 regret. Proceedings of
the 46th Annual ACM Symposium on Theory of Computing. ACM, New York, 459–467, 2014.
[30] A. Désir and V. Goyal. Near-optimal algorithms for capacity constrained assortment optimization.
Working paper, Columbia University, New York, 2014.
[31] M. Dudı́k, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient
optimal learning for contextual bandits. F. Cozman and A. Pfeffer, eds. Proceedings of the 27th
Conference on Uncertainty in Artificial Intelligence. AUAI Press, Arlington, VA, 169–178, 2011.
[32] A. Durand, C. Achilleos, D. Iacovides, K. Strati, G. D. Mitsis, and J. Pineau. Contextual bandits for
adapting treatment in a mouse model of de novo carcinogenesis. Proceedings of the 3rd Machine
Learning for Healthcare Conference, Vol. 85. PMLR, 67–82, 2018.
[33] E. Hazan and S. Kale. Online submodular minimization. Journal of Machine Learning Research
13(1):2903–2922, 2012.
[34] P. Joulani, A. György, and C. Szepesvári. Online learning under delayed feedback. S. Dasgupta and
D. McAllester, eds. Proceedings of the 30th International Conference on Machine Learning. PMLR,
1453–1461, 2013.
[35] V. Kanade, H. B. McMahan, and B. Bryan. Sleeping experts and bandits with stochastic action
availability and adversarial rewards. D. van Dyk and M. Welling, eds. Proceedings of the 12th
International Conference on Artificial Intelligence and Statistics. PMLR, 272–279, 2009.
[36] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-
time analysis. N. H. Bshouty, G. Stoltz, N. Vayatis, and T. Zeugmann, eds. Algorithmic Learning
Theory—23rd International Conference. Springer, Berlin, 199–213, 2012.
[37] N. Korda, E. Kaufmann, and R. Munos. Thompson sampling for 1-dimensional exponential family
bandits. C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.
Advances in Neural Information Processing Systems, Vol. 26. Curran Associates, Red Hook, NY,
1448–1456, 2013.
Agrawal: Recent Advances in MABs for Sequential Decision Making
188 Tutorials in Operations Research, © 2019 INFORMS
[38] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied
Mathematics 6(1):4–22, 1985.
[39] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. J. C. Platt,
D. Koller, Y. Singer, and S. T. Roweis, eds. Advances in Neural Information Processing Systems,
Vol. 20. Curran Associates, Red Hook, NY, 817–824, 2007.
[40] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news
article recommendation. Proceedings of the 19th International Conference on World Wide Web.
ACM, New York, 661–670, 2010.
[41] R. Luce. Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons, New York, 1959.
[42] D. McFadden. Modeling the choice of residential location. Transportation Research Record (673):
72–77, 1978.
[43] S. Pandey and C. Olston. Handling advertisements of unknown quality in search advertising.
B. Schölkopf, J. C. Platt, and T. Hoffman, eds. Advances in Neural Information Processing
Systems, Vol. 19. MIT Press, Cambridge, MA, 1065–1072, 2006.
[44] R. L. Plackett. The analysis of permutations. Applied Statistics 24(2):193–202, 1975.
[45] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations
Research 35(2):395–411, 2010.
[46] P. Rusmevichientong, Z. M. Shen, and D. B. Shmoys. Dynamic assortment optimization with a
multinomial logit choice model and capacity constraint. Operations Research 58(6):1666–1680,
2010.
[47] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations
Research 39(4):1221–1243, 2014.
[48] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. Journal of
Machine Learning Research 17(1):2442–2471, 2016.
[49] D. Russo, I. Osband, and B. Van Roy. (More) efficient reinforcement learning via posterior
sampling. C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.
Advances in Neural Information Processing Systems, Vol. 26. Curran Associates, Red Hook, NY,
2013.
[50] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on Thompson sampling.
Foundations and Trends in Machine Learning 11(1):1–96, 2018.
[51] A. Slivkins. Multi-armed bandits on implicit metric spaces. J. Shawe-Taylor, R. S. Zemel,
P. L. Bartlett, F. Pereira, and K. Q. Weinberger, eds. Advances in Neural Information Processing
Systems, Vol. 24. Curran Associates, Red Hook, NY, 1602–1610, 2011.
[52] A. L. Strehl, Associative reinforcement learning. C. Sammut, G. I. Webb, eds. Encyclopedia of
Machine Learning, Springer US, Boston, MA, 49–51, 2019.
[53] K. Talluri and G. van Ryzin. Revenue management under a general discrete choice model of
consumer behavior. Management Science 50(1):15–33, 2004.
[54] L. Tang, R. Rosales, A. Singh, and D. Agarwal. Automatic ad format selection via contextual
bandits. Proceedings of the 22nd ACM International Conference on Information and Knowledge
Management. ACM, New York, 1587–1594, 2013.
[55] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health.
J. M. Rehg, S. A. Murphy, and S. Kumar, eds. Mobile Health: Sensors, Analytic Methods and
Applications. Springer, Cham, Switzerland, 495–517, 2017.
[56] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the
evidence of two samples. Biometrika 25(3–4):285–294, 1933.
[57] K. E. Train. Discrete Choice Methods with Simulation. Cambridge University Press, Cambridge,
UK, 2009.
[58] M. Valko, N. Korda, R. Munos, I. N. Flaounas, and N. Cristianini. Finite-time analysis of kernelised
contextual bandits. A. Nicholson and P. Smyth, eds. Proceedings of the 29th Conference on
Uncertainty in Artificial Intelligence. AUAI Press, Corvallis, OR, 2013.
[59] H. Zhang, X. Chao, and C. Shi. Closing the gap: A learning algorithm for the lost-sales inventory
system with lead times. Working paper, Pennsylvania State University, State College. https://
ssrn.com/abstract=2922820, 2018.