Jouini Wassim MUCB Spawc2012 Preprint

CHANNEL SELECTION WITH RAYLEIGH FADING: A MULTI-ARMED BANDIT
FRAMEWORK
Wassim Jouini and Christophe Moy
SUPELEC, IETR, SCEE, Avenue de la Boulaie, CS 47601, 35576 Cesson S evign e, France.
INSERM U936 - IFR140- Facult e de M edecine Universit e de Rennes1 - 35033 Rennes France.
ABSTRACT
Channel Selection in fading environments with no prior infor-
mation on the channels quality is a challenging issue. In the
case of Rayleigh channels the measured Signal-To-Noise
Ratio follows exponential distributions. Thus, we suggest in
this paper a simple algorithm that deals with resource selec-
tion when the measured samples are drawn from exponential
distributions. This strategy, referred to as Multiplicative Up-
per Condence Bound Algorithm (MUCB), associates a util-
ity index to every available arm, and then selects the arm with
the highest index. For every arm, the associated index is equal
to the product of a multiplicative factor by the sample mean of
the rewards collected by this arm. We show that MUCB poli-
cies are order optimal. Moreover simulations illustrate and
validate the stated theoretical results.
1. INTRODUCTION
Several sequential decision making problems face a dilemma
between the exploration of a space of choices, or solutions,
and the exploitation of the information available to the deci-
sion maker. The problem described herein is known as se-
quential decision making under uncertainty. In this paper
we focus on a sub-class of this problem, where the decision
maker has a discrete set of stateless choices and the added
information is a real valued sequence (of feedbacks, or re-
wards) that quanties how well the decision maker behaved in
the previous time steps. This particular instance of sequential
decision making problems is generally known as the multi-
armed bandit (MAB) problem [1, 2].
A common approach to solving the exploration versus ex-
ploitation dilemma within MAB problems consists in assign-
ing an utility value to every arm. An arms utility aggregates
all the past information about the lever and quanties the gam-
blers interest in pulling it. Such utilities are called indexes.
Agrawal et al. [2] emphasized the family of indexes minimiz-
ing the expected cumulated loss and called them Upper Con-
dence Bound (UCB) indexes. UCB indexes provide an opti-
mistic estimation of the arms performances while ensuring a
The authors would like to thank Damien Ernst, Raphael Fonteneau and
Emmanuel Rachelson for their many helpful comments and answers regard-
ing this work.
rapidly decreasing probability of selecting a suboptimal arm.
The decision maker builds its policy by greedily selecting the
largest index. Recently, Auer et al. [3] proved that a simple
additive form, of the rewards sample mean and a bias, known
as UCB
1
can achieve order optimality over time when deal-
ing with rewards drawn from bounded distributions. Tackling
exponentially distributed rewards, as it usually occurs when
measuring Signal-to-Noise Ratios (SNR) in Fading environ-
ments, remains however a challenge as optimal learning algo-
rithms to tackle this matter prove to be complex to implement
[1, 2].
This paper is inspired from the aforementioned work and
is motivated by the problematic of channel selection when the
channels as subject to Rayleigh fading. However, we suggest
the analysis of a multiplicative rather than an additive expres-
sion for the index.
The main contribution of this paper is to design and an-
alyze a simple, deterministic, multiplicative index-based pol-
icy. The decision making strategy computes an index asso-
ciated to every available arm, and then selects the arm with
the highest index. Every index associated to an arm is equal
to the product of the sample mean of the reward collected by
this arm and a scaling factor. The scaling factor is chosen so
as to provide an optimistic estimation of the considered arms
performance.
We show that our decision policy has both a low computa-
tional complexity and can lead to a logarithmic loss over time
under some non-restrictive conditions. For the rest of this
paper we will refer to our suggested policy as Multiplicative
Upper Condence Bound index (MUCB).
The outline of this paper is the following: We start by
presenting some general notions on the multi-armed bandit
framework with exponentially distributed rewards in Section 2.
Then, Section 3 introduces our index policy and Section 4 an-
alyzes its behavior, proving the order optimality of the sug-
gested algorithm. Finally, Section 6 concludes.
2. MULTI-ARMED BANDITS
A K-armed bandit (K N) is a machine learning problem
based on an analogy with the traditional slot machine (one-
armed bandit) but with more than one lever. Such a problem
is dened by the K-tuple (
1
,
2
, ...,
K
)
K
, being the
set of all positive reward distributions. When pulled at a time
t N, each lever
1
k 1, K (where 1, K = {1, ..., K})
provides a reward r
t
drawn from a distribution
k
associated
to that specic lever. The objective of the gambler is to maxi-
mize the cumulated sum of rewards through iterative pulls. It
is generally assumed that the gambler has no (or partial) ini-
tial knowledge about the levers. The crucial tradeoff the gam-
bler faces at each trial is between exploitation of the lever that
has the highest expected payoff and exploration to get more
information about the expected payoffs of the other levers.
In this paper, we assume that the different exponentially dis-
tributed payoffs drawn from a machine are independent and
identically distributed (i.i.d.) and that the independence of
the rewards holds between the machines. However the dif-
ferent machines reward distributions (
1
,
2
, ...,
K
) are not
supposed to be the same.
Let I
t
1, K denote the machine selected at a time t,
and let H
t
be the history vector available to the gambler at
instant t, i.e., H
t
= [I
0
, r
0
, I
1
, r
1
, . . . , I
t1
, r
t1
]
We assume that the gambler uses a policy to select arm
I
t
at instant t, such that I
t
= (H
t
). We shall also write
k 1, K,
k
=
1
=E[
k
], where
k
refers to the pa-
rameter of the considered exponential distribution with pdf
f
k
(x) =
k
e
k
x
, x 0, and we assume that
k
> 0, k
1, K. The (cumulated) regret of a policy at time t (after
t pulls) is dened as follows: R
t
= t
t1
m=0
r
m
, where
= max
k1,K
{
k
} refers to the expected reward of the opti-
mal arm.
We seek to nd a policy that minimizes the expected cu-
mulated regret (Equation 1),
E[R
t
] =
k=k
k
E[T
k,t
] , (1)
where
k
=

k
is the expected loss of playing arm k,
and T
k,t
refers to the number of times the machine k has been
played from instant 0 to instant t 1.
3. MULTIPLICATIVE UPPER CONFIDENCE
BOUND ALGORITHMS
This section presents our main contribution, the introduction
of a new multiplicative index. Let B
k,t
(T
k,t
) denote the index
of arm k at time t after being pulled T
k,t
. We refer to as
Multiplicative Upper Condence Bound algorithms (MUCB)
the family of indexes that can be written in the form:
B
k,t
(T
k,t
) = X
k,t
(T
k,t
)M
k,t
(T
k,t
) ,
where X
k,t
(T
k,t
) is the sample mean of machine k at step t
after T
k,t
pulls, i.e., X
k,t
(T
k,t
) =
1
T
k,t
t1
i=0
1
{I
i
=k}
r
i
and
1
We use indifferently the words lever, arm, or machine.
M
k,t
() is an upper condence scaling factor chosen to insure
that the index B
k,t
(T
k,t
) is an increasing function of the num-
ber of rounds t. This last property insures that the index of an
arm that has not been pulled for a long time will increase, thus
eventually leading to the sampling of this arm. We introduce
a particular parametric class of MUCB indexes, which we call
MUCB(), given as follows
2
:
0, M
k,t
(T
k,t
) =
1
max
_
0; (1
_
ln(t)
T
k,t
)
_ (2)
We adopt the convention that
1
0
= +. Given a history H
t
,
one can compute the values of T
k,t
and M
k,t
and derive an
index-based policy as follows:
I
t
= (H
t
) arg max
k1,K
{B
k,t
(T
k,t
)} . (3)
4. ANALYSIS OF MUCB() POLICIES
This section analyses the theoretical properties of MUCB()
algorithms. More specically, it focuses on determining how
fast is the optimal arm identied and what are the probabili-
ties of anomalies, that is sub-optimal pulls.
4.1. Consistency and order optimality of MUCB indexes
Denition 1 (-consistency) Consider the set
K
of K-armed
bandit problems. A policy is said to be -consistent,
0 < 1, with respect to
K
, if and only if:
(
1
, . . . ,
K
)
K
, lim
t
E[R
t
]
t
= 0 (4)
We expect good policies to be at least 1-consistent. As a
matter of fact, 1-consistency ensures that, asymptotically, the
average expected reward is optimal.
From the expression of Equation 1 one can remark that
its is sufcient to upper bound the expected number of times
E[T
k,t
] one plays a suboptimal machine k after t rounds, to
obtain an upper bound on the expected cumulated regret. This
leads to the main result of this paper in the form of the follow-
ing theorem.
Theorem 1 (Order optimality of MUCB() policies) Let
k
=
k
/
, k 1, K \ {k
}. For all K 2, if policy

MUCB( > 4) is run on K machines having rewards drawn
from exponential distributions
1
, ...,
K
then:
E[R
t
]
k:
k
>0
4
1
k
ln(t) +o (ln(t)) (5)
Proving Theorem1 relies on three lemmas that we analyze
and prove in the next subsection.
2
This form offers a compact mathematical formula. However practically
speaking, a machine k is played when T
k,t
ln(t). Otherwise the ma-
chine with largest nite index is played.
4.2. Learning Anomalies and Consistency of MUCB poli-
cies
Let us introduce the set S = N R; then, one can write
S
k,t
= (T
k,t
, B
k,t
) S the decision state of arm k at time t.
We associate the product order to the set S: for a pair of states
S = (T, B) S and S
= (T
, B
) S, we write S S
if
and only if T T
and B B
.
Denition 2 (Anomaly of type 1) We assume that there ex-
ists at least one suboptimal machine, i.e., 1, K \ {k
} = .
We call anomaly of type 1, denoted by {
1
(u
k
)}
k,t
, for a
suboptimal machine k 1, K \ {k
}, and with parame-

ter u
k
N, the following event:
{
1
(u
k
)}
k,t
= {S
k,t
(u
k
,
)} .
Denition 3 (Anomaly of type 2) We refer to as anomaly of
type 2, denoted by {
2
}
t
, associated to the optimal machine
k
, the following event:

{
2
}
t
= {S
k
,t
< (,
) T
k
,t
1} .
Lemma 1 (Expected cumulated regret. Proof in 8.2) Given
a policy and a MAB problem, let u = [u
1
, . . . , u
K
] repre-
sent a set of integers, then the expected cumulated regret is
upper bounded by:
E[R
t
]
k=k
k
u
k
+
k=k
k
P
t
(u
k
)
with, P
t
(u
k
) =
t
m=u
k
+1
_
P({
2
}
m
) +P
_
{
1
(u
k
)}
k,m
__
We consider the following values for the set u, for all sub-
optimal arms k, u
k
(t) =
_
4
(1
k
)
2
ln(t)
_
.
We show in the two following lemmas that for the dened
set u the anomalies are upper bounded by exponentially de-
creasing functions of the number of iterations.
Lemma 2 (Upper bound of Anomaly 1. Proof in 8.3) For all
K 2, if policy MUCB() is run on K machines having
rewards drawn from exponential distributions
1
, ...,
K
then
k 1, K \ {k
}:
P
_
{
1
(u
k
)}
k,t
_
t
/2+1
(6)
Lemma 3 (Upper bound of Anomaly 2. Proof in 8.4) For all
K 2, if policy MUCB() is run on K machines having
rewards drawn from exponential distributions
1
, ...,
K
then:
P({
2
}
t
) t
/2+1
(7)
We end this paper by the proof of Theorem 1
[Proof of Theorem 1] For > 4, relying on Lemmas 1, 2
and 3 we can write:
E[R
t
]
k=k
k
_
4
(1
k
)
2
ln(t)
_
+o(ln(t))
with,

k=k

k
P
t
(u
k
) = o(ln(t)). Finally, since
k
=
(1
k
) and u
k
(t) =
4
(1
k
)
2
ln(t) +o(ln(t)), we nd the
stated result in Theorem 1.
5. SIMULATION RESULTS
For illustration purpose we consider a SU willing to eval-
uate the quality of K = 10 channels. The SU relies on
the measure of the channels SNR to evaluate the best chan-
nel. We assume that the SU suffers Rayleigh fading. Con-
sequently, for every channel, the measured SNR follows an
exponential distribution. The presented simulation consider
the following parameters = {
1
, ,
10
} for the chan-
nels, where
1

10
without loss of generality: and
= {0.1; 0.2; 0.3; 0.4; 0.5; 0.6; 0.7; 0.8; 0.9; 1}.
The simulations compare three MUCBpolicies for equal
to respectively, {1; 2; 4.01}. Theses algorithms are referred
to as MUCB(1), MUCB(2) and MUCB(4) respectively.
Notice that MUCB(4) is chosen so as to respect the con-
dition imposed in Theorem 1, i.e., > 4. MUCB(1) and
MUCB(2) on the contrary are considered as possibly risky
by Theorem 1. The simulations consider a time horizon of
10
6
iterations.
Figure 1 plots the cumulated averaged regret of MUCB
policies. In order to obtain relevant results, the curves were
averaged over 100 experiments. All curves show a similar
behavior: rst an exploration phase were the regret grows
quickly. Then the curves tend to conrm that the regret of
MUCB policies grow as a logarithmic function of the number
of iterations. As matter of fact, we notice that after the rst ex-
ploration phase, on a logarithmic scale, the regret grows as a
linear function. Moreover, since MUCB(1) and MUCB(2)
seem to respect this trend, these curves suggest that the im-
posed condition in Theorem 1, > 4, might be improvable.
6. CONCLUSION
A new low complexity algorithm for MAB problems is sug-
gested and analyzed in this paper: MUCB. The analysis of its
regret proves that the algorithm is order optimal over time. In
order to quantify it performance compared to optimal algo-
rithms, further empirical evaluations are needed and are cur-
rently under investigation.
10
1
10
2
10
3
10
4
10
5
10
6
0
500
1000
1500
Iterations
R
e
g
r
e
t
Averaged Regret over 100 experiments

MUCB(1)
MUCB(2)
MUCB(4)
Fig. 1. Average Regret Over 100 experiments: Illustration of
Theorem 1.
7. REFERENCES
[1] T.L. Lai and H. Robbins. Asymptotically efcient adap-
tive allocation rules. Advances in Applied Mathematics,
6:422, 1985.
[2] R. Agrawal. Sample mean based index policies with
O(log(n)) regret for the multi-armed bandit problem. Ad-
vances in Applied Probability, 27:10541078, 1995.
[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time anal-
ysis of multi-armed bandit problems. Machine learning,
47(2/3):235256, 2002.
[4] H. Chernoff. A measure of asymptotic efciency fo tests
of a hypothesis based on the sum of observations. The
Annals of Mathematical Statistics, pages 493507, 1952.
8. APPENDIX
8.1. Large deviations inequalities
Assumption 1 (Cramer condition) Let X be a real random
variable. X satises the Cramer condition if and only if
> 0 : (0, ), E
_
e
X
< .
Lemma 4 (Cramer-Chernoff Lemma for the sample mean)
Let X
1
, . . . , X
n
(n N) be a sequence of i.i.d. real random
variables satisfying the Cramer condition with expected value
E[X]. We denote by X
n
the sample mean X
n
=
1
n
n
i=1
X
i
.
Then, there exist two functions l
1
() and l
2
() such that:
1
> E[X], P(X
n

1
) e
l
1
(
1
)n
,
2
< E[X], P(X
n

2
) e
l
2
(
2
)n
.
Functions l
1
() and l
2
() do not depend on the sample size n
and are continuous non-negative, strictly increasing (respec-
tively strictly-decreasing) for all
1
> E(X) (respectively
2
< E(X)), both null for
1
=
2
= E(X).
This result was initially proposed and proved in [4]. The
bounds provided by this lemma are called Large Deviations
Inequalities (LDIs) in this paper.
In the case of exponential distributions this theorem can
be applied and LDI functions have the following expressions:
l
1
() = l
2
() =

E[X]
1 ln
_

E[X]
_
3
_
1

E[X]
_
2
2
_
1 + 2

E[X]
_
8.2. Proof of Lemma 1
According to Equation 1: E[R
t
] =
k=k
k
E[T
k,t
] . Per def-
inition T
k,t
=
t1
m=0
1
I
m
=k
. Then, E[T
k,t
] =
t1
m=0
E[1
I
m
=k
].
After playing an arm u
k
times, bounding the rst u
k
terms by
1 yields:
E[T
k,t
] u
k
+
t1
m=u
k
+1
P({I
m
= k} {T
k,m
> u
k
}) (8)
Then we can notice that the following events are equiva-
lent:
{I
m
= k} =
_
B
k,m
> max
k
=k
B
k
,m
_
Moreover we can notice that:
_
B
k,m
> max
k
=k
B
k
,m
_
{B
k,m
> B
k,m
}
Which can be further included in the following union of events:
{B
k,m
> B
k,m
} {B
k,m
>
} {
> B
k,m
}
Consequently we can write:
{I
m
= k} {T
k,m
> u
k
} {
1
(u
k
)}
k,m
{
2
}
m
(9)
Finally, we apply the probability operator:
E[T
k,t
] u
k
+
t1
m=u
k
+1
P({
1
(u
k
)}
k,m
) +P({
2
}
m
) (10)
The combination of Equation 1 - given at the beginning of this
proof - and Equation 10 concludes this proof.
From the denition of {
1
(u
k
)}
k,t
we can write that:
P
_
{
1
(u
k
)}
k,t
_
=
S
k,t
S
P(S
k,t
(u
k
,
)) ,
t1
u=u
k
P(B
k,t
(u)
) .
In the case of MUCB policies, we have:
u t, P(B
k,t
(u)
) = P
_
X
k,t
(u)

M
k,t
(u)
_
Consequently, we can upper bound the probability of occur-
rence of type 1 anomalies by:
P
_
{
1
(u
k
)}
k,t
_
t1
u=u
k
P
_
X
k,t
(u)

M
k,t
(u)
_
(11)
Let us dene
k,t
(T
k,t
) =

M
k,t
(T
k,t
)
Since we are dealing with exponential distributions, the
rewards provided by the arm k satisfy the Cramer condition.
As a matter of fact, since u u
k

ln(t)
(1
k
)
2
then:
k,t
(u)
k
=
1
k
_
1
_
ln(t)
u
_
1
So, according to the large deviation inequality for X
k,t
(T
k,t
)
given by Lemma 4 (with T
k,t
u
k
and u
k
large enough),
there exists a continuous, non-decreasing, non-negative func-
tion l
1,k
such that:
P
_
X
k,t
(T
k,t
)
k,t
(T
k,t
)|T
k,t
= u
_
e
l
1,k
(
k,t
(u))u
Finally: P
_
{
1
(u
k
)}
k,t
_
t1
u=u
k
e
l
1,k
(
k,t
(u))u
The end of this proof aims at proving that for:
u u
k
: l
1,k
(
k,t
(u))
ln(t)
2u
.
Note that since we are dealing with exponential distribu-
tions we can write: l
1,k
(
k,t
(u))
3(1
k,t
(u)
k
)
2
2(1+2
k,t
(u)
k
)
.
Moreover since u u
k

ln(t)
(1
k
)
2
then:
k,t
(u)
k
=
1
k
_
1
_
ln(t)
u
_

1
k
Consequently it is sufcient to prove that:
3 (1
k,t
(u)
k
)
2
2
_
1 + 2
1
k
_
ln(t)
2u
Let us dene h(t) as a function of time: h(t) =
_
ln(t)
u

[0, 1]. We analyze the sign of the function:
g(t) =
_
1
k
h(t)
_
1
k
1
__
2
_
1 + 2
1
k
_
3
h(t)
2
(12)
Consequently we need to prove that for u u
k
, g() has
positive values.
Factorizing last equation leads to the following two terms:
_
_
_
1
k

_
(1+2
1
k
)
3
_
h(t)
_
1
k
1
_
_
1
k
+
_
(1+2
1
k
)
3
_
h(t)
_
1
k
1
_
(13)
Since per denition: h(t) [0, 1] and
1
k
1 then,
_
1
k

_
(1+2
1
k
)
3
_
h(t)
_
1
k
1
_
0. Consequently,
g() is positive only if the second term of Equation 13 is neg-
ative, i.e.,
_
ln(t)
u

(
1
k
1)
1
k
+
(
1+2
1
k
)
3
. Since u u
k
, the
last inequation is veried. Finally upper bounding Equation
11 for u u
k
:
P
_
{
1
(u
k
)}
k,t
_
t1
u=u
k
e
ln(u)/2
t1
u=u
k
1
u
/2

1
t
/21
This proof follows the same steps as the the proof in Subsec-
tion 8.3.
From the denition of {
1
(u
k
)}
k,t
we can write that:
P({
2
}
t
)
t1
u=1
P(B
k
,t
(u)
)
In the case of MUCB policies, we have:
u t, P(B
k
,t
(u)
) = P
_
X
k
,t
(u)

M
k
,t
(u)
_
Consequently, we can upper bound the probability of occur-
rence of type 2 anomalies by:
P({
2
}
t
)
t1
u=1
P
_
X
k
,t
(u)
max
_
0; (1
ln(t)
T
k,t
)
__
Since
max
_
0; (1
_
ln(t)
T
k,t
)
_

Cramers con-
dition is veried. Moreover since the machine is played when
the maximal of the previous term is equal to 0, we can con-
sider that u ln(t) and that:
max
_
0; (1
ln(t)
T
k,t
)
_
=
_
1
ln(t)
T
k,t
_
Consequently, we can upper-bound the occurrence of
Anomaly 2:
P({
2
}
t
)
t1
u=ln(t)
e
l
2
(
k
,t
(u))u
(14)
Where, l
2
(
k
,t
(u)) veries the LDI as dened in Appendix
8.1. Thus, after mild simplications we can write,
l
2
(
k
,t
(u))
3ln(t)
u
2
_
1 + 2(1
_
ln(t)
u
)
_
ln(t)
2u
Consequently, including this last inequality into Equation
14 ends the proof.

Jouini Wassim MUCB Spawc2012 Preprint

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jouini Wassim MUCB Spawc2012 Preprint

Uploaded by

Copyright:

Available Formats

CHANNEL SELECTION WITH RAYLEIGH FADING: A MULTI-ARMED BANDIT

}. For all K 2, if policy

}, and with parame-

, the following event:

You might also like