Professional Documents
Culture Documents
Online Learning For Causal Bandits
Online Learning For Causal Bandits
sampling based methods for bandit learning. this procedure for T episodes. We define cumulative
PT regret
after T episodes as being R(T ) = µa∗ T − t=1 µâ∗t
2.1. Multi-Armed Bandits where â∗t represents the estimate of the optimal action at
time t.
Consider the Bernoulli Multi-Armed Bandit problem in
which we have N arms and at each time step t = 1, 2, . . . , T This environment formulation is modeled off of the
over a finite horizon T , one of the N arms is played. After setting in (Lattimore & Reid, 2016). Distinctively, we
an arm i is played, a reward Yt ∈ {0, 1}, which is inde- focus on an online learning representation and measure
pendent of previous plays and i.i.d. from the selected arm’s performance in cumulative regret rather than simple regret.
distribution, will be observed. An MAB algorithm must Note that the problem formulation is different than contex-
efficiently use observations from the previous t − 1 plays tual bandits in that we observe our set of variables Xt only
to estimate reward distributions for each of the N arms and after selecting an action and, therefore, cannot construct a
choose which arm to play at time t. One measure of an best action through observations of Xt .
MAB algorithm’s performance is it’s ability
PT to maximize
expected reward over a horizon T : E[ t=1 µA(t) ] where
2.3. Thompson Sampling
A(t) is the arm played at time t and µA(t) is the selected
arm’s expected reward. Another commonly used measure The Thompson Sampling (TS) algorithm initializes with
for providing bounds on algorithmic performance is the con- a Bayesian prior of the reward distribution of each arm.
cept of expectedP finite horizon T :
cumulative regret over aP After each observation Yt , TS will update the posterior
T
E[R(T )] = E[ t=1 (µ∗ − µA(t) )] = i ∆i E[ki (T ))], distribution of the arm that was selected in timestep t. If
where µ∗ = maxi µi , ∆i := µ∗ − µi and ki (t) denotes we initialize our prior to be uniform over the unit interval,
the number of times arm i has been played up to time t. our best estimate µ̂i of the expected reward for each arm
In deriving regret bounds for our MAB algorithms, we will i can be derived simply by taking the point θ ∈ (0, 1)
prove an upper bound on the expected cumulative regret that maximizes our posterior distribution (and likelihood
over a finite horizon T . function): µ̂i = argmaxθ∈(0,1) Pi (θ | x1 , x2 , . . . , xn ) ∝
fi (x1 , x2 , . . . , xn | θ). Furthermore, a plausible expected
2.2. Causal Multi-Armed Bandits reward for arm i, θit , can be sampled from our estimated
posterior Pit (θ | x1 , x2 , . . . , xn ).
In our causal model, following the terminology of (Koller
& Friedman, 2009) we have a directed acyclic graph G, Generally in the case of Bernoulli bandits, TS will
a set of variables X = {X1 , X2 , . . . , XN }, and a joint start with a uniform prior over support [0, 1] which can
distribution P over X that factorizes over G. Each variable be represented by a (1, 1) beta distribution. Given that
is Bernoulli and the existence of an edge from variable we will have to update our posterior distribution at the
Xi to Xj conditions the probability distribution of Xj end of each trial, a beta distribution which has the form
i.e. P(Xj = xj |Xi = xi ) 6= P(Xj = xj ). We denote Γ(α+β)
f (x, α, β) = Γ(α)+Γ(β) xα−1 (1 − x)β−1 α, β ∈ Z++ , is
the parents of a variable Xi , or the subset of X such that a friendly reward distribution choice due to it’s simplistic
there exists an edge from Xj to Xi , to be P aXi . An posterior updates. Concretely, we can represent each
intervention (of size m) is represented as do(X = x) arm’s posterior distribution on Pit (θ | x1 , x2 , . . . , xn )
which sets the values of x = {x1 , x2 , ..., xm } to the as Beta(Si (t − 1) + 1, Fi (t − 1) + 1) where Si (t − 1)
corresponding variables in X . When an intervention is represents the number of successes from Bernoulli trials
performed, all edges between Xi and P aXi are mutilated for plays from arm i and conversely Fi (t − 1) represents
and the graph G can be represented by an altered prob- the number of failures from Bernoulli trials for plays from
ability distribution P (X c |do(X = x)) where X c = X −X. arm i. In order to perform an update on the posterior
distribution at timestep t for the selected arm i at time
Our agent in the online causal bandit problem is given a set t, we simply increment Si (t) = Si (t − 1) + 1{Yt =1}
of allowed actions A and limited knowledge of the graph G. and Fi (t) = Fi (t − 1) + 1{Yt =0} . Note that a dirichlet
One of the variables, Y ∈ X , in our graph G is the reward distribution allows for extension when there are more
variable and can be represented by a Bernoulli Random Vari- than two classes and will be used for modeling causal
able. The expected reward for an action a ∈ A can be rep- dependencies in our proposed algorithm.
resented as µa = E[Y |do(X = a)] and the expected reward
for the optimal action µa∗ = maxa∈A E[Y |do(X = a)].
After choosing an action a, the agent observes realized 3. Related Work
values of observable variables in X , Xt , and a reward We will review two general frameworks for bandit learning
Yt . Using these realizations, the agent updates his best in causal environments. First, we consider scenarios with
estimates of expected rewards for each a ∈ A and repeats
Online Learning for Causal Bandits
bination of variables, P (Y = 1|P aY = Zk ), and the de- These experiment settings are the same as those in (Latti-
pendencies in our causal graph, P (P aY = Zk |do(X = a)). more & Reid, 2016). In all of the experiments, Y depends
We consider two variants of the algorithm with varying only on a single variable X1 (unknown to the algorithms).
0
degrees of knowledge of the causal graph G. In the Yt ∼ Bernoulli( 12 + ) if X1 = 1 and Yt ∼ Bern( 12 + )
0
first setting, we know only {P aY }, the subset of X such if X1 = 0 where = q1 /(1 − q1 ). We have an expected
0
that there exists an edge from each element of {P aY } reward of 2 + for do(X1 = 1), 12 − for do(X1 = 0)
1
to Y . In the second setting, we assume knowledge of and 2 for all other actions. We set qi = 0 for i ≤ m and 12
1
P (P aY = Zk |do(X = a)), the probability distributions of otherwise. We notice that our online algorithm outperforms
{P aY } conditioned on performing the intervention do(X = both of the offline variants whenever T > |A|.
a).
For the setting where P (P aY = Zk |do(X = a)) is 5.2. Cumulative Regret Experiments
known we can use the true conditional distributions for
P (P aY = Zk |do(X = a)) when calculating µât rather 300
Algorithm Comparison - Experiment 1
100
Algorithm Comparison - Experiment 2
Cumulative Regret
Cumulative Regret
60
the algorithm is learning and empirically results in a sig- 150
unrealistic. 0
0 100 200 300 400 500 600 700 800
0
0 500 1000 1500 2000
T T
OC - TS Algorithm 1 Algorithm 2 - Eta^* OC - TS Algorithm 1 Algorithm 2 - Eta^*
5. Experiments (Replication from (Lattimore (a) Cumulative regret for fixed (b) Cumulative regret for fixed
horizon T = 800, number of horizon T = 2000, number of
& Reid, 2016)) variables N = 50, m(q) = 25, variables N =q50, m(q) = 2,
and fixed = 0.3 N
Included are some experiments comparing both the simple and fixed = 8T
and cumulative regret of our online causal bandits algorithm
and the two offline algorithms proposed in (Lattimore &
Figure 2. Cumulative Regret Comparison
Reid, 2016) (denoted as Algorithm 1 and Algorithm 2 as
in the original paper). Note that for the experiments in this
section, we use a modified version of OC-TS that for the Now, we measure the cumulative regret for same experi-
first |A| timesteps plays each action exactly once. As a ments with a fixed horizon T for each scenario (large and
result, in the experiments where the time horizon T < |A| small ). For each algorithm, we follow standard behavior
we see that OC-TS has a high simple regret at time T as it for the first T2 timesteps and fix the best estimated action
is still collecting information about the environment. After for the second T2 timesteps. We notice that our online al-
the first |A| plays, OC-TS plays actions in an online manner gorithm’s regret converges for both experiments unlike the
and performs optimally for the conducted experiments. offline variants.
0.35
0.40
graphs. Note that for the experiments in this section, we use
0.35
0.30 0.30
a modified version of OC-TS that for the first |A| timesteps
0.25 0.25 plays each action exactly once. As a result, in the experi-
Regret
Regret
0.20
0.15
0.20
0.15
ments where the time horizon T < |A| we see that OC-TS
0.10 0.10
has a high simple regret at time T as it is still collecting in-
0.05 0.05 formation about the environment. After the first |A| plays,
0.00
0 10 20
m(q)
30 40 50
0.00
0 200 400
T
600 800 1000
OC-TS plays actions in an online manner and performs well
OC - TS Algorithm 1 Algorithm 2 - Eta^* OC - TS Algorithm 1 Algorithm 2 - Eta^*
for the conducted experiments.
(a) Simple regret vs m(q) for (b) Simple regret vs horizon, T,
fixed horizon T = 400, number with
q N = 50, m = 2, and =
of variables N = 50, and fixed N
6.1. Chain Causal Graph
8T
= 0.3
In the chain causal graph setting, Y depends only on a single
variable X3 . Yt ∼ Bernoulli( 21 + ) if X3 = 1 and Yt ∼
Figure 1. Simple Regret Comparison Bern( 12 ) if X1 = 0. We have an expected reward of 12 +
Online Learning for Causal Bandits
0.40
Algorithm Comparison - Chain Graph 0.40
Algorithm Comparison - Chain Graph
Regret
0.20 0.20
0.05 0.05 Our best possible action for both scenarios, Pu = 0.1
0.00
0 200 400
Horizon (T)
600 800 1000
0.00
0 200 400
Horizon (T)
600 800 1000
and Pu = 0.3, is the intervention do(X1 = 1). When
OC - TS Algorithm 1 Algorithm 2 - Eta^* OC - TS Algorithm 1 Algorithm 2 - Eta^*
Pu = 0.3, the difference in conditional rewards for the ac-
(a) Simple regret vs horizon T ; (b) Simple regret vs horizon T ; tions do(X1 = 1) and do(X2 = 1) is very small. We set
N = 50, m(q) = 30, Pu = 0.1, N = 50, m(q) = 30, Pu = 0.3,
qi = 0 for i ≤ m, i ∈ / {2, 3}; qi = Pu for i ∈ {2, 3};
and fixed = 0.3 and fixed = 0.3
and 21 otherwise. Note that we compare two versions
of OC-TS: (1) where each action is played once for the
Figure 4. Simple Regret Comparison first |A| timesteps and (2) where is each action is drawn
from η ∗ , the same sampling distribution as Algorithm 2
We measure the simple regret for the chain causal graph from (Lattimore & Reid, 2016): η ∗ = argminη m(η) =
experiments as we vary the horizon T for each Pu ∈ argmina∈A maxa∈A E[ P P(PηbaPY(P (X)|a)
aY (X)|b) ]. b∈A
{0.1, 0.3}. We notice that our online algorithm performs
optimally for all horizons T > |A|. 0.40
Algorithm Comparison - Confounding Graph 0.40
Algorithm Comparison - Confounding Graph
0.35 0.35
0.30 0.30
300
Algorithm Comparison - Chain Graph 300
Algorithm Comparison - Chain Graph
0.25 0.25
Regret
Regret
0.20 0.20
250 250
0.15 0.15
200 200
Cumulative Regret
Cumulative Regret
0.10 0.10
0.00 0.00
100 100 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Horizon (T) Horizon (T)
OC - TS Algorithm 2 - Eta^* OC - TS (Imp. Sampl) OC - TS Algorithm 2 - Eta^* OC - TS (Imp. Sampl)
50 50 Algorithm 1 Algorithm 1
0
0 100 200 300 400
T
500 600 700 800
0
0 100 200 300 400
T
500 600 700 800 (a) Simple regret vs horizon T ; (b) Simple regret vs horizon T ;
OC - TS Algorithm 1 Algorithm 2 - Eta^* OC - TS Algorithm 1 Algorithm 2 - Eta^* N = 50, m(q) = 30, Pu = 0.1, number of variables N = 50,
(a) Cumulative regret for fixed (b) Cumulative regret for fixed and fixed = 0.3 m(q) = 30, Pu = 0.3, and fixed
horizon T = 800, N = 50, horizon T = 800, N = 50, = 0.3
m(q) = 30, Pu = 0.1, and fixed m(q) = 30, Pu = 0.3, and fixed
= 0.3 = 0.3 Figure 7. Simple Regret Comparison
Figure 5. Cumulative Regret Comparison We measure the simple regret for the confounded causal
Online Learning for Causal Bandits
We notice that our online algorithm performs well for all 250 250
Cumulative Regret
Cumulative Regret
150 150
300
Algorithm Comparison - Confounded Graph 300
Algorithm Comparison - Confounded Graph 100 100
250 250 50 50
200 200 0 0
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Cumulative Regret
Cumulative Regret
T T
OC - TS Algorithm 1 Algorithm 2 - Eta^* OC - TS Algorithm 1 Algorithm 2 - Eta^*
150 150
100 100 (a) Cumulative regret for fixed (b) Cumulative regret for fixed
50 50 horizon T = 800, N = 50, horizon T = 800, N = 50,
0 0
m(q) = 30, Pu = 0.1, and fixed m(q) = 30, Pu = 0.3, and fixed
0 500 1000
OC - TS
1500 2000
T
Algorithm 2 - Eta^*
2500 3000 3500
OC - TS (Imp. Sampl)
4000 0 500 1000
OC - TS
1500 2000
T
Algorithm 2 - Eta^*
2500 3000 3500
OC - TS (Imp. Sampl)
4000
= 0.3 = 0.3
Algorithm 1 Algorithm 1
(a) Cumulative regret for fixed (b) Cumulative regret for fixed
Figure 10. Cumulative Regret (No Exhaustive Initial Exploration)
horizon T = 4000, N = 50, horizon T = 4000, N = 50,
m(q) = 30, Pu = 0.1, and fixed m(q) = 30, Pu = 0.3, and fixed
= 0.3 = 0.3
exhaustive initial exploration, the performance of our algo-
Figure 8. Cumulative Regret Comparison rithm is severely detrimented. This is still an area in which
we are currently researching and is an area of future work.
Now, we measure the cumulative regret for the same exper- In particular, we are interested in understanding how impor-
iments with a fixed horizon T for each scenario (large and tant initial exploration is as the actions space and number
small ). For each algorithm, we follow standard behavior of causal dependencies scale.
for the first T2 timesteps and fix the best estimated action for
the second T2 timesteps. 8. Experiments: Sparse Causal Graphs
Included are some experiments comparing online bandits
7. Limitations of OC-TS algorithms in the setting of sparse causal graphs. For dense
One limitation of our proposed algorithm is the need to causal graphs we contrast the performance of of Online
seed initial exploration either through exhaustive search or Causal Thompson Sampling with knowledge of {P aY } and
minimizing the maximal importance sampling ratio as seen Online Thompson Sampling without causal signaling.
in Section 6. In particular, for the confounded Causal Graph
as seen in Section 6.2 we have provided empirical results
below where we do not ’warm-start’ our algorithm. We X1
observe that for our experimental parameters, Pu = 0.3
results in a gap µ∗a − maxa∈A,a6=a∗ µa = 0.04 compared to
µ∗a − maxa∈A,a6=a∗ µa = 0.08 for Pu = 0.1. For Pu = 0.1
but not Pu = 0.3, we are able to play the optimal action with X2 X3
0.30 0.30
0.25 0.25
(a) Graph Conditional Probabil- (b) Graph Depic-
ities tion
Regret
Regret
0.20 0.20
0.15 0.15
0.10 0.10
Figure 11. Dense Causal Graph Problem Formulation
0.05 0.05
0.00 0.00
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Horizon (T) Horizon (T)
OC - TS Algorithm 1 Algorithm 2 - Eta^* OC - TS Algorithm 1 Algorithm 2 - Eta^*
(a) Simple regret vs horizon T ; (b) Simple regret vs horizon T ; Our causal graph G has variables X =
N = 50, m(q) = 30, Pu = 0.1, N = 50, m(q) = 30, Pu = 0.3, {X1 , X2 , X3 , X4 , X5 } and |A| = 10 since each vari-
and fixed = 0.3 and fixed = 0.3 able is a Bernoulli Random Variable and we are al-
lowed to make an intervention of size 1. Note that
Figure 9. Simple Regret (No Exhaustive Initial Exploration) P (Y = 1|X2 , X3 , X4 , X5 ) = P (Y = 1|X2 , X3 ) and that
P (X2 = 1|X1 ), P (X2 = 1|X1 ) are the same as in Figure
In the figures above, we can see that without performing an 3.
Online Learning for Causal Bandits
700
Sparse Causal Graph - Cumulative Regret (500 Trials) 1.0
Sparse Causal Graph - Prob. of Optimal Action (500 Trials) online method and the computational tractability of offline
600
0.8
methods.
500
0.6
400
300
0.4
10. Future Work
200
100
0.2
We would like to better understand the fundamental limi-
0
0 2000 4000 6000 8000 10000
0.0
0 2000 4000 6000 8000 10000
tations of our algorithm and potential alternatives that may
Episode # Episode #
TS TS CO TS TS CO tradeoff performance with robustness and/or computational
(a) Cumulative Regret Compar- (b) Optimal Action Prob. Com- complexity. One limitation we presented is sensitivity to
ison parison the initial exploration phase of our algorithm. When the
gap between the optimal action and the next best action,
Figure 12. Sparse Causal Graph (|A| = 10) µ∗a − maxa∈A,a6=a∗ µa , is small we notice that our algo-
rithm’s performance is sensitive to initial exploration strate-
gies. We believe this is a byproduct of the cardinality of the
Now we consider a sparser causal graph G which has vari- action space being large as well as the sparsity of the causal
ables X = {X1 , X2 , . . . , X7 } and |A| = 14. Note that graph and are currently performing research to validate this
P (Y = 1|X2 , X3 , . . . , X7 ) = P (Y = 1|X2 , X3 ) and that claim. Another potential piece of future work is understand-
P (X2 = 1|X1 ), P (X2 = 1|X1 ) are the same as in Figure ing the performance of batch methods for causal bandits
3. and whether such methods provide any added robustness
in estimation of causal dependencies or conditional reward
700
Sparse Causal Graph - Cumulative Regret (500 Trials) 1.0
Sparse Causal Graph - Prob. of Optimal Action (500 Trials)
distributions.
600
0.8
500
References
Probability of Optimal Action
Cumulative Regret
0.6
400
300
0.4 Agrawal, S. and Goyal, N. Analysis of thompson sampling
200
0.2
for the multi-armed bandit problem. In Proceedings of
100
the 25th Annual Conference on Learning Theory (COLT),
0 0.0
0 2000 4000
Episode #
TS
6000
TS CO
8000 10000 0 2000 4000
Episode #
TS
6000
TS CO
8000 10000
2012.
(a) Cumulative Regret Compar- (b) Optimal Action Prob. Com- Bareinboim, E., Forney A. and Pearl, J. Bandits with unob-
ison parison served confounders: A causal approach. In Proceedings
of the 29th Conference on Neural Information Processing
Figure 13. Sparse Causal Graph (|A| = 14) Systems (NIPS), 2015.