Fairness in Reinforcement Learning

Fairness in Reinforcement Learning ⇤
Shahin Jabbari Matthew Joseph Michael Kearns Jamie Morgenstern Aaron Roth 1
Abstract tings where historical context can have a distinct influence

We initiate the study of fairness in reinforcement on the future. For concreteness, we consider the specific
learning, where the actions of a learning algo- example of hiring (though other settings such as college
rithm may affect its environment and future re- admission or lending decisions can be embedded into this
wards. Our fairness constraint requires that an framework). Consider a firm aiming to hire employees for
algorithm never prefers one action over another a number of positions. The firm might consider a variety
if the long-term (discounted) reward of choosing of hiring practices, ranging from targeting and hiring ap-
the latter action is higher. Our first result is neg- plicants from well-understood parts of the applicant pool
ative: despite the fact that fairness is consistent (which might be a reasonable policy for short-term produc-
with the optimal policy, any learning algorithm tivity of its workforce), to exploring a broader class of ap-
satisfying fairness must take time exponential in plicants whose backgrounds might differ from the current
the number of states to achieve non-trivial ap- set of employees at the company (which might incur short-
proximation to the optimal policy. We then pro- term productivity and learning costs but eventually lead to
vide a provably fair polynomial time algorithm a richer and stronger overall applicant pool).
under an approximate notion of fairness, thus es- We focus on the standard model of reinforcement learning,
tablishing an exponential gap between exact and in which an algorithm seeks to maximize its discounted
approximate fairness. sum of rewards in a Markovian decision process (MDP).
Throughout, the reader should interpret the actions avail-
able to a learning algorithm as corresponding to choices or
1. Introduction policies affecting individuals (e.g. which applicants to tar-
The growing use of machine learning for automated get and hire). The reward for each action should be viewed
decision-making has raised concerns about the potential for as the short-term payoff of making the corresponding deci-
unfairness in learning algorithms and models. In settings sion (e.g. the short-term influence on the firm’s productiv-
as diverse as policing [22], hiring [19], lending [4], and ity after hiring any particular candidate). The actions taken
criminal sentencing [2], mounting empirical evidence sug- by the algorithm affect the underlying state of the system
gests these concerns are not merely hypothetical [1; 25]. (e.g. the company’s demographics as well as the available
applicant pool) and therefore in turn will affect the actions
We initiate the study of fairness in reinforcement learning, and rewards available to the algorithm in the future.
where an algorithm’s choices may influence the state of the
world and future rewards. In contrast, previous work on fair Informally, our definition of fairness requires that (with
machine learning has focused on myopic settings where high probability) in state s, an algorithm never chooses an
such influence is absent, e.g. in i.i.d. or no-regret mod- available action a with probability higher than another ac-
els [5; 6; 9; 10]. The resulting fairness definitions therefore tion a0 unless Q⇤ (s, a) > Q⇤ (s, a0 ), i.e. the long-term re-
do not generalize well to a reinforcement learning setting, ward of a is greater than that of a0 . This definition, adapted
as they do not reason about the effects of short-term ac- from Joseph et al. (2016), is weakly meritocratic: facing
tions on long-term rewards. This is relevant for the set- some set of actions, an algorithm must pick a distribution
over actions with (weakly) heavier weight on the better ac-
1
University of Pennsylvania, Philadelphia, PA, USA. Corre- tions (in terms of their discounted long-term reward). Cor-
spondence to: Shahin Jabbari, Matthew Joseph, Michael Kearns, respondingly, a hiring process satisfying our fairness def-
Jamie Morgenstern, Aaron Roth <jabbari, majos, mkearns,
jamiemor, aaroth@cis.upenn.edu>. inition cannot probabilistically target one population over
another if hiring from either population will have similar
Proceedings of the 34 th International Conference on Machine long-term benefit to the firm’s productivity.
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017
by the author(s). Unfortunately, our first result shows an exponential separa-
⇤
The full technical version of this paper is available at tion in expected performance between the best unfair algo-
https://arxiv.org/pdf/1611.03071.pdf.
Fairness in Reinforcement Learning
rithm and any algorithm satisfying fairness. This motivates problem of fairness in machine learning. Early work in
our study of a natural relaxation of (exact) fairness, for data mining [8; 11; 12; 17; 21; 29] considered the question
which we provide a polynomial time learning algorithm, from a primarily empirical standpoint, often using statisti-
thus establishing an exponential separation between exact cal parity as a fairness goal. Dwork et al. (2012) explicated
and approximately fair learning in MDPs. several drawbacks of statistical parity and instead proposed
one of the first broad definitions of algorithmic fairness,
Our Results Throughout, we use (exact) fairness to refer
formalizing the idea that “similar individuals should be
to the adaptation of Joseph et al. (2016)’s definition defin-
treated similarly”. Recent papers have proven several im-
ing an action’s quality as its potential long-term discounted
possibility results for satisfying different fairness require-
reward. We also consider two natural relaxations. The first,
ments simultaneously [7; 15]. More recently, Hardt et al.
approximate-choice fairness, requires that an algorithm
(2016) proposed new notions of fairness and showed how
never chooses a worse action with probability substantially
to achieve these notions via post-processing of a black-box
higher than better actions. The second, approximate-action
classifier. Woodworth et al. (2017) and Zafar et al. (2017)
fairness, requires that an algorithm never favors an action
further studied these notion theoretically and empirically.
of substantially lower quality than that of a better action.
The contributions of this paper can be divided into two 1.2. Strengths and Limitations of Our Models
parts. First, in Section 3, we give a lower bound on the time
required for a learning algorithm to achieve near-optimality In recognition of the duration and consequence of choices
subject to (exact) fairness or approximate-choice fairness. made by a learning algorithm during its learning process
– e.g. job applicants not hired – our work departs from
Theorem (Informal statement of Theorems 3, 4, and 5). previous work and aims to guarantee the fairness of the
For constant ✏, to achieve ✏-optimality, (i) any fair or learning process itself. To this end, we adapt the fairness
approximate-choice fair algorithm takes a number of definition of Joseph et al. (2016), who studied fairness in
rounds exponential in the number of MDP states and (ii) the bandit framework and defined fairness with respect to
any approximate-action fair algorithm takes a number of one-step rewards. To capture the desired interaction and
rounds exponential in 1/(1 ), for discount factor . evolution of the reinforcement learning setting, we mod-
ify this myopic definition and define fairness with respect
Second, we present an approximate-action fair algorithm
to long-term rewards: a fair learning algorithm may only
(Fair-E3 ) in Section 4 and prove a polynomial upper bound
choose action a over action a0 if a has true long-term re-
on the time it requires to achieve near-optimality.
ward at least as high as a0 . Our contributions thus depart
Theorem (Informal statement of Theorem 6). For constant from previous work in reinforcement learning by incorpo-
✏ and any MDP satisfying standard assumptions, Fair- rating a fairness requirement (ruling out existing algorithms
E3 is an approximate-action fair algorithm achieving ✏- which commonly make heavy use of “optimistic” strategies
optimality in a number of rounds that is (necessarily) expo- that violates fairness) and depart from previous work in fair
nential in 1/(1 ) and polynomial in other parameters. learning by requiring “online” fairness in a previously un-
considered reinforcement learning context.
The exponential dependence of Fair-E3 on 1/(1 ) is
tight: it matches our lower bound on the time complexity First note that our definition is weakly meritocratic: an al-
of any approximate-action fair algorithm. Furthermore, our gorithm satisfying our fairness definition can never proba-
results establish rigorous trade-offs between fairness and bilistically favor a worse option but is not required to favor
performance facing reinforcement learning algorithms. a better option. This confers both strengths and limitations.
Our fairness notion still permits a type of “conditional dis-
1.1. Related Work crimination” in which a fair algorithm favors group A over
group B by selecting choices from A when they are supe-
The most relevant parts of the large body of literature on rior and randomizing between A and B when choices from
reinforcement learning focus on constructing learning al- B are superior. In this sense, our fairness requirement is
gorithms with provable performance guarantees. E3 [13] relatively minimal, encoding a necessary variant of fairness
was the first learning algorithm with a polynomial learn- rather than a sufficient one. This makes our lower bounds
ing rate, and subsequent work improved this rate (see Szita and impossibility results (Section 3) relatively stronger and
and Szepesvári (2010) and references within). The study of upper bounds (Section 4) relatively weaker.
robust MDPs [16; 18; 20] examines MDPs with high pa-
rameter uncertainty but generally uses “optimistic” learn- Next, our fairness requirement holds (with high probabil-
ing strategies that ignore (and often conflict with) fairness ity) across all decisions that a fair algorithm makes. We
and so do not directly apply to this work. view this strong constraint as worthy of serious consid-
eration, since “forgiving” unfairness during the learning
Our work also belongs to a growing literature studying the
may badly mistreat the training population, especially if the The horizon time H✏ := log (✏(1 )) / log( ) of an
learning process is lengthy or even continual. Additionally, MDP captures the number of steps an approximately opti-
it is unclear how to relax this requirement, even for a small mal policy must optimize over. The expected discounted
fraction of the algorithm’s decisions, without enabling dis- reward of any policy after H✏ steps approaches the ex-
crimination against a correspondingly small population. pected asymptotic discounted reward (Kearns and Singh
(2002), Lemma 2). A learning algorithm L is a non-
Instead, aiming to preserve the “minimal” spirit of our def-
stationary policy that at each round takes the entire history
inition, we consider a relaxation that only prevents an al-
and outputs a distribution over actions. We now define a
gorithm from favoring a significantly worse option over
performance measure for learning algorithms.
a better option (Section 2.1). Hence, approximate-action
fairness should be viewed as a weaker constraint: rather Definition 1 (✏-optimality). Let ✏ > 0 and 2 (0, 1/2).
than safeguarding against every violation of “fairness”, it L achieves ✏-optimality in T steps if for any T T
" #
instead restricts how egregious these violations can be. We T
⇤ 1X ⇤ 2✏
discuss further relaxations of our definition in Section 5. Es⇠µ⇤ [VM (s)] E VM (st )  , (1)
T t=1 1
2. Preliminaries with probability at least 1 , for st the state L reaches at

time t, where the expectation is taken over the transitions
In this paper we study reinforcement learning in Markov and the randomization of L, for any MDP M .
Decision Processes (MDPs). An MDP is a tuple M =
(SM , AM , PM , RM , T, ) where SM is a set of n states, We thus ask that a learning algorithm, after sufficiently
AM is a set of k actions, T is a horizon of a (possibly infi- many steps, visits states whose values are arbitrarily close
nite) number of rounds of activity in M , and is a discount to the values of the states visited by the optimal policy.
factor. PM : SM ⇥ AM ! SM and RM : SM ! [0, 1] de- Note that this is stronger than the “hand-raising” notion
note the transition probability distribution and reward dis- in Kearns and Singh (2002),4 which only asked that the
tribution, respectively. We use R̄M to denote the mean of learning algorithm stop in a state from which discounted
RM .2 A policy ⇡ is a mapping from a history h (the se- return is near-optimal, permitting termination in a state
quence of triples (state, action, reward) observed so far) from which the optimal discounted return is poor. In Def-
to a distribution over actions. The discounted state and inition 1, if there are states with poor optimal discounted
state-action value functions are denoted by V ⇡ and Q⇡ , and reward that the optimal policy eventually leaves for better
V ⇡ (s, T ) represents expected discounted reward of follow- states, so must our algorithms. We also note the following
ing ⇡ from s for T steps. The highest values functions are connection between the average VM ⇡
values of states vis-
achieved by the optimal policy ⇡ ⇤ and are denoted by V ⇤ ited under the stationary distribution of ⇡ (and in particular
and Q⇤ [24]. We use µ⇡ to denote the stationary distribu- an optimal policy) and the average undiscounted rewards
tion of ⇡. Throughout we make the following assumption. achieved under the stationary distribution of that policy.
Assumption 1 (Unichain Assumption). The stationary dis- Lemma 2 (Singh (2016)). Let R̄M be the vector of mean
tribution of any policy in M is independent of its start state. rewards in states of M and V⇡M the vector of discounted
rewards in states under ⇡. Then µ⇡ · R̄M = (1 )µ⇡ ·V⇡M .
We denote the ✏-mixing time of ⇡ by T✏⇡ . Lemma 1 relates
the ✏-mixing time of any policy ⇡ to the number of rounds We design an algorithm which quickly achieves ✏-
until the VM
⇡
values of the visited states by ⇡ are close to optimality and we bound the number of steps T before this
the expected VM ⇡
values (under the stationary distribution happens by a polynomial in the parameters of M .
µ ). We defer all the omitted proofs to the Appendix.
⇡
Lemma 1. Fix ✏ > 0. For any state s, following ⇡ for 2.1. Notions of Fairness
T T✏⇡ steps from s satisfies We now turn to formal notions of fairness. Translated to
" T #
X our setting, Joseph et al. (2016) define action a’s quality as
⇡ ⇡
Es⇠µ⇡ [VM (s)] E T 1
VM (st )  1 ✏ , the expected immediate reward for choosing a from state
t=1 s and then require that an algorithm not probabilistically
where st is the state visited at time t when following ⇡ from favor a over a0 if a has lower expected immediate reward.
s and the expectation in the second term is over the transi- However, this naive translation does not adequately cap-
tion function and the randomization of ⇡.3 ture the structural differences between bandit and MDP set-
2
Note that R̄M  1 and Var(RM )  1 for all states. The called the ✏-reward mixing time which is always linearly bounded
bounded reward assumption can be relaxed (see e.g. [13]). Also by the ✏-mixing time but can be much smaller in certain cases
assuming rewards in [0, 1] can be made w.l.o.g. up to scaling. (see Kearns and Singh (2002) for a discussion).
3 4
Lemma 1 can be stated for a weaker notion of mixing time We suspect unfair E3 also satisfies this stronger notion.
tings since present rewards may depend on past choices in with probability of at least 1 over histories ht 1 .
MDPs. In particular, defining fairness in terms of imme- If L is ↵-action fair for any input ↵ > 0, we call L
diate rewards would prohibit any policy sacrificing short- approximate-action fair.
term rewards in favor of long-term rewards. This is un-
desirable, since it is the long-term rewards that matter in Approximate-choice fairness prevents equally good ac-
reinforcement learning, and optimizing for long-term re- tions from being chosen at very different rates, while
wards often necessitates short-term sacrifices. Moreover, approximate-action fairness prevents substantially worse
the long-term impact of a decision should be considered actions from being chosen over better ones. In hiring, an
when arguing about its relative fairness. We will therefore approximately-action fair firm can only (probabilistically)
define fairness using the state-action value function Q⇤M . target one population over another if the targeted popu-
Definition 2 (Fairness). L is fair if for all input > 0, all lation is not substantially worse. While this is a weaker
M , all rounds t, all states s and all actions a, a0 guarantee, it at least forces an approximately-action fair
algorithm to learn different populations to statistical con-
Q⇤M (s, a) Q⇤M (s, a0 ) ) L(s, a, ht 1) L(s, a0 , ht 1) fidence. This is a step forward from current practices, in
5 which companies have much higher degrees of uncertainty
with probability at least 1 over histories ht 1.
about the quality (and impact) of hiring individuals from
Fairness requires that an algorithm never probabilistically under-represented populations. For this reason and the
favors an action with lower long-term reward over an action computational benefits mentioned above, our upper bounds
with higher long-term reward. In hiring, this means that will primarily focus on approximate-action fairness.
an algorithm cannot target one applicant population over We now state several useful observations regarding fair-
another unless the targeted population has a higher quality. ness. We defer all the formal statements and their proofs
In Section 3, we show that fairness can be extremely re- to the Appendix. We note that there always exists a (pos-
strictive. Intuitively, L must play uniformly at random until sibly randomized) optimal policy which is fair (Observa-
it has high confidence about the Q⇤M values, in some cases tion 1); moreover, any optimal policy (deterministic or ran-
taking exponential time to achieve near-optimality. This domized) is approximate-action fair (Observation 2), as is
motivates relaxing Definition 2. We first relax the proba- the uniformly random policy (Observation 3).
bilistic requirement and require only that an algorithm not Finally, we consider a restriction of the actions in an MDP
substantially favor a worse action over a better one. M to nearly-optimal actions (as measured by Q⇤M values).
Definition 3 (Approximate-choice Fairness). L is ↵-choice
Definition 5 (Restricted MDP). The ↵-restricted MDP of
fair if for all inputs > 0 and ↵ > 0: for all M , all rounds
M , denoted by M ↵ , is identical to M except that in each
t, all states s and actions a, a0 :
state s, the set of available actions are restricted to {a :
Q⇤M (s, a) Q⇤M (s, a0 ) ) L(s, a, ht 1) L(s, a0 , ht 1 ) ↵, Q⇤M (s, a) maxa0 2AM Q⇤M (s, a0 ) ↵ | a 2 AM }.
with probability of at least 1 over histories ht 1 . M ↵ has the following two properties: (i) any policy in M ↵
If L is ↵-choice fair for any input ↵ > 0, we call L is ↵-action fair in M (Observation 4) and (ii) the optimal
approximate-choice fair. policy in M ↵ is also optimal in M (Observation 5). Ob-
servations 4 and 5 aid our design of an approximate-action
A slight modification of the lower bound for (exact) fair-
fair algorithm: we construct M ↵ from estimates of the Q⇤M
ness shows that algorithms satisfying approximate-choice
values (see Section 4.3 for more details).
fairness can also require exponential time to achieve near-
optimality. We therefore propose an alternative relaxation,
where we relax the quality requirement. As described in 3. Lower Bounds
Section 1.1, the resulting notion of approximate-action fair-
We now demonstrate a stark separation between the per-
ness is in some sense the most fitting relaxation of fair-
formance of learning algorithms with and without fairness.
ness, and is a particularly attractive one because it allows
First, we show that neither fair nor approximate-choice fair
us to give algorithms circumventing the exponential hard-
algorithms achieve near-optimality unless the number of
ness proved for fairness and approximate-choice fairness.
time steps T is at least ⌦(k n ), exponential in the size of the
Definition 4 (Approximate-action Fairness). L is ↵-action state space. We then show that any approximate-action fair
fair if for all inputs > 0 and ↵ > 0, for all M , all rounds algorithm requires a number of time steps T that is at least
t, all states s and actions a, a0 : 1
⌦(k 1 ) to achieve near-optimality. We start by proving a
Q⇤M (s, a) > Q⇤M (s, a0 )+↵ ) L(s, a, ht 1) L(s, a0 ht 1) lower bound for fair algorithms.
5
L(s, a, h) denotes the probability L chooses a from s given history h. Theorem 3. If < 14 , > 1
2 and ✏ < 18 , no fair algorithm
can be ✏-optimal in T = O(k n ) steps.6 Hence, we focus on approximate-action fairness. Before

moving to positive results, we mention that the time com-
Standard reinforcement learning algorithms (absent a fair- plexity of approximate-action fair algorithms will still suf-
ness constraint) learn an ✏-optimal policy in a number of fer from an exponential dependence on 1 1 .
steps polynomial in n and 1✏ ; Theorem 3 therefore shows
a steep cost of imposing fairness. We outline the idea for Theorem 5. For < 14 , ↵ < 18 , > max(0.9, c), c 2
ec 1
proof of Theorem 3. For intuition, first consider the special ( 12 , 1) and ✏ < 1 16 , no ↵-action fair algorithm is ✏-
case when the number of actions k = 2. We introduce the 1
optimal for T = O((k 1 )c ) steps.
MDPs witnessing the claim in Theorem 3 for this case.
Definition 6 (Lower Bound Example). For AM = {L, R}, The MDP in Figure 1 also witnesses the claim of Theo-
let M (x) = (SM , AM , PM , RM , T, , x) be an MDP with rem 5 when n = d log(1/(2↵)) e. The discount factor
1
• for all i 2 [n], PM (si , L, s1 ) = PM (si , R, sj ) = 1 is generally taken as a constant, so in most interesting
where j = min{i + 1, n} and is 0 otherwise. cases 1 1 ⌧ n: this lower bound is substantially less
• for i 2 [n 1], RM (si ) = 0.5, and RM (sn ) = x. stringent than the lower bounds proven for fairness and
approximate-choice fairness. Hence, from now on, we fo-
cus on designing algorithms satisfying approximate-action
fairness with learning rates polynomial in every parameter
but 1 1 , and with tight dependence on 1 1 .
4. A Fair and Efficient Learning Algorithm

We now present an approximate-action fair algorithm,
Fair-E3 with the performance guarantees stated below.
Figure 1. MDP(x): Circles represent states (labels denote the state
name and deterministic reward). Arrows represent actions. Theorem 6. Given ✏ > 0, ↵ > 0, 2 0, 12 and
2 [0, 1) as inputs, Fair-E is an ↵-action fair algorithm
3
Figure 1 illustrates the MDP from Definition 6. All the which achieves ✏-optimality after
transitions and rewards in M are deterministic, but the re- !
1
ward at state sn can be either 1 or 12 , and so no algorithm n5 T✏⇤ k 1 +5
(fair or otherwise) can determine whether the Q⇤M values T = Õ 12 (2)
min{↵4 , ✏4 }✏2 (1 )
of all the states are the same or not until it reaches sn and
observes its reward. Until then, fairness requires that the
steps where Õ hides poly-logarithmic terms.
algorithm play all the actions uniformly at random (if the
reward at sn is 12 , any fair algorithm must play uniformly The running time of Fair-E3 (which we have not attempted
at random forever). Thus, any fair algorithm will take ex- to optimize) is polynomial in all the parameters of the MDP
ponential time in the number of states to reach sn . This can except 1 1 ; Theorem 5 implies that this exponential de-
be easily modified for k > 2: from each state si , k 1 of
pendence on 1 1 is necessary.
the actions from state si (deterministically) return to state
s1 and only one action (deterministically) reaches any other Several more recent algorithms (e.g. R-MAX [3]) have
state smin{i+1,n} . It will take k n steps before any fair algo- improved upon the performance of E3 . We adapted E3 pri-
rithm reaches sn and can stop playing uniformly at random marily for its simplicity. While the machinery required
(which is necessary for near-optimality). The same exam- to properly balance fairness and performance is somewhat
ple, with a slightly modified analysis, also provides a lower involved, the basic ideas of our adaptation are intuitive.
bound of ⌦((k/(1 + k↵))n ) time steps for approximate- We further note that subsequent algorithms improving on
choice fair algorithms as stated in Theorem 4. E3 tend to heavily leverage the principle of “optimism in
Theorem 4. If < 14 , ↵ < 14 , > 12 and ✏ < 18 , no ↵- face of uncertainty”: such behavior often violates fairness,
choice fair algorithm is ✏-optimal for T = O(( 1+k↵
k
)n ) which generally requires uniformity in the face of uncer-
steps. tainty. Thus, adapting these algorithms to satisfy fairness
is more difficult. This in particular suggests E3 as an apt
Fairness and approximate-choice fairness are both ex- starting point for designing a fair planning algorithm.
tremely costly, ruling out polynomial time learning rates. The remainder of this section will explain Fair-E3 , be-
6
We have not optimized the constants upper-bounding param- ginning with a high-level description in Section 4.1. We
eters in the statement of Theorems 3, 4 and 5. The values pre- then define the “known” states Fair-E3 uses to plan in Sec-
sented here are only chosen for convenience. tion 4.2, explain this planning process in Section 4.3, and
✓ ◆4 ✓ ◆!
bring this all together to prove Fair-E3 ’s fairness and per- n 8 1
formance guarantees in Section 4.4. m2 = O H✏ log .
min{✏, ↵}
4.1. Informal Description of Fair-E3 A state s becomes known after taking
Fair-E3 relies on the notion of “known” states. A state s is mQ := k · max{m1 , m2 } (3)

defined to be known after all actions have been chosen from
s enough times to confidently estimate relevant reward dis- length-H✏ random trajectories from s.
tributions, transition probabilities, and Q⇡M values for each
action. At each time t, Fair-E3 then uses known states to It remains to show that motivating conditions (i) and (ii)
reason about the MDP as follows: indeed hold for our formal definition of a known state. In-
formally, m1 random trajectories suffice to ensure that we
• If in an unknown state, take a uniformly random trajec- have accurate estimates of all Q⇤M (s, a) values, and m2
tory of length H✏ . random trajectories suffice to ensure accurate estimates of
• If in a known state, compute (i) an exploration policy the transition probabilities and rewards.
which escapes to an unknown state quickly and p, the
To formalize condition (i), we rely on Theorem 7, connect-
probability that this policy reaches an unknown state
ing the number of random trajectories taken from s to the
within 2T✏⇤ steps, and (ii) an exploitation policy which
accuracy of the empirical VM ⇡
estimates.
is near-optimal in the known states of M .
– If p is large enough, follow the exploration policy; oth- Theorem 7 (Theorem 5.5, Kearns et al. (2000)). For any
erwise, follow the exploitation policy. state s and ↵ > 0, after
✓ ◆2 ✓ ◆!
Fair-E3 thus relies on known states to balance exploration H✏ +3 1 |⇧|
and exploitation in a reliable way. While Fair-E3 and E3 m=O k log
(1 )↵
share this general idea, fairness forces Fair-E3 to more
delicately balance exploration and exploitation. For ex- random trajectories of length H✏ from s, with probability
ample, while both algorithms explore until states become of at least 1 , we can compute estimates V̂M
⇡
such that
“known”, the definition of a known state must be much |VM (s) V̂M (s) |  ↵, simultaneously for all ⇡ 2 ⇧.
⇡ ⇡
stronger in Fair-E3 than in E3 because Fair-E3 addition-
ally requires accurate estimates of actions’ Q⇡M values in Theorem 7 enables us to translate between the number of
order to make decisions without violating fairness. For this trajectories taken from a state and the uncertainty about its
reason, Fair-E3 replaces the deterministic exploratory ac- VM⇡
values for all policies (including ⇡ ⇤ and hence VM ⇤
).
tions of E3 with random trajectories of actions from un- Since |⇧| = k , we substitute log (|⇧|) = n log (k). To
n
known states. These random trajectories are then used to estimate Q⇤M (s, a) values using the VM ⇤
(s) values we in-
estimate the necessary Q⇡M values. crease the number of necessary length-H✏ random trajec-
In a similar vein, Fair-E3 requires particular care in com- tories by a factor of k.
puting exploration and exploitation policies, and must re- For condition (ii), we adapt the analysis of E3 [13], which
strict the set of such policies to fair exploration and fair states that if each action in a state s is taken m2 times,
exploitation policies. Correctly formulating this restriction then the transition probabilities and reward in state s can
process to balance fairness and performance relies heavily be estimated accurately (see Section 4.4).
on the observations about the relationship between fairness
and performance provided in Section 2.1. 4.3. Planning in Fair-E3
4.2. Known States in Fair-E3 We now formalize the planning steps in Fair-E3 from
known states. For the remainder of our exposition, we
We now formally define the notion of known states for make Assumption 2 for convenience (and show how to re-
Fair-E3 . We say a state s becomes known when one can move this assumption in the Appendix).
compute good estimates of (i) RM (s) and PM (s, a) for all
Assumption 2. T✏⇤ is known.
a, and (ii) Q⇤M (s, a) for all a.
Fair-E3 constructs two ancillary MDPs for planning: M
Definition 7 (Known State). Let is the exploitation MDP, in which the unknown states of M
are condensed into a single absorbing state s0 with no re-
✓ ◆2 ✓ ◆! ward. In the known states , transitions are kept intact and
H✏ +3 1 k the rewards are deterministically set to their mean value.
m1 = O k n log and
(1 )↵ M thus incentivizes exploitation by giving reward only
probability that a walk of 2T steps from s following ⇡

will terminate in s0 exceeds /T .
We can use this fact to reason about exploration as follows.

First, since Observation 2 tells us that the optimal policy in
Figure 2. Left: An MDP M with two actions (L and R) and de- M is approximate-action fair, if the optimal policy stays in
terministic transition functions and rewards. Green denotes the the set of M ’s known states M , then following the optimal
set of known states . Middle: M . Right: M[n]\ . policy in M ↵ is both optimal and approximate-action fair.
However, if instead the optimal policy in M quickly es-
for staying within known states. In contrast, M[n]\ is the capes to an unknown state in M , the optimal policy in M ↵
exploration MDP, identical to M except for the rewards. may not be able to compete with the optimal policy in M .
The rewards in the known states are set to 0 and the re- Ignoring fairness, one natural way of computing an escape
ward in s0 is set to 1. M[n]\ then incentivizes exploration policy to “keep up” with the optimal policy is to compute
by giving reward only for escaping to unknown states. See the optimal policy in M[n]\ . Unfortunately, following this
the middle (right) panel of Figure 2 for an illustration of escape policy might violate approximate-action fairness –
M (M[n]\ ), and Appendix for formal definitions. high-quality actions might be ignored in lieu of low-quality
exploratory actions that quickly reach the unknown states
Fair-E3 uses these constructed MDPs to plan according to
of M . Instead, we compute an escape policy in M[n]\ ↵
the following natural idea: when in a known state, Fair-E3
and show that if no near-optimal exploitation policy exists
constructs M̂ and M̂[n]\ based on the estimated transi-
in M , then the optimal policy in M[n]\
↵
(which is fair by
tion and rewards observed so far (see the Appendix for for-
construction) quickly escapes to the unknown states of M .
mal definitions), and then uses these to compute additional
restricted MDPs M̂ ↵ and M̂[n]\
↵
for approximate-action Next, in order for Fair-E3 to check whether the optimal
fairness. Fair-E then uses these restricted MDPs to
3 policy in M[n]\
↵
quickly reaches the absorbing state of M
choose between exploration and exploitation. with significant probability, Fair-E3 simulates the execu-
tion of the optimal policy of M[n]\
↵
for 2T✏⇤ steps from the
More formally, if the optimal policy in M̂[n]\
↵
escapes to
known state s in M several times, counting the ratio of
↵
the absorbing state of M with high enough probability
the runs ending in s0 , and applying a Chernoff bound; this
within 2T✏⇤ steps, then Fair-E3 explores by following that
is where Assumption 2 is used.
policy. Otherwise, Fair-E3 exploits by following the opti-
mal policy in M̂ ↵ for T✏⇤ steps. While following either of Having discussed exploration, it remains to show that
these policies, whenever Fair-E3 encounters an unknown the exploitation policy described in Lemma 8 satisfies ✏-
state, it stops following the policy and proceeds by taking optimality as defined in Definition 1. By setting T T✏⇤
a length-H✏ random trajectory. in Lemma 8 and applying Lemmas 1 and 10, we can prove
Corollary 9 regarding this exploitation policy.
4.4. Analysis of Fair-E3 Corollary 9. For any state s 2 and T T✏⇤ if there
In this section we formally analyze Fair-E and prove The-
3 exists an exploitation policy ⇡ in M ↵ then
orem 6. We begin by proving that M ↵ is useful in the fol-
T
lowing sense: M ↵ has at least one of an exploitation policy 1 X ⇡ t ⇤ ✏
achieving high reward or an exploration policy that quickly E V ⇡ (s), T Es⇠µ⇤ VM (s)  .
T t=1 M 1
reaches an unknown state in M .
Lemma 8 (Exploit or Explore Lemma). For any state s 2
, 2 (0, 1) and any T > 0 at least one of the statements Finally, we have so far elided the fact that Fair-E3 only has
below holds: access to the empirically estimated MDPs M̂ ↵ and M̂[n]\ ↵
(see the Appendix for formal definitions). We remedy this

• there exists an exploitation policy ⇡ in M ↵ such that issue by showing that the behavior of any policy ⇡ in M̂ ↵
T T (and M̂[n]\
↵
) is similar to the behavior of ⇡ in M ↵ (and
X X
max E ⇡
¯
VM ⇡t
¯ (s), T E ⇡
VM ⇡ t (s), T  T M[n]\ ). To do so, we prove a stronger claim: the behavior
↵
¯ 2⇧
⇡
t=1 t=1 of any ⇡ in M̂ (and M̂[n]\ ) is similar to the behavior of
⇡ in M (and M[n]\ ).
where the random variables ⇡ t (s) and ⇡¯ t (s) denote the
states reached from s after following ⇡ and ⇡ ¯ for t steps, Lemma 10. Let be the set of known states and M̂ the
respectively. approximation to M . Then for any state s 2 , any action
• there exists an exploration policy ⇡ in M ↵ such that the a and any policy ⇡, with probability at least 1 :
1. VM⇡
(s) ⇡
min{↵/2, ✏}  VM ⇡
ˆ (s)  VM (s) + exploration, the number of 2T✏⇤ steps exploitation phases
min{↵/2, ✏}, needed is at least
2. Q⇡M (s, a) min{↵/2, ✏}  Q⇡Mˆ (s, a)  ✓ ◆
T1 (1 )
Q⇡M (s, a) + min{↵/2, ✏}. T2 = O .
✏
We now have the necessary results to prove Theorem 6.
Therefore, after T = T1 + T2 steps we have
Proof of Theorem 6. We divide the analysis into separate T
parts: the performance guarantee of Fair-E3 and its ⇤ 1 X ⇤ 2✏
Es⇠µ⇤ VM (s) E V (st )  ,
approximate-action fairness. We defer the analysis of the T t=1 M 1
probability of failure of Fair-E3 to the Appendix.
as claimed in Equation 2. The running time of Fair-E3 is
We start with the performance guarantee and show that 3 2
when Fair-E3 follows the exploitation policy the average O( nT✏ ): the additional nT✏ factor comes from offline
VM⇤
values of the visited states is close to Es⇠µ⇤ VM ⇤
(s). computation of the optimal policies in M̂ ↵ and M̂[n]\
↵
.
However, when following an exploration policy or taking
We wrap up by proving Fair-E3 satisfies
random trajectories, visited states’ VM
⇤
values can be small.
approximate-action fairness in every round. The ac-
To bound the performance of Fair-E3 , we bound the num-
tions taken during random trajectories are fair (and hence
ber of these exploratory steps by the MDP parameters so
approximate-action fair) by Observation 3. Moreover,
they only have a small effect on overall performance.
Fair-E3 computes policies in M̂ ↵ and M̂[n]\ ↵
. By
Note that in each T✏⇤ -step exploitation phase of Fair-E3 , the Lemma 10 with probability at least 1 any Q⇤ or V ⇤
expectation of the average VM ⇤
values of the visited states value estimated in M̂ ↵ or M̂[n]\
↵
is within ↵/2 of its
is at least Es⇠µ⇤ VM (s) ✏/(1
⇤
) ✏/2 by Lemmas 1, 8 corresponding true value in M ↵ or M[n]\↵
. As a result,
and Observation 5. By a Chernoff bound, the probability M̂ ↵ and M̂[n]\
↵
(i) contain all the optimal policies and
that the actual average VM ⇤
values of the visited states is (ii) only contain actions with Q⇤ values within ↵ of the
less than Es⇠µ⇤ VM (s) ✏/(1
⇤
) 3✏/4 is less than /4 optimal actions. It follows that any policy followed in M̂ ↵
1
if there are at least
log( )
2 exploitation phases. and M̂[n]\
↵
is ↵-action fair, so both the exploration and
✏
exploitation policies followed by Fair-E3 satisfy ↵-action
We now bound the total number of exploratory steps of
fairness, and Fair-E3 is therefore ↵-action fair.
Fair-E3 by
✓ ⇣ n ⌘◆
T1 = O nmQ H✏ + nmQ
T✏⇤
log , 5. Discussion and Future Work
✏
Our work leaves open several interesting questions. For ex-
where mQ is defined in Equation 3 of Definition 7. The ample, we give an algorithm that has an undesirable expo-
two components of this term bound the number of rounds nential dependence on 1/(1 ), but we show that this de-
in which Fair-E3 plays non-exploitatively: the first bounds pendence is unavoidable for any approximate-action fair al-
the number of steps taken when Fair-E3 follows random gorithm. Without fairness, near-optimality in learning can
trajectories, and the second bounds how many steps are be achieved in time that is polynomial in all of the param-
taken following explicit exploration policies. The former eters of the underlying MDP. So, we can ask: does there
bound follows from the facts that each random trajectory exist a meaningful fairness notion that enables reinforce-
has length H✏ ; that in each state, mQ trajectories are suf- ment learning in time polynomial in all parameters?
ficient for the state to become known; and that random tra-
jectories are taken only before all n states are known. The Moreover, our fairness definitions remain open to further
latter bound follows from the fact that Fair-E3 follows modulation. It remains unclear whether one can strengthen
an exploration policy for 2T✏⇤ steps; and an exploration our fairness guarantee to bind across time rather than sim-
T⇤ ply across actions available at the moment without large
policy needs to be followed only O( ✏✏ log( n )) times be-
performance tradeoffs. Similarly, it is not obvious whether
fore reaching an unknown state (since any exploration pol-
one can gain performance by relaxing the every-step na-
icy will end up in an unknown state with probability of at
ture of our fairness guarantee in a way that still forbids dis-
least T✏⇤ according to Lemma 8, and applying a Chernoff
✏ crimination. These and other considerations suggest many
bound); that an unknown state becomes known after it is questions for further study; we therefore position our work
visited mQ times; and that exploration policies are only as a first cut for incorporating fairness into a reinforcement
followed before all states are known. learning setting.
Finally, to make up for the potentially poor performance in
References Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and

Jun Sakuma. Fairness-aware classifier with prejudice re-
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirch-
mover regularizer. In Proceedings of the European Con-
ner. Machine bias. Propublica, 2016.
ference on Machine Learning and Knowledge Discovery
Anna Barry-Jester, Ben Casselman, and Dana Goldstein. in Databases, pages 35–50, 2012.
The new science of sentencing. The Marshall Project, Michael Kearns and Satinder Singh. Near-optimal rein-
August 8 2015. URL https://www.themarshallproject. forcement learning in polynomial time. Machine Learn-
org/2015/08/04/the- new- science- of- sentencing/. Re- ing, 49(2-3):209–232, 2002.
trieved 4/28/2016.
Michael Kearns, Yishay Mansour, and Andrew Ng. Ap-
Ronen Brafman and Moshe Tennenholtz. R-MAX - A proximate planning in large POMDPs via reusable tra-
general polynomial time algorithm for near-optimal re- jectories. In Proceedings of the 13th Annual Conference
inforcement learning. Journal of Machine Learning Re- on Neural Information Processing Systems, pages 1001–
search, 3:213–231, 2002. 1007, 2000.
Nanette Byrnes. Artificial intolerance. MIT Technol- Jon Kleinberg, Sendhil Mullainathan, and Manish Ragha-
ogy Review, March 28 2016. URL https : / / www. van. Inherent trade-offs in the fair determination of risk
technologyreview.com/s/600996/artificial- intolerance/. scores. In Proceedings of the 7th Conference on Innova-
Retrieved 4/28/2016. tions in Theoretical Computer Science, 2017.
Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforce-
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Rein-
ment learning in robust markov decision processes. In
gold, and Richard Zemel. Fairness through awareness. In
Proceedings of the 27th Annual Conference on Neural
Proceedings of the 3rd Innovations in Theoretical Com-
Information Processing Systems, pages 701–709, 2013.
puter Science, pages 214–226, 2012.
Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini.
Michael Feldman, Sorelle Friedler, John Moeller, Carlos k-NN as an implementation of situation testing for dis-
Scheidegger, and Suresh Venkatasubramanian. Certify- crimination discovery and prevention. In Proceedings
ing and removing disparate impact. In Proceedings of the of the 17th ACM SIGKDD International Conference on
21th ACM SIGKDD International Conference on Knowl- Knowledge Discovery and Data Mining, pages 502–510,
edge Discovery and Data Mining, pages 259–268, 2015. 2011.
Sorelle Friedler, Carlos Scheidegger, and Suresh Venkata- Shie Mannor, Ofir Mebel, and Huan Xu. Lightning does
subramanian. On the (im)possibility of fairness. CoRR, not strike twice: Robust MDPs with coupled uncertainty.
abs/1609.07236, 2016. In Proceedings of the 29th International Conference on
Machine Learning, 2012.
Sara Hajian and Josep Domingo-Ferrer. A methodology
for direct and indirect discrimination prevention in data Clair Miller. Can an algorithm hire better than a hu-
mining. IEEE Transactions on Knowledge and Data En- man? The New York Times, June 25 2015. URL
gineering, 25(7):1445–1459, 2013. http://www.nytimes.com/2015/06/26/upshot/can- an-
algorithm- hire- better- than- a- human.html/. Retrieved
Moritz Hardt, Eric Price, and Nathan Srebro. Equality of 4/28/2016.
opportunity in supervised learning. In Proceedings of
Jun Morimoto and Kenji Doya. Robust reinforcement
the 30th Annual Conference on Neural Information Pro-
learning. Neural computation, 17(2):335–359, 2005.
cessing Systems, pages 3315–3323, 2016.
Dino Pedreshi, Salvatore Ruggieri, and Franco Turini.
Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Discrimination-aware data mining. In Proceedings of the
Aaron Roth. Fairness in learning: Classic and contextual 14th ACM SIGKDD international conference on Knowl-
bandits. In Proceedings of the 30th Annual Conference edge discovery and data mining, pages 560–568. ACM,
on Neural Information Processing Systems, pages 325– 2008.
333, 2016.
Cynthia Rudin. Predictive policing using machine learn-
Faisal Kamiran, Asim Karim, and Xiangliang Zhang. De- ing to detect patterns of crime. Wired Magazine, August
cision theory for discrimination-aware classification. In 2013. URL http://www.wired.com/insights/2013/08/
Proceedings of the 12th IEEE International Conference predictive-policing-using-machine-learning-to-detect-
on Data Mining, pages 924–929, 2012. patterns-of-crime/. Retrieved 4/28/2016.
Satinder Singh. Personal Communication, June 2016.
Richard Sutton and Andrew Barto. Introduction to Rein-

forcement Learning. MIT Press, Cambridge, MA, USA,
1st edition, 1998.
Latanya Sweeney. Discrimination in online ad delivery.
Communications of the ACM, 56(5):44–54, 2013.
Istv’an Szita and Csaba Szepesvári. Model-based rein-

forcement learning with nearly tight exploration com-
plexity bounds. In Proceedings of the 27th Interna-
tional Conference on Machine Learning, pages 1031–
1038, 2010.
Blake Woodworth, Suriya Gunasekar, Mesrob Ohannes-

sian, and Nathan Srebro. Learning non-discriminatory
predictors. In Proceedings of the 30th Conference on
Learning Theory, 2017.
Muhammad Bilal Zafar, Isabel Valera, Gomez-Rodriguez

Manuel, and Krishna P. Gummadi. Fairness beyond dis-
parate treatment and disparate impact: Learning classi-
fication without disparate mistreatment. In Proceedings
of the 26th International World Wide Web Conference,
2017.
Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cyn-

thia Dwork. Learning fair representations. In Proceed-
ings of the 30th International Conference on Machine
Learning, pages 325–333, 2013.

Fairness in Reinforcement Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fairness in Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Fairness in Reinforcement Learning ⇤

Abstract tings where historical context can have a distinct influence

2. Preliminaries with probability at least 1 , for st the state L reaches at

can be ✏-optimal in T = O(k n ) steps.6 Hence, we focus on approximate-action fairness. Before

4. A Fair and Efficient Learning Algorithm

4.1. Informal Description of Fair-E3 A state s becomes known after taking

Fair-E3 relies on the notion of “known” states. A state s is mQ := k · max{m1 , m2 } (3)

probability that a walk of 2T steps from s following ⇡

We can use this fact to reason about exploration as follows.

(see the Appendix for formal definitions). We remedy this

References Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and

Satinder Singh. Personal Communication, June 2016.

Richard Sutton and Andrew Barto. Introduction to Rein-

Istv’an Szita and Csaba Szepesvári. Model-based rein-

Blake Woodworth, Suriya Gunasekar, Mesrob Ohannes-

Muhammad Bilal Zafar, Isabel Valera, Gomez-Rodriguez

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cyn-

You might also like