Fitted Advantage Estimation

Orthogonalized Estimation of Difference of
Q-functions
Anonymous Author(s)
Affiliation
Address
email
1 1 Introduction and Related Work

2 Learning optimal dynamic treatment rules, or sequential policies for taking actions, is important,
3 although often only observational data is available. Many recent works in offline reinforcement
4 learning develop methodology to evaluate and optimize sequential decision rules, without the ability
5 to conduct online exploration.
6 An extensive literature on causal inference and machine learning establishes methodologies for
7 learning causal contrasts, such as the conditional average treatment effect (CATE), which is suf-
8 ficient for making optimal decisions. Methods that specifically estimate causal contrasts (such as
9 the CATE), can better adapt to potentially smoother or more structured contrast functions, while
10 methods that instead contrast estimates (by taking the difference of outcome regressions or Q func-
11 tions) can not. Additionally, estimation of causal contrasts can be improved via orthogonalization
12 or double machine learning [Ken22, CCD+ 18]. Estimating the causal contrast is both sufficient for
13 optimal decisions and statistically favorable.
14 In this work, building on recent advances in heterogeneous treatment effect estimation, we fo-
15 cus on estimating analogous causal contrasts for offline reinforcement learning, namely τtπ (s) =
16 Qπt (s, 1) − Qπt (s, 0), and natural multiple-action generalizations thereof. (For multiple actions, fix
17 some action choice a0 and estimate, for all other actions a, Qπt (s, a) − Qπt (s, a0 ) via the same
18 approach as for two actions).1
19 The sequential setting offers even more motivation to target estimation of the contrast: additional
20 structure can arise from sparsity patterns induced by the joint (in)dependence of rewards and tran-
21 sition dynamics on the (decompositions of) the state variable. A number of recent works point out
22 this additional structure [WXZS, WDT+ 22], for example of a certain transition-reward factoriza-
23 tion, first studied by (author?) [DTC18], that admits a sparse Q-function contrast [PS23]. [Zho24]
24 proposes a variant of the underlying blockwise pattern that also admits sparse optimal policies, but
25 proposes a special modification of LASSO. Our method can adapt to such underlying sparsity struc-
26 ture when it is present in the Q-function contrast, in addition to other scenarios where the contrast
27 is smoother than the Q-functions themselves.
28 The contributions of this work are as follows: [az: ] We develop a dynamic generalization of the
29 R-learner [NW21] for estimating the Q-function contrast. The method wraps around standard esti-
30 mation procedures in offline reinforcement learning via a sequence of sequential loss minimization
31 problems, which makes it appealingly practical. It works with black-box nuisance estimators of the
32 Q-function and behavior policy, allowing for the use of current state-of-the-art deep learning ap-
33 proaches, for example. But, the residualized learning step allows for targeting the more structured
34 smoothness of the Q-function contrast itself.
This contrast is subtly different from advantage functions in reinforcement learning, defined as Qπ (s, a) −
1
π
V (s), the advantage of taking action a beyond the policy, which have also been extensively studied in rein-
forcement learning.
Submitted to 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Do not distribute.
35 Related work There is a large body of work on offline policy evaluation and optimization
36 in offline reinforcement learning [JYW21, XCJ+ 21], including approaches that leverage im-
37 portance sampling or introduce marginalized versions [JL16, TTG15, KU19a, LLTZ18]. For
38 Markov decision processes, other papers study statistically semiparametrically efficient estimation
39 [KU19a, KU19b, XYZ23, KZ22, 1]. The literature on dynamic treatment regimes (DTRs) stud-
40 ies a method called advantage learning [STLD14], although DTRs in general lack reward at every
41 timestep, whereas we are particularly motivated on sparsity implications that arise jointly from re-
42 ward and transition structure. Prior works that consider policy optimization under a restricted func-
43 tion class can require estimating difficult policy-dependent nuisance functions; we maximize the ad-
44 vantage function without further restricting functional complexity, which requries re-estimating nui-
45 sance functions at every timestep (but not every iteration of policy optimization, as in [XYWL21]).
46 At a high level, our method is similar to the dynamic DR learner studied in [LS21] in that we extend
47 the R-learner identification approach to a sequential setting, although the estimand is quite different.
48 In particular, they only consider heterogeneity based on a fixed initial state, dynamic treatment
49 regimes with terminal rewards, generalize structural nested-mean models (SNMMs) by estimating
50 “blip-to-zero” functions. Consequently, our analysis is similar at a high level.
51 Overall, the most closely related work when it comes to estimating contrast functionals
52 in reinforcement learning is that of [SLZS22], which derives a pseudo-outcome for esti-
53 mating the Q-function contrast in the infinite horizon setting. We share the same esti-
54 mand, but in nthe finite-horizon setting. They regress upon o the contrast: Q̂(ℓ) (a, Si,t ) +
I(Ai,t =a) γ
55
Pr(Ai,t =a|Si,t ) Ri,t + γ maxa′ Q̂(ℓ) (a′ , Si,t+1 ) − Q̂(ℓ) (Ai,t , Si,t ) + (1−γ) ηei,t,a . Crucial differ-
56 ences include: we directly generalize the residualized learning (R-learner) approach, and we work
57 in finite-horizons, with the propensity score which is quite different from the hard-to-estimate sta-
58 tionary distribution density ratio [LLTZ18]. [? ] studies fast rates using the margin assumption for
59 fitted-Q-iteration; here we simply use the margin assumption to relate contrast estimation error to
60 the policy value, and our estimation appraoch is quite different.
61 To relate this approach to natural approaches based on pseudo-outcome regression based on the
62 doubly-robust score (i.e. the pseudo-outcome counterpart of [TTG15, JL16], note that the (single-
63 stage) R-learner loss function is an overlap-weighted [LTL19] regression against the doubly-robust
64 score (DR-learner [Ken20]). (See [MDV23, CHK+ 24] for more discussion). [az: what are the
65 overlap weights)]
66 Other more distantly related work studies advantage functions, the more classical RL notion of
67 contrast [NP08]. [NBW21] study OPE for advantage functions in a special case of optimal stopping.
68 [BSZ23] also introduces orthogonality in estimation, but of robust Q functions.
69 2 Method
70 2.1 Problem Setup
71 We consider a finite-horizon Markov Decision Process on the full-information state space comprised
72 of a tuple M = (S, A, r, P, γ, T ) of states, actions, reward function r(s, a) , transition probability
73 matrix P , discount factor γ < 1, and time horizon of T steps, where t = 1, . . . , T . We let the state
74 spaces S ⊆ Rd be continuous, and assume the action space A is finite. A policy π : S 7→ ∆(A)
75 maps from the state space to a distribution over actions, where ∆(·) is the set of distributions over
76 (·), and π(a | s) is the probability of taking action a in state s. (At times we overload notation a bit
77 so that π(s) ∈ A indicates the action random variable under π evaluated at state s).
PT ′
78 The value function is Vtπ (s) = Eπ [ t′ =t γ t −t Rt′ | s], where Eπ denotes expectation under
79 the joint distribution induced by the MDP M running policy π. The state-action value function,
PT
80 or Q function is Qπt (s) = Eπ [ t′ =t γRt′ | s, a]. These satisfy the Bellman operator, e.g.
81 Qπt (s, a) = r(s, a) + γE[Vt+1
π
(st+1 ) | s, a]. The optimal value and q-functions are denoted V ∗ , Q∗
82 under the optimal policy. We focus on estimating the difference of Q-functions (each under the
83 same policy), τtπ (s) = Qπt (s, 1) − Qπt (s, 0). (This differs slightly from the conventional advantage
84 function studied in RL, defined as Qπ (s, a) − V π (s), where the contrast being estimated depends on
85 the policy). We focus on the offline reinforcement learning setting where we have access to a dataset
2
(i) (i) (i) (i)
86 of n offline trajectories, D = {(St , At , Rt St+1 )Tt=1 }ni=1 , where actions were taken according
87 to some behavior policy π b . We assume throughout that the underlying policy was stationary, i.e.
88 offline trajectories (drawn potentially from a series of episodes) that are independent.
89 We state some notational conventions. For some generic function f we define the norm ∥f ∥u :=
1/u
90 E [∥ψ(X)∥u ] . In the context of estimation (rather than discussing identification), we denote the
91 true population functions with a ◦ subscript, i.e. τtπ,◦ and so on.
92 2.2 Policy evaluation
93 Identification First we overview deriving the estimating moments of our approach. The arguments
94 are broadly a generalization of the so-called residualized R-learner [NW21]; [LS21] considers a
95 similar generalization for structural nested mean models without state-dependent heterogeneity. For
96 the purposes of this section, we discuss the true population Q, m, τ functions without notational
97 decoration, which we introduce later on when we discuss estimation.
π
98 Denote π t+1 = πt+1:T := {πt+1 , . . . , πT } for brevity. Then Qt t+1 indicates the Qt function under
π
99 policy π. For brevity, we further abbreviate Qπt := Qt t+1 when this is unambiguous. We seek to
100 estimate:
τtπ (St ) = Qπt (St , 1) − Qπt (St , 0), (1)
101 Note that the Q-function satisfies: Qπt (St , At ) = E[Rt + γQπt+1 (St+1 , At+1 ) | St , At ]. Define
(i)
ϵt (At ) = Rt + γQπt+1 (St+1 , At+1 ) − {Qπt (St , 0) + At τtπ (St )}.
102 Under sequential unconfoundedness and Markovian properties, we obtain the conditional moment:
(i)
E[ϵt (At ) | St , At ] = 0. (2)
103 Define the analogue to the marginal outcome function, which is the state-conditional value function
π b ,π
104 under the behavior policy: m◦,π (St ) = Vt t t+1 = Eπtb [Rt + γQπt+1 (St+1 , At+1 ) | St ]. Under
105 sequential unconfoundedness,
Rt + γQπt+1 (St+1 , At+1 ) + ϵ(At ) = Qπt (St , 0) + At τtπ (St )
mπt (St ) = Eπb [Rt + γQπt+1 (St+1 , At+1 ) | St ] = Qπt (St , 0) + π b (1 | St )τtπ (St )
106 Hence,
Rt + γQπt+1 (St+1 , At+1 ) − mπt (St ) = (A − π b (1 | St ))τtπ (St ) + ϵt (At ). (3)
107 Extension to multiple actions. So far we presented the method with A ∈ {0, 1} for simplicity,
108 but all methods in this paper will extend to the multi-action case. For multiple actions, fix a choice
109
π
a0 ∈ A, and for a ∈ A \ a0 , define τtπ (s, a) := τa,t (s) = Qπt (s, a) − Qπt (s, a0 ). For k ∈ |A|, let
(i)
110 πb (k | St ) = P (At = k | St ). Redefine ϵt (At ) = Rt +γQπt+1 (St+1 , At+1 )−{Qπt (St , 0)+I[At =
111 a]τtπ (St )}. Then the equivalent of eq. (3) is that τa,t
π
(St ) satisfies:
Rt + γQπt+1 (St+1 , At+1 ) − mπt (St ) = (I[At = a] − π b (a | St ))τa,t
π
(St ) + ϵt (At )
112 The loss function. This motivates the approach based on (penalized) empirical risk minimization:
n h 2 io
τt (·) ∈ argmin E {Rt + γQπt+1 (St+1 , At+1 ) − mπt (St )} − γ{A − πtb (1 | St )} · τtπ (St )
τ
(4)
113 Again, so far we have discussed identification assuming the true Q, m, π b functions, etc. Later on
114 we relax this assumption of known nuisance functions, and outside of this section we refer to the
115 population-level true nuisance functions as Qπ,◦ , mπ,◦ , π b,◦ , τ π,◦ .
116 Feasible estimation In practice, the nuisance functions need to be estimated. We introduce
117 some notation before defining the full estimation algorithm. Let the nuisance vector be denoted
118 η = [{Qπt }Tt=1 , {mπt }Tt=1 , {πtb }Tt=1 ]. The fitted advantage R-learner for evaluation is a feasible ver-
119 sion of the sequential loss minimization approach implied by eq. (4): we describe the algorithm in
120 Algorithm 1. Given an evaluation policy πe , first fit the nuisance functions: a pilot estimate of the
121 Q function and the behavior policy. Then, evaluate the loss function in eq. (4) and estimate τt .
3
Algorithm 1 Dynamic Residualized Difference-of-Q-Evaluation
1: Given: πe , evaluation policy; and for sample splitting, partition of D into K folds, {Dk }K
k=1 .
πe ,k b,k
2: On Dk , estimate Q̂ (s, a) and behavior policy π̂t (s). Evaluate the value function via inte-
grating/summing Q̂ over the empirical distribution of actions, a ∼ π b , observed in the data, so
that e
m̂(s) = Eπtb [Rt + γ Q̂πt+1 (St+1 , At+1 ) | St = s].
3: for timestep t =(
T, . . . , 1 do )
K P 2
(i) π,−k (i) (i) (i) (i) (i) (i)
m̂π,−k π̂tb,−k (1
P
4: τ̂t ∈ argminτ Rt + γ Q̂t+1 (St+1 , At+1 ) − t (St ) − {At − | St )}τt (St )
k=1 i∈Dk
5: end for
Algorithm 2 Dynamic Residualized Difference-of-Q Optimization

1: Given: Partition of D into 3 folds, {Dk }3k=1 .
2: Estimate π̂tb on D1 .
3: At time T: Set Q̂T (s, a) = 0. Estimate mT = Eπb [RT | ST ] on D1 and τ̂T on Dk(T ) , where
k(t) = 2 if t is odd and k(t) = 3 if t is even.
Optimize π̂T (s) ∈ arg max τT (s).
4: for timestep t = T − 1, . . . , 1 do
π̂ π̂
5: Estimate Qπ̂t+1 on D1 . Evaluate mt t+1 . Estimate τ̂t t+1 on Dk(t) by minimizing the empiri-
cal loss:
P (i) 2
π,(1) (i) (i) π,(1) (i) (i) b,(1) (i) (i)
τ̂t (·) ∈ argminτ Rt + γ Q̂t+1 (St+1 , At+1 ) − m̂t (St ) − {At − π̂t (1 | St )}τt (St )
i∈Dk(t)
π̂ t+1
6: Optimize π̂t (s) ∈ arg max τ̂t (s).
7: end for
122 Cross-fitting. We also introduce cross-fitting, which will differ slightly between policy evaluation
123 and optimization: splitting the dataset D into K many folds (preserving trajectories, i.e. randomizing
124 over trajectory index i), and learning the nuisance function η −k on {Dk′ }k′ ∈{[K]\k} . (In scenarios
125 with possible confusion we denote the nuisance function η (−k) instead. In evaluating the loss-
126 function, we evaluate the nuisance function η −k using data from the held-out kth fold. Given the
127 cross-fitting procedure, we introduce the empirical squared loss function:
K P 2
(i) (i) (i) (i) (i) (i) (i)
Rt + γ Q̂π,−k π,−k b,−k
P
L̂t (τ, η) = t+1 (St+1 , At+1 ) − m̂t (St ) − {A t − π̂ t (1 | St )}τt (St )
k=1 i∈Dk
128 and let the population loss function Lt (τ, η) be the population expectation of the above.
129 Estimating the nuisance functions. The Q function nuisance can be estimated with a variety
130 of approaches such as fitted-Q-evaluation [LVY19, CM14, DJL21], other approaches in offline
131 reinforcement learning, minimum-distance estimation for conditional moment restrictions/GMM
132 [KU19a], or the finite-horizon analogous version of DR-learner suggested in [SLZS22]. Estimat-
133 ing the behavior policy is a classic probabilistic classification or multi-class classification problem.
134 Sometimes the offline trajectories might arise from a system with known exploration probabilities,
135 so that the behavior policy might be known.
136 2.3 Policy optimization
137 The sequential loss minimization approach also admits an policy optimization procedure. The policy
138 at every timestep is greedy with respect to the estimated τ . We describe the algorithm in Algorithm 2.
139 We use a slightly different cross-fitting approach for policy optimization. We introduce an additional
140 fold, upon which we alternate estimation of τ̂t . So, overall we use three folds: one for estimating
4
π̂
141 nuisance functions η, and the other two for estimating τ̂t t+1 . On these two other folds, between
142 every timestep, we alternate estimation of τ̂t on one of them, in order to break dependence between
143 the estimated optimal forwards policy π̂ t+1 and τ̂t (and therefore the greedy policy π̂t ).
144 3 Analysis
145 Our analysis generally proceeds under the following assumptions.
146 Assumption 1 (Independent and identically distributed trajectories). We assume that the data was
147 collected under a stationary behavior policy, i.e. not adaptively collected from a policy learning
148 over time.
149 Assumption 2 (Sequential unconfoundedness). r(A) ⊥
⊥ A | St and St+1 (a) ⊥
⊥ At | St
150 This assumption posits that the state space is sufficient, i.e. [az: ]
151 Assumption 3 (Boundedness). Vt ≤ BV , τ ≤ Bτ
152 [az: ] The following structural assumptions relate more specifically to our method and analysis.
153 Assumption 4 (Bounded transition density). Transitions have bounded density: P (s′ | s, a) ≤ c.
154 Let dπ (s) denote the marginal state distribution under policy π. Assume that dπtb (s) < c, for
155 t = 1, . . . , T.
156 [az: explain]

157 Next we establish convergence rates for τ̂ π , depending on convergence rates of the nuisance func-
158 tions. Broadly we follow the analysis of [FS19, LS21] for orthogonal statistical learning. The
159 analysis considers some τ̂ with small excess risk relative to the projection onto the function class,
160 i.e. as might arise from an optimization algorithm with some approximation error.
161 For a fixed evaluation policy π e , define the projection of the true advantage function onto Ψn ,
e e
τtπ ,n
= arg inf n ∥τt − τt◦,π ∥2 . (5)
τt ∈Ψt
e
For a fixed evaluation policy π e , define the error of some estimate τ̂tπ to projection onto the function
class: e e e
νtπ = τ̂tπ − τtn,π .

Theorem 1 (Policy evaluation ). Suppose sups,t E[(At − πtb )(At − πtb ) | s] ≤ C and Assump-
tions 1 to 4. Consider
e
a fixed
e
evaluation
e
policy π e . Consider any estimation algorithm that produces
an estimate τ̂ π = τ1π , . . . , τTπ , with small plug-in excess risk at every t, with respect to any
e
generic candidate τ̃ π , at some nuisance estimate η̂, i.e.,
e e
LD,t (τ̂tπ ; η̂) − LD,t (τ̃tπ ; η̂) ≤ ϵ(τtn , η̂).
162 Let ρt denote product error terms:

e e e
ρπt (η̂) = Bτ 2 ∥(π̂tb − πtb,◦ )2 ∥u + Bτ ∥(π̂tb − πtb,◦ )(m̂πt − mtπ ,◦
)∥u
e e e e e e
+ γ(Bτ ∥(π̂tb − πtb,◦ )(Q̂πt+1 − Qπt+1,◦ )∥u + ∥(m̂πt − mtπ ,◦ π ,◦
)(Q̂πt+1 − Qt+1 )∥u ). (6)
Then, for σ > 0, and u−1 + u−1 = 1,
λ πe 2 σ πe 2 e 2 πe ,◦ e e

∥νt ∥2 − ∥νt ∥u ≤ ϵ(τ̂tπ , η̂) + ∥(τ − τtπ ,n )∥2u + ρπt (η̂)2 .
2 4 σ
e
In the above theorem, ϵ(τ̂tπ , η̂) is the excess risk of the empirically optimal solution. The bias
e e
term is ∥(τ π ,◦ − τtπ ,n )∥2u , which describes the model misspecification bias of the function class
e
parametrizing Q−function contrasts, Ψ. The product error terms ρπt (η̂) highlight the reduced de-
pendence on individual nuisance error rates. We will instantiate the previous generic theorem for the
projection onto Ψn , defined in Equation (5), also accounting for the sample splitting. We will state
the results with local Rademacher complexity, which we now introduce. For generic 1-bounded
5
functions f in a function space f ∈ F, f ∈ [−1, 1], the local Rademacher complexity is defined as
follows: h i
Pn
Rn (F; δ) = Eϵ1:n ,X1:n supf ∈F :∥f ∥2 ≤δ n1 i=1 ϵi f (Xi )
163 The critical radius δ 2 more tightly quantifies the statistical complexity of a function class, and is any
164 solution to the so-called basic inequality, Rn (F; δ) ≤ δ 2 . The star hull of a generic function class F
165 is defined as star(F) = {cf : f ∈ F, c ∈ [0, 1]}. Bounds on the critical radius of common function
166 classes like linear and polynomial models, deep neural networks, etc. can be found in standard
167 references on statistical learning theory, e.g. [Wai19]. [az: how does critical radius scale] We can
168 obtain mean-squared error rates for policy evaluation via specializing Theorem 1 to the 2-norm and
169 leveraging results from [FS19]

Theorem 2 (MSE rates for policy evaluation). Suppose sups,t E[(At − πtb )(At − πtb ) | s] ≤ C
and Assumptions 1 to 4. Consider a fixed policy π e . Suppose each of E[∥(π̂tb − πtb,◦ )∥22 ], E[∥(π̂tb −
e e e e e e e e
πtb,◦ )(m̂πt −mπt ,◦ )∥22 ], E[∥(π̂tb −πtb,◦ )(Q̂πt+1 −Qπt+1,◦ )∥22 ], and E[∥(m̂πt −mπt ,◦ )(Q̂πt+1 −Qπt+1,◦ )∥22 ]
e e
2
are of order O(δn/2 + ∥τtπ ,◦ − τtπ ,n ∥22 ). Then
e e
e e

E[∥τ̂tπ − τtπ ,◦ ∥22 ] = O δn/22
+ ∥τtπ ,◦ − τtπ ,n ∥22
170 Remark 1. Working with the orthogonalized estimate results in the weaker product-error rate re-
171 quirements included above. However, our estimating moments do include the Q function nuisances,
172 and quarter-root rates are required for estimating both the Q and πb functions.
173 3.1 Policy optimization
174 Convergence of τt implies convergence in policy value. We quantify this with the margin assump-
175 tion, which is a low-noise condition that quantifies the gap between regions of different optimal
176 action. It is commonly invoked to relate estimation error of plug-in quantities to decision regions, in
177 this case the difference-of-Q functions to convergence of optimal decision values.
Assumption 5 (Margin [Tsy04]). Assume there exist some constants α, δ0 > 0 such that

∗ ∗ ′
P max Q (s, a) − ′ max ∗ Q (s, a ) ≤ ϵ = O (εα )
a a ∈A\arg maxa Q (s,a)
178 The probability density in Assumption 5 is evaluated with respect to Lebesgue measure over the
179 state space. [az: add exposition on margin]
Lemma 1 ( Advantage estimation error to policy value via margin.). Suppose Assumptions 2, 4
and 5 (margin assumption holds with α). Suppose that with high probability, for some b∗ > 0,
[az: where does O(n−κ ) come from? ]
π̂ t+1 π∗
t+1 ,◦
sup |τ̂t (s) − τt (s)| ≤ Kn−b∗ ,
s∈S,a∈A
(1−γ T −t )
180 then E[Vt∗ (St ) − Vtπ̂τ̂ (St )] ≤ 1−γ cK 2 n−b∗(1+α) + O(n−κ ),
o1/2
(1−γ T −t )
nR
2
181 and (Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )) ds ≤ 1−γ cK 2 n−b∗(1+α) + O(n−κ ).
R 1/2
2
Else, assume that E s∈S |τ̂tn (s) − τt◦ (s)| ds ≤ K n−b∗ , [az: check superscripts] for some

182
183 b∗ > 0. Then

2+2α
nR o1/2 2+2α

E[Vt∗ (St ) − Vtπ̂τ̂ (St )] = O n−b∗ ( 2+α ) , and = O n−b∗ ( 2+α ) .
2
(Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )) ds
184 Next we study policy optimization. Studying the convergence of policy optimization requires con-
185 ditions on convergence of advantage functions from previous steps of the algorithm.
186 Theorem 3 (Policy optimization bound). Suppose Assumptions 1, 2 and 4 and ??. Further, suppose
187 that Q◦ satisfies Assumption 5 (margin) with α > 0. Suppose the product error rate conditions hold
6
188 for each t for data-optimal policies evaluated along the algorithm, i.e. for each t, for π̂ t+1 , each of
π̂ ◦,π̂ t+1 π̂ ◦,π̂
189 E[∥(π̂tb − πtb,◦ )∥22 ], E[∥(π̂tb − πtb,◦ )(m̂t t+1 − mt )∥22 ], E[∥(π̂tb − πtb,◦ )(Q̂t+1
t+2
− Qt+1t+2 )∥22 ], and
◦,π̂ ◦,π̂ π̂ ,◦ π̂ ,n
190 E[∥(m̂t − m◦t )(Q̂t+1t+2 − Qt+1t+2 )∥22 ] 2
are of order O(δn/2 + ∥τt t+1 − τt t+1 ∥22 ). [az: update this
191 to be estimation error conditioned on the algorithm filtration ] [az: if we add extra sample splitting
192 .. update extra notation of δn/2 ]
Suppose that for π̂t , under the above assumptions, Theorem 2 holds, and the critical radius δn/2 and
π̂ ,◦ π̂ ,n
for time t, function class specification error ∥τt t+1 −τt t+1 ∥2 satisfy the root-mean-squared-error
(c) (Ψ)
rate conditions: ρt , ρt
(c) (Ψ)
π̂ ,◦ π̂ ,n
2
δn/2 = Kr2 n−2ρt , 2 −2ρt
∥τt t+1 − τt t+1 ∥22 = KΨ n .
(·) (·)
193 Further define for a generic t, ρ≥t = mint′ ≥t {ρt′ }, for (·) ∈ {(c), (Ψ)}. Then [az: with high
194 probability ]
π̂ ◦,π ∗ ◦,π̂ n,π̂

∥τ̂t t+1 − τt t+1 ∥ ≤ O(δn/2 + ∥τt t+1 − τt t+1 ∥2 ) + Kn−Rt . (7)
n o ′
(c) (Ψ) (c) (Ψ) 2+2αT −k
195 where Rk = min ρk+1 · 2+2α
2+α , ρ k+1 · 2+2α
2+α , min ′ (ρ
k ≥k+1 k ′ , ρk ′ ) · 2+α .
(·) (·)
196 Further suppose that α > 0 and that for t′ ≥ t, we have that ρt ≤ ρt′ , for (·) ∈ {(c), (Ψ)}, i.e.
197 the estimation error rate is nonincreasing over time. Then,
π̂ t+1 ◦,π ∗
t+1 ◦,π̂ t+1 n,π̂ t+1
∥τ̂t − τt ∥ ≤ O(δn/2 + ∥τt − τt ∥2 ), (8)
198 and
∗ (c) (Ψ) 2+2α
E[V1π (S1 ) − V1π̂τ̂ (S1 )] = O(n− min{ρ≥1 ,ρ≥1 } 2+α ).
199 Our method introduces auxiliary estimation at every timestep, so that the exponentiated terms are
200 higher-order relative to the difference-of-Q one-step estimation error at every timestep.
201 Remark 2. Other work also studies fast rates in offline reinforcement learning via local
202 Rademacher complexity [DJL21] or the margin assumption of [Tsy04] [HKU24, SZLS22]. In our
203 work, we also use a margin assumption to relate estimated advantage functions to their policy re-
204 gret, although we introduce an auxiliary estimation step of the advantage function between policy
205 updates, and so qualitatively the estimation implications of the margin assumption are quite differ-
206 ent. Note that [HKU24] also establishes margin constants for linear and tabular MDPs.
207 3.2 Limitations
208 [az: to fill]
209 4 Experiments
210 4.1
211 One of the motivations mentioned earlier was recent work that points out joint implications of cer-
212 tain blockwise conditional independence properties of reward and transition probabilities, with some
213 components being “exogenous" in the sense of being reward-irrelevant or action-irrelevant. Most pa-
214 pers take a model-based approach in filtering out irrelevant factors via estimation of reward and tran-
215 sitions under conditional independences. [Zho24] considers a variant of the exogenous estimation
216 model that implies sparse optimal policies. [az: fill in]. Unfortunately, [Zho24] assumes this struc-
217 ture is true. Although some recent works suggest the equivalence of pre-testing for the presence
218 of such conditional independences, pre-testing has poor statistical properties vis-a-vis the overall
219 procedure. Therefore, if there is some uncertainty about the underlying structure, it may not be a
220 good idea to use a specific estimation procedure. On the other hand, different underlying structures
221 may imply the same sparsity pattern on the advantage function. For example, under the “exo-endo"
222 decomposition of [DTC18], the advantage function is sparse. [PS23] studies whether advantage
7
sc0 sc1 sc0 sc1
s0 a0 s1 a1 s0 a0 s1 a1
r0 r1 r0 r1
Figure 1: Reward-relevant/irrelevant factored dynamics. The dotted line from at to sρt+1

c
indicates
the presence or absence is permitted in the model.
223 function estimation would therefore naturally recover the endogenous component under this model,
224 although it considers an online learning setting, whereas we obtain statistical improvements in the
225 offline setting, and hence the methods are not comparable.
226 [az: redo experiments ]
227 5 Conclusion
228 We developed methodology for orthogonalized estimation of the contrasts of Q functions, which
229 enables practical policy optimization. Orthogonalization helps reduce the dependence of estimation
230 error on nuisance functions, while targeting the Q-function contrasts allows us to adapt to the struc-
231 ture that is relevant for decision-making. Important directions also include reducing dependence on
232 assumptions, both for identification and estimation.
233 References
234 [1]
235 [BSZ23] David Bruns-Smith and Angela Zhou. Robust fitted-q-evaluation and iteration under
236 sequentially exogenous unobserved confounders. arXiv preprint arXiv:2302.00662,
237 2023.
238 [CCD+ 18] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian
239 Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for
240 treatment and structural parameters, 2018.
241 [CHK+ 24] Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis
242 Syrgkanis. Applied causal inference powered by ml and ai. rem, 12(1):338, 2024.
243 [CM14] Bibhas Chakraborty and Susan A Murphy. Dynamic treatment regimes. Annual review
244 of statistics and its application, 1:447–464, 2014.
245 [DJL21] Yaqi Duan, Chi Jin, and Zhiyuan Li. Risk bounds and rademacher complexity in
246 batch reinforcement learning. In International Conference on Machine Learning, pages
247 2892–2902. PMLR, 2021.
248 [DTC18] Thomas Dietterich, George Trimponias, and Zhitang Chen. Discovering and removing
249 exogenous state variables and rewards for reinforcement learning. In International
250 Conference on Machine Learning, pages 1262–1270. PMLR, 2018.
251 [FS19] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint
252 arXiv:1901.09036, 2019.
253 [HKU24] Yichun Hu, Nathan Kallus, and Masatoshi Uehara. Fast rates for the regret of offline
254 reinforcement learning. Mathematics of Operations Research, 2024.
8
255 [JL16] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement
256 learning. Proceedings of the 33rd International Conference on Machine Learning,
257 2016.
258 [JYW21] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline
259 rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
260 [Ken20] Edward H Kennedy. Optimal doubly robust estimation of heterogeneous causal effects.
261 arXiv preprint arXiv:2004.14497, 2020.
262 [Ken22] Edward H Kennedy. Semiparametric doubly robust targeted double machine learning:
263 a review. arXiv preprint arXiv:2203.06469, 2022.
264 [KU19a] Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient
265 off-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526,
266 2019.
267 [KU19b] Nathan Kallus and Masatoshi Uehara. Efficiently breaking the curse of horizon: Double
268 reinforcement learning in infinite-horizon processes. arXiv preprint arXiv:1909.05850,
269 2019.
270 [KZ22] Nathan Kallus and Angela Zhou. Stateful offline contextual policy evaluation and
271 learning. In International Conference on Artificial Intelligence and Statistics, pages
272 11169–11194. PMLR, 2022.
273 [LLTZ18] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of hori-
274 zon: Infinite-horizon off-policy estimation. In Advances in Neural Information Pro-
275 cessing Systems, pages 5356–5366, 2018.
276 [LS21] Greg Lewis and Vasilis Syrgkanis. Double/debiased machine learning for dynamic
277 treatment effects. Advances in Neural Information Processing Systems, 34, 2021.
278 [LTL19] Fan Li, Laine E Thomas, and Fan Li. Addressing extreme propensity scores via the
279 overlap weights. American journal of epidemiology, 188(1):250–257, 2019.
280 [LVY19] Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under con-
281 straints. In International Conference on Machine Learning, pages 3703–3712. PMLR,
282 2019.
283 [MDV23] Pawel Morzywolek, Johan Decruyenaere, and Stijn Vansteelandt. On a general class
284 of orthogonal learners for the estimation of heterogeneous treatment effects. arXiv
285 preprint arXiv:2303.12687, 2023.
286 [NBW21] Xinkun Nie, Emma Brunskill, and Stefan Wager. Learning when-to-treat policies. Jour-
287 nal of the American Statistical Association, 116(533):392–409, 2021.
288 [NP08] Gerhard Neumann and Jan Peters. Fitted q-iteration by advantage weighted regression.
289 Advances in neural information processing systems, 21, 2008.
290 [NW21] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment
291 effects. Biometrika, 108(2):299–319, 2021.
292 [PS23] Hsiao-Ru Pan and Bernhard Schölkopf. Learning endogenous representation in re-
293 inforcement learning via advantage estimation. In Causal Representation Learning
294 Workshop at NeurIPS 2023, 2023.
295 [SLZS22] Chengchun Shi, Shikai Luo, Hongtu Zhu, and Rui Song. Statistically efficient ad-
296 vantage learning for offline reinforcement learning in infinite horizons. arXiv preprint
297 arXiv:2202.13163, 2022.
298 [STLD14] Phillip J Schulte, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. Q-and
299 a-learning methods for estimating optimal dynamic treatment regimes. Statistical sci-
300 ence: a review journal of the Institute of Mathematical Statistics, 29(4):640, 2014.
9
301 [SZLS22] Chengchun Shi, Sheng Zhang, Wenbin Lu, and Rui Song. Statistical inference of the
302 value function for reinforcement learning in infinite-horizon settings. Journal of the
303 Royal Statistical Society Series B: Statistical Methodology, 84(3):765–793, 2022.
304 [Tsy04] Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The
305 Annals of Statistics, 32(1):135–166, 2004.
306 [TTG15] Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confi-
307 dence policy improvement. In International Conference on Machine Learning, pages
308 2380–2388, 2015.
309 [Wai19] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, vol-
310 ume 48. Cambridge university press, 2019.
311 [WDT+ 22] Tongzhou Wang, Simon S Du, Antonio Torralba, Phillip Isola, Amy Zhang, and Yuan-
312 dong Tian. Denoised mdps: Learning world models better than the world itself. arXiv
313 preprint arXiv:2206.15477, 2022.
314 [WXZS] Zizhao Wang, Xuesu Xiao, Yuke Zhu, and Peter Stone. Task-independent causal state
315 abstraction.
316 [XCJ+ 21] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal.
317 Bellman-consistent pessimism for offline reinforcement learning. Advances in neural
318 information processing systems, 34:6683–6694, 2021.
319 [XYWL21] Tengyu Xu, Zhuoran Yang, Zhaoran Wang, and Yingbin Liang. Doubly robust off-
320 policy actor-critic: Convergence and optimality. In International Conference on Ma-
321 chine Learning, pages 11581–11591. PMLR, 2021.
322 [XYZ23] Chuhan Xie, Wenhao Yang, and Zhihua Zhang. Semiparametrically efficient off-policy
323 evaluation in linear markov decision processes. In International Conference on Ma-
324 chine Learning, pages 38227–38257. PMLR, 2023.
325 [Zho24] Angela Zhou. Reward-relevance-filtered linear offline reinforcement learning, 2024.
10
326 A Proofs
327 A.1 Preliminaries
Lemma 2 (Excess Variance ).

E[L̂t (τ, η)] − Lt (τ, η) = Var[max
′
Q (St+1 , a′ ) | πtb ]
a
328 Proof. ±E[Qπt+1

e
| St , πtb ] instead of ±T Qπt+1
e
E[L̂t (τ, η)]

2
= E[ {Rt + γQπt+1
e
(St+1 , At+1 ) − Vπtb ,πt+1:T (St )} ± E[γQπt+1 e
| St , πtb ] − {A − πtb (1 | St )} · τ (St ) ]
2
= E[ {Rt + γE[Qπt+1
e
| St , πtb ] − Vπtb ,πt+1:T (St )} − {A − πtb (1 | St )} · τ (St ) + γ(Qπt+1e
(St+1 , At+1 ) − E[Qπt+1
e
| St , πtb ])
2
= E[ {Rt + γT Qπt+1
e
− Vπtb ,πt+1:T (St )} − {A − πtb (1 | St )} · τ (St ) ]
(squared loss of identifying moment)
+ E[γ(Qπt+1
e
−
(St+1 , At+1 ) E[Qπt+1
| e
St , πtb ])2 ]
(residual variance of Qt (s, a) − Rt (s, a))
b t+1:T
n o
π ,π
+ E[ Rt + γE[Qπt+1
e
| St , πtb ] − Vt t (St ) − {A − πtb (1 | St )} · τ (St ) · γ(Qπt+1
e
(St+1 , At+1 ) − E[Qπt+1
e
| St , πtb ])]
329 Note the last term = 0 by iterated expectations and the pull-out property of conditional expectation.
330
331 A.2 Orthogonality
Below we will omit the π superscript; the analysis below holds for any valid π. Define νt =
τ̂t − τtn , νt◦ = τ̂t − τt◦ . We define for any functional L(f ) the Frechet derivative as:
∂
Df L(f )[ν] = L(f + tν)
∂t t=0
332 Higher order derivatives are denoted as Dg,f L(f, g)[µ, ν].
Lemma 3 (Universal Orthogonality).
Dη,τt Lt (τtn ; τt+1
n
, η ∗ )[η − η ∗ , νt ] = 0
333 Proof of Lemma 3. For brevity, for a generic f , let {f }ϵ denote f +ϵ(f −f ◦ ). Then the first Frechet
334 derivatives are:
d hn
πe ,◦ π e ,◦
o i
Lt (τ̃ , η ◦ )[τ − τ̃ , η − η ◦ ] = E Rt + γ{Qt+1 }ϵ − {mt }ϵ − (At − {πtb ◦ }ϵ )τ (At − {πtb,◦ }ϵ )(τ̃ − τ )
dϵτ
d d
Lt (τ̃ , η ◦ ) [η − η ◦ , τ − τ ]
dϵe dϵτ ϵ=0
π e ,◦
h hn o i
b,◦
= E πt − πt b
τ (τ − τ̃ )(At − et )]+ E R + γQπt+1
e
− mt − (At − et ) (τ − τ̃ ) · − (et − e◦t )
=0
d d
Lt (τ̃ , η ◦ ) [η − η ◦ , τ − τ̃ ]
dϵQt+1 dϵτ ϵ=0
πe ,◦
= E[γ(Qπt+1
e
− Qt+1 )(At − πtb,◦ )(τt − τ̃t )]]
=0
d d
Lt (τ̃ , η ◦ ) [η − η ◦ , τ − τ̃ ]
dϵmt dϵτ ϵ=0
πe π e ,◦
= E[−(mt − mt )(At − πtb,◦ )(τt − τ̃t )]
=0
335
11
336 Lemma 4 (Second order derivatives). For Qt+1 , Q◦t+1 evaluated at some fixed policy π e :
Dηt ,ηt Lt [η̂t − ηt◦ , η̂t − ηt◦ ]
2 h i h i
b,◦
2
= E τt π̂t − πt b
+ E (π̂tb − πtb,◦ )τt (m̂t − m◦t ) + E (π̂tb − πtb,◦ )τt γ(Q̂t+1 − Q◦t+1 )
h i
− E (m̂t − m◦t ) γ Q̂t+1 − Q◦t+1
337 Proof of Lemma 4. Below, the evaluation policy π e is fixed and omitted for brevity. Note that
De LD [ê − e◦ ] = E[(Rt + γQt+1 − π b⊤ Qt + (A − πtb )τt )(−τt )(ê − e◦ )]
Dmt LD [m̂t − m◦t ] = E[(Rt + γQt+1 − π b⊤ Qt + (A − πtb )τt )(−1) ∗ (mt − m◦ )]
338 By inspection, note that the nonzero terms of the second-order derivatives are as follows:
2
Dπtb ,πtb Lt [π̂tb − πtb,◦ , π̂tb − πtb,◦ ] = E τt2 π̂tb − πtb,◦
h i
Dmt ,Qt+1 Lt [Q̂t+1 − Q◦t+1 , m̂t − m◦t ] = E − (m̂t − m◦t ) γ Q̂t+1 − Q◦t+1
h i
Dmt ,πtb Lt [π̂tb − πtb,◦ , m̂t − m◦t ] = E (π̂tb − πtb,◦ )τt (m̂t − m◦t )
h i
DQt+1 ,πtb Lt [π̂tb − πtb,◦ , Q̂t+1 − Q◦t+1 ] = E (π̂tb − πtb,◦ )τt γ(Q̂t+1 − Q◦t+1 )
339 By the chain rule for Frechet differentiation, we have that
Dηt ,ηt Lt [η̂t − ηt◦ , η̂t − ηt◦ ] = Dπtb ,πtb Lt [π̂tb − πtb,◦ , π̂tb − πtb,◦ ]
+ Dmt ,πtb Lt [π̂tb − πtb,◦ , m̂t − m◦t ] + DQt+1 ,πtb Lt [π̂tb − πtb,◦ , Q̂t+1 − Q◦t+1 ] + Dmt ,Qt+1 Lt [Q̂t+1 − Q◦t+1 , m̂t − m◦t ]
340
341 A.3 Proof of sample complexity bounds
Proof of Lemma 1.
∗
Vt∗ (s) − Vtπτ̂ (s) = Vt∗ (s) − Vtπτ̂ (s) ± Qπ (s, πτ̂ )
= Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ ) + Q∗t (s, π̂τ̂ ) − Vtπ̂τ̂ (s)
h ∗ i
≤ γEπ̂t Vt+1 π π̂τ̂
− Vt+1 | s + Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )
342 Therefore for any t and Markovian policy π inducing a marginal state distribution:[az: under what
343 distribution]
h i
π∗
E[Vt∗ (s)] − E[Vtπτ̂ (s)] ≤ γE Eπ̂t [Vt+1 π̂τ̂
− Vt+1 | s] + E[Q∗t (s, π ∗ ) − Q∗t (s, π̂τ̂ )] (9)
344 Assuming bounded rewards implies that P (st+1 | s, a) ≤ c, which remains true under the state-
345 action distribution induced by any Markovian policy π(s, a), including the optimal policy. Therefore
346 the second term of the above satisfies:
Z
Eπ [Q∗t (st , π ∗ ) − Q∗t (st , π̂τ̂ )] ≤ c {Q∗t (s, π ∗ ) − Q∗t (s, π̂τ̂ } ds, (10)
and fixing t = 1, we obtain:

Z
E[Q∗1 (s1 , π ∗ ) − Q∗1 (s1 , π̂τ̂ )] ≤ c {Q∗1 (s, π ∗ ) − Q∗1 (s, π̂τ̂ } ds.
347 Next we continue for generic t and bound the right hand side term of eq. (10).
12
First we suppose we have a high-probability bound on ℓ∞ convergence of τ̂ . Define the good event

π̂ t+1 π∗ ,◦ −b∗
Eg = sup |τ̂ (s) − τ t+1 (s)| ≤ Kn
s∈S,a∈A
348 A maximal inequality gives that P (Eg ) ≥ 1 − n−κ . We have that

Z Z Z
{Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} ds = {Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} I [Eg ] ds+ {Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} I Egc ds

(11)
349 Assuming boundedness, the bad event occurs with vanishingly small probability n−κ , which bounds
350 the second term of eq. (15).
For the first term of eq. (15), note that on the good event, if mistakes occur such that πt∗ (s) ̸= π̂t (s),
then the true contrast function is still bounded in magnitude by the good event ensuring closeness of
π ∗ ,◦
the estimate, so that τt t+1 (s) ≤ 2Kn−b∗ . And if no mistakes occur, at s the contribution to the
integral is 0. Denote the mistake region as
π∗
t+1 ,◦
Sm = {s ∈ S : τt (s) ≤ 2Kn−b∗ }
351 Therefore
Z Z
∗ ∗ ∗
{Qt (s, π (s)) − Qt (s, π̂τ̂ )} ds ≤ {Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} I [s ∈ Sm ] I [Eg ] ds+O(n−κ )
s∈Sm
(12)
352 Note also that (for two actions), if action mistakes occur on the good event Eg , the difference of Q
353 functions must be near the decision boundaries so that we have the following bound on the integrand:
∗
|Q∗ (s, π ∗ ) − Q∗ (s, π̂)| ≤ |τ πt+1 ,◦ | ≤ 2Kn−b∗ . (13)
354 Therefore,
Z Z
{Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} ds ≤ O(n −κ
) + Kn −b∗
I [s ∈ Sm ] ds
≤ O(n−κ ) + (Kn−b∗ )(Kn−b∗α )

= O(n−κ ) + (K 2 n−b∗(1+α) ) (14)
355 where the first inequality follows from the above, and the second from assumption 5 (margin).
356 Combining eqs. (9) and (14), we obtain:
T
X Z
E[Vt∗ (St )] − E[Vtπ̂τ̂ (St )] ≤ γtc Qπ̂t τ̂ (s, π ∗ (s)) − Qπ̂t τ̂ (s, π̂τ̂ )ds
t=1
(1 − γ T )
≤ cT {O(n−κ ) + (K 2 n−b∗(1+α) )}
1−γ
357 We also obtain analogous results for norm bounds:

Z 1/u
u
(Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )) ds
Z 1/u
≤ (Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ ))u I [s ∈ Sm ] I [Eg ] ds + O(n−κ )
s∈Sm
T
(1 − γ )
≤ cT {O(n−κ ) + (K 2 n−b∗(1+α) )}
1−γ
The results under an integrated risk bound assumption on convergence of τ follow analogously as
[SLZS22], which we also include for completeness. For a given ε > 0, redefine the mistake region
parametrized by ϵ: n o
Sϵ = max Q∗ (s, a) − Q∗ (s, π̂(s)) ≤ ε .
a
13
358 Again we obtain the bound by conditioning on the mistake region:
Z Z Z
{Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} ds = {Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} I [Sϵ ] ds+ {Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} I [Sϵc ] ds
(15)
Using similar arguments as earlier, we can show by Assumption 5:
Z Z
{Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} I(s ∈ S∗ )ds ≤ ε I(s ∈ S∗ )ds = O ε1+α .

x
359 As previously argued, we can show mistakes πt∗ (s) ̸= π̂t (s) occur only when
∗
max Q∗t (s, a) − Q∗ (s, π̂(s)) ≤ 2 τ̂ π̂t+1 (s) − τ πt+1 ,◦ (s) . (16)
a
It follows that Z
{Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} I [s ∈ Sϵc ] ds
∗ 2
Z 4 τ̂ π̂t+1 (s) − τ πt+1 ,◦ (s)
≤E I [s ∈ Sϵc ] ds
{Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )}
Z 2
4 ∗
τ̂ π̂t+1 (s) − τ πt+1 ,◦ (s) ds = O ε−1 |I|−2b∗ .

≤
ε
Combining this together with (E.106) and (E.107) yields that
Z
{Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )} ds = O ε1+α + O ε−1 |I|−2b∗ .

360 The result follows by choosing ε = n−2b∗ /(2+α) to balance the two terms.

For the norm bound, the first term is analogously bounded as O ε1+α :
Z 1/2
(Q∗t (s, π ∗ (s)) − Q∗t (s, π̂τ̂ ))2 I[s ∈ S∗ ]ds = O ε1+α .

For the second term,

  1/2
2 2

∗
1/2  π̂t+1 πt+1 ,◦
(s) − τ

 4 τ̂ (s) 
Z 
Z 

∗ ∗ ∗ 2 c c
(Qt (s, π (s)) − Qt (s, π̂τ̂ )) I [s ∈ Sϵ ] ds ≤  ∗ I [s ∈ Sϵ ] ds
Qt (s, π ∗ (s)) − Q∗t (s, π̂τ̂ )


 

 
Z 4
4 ∗
≤ { τ̂ π̂t+1 (s) − τ πt+1 ,◦ (s) ds}1/2 = O ε−1 |I|−2b∗ .

ε
361 The result follows as previous.
362 Proof of Theorem 1. In the following, at times we omit the fixed evaluation policy π e from the
e e
363 notation for brevity. That is, in this proof, τ̂t , τtn are equivalent to τ̂tπ , τtn,π . Further define
νt = τ̂t − τtn , νt◦ = τ̂t − τt◦

Strong convexity implies that: [az: to add]
2
Dτt ,τt L (τt , η̂) [νt , νt ] ≥ λ ∥νt ∥2
364 therefore
λ 2
∥νt ∥2 ≤ LD (τ̂t , η̂) − LD (τtn , η̂) − Dτt LD (τtn , η̂)[νt ] (17)
2
≤ ϵ(τ̂t , η̂) − Dτt LD (τtn , η ◦ )[νt ]
+ Dτt LD (τtn , η ◦ )[νt ] − Dτt LD (τtn , η̂)[νt ]
14
365 We bound each term in turn.
To bound |Dτt LD (τtn , η ◦ )[νt ]| , note that
π b ,π
Dτt LD (τtn , η ◦ )[νt ] = E[(R + γQt+1 − Vt t+1:T + A − πtb τt )) A − πtb νt ]

and by the properties of the conditional moment at the true τ ◦ ,

π b ,π
= E[(R + γQt+1 − Vt t+1:T + A − πtb τt◦ )) A − πtb νt ] = 0

Therefore,
Dτt LD (τtn , η ◦ )[νt ] = −E[(τ ◦ − τtn )(A − πtb )(A − πtb )(τ̂t − τtn )]
366 Note that in general, for generic p, q, r such that 1/p + 1/q + 1/r = 1 we have that E[f gh] ≤
pq
367 ∥f g∥p′ ∥h∥r ≤ ∥f ∥p ∥g∥q ∥h∥r where p′ = p+q or p1′ = p1 + 1q or 1 = p/p
1 1
′ + q/p′ .
368 Therefore,
Dτt LD (τtn , η ◦ )[νt ] ≤ |Dτt LD (τtn , η ◦ )[νt ]|
≤ E[(τ ◦ − τtn )E[(At − πtb )(At − πtb ) | St ](τ̂t − τtn )]

◦ n n b b
≤ ∥(τ − τt )∥u ∥(τ̂t − τt )∥u · sup E[(At − πt )(At − πt ) | s]
s
1 1
369 where u, u satisfy u + u = 1.
370 Next we bound Dτt LD (τtn , η ◦ )[νt ] − Dτt LD (τtn , η̂)[νt ] by universal orthogonality. By a second
371 order Taylor expansion, we have that, where ηϵ = η ◦ + ϵ(η̂ − η ◦ ).
1 1
Z
Dτt (LD (τtn , η ◦ ) − LD (τtn , η̂)) [νt ] = ◦
Dη,η,τt (τtn , τt+1 , ηϵ )[η̂ − η ◦ , η̂ − η ◦ , νt ]
2 0
372 We can deduce from Lemmas 3 and 4 that the integrand is:
2 h i h i
E τt2 π̂tb − πtb,◦ νt + E (π̂tb − πtb,◦ )τt (m̂t − m◦t )νt + E (π̂tb − πtb,◦ )τt γ(Q̂t+1 − Q◦t+1 )νt
h i
− E (m̂t − m◦t ) γ Q̂t+1 − Q◦t+1 νt
2
≤Bτ 2 ∥ π̂tb − πtb,◦ ∥u ∥νt ∥u + Bτ ∥(π̂tb − πtb,◦ )(m̂t − m◦t )∥u ∥νt ∥u + γBτ ∥(π̂tb − πtb,◦ )(Q̂t+1 − Q◦t+1 )∥u ∥νt ∥u
+ γ∥(m̂t − m◦t )(Q̂t+1 − Q◦t+1 )∥u ∥νt ∥u
373 Putting the bounds together, we obtain:
λ 2
∥νt ∥2 ≤ ϵ(τ̂t , η̂) + ∥νt ∥u ∥(τ ◦ − τtn )∥u
2 2
+ ∥νt ∥u Bτ 2 ∥ π̂tb − πtb,◦ ∥u + Bτ ∥(π̂tb − πtb,◦ )(m̂t − m◦t )∥u + γBτ ∥(π̂tb − πtb,◦ )(Q̂t+1 − Q◦t+1 )∥u

+γ∥(m̂t − m◦t )(Q̂t+1 − Q◦t+1 )∥u (18)
e
374 Let ρπt (η̂) denote the collected product error terms, e.g.
e
2
ρπt (η̂) = Bτ 2 ∥ π̂tb − πtb,◦ ∥u + Bτ ∥(π̂tb − πtb,◦ )(m̂t − m◦t )∥u
+ γ(Bτ ∥(π̂tb − πtb,◦ )(Q̂t+1 − Q◦t+1 )∥u + ∥(m̂t − m◦t )(Q̂t+1 − Q◦t+1 )∥u )
375 Analogously we drop the π e decoration from ρt in this proof. The AM-GM inequality implies that
376 for x, y ≥ 0, σ > 0, we have that xy ≤ 21 ( σ2 x2 + σ2 y 2 ). Therefore
λ 2 σ 1 2
∥νt ∥2 − ∥νt ∥2u ≤ ϵ(τ̂t , η̂) + (∥(τ ◦ − τtn )∥u + ρt (η̂)) (19)
2 4 σ
and since (x + y)2 ≤ 2(x2 + y 2 ),
λ 2 σ 2
∥νt ∥2 − ∥νt ∥2u ≤ ϵ(τ̂t , η̂) + ∥(τ ◦ − τtn )∥2u + ρt (η̂)2

2 4 σ
377
15
378 Proof of Theorem 2. Let L̂S,t , L̂S ′ ,t denote the empirical loss over the samples in S and S ′ ; analo-
379 gously η̂S , η̂S ′ are the nuisance functions trained on each sample split.
Define the loss function ℓt on observation O = {(St , At , Rt , St+1 )}Tt=1 :
π et+1
2
ℓt (O; τt ; η̂) = {Rt + Q̂t+1 (St+1 , At+1 ) − m̂t (St )} − {A − π̂tb (1 | St )} · τt (St )
and the centered loss function ∆ℓ, centered with respect to τ̂tn :
∆ℓt (O; τt ; η̂) = ℓt (O; τt ; η̂) − ℓt (O; τ̂tn ; η̂).
Assuming boundedness, ℓt is L−Lipschitz constant in τt :
|∆ℓt (O; τt ; η̂) − ∆ℓt (O; τt′ ; η̂)| ≤ L∥τt − τt ∥2 .
380 Note that ℓ(O, τ̂tn , η̂) = 0. Define the centered average losses:
∆L̂S,t (τt , η̂) = L̂S,t (τt , η̂) − L̂S,t (τ̂tn , η̂) = ÊSn/2 [∆ℓt (O, τT , η̂)]
∆LS,t (τt , η̂) = LS,t (τt , η̂) − LS,t (τ̂tn , η̂) = E[∆ℓt (O, τT , η̂)]
381 Assume that δn is an upper bound on the critical radius of the centered function class {Ψnt,i − τ̂t,i n
,
q
log(c T /ξ)
382 with δn = Ω( r lognlog n ), and define δn,ξ = δn + c0 1
n for some c0 , c1 . [az: ξ is like δ and
2
383 δξ includes a union bound over time horizon ]
384 By Lemma 6 (Lemma 14 of [FS19] on local Rademacher complexity decompositions), with high
385 probability 1-ξ, for all t ∈ [T ], and for c0 a universal constant ≥ 1.
|∆LS,t (τ̂t , η̂S ′ ) − ∆LD,t (τ̂t , η̂S ′ )| = |∆LS,t (τ̂t , η̂S ′ ) − ∆LS,t (τ̂tn , η̂S ′ ) − (∆LD,t (τ̂t , η̂S ′ ) − ∆LD,t (τ̂tn , η̂S ′ ))|

≤ c0 rmδn/2,ξ ∥τ̂t − τ̂tn ∥22 + rmδn/2,ξ 2
386 [az: what’s r]

1
387 Assuming realizability of τ̂t , we have that 2 ∆L̂S,t (τ̂t , η̂S ′ ) + ∆L̂S ′ ,t (τ̂t , η̂S ) ≤ 0. Then with
388 high probability ≥ 1 − 2ξ:
1
(∆LD,t (τ̂t , η̂S ′ ) + ∆LD,t (τ̂t , η̂S ))
2
1
≤ |∆LD,t (τ̂t , η̂S ′ ) − ∆LS,t (τ̂t , η̂S ′ ) + ∆LD,t (τ̂t , η̂S ) − ∆LS ′ ,t (τ̂t , η̂S )|
2
1
≤ |∆LD,t (τ̂t , η̂S ′ ) − ∆LS,t (τ̂t , η̂S ′ )| + |∆LD,t (τ̂t , η̂S ) − ∆LS ′ ,t (τ̂t , η̂S )|
2
≤c0 rmδn/2,ξ ∥τ̂t − τ̂tn ∥2 + rmδn/2,ξ
2
389 The ϵ excess risk term in Theorem 1 indeed corresponds to one of the loss differences defined here,
390 i.e. ∆LD,t (τ̂t , η̂S ) := ϵ(τ̂tn , τ̂t , ĥS ). Therefore, applying Theorem 1 with u = u = 2 and σ = λ
391 with the above bound, and averaging the sample-split estimators, we obtain
 
λ 2 1 2 ◦ 2
X
∥νt ∥2 ≤ (ϵ(τ̂t , η̂S ) + ϵ(τ̂t , η̂S ′ )) + ∥τt − τ̂tn ∥2 + ρt (η̂s )2 
4 2 λ
s∈{S,S ′ }
392 We further decompose the excess risk of empirically-optimal τ̂t relative to the population minimizer
2 2
393 to instead bound by the error of τ̂t to the projection onto Ψ, τ̂tn , since ∥τ̂t − τt◦ ∥2 ≤ ∥τ̂t − τ̂tn ∥2 +
2
394 ∥τ̂tn − τt◦ ∥2 , we obtain
λ 8 + λ2 2 X
2
∥τ̂t − τt◦ ∥22 ≤ c0 rmδn/2,ξ ∥τ̂t − τ̂tn ∥2 + rmδn/2,ξ
2
+ ∥τt◦ − τtn ∥2 + ρt (η̂s )2
4 4λ λ ′ s∈{S,S }
16
Again using the AM-GM inequality xy ≤ 12 σ2 x2 + σ2 y 2 , we bound

395
c 2 2 ϵ
0
c0 rmδn/2,ξ ∥τ̂t − τ̂tn ∥2 + rmδn/2,ξ
2
≤ r2 m2 (1 + )δn/2,ξ + ∥τ̂t − τ̂tn ∥22
2 ϵ 4
1 2 ϵ
≤ c0 r m (1 + )δn/2,ξ + (∥τ̂t − τt◦ ∥22 + ∥τt◦ − τ̂tn ∥22 )
2 2
ϵ 4
396 Therefore,
8 + λ2

λ−ϵ 1 2 ϵ 2 2 X
∥τ̂t − τt◦ ∥22 ≤ c0 r2 m2 (1 + )δn/2,ξ + + ∥τt◦ − τtn ∥2 + ρt (η̂s )2
4 ϵ 4λ 4 λ ′ s∈{S,S }
397 Choose ϵ ≤ λ/8 so that

4 + λ2

λ ◦ 2 8 2 2 2 X
2 2
∥τ̂t − τt ∥2 ≤ c0 r m (1 + )δn/2,ξ + ∥τt◦ − τtn ∥2 + ρt (η̂s )2
8 λ 2λ λ
s∈{S,S ′ }

8 λ 2
X
≤ 1+ + (c0 r2 m2 δn/2,ξ
2
+ ∥τt◦ − τtn ∥2 + ρt (η̂s )2 )
λ 2 ′ s∈{S,S }
and therefore

8 8 2
X
∥τ̂t − τt◦ ∥22 ≤ (1 + ) + 4 (c0 r2 m2 δn/2,ξ
2
+ ∥τt◦ − τtn ∥2 + ρt (η̂s )2 )
λ λ
s∈{S,S ′ }
Taking expectations:

8 8 2
E[∥τ̂t − τt◦ ∥22 ] ≤ (1 + ) + 4 (c0 r2 m2 δn/2
2
+ ∥τt◦ − τtn ∥2 + max′ E[ρt (η̂s )2 ])
λ λ s∈{S,S }
398 Therefore, if the product error rate terms are all of the same order as the estimation order terms:
2
E[∥π̂tb − πtb,◦ ∥22 ] = O(δn/2
2
+ ∥τt◦ − τtn ∥2 )
2
E[∥(π̂tb − πtb,◦ )(m̂t − m◦t )∥22 ] = O(δn/2
2
+ ∥τt◦ − τtn ∥2 )
2
E[∥(π̂tb − πtb,◦ )(Q̂t+1 − Q◦t+1 )∥22 ] = O(δn/2
2
+ ∥τt◦ − τtn ∥2 )
2
E[∥(m̂t − m◦t )(Q̂t+1 − Q◦t+1 )∥22 ] = O(δn/2
2
+ ∥τt◦ − τtn ∥2 )
399
Proof of Theorem 3. Preliminaries We introduce some additional notation. For the analysis of im-
plications of policy optimization, we further introduce notation that parametrizes the time-t loss
function with respect to the time-(t + 1) policy. In analyzing the policy optimization, this will be
used to decompose the policy error arising from time steps closer to the horizon. Define
2
πτ ′
′
LD (τtn , τt+1 , η̂) = E {Rt + γQt+1 t+1
(St+1 , At+1 ) − Vπtb ,πτ′ (St )} − {A − πtb (1 | St )} · τ (St )
t+1
′
400 where πτt+1
′ (s) ∈ argmax τt+1 (s). That is, the second argument parameterizes the difference-of-Q
401 function that generates the policy that oracle nuisance functions are evaluated at.
Then, for example, the true optimal policy satisfies that πt∗ ∈ arg max τt◦ (s). We define the oracle
loss function with nuisance functions evaluated with respect to the optimal policy π ∗ .
" 2 #
πτ∗◦
n ◦ t+1 ◦ b
LD (τt , τ , η̂) = E {Rt + γQt+1 (St+1 , At+1 ) − m (St )} − γ{A − πt (1 | St )} · τ (St )
In contrast, the empirical policy optimizes with respect to a next-stage estimate of the empirical best
next-stage policy π̂τ̂t+1 . That is, noting the empirical loss function:
2
n π̂τ̂t+1 ◦ b
LD (τt , τ̂t+1 , η̂) = E {Rt + γQt+1 (St+1 , At+1 ) − m (St )} − γ{A − πt (1 | St )} · τ (St )
17
Step 1: Applying advantage estimation results. At every timestep, the first substep is to estimate
π̂
the Q-function contrast, τ̂t t+1 . The assumptions on product error nuisance rates imply that for a
fixed π̂t+1 that we would obtain estimation error

h i e e 2
π̂ π̂ ,◦
E ∥τ̂t t+1 − τt t+1 ∥22 = O δn/22
+ τtπ ,◦ − τtπ ,n
2
π̂
402 Step 2: Establishing policy consistency. Applying Lemma 1 requires a convergence rate of τ̂t t+1
π∗
403 toτ̂t t+1 .
The estimation error guarantees on the contrast function, however, are for the policy π̂t+1 .
404 We obtain the required bound via induction. At a high level, the estimation error arising from π̂t+1
∗
405 vs πt+1 too eventually is integrated; so when the margin exponent α > 0, these policy error terms
406 are higher-order and vanish at a faster rate.
407 Importantly, we suppose the product error rate conditions hold for each t for data-optimal policies
408 evaluated along the algorithm, i.e. for each t, for each t, for π̂ t+1 , each of E[∥(π̂tb − πtb,◦ )∥22 ],
π̂ ◦,π̂ t+1 π̂ ◦,π̂
409 E[∥(π̂tb − πtb,◦ )(m̂t t+1 − mt )∥22 ], E[∥(π̂tb − πtb,◦ )(Q̂t+1
t+2
− Qt+1t+2 )∥22 ], and E[∥(m̂t −
◦,π̂ ◦,π̂ π̂ ,◦ π̂ ,n
410 m◦t )(Q̂t+1t+2 − Qt+1t+2 )∥22 ] are of order O(δn/2
2
+ ∥τt t+1 − τt t+1 ∥22 ).
411 Step 2a: induction hypothesis.
412 Next we show the induction hypothesis.

413 First we consider the base case: When t = T , τT is independent of the forward policy so that
∗
414 ∥τ̂Tπ̂ − τT◦,π ∥ = ∥τ̂T − τT◦ ∥. Then the base case follows by Theorem 2.
415 Suppose it is true that for timesteps k ≥ t + 1, we have that
π̂ ◦,π ∗ ◦,π̂ k+1 n,π̂ k+1
∥τ̂k k+1 − τk k+1
∥ = O(δn/2 + ∥τk − τk ∥2 ) + Kn−Rk , (20)
416 where
′
!
(c) 2 + 2α (Ψ) 2 + 2α (c) (Ψ) 2 + 2α T −k
Rk = min ρk+1 · , ρk+1 · , −{ ′min (ρk′ , ρk′ )} · . (21)
2+α 2+α k ≥k+1 2+α
417 And therefore, applying Lemma 1, that
∗ (c) (Ψ) (c) (Ψ)
} 2+2α } 2+2α
E[Vkπ − Vkπ̂τ̂ ] = O(n− min{ρk ,ρk 2+α ) + o(n− min{ρk ,ρk 2+α ). (22)
418 We will show that the induction hypothesis implies

π̂ t+1 ◦,π ∗ ◦,π̂ t+1 n,π̂ t+1
∥τ̂t − τt t+1
∥ ≤ O(δn/2 + ∥τt − τt ∥2 ) + Kn−Rt .
419 and (c) (Ψ) (c) (Ψ)
∗
} 2+2α } 2+2α
E[Vkπ − Vkπ̂τ̂ ] = O(n− min{ρk ,ρk 2+α ) + o(n− min{ρk ,ρk 2+α )
π̂ t+1 ◦,π ∗
t+1
420 First decompose the desired error ∥τ̂t − τt ∥ as:
π̂ ◦,π ∗ π̂ ◦,π̂ ◦,π̂ t+1 ◦,π ∗ t+1
∥τ̂t t+1 − τt t+1 ∥ ≤ ∥τ̂t t+1 − τt t+1 ∥ + ∥τt − τt ∥ (23)
421 The first term is the policy evaluation estimation error, and under the product error rate assumptions
π̂ ◦,π̂ ◦,π̂ n,π̂
422 [az: point], Theorems 1 and 2 give that E[∥τ̂t t+1 − τt t+1 ∥22 ] = O(δn/2 2
+ ∥τt t+1 − τt t+1 ∥22 ).
423 The second term of the above depends on the convergence of the empirically optimal policy π̂;
424 we use our analysis from Lemma 1 to bound the impact of future estimates of difference-of-Q
425 functions using the induction hypothesis. The following analysis will essentially reveal that the
426 margin assumption of Assumption 5 implies that the error due to the empirically optimal policy is
427 higher-order, and the first term (time−t estimation error of τ̂t ) is the leading term.
428 As in eq. (9), we have that:
h ∗ i
Vt∗ (s) − Vtπτ̂ (s) ≤ γEπ̂t Vt+1
π π̂τ̂
− Vt+1 | st + Q∗t (s, π ∗ ) − Q∗t (s, π̂τ̂ ).
18
Decompose:
◦,π̂ t+1 ◦,π ∗
t+1
X π∗ π̂
∥τt − τt ∥≤ ∥Qt t+1 (s, a) − Qt t+1 (s, a)∥
a
π̂ t+1 ◦,π ∗
429 [az: is this all ∥τ̂t − τt t+1
∥22 or ? ]
π̂t+1 ,π ∗
t+2
430 By definition of τ and ±Vt+1 , for each a, we have that
π∗ π̂
∥Qt t+1 (s, a) − Qt t+1 (s, a)∥
t+1 π∗ t+1 π̂
= ∥Eπta [Vt+1 − Vt+1 | St ]∥
t+1 π∗ π̂t+1 ,π ∗
t+2 π̂t+1 ,π ∗
t+2 π̂
t+1
≤ ∥Eπta [Vt+1 − Vt+1 | St ]∥ + ∥Eπta [Vt+1 − Vt+1 | St ]∥
t+2 π∗ ∗ t+2 π∗ t+2 t+2 π∗ π̂
= ∥Eπta [Qt+1 (St+1 , πt+1 ) − Qt+1 (St+1 , π̂t+1 ) | St ]∥ + γ∥Eπta [Eπ̂t+1 [Vt+2 − Vt+2 | St ]]∥
(24)
Z ∗ ∗
1/2 ∗
π t+2 ∗ π t+2 π t+2 π̂ t+2
≤c (Qt+1 (s, πt+1 ) − Qt+1 (s, π̂t+1 ))2 ds + γ∥Eπta [Eπ̂t+1 [Vt+2 − Vt+2 | St ]]∥
(25)
431 [az: watch γ] where the last inequality follows by Assumption 4 and the policy-convolved transition
432 density.
Next we bound the first term using the margin analysis of Lemma 1 and the inductive hypothesis.
[az: definition of c∗ , rΨ ] Supposing the product error rates are satisfied on the nuisance functions
for estimation of τ̂t+1 , the induction hypothesis gives that
◦,π ∗
π̂ t+2
e

E[∥τ̂t+1 − τt+1t+2 ∥2 ] = O δn/2 + ∥τtπ ,◦ − τtn ∥2 + n−Rt+1 .
433 The induction hypothesis gives the integrated risk rate assumption on τ̂t+1 to apply Lemma 1,
Z 1/2
π∗ ∗ π∗
t+2
(Qt+1 (s, πt+1 ) − t+2
Qt+1 (s, π̂t+1 ))2 ds
(1 − γ T −t−1 ) (c) (Ψ)

≤ c(T − t − 1){O(n−κ ) + Kn− min{rt+1 ,rt+1 ,Rt+1 }(1+α) }.
1−γ
434 Combining with the previous analysis, we obtain:

n o
π̂ t+1 ◦,π ∗
t+1 2 2 ◦,π̂ t+1 n,π̂ t+1 2 (c) (Ψ)
− min ρt+2 ,ρt+2 ,Rt+2 2+2α
∥τ̂t − τt ∥2 ≤ O(δt,n/2 + ∥τt − τt ∥2 ) + O(n 2+α
)}
(1 − γ T −t−1 ) (c) (Ψ) 2+2α
+ c(T − t − 1){O(n−κ ) + Kn− min{ρt+1 ,ρt+1 ,Rt+1 } 2+α }
1−γ
(26)
435 from eq. (24) and appendix A.3.

436 Hence we obtain the inductive step and the result follows.
(·) (·)
437 If we further assume that for t′ ≥ t, we have that ρt ≤ ρt′ , for (·) ∈ {(c), (Ψ)}, i.e. the estimation
438 error rate is nonincreasing over time, and that α > 0 (i.e. Assumption 5, the margin assumption,
439 holds with exponent α > 0, then we can see from the result that the integrated risk terms obtain
440 faster rates, hence are higher-order, and the leading term is the auxiliary estimation error of the
441 Q-function contrast.
442 [az: I feel like uniform consistency over all policies is kind of harder to establish (directly applying
443 lemma 1). Induction feels more explicit anyhow. bounded the sum with ]
444
19
445 B Results used from other works
446 Here we collect technical lemmas from other works, stated without proof.
Lemma 5 (Lemma 18 of [LS21]). Consider any sequence of non-negative numbers a1 , . . . , am
satisfying the inequality:
m
at ≤ µt + ct max aj
j=t+1
with µt , ct ≥ 0. Let c := maxt∈[m] ct and µ := maxt∈[m] µt . Then it must also hold that:
cm−t+1 − 1
at ≤ µ
c−1
Lemma 6 (Lemma 14 of [FS19], see also results on local Rademacher complexity [Wai19]). Con-
sider a function class F, with supf ∈F ∥f ∥∞ ≤ 1, and pick any f ⋆ ∈ F. Let δn2 ≥ 4d log(41c2log(2c
n
2 n))
be any solution to the inequalities:

∀t ∈ {1, . . . , d} : R (star ( F|t − ft⋆ ) , δ) ≤ δ 2 .
Moreover, assume that the loss ℓ is L-Lipschitz in its first argument with respect to the ℓ2 norm. Then
for some universal constants c5 , c6 , with probability 1 − c5 exp c6 nδn2 ,
|Pn (Lf − Lf ⋆ ) − P (Lf − Lf ⋆ )| ≤ 18Ldδn {∥f − f ⋆ ∥2 + δn } , ∀f ∈ F.
Hence, the outcome fˆ of constrained ERM satisfies that with the same probability,
n o
P Lfˆ − Lf ⋆ ≤ 18Ldδn fˆ − f ⋆ + δn .
2
447 If the loss Lf is also linear in f , i.e. Lf +f ′ = Lf + Lf ′ and Lαf = αLf , then the lower bound on
448 δn2 is not required.
20
449 NeurIPS Paper Checklist
450 The reviewers of your paper will be asked to use the checklist as one of the factors in their evalu-
451 ation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No]
452 " provided a proper justification is given (e.g., "error bars are not reported because it would be too
453 computationally expensive" or "we were unable to find the license for the dataset we used"). In
454 general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased
455 in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your
456 best judgment and write a justification to elaborate. All supporting evidence can appear either in the
457 main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question,
458 in the justification please point to the section(s) where related material for the question can be found.
459 1. Claims
460 Question: Do the main claims made in the abstract and introduction accurately reflect the
461 paper’s contributions and scope?
462 Answer: [Yes]
463 Justification: the abstract describes the claims in the paper.
464 2. Limitations
465 Question: Does the paper discuss the limitations of the work performed by the authors?
466 Answer: [Yes]
467 Justification: We include a Limitations section Section 3.2 with further detail
468 3. Theory Assumptions and Proofs
469 Question: For each theoretical result, does the paper provide the full set of assumptions and
470 a complete (and correct) proof?
471 Answer: [Yes]
472 Justification: All proofs are in the appendix. We have a separate assumptions block at be-
473 ginning of ??. Every theorem statement starts with the stated assumptions. (Some theorem
474 statements impose additional assumptions, of a mild technical nature, that do not apply
475 broadly across the paper and are therefore not listed in the earlier assumptions block
476 4. Experimental Result Reproducibility
477 Question: Does the paper fully disclose all the information needed to reproduce the main
478 experimental results of the paper to the extent that it affects the main claims and/or conclu-
479 sions of the paper (regardless of whether the code and data are provided or not)?
480 Answer: [Yes]
481 Justification: Additional section included in ??
482 5. Open access to data and code
483 Question: Does the paper provide open access to the data and code, with sufficient instruc-
484 tions to faithfully reproduce the main experimental results, as described in supplemental
485 material?
486 Answer: [Yes]
487 Justification: attached in supplement
488 6. Experimental Setting/Details
489 Question: Does the paper specify all the training and test details (e.g., data splits, hyper-
490 parameters, how they were chosen, type of optimizer, etc.) necessary to understand the
491 results?
492 Answer: [TODO]
493 Justification: [TODO]
494 Guidelines:
495 • The answer NA means that the paper does not include experiments.
21
496 • The experimental setting should be presented in the core of the paper to a level of
497 detail that is necessary to appreciate the results and make sense of them.
498 • The full details can be provided either with the code, in appendix, or as supplemental
499 material.
500 7. Experiment Statistical Significance
501 Question: Does the paper report error bars suitably and correctly defined or other appropri-
502 ate information about the statistical significance of the experiments?
503 Answer: [TODO]
505 Guidelines:
507 • The authors should answer "Yes" if the results are accompanied by error bars, confi-
508 dence intervals, or statistical significance tests, at least for the experiments that support
509 the main claims of the paper.
510 • The factors of variability that the error bars are capturing should be clearly stated (for
511 example, train/test split, initialization, random drawing of some parameter, or overall
512 run with given experimental conditions).
513 • The method for calculating the error bars should be explained (closed form formula,
514 call to a library function, bootstrap, etc.)
515 • The assumptions made should be given (e.g., Normally distributed errors).
516 • It should be clear whether the error bar is the standard deviation or the standard error
517 of the mean.
518 • It is OK to report 1-sigma error bars, but one should state it. The authors should prefer-
519 ably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of
520 Normality of errors is not verified.
521 • For asymmetric distributions, the authors should be careful not to show in tables or
522 figures symmetric error bars that would yield results that are out of range (e.g. negative
523 error rates).
524 • If error bars are reported in tables or plots, The authors should explain in the text how
525 they were calculated and reference the corresponding figures or tables in the text.
526 8. Experiments Compute Resources
527 Question: For each experiment, does the paper provide sufficient information on the com-
528 puter resources (type of compute workers, memory, time of execution) needed to reproduce
529 the experiments?
530 Answer: [TODO]
532 Guidelines:
534 • The paper should indicate the type of compute workers CPU or GPU, internal cluster,
535 or cloud provider, including relevant memory and storage.
536 • The paper should provide the amount of compute required for each of the individual
537 experimental runs as well as estimate the total compute.
538 • The paper should disclose whether the full research project required more compute
539 than the experiments reported in the paper (e.g., preliminary or failed experiments
540 that didn’t make it into the paper).
541 9. Code Of Ethics
542 Question: Does the research conducted in the paper conform, in every respect, with the
543 NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
544 Answer: [Yes]
545 Justification: The research conforms with the code of ethics.
546 10. Broader Impacts
22
547 Question: Does the paper discuss both potential positive societal impacts and negative
548 societal impacts of the work performed?
549 Answer: [Yes]
550 Justification: yes, in Section 3.2
551 11. Safeguards
552 Question: Does the paper describe safeguards that have been put in place for responsible
553 release of data or models that have a high risk for misuse (e.g., pretrained language models,
554 image generators, or scraped datasets)?
555 Answer: [NA]
556 Justification: The paper artifacts do not have a high risk of misuse beyond impacts dis-
557 cussed in Section 3.2.
558 12. Licenses for existing assets
559 Question: Are the creators or original owners of assets (e.g., code, data, models), used in
560 the paper, properly credited and are the license and terms of use explicitly mentioned and
561 properly respected?
562 Answer: [Yes]
563 Justification: Properly credited in ??[az: fix cref]
564 Guidelines:
565 13. New Assets
566 Question: Are new assets introduced in the paper well documented and is the documenta-
567 tion provided alongside the assets?
568 Answer: [NA]
569 Justification: no new assets
570 14. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human
571 Subjects
572 Question: Does the paper describe potential risks incurred by study participants, whether
573 such risks were disclosed to the subjects, and whether Institutional Review Board (IRB)
574 approvals (or an equivalent approval/review based on the requirements of your country or
575 institution) were obtained?
576 Answer: [NA]
577 Justification: no study participants
23

Fitted Advantage Estimation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fitted Advantage Estimation

Uploaded by

Copyright:

Available Formats

Orthogonalized Estimation of Difference of

1 1 Introduction and Related Work

70 2.1 Problem Setup

92 2.2 Policy evaluation

Algorithm 2 Dynamic Residualized Difference-of-Q Optimization

136 2.3 Policy optimization

156 [az: explain]

162 Let ρt denote product error terms:

173 3.1 Policy optimization

183 b∗ > 0. Then

π̂ ◦,π ∗ ◦,π̂ n,π̂

207 3.2 Limitations

208 [az: to fill]

Figure 1: Reward-relevant/irrelevant factored dynamics. The dotted line from at to sρt+1

Lemma 2 (Excess Variance ).

328 Proof. ±E[Qπt+1

E[L̂t (τ, η)]

331 A.2 Orthogonality

339 By the chain rule for Frechet differentiation, we have that

341 A.3 Proof of sample complexity bounds

and fixing t = 1, we obtain:

348 A maximal inequality gives that P (Eg ) ≥ 1 − n−κ . We have that

≤ O(n−κ ) + (Kn−b∗ )(Kn−b∗α )

357 We also obtain analogous results for norm bounds:

For the second term,

νt = τ̂t − τtn , νt◦ = τ̂t − τt◦

and by the properties of the conditional moment at the true τ ◦ ,

386 [az: what’s r]

397 Choose ϵ ≤ λ/8 so that

411 Step 2a: induction hypothesis.

412 Next we show the induction hypothesis.

418 We will show that the induction hypothesis implies

(1 − γ T −t−1 ) (c) (Ψ)

434 Combining with the previous analysis, we obtain:

435 from eq. (24) and appendix A.3.

be any solution to the inequalities:

You might also like