A Two-Timescale Stochastic Algorithm Framework For Bilevel Optimization: Complexity Analysis and Application To Actor-Critic

A Two-Timescale Stochastic Algorithm Framework for Bilevel
Optimization: Complexity Analysis and Application to

Actor-Critic
arXiv:2007.05170v3 [math.OC] 30 Dec 2021
Mingyi Hong∗† Hoi-To Wai ‡

Zhaoran Wang§ Zhuoran Yang¶
January 3, 2022
Abstract
This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization.
Bilevel optimization is a class of problems which exhibits a two-level structure, and its goal is
to minimize an outer objective function with variables which are constrained to be the optimal
solution to an (inner) optimization problem. We consider the case when the inner problem is
unconstrained and strongly convex, while the outer problem is constrained and has a smooth
objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for
tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger
step size is used for the inner problem, while a projected stochastic gradient update with a
smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA
algorithm under various settings: when the outer problem is strongly convex (resp. weakly
−2/3 −2/5
convex), the TTSA algorithm finds an O(Kmax )-optimal (resp. O(Kmax )-stationary) solution,
where Kmax is the total iteration number. As an application, we show that a two-timescale
natural actor-critic proximal policy optimization algorithm can be viewed as a special case of
our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at
−1/4
a rate of O(Kmax ) in terms of the gap in expected discounted reward compared to a global
optimal policy.
1 Introduction
Consider bilevel optimization problems of the form:
min `(x) := f (x, y ? (x)) subject to y ? (x) ∈ arg min g(x, y), (1)
x∈X⊆Rd1 y∈Rd2
where d1 , d2 ≥ 1 are integers; X is a closed and convex subset of Rd1 , f : X × Rd2 → R and
g : X × Rd2 → R are continuously differentiable functions with respect to (w.r.t.) x, y. Problem (1)
∗
Authors listed in alphabetical order.
†
University of Minnesota, email:mhong@umn.edu
‡
The Chinese University of Hong Kong, email:htwai@se.cuhk.edu.hk
§
Northwestern University, email:zhaoranwang@gmail.com
¶
Princeton University, email:zy6@princeton.edu
1
Table 1: Summary of the main results. SC stands for strongly convex, WC for weakly convex, C
for convex; k is the iteration counter, Kmax is the total number of iterations.
`(x) Constraint Step Size (αk , βk ) Rate (outer) Rate (Inner)
−1 −2/3 −2/3 −2/3
SC X⊆R d1
O(k ), O(k ) O(Kmax )† O(Kmax )?
−3/4 −1/2 −1/4 −1/2
C X ⊆ Rd1 O(Kmax ), O(Kmax ) O(Kmax )¶ O(Kmax )?
−3/5 −2/5 −2/5 −2/5
WC X ⊆ Rd1 O(Kmax ), O(Kmax ) O(Kmax )# O(Kmax )?
†
in terms of kxKmax −x? k2 , where x? is the optimal solution; ? in terms of ky Kmax −y ? (xKmax −1 )k2 , where y ? (xKmax −1 )
is the optimal inner solution for fixed xKmax −1 ; ¶ measured using `(x) − `(x? ); # measured using distance to a fixed
point with the Moreau proximal map x b(·); see (18).
involves two optimization problems following a two-level structure. We refer to miny∈Rd2 g(x, y) as
the inner problem whose solution depends on x, and g(x, y) is called the inner objective function;
minx∈X `(x) is referred as the outer problem, which represents the outer objective function that we
wish to minimize and `(x) ≡ f (x, y ? (x)) is called the outer objective function. Moreover, both f, g
can be stochastic functions whose gradient may be difficult to compute. Despite being a non-convex
stochastic problem in general, (1) has a wide range of applications, e.g., reinforcement learning [33],
hyperparameter optimization [22], game theory [51], etc..
Tackling (1) is challenging as it involves solving the inner and outer optimization problems
simultaneously. Even in the simplest case when `(x) and g(x, y) are strongly convex in x, y,
respectively, solving (1) is difficult. For instance, if we aim to minimize `(x) via a gradient method,
at any iterate xcur ∈ Rd1 – applying the gradient method for (1) involves a double-loop algorithm
that (a) solves the inner optimization problem y ? (xcur ) = arg miny∈Rd2 g(xcur , y) and then (b)
evaluates the gradient as ∇`(xcur ) based on the solution y ? (xcur ). Depending on the application,
step (a) is usually accomplished by applying yet another gradient method for solving the inner
problem (unless a closed form solution for y ? (xcur ) exists). In this way, the resulting algorithm
necessitates a double-loop structure.
To this end, [23] and the references therein proposed a stochastic algorithm for (1) involving a
double-loop update. During the iterations, the inner problem miny∈Rd2 g(xcur , y) is solved using a
stochastic gradient (SGD) method, with the solution denoted by yb? (xcur ). Then, the outer prob-
lem is optimized with an SGD update using estimates of ∇f (xcur , yb? (xcur )). Such a double-loop
algorithm is proven to converge to a stationary solution, yet a practical issues lingers: What if the
(stochastic) gradients of the inner and outer problems are only revealed sequentially? For example,
when these problems are required to be updated at the same time such as in a sequential game.
To address the above issues, this paper investigates a single-loop stochastic algorithm for (1).
Focusing on a class of the bilevel optimization problem (1) where the inner problem is unconstrained
and strongly convex, and the outer objective function is smooth, our contributions are three-fold:
• We study a two-timescale stochastic approximation (TTSA) algorithm [7] for the concerned class
of bilevel optimization. The TTSA algorithm updates both outer and inner solutions simultane-
ously, by using some cheap estimates of stochastic gradients of both inner and outer objectives.
The algorithm guarantees convergence by improving the inner (resp., outer) solution with a larger
(resp., smaller) step size, also known as using a faster (resp., slower) timescale.
2
• We analyze the expected convergence rates of the TTSA algorithm. Our results are summarized
in Table 1. Our analysis is accomplished by building a set of coupled inequalities for the one-step
update in TTSA. For strongly convex outer function, we show inequalities that couple between
the outer and inner optimality gaps. For convex or weakly convex outer functions, we establish
inequalities coupling between the difference of outer iterates, the optimality of function values,
and the inner optimality gap. We also provide new and generic results for solving coupled
inequalities. The distinction of timescales between step sizes of the inner and outer updates
plays a crucial role in our convergence analysis.
• Finally, we illustrate the application of our analysis results on a two-timescale natural actor-
critic policy optimization algorithm with linear function approximation [30, 45]. The natural
actor-critic algorithm converges at the rate O(K −1/4 ) to an optimal policy, which is comparable
to the state-of-the-art results.
The rest of this paper is organized as follows. §2 formally describes the problem setting of bilevel
optimization and specify the problem class of interest. In addition, the TTSA algorithm is introduced
and some application examples are discussed. §3 presents the main convergence results for the
generic bilevel optimization problem (1). The convergence analysis is also presented where we
highlight the main proof techniques used. Lastly, §4 discusses the application to reinforcement
learning. Notice that some technical details of the proof have been relegated to the online appendix
[24].
1.1 Related Works
The study of bilevel optimization problem (1) can be traced to that of Stackelberg games [51], where
the outer (resp. inner) problem optimizes the action taken by a leader (resp. the follower). In the
optimization literature, bilevel optimization was introduced in [10] for resource allocation problems,
and later studied in [9]. Furthermore, bilevel optimization is a special case of the broader class
problem of Mathematical Programming with Equilibrium Constraints [39].
Many related algorithms have been proposed for bilevel optimization. This includes approximate
descent methods [19,56], and penalty-based method [26,58]. The approximate descent methods deal
with a subclass of problem where the outer problem possesses certain (local) differentiability prop-
erty, while the penalty-based methods approximate the inner problems and/or the outer problems
with an appropriate penalty functions. It is noted in [12] that descent based methods have rel-
atively strong assumptions about the inner problem (such as non-degeneracy), while the penalty
based methods are typically slow. Moreover, these works typically focus on asymptotic convergence
analysis, without characterizing the convergence rates; see [12] for a comprehensive survey.
In [13, 23, 27], the authors considered bilevel problems in the (stochastic) unconstrained setting,
when the outer problem is non-convex and the inner problem is strongly (or strictly) convex. These
works are more related to the algorithms and results to be developed in the current paper. In
this case, the (stochastic) gradient of the outer problem may be computed using the chain rule.
However, to obtain an accurate estimate, one has to either use double loop structure where the
inner loop solves the inner sub-problem to a high accuracy [13, 23], or use a large batch-size (e.g.,
O(1/)) [27]. Both of these methods could be difficult to implement in practice as the batch-size
selection or number of required inner loop iterations are difficult to adjust. In reference [48], the
3
authors analyzed a special bilevel problem where there is a single optimization variable in both outer
and inner levels. The authors proposed a Sequential Averaging Method (SAM) algorithm which can
provably solve a problem with strongly convex outer problem, and convex inner problems. Building
upon the SAM, [35, 38] developed first-order algorithms for bilevel problem, without requiring that
for each fixed outer-level variable, the inner-level solution must be a singleton.
In a different line of recent works, references [36, 50] proposed and analyzed different versions
of the so-called truncated back-propagation approach for approximating the (stochastic) gradient
of the outer-problem, and established convergence for the respective algorithms. The idea is to use
a dynamical system to model an optimization algorithm that solves the inner problem, and then
replace the optimal solution of the inner problem by unrolling a few iterations of the updates. How-
ever, computing the (hyper-)gradient of the objective function `(x) requires using back-propagation
through the optimization algorithm, and can be computationally very expensive. It is important to
note that none of the methods discussed above have considered single-loop stochastic algorithms,
in which a small batch of samples are used to approximate the inner and outer gradients at each
iteration. Later we will see that the ability of being able to update using a small number of samples
for both outer and inner problems is critical in a number of applications, and it is also beneficial
numerically.
In contrast to the above mentioned works, this paper considers a TTSA algorithm for stochastic
bilevel optimization, which is a single-loop algorithm employing cheap stochastic estimates of the
gradient. Notice that TTSA [7] is a class of algorithms designed to solve coupled system of (nonlin-
ear) equations. While its asymptotic convergence property has been well understood, e.g., [7, 8, 32],
the convergence rate analysis have been focused on linear cases, e.g., [14, 31, 34]. In general, the
bilevel optimization problem (1) requires a nonlinear TTSA algorithm. For this case, an asymptotic
convergence rate is analyzed in [42] under a restricted form of nonlinearity. For convergence rate
analysis, [48] considered a single-loop algorithm for deterministic bilevel optimization with only one
variable, and [17] studied the convergence rate when the expected updates are strongly monotone.
Finally, it is worthwhile mentioning that various forms of TTSA have been applied to tackle
compositional stochastic optimization [57], policy evaluation methods [6, 54], and actor-critic meth-
ods [5,33,40]. Notice that some of these optimization problems can be cast as a bilevel optimization,
as we will demonstrate next.
Notations Unless otherwise noted, k · k is the Euclidean norm on finite dimensional Euclidean
space. For a twice differentiable function f : X × Y → R, ∇x f (x, y) (resp. ∇y f (x, y)) denotes its
partial gradient taken w.r.t. x (resp. y), and ∇2yx f (x, y) (resp. ∇2xy f (x, y)) denotes the Jacobian
of ∇y f (x, y) at y (resp. ∇x f (x) at x). A function `(·) is said to be weakly convex with modulus
µ` ∈ R if
`(w) ≥ `(v) + h∇`(v), w − vi + µ` kw − vk2 , ∀ w, v ∈ X. (2)
Notice that if µ` ≥ 0 (resp. µ` > 0), then `(·) is convex (resp. strongly convex).
2 Two-Timescale Stochastic Approximation Algorithm for (1)
To formally define the problem class of interest, we state the following conditions on the bilevel
optimization problem (1).
4
Assumption 1. The outer functions f (x, y) and `(x) := f (x, y ? (x)) satisfy:
1. For any x ∈ Rd1 , ∇x f (x, ·) and ∇y f (x, ·) are Lipschitz continuous with respect to (w.r.t.)
y ∈ Rd2 , and with constants Lf x and Lf y , respectively.
2. For any y ∈ Rd2 , ∇y f (·, y) is Lipschitz continuous w.r.t. x ∈ X, and with constant L̄f y .
3. For any x ∈ X, y ∈ Rd2 , we have k∇y f (x, y)k ≤ Cfy , for some Cfy > 0.
Assumption 2. The inner function g(x, y) satisfies:
1. For any x ∈ X and y ∈ Rd2 , g(x, y) is twice continuously differentiable in (x, y);
2. For any x ∈ X, ∇y g(x, ·) is Lipschitz continuous w.r.t. y ∈ Rd2 , and with constant Lg .
3. For any x ∈ X, g(x, ·) is strongly convex in y, and with modulus µg > 0.
4. For any x ∈ X, ∇2xy g(x, ·) and ∇2yy g(x, ·) are Lipschitz continuous w.r.t. y ∈ Rd2 , and with
constants Lgxy > 0 and Lgyy > 0, respectively.
5. For any y ∈ Rm , ∇2xy g(·, y) and ∇2yy g(·, y) are Lipschitz continuous w.r.t. x ∈ X, and with
constants L̄gxy > 0 and L̄gyy > 0, respectively.
6. For any x ∈ X and y ∈ Rd2 , we have k∇2xy g(x, y)k ≤ Cgxy for some Cgxy > 0.
Basically, Assumption 1, 2 require that the inner and outer functions f, g are well-behaved.
In particular, ∇x f , ∇y f , ∇2xy g, and ∇2yy g are Lipschitz continuous w.r.t. x when y is fixed, and
Lipschitz continuous w.r.t. y when x is fixed. These assumptions are satisfied by common problems
in machine learning and optimization, e.g., the application examples discussed in Sec. 2.1.
Our first endeavor is to develop a single-loop stochastic algorithm for tackling (1). Focusing
on solutions which satisfy the first-order stationary condition of (1), we aim at finding a pair of
solution (x? , y ? ) such that
∇y g(x? , y ? ) = 0, h∇`(x? ), x − x? i ≥ 0, ∀ x ∈ X. (3)
Given x? , a solution y ? satisfying the first condition in (3) may be found by a cheap stochastic
gradient recursion such as y ← y − βhg with E[hg ] = ∇y g(x? , y). On the other hand, given y ? (x)
and suppose that we can obtain a cheap stochastic gradient estimate hf with E[hf ] = ∇`(x) =
∇x f (x, y ? (x)), where ∇x f (x, y) is a surrogate for ∇`(x) (to be described later), then the second
condition can be satisfied by a simple projected stochastic gradient recursion as x ← PX (x − αhf ),
where PX (·) denotes the Euclidean projection onto X.
A challenge in designing a single-loop algorithm for satisfying (3) is to ensure that the outer
function’s gradient ∇x f (x, y) is evaluated at an inner solution y that is close to y ? (x). This led
us to develop a two-timescale stochastic approximation (TTSA) [7] framework, as summarized in
Algorithm 1. An important feature is that the algorithm utilizes two step sizes αk , βk for the outer
(xk ), inner (y k ) solution, respectively, designed with different timescales as αk /βk → 0. As a larger
step size is taken to optimize y k , the latter shall stay close to y ? (xk ). Using this strategy, it is
expected that y k will converge to y ? (xk ) asymptotically.
5
Algorithm 1. Two-Timescale Stochastic Approximation (TTSA)
S0) Initialize the variable (x0 , y 0 ) ∈ X × Rd2 and the step size sequence {αk , βk }k≥0 ;
S1) For iteration k = 0, ..., K,
y k+1 = y k − βk · hkg , (4a)

xk+1 = PX (xk − αk · hkf ), (4b)
where hkg , hkf are stochastic estimates of ∇y g(xk , y k ), ∇x f (xk , y k+1 ) [cf. (6)], respectively, satis-
fying Assumption 3 given below. Moreover, PX (·) is the Euclidean projection operator onto the
convex set X.
Inspired by [23], we provide a method for computing a surrogate of ∇`(x) given y with general
objective functions satisfying Assumption 1, 2. Given y ? (x), we observe that using chain rule, the
gradient of `(x) can be derived as
−1
∇`(x) = ∇x f x, y ? (x) − ∇2xy g x, y ? (x) ∇2yy g x, y ? (x) ∇y f x, y ? (x) . (5)

We note that the computation of the above gradient critically depends on the fact that the inner
problem is strongly convex and unconstrained, so that the inverse function theorem can be applied
when computing ∇y ∗ (x).
We may now define ∇x f (x, y) as a surrogate of ∇`(x) by replacing y ? (x) with y ∈ Rd2 :
∇x f (x, y) := ∇x f (x, y) − ∇2xy g(x, y)[∇2yy g(x, y)]−1 ∇y f (x, y). (6)
Notice that ∇`(x) = ∇x f (x, y ? (x)). Eq. (6) is a surrogate for ∇`(x) that may be used in TTSA.
We emphasize that (6) is not the only construction and the TTSA can accommodate other forms of
gradient surrogate. For example, see (41) in the application of our results to actor-critic.
Let Fk := σ{y 0 , x0 , ..., y k , xk }, Fk0 := σ{y 0 , x0 ..., y k , xk , y k+1 } be the filtration of the random
variables up to iteration k, where σ{·} denotes the σ-algebra generated by the random variables.
We consider the following assumption regarding hkf , hkg :
Assumption 3. For any k ≥ 0, there exist constants σg , σf , and a nonincreasing sequence {bk }k≥0
such that:
E[hkg |Fk ] = ∇y g(xk , y k ), E[hkf |Fk0 ] = ∇x f (xk , y k+1 ) + Bk , kBk k ≤ bk , (7a)
E[khkg − ∇y g(x , y )k |Fk ] ≤ σg2 · {1 + k∇y g(xk , y k )k2 },
k k 2
(7b)
E[khkf − Bk − ∇x f (xk , y k+1 )k2 |Fk0 ] ≤ σf2 . (7c)
Notice that the conditions on hkg are standard when the latter is taken as a stochastic gradient
of g(xk , y k ), while hkf is a potentially biased estimate of ∇x f (xk , y k+1 ). As we will see in our
convergence analysis, the bias shall decay polynomially to zero.
In light of (6) and as inspired by [23], we suggest to construct a stochastic estimate of ∇x f (xk , y k+1 )
as follows. Let tmax (k) ≥ 1 be an integer, ch ∈ (0, 1] be a scalar parameter, and denote x ≡ xk , y ≡
y k+1 for brevity. Consider:
1. Select p ∈ {0, . . . , tmax (k) − 1} uniformly at random and draw 2 + p independent samples as
6
(2) (2)
ξ (1) ∼ µ(1) , ξ0 , . . . , ξp ∼ µ(2) .
2. Construct the gradient estimator hkf as
hkf = ∇x f (x, y; ξ (1) )−

p
" #
2 (2) tmax (k) ch Y ch 2 (2)

∇xy g(x, y; ξ0 ) I− ∇ g(x, y; ξi ) ∇y f (x, y; ξ (1) ),
Lg Lg yy
i=1
Q0 (2)
where as a convention, we set i=1 I− ch 2
Lg ∇yy g(x, y; ξi = I.
In the above, the distributions µ(1) , µ(2) are defined such that they yield unbiased estimate of the
gradients/Jacobians/Hessians as:
∇x f (x, y) = Eµ(1) [∇x f (x, y; ξ (1) )], ∇y f (x, y) = Eµ(1) [∇y f (x, y; ξ (1) )], (8)
∇2xy g(x, y) = Eµ(2) [∇2xy g(x, y; ξ (2) )], ∇2yy g(x, y) = Eµ(2) [∇2yy g(x, y; ξ (2) )],
and satisfying E[k∇y f (x, y; ξ (1) )k2 ] ≤ Cy , E[k∇2xy g(x, y; ξ (2) )k2 ] ≤ Cg ,
E[k∇x f (x, y) − ∇x f (x, y; ξ (1) )k2 ] ≤ σf2x , E[k∇y f (x, y) − ∇y f (x, y; ξ (1) )k2 ] ≤ σf2y ,
(9)
E[k∇2xy g(x, y) − ∇2xy g(x, y; ξ (2) )k22 ] ≤ σgxy
2
,
µg
note that k · k2 is the Schatten-2 norm. For convenience of analysis, we assume µ2g +σgxy
2 ≤ 1, Lg ≥ 1.
The next lemma shows that hkf satisfies Assumption 3.
Lemma 1. Under Assumption 1, 2, (8), (9), and ch = µg /(µ2g + σgxy 2 ), then for any x ∈ X, y ∈
Rd2 , tmax (k) ≥ 1, it holds that

1 µ2g tmax (k)
∇x f (xk , y k+1 ) − E[hkf ] ≤ Cgxy Cf y · · 1− . (10)

µg Lg (µ2g + σgxy
2 )
Furthermore, the variance is bounded as

h i 3 3d1 /Lg
E[khkf − E[hkf ]k2 ] ≤ σf2x + (σf2y + Cy2 ){σgxy
2 2
+ 2Cgxy 2
} + σf2y Cgxy max 2 , 2 2
. (11)
µg µg + σgxy
The proof of the above lemma is relegated to our online appendix, see §E in [24]. Note that the
variance bound (11) relies on analyzing the expected norm of product of random matrices using the
techniques inspired by [18, 25]. Finally, observe that the upper bounds in (10), (11) correspond to
bk , σf2 in Assumption 3, respectively, and the requirements on hkf are satisfied.
To further understand the property of the TTSA algorithm with (6), we borrow the following
results from [23] on the Lipschitz continuity of the maps ∇`(x), y ? (x):
Lemma 2. [23, Lemma 2.2] Under Assumption 1, 2, it holds

k∇x f (x, y) − ∇`(x)k ≤ Lky ? (x) − yk, ky ? (x1 ) − y ? (x2 )k ≤ Ly kx1 − x2 k, (12a)
? ?
k∇`(x1 ) − ∇`(x2 )k = k∇f (x1 , y (x1 )) − ∇f (x2 , y (x2 ))k ≤ Lf kx1 − x2 k. (12b)
7
for any x, x1 , x2 ∈ X and y ∈ Rd2 , where we have defined

Lf Cgxy Lgxy Lg Cg
L := Lfx + y + C fy + yy 2 xy ,
µg µg µg
(13)
(L̄fy + L)Cgxy L̄gxy L̄gyy Cgxy Cg
Lf := Lfx + + C fy + , Ly = xy .
µg µg µ2g µg
The above properties will be pivotal in establishing the convergence of TTSA. First, we note that
(12b) implies that the composite function `(x) is weakly convex with a modulus that is at least
(−Lf ). Furthermore, (7c) in Assumption 3 combined with Lemma 2 leads to the following estimate:
E[khkf k2 |Fk0 ] ≤ σ
ef2 + 3b2k + 3L2 ky k+1 − y ? (xk )k2 , ef2 := σf2 + 3 sup k∇`(x)k2 .
σ (14)
x∈X
Throughout, we assume σ
ef2 is bounded, e.g., it can be satisfied if X is bounded, or if `(x) has
bounded gradient.
2.1 Applications
Practical problems such as hyperparameter optimization [22, 41, 50], Stackelberg games [51] can
be cast into special cases of bilevel optimization problem (1). To be specific, we discuss three
applications of the bilevel optimization problem (1) below.
Model-Agnostic Meta-Learning An important paradigm of machine learning is to find model

that adapts to multiple training sets in order to achieve the best performance for individual tasks.
Among others, a popular formulation is model-agnostic meta learning (MAML) [20] which minimizes
an outer objective of empirical risk on all training sets, and the inner objective is the one-step
(j)
projected gradient. Let D(j) = {zi }ni=1 be the j-th (j ∈ [J]) training set with sample size n,
MAML can be formulated as a bilevel optimization problem [47]:
J X
n
(j)
X
`¯ θ?(j) (θ), zi

min
θ∈Θ
j=1 i=1
n
X (15)
¯ z (j) )i λ (j)
subject to θ ?(j)
(θ) ∈ arg min hθ (j)
, ∇θ `(θ, i
2
+ kθ − θk .
θ(j) 2
i=1
Here θ is the shared model parameter, is the adaptation of θ to the jth training set, and `¯ is the
θ(j)
loss function. It can be checked that the inner problem is strongly convex. We have Assumption 1,
2, 3 for stochastic gradient updates, assuming `¯ is sufficiently regular, and the losses are the logistic
loss. Moreover, [21] proved that, assuming λ is sufficiently large and `¯ is strongly convex, the outer
problem is also strongly convex. In fact, [46] demonstrated that an algorithm with no inner loop
achieves a comparable performance to [21].
Policy Optimization Another application of the bilevel optimization problem is the policy op-
timization problem, particularly when combined with an actor-critic scheme. The optimization
involved is to find an optimal policy to maximize the expected (discounted) reward. Here, the
‘actor’ serves as the outer problem and the ‘critic’ serves as the inner problem which evaluates the
8
performance of the ‘actor’ (current policy). To avoid redundancy, we refer our readers to §4 where
we present a detailed case study. The latter will also shed lights on the generality of our proof
techniques for TTSA algorithms.
Data hyper-cleaning The data hyper-cleaning problem trains a classifier with a dataset of ran-
domly corrupted labels [50]. The problem formulation is given below:
minx∈Rd1 `(x) := i∈Dval L(a> ? (16)
P
i y (x), bi )
s.t. y (x) = arg miny∈Rd2 λkyk + i∈Dtr σ(xi )L(ai y, bi ) .
? 2 >
P
In this problem, we have d1 = |Dtr |, and d2 is the dimension of the classifier y; (ai , bi ) is the ith
data point; L(·) is the loss function; xi is the parameter that determines the weight for the ith data
sample; σ : R → R+ is the weight function; λ > 0 is a regularization parameter; Dval and Dtr are
validation and training sets, respectively. Here, the inner problem finds the classifier y ∗ (x) with the
training set Dtr , while the outer problem finds the best weights x with respect to the validation set
Dval .
Before ending this subsection, let us mention that, we do not aware any general sufficient condi-
tions that can be used to verify whether the outer function `(x) is (strongly) convex or not. To our
knowledge, the convexity of `(x) has to be verified in a case-by-case manner; please see [21, Appendix
B3] for how this can be done.
3 Main Results
This section presents the convergence results for TTSA algorithm for (1). We first summarize a list
of important constants in Table 2 to be used in the forthcoming analysis. Next, we discuss a few
concepts pivotal to our analysis.
Tracking Error TTSA tackles the inner and outer problems simultaneously using single-loop
updates. Due to the coupled nature of the inner and outer problems, in order to obtain an upper
bound on the optimality gap ∆kx := E[kxk − x? k2 ], where x? is an optimal solution to (1), we need
to estimate the tracking error defined as
∆ky := E[ky k − y ? (xk−1 )k2 ] where y ? (x) = arg min g(x, y). (17)
y∈Rd2
For any x ∈ X, y ? (x) is well defined since the inner problem is strongly convex due to Assumption 2.
By definition, ∆ky quantifies how close y k is from the optimal solution to inner problem given xk−1 .
Moreau Envelop Fix ρ > 0, define the Moreau envelop and proximal map as
Φ1/ρ (z) := min `(x) + (ρ/2)kx − zk2 , x
b(z) := arg min `(x) + (ρ/2)kx − zk2 . (18)

x∈X x∈X
For any > 0, xk ∈ X is said to be an -nearly stationary solution [16] if xk is an approximate fixed
point of {b
x − I}(·), where
∆ x(xk ) − xk k2 ] ≤ ρ−2 · .
e k := E[kb
x (19)
9
Table 2: Summary of the Constants for Section 3
Constant Description Reference
Lf x , Lf y Lipschitz constants for ∇x f (x, ·), ∇y f (x, ·) w.r.t. y, resp. Assumption 1
L̄f y Lipschitz constants for ∇y f (·, y) w.r.t. x Assumption 1
Cf y Upper bound on k∇y f (x, y)k Assumption 1
Lg Lipschitz constant of ∇y g(x, ·) Assumption 2
µg Strong convexity modulus g(x, ·) w.r.t. y Assumption 2
Lgxy , Lgyy Lipschitz constants of ∇2xy g(x, ·), ∇2yy g(x, ·) w.r.t. y, resp. Assumption 2
L̄gxy , L̄gyy Lipschitz constants of ∇2xy g(·, y), ∇2yy g(·, y) w.r.t. x, resp. Assumption 2
Cgxy Upper bound on k∇2xy g(x, y)k Assumption 2
bk Bound on the bias of hkf at iteration k Assumption 3
σg2 , σf2 Variance of stochastic estimates hkg , hkf , resp. Assumption 3
ef2
σ Constant term on the bound for E[khkf k2 ] (14)
L Difference between ∇x f (x, y), ∇`(x) w.r.t. ky ? (x) − xk Lemma 2
Ly Lipschitz constant of y ? (x) Lemma 2
Lf Lipschitz constant of ∇`(x) Lemma 2
We observe that if = 0, then xk ∈ X is a stationary solution to (1) satisfying the second condition
in (3). As we will demonstrate next, the near-stationarity condition (19) provides an apparatus to
quantify the finite-time convergence of TTSA in the case when `(x) is non-convex.
3.1 Strongly Convex Outer Objective Function
Our first result considers the instance of (1) where `(x) is strongly convex. We obtain:
Theorem 1. Under Assumption 1, 2, 3. Assume that `(x) is weakly convex with a modulus µ` > 0
(i.e., it is strongly convex), and the step sizes satisfy
3/2 2/3 βk−1 αk−1
αk ≤ c0 βk , βk ≤ c1 αk , ≤ 1 + βk µg /8, ≤ 1 + 3αk µ` /4, (20a)
βk αk
( )
1 1 µg µ2g
αk ≤ , βk ≤ min , , , 8µ` αk ≤ µg βk , ∀ k ≥ 0, (20b)
µ` µg L2g (1 + σg2 ) 48c20 L2 L2y
where the constants L, Ly were defined in Lemma 2 and c0 , c1 > 0 are free parameters. If the bias
is bounded as b2k ≤ e
cb αk+1 , then for any k ≥ 1, the TTSA iterates satisfy
k−1
Y 0 L2 0 c1 L2 h σg2 c20 L2y 2 i 2/3
∆kx . (1 − αi µ` ) ∆x + 2 ∆y + 2 + 2 σ e α ,
µ` µ` µg µg f k−1
i=0
(21)
k−1 h σ2
Y g c20 L2y 2 i
∆ky . (1 − βi µg /4)∆0y + + 2 σ ff βk−1 ,
µg µg
i=0
where the symbol . denotes that the numerical constants are omitted (see Section 3.3).
Notice that the bounds in (21) show that the expected optimality gap and tracking error at the kth
iteration shall compose of a transient and fluctuation terms. For instance, for the bound on ∆kx , the
Qk−1
first (transient) term decays sub-geometrically as i=0 (1 − αi µ` ), while the second (fluctuation)
2/3
term decays as αk−1 .
10
The conditions in (20) are satisfied by both diminishing and constant step sizes. For example,
we define the constants:
3
n L 3
g 2 32
(512) 2 L2 L2y o 8 1 32
kα = max 35 (1 + σg ) , , cα = , kβ = kα , cβ = . (22)
µg µ2` 3µ` 4 3µg
Then, for diminishing step sizes, we set αk = cα /(k + kα ), βk = cβ /(k + kβ )2/3 , and for constant
3/2
2/3 µg
step sizes, we set αk = cα /kα , βk = cβ /kβ . Both pairs of the step sizes satisfy (20) with c0 = µ` ,
2/3
µ`
c1 = 10 µg . For diminishing step sizes, Theorem 1 shows the last iterate convergence rate for
the optimality gap and the tracking error to be O(k −2/3 ). To compute an -optimal solution with
∆kx ≤ , the TTSA algorithm with diminishing step size requires a total of O(log(1/)/3/2 ) calls of
stochastic (gradient/Hessian/Jacobian) oracles of both outer (f (·, ·)) and inner (g(·, ·)) functions1 .
While this is arguably the easiest case of (1), we notice that the double-loop algorithm in [23]
requires O(1/), O(1/2 ) stochastic oracles for the outer (i.e., f (·, ·)), inner (i.e., g(·, ·)) functions,
respectively. As such, the TTSA algorithm requires less number of stochastic oracles for the inner
function.
3.2 Smooth (Possibly Non-convex) Outer Objective Function
We focus on the case where `(x) is weakly convex. We obtain
Theorem 2. Under Assumption 1, 2, 3, assume that `(·) is weakly convex with modulus µ` ∈ R.
Let Kmax ≥ 1 be the maximum iteration number and set
n µ2g 1 −3/5
o n µg 2 −2/5 o
α = min , Kmax , β = min , Kmax . (23)
8Ly LL2g (1 + σg2 ) 4Ly L L2g (1 + σg2 ) µg
If b2k ≤ α, then for any Kmax ≥ 1, the iterates from the TTSA algorithm satisfy
−2/5 −2/5
eK
E[∆ K+1
x ] = O(Kmax ), E[∆y ] = O(Kmax ), (24)
where K is an independent uniformly distributed random variable on {0, ..., Kmax − 1}; and we recall
e kx := kb
∆ x(xk ) − xk k2 . When Kmax is large and µ` < 0, setting ρ = 2|µ` | yields
−2
h σg2 iK 5
max
h ∆0σg2 µg σf2 i − 25
e K ] . L2
E[∆ x ∆0 + 2 + µg σ
ef2 , E[∆yK+1 ] . + 2+ Kmax , (25)
µg |µ` |2 µg µg L2
L
where we defined ∆0 := max{Φ1/ρ (x0 ), Ly OPT0 , ∆0y }, and used the conditions α < 1, µg 1; the
symbol . denotes that the numerical constants are omitted (see Section 3.3).
The above result uses constant step sizes determined by the maximum number of iterations,
−3/5 −2/5
Kmax . Here we set the step sizes as αk Kmax and βk Kmax . Similar to the previous
case of strongly convex outer objective function, αk /βk converges to zero as Kmax goes to infinity.
Nevertheless, Theorem 2 shows that TTSA requires O(−5/2 log(1/)) calls of stochastic oracle for
sampled gradient/Hessian to find an -nearly stationary solution. In addition, it is worth noting that
1 √
Notice that as we need bk = O( αk+1 ), from Lemma 1, the polynomial bias decay requires using tmax (k) =
O(1 + log(k)) samples per iteration, justifying the log factor in the bound.
11
−2/5
when X = Rd1 , Theorem 2 implies that TTSA achieves E[k∇`(xK )k2 ] = O(Kmax ) [16, Sec. 2.2],
−2/5
i.e., xK is an O(Kmax )-approximate (near) stationary point of `(x) in expectation.
Let us compare our sampling complexity bounds to the double loop algorithm in [23], which
requires O(−3 ) (resp. O(−2 )) stochastic oracle calls for the inner problem (resp. outer problem),
to reach an -stationary solution. The sample complexity of TTSA yields a tradeoff for the inner
and outer stochastic oracles. We also observe that a trivial extension to a single-loop algorithm
results in a constant error bound2 . Finally, we can extend Theorem 2 to the case where `(·) is a
convex function.
Corollary 1. Under Assumption 1, 2, 3 and assume that `(x) is weakly convex with modulus µ` ≥ 0.
Consider (1) with X ⊆ Rd1 , Dx = supx,x0 ∈X kx − x0 k < ∞. Let Kmax ≥ 1 be the maximum iteration
number and set
n µ2g 1 −3/4
o n µg 2 −1/2 o
α = min , Kmax , β = min , Kmax . (26)
8Ly LL2g (1 + σg2 ) 4Ly L L2g (1 + σg2 ) µg
−1/4
If bk ≤ cb Kmax , then for large K, the TTSA algorithm satisfies
−1/4 −1/2
E[`(xK ) − `(x? )] = O(Kmax ), E[∆K+1
y ] = O(Kmax ), (27)
where K is an independent uniform random variable on {0, ..., Kmax − 1}. By convexity, the above
1 PKmax k ? −1/4
implies E[`( Kmax k=1 x ) − `(x )] = O(Kmax ).
From Corollary 1, the TTSA algorithm requires O(−4 log(1/)) stochastic oracle calls to find an
-optimal solution (in terms of the optimality gap defined with objective values). This is comparable
to the complexity bounds in [23], which requires O(−4 log(1/)) (resp. O(−2 )) stochastic oracle
calls for the inner problem (resp. outer problem). Additionally, we mention that the constant Dx ,
which represents the diameter of the constraint set, appears in the constant of the convergence
bounds, therefore it is omitted in the big-O notation in (26). For details, please see the proof in
Appendix B.4.
3.3 Convergence Analysis
We now present the proofs for Theorem 1, 2. The proof for Corollary 1 is similar to that of
Theorem 2. Due to the space limitation, we refer the readers to [24]. We highlight that the proofs
of both theorems rely on the similar ideas of tackling coupled inequalities.
Proof of Theorem 1 Our proof relies on bounding the optimality gap and tracking error coupled
with each other. First we derive the convergence of the inner problem.
Lemma 3. Under Assumption 1, 2, 3. Suppose that the step size satisfies (20a), (20b). For any
k ≥ 1, it holds that
k
Y 8 n 2 4c20 L2y 2 o
∆k+1
y ≤ (1 − β` µg /2) ∆0y + σg + ef + 3b20 βk .
σ (28)
µg µg
`=0
2
To see this, the readers are referred to [23, Theorem 3.1]. If a single inner iteration is performed, tk = 1, so
Āk ≥ ky 0 − y ? (xk )k which is a constant. Then the r.h.s. of (3.70), (3.73), (3.74) in [23, Theorem 3.1] will all have a
constant term.
12
Notice that the bound in (28) relies on the strong convexity of the inner problem and the Lipschitz
properties established in Lemma 2 for y ? (x); see §A.1. We emphasize that the step size condition
3/2
αk ≤ c0 βk is crucial in establishing the above bound. As the second step, we bound the convergence
of the outer problem.
Lemma 4. Under Assumption 1, 2, 3. Assume that the bias satisfies b2k ≤ e cb αk+1 . With (20a),
(20b), for any k ≥ 1, it holds that
k
k+1
Y
0 cb 2e
h 4e σf2 + 6b20 i
∆x ≤ (1 − α` µ` )∆x + + αk
µ2` µ`
`=0 (29)
h 2L2 iXk Yk
+ + 3α0 L2 αj (1 − α` µ` )∆j+1
y .
µ`
j=0 `=j+1
see §A.2. We observe that (28), (29) lead to a pair of coupled inequalities. To compute the final
2/3
bound in the theorem, we substitute (28) into (29). As ∆j+1y = O(βj ) = O(αj ), the dominating
term in (29) can be estimated as
k k k k
5/3 2/3
X Y X Y
αj (1 − α` µ` )∆j+1
y = O(αj ) (1 − α` µ` ) = O(αk ), (30)
j=0 `=j+1 j=0 `=j+1
yielding the desirable rates in the theorem. See §A.3 for details.
Proof of Theorem 2 Without strong convexity in the outer problem, the analysis becomes more
challenging. To this end, we first develop the following lemma on coupled inequalities with numerical
sequences, which will be pivotal to our analysis:
Lemma 5. Let K ≥ 1 be an integer. Consider sequences of non-negative scalars {Ωk }K k K

k=0 , {Υ }k=0 ,
{Θk }K
k=0 . Let c0 , c1 , c2 , d0 , d1 , d2 be some positive constants. If the recursion holds
Ωk+1 ≤ Ωk − c0 Θk+1 + c1 Υk+1 + c2 , Υk+1 ≤ (1 − d0 )Υk + d1 Θk + d2 , (31)

for any k ≥ 0. Then provided that c0 − c1 d1 (d0 )−1 > 0, d0 − d1 c1 (c0 )−1 > 0, it holds
K
Ω0 + dc10 Υ0 + d1 Θ0 + d2

1 X k c2 + c1 d2 (d0 )−1
Θ ≤ +
c0 − c1 d1 (d0 )−1

K
k=1
c0 − c1 d1 (d0 )−1 K
(32)
K d1 0
1 X
k
Υ0 + d1 Θ0 + d2 + c0 Ω d2 + d1 c2 (c0 )−1
Υ ≤ + .
d0 − d1 c1 (c0 )−1

K
k=1
d0 − d1 c1 (c0 )−1 K
The proof of the above lemma is simple and is relegated to §B.1.

We demonstrate that stationarity measures of the TTSA iterates satisfy (31). The conditions
c0 − c1 d1 (d0 )−1 > 0, d0 − d1 c1 (c0 )−1 > 0 impose constraints on the step sizes and (32) leads to a
finite-time bound on the convergence of TTSA. To begin our derivation of Theorem 2, we observe
the following coupled descent lemma:
Lemma 6. Under Assumption 1, 2, 3. If µg β/2 < 1, βL2g (1 + σg2 ) ≤ µg , then the following
13
inequalities hold for any k ≥ 0:
1 − αLf
OPTk+1 ≤ OPTk − E[kxk+1 − xk k2 ] + α 2L2 ∆k+1 + 2b20 + σf2 (33a)

y
2α
k 2
∆k+1
y ≤ 1 − µg β/2 ∆y + − 1 L2y · E[kxk − xk−1 k2 ] + β 2 σg2 . (33b)
µg β
The proof of (33a) is due to the smoothness of outer function `(·) established in Lemma 2, while
(33b) follows from the strong convexity of the inner problem. See the details in §B.2. Note that
(33a), (33b) together is a special case of (31) with:
1 Lf
Ωk = OPTk , Θk = E[kxk − xk−1 k2 ], c0 = − , c1 = 2αL2 , c2 = α(2b20 + σf2 ),
2α 2
2 (34)
Υk = ∆ky , d0 = µg β/2, d1 = − 1 L2y , d2 = β 2 σg2 .
µg β
Notice that Θ0 = 0. Assuming that α ≤ 1/2Lf , we notice the following implications:
α µg d1 1 c1 µg β
≤ =⇒ c0 − c1 ≥ > 0, d0 − d1 ≥ > 0, (35)
β 8Ly L d0 8α c0 4
i.e., if (35) holds, then the conclusion (32) can be applied. It can be shown that the step sizes in
(23) satisfy (35). Applying Lemma 5 shows that
σ 2 L2
2OPT0 + LLy (∆0y + 4σg2 /µ2g ) (2b20 + σf2 ) + 8 µ2g
K g
1 X k k−1 2
E[kx − x k ] ≤ 8/5
+ 2 L2 · K 6/5
,
K
k=1
L y L · K 2L y
8σg2 4Ly 0 σ2 µg (2b20 +σf2 )
K
1 X k 2∆0y + µ2g
+ µg L OPT 8 µg2 + 2L2
g
∆y ≤ + .
K
k=1
K 3/5 K 2/5
Again, we emphasize that the two timescales step size design is crucial to establishing the above
upper bounds. Now, recalling the properties of the Moreau envelop in (18), we obtain the following
descent estimate:
Lemma 7. Under Assumption 1, 2, 3. Set ρ > −µ` , ρ ≥ 0, then for any k ≥ 0, it holds that
5ρ h 2αρL2 i
E[Φ1/ρ (xk+1 ) − Φ1/ρ (xk )] ≤ E[kxk+1 − xk k2 ] + + 3α2 ρL2 ∆k+1
y
2 ρ + µ`
(36)
(ρ + µ` )ρα k k 2
h 2ρ 2
i
2
− x −x k ]+
E[kb + ρ(e
σf + 3α) α .
4 ρ + µ`
See details in §B.3. Summing up the inequality (36) from k = 0 to k = Kmax − 1 gives the following
upper bound:
max −1
KX h Φ (x0 ) Kmax
1 k k 2 4 1/ρ 5ρ X i
x(x ) − x k ] ≤
E[kb + E[kxk − xk−1 k2 ]
Kmax (ρ + µ` )ρ αKmax 2αKmax
k=0 k=1
L2
4 h
ρ+µ` + 3αL2 K
X max h 2 i i
+ ∆ky + ef2 + 3α α .
+σ
ρ + µ` Kmax ρ + µ`
k=1
14
−3/5 1 PKmax −1 −2/5
Combining the above and α Kmax yields the desired Kmax k=0 x(xk )−xk k2 ] = O(Kmax ).
E[kb
In particular, the asymptotic bound is given by setting ρ = 2|µ` |.
Proof of Corollary 1 We observe that Lemma 6 can be applied directly in this setting since
convex functions are also weakly convex. With the step size choice (26), similar conclusions hold
as:
σ 2 L2
2OPT0 + LLy (∆0y + 4σg2 /µ2g ) (2b20 + σf2 ) + 8 µ2g
K g
1 X k k−1 2
E[kx − x k ] ≤ + ,
K
k=1
Ly L · K 7/4 2L2y L2 · K 6/4
8σg2 4Ly 0 σ2 µg (2b20 +σf2 )
K
1 X k 2∆0y + µ2g
+ µg L OPT 8 µg2 + 2L2
g
∆y ≤ + .
K
k=1
K 1/2 K 1/2
With the additional property µ` ≥ 0, in §B.4 Pwe further derive an alternative descent estimate to
K
(33a) that leads to the desired bound of K −1
k=1 OPT .
k
4 Application to Reinforcement Learning

Consider a Markov decision process (MDP) (S, A, γ, P, r), where S and A are the state and action
spaces, respectively, γ ∈ [0, 1) is the discount factor, P (s0 |s, a) is the transition kernel to the
next state s0 given the current state s and action a, and r(s, a) ∈ [0, 1] is the reward at (s, a).
Furthermore, the initial state s0 is drawn from a fixed distribution ρ0 . We follow a stationary
policy π : S × A → R. For any (s, a) ∈ S × A, π(a|s) is the probability of the agent choosing
action a ∈ A at state s ∈ S. Note that a policy π ∈ X induces a Markov chain on S. Denote
the inducedPMarkov transition kernel as P π such that st+1 ∼ P π (·|st ). For any s, s0 ∈ S, we have
P π (s0 |s) = a∈A π(a|s)P (s0 |s, a). For any π ∈ X, P π is assumed to induce a stationary distribution
over S, denoted by µπ . We assume that |A| < ∞ while |S| is possibly infinite (but countable). To
simplify our notations, for any distribution ρ on S, we let h·, ·iρ be the inner product with respect
to ρ, and k · kµπ ⊗π be the weighted `2 -norm with respect to the probability measure µπ ⊗ π over
S × A (where f, g are measurable functions on S × A)
X sX X

hf, giρ = hf (s, ·), g(s, ·)iρ(s), kf kµ ⊗π =
π π(a|s) · [f (s, a)]2 µπ (s).
s∈S s∈S a∈A
In policy optimization, our objective is to maximize the expected total discounted reward re-
ceived by the agent with respect to the policy π, i.e.,
X t
(37)

max − `(π) = Eπ γ · r(st , at ) | s0 ∼ ρ0 ,
π∈X⊆R|S|×|A|
t≥0
where Eπ is the expectation with the actions taken according to policy π. Here we let ρ0 to denote
the distribution of the initial state. To see that (37) is approximated as a bilevel problem, set P π as
the Markov operator under the policy π. We let Qπ be the unique solution to the following Bellman
equation [53]:
Q(s, a) = r(s, a) + γ(P π Q)(s, a), ∀ s, a ∈ S × A. (38)
15
Notice that the following holds:
X t
Qπ (s, a) = Eπ γ r(st , at )|s0 = s, a0 = a , Ea∼π(·|s) [Qπ (s, a)] = hQπ (s, ·), π(·|s)i.

t≥0
Further, we parameterize Q using a linear approximation Q(s, a) ≈ Qθ (s, a) := φ> (s, a)θ, where
φ : S × A → Rd is a known feature mapping and θ ∈ Rd is a finite-dimensional parameter. Using
the fact that `(π) = −Eπ [Qπ (s, a)], problem (37) can be approximated as a bilevel optimization
problem such that:
min `(π) = −hQθ? (π) , πiρ0
π∈X⊆R|S|×|A|
(39)
subject to θ? (π) ∈ arg min 1
2 kQθ − r − γP π Qθ k2µπ ⊗π .
θ∈Rd
Solving Policy Optimization Problem We illustrate how to adopt the TTSA algorithm to solve
(39). First, the inner problem is the policy evaluation (a.k.a. ‘critic’) which minimizes the mean
squared Bellman error (MSBE). A standard approach is TD learning [52]. We draw two consecutive
k
state-action pairs (s, a, s0 , a0 ) satisfying s ∼ µπ , a ∼ π k (·|s), s0 ∼ P (·|s, a), and a0 ∼ π k (·|s0 ), and
update the critic via
θk+1 = θk − βhkg with hkg = [φ> (s, a)θk − r(s, a) − γφ> (s0 , a0 )θk ]φ(s, a), (40)
where β is the step size. This step resembles (4a) of TTSA except that the mean field E[hkg |Fk ] is a
semigradient of the MSBE function.
Secondly, the outer problem searches for the policy (a.k.a. ‘actor’) that maximizes the expected
discounted reward. To develop this step, let us define the visitation measure and the Bregman
divergence as:
k k
X X
ρπ (s) := (1 − γ)−1 γ t P(st = s), D̄ψ,ρπk (π, π k ) := Dψ π(·|s), π k (·|s) ρπ (s),
t≥0 s∈S
such that {st }t≥0 is a trajectory of states obtained by drawing s0 ∼ ρ0 and following the policy π k ,
and Dψ is the Kullback-Leibler (KL) divergence between probability distributions over A. We also
define the following gradient surrogate:
k
[∇π f (π k , θk+1 )](s, a) = −(1 − γ)−1 Qθk+1 (s, a)ρπ (s), ∀ (s, a). (41)
Similar to (6) and under the additional assumption that the linear approximation is exact, i.e.,
k
Qθ? (πk ) = Qπ , we can show ∇π f (π k , θ? (π k )) = ∇`(π k ) using the policy gradient theorem [53].
In a similar vein as (4b) in TTSA, we consider the mirror descent step for improving the policy
(cf. proximal policy optimization in [49]):
n 1 o
π k+1 = arg min −(1 − γ)−1 hQθk+1 , π − π k iρπk + D̄ψ,ρπk (π, π k ) , (42)
π∈X α
where α is the step size. Note that the above update can be performed as:
k
X
π k+1 (·|s) ∝ π k (·|s) exp αk (1 − γ)−1 Qθk+1 (s, ·) = π 0 (·|s) exp (1 − γ)−1 φ(s, ·)> αθi+1 .

i=0
Pk
In other words, π k+1 can be represented using the running sum of critic i=0 αθi+1 . This is similar
16
to the natural policy gradient method [30], and the algorithm requires a low memory footprint.
Finally, the recursions (40), (42) give the two-timescale natural actor critic (TT-NAC) algorithm.
4.1 Convergence Analysis of TT-NAC
Consider the following assumptions on the MDP model of interest.

Assumption 4. The reward function is uniformly bounded by a constant r. That is, |r(s, a)| ≤ r
for all (s, a) ∈ S × A.
Assumption 5. The feature map φ : S × A → Rd satisfies kφ(s, a)k2 ≤ 1 for all (s, a) ∈ S × A. The
action-value function associated with each policy is a linear function of φ. That is, for any policy
π ∈ X, there exists θ? (π) ∈ Rd such that Qπ (·, ·) = φ(·, ·)> θ? (π) = Qθ? (π) (·, ·).
Assumption 6. For each policy π ∈ X, the induced Markov chain P π admits a unique stationary
distribution µπ for all π ∈ X. Let there exists µφ > 0 such that Es∼µπ ,a∼π(·|s) [φ(s, a)φ(s, a)> ] µ2φ ·Id
for all π ∈ X.
Assumption 7. For any (s, a) ∈ S × A and any π ∈ X, let %(s, a, π) be a probability measure over
S, defined by
[%(s, a, π)](s0 ) = (1 − γ)−1 t≥0 γ t · P(st = s0 ), ∀s0 ∈ S. (43)
P
That is, %(s, a, π) is the visitation measure induced by the Markov chain starting from (s0 , a0 ) = (s, a)
and follows π afterwards. For any π ? , there exists Cρ > 0 such that
%(s, a, π) 0 2

Es0 ∼ρ? (s ) ≤ Cρ2 , ∀ (s, a) ∈ S × A, π ∈ X.
ρ?
?
Here we let ρ? denote ρπ to simplify the notation, which is the visitation measure induced by π ?
with s0 ∼ ρ0 .
We remark that Assumption 4 is standard in the reinforcement learning literature [53, 55]. In
Assumption 5, we assume that each Qπ is linear which implies that the linear function approximation
is exact. A sufficient condition for Assumption 5 is that the underlying MDP is a linear MDP [28,62],
where both the reward function and Markov transition kernel are linear in φ. Linear MDP contains
the tabular MDP as a special case, where the feature mapping φ(s, a) becomes the canonical vector
in RS×A . The assumption that the stationary distribution µπ exists for any policy π is a common
property for the MDP analyzed in TD learning, e.g., [4, 15]. Assumption 6 further assumes that
the smallest eigenvalue of Σπ is bounded uniformly away from zero. Assumption 6 is commonly
made in the literature on policy evaluation with linear function approximation, e.g., [4, 37]. Finally,
Assumption 7 postulates that ρ? is regular such that that the density ratio between %(s, a, π) and
ρ? has uniformly bounded second-order moments under ρ? . Such an assumption is closely related to
the concentratability coefficient [1,2,43], which characterizes the distribution shift incurred by policy
updates and is conjectured essential for the sample complexity analysis of reinforcement learning
methods [11]. Assumption 7 is satisfied if the initial distribution ρ0 has lower bounded density over
S × A [1].
To state our main convergence results, let us define the quantities of interest:
∆k+1
Q := E[kθk+1 − θ? (π k )k22 ], OPTk := E[`(π k ) − `(π ? )], (44)
17
where the expectations above are taken with respect to the i.i.d. draws of state-action pairs in
(40) for TT-NAC. We remark that ∆kQ , analogous to ∆ky used in TTSA, is the tracking error that
k
characterizes the performance of TD learning when the target value function, Qπ , is time-varying
due to policy updates. We obtain:
Theorem 3. Consider the TT-NAC algorithm (40)-(42) for the policy optimization problem (39).
Let Kmax ≥ 322 be the maximum number of iterations. Under Assumption 4 – 7, and we set the
step sizes as
( )
(1 − γ)3 µφ n (1 − γ)2
−3/4
o (1 − γ)µ2φ 16 −1/2
α= q min , Kmax , β = min , Kmax . (45)
r · Cρ2 128µ−2
φ
8 (1 − γ)µ2φ
Then the following holds

−1/4 −1/2
E[OPTK ] = O(Kmax ), E[∆K+1
Q ] = O(Kmax ), (46)
where K is an independent random variable uniformly distributed over {0, ..., Kmax − 1}.
To shed lights on our analysis, we first observe the following performance difference lemma
proven in [29, Lemma 6.1]:
`(π) − `(π ? ) = (1 − γ)−1 hQπ , π ? − πiρπ? , ∀π ∈ X, (47)
where π ? is an optimal policy solving (39). The above implies a restricted form of convexity, and
our analysis uses the insight that (47) plays a similar role as (2) [with µ` ≥ 0] and characterizes the
loss geometry of the outer problem.
Our result shows that the TT-NAC algorithm finds an optimal policy at the rate of O(K −1/4 ) in
terms of the objective value. This rate is comparable to another variant of the TT-NAC algorithm
in [61], which provided a customized analysis for TT-NAC. In contrast, the analysis for our TT-
NAC algorithm is rooted in the general TTSA framework developed in §3.3 for tackling bilevel
optimization problems. Notice that analysis for the two-timescale actor-critic algorithm can also be
found in [59], which provides an O(K −2/5 ) convergence rate to a stationary solution.
5 Numerical Experiments
We consider the data hyper-cleaning task (16), and compare TTSA with several algorithms such as
the BSA algorithm [23], the stocBiO [27] for different batch size choices, and the HOAG algorithm
in [44]. Note that HOAG is a deterministic algorithm and it requires full gradient computation at
each iteration. In contrast, stocBiO is a stochastic algorithm but it relies on large batch gradient
computations.
We consider problem (16) with L(·) being the cross-entropy loss (i.e., a data cleaning problem
for logistic regression); σ(x) := 1+exp(−x)
1
; c = 0.001; see [50]. The problem is trained on the
FashionMNIST dataset [60] with 50k, 10k, and 10k image samples allocated for training, validation
and testing purposes, respectively. We consider the setting where each sample in the training dataset
is corrupted with probability 0.4. Note that the outer problem `(x) is non-convex while the lower
level problem is strongly-convex. The simulation results are an average of 3 independent runs. The
step sizes for different algorithms are chosen according to their theoretically suggested values. Let
18
Figure 1: Data hyper-cleaning task on the FashionMNIST dataset. We plot the training loss and
testing accuracy against the number of gradients evaluated with corruption rate p = 0.4.
the outer iteration be indexed by t, for TTSA we choose αt = cα /(1 + t)3/5 , βt = cβ /(1 + t)2/5 , and
tune for cα and cβ in the set {10−3 , 10−2 , 10−1 , 10}. For √ BSA [23], we index the outer iteration by t
and the inner iteration by k̄ ∈ {1, . . . , k̄t }. We set k̄t = d t + 1e as suggested in [23] and choose the
outer and inner step-sizes αt and βk̄ , respectively, as αt = dα /(1 + t)1/2 and βk̄ = dβ /(k̄ + 2). We
tune for dα and dβ in the set {10−3 , 10−2 , 10−1 , 10}. Finally, for stocBiO we tune for parameters αt
and βt in the range [0, 1].
In Figure 1, we compare the performance of different algorithms against the total number of
outer samples accessed. As observed, TTSA outperforms BSA, stocBiO and HOAG. We remark
that HOAG is a deterministic algorithm and hence requires full batch gradient computations at
each iteration. Similarly, stocBio relies on large batch gradients which results in relatively slow
convergence.
6 Conclusion
This paper develops efficient two-timescale stochastic approximation algorithms for a class of bi-
level optimization problems where the inner problem is unconstrained and strongly convex. We
show the convergence rates of the proposed TTSA algorithm under the settings where the outer
objective function is either strongly convex, convex, or non-convex. Additionally, we show how our
theory and analysis can be customized to a two-timescale actor-critic proximal policy optimization
algorithm in reinforcement learning, and obtain a comparable convergence rate to existing literature.
A Omitted Proofs of Theorem 1
To simplify notations, for any n, m ∈ N, we define the following quantities for brevity of notations.
n
Y n
Y
G(1)
m:n = (1 − βi µg /4), G(2)
m:n = (1 − αi µ` ). (48)
i=m i=m
19
A.1 Proof of Lemma 3
Following a direct expansion of the updating rule for y k+1 and taking the conditional expectation
given filtration Fk yield that
E[ky k+1 − y ? (xk )k2 |Fk ] ≤ (1 − 2βk µg )ky k − y ? (xk )k2 + βk2 E[khkg k2 |Fk ],
where we used the unbiasedness of hkg [cf. Assumption 3] and the strong convexity of g. By direct
computation and (7b) in Assumption 3, we have
E[khkg k2 |Fk ] = E khkg − ∇g(xk , y k )k2 Fk + k∇g(xk , y k )k2

≤ σg2 + (1 + σg2 )k∇g(xk , y k )k2 ≤ σg2 + (1 + σg2 ) · L2g ky k − y ? (xk )k2 , (49)
where the last inequality uses Assumption 2 and the optimality of the inner problem ∇y g(xk , y ? (xk )) =
0. As βk L2g (1 + σg2 ) ≤ µg , we have
E[ky k+1 − y ? (xk )k2 |Fk ] ≤ (1 − βk µg ) · ky k − y ? (xk )k2 + βk2 σg2 . (50)
Using the basic inequality 2ab ≤ 1/c · a2 + c · b2 for all c ≥ 0 and a, b ∈ R, we have
ky k − y ? (xk )k2 ≤ 1 + 1/c · ky k − y ? (xk−1 )k2 + 1 + c ky ? (xk−1 ) − y ? (xk )k2 . (51)

Note we have taken the convention that x−1 = x0 . Furthermore, we observe that
ky ? (xk−1 ) − y ? (xk )k2 ≤ L2y kxk − xk−1 k2 ≤ αk−1
2
L2y · khk−1 2
f k , (52)
where the first inequality follows from Lemma 2, and the second inequality follows from the non-
expansive property of projection. We have set h−1
f = 0 as convention.
2(1−β µ ) µ
Through setting c = βk µkg g , we have 1 + 1/c (1 − βk µg ) = 1 − 2g βk . Substituting the above

quantity c into (51) and combining with (50) show that

E[ky k+1 − y ? (xk )k2 |Fk ]
β k µg 2 − µg βk
≤ 1− · ky k − y ? (xk−1 )k2 + βk2 · σg2 + 2
· αk−1 L2y · khk−1 2
f k . (53)
2 µg βk
Taking the total expectation and using (14), we have
2 − µg βk 2
∆k+1 ≤ 1 − βk µg /2 · ∆ky + βk2 σg2 + αk−1 L2y σ
2
ef + 3b2k−1 + 3L2 ∆ky ,

y
µg βk
3/2
with the convention α−1 = 0. Using αk−1 ≤ 2αk , αk ≤ c0 βk , we have
h 12c20 L2y L2 2 i 4c20 L2y 2 2
∆k+1 k 2 2
ef + 3b20

y ≤ 1 − βk µg /2 + β k · ∆ y + β σ
k g + βk · σ
µg µg
h i 2 2
4c0 Ly 2 2
≤ 1 − βk µg /4 · ∆ky + βk2 σg2 + ef + 3b20 ,

βk · σ
µg
where the last inequality is due to (20a). Solving the recursion leads to
n 2 2 o
(1) 0 2 G(1) 2 + 4c0 Ly σ (54)
Pk
∆k+1
y ≤ G ∆
0:k y + β
j=0 j j+1:k σ g µg ef
2 + 3b2
0 .
Since βk−1 /βk ≤ 1 + βk · (µg /8), applying Lemma 10 to {βk }k≥0 with a = µg /4 and q = 2, we have
20
2 (1)
Pk 8βk
j=0 βj Gj+1:k ≤ µg . Finally, we can simplify (54) as
(1) 8 n 2 4c20 L2y 2 o

∆k+1
y ≤ G0:k ∆0y + C(1)
y βk , where C(1)
y := σg + ef + 3b20 .
σ (55)
µg µg
A.2 Proof of Lemma 4

Due to the projection property, we get
kxk+1 − x? k2 ≤ kxk − αk hkf − x? k2 = kxk − x? k2 − 2αk hhkf , xk − x? i + αk2 khkf k2 .
Taking the conditional expectation given Fk0 gives
E[kxk+1 − x? k2 |Fk0 ] ≤ kxk − x? k2 − 2αk h∇`(xk ), xk − x? i + αk2 E[khkf k2 |Fk0 ]
(56)
− 2αk h∇x f (xk , y k+1 ) − ∇`(xk ) + Bk , xk − x? i,
where the inequality follows from (7a). The strong convexity implies h∇`(xk ), xk − x? i ≥ µ` kxk −
x? k2 , we further bound the r.h.s. of (56) via
E[kxk+1 − x? k2 |Fk0 ]
≤ 1 − 2αk µ` kxk − x? k2 − 2αk h∇x f (xk , y k+1 ) − ∇`(xk ) + Bk , xk − x? i + αk2 E[khkf k2 |Fk0 ]

αk
≤ 1 − αk µ` kxk − x? k2 + k∇x f (xk , y k+1 ) − ∇`(xk ) + Bk k2 + αk2 E[khkf k2 |Fk0 ]

µ`
≤ 1 − αk µ` · kxk − x? k2 + (2αk /µ` ) · L2 ky k+1 − y ? (xk )k2 + b2k + αk2 · E[khkf k2 |Fk0 ],

where the last inequality is from Lemma 2. Using (14) and taking total expectation:
∆k+1 ≤ 1 − αk µ` ∆kx + 2αk /µ` L2 ∆k+1 + 2αk b2k /µ` + αk2 σ
2
ef + 3b20 + 3L2 ∆k+1

x y y
≤ 1 − αk µ` · ∆kx + 2αk /µ` + 3αk2 · L2 · ∆k+1 + αk2 2e ef2 + 3b20 ,

y cb /µ` + σ
cb αk . Solving the recursion above leads to
where we have used b2k ≤ e
k n
(2)
X 2e
cb (2) 2L2 (2)
o
∆k+1 G0:k ∆0x ef2 + 3b20 αj2 Gj+1:k + + 3α0 L2 αj Gj+1:k ∆j+1

x ≤ + +σ y
µ` µ`
j=0
2L2 k
(2) 2 2e
cb X
(2)
G0:k ∆0x ef2 + 3b20 · αk + + 3α0 L2 αj Gj+1:k ∆j+1

≤ + +σ y .
µ` µ` µ`
j=0
The last inequality follows from applying Lemma 10 with q = 2 and a = µ` .
A.3 Bounding ∆kx by coupling with ∆ky

Using (55), we observe that
n o
Pk (2) j+1 Pk (2) (1) 0 (1)
j=0 α j G j+1:k ∆ y ≤ j=0 αj G j+1:k G ∆
0:j y + C y βj . (57)
We bound each term on the right-hand side of (57). For the first term, as µ` αi ≤ µg βi /8 [cf. (20b)],
applying Lemma 11 with a = µ` , b = µg /4, γi = αi , ρi = βi gives
Pk (2) (1) 0 (2) 0
1
j=0 αj Gj+1:k G0:j ∆y ≤ µ` G0:k ∆y . (58)
21
2/3
Recall that βj ≤ c1 · αj . Applying Lemma 10 with q = 5/3, a = µ` yields
Pk (2) Pk 5/3 (2) 2/3
j=0 αj βj Gj+1:k ≤ c1
2
j=0 αj Gj+1:k ≤ c1 µ` αk , (59)
We obtain a bound on the optimality gap as
n h 2L2 3α0 L2 i 0 o 2 h 2e
cb i
(2)
∆k+1
x ≤ G0:k ∆0x + 2
+ ∆y + ef2 + 3b20 αk
+σ
µ` µ` µ` µ`
2c1 h 2L2 i 2
+ + 3α0 L2 C(1) 3
y αk .
µ` µ`
To simplify the notation, we define the constants

h 2L2 3α0 L2 i 0 (1) 2 h 2e
cb i 2c h 2L2 i
1
C(0) 0
x = ∆x + + ∆ y , Cx = ef2 + 3b20 +
+σ + 3α0 L2 C(1)
y
µ2` µ` µ` µ` µ` µ`
Then, as long as αk < 1/µ` and we use the step size parameters in (22), we have
(2) (0) (1) 2/3

h L2 ef2
L2 L2y i σ L2 σg2
∆k+1
x ≤ G C
0:k x + C x αk = O + + ,
µ2` µ2g µ4` k 2/3 µ2` µg k 2/3
ef2 = σf2 + 3 supx∈X k∇`(x)k2 .
and we recall that σ
B Omitted Proofs of Theorem 2 and Corollary 1
B.1 Proof of Lemma 5
We observe that summing the first and the second inequalities in (31) from k = 0 to k = K − 1
gives:
c0 K
PK
k 0 k (60)
P
k=1 Θ ≤ Ω + c1 k=1 Υ + c2 · K.
PK K
k 1 k (61)
P
d0 k=1 Υ ≤ Υ + d1 k=1 Θ + d2 · K.
Substituting (60) into (61) gives
h i
d0 K k 1 d1
Ω0 + c 1 K k+c ·K . (62)
P P
k=1 Υ ≤ Υ + d2 · K + c0 k=1 Υ 2
Therefore, if d0 − d1 cc01 > 0, a simple computation yields the second inequality in (32). Similarly,
we substitute (61) into (60) to yield
h i
c0 K
PK
k ≤ Ω0 + c · K + c1 Υ1 + d k+d ·K . (63)
P
k=1 Θ 2 d0 1 k=1 Θ 2
Under c0 − c1 dd10 > 0, simple computation yields the first inequality in (32).
Recall that we defined OPTk := E[`(xk ) − `(x? )] for each k ≥ 0. To begin with, we have the
following descent estimate
`(xk+1 ) ≤ `(xk ) + h∇`(xk ), xk+1 − xk i + (Lf /2)kxk+1 − xk k2 . (64)
22
The optimality condition of step (4b) leads to the following bound
h∇`(xk ), xk+1 − xk i ≤ h∇`(xk ) − ∇x f (xk , y k+1 ) − Bk , xk+1 − xk i
1 k+1
+ hBk + ∇x f (xk , y k+1 ) − hkf , xk+1 − xk i − kx − xk k2 ,
α
where we obtained the inequality by adding and subtracting Bk + ∇x f (xk , y k+1 ) − hkf . Then, taking
the conditional expectation on Fk0 , for any c, d > 0, we obtain
E[h∇`(xk ), xk+1 − xk i|Fk0 ]
≤ E k∇`(xk ) − ∇x f (xk , y k+1 ) − Bk k · kxk+1 − xk kFk0

1
+ E kBk + ∇x f (xk , y k+1 ) − hkf kkxk+1 − xk kFk0 − E[kxk+1 − xk k2 |Fk0 ]

α
1 c
≤ E[k∇`(xk ) − ∇x f (xk , y k+1 ) − Bk k2 |Fk0 ] + E[kxk+1 − xk k2 |Fk0 ]
2c 2
σf2 d 1
+ + E[kxk+1 − xk k2 |Fk0 ] − E[kxk+1 − xk k2 |Fk0 ],
2d 2 α
where the second inequality follows from the Young’s inequality and Assumption 3. Simplifying the
terms above leads to
E[h∇`(xk ), xk+1 − xk i|Fk0 ]
1 σf2 c + d 1
≤ E[k∇`(xk ) − ∇x f (xk , y k+1 ) − Bk k2 |Fk0 ] + + − · E[kxk+1 − xk k2 |Fk0 ].
2c 2d 2 α
Setting d = c = 2α ,
1
plugging the above to (64), and taking the full expectation:

1 Lf
OPTk+1 ≤ OPTk − − · E[kxk+1 − xk k2 ] + α∆k+1 + ασf2 , (65)
2α 2
where we have denoted ∆k+1 as follows
(12a)
∆k+1 := E[k∇x f (xk ; y k+1 ) − ∇`(xk ) − Bk k2 ] ≤ 2L2 E[ky k+1 − y ? (xk )k2 ] + 2b2k ,
where the last inequality follows from Lemma 2 and (7a) in Assumption 3. Next, following from
the standard SGD analysis [cf. (50)] and using β ≤ µg /(L2g (1 + σg2 )), we have
E[ky k+1 − y ? (xk )k2 |Fk ] ≤ (1 − µg β)E[ky k − y ? (xk )k2 |Fk ] + β 2 σg2
≤ (1 + c)(1 − µg β)E[ky k − y ? (xk−1 )k2 |Fk ] + (1 + 1/c)E[ky ? (xk ) − y ? (xk−1 )k2 |Fk ] + β 2 σg2
2
≤ 1 − µg β/2 E[ky k − y ? (xk−1 )k2 |Fk ] + − 1 · E[ky ? (xk ) − y ? (xk−1 )k2 |Fk ] + β 2 σg2 ,

µg β
2
≤ 1 − µg β/2 E[ky k − y ? (xk−1 )k2 |Fk ] + − 1 L2y · E[kxk − xk−1 k2 |Fk ] + β 2 σg2 ,

µg β
where the last inequality is due to the Lipschitz continuity property (12a) and µg β < 1. Furthermore,
we have picked c = µg β · [2(1 − µg β)]−1 , so that
(1 + c)(1 − µg β) = 1 − µg β/2, 1/c + 1 = 2/(µg β) − 1.
Taking a full expectation on both sides leads to the desired result.
23
For simplicity, we let x

bk+1 and x b denote x b(xk+1 ) and xb(x), respectively. For any x ∈ X, letting
b and x2 = x in (2), we get
x1 = x
µ`
`(bx) ≥ `(x) + h∇`(x), x b − xi + kb x − xk2 . (66)
2
Moreover, by the definition of x b, for any x ∈ X, we have
ρ h ρ i h ρ i
`(x) + kx − xk2 − `(b x) + kb x − xk2 = `(x) − `(b x) + kb x − xk2 ≥ 0. (67)
2 2 2
Adding the two inequalities above, we obtain
µ` + ρ
− x − xk2 ≥ h∇`(x), x
· kb b − xi. (68)
2
Note that we choose ρ such that ρ + µ` > 0. To proceed, combining the definitions of the Moreau
envelop and xb in (18), for xk+1 , we have
(18) ρ ρ
Φ1/ρ (xk+1 ) = `(bxk+1 ) + · kxk+1 − x bk+1 k2 ≤ `(b
xk ) + · kxk+1 − x bk k2
2 2
ρ ρ
≤ `(bxk ) + · kxk − x bk k2 + · kxk+1 − xk k2 + ρα · hb xk − xk , hkf i
2 2
+ αρ · hhkf , xk − xk+1 i + ρkxk+1 − xk k2
(18) 5ρ
= Φ1/ρ (xk ) + · kxk+1 − xk k2 + ραhb xk − xk , hkf i + α2 ρkhkf k2 , (69)
2
where the first equality and the first inequality follow from the optimality of x
bk+1 = xb(xk+1 ), and
the second term is from the optimality condition in (4b). For any x that is a global optimal solution
?
for the original problem minx∈X `(x), we must have

n ρ o
Φ1/ρ (x? ) = min `(x) + kx − x? k2 = `(x? ),
x∈X 2
where the last equality holds because
ρ ρ
Φ1/ρ (x? ) = min `(x) + kx − x? k2 ≤ `(x? ) + kx? − x? k2 = `(x? ), (70)

x∈X 2 2
ρ
Φ1/ρ (z) = min `(x) + kx − zk2 ≥ min `(x) = `(x? ), ∀ z ∈ X. (71)

x∈X 2 x∈X
xk − xk , hkf i while conditioning on Fk0 , we have:

Taking expectation of hb
xk − xk , hkf i|Fk0 ]
E[hb
xk − xk , hkf − ∇x f (xk , y k+1 ) + ∇x f (xk , y k+1 ) − ∇`(xk ) + ∇`(xk )i|Fk0 ]
= E[hb
xk − xk , Bk i + E[hb
= hb xk − xk , ∇`(xk )i|Fk0 ],
xk − xk , ∇x f (xk , y k+1 ) − ∇`(xk )i + hb (72)
where the second equality follows from (7a) in Assumption 3. By Young’s inequality, for any c > 0,
we have c k 1
xk − xk , Bk i ≤ kb
hb x − xk k2 + b2k , (73)
4 c
1 c k
E[hbxk − xk , ∇x f (xk , y k+1 ) − ∇`(xk )i|Fk0 ] ≤ k∇x f (xk , y k+1 ) − ∇`(xk )k2 + kb
x − xk k2 ,
c 4
24
where we also use (7a) in deriving (73). Combining (68), (72), (73), and setting c = (ρ + µ` )/2, we
obtain that
xk − xk , hkf i|Fk0 ]
E[hb (74)
c k 1 1 ρ + µ` k
≤ kb x − xk k2 + b2k + E[k∇x f (xk , y k+1 ) − ∇`(xk )k2 ] − x − xk k2
kb
2 c c 2
2 (ρ + µ` ) 2
= · E[k∇x f (xk , y k+1 ) − ∇`(xk )k2 ] − xk − xk k2 +
· kb · b2
ρ + µ` 4 ρ + µ` k
2L2 (ρ + µ` ) 2
≤ · E[ky k+1 − y ? (xk )k2 ] − xk − xk k2 +
· kb · b2 ,
ρ + µ` 4 ρ + µ` k
where the last step follows from the first inequality of Lemma 2. Plugging the above into (69), and
taking a full expectation, we obtain
E[Φ1/ρ (xk+1 )] − E[Φ1/ρ (xk )]
5ρ 2ραL2 k+1 (ρ + µ` )ρα 2ραb2k
≤ E[kxk+1 − xk k2 ] + ∆y − xk − xk k2 ] +
E[kb + α2 ρE[khkf k2 ]
2 ρ + µ` 4 ρ + µ`
5ρ 2ραL2 k+1 (ρ + µ` )ρα 2ραb2k
≤ E[kxk+1 − xk k2 ] + ∆y − xk − xk k2 ] +
E[kb
2 ρ + µ` 4 ρ + µ`
2 2 2 2 k+1
+ α ρ(eσf + 3bk + 3L ∆y )
5ρ h 2αρL2 i (ρ + µ` )ρα
≤ E[kxk+1 − xk k2 ] + + 3α2 ρL2 ∆k+1
y − E[kbxk − xk k2 ]
2 ρ + µ` 4
h 2ρ i
+ σf2 + 3b20 ) α2 ,
+ ρ(e
ρ + µ`
where the last inequality is due to the assumption b2k ≤ α.
B.4 Proof of Corollary 1
Our proof departs from that of Theorem 2 through manipulating the descent estimate (64) in an
alternative way. The key is to observe the following three-point inequality [3]:
1 n ? o
hhkf , xk+1 − x? i ≤ kx − xk k2 − kx? − xk+1 k2 − kxk − xk+1 k2 , (75)
2α
where x? is an optimal solution to (1). Observe that
h∇`(xk ), xk+1 − xk i = h∇`(xk ) − hkf , xk+1 − x? i + hhkf , xk+1 − x? i + h∇`(xk ), x? − xk i. (76)
Notice that due to the convexity of `(x), we have h∇`(xk ), x? − xk i ≤ −OPTk . Furthermore,
h∇`(xk ) − hkf , xk+1 − x? i
= h∇`(xk ) − hkf + Bk + ∇x f (xk , y k+1 ) − Bk − ∇x f (xk , y k+1 ), xk+1 − x? i
≤ Dx bk + Lky k+1 − y ? (xk )k + hBk + ∇x f (xk , y k+1 ) − hkf , xk+1 − xk + xk − x? i.

25
We notice that E[hBk + ∇x f (xk , y k+1 ) − hkf , xk − x? i|Fk0 ] = 0. Thus, taking the total expectation
on both sides and applying Young’s inequality on the last inner product lead to
α 1
E[h∇`(xk ) − hkf , xk+1 − x? ] ≤ Dx bk + LE[ky k+1 − y ? (xk )k] + σf2 + E[kxk+1 − xk k2 ]

2 2α
Substituting the above observations into (64) and using the three-point inequality (75) give
1 n ? o
E[`(xk+1 ) − `(xk )] ≤ Dx bk + LE[ky k+1 − y ? (xk )k] + kx − xk k2 − kx? − xk+1 k2

2α
α 2 Lf
+ σf − OPTk + E[kxk+1 − xk k2 ].
2 2
Summing up both sides from k = 0 to k = Kmax − 1 and dividing by Kmax gives
K Kmax
1 Xmax
k
ασf2
kx? − x0 k2 Dx L X
OPT ≤ Dx b0 + + + E[ky k − y ? (xk−1 )k]
Kmax 2 2αKmax K
k=1 k=1
Kmax
Lf X
+ E[kxk − xk−1 k2 ].
2Kmax
k=1
−3/4 −1/2
Applying Cauchy-Schwartz inequality and Lemma
q 5, 6 P
with α = O(Kmax ), β = O(Kmax ) as in
−1/4
(26) show that Kmax k=1 E[ky − y (x )k] ≤ Kmax K
1 PKmax k ? k−1 1
k=1 ∆y = O(Kmax ); cf. (34). The proof
max k
is concluded.
C Proof of Theorem 3
Hereafter, we let h·, ·i and k · k denote the inner product and `1 -norm on R|A| , respectively. For any
two policies π1 and π2 , for any s ∈ S, kπ1 (·|s) − π2 (·|s)k1 is the total variation distance between
π1 (·|s) and π2 (·|s). For any f, f 0 : S × A → R, define the following norms:
P 2
1/2 P 2
1/2
kf kρ,1 = s∈S kf (s, ·)k1 ρ(s) , kf kρ,∞ = s∈S kf (s, ·)k∞ ρ(s)
The following result can be derived from the Hölder’s inequality:

hf, f 0 iρ ≤ hf (s, ·), f 0 (s, ·)iρ(s) ≤ kf kρ,1 kf 0 kρ,∞ . (77)
P
s∈S
Lastly, it can be shown that kπkρ,1 = 1, kπkρ,∞ ≤ 1.

Under Assumption 5, θ? (π) is the solution to the inner problem with Qπ (·, ·) = φ(·, ·)> θ? (π).
Below we first show that θ? (π) and Qπ are Lipschitz continuous maps with respect to k · kρ? ,1 , where
ρ? is the visitation measure of an optimal policy π ? .
Lemma 8. Under Assumption 4–7, for any two policies π1 , π2 ∈ X,

kQπ1 − Qπ2 kρ? ,∞ ≤ (1 − γ)−2 · r · Cρ · kπ1 − π2 kρ? ,1 ,
θ (π1 ) − θ? (π2 ) ≤ (1 − γ)−2 · r · Cρ /µφ · kπ1 − π2 kρ? ,1 , (78)
?
2
where r is an upper bound on the reward function, µφ is specified in Assumption 6, and Cρ is defined
in Assumption 7.
The proof of the above lemma is relegated to §C.1.
26
In the sequel, we first derive coupled inequalities on the non-negative sequences OPTk :=
E[`(π k ) − `(π ? )], ∆kQ := E[kθk − θ? (π k−1 )k2 ], E[kπ k − π k+1 k2ρ? ,1 ], then we apply Lemma 5 to derive
the convergence rates of TT-NAC. Using the performance difference lemma [cf. (47)], we obtain the
following
k+1 k
`(π k+1 ) − `(π k ) = −(1 − γ)−1 hQπ , π k+1 − π ? iρ? + (1 − γ)−1 hQπ , π k − π ? iρ?
k k k+1
= (1 − γ)−1 h−Qπ , π k+1 − π k iρ? + (1 − γ)−1 hQπ − Qπ , π k+1 − π ? iρ? . (79)
Applying the inequality (77), we further have
k k+1 k k+1
hQπ − Qπ , π k+1 − π ? iρ? ≤ kQπ − Qπ kρ? ,∞ kπ k+1 − π ? kρ? ,1 (80)
1 − γ k+1 4L2Q α
≤ 2LQ kπ k − π k+1 kρ? ,1 ≤ kπ − π k k2ρ? ,1 + , (81)
4α 1−γ
where LQ := (1 − γ)−2 r · Cρ . The above inequality follows from kπ k − π k+1 kρ? ,1 ≤ 2 for any
π1 , π2 ∈ X and applying Lemma 8. Then, combining (79), (81) leads to
−1 k 1 αL2Q
`(π k+1 ) − `(π k ) ≤ hQπ , π k+1 − π k iρ? + kπ k+1 − π k k2ρ? ,1 + 4 . (82)
1−γ 4α (1 − γ)2
Let us bound the first term in the right-hand side of (82). To proceed, note that the policy
update (42) can be implemented for each state individually as below:
n o
π k+1 (·|s) = P arg min −(1 − γ)−1 · hQθk+1 (s, ·), νi + 1/αk · Dψ ν, π k (·|s) , (83)
ν: a ν(a)=1,ν(a)≥0
k
for all s ∈ S. Observe that we can modify ρπ in (42) to ρ? without changing the optimal solution
for this subproblem. Specifically, (42) can be written as
n 1 o
π k+1 = arg min −(1 − γ)−1 hQθk+1 , π − π k iρ? + D̄ψ,ρ? (π, π k ) . (84)
π∈X α
We have
k k
−(1 − γ)−1 hQπ , π k+1 − π k iρ∗ = (1 − γ)−1 hQθk+1 − Qπ , π k+1 − π k iρ? − hQθk+1 , π k+1 − π k iρ? .

Furthermore, from (84), we obtain

hQθk+1 , π k+1 − π k iρ? 1X
≥ h∇Dψ (π k+1 (·|s), π k (·|s)), π k+1 (·|s) − π k (·|s)iρ? (s), (85)
1−γ α
s∈S
where the inequality follows from the optimality condition of the mirror descent step. Meanwhile,
the 1-strong convexity of Dψ (·, ·) implies that
∇Dψ π k+1 (·|s), π k (·|s) , π k+1 (·|s) − π k (·|s) ≥ kπ k+1 (·|s) − π k (·|s)k2 . (86)

Thus, combining (85) and (86), and applying Young’s inequality, we further have
k
− (1 − γ)−1 hQπ , π k+1 − π k iρ∗
1 k 1
≤ kπ k+1 − π k k2ρ? ,1 + α(1 − γ)−2 · kQπ − Qθk+1 k2ρ? ,∞ − kπ k+1 − π k k2ρ? ,1
4α α
k 3
= α(1 − γ)−2 · kQπ − Qθk+1 k2ρ? ,∞ − kπ k+1 − π k k2ρ? ,1 . (87)
4α
27
By direct computation and using kφ(s, a)k ≤ 1 [cf. Assumption 5], we have
k
Xn o2
kQπ − Qθk+1 k2ρ? ,∞ = maxφ(s, a)> [θ? (π k ) − θk+1 ] ρ? (s)

a∈A
s∈S
X
max kφ(s, a)k2 kθ? (π k ) − θk+1 k2 ρ? (s) ≤ kθ? (π k ) − θk+1 k2 . (88)

≤
a∈A
s∈S
Combining (82), (87), and (88), we obtain

1
`(π k+1 ) − `(π k ) ≤ α(1 − γ)−2 kθ? (π k ) − θk+1 k2 − kπ k+1 − π k k2ρ? ,1 + 4(1 − γ)−2 L2Q α.
2α
Taking full expectation leads to
1
OPTk+1 − OPTk ≤ α(1 − γ)−2 ∆k+1
Q − E[kπ k+1 − π k k2ρ∗ ,1 ] + 4(1 − γ)−2 L2Q α. (89)
2α
Next, we consider the convergence of ∆kQ . Let Fk = σ{θ0 , π 0 , . . . , θk , π k } be the σ-algebra

generated by the first k + 1 actor and critic updates. Under Assumption 5, we can write the
conditional expectation of hkg as
E[hkg |Fk ] = Eµπk {Qθk (s, a) − r(s, a) − γQθk (s0 , a0 )}φ(s, a)|Fk

= Eµπk φ(s, a){φ(s, a) − γφ(s0 , a0 )}> [θk − θ? (π k )], (90)

k
where Eµπk [·] denotes the expectation taken with s ∼ µπ , a ∼ π k (·|s), s0 ∼ P (·|s, a), a0 ∼ π k (·|s0 ).
Under Assumption 5 and 6, Lemma 3 of [4] shows that E[hkg |Fk ] is a semigradient of the MSBE
function kQθk − Qθ? (πk ) k2 πk k . Particularly, we obtain
µ ⊗π
E[hkg |Fk ]> [θk − θ? (π k )] ≥ (1 − γ)kQθk − Qθ? (πk ) k2µπk ⊗πk ≥ µtd kθk − θ? (π k )k22 , (91)
where we have let µtd = (1 − γ)µ2φ . Moreover, Lemma 5 of [4] demonstrates that the second order
moment E[khkg k22 |Fk ] is bounded as
E[khkg k22 |Fk ] ≤ 8kQθk − Qθ? (πk ) k2µπk ⊗πk + σtd
2
≤ 8kθk − θ? (π k )k22 + σtd
2
, (92)
where σtd
2 = 4r 2 (1 − γ)−2 . Combining (91), (92) and recalling β ≤ µ /8, it holds
td
E[kθk+1 − θ? (π k )k22 |Fk ] = kθk − θ? (π k )k22 − 2βE[hkg |Fk ]> [θk − θ? (π k )] + β 2 E[khkg k22 |Fk ]
≤ 1 − 2µtd β + 8β 2 ) · kθk − θ? (π k )k22 + β 2 · σtd
2
≤ (1 − µtd β) · kθk − θ? (π k )k22 + β 2 · σtd

2
. (93)
By Young’s inequality and Lemma 8, we further have
E[kθk+1 − θ? (π k )k22 |Fk ]
≤ (1 + c)(1 − µtd β)kθk − θ? (π k−1 )k22 + (1 + 1/c)kθ? (π k ) − θ? (π k−1 )k22 + β 2 σtd
2
2
≤ (1 − µtd β/2)kθk − θ? (π k−1 )k22 + − 1 LQ kπ k − π k+1 k2ρ? ,1 + β 2 σtd
2
, (94)
µtd β
where we have chosen c > 0 such that (1 + c)(1 − µtd β) = 1 − µtd β/2, which implies that 1/c +
1 = 2/(µtd β) − 1 > 0 [cf. (45)]; The last inequality comes from Lemma 8 with the constant
LQ = (1 − γ)−4 · r · Cρ2 · µ−2
φ .
28
From (89), (94), we identify that condition (31) of Lemma 5 holds with:
1 α 4L2Q α
Ωk = OPTk , Θk = E[kπ k − π k−1 k2ρ∗ ,1 ], c0 = , c1 = , c 2 = ,
2α (1 − γ)2 (1 − γ)2
2
Υk = E[kθk − θ? (π k−1 )k2 ], d0 = µtd β/2, d1 = − 1 LQ > 0, d2 = β 2 · σtd2
.
µtd β
µtd (1−γ)
Selecting the step sizes as in (45), one can verify that α
β < √ . This ensures
16 L̄Q
c0 − c1 d1 (d0 )−1 > 1/(4α), d0 − c1 d1 (c0 )−1 > µtd β/4. (95)
Applying Lemma 5, we obtain for any K ≥ 1 that
8α2 (1−γ)−2
K
1 X k k+1 2
OPT0 · 4α + µtd β (∆0Q + β 2 σtd
2 )
8α2 (4L2Q + βσtd
2 /µ )
td
E[kπ − π kρ? ,1 ] ≤ +
K K (1 − γ)2
k=1
16α2
K
1 X
2 + 4α L̄ OPT0
E[∆0Q ] + β 2 σtd µtd β Q β 2 σtd
2 +
µtd β (1 − γ)−2 L2Q α2
E[∆k+1
Q ] ≤ + .
K µtd βK/4 µtd β/4
k=1
−3/4 −1/2 PKmax
Particularly, plugging in α Kmax , β Kmax shows that the convergence rates are Kmax
−1
k=1 E[kπ k −
−3/2 −1 K k+1 −1/2
π k+1 k2ρ? ,1 ] = O(Kmax ), Kmax
P
k=1 E[∆Q ] = O(Kmax ).
max
Our last step is to analyze the convergence rate of the objective value OPTk . To this end, we
observe the following three-point inequality [3]
−hQθk+1 , π k+1 − π ? iρ? 1h i
≤ D̄ψ,ρ? (π ? , π k ) − D̄ψ,ρ? (π ? , π k+1 ) − D̄ψ,ρ? (π k+1 , π k ) . (96)
1−γ α
Meanwhile, by the inequalities (77), (79), (80), we have
1 k 1 k k+1
`(π k+1 ) − `(π k ) = − hQπ , π k+1 − π k iρ? + hQπ − Qπ , π k+1 − π ? iρ?
1−γ 1−γ
1 k k k k+1
h−Qπ , π k+1 − π ? iρ? − hQπ , π ? − π k iρ? + kπ k+1 − π ? kρ? ,1 kQπ − Qπ kρ? ,∞

≤
1−γ
1 k k
h−Qπ , π k+1 − π ? iρ? − hQπ , π ? − π k iρ? + 2LQ kπ k+1 − π k kρ? ,1 ,

≤
1−γ
where the last inequality follows from Lemma 8. Now, with the performance difference lemma
k
`(π ∗ ) − `(π k ) = (1 − γ)−1 h−Qπ , π ? − π k iρ? , the above simplifies to
k
`(π k+1 ) − `(π ? ) ≤ (1 − γ)−1 − hQπ , π k+1 − π ? iρ? + 2LQ kπ k+1 − π k kρ? ,1

k k
With hQπ , π k+1 − π ? iρ? = hQπ − Qθk+1 , π k+1 − π ? iρ? + hQθk+1 , π k+1 − π ? iρ? and applying the
three-point inequality (96), we have
2 k
`(π k+1 ) − `(π ? ) ≤ LQ kπ k+1 − π k kρ? ,1 + kQπ − Qθk+1 kρ? ,∞

1−γ
1h i
+ D̄ψ,ρ? (π ? , π k ) − D̄ψ,ρ? (π ? , π k+1 ) − D̄ψ,ρ? (π k+1 , π k ) .
α
2 k+1
1
− π k kρ? ,1 + kθ? (π k ) − θk+1 k2 + D̄ψ,ρ? (π ? , π k ) − D̄ψ,ρ? (π ? , π k+1 )

≤ LQ kπ
1−γ α
where the last inequality uses (88) and the fact that D̄ψ,ρ? is non-negative. Finally, taking the full
29
expectation on both sides of the inequality, we obtain
OPTk+1 ≤ 2(1 − γ)−1 E[kθ? (π k ) − θk+1 k2 ] + 2(1 − γ)−1 LQ E[kπ k+1 − π k kρ? ,1 ]
1
+ E D̄ψ,ρ? (π ? , π k ) − D̄ψ,ρ? (π ? , π k+1 ) . (97)

α
Summing up both sides from k = 0 to k = Kmax − 1 and dividing by Kmax yields
Kmax
1 X 1
OPTk ≤ D̄ψ,ρ? (π ? , π 0 ) − D̄ψ,ρ? (π ? , π Kmax )

Kmax αKmax
k=1
(98)
Kmax
2 1 X
E[kθ? (π k−1 ) − θk k2 ] + LQ · E[kπ k − π k−1 kρ? ,1 ] .

+
(1 − γ) Kmax
k=1
−1/4
Using Cauchy-Schwarz’s inequality, it can be easily seen that the right-hand side is O(Kmax ). This
concludes the proof of the theorem.
C.1 Proof of Lemma 8
We first bound |Qπ1 (s, a)−Qπ2 (s, a)|. By the Bellman equation (38) and the performance difference
lemma (47), we have
X
Qπ1 (s, a) − Qπ2 (s, a) = P (s0 |s, a) · V π1 (s0 ) − V π2 (s0 )

s0 ∈S
X
= (1 − γ)−1 P (s0 |s, a) · Ese∼e%(s0 ,π1 ) Qπ2 (e (99)

s, ·), π1 (·|e
s) − π2 (·|e
s) ,
s0 ∈S
where %e(s0 , π1 ) is the visitation measure obtained by the Markov chain induced by π1 with the initial
state fixed to s0 . Recall the definition of the visitation measure %(s, a, π) in (43). We rewrite (99)
as
Qπ1 (s, a) − Qπ2 (s, a) = (1 − γ)−1 · Ese∼%(s,a,π1 ) Qπ2 (e (100)

s, ·), π1 (·|e
s) − π2 (·|e
s) .
Moreover, notice that sup(s,a)∈S×A |Qπ (s, a)| ≤ (1 − γ)−1 · r under Assumption 4. Then, applying
Hölder’s inequality to (100), we obtain
Q 1 (s, a) − Qπ2 (s, a) ≤ (1 − γ)−1 · Ese∼e%(s0 ,π ) kQπ2 (e
π
1
s, ·)k∞ · kπ1 (·|es) − π2 (·|e
s)k1

%(s, a, π)
≤ (1 − γ)−2 · r · Ese∼ρ? (es) · kπ 1 (·|e
s) − π2 (·|e
s)k 1
ρ?
1/2

−2
%(s, a, π) 2 2
≤ (1 − γ) · r · Ese∼ρ? s) Ese∼ρ? kπ1 (·|e
(e s) − π2 (·|e
s)k1
ρ?
≤ (1 − γ)−2 · r · Cρ · kπ1 − π2 kρ? ,1 , (101)
where the second inequality is from the boundedness of Qπ , the third one is the Cauchy-Schwarz
inequality, and the last one is from Assumption 7. Finally, we have
kQπ1 − Qπ2 kρ? ,∞ ≤ (1 − γ)−2 · r · Cρ · kπ1 − π2 kρ? ,1 .
30
It remains to bound kθ? (π1 ) − θ? (π2 )k2 . Under Assumption 5, we have
2
kQπ1 − Qπ2 k2µπ? ⊗π? = Es∼µπ? ,a∼π? (·|s) Qπ1 (s, a) − Qπ2 (s, a)

2
= Es∼µπ? ,a∼π? (·|s) φ(s, a)> [θ? (π1 ) − θ? (π2 )]
= [θ? (π1 ) − θ? (π2 )]> Σπ? [θ? (π1 ) − θ? (π2 )]. (102)
Then, combining Assumption 6 and (101), we have
µ2φ kθ? (π1 ) − θ? (π2 )k2 ≤ kQπ1 − Qπ2 k2µπ? ⊗π? ≤ (1 − γ)−4 · r2 · Cρ2 · kπ1 − π2 k2ρ? ,1 , (103)
which yields the second inequality in Lemma 8. We conclude the proof.
D Auxiliary Lemmas
The proofs for the lemmas below can be found in the online appendix [24].
Lemma 9. [31, Lemma 12] Let a > 0, {γj }j≥0 be a non-increasing, non-negative sequence such
that γ0 < 1/a, it holds for any k ≥ 0 that
k k
X Y 1
γj (1 − γ` a) ≤ . (104)
a
j=0 `=j+1
Lemma 10. Fix a real number 1 < q ≤ 2. Let a > 0, {γj }j≥0 be a non-increasing, non-negative
γ a
sequence such that γ0 < 1/(2a). Suppose that `−1
γ` ≤ 1 + 2(q−1) γ` . Then, it holds for any k ≥ 0 that
k k
2 q−1
γjq
X Y
(1 − γ` a) ≤ γ . (105)
a k
j=0 `=j+1
Lemma 11. Fix the real numbers a, b > 0. Let {γj }j≥0 , {ρj }j≥0 be nonincreasing, non-negative
sequences such that 2aγj ≤ bρj for all j, and ρ0 < 1/b. Then, it holds that
k k j k
X Y Y 1Y
γj (1 − γ` a) (1 − ρi b) ≤ (1 − γ` a), ∀ k ≥ 0. (106)
a
j=0 `=j+1 i=0 `=0
E Technical Results Omitted from the Main Paper
E.1 Proof of Lemma 10
To derive this result, we observe that

k k k
γjq−1 Y
k
γjq γkq−1
X Y X
(1 − γ` a) ≤ γj (1 − γ` a)
j=0 `=j+1 j=0 γkq−1 `=j+1
k k
γ`−1 q−1

γkq−1
X Y
= γj (1 − γ` a).
γ`
j=0 `=j+1
31
Furthermore, from the conditions on γ` ,
γ`−1 q−1

a q−1 a
(1 − γ` a) ≤ 1 + γ` (1 − γ` a) ≤ 1 − γ` .
γ` 2(q − 1) 2
Therefore,
k k k k
a 2
γjq γkq−1 (1 − γ` ) ≤ γkq−1 .
X Y X Y
(1 − γ` a) ≤ γj
2 a
j=0 `=j+1 j=0 `=j+1
This concludes the proof.
First of all, the condition 2aγj ≤ bρj implies

1 − ρi b
≤ 1 − ρi b/2, ∀ i ≥ 0.
1 − γi a
Qj
As such, we observe 1−ρi b
≤ ji=0 (1 − ρi b/2) and subsequently,
Q
i=0 1−γi a
k
X k
Y j
Y k
hY k
iX j
Y
γj (1 − γ` a) (1 − ρi b) ≤ (1 − γ` a) γj (1 − ρi b/2).
j=0 `=j+1 i=0 `=0 j=0 i=0
Furthermore, for any j = 0, ..., k, it holds

j−1 j−1 j
Y 2h Y Y i
ρj (1 − ρi b/2) = (1 − ρi b/2) − (1 − ρi b/2) , (107)
b
i=0 i=0 i=0
Q−1
where we have taken the convention i=0 (1 − ρi b/2) = 1. We obtain that
k j k j−1
X Y b X Y
γj (1 − ρi b/2) ≤ ρj (1 − ρi b/2)
j=0 i=0
2a j=0 i=0
k j−1 j
1 Xh Y Y i 1
= (1 − ρi b/2) − (1 − ρi b/2) ≤ ,
a j=0 i=0 i=0
a
Qk
where the last inequality follows from the bound 1 − i=0 (1 − ρi µg b/2) ≤ 1. Combining with the
above inequality yields the desired results.
Proof. Since the samples are drawn independently, the expected value of hkf is
t µg Qp µg (2)
E[hkf ] = ∇x f (x, y) − ∇2xy g(x, y)E Lg (µmax
2 +σ 2 )
2
i=1 I − Lg (µ2 +σ 2 ) ∇yy g(x, y; ξi ) ∇y f (x, y).
g gxy g gxy
(108)
32
We have
k∇x f (x, y) − E[hkf ]k
−1
2 g(x, y; ξ (2) )
t µg Qp µg
= ∇2xy g(x, y) E Lg (µmax

I − ∇ − ∇ g(x, y) ∇ f (x, y)

2 +σ 2 ) i=1 2 2
Lg (µg +σgxy ) yy i yy y
g gxy

t µg Qp µg 2 (2) −1
≤ Cgxy Cf y E Lg (µmax i=1 I − Lg (µ2 +σ 2 ) ∇yy g(x, y; ξi ) − ∇yy g(x, y) ,

2 +σ 2 )
g gxy g gxy
where the last inequality follows from Assumption 1-3 and Assumption 2-5.
µ2g tmax
Applying [23, Lemma 3.2], the latter norm can be bounded by 1
µg 1− 2 2 )
Lg (µg +σgxy
. This
concludes the proof for the first part of the lemma.
It remains to bound the variance of hkf . We first let
p
tmax µg Y µg (2)
Hyy = 2 2
I− 2 2
∇2yy g(x, y; ξi ) .
Lg (µg + σgxy ) Lg (µg + σgxy )
i=1
To estimate the variance of hkf , using (108), we observe that
E[khkf − E[hkf ]k2 ] = E[k∇x f (x, y; ξ (1) ) − ∇x f (x, y)k2 ]

h 2 i
(2)
+ E ∇2xy g(x, y; ξ0 )Hyy ∇y f (x, y; ξ (1) ) − ∇2xy g(x, y)E Hyy ∇y f (x, y) .

The first term on the right hand side can be bounded by σf2x . Furthermore
(2)
∇2xy g(x, y; ξ0 )Hyy ∇y f (x, y; ξ (1) ) − ∇2xy g(x, y)E Hyy ∇y f (x, y)

(2)
= ∇2xy g(x, y; ξ0 ) − ∇2xy g(x, y) Hyy ∇y f (x, y; ξ (1) )

+ ∇2xy g(x, y){Hyy − E[Hyy ]}∇y f (x, y; ξ (1) )

+ ∇2xy g(x, y)E[Hyy ] ∇y f (x, y; ξ (1) ) − ∇y f (x, y) .

We also observe
E[k∇y f (x, y; ξ (1) )k2 ] = E[k∇y f (x, y; ξ (1) ) − ∇y f (x, y)k2 ] + E[k∇y f (x, y)k2 ] ≤ σf2y + Cy2 .
Using (a + b + c)2 ≤ 3(a2 + b2 + c2 ) and the Cauchy-Schwarz’s inequality, we have
h 2 i
(2)
E ∇2xy g(x, y; ξ0 )Hyy ∇y f (x, y; ξ (1) ) − ∇2xy g(x, y)E Hyy ∇y f (x, y)

n o
≤ 3 (σf2y + Cy2 ){σgxy
2
E[kHyy k2 ] + Cgxy 2
E[kHyy − E[Hyy ]k2 ]} + σf2y Cgxy
2
· kE[Hyy ]k2 .
Next, we observe that
 2 
2 tmax −1 p
µg
X Y µg
(2) 
2 2
(109)

E[kHyy k ] = 2 2 I− ∇ g(x, y; ξi )

2 )2
E

2 ) yy
Lg (µg + σgxy Lg (µ2g + σgxy
p=0 i=1
Observe that the product of random matrices satisfies the conditions in Lemma 12 with µ ≡
µ2g µ2g σgxy
2
2
Lg (µ +σ 2 )
, σ 2 ≡
L (µ +σ 2 )2
2 2 . Under the condition Lg ≥ 1, it can be seen that
g gxy g g gxy
(1 − µ)2 + σ 2 ≤ 1 − µ2g /(Lg (µ2g + σgxy

2
)) < 1.
33
Applying Lemma 12 shows that
 2  !p
p
Y µg 2

(2) 
µ2g
I− ∇ g(x, y; ξ ) ≤ d1 1− .

E  2 ) yy i
Lg (µ2g + σgxy Lg (µ2g + σgxy
2 )
i=1
Subsequently,
d1
E[kHyy k2 ] ≤ .
2 )
Lg (µ2g + σgxy
3 Furthermore, it is easy to derive that kE[Hy y]k ≤ 1/µg . Together, the above gives the following
estimate on the variance:
n o 3 3d1
E[khkf − E[hkf ]k2 ] ≤ σf2x + (σf2y + Cy2 ){σgxy
2 2
} + σf2y Cgxy
2
. (110)

+ 2Cgxy max 2 , 2 2
µg Lg (µg + σgxy )
This concludes the proof for the second part of the lemma.
We observe the following lemma on the product of (possibly non-PSD) matrices, which is inspired
by [18, 25]:
Lemma 12. Let Zi , i = 0, 1, ... be a sequence of random matrices defined recursively as Zi = Yi Zi−1 ,
i ≥ 1, with Z0 = I ∈ Rd×d , and Yi , i = 0, 1, ... are independent, symmetric, random matrices
satisfying kE[Yi ]k ≤ 1 − µ and E[kYi − E[Yi ]k22 ] ≤ σ 2 . If (1 − µ)2 + σ 2 < 1, then for any t ≥ 0, it
holds t
E[kZt k2 ] ≤ E[kZt k22 ] ≤ d (1 − µ)2 + σ 2 , (111)
where kXk2 denotes the Schatten-2 norm of the matrix X.
Proof. We note from the norm equivalence between spectral norm and Schatten-2 norm which yields
kZt k ≤ kZt k2 and thus E[kZt k2 ] ≤ E[kZt k22 ]. For any i ≥ 1, we observe that
Zi = (Yi − E[Yi ])Zi−1 + E[Yi ]Zi−1 .
| {z } | {z }
=Ai =Bi
Notice that as E[Ai |Bi ] = 0, applying [25, Proposition 4.3] yields

E[kZt k22 ] ≤ E[kAt k22 ] + E[kBt k22 ]. (112)
Furthermore, using the fact that Yi s are independent random matrices and Hölder’s inequality for
matrices, we observe that
E[kAt k22 ] ≤ E[kYt − E[Yt ]k22 kZt−1 k22 ] = E[kYt − E[Yt ]k22 ]E[kZt−1 k22 ] ≤ σ 2 E[kZt−1 k22 ]
and using [25, (4.1)],
E[kBt k22 ] ≤ kE[Yt ]k2 E[kZt−1 k22 ] ≤ (1 − µ)2 E[kZt−1 k22 ]
Substituting the above into (112) yields E[kZt k22 ] ≤ ((1 − µ)2 + σ 2 )E[kZt−1 k22 ]. Repeating the same
arguments for t times and using kIk22 = d yields the upper bound.
Notice that a slight modification of the proof of [23, Lemma 3.2] yields (E[kHyy k])2 ≤ µ−2
3
g . However, the latter
(2)
lemma requires I − (1/Lg )∇2yy g(x, y, ξi ) to be a PSD matrix almost surely, which is not required in our analysis.
34
References
[1] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient
methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research,
2021.
[2] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-
residual minimization based fitted policy iteration and a single sample path. Machine Learning,
71(1):89–129, 2008.
[3] Amir Beck. First-order methods in optimization, volume 25. SIAM, 2017.
[4] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning
with linear function approximation. In Conference On Learning Theory, pages 1691–1692, 2018.
[5] Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, and Richard S Sutton. Incremental natural
actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 105–112, 2008.
[6] Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R Maei, and Csaba
Szepesvári. Convergent temporal-difference learning with arbitrary smooth function approximation.
In Advances in Neural Information Processing Systems, pages 1204–1212, 2009.
[7] Vivek S Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291–
294, 1997.
[8] Vivek S Borkar and Sarath Pattathil. Concentration bounds for two time scale stochastic approximation.
In Allerton Conference on Communication, Control, and Computing, pages 504–511, 2018.
[9] Jerome Bracken, James E. Falk, and James T. McGill. Technical note—the equivalence of two mathe-
matical programs with optimization problems in the constraints. Operations Research, 22(5):1102–1104,
1974.
[10] Jerome Bracken and James T. McGill. Mathematical programs with optimization problems in the
constraints. Operations Research, 21(1):37–44, 1973.
[11] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In
ICML, 2019.
[12] Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of
Operations Research, 153(1):235–256, 2007.
[13] Nicolas Couellan and Wenjuan Wang. On the convergence of stochastic bi-level gradient methods. 2016.
[14] Gal Dalal, Balazs Szorenyi, and Gugan Thoppe. A tale of two-timescale reinforcement learning with
the tightest finite-time bound. arXiv preprint arXiv:1911.09157, 2019.
[15] Christoph Dann, Gerhard Neumann, Jan Peters, et al. Policy evaluation with temporal differences: A
survey and comparison. Journal of Machine Learning Research, 15:809–883, 2014.
[16] Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate o(k −1/4 )
on weakly convex functions. arXiv preprint arXiv:1802.02988, 2018.
[17] Thinh T Doan. Nonlinear two-time-scale stochastic approximation: Convergence and finite-time per-
formance. arXiv preprint arXiv:2011.01868, 2020.
[18] Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Kevin Scaman, and Hoi-To Wai.
Tight high probability bounds for linear stochastic approximation with fixed stepsize. In NeurIPS,
2021.
[19] James E. Falk and Jiming Liu. On bilevel programming, part I: General nonlinear cases. Mathematical
Programming, 70:47–72, 1995.
[20] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of
deep networks. In International Conference on Machine Learning, pages 1126–1135, 2017.
35
[21] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In Inter-
national Conference on Machine Learning, 2019.
[22] L Franceschi, P Frasconi, S Salzo, R Grazzi, and M Pontil. Bilevel programming for hyperparameter
optimization and meta-learning. In International Conference on Machine Learning, pages 1563–1572,
2018.
[23] Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. arXiv preprint
arXiv:1802.02246, 2018.
[24] M. Hong, H. Wai, Z. Wang, and Z. Yang. A two-timescale framework for bilevel optimization: Com-
plexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170, 2020.
[25] De Huang, Jonathan Niles-Weed, Joel A Tropp, and Rachel Ward. Matrix concentration for products.
Foundations of Computational Mathematics, pages 1–33, 2021.
[26] Y. Ishizuka and E. Aiyoshi. Double penalty method for bilevel optimization problems. Ann Oper Res,
34:73–88, 1992.
[27] Kaiyi Ji, Junjie Yang, and Yingbin Liang. Provably faster algorithms for bilevel optimization and
applications to meta-learning. arXiv preprint arXiv:2010.07962, 2020.
[28] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning
with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
[29] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In
International Conference on Machine Learning, pages 267–274, 2002.
[30] Sham M Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems,
pages 1531–1538, 2002.
[31] Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time analysis
of linear two-timescale stochastic approximation with Markovian noise. In COLT, 2020.
[32] Prasenjit Karmakar and Shalabh Bhatnagar. Two time-scale stochastic approximation with con-
trolled Markov noise and off-policy temporal-difference learning. Mathematics of Operations Research,
43(1):130–151, 2018.
[33] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information
Processing Systems, pages 1008–1014, 2000.
[34] Vijay R Konda, John N Tsitsiklis, et al. Convergence rate of linear two-time-scale stochastic approxi-
mation. Annals of Applied Probability, 14(2):796–819, 2004.
[35] Junyi Li, Bin Gu, and Heng Huang. Improved bilevel model: Fast and optimal algorithm with theoretical
guarantee. arXiv preprint arXiv:2009.00690, 2020.
[36] Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Davis, and Adrian Weller. Ufo-blo:
Unbiased first-order bilevel optimization. arXiv preprint arXiv:2006.03631, 2020.
[37] Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, and Marek Petrik. Proximal
gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity.
Journal of Artificial Intelligence Research, 63:461–494, 2018.
[38] Risheng Liu, Pan Mu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A generic first-order algorithmic
framework for bi-level programming beyond lower-level singleton. arXiv preprint arXiv:2006.04045,
2020.
[39] Zhi-Quan Luo, Jong-Shi Pang, and Daniel Ralph. Mathematical Programs with Equilibrium Constraints.
Cambridge University Press, 1996.
[40] Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S Sutton. Toward off-policy
learning control with function approximation. In International Conference on Machine Learning, 2010.
[41] Akshay Mehra and Jihun Hamm. Penalty method for inversion-free deep bilevel optimization. In Asian
Conference on Machine Learning, 2019.
36
[42] Abdelkader Mokkadem, Mariane Pelletier, et al. Convergence rate and averaging of nonlinear two-time-
scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671–1702, 2006.
[43] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine
Learning Research, 9(May):815–857, 2008.
[44] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International conference
on machine learning, pages 737–746. PMLR, 2016.
[45] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
[46] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse?
towards understanding the effectiveness of MAML. In International Conference on Learning Represen-
tations, 2019.
[47] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-learning with implicit
gradients. 2019.
[48] Shoham Sabach and Shimrit Shtern. A first order method for solving convex bilevel optimization
problems. SIAM Journal on Optimization, 27(2):640–660, 2017.
[49] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[50] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for
bilevel optimization. In Artificial Intelligence and Statistics, pages 1723–1732, 2019.
[51] H. Van Stackelberg. The theory of market economy. Oxford University Press, 1952.
[52] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine Learning,
3(1):9–44, 1988.
[53] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
[54] Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári,
and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function
approximation. In International Conference on Machine Learning, pages 993–1000, 2009.
[55] Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence
and machine learning, 4(1):1–103, 2010.
[56] L. Vicente, G. Savard, and J. Judice. Descent approaches for quadratic bilevel programming. Journal
of Optimization Theory and Applications, 81:379–399, 1994.
[57] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algorithms for
minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449,
2017.
[58] D. J. White and G. Anandalingam. A penalty function approach for solving bi-level linear programs.
Journal of Global Optimization, 3:397–419, 1993.
[59] Yue Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite time analysis of two time-scale actor-critic
methods. In NeurIPS, 2020.
[60] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[61] Tengyu Xu, Zhe Wang, and Yingbin Liang. Non-asymptotic convergence analysis of two time-scale
(natural) actor-critic algorithms. arXiv preprint arXiv:2005.03557, 2020.
[62] Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In
International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
37

A Two-Timescale Stochastic Algorithm Framework For Bilevel Optimization: Complexity Analysis and Application To Actor-Critic

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Two-Timescale Stochastic Algorithm Framework For Bilevel Optimization: Complexity Analysis and Application To Actor-Critic

Uploaded by

Copyright:

Available Formats

A Two-Timescale Stochastic Algorithm Framework for Bilevel

Optimization: Complexity Analysis and Application to

Mingyi Hong∗† Hoi-To Wai ‡

1.1 Related Works

2 Two-Timescale Stochastic Approximation Algorithm for (1)

Assumption 2. The inner function g(x, y) satisfies:

3. For any x ∈ X, g(x, ·) is strongly convex in y, and with modulus µg > 0.

y k+1 = y k − βk · hkg , (4a)

2. Construct the gradient estimator hkf as

hkf = ∇x f (x, y; ξ (1) )−

Rd2 , tmax (k) ≥ 1, it holds that

Furthermore, the variance is bounded as

Lemma 2. [23, Lemma 2.2] Under Assumption 1, 2, it holds

Model-Agnostic Meta-Learning An important paradigm of machine learning is to find model

3.1 Strongly Convex Outer Objective Function

3.2 Smooth (Possibly Non-convex) Outer Objective Function

We focus on the case where `(x) is weakly convex. We obtain

3.3 Convergence Analysis

Lemma 5. Let K ≥ 1 be an integer. Consider sequences of non-negative scalars {Ωk }K k K

Ωk+1 ≤ Ωk − c0 Θk+1 + c1 Υk+1 + c2 , Υk+1 ≤ (1 − d0 )Υk + d1 Θk + d2 , (31)

The proof of the above lemma is simple and is relegated to §B.1.

4 Application to Reinforcement Learning

4.1 Convergence Analysis of TT-NAC

Consider the following assumptions on the MDP model of interest.

Then the following holds

A Omitted Proofs of Theorem 1

quantity c into (51) and combining with (50) show that

(1) 8 n 2 4c20 L2y 2 o

A.2 Proof of Lemma 4

The last inequality follows from applying Lemma 10 with q = 2 and a = µ` .

A.3 Bounding ∆kx by coupling with ∆ky

To simplify the notation, we define the constants

(2) (0) (1) 2/3

B Omitted Proofs of Theorem 2 and Corollary 1

B.1 Proof of Lemma 5

B.2 Proof of Lemma 6

For simplicity, we let x

for the original problem minx∈X `(x), we must have

xk − xk , hkf i while conditioning on Fk0 , we have:

B.4 Proof of Corollary 1

The following result can be derived from the Hölder’s inequality:

Lastly, it can be shown that kπkρ,1 = 1, kπkρ,∞ ≤ 1.

Lemma 8. Under Assumption 4–7, for any two policies π1 , π2 ∈ X,

The proof of the above lemma is relegated to §C.1.

Furthermore, from (84), we obtain

Combining (82), (87), and (88), we obtain

Next, we consider the convergence of ∆kQ . Let Fk = σ{θ0 , π 0 , . . . , θk , π k } be the σ-algebra

= Eµπk φ(s, a){φ(s, a) − γφ(s0 , a0 )}> [θk − θ? (π k )], (90)

≤ (1 − µtd β) · kθk − θ? (π k )k22 + β 2 · σtd

C.1 Proof of Lemma 8

E Technical Results Omitted from the Main Paper

E.1 Proof of Lemma 10

To derive this result, we observe that

This concludes the proof.

E.2 Proof of Lemma 11

First of all, the condition 2aγj ≤ bρj implies

Furthermore, for any j = 0, ..., k, it holds

E.3 Proof of Lemma 1

To estimate the variance of hkf , using (108), we observe that

E[khkf − E[hkf ]k2 ] = E[k∇x f (x, y; ξ (1) ) − ∇x f (x, y)k2 ]

+ ∇2xy g(x, y){Hyy − E[Hyy ]}∇y f (x, y; ξ (1) )

(1 − µ)2 + σ 2 ≤ 1 − µ2g /(Lg (µ2g + σgxy