A Lyapunov Analysis For Accelerated Gradient Methods

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

A LYAPUNOV ANALYSIS FOR ACCELERATED GRADIENT METHODS: FROM

DETERMINISTIC TO STOCHASTIC CASE

MAXIME LABORDE AND ADAM M. OBERMAN


arXiv:1908.07861v2 [math.OC] 19 Sep 2019

Abstract. The article [SBC14] made a connection between Nesterov’s accelerated gradient descent method
and an ordinary differential equation (ODE). We show that this connection can be extended to the case of
stochastic gradients, and develop Lyapunov function based convergence rates proof for Nesterov’s accelerated
stochastic gradient descent. In the gradient case, we show if a Hessian damping term is added to the ODE
from [SBC14], then Nesterov’s method arises as a straightforward discretization of the modified ODE.
Established Lyapunov analysis is used to recover the accelerated rates of convergence in both continuous
and discrete time. Moreover, the Lyapunov analysis can be extended to the case of stochastic gradients
which allows the full gradient case to be considered as a special case of the stochastic case. The result is
a unified approach to convex acceleration in both continuous and discrete time and in both the stochastic
and full gradient cases.

1. Introduction
In [SBC14], Su, Boyd and Candés made the connection between Nesterov’s accelerated gradient descent
method and a second order differential equation. The goal of the approach appeared to be to develop insight
into Nesterov’s algorithm, possibly leading to new optimization algorithms. This work resulted in a renewed
interest in continuous time approach, for example in [WWJ16, WRJ16] and [WMW19] a Lyapunov analysis
is done to recover the optimal rate of convergence. Continuous time analysis also appears in [FB15], [LRP16],
and [KBB15], among many other recent works. Of course, the Lyapunov approach to proving convergence
rates appears widely in optimization, for example, see [BT09] for FISTA.
So far there is less work on continuous time approaches to stochastic optimization. Stochastic Gradient
Descent (SGD) [RM51] is a widely used optimization algorithm due to its ubiquitous use in machine learning
[Bot91][BCN16]. Convergence rates are available in a wide setting [LJSB12, BCN16, QRG+ 19]. When SGD
is combined with momentum [Pol64, Nes13] empirical performance is improved, but this improvement is not
always theoretically established [KNJK18]. The optimal convergence rate for SGD in the smooth,√ strongly
convex case is order 1/k. In the convex, nonsmooth case, the optimal rate goes down to O(1/ k) [NJLS09] or
[Bub14]. Accelerated versions of stochastic gradient descent algorithms are comparatively more recent: they
appear in [LMH15] as well as in [FGKS15] and [JKK+ 18]. A direct acceleration method with a connection to
Nesterov’s method can be found in [AZ17]. For the continuous case, going from ODEs and Lyapunov analysis
to a perturbed ODEs was done in [APR16] with results for accelerated gradient descent in continuous time.

Outline of approach. One goal of the continuous time approach is to take advantage of the Lyapunov function
approach to obtaining convergence rates for differential equations, with the hope that an explicit discretiza-
tion of the differential equation leads to an algorithm which also decreases the Lyapunov function. We
present a general Lyapunov function approach which allows us to go from continuous time to discrete time,
in both the full gradient and stochastic gradient setting. The discrete time result follows from the continuous
time one by enforcing a restriction on the time step (learning rate), as in (CFL) below. Since we use a first
order system to represent our ODE, the analysis can also be adapted to the non-smooth case.
The abstract analysis applies in particular to each of the cases: continuous time/algorithm, acceler-
ated/standard gradient descent, convex/strongly convex, and full gradients/stochastic gradients. In each
case, we combine a Lyapunov function and a differential equation/finite difference equation. There is a fairly
systematic way to go from each case, which we make an effort to make clear. We use the same Lyapunov
function E(t, z) to go from continuous time to the algorithm. Going from full gradients to perturbed gradi-
ents requires adding a second term, I, to the Lyapunov function: this term satisfies an easily solved ODE in
the continuous time case, and an easily solved recursion equation in the discrete time case.
1
2 MAXIME LABORDE AND ADAM M. OBERMAN

We extend the analysis to the case of stochastic gradients, by first analyzing what we call the “perturbed”
gradient case. We write

(1) ˜ (x) = ∇f (x) + e,


∇f

where e is an error term. We first perform the analysis in an abstract setting with time step/learning rate
hk . The key step is to obtain the following inequality on a rate generating Lyapunov function

(2) E(tk+1 , zk+1 ) ≤ (1 − rE hk )E(tk , zk ) + hk βk .

here rE ≥ 0 is a rate constant coming from the Lyapunov function, which is zero in convex case, and > 0 in
the strongly convex case. The term βk depends on the error ek from (1), it is zero in the full gradient case.
The algorithms involve no averaging of previous values.
When we consider the stochastic case, we take expectations in (1): E[βk ] is proportional to hk σ 2 , where σ 2
is the variance of the perturbation of the gradient. After taking expectations in (1), we establish convergence
rates for standard and accelerated SGD in both the convex and strongly convex cases. In the strongly convex
case, we obtain an algorithm which would correspond to an adaptive time step version of Nesterov’s method,
with stochastic gradients, and time step of order 1/k. The algorithm recovers the 1/k rate of convergence
of other SGD algorithms, Propositions 4.8 and 6.16.
In the convex case, we obtain an algorithm which corresponds to Nesterov’s algorithm in the convex
case, but with a scheduled learning rate. The learning rate schedule we obtain is different from existing
ones: it corresponds to Nesterov’s algorithm with learning rate with hk = 1/k 3/4+ǫ . The convergence rate is
1/k 1/2−2ǫ , for any ǫ > 0, Proposition 5.19. A summary of our asymptotic results in the convex and strongly
convex stochastic gradient cases can be found in Figure 1.

time step hk rate of E[f (xk )] − f ∗


O(k α−1 ) if α ∈ (2/3, 1),
SGD k −α , α ∈ (2/3, 1]
Convex O(ln(k)−1 ) if α = 1.
O(k 2(α−1) ) if α ∈ (3/4, 1),
Acc. SGD k −α , α ∈ (3/4, 1]
O(ln(k)−2 ) if α = 1.
2 2(Cf +1)σ2
µ-Strongly SGD µk+2(Cf +1)σ2 E0−1 µk+2(Cf +1)σ2 E0−1
Convex
2 2σ2
Acc. SGD µk+2σ2 E0−1 µk+2σ2 E0−1

Figure 1. Convergence rates in expectation of f (xk ) − f ∗ : where hk is a non constant


learning rate and the error is such that E[ek ] = 0 and Var(ek ) = σ 2 . E0 represents the
Lyapunov function at initial time and Cf := L µ denotes the condition number of the µ-
strongly convex, L-smooth function f .

In addition, we obtain non-asymptotic error estimates in finite time, which measure the deviation from the
rate in the full gradient case: the correction is a sum or integral of the error terms. By summing the errors,
we get a result in the constant time step case with assuming decreasing error sizes. In the case of constant
time steps, we need to assume that |ei | is decreasing fast enough, however we do not take expectations of
the error or make mean zero assumptions. In particular, we cover the biased gradient case.

Remark 1.1 (interpretation of the results). One motivation/application for this work is the use of accelerated
SDG in deep learning. The practical implementation uses Polyak’s momentum, and decrease time step after
several epochs (a fixed number of time steps). In the first phase: see behaviour consistent with accelerated
gradient descent, then the noise dominates, and the learning rate is decreased. Our analysis covers both
phases: we show that after a finite number of iterations with constant learning rate, the decrease in f gap is
consistent with the accelerated rate plus a correction due to the error in the gradients. Second, we show that
we can obtain the asymptotic 1/k rate in the strongly convex case using accelerated gradient descent with
decreasing time step.
NESTEROV 3

Discussion of asymptotic error rates in the perturbed case. In the perturbed gradient case, we consider the
case of decreasing errors ei but fixed learning rates. We show that if the size of the errors decreases quickly
enough, we can recover the asymptotic convergence rates of the unperturbed gradient case. A sufficient
condition is that
1
|ek | ∼ α
k
with α > 1 for perturbed gradient descent and α > 2 for perturbed accelerated gradient descent. Thus, in
order for the accelerated rates to be obtained, the perturbations of the gradients need to go to zero faster in
the accelerated case.
Under these assumptions by introducing a perturbed Lyapunov function to compensate the effect of
the error, inspired by the continuous time analysis of [APR16, ACPR18], we are able to obtain the same
accelerated rate of convergence as in the deterministic case, Corollary 5.10 and Corollary 5.15, see Figure 2.

error rate of f (xk ) − f ∗


R +∞
SGD |e(t)| < +∞ O (1/t)
Continuous R 0+∞
O 1/t2

Acc. SGD 0
t|e(t)| < +∞
P +∞
SGD k=0 |ek | < +∞ O (1/k)
Discrete P+∞
Acc. SGD k=0 k|ek | < +∞ O 1/k 2

Figure 2. Convergence rates in the convex case: where√h is the learning rate such that
0 < h ≤ 1/L in the gradient descent case and 0 < h ≤ 1/ L in the accelerated case.

In addition, if we assume that f is strongly convex, we can introduce another Lyapunov function in order
to take advantage of the gap in the dissipation along the dynamics. This will imply an accelerated estimate
on the decrease of the norm of |∇f (yk )|2 , Corollary 5.18.
Similarly, in the strongly convex case, the perturbation needs to go to zero faster in the accelerated case
compared to the non-accelerated case. In the discrete case, |ek | should decrease as exp(−µk) in the for

gradient descent, whereas, for the accelerated method, |ek | should decrease as exp(− µk), which is faster
in the relevant case µ < 1. Then, under these assumptions, the same accelerated rate as in the deterministic
case is achieved, see Corollary 6.10 in the continuous case and Corollary 6.13 in the discrete case. A summary
of the results in the strongly convex case is given in Figure 3.

error rate of f (xk ) − f ∗


R +∞
SGD 0 exp(µt)|e(t)| < +∞ O (exp(−µt))
Continuous R +∞ √ √ 
Acc. SGD exp( µt)|e(t)| < +∞ O exp(− µt)
P0 +∞ −k
O (1 − hµ)k 

SGD k=0 (1 − hµ) |ek | < +∞
Discrete P+∞ √ −k √
Acc. SGD k=0 (1 − h µ) |ek | < +∞ O (1 − h µ)k

Figure 3. Convergence rates in the strongly convex case: where h is √


the learning rate such
that 0 < h ≤ 2/(L + µ) in the gradient descent case and 0 < h ≤ 1/ L in the accelerated
case.

Remark 1.2 (Applications of abstract perturbed gradient). The perturbation of the gradient can be abstract.
In particular
(1) Can have a stochastic gradient where the error is a mini-batch gradient. In order to convert from
the mini-batch gradient to an abstract error, we require an estimate of the mean and variance of
e = ∇f (x) − ∇I f (x) the minibatch error.
(2) Can include the case where the error includes variance reduction [JZ13]. The correction by a snapshot
of the full gradient at a snapshot location, which is updated every m iterations,
ek = ∇f (ỹ) − ∇I f (ỹ) − (∇f (yk ) − ∇I f (yk )).
The combination of variance reduction and momentum was discussed in [AZ17].
4 MAXIME LABORDE AND ADAM M. OBERMAN

(3) The error ek can also represent the difference between Nesterov’s method and Polyak’s momentum
method, which comes from the error where ∇f (yk ) is replaced by ∇f (xk ). This difference can just
be absorbed into the error in the gradient,
∇f (yk ) + ek = ∇f (xk ) + ẽk , ẽk = ek + O(xk − yk ).
But in the early phase of the algorithm, where we compare to the accelerated gradient method rate,
the finite time error estimate can simply include this term. So basically in Phase 1, Polyak SGD is
not too different from Nesterov SGD.
1.1. Other related work. Our goal here is to obtain rates for optimization algorithms using a continuous
time perspective. The idea put forward by many authors, notably [SBC14, WWJ16, WRJ16], is that the
convergence proofs of accelerated algorithms do not give enough insight, and that building the connection
with continuous time methods may bring the insight needed to more easily develop new algorithms.
However, continuous time approaches to optimization have been around for a long time. Polyak’s method
[Pol64] is related to successive over relaxation for linear equations [Var57] which were initially used to
accelerate solutions of linear partial differential equations [You54]. Continuous time interpretations of
Newton’s method can be found in [Pol87] or [AABR02], and of mirror descent [NY83] can be found in
[B+ 15, XWG18a, XWG18b].
Indeed, continuous time approach to solve first order convex optimization is a very well-developed theory
and there exists a huge literature on the study of Nesterov’s method by continuous time and ODE arguments.
The continuous time analysis can offer a very good framework for optimization and may lead to a better
understanding of algorithms. However the project of continuous analysis to discrete is still not clearly defined,
despite the recent work by [WWJ16, WRJ16, WMW19] .
Related work studying discretizations of ordinary and partial differential equations which respect Lya-
punov functions can be found in [SH96]; although in this case the discretizations are typically implicit, so
they require further solution of equations to obtain an algorithm.
1.2. Notations and organization. Throughout the paper, denote x∗ = argminx f (x) and −∞ < f ∗ :=
f (x∗ ) = minx f (x). We say that a function is L-smooth if f : Rd → R satisfies,
L
f (y) − f (x) + ∇f (x) · (x − y) ≤ |x − y|2 ,
2
In addtion, we consider also the class of µ-strongly convex functions, i.e. f − µ2 | · |2 is convex,
µ
f (x) + ∇f (x) · (y − x) ≤ f (y) − |x − y|2 ,
2
Combining these properties, we get, in particular, for L-smooth convex functions, we have
1
(3) |∇f (x)|2 ≤ f (x) − f ∗ .
2L
L
The condition number of a µ-strongly convex, L-smooth function, f , is denoted Cf and defined by Cf := µ.

The paper is organized as follows. In section 2, we introduce second order ODEs with Hessian damping,
(H-ODE) and (H-ODE-SC), and especially their associated first order systems, (1st-ODE) and (1st-ODE-SC).
We show that Nesterov’s schemes derive from an explicit discretization of these systems in both convex and
strongly convex cases. Section 3 is devoted to the presentation of an abstract Lyapunov analysis in order to
obtain rates for optimization algorithms using a continuous time perspective. Then we extend this analysis
to the perturbed case where the gradient is replaced by ∇f ˜ , (1). In this case, providing that the error term
decreases fast enough, we show an abstract convergence result with the same rate as in the unperturbed
case. To conclude this section, we consider the case with a variable time step and error with fixed variance.
Under these assumption, we provide abstract convergence results in expectation. Then, we apply this ab-
stract analysis in Section 4 to the gradient descent case for convex and strongly convex functions. Finally, in
sections 5 and 6, we extend this framework to the special case of accelerated gradient methods, applying it
to the first order systems (1st-ODE) and (1st-ODE-SC). In particular, in the unperturbed case, we recover
the usual optimal rates int the continuous and discrete setting. In addition, we give an accelerated rate
for the gradient taking advantage of the gap in the disspiation of the Lyapunov functions. Concerning, the
NESTEROV 5

perturbed case, we present a slightly extension of our abstract setting to recover the optimal rates from the
unperturbed case. In addition, we give an accelerated rate in the stochastic case.

2. ODEs and derivation of Nesterov’s methods


In this section, we introduce, in both convex and strongly convex cases, a second order ODE which is a
perturbation of the one introduced by Su, Boyd and Candés [SBC14], with an Hessian damping. However,
the analysis of this ODE needs to assume f to be twice differentiable. Nethertheless, we introduce a first oder
system, equivalent to the second order ODE when f is smooth. We prove that Nesterov’s methods derive
from an explicit discretization of our first order systems. One advantage of the first oder system is that it
allows to deal with non-smooth f . Moreover, to go to the stochastic gradient case, we really need a first
order system for continuous time. Indeed, in [SBC14] and [SDJS18], a term in ẋ appears in the Lyapunov
function, and then it is not clear how to extend this to the stochastic case.
We start to expose it in the convex case and then we deal with the strongly convex case.
2.1. Convex case. In [SBC14] Su, Boyd and Candés made a connection between Nesterov’s method for a
convex, L-smooth function, f , and the second order ordinary differential equation (ODE)
3
(A-ODE) ẍ + ẋ + ∇f (x) = 0
t
which can be written as the first order system
2

 ẋ = (v − x)

(4) t
 v̇ = − t ∇f (x)

2
Our starting point is the following system of first order ODEs, which is a slight modification of (2.1)
 2 1
 ẋ = (v − x) − √ ∇f (x)

t L
(1st-ODE)
 t
 v̇ = − ∇f (x)
2
The system (1st-ODE) is equivalent to the following second order differential equation with a Hessian damp-
ing
 
3 1 1
(H-ODE) ẍ + ẋ + ∇f (x) = − √ D2 f (x) · ẋ + ∇f (x)
t L t
Derivation of (H-ODE). Solve for v in the first line of (1st-ODE)
t 1
v = (ẋ + √ ∇f (x)) + x
2 L
differentiate to obtain
1 1 t 1
v̇ = (ẋ + √ ∇f (x)) + (ẍ + √ D2 f (x) · ẋ) + ẋ.
2 L 2 L
Insert into the second line of (1st-ODE)
1 1 t 1 t
(ẋ + √ ∇f (x)) + (ẍ + √ D2 f (x) · ẋ) + ẋ = − ∇f (x).
2 L 2 L 2
Simplify to obtain (H-ODE). 
We will show below that solutions of (1st-ODE) decrease the same Lyapunov function faster than solutions
of (A-ODE). Interestingly, it leads
√ to the second order ODE (H-ODE), which has an additional Hessian
damping term with coefficient 1/ L. This Hessian damping term combines continuous time Newton method
and accelerated dynamic (A-ODE). Notice that (H-ODE) is a perturbation of (A-ODE) of order √1L , and
the perturbation goes to zero as L → ∞. Similar ODEs have been studied by [AABR02], they have been
shown to accelerate gradient descent in continuous time in [APR16].
In the ODE (H-ODE) the coefficient of ẋ is damped by the Hessian and then depends on x. In addition the
coefficient of ∇f (x) is perturbed by √1Lt which goes to zero asymptotically. This equation corresponds at a
6 MAXIME LABORDE AND ADAM M. OBERMAN

 
first order perturbation O h = √1L of (A-ODE). Recently in [SDJS18], Shi, Du, Jordan and Su introduced
a family of second order differential equations called high-resolution differential equation. This equation is
derived from Nesterov’s method using terms of order O(1) and O(h) instead of only terms of order O(1) to
derive (A-ODE). In this context, (H-ODE) corresponds to the high-resolution equation with the parameter
√1 .
L
However, we demonstrate below that the first order system (1st-ODE) is more amenable to analysis,
allowing for short clean proofs which generalize to the perturbed gradient case. The system (1st-ODE)
can be discretized to recover Nesterov’s method using a explicit discretization with a time step h = √1L ,
Proposition 2.2. By a Lyapunov analysis, we recover the usual optimal rates, in both continuous and discrete
cases, and, in addition, we obtain an extra gap in the dissipation of the Lyapunov function, Proposition 5.2,
which gives us an estimate on the decrease of |∇f (x)|2 , Corollary 5.5.
Nesterov’s method for a convex, L-smooth function, f , can be written as [Nes13, Section 2.2]
 1
 xk+1 = yk − ∇f (yk )

L
(C-Nest)
 yk+1 = xk+1 + k (xk+1 − xk )

k+3
Definition 2.1. Let h > 0 be a given small time step/learning rate and let tk = h(k + 2). The discretiza-
tion of (1st-ODE) corresponds to an explicit time discretization with gradients evaluated at yk , the convex
combination of xk and vk , defined below,

2h √h
xk+1 − xk = tk (vk − xk ) − L ∇f (yk ),


htk
(FE-C)  vk = −2 ∇f (yk ),
vk+1 −
yk = 1 − 2 xk + 2 vk .


k+2 k+2

Then the following result holds.



Proposition 2.2. The discretization of (1st-ODE) given by (FE-C) with h = 1/ L is equivalent to the
standard Nesterov’s method (C-Nest).

Proof. The system (FE-C) with h = 1/ L and tk = h(k + 2) becomes
(
2
xk+1 − xk = k+2 (vk − xk ) − L1 ∇f (yk )
k+2
vk+1 − vk = − 2L ∇f (yk )
Eliminate the variable vk using the definition of yk in (FE-C) to obtain (C-Nest). 
2.2. Strongly convex case. In the case of a µ-strongly convex function, we are interested to another second
order differential equation with a Hessian damping. For a µ-strongly, convex function f , consider the first
order system
( √
ẋ = µ(v − x) − √1L ∇f (x),
(1st-ODE-SC) √
v̇ = µ(x − v) − √1µ ∇f (x).
which is equivalent to the second order equation with Hessian damping for a smooth f
√ 1 √
ẍ + 2 µẋ + ∇f (x) = − √ D2 f (x) · ẋ + µ∇f (x) .

(H-ODE-SC)
L
Equivalence between (1st-ODE-SC) and (H-ODE-SC). Solve for v in the first line of (1st-ODE-SC)
1 1
v = √ (ẋ + √ ∇f (x)) + x
µ L
differentiate to obtain
1 1
v̇ = √ (ẍ + √ D2 f (x) · ẋ) + ẋ.
µ L
Insert into the second line of (1st-ODE-SC)
 
1 1 2 1 1
√ (ẍ + √ D f (x) · ẋ) + ẋ = −ẋ − √ + √ ∇f (x).
µ L L µ
NESTEROV 7

Simplify to obtain (H-ODE-SC). 


The equation (H-ODE-SC) can be seen as a combination between Polyak’s ODE

(A-ODE-SC) ẍ + 2 µẋ + ∇f (x) = 0
which is an accelerates gradient method when f is quadratic see [SRBd17], and the ODE for Newton’s method.

Similary to the convex case, notice that (H-ODE-SC) can be seen as the high-resolution equation from
[SDJS18] with the highest parameter value √1L . Using a Lyapunov analysis, we will show in Section 6 that
the same Lyapunov function of (A-ODE-SC) decreases faster along (1st-ODE-SC) and allows an acceleration
in the decrease of the gradient. The asymptotic exponential rates are retrieved in the continuous and discrete
setting, Proposition 6.2. In addition, rewriting (H-ODE-SC) as a first order a system (SC-Nest) permits to
derive Nesterov’s method using an explicit discretization with a time step h = √1L , Proposition 2.5, and to
extend the Lyapunov analysis in the perturbed case.
Nesterov’s method in the strongly convex case can be written as follows.

1
xk+1 = yk − L ∇f q
 (yk )
(SC-Nest) 1− Cf−1
yk+1 = xk+1 + q −1 (xk+1 − xk )

1+ Cf

Definition 2.3. Let h > 0 be a small time step, and take an √explicit Euler method for (1st-ODE-SC)
√ h µ
evaluated at yk , defined below, and with h µ replaced by λh = 1+h√µ
h


 xk+1 − xk = λh (vk − xk ) − √ ∇f (yk ),
L




h


(FE-SC) vk+1 − vk = λh (xk − vk ) − √ ∇f (yk )
 µ

 √

 h µ
yk = (1 − λh )xk + λh vk , λh = √ .


1+h µ

Remark 2.4. As in the convex case, to obtain Nesterov’s method, we need to evaluate the gradient at yk ,

which is a perturbation of xk . In addition, in the strongly convex case, we perturb µ.

Proposition 2.5. The discretization of (1st-ODE-SC) given by (FE-SC) with h = 1/ L is equivalent to
the standard Nesterov’s method (SC-Nest).

Proof. (FE-SC) with h = 1/ L becomes
 √ −1
C
xk+1 − xk = √ f −1 (vk − xk ) − L1 ∇f (yk )

1+ Cf
√ −1
vk+1 − vk = √Cf −1 (xk − vk ) − √1 ∇f (yk )

1+ Cf Lµ

Eliminate the variable vk using the definition of yk to obtain (SC-Nest). 

3. Abstract Lyapunov Analysis: going from continuous to discrete time


The advantage of continuous time Lyapunov analysis to obtain convergence rates is that there is no time
step, so the number of terms we need to control is simpler. By presenting our ODEs as first order systems,
we use explicit gradient calculations, and avoid the tedious substitution of terms which results from used
the second order ODE. In addition, the Lyapunov functions are cleaner, since they involve only variables
without time derivatives. The first order system approach becomes even more important in the stochastic
case, where derivatives of error terms might otherwise appear. Finally, the first order system formulation
allows the analysis to go through for nonsmooth objectives, although we do not pursue that here.
In this section we present theorems showing how to go from continuous time Lyapunov functions to
discrete time, in both the full gradient and perturbed cases in the abstract setting.
We make a definition of the continuous to discrete problem and start exposing the problem in an abstract
setting. then, we show how this framework can be extended to study a perturbed gradient descent (i.e. an
error is made in the evaluation of the gradient). We discuss also how this framework can be adapt to deal
8 MAXIME LABORDE AND ADAM M. OBERMAN

with accelerated gradient methods. To conclude this section, we present an abstract convergence rate for
variable learning rates.
In the first subsection, we define the class of ODEs we consider and the associated Forward Euler methods.
We define a generic Lyapunov function, which may depend on time, and provide conditions for the Lyapunov
function to give a rate in continuous time, and then show how this rate extends to the forward Euler method,
provided a restriction on the time step is satisfied.
In the subsequent subsection, we show how the same analysis can be extended to the case of perturbed
gradients. Again we are in the abstract Lyapunov function setting. Our analysis shows that we can recover
the rates corresponding to the full gradient case, provided that the error in the gradient decreases fast
enough. Although this an unusual way to present the rates, presenting the results in this way gives a unified
approach to the perturbed and full gradient cases. However, to conclude this section, we present an abstract
convergence rate in the case where the lerning rate is not constant anymore and the error has zero-mean and
a fixed variance.
3.1. ODEs, Perturbed ODEs and discretizations. Consider an abstract ordinary differential equation,
generated by the velocity field g as follows.
Definition 3.1. Let g(t, z, p) be Lg -Lipschitz continuous, and affine in the variable p,

g(t, z, p) = g1 (t, z) + g2 (t, z)p.


Consider the ODE
(ODE) ż(t) = g(t, z(t), ∇f (z(t)))
Let e(t) be a perturbation of the gradient ∇f (z(t)) as in (1). Consider the Perturbed ODE
(PODE) ż(t) = g(t, z(t), ∇f (z(t)) + e(t)),
(ODE) has unique solutions in all time for every initial condition z(0) = z0 ∈ Rn . Moreover, if we assume
that e(t) is Lipschitz continuous in time, then (PODE) has unique solutions in all time for every initial
condition z(0) = z0 ∈ Rn . On the hand, if we wish to consider a model of e(t) which is more consistent
with random, mean zero errors, then (PODE) is no longer well-posed as an ODE. However, we can consider
a Stochastic Differential Equation (SDE) [Oks13, Pav16], which would lead to similar results to the discrete
case where we take expectations of the mean zero error term. We do not pursue the SDE approach here to
simplify the exposition.
Definition 3.2. For a given time step (learning rate) h ≥ 0, the forward Euler discretization of (ODE)
corresponds to the sequence
(FE) zk+1 = zk + hg(tk , zk , ∇f (zk )), tk = hk
given an initial value z0 . Similarly, the forward Euler discretization of (PODE) is given by
(FEP) zk+1 = zk + hg(tk , zk , ∇f (zk ) + ek ) tk = hk
The solution of (FE) or of (FEP) can be interpolated to be a function of time z h : [0, T ) → Rn by simply
setting z h (tk ) = zk along with piecewise constant or piecewise linear interpolation between time steps. It
is a standard result from numerical analysis of ODE theory [Ise09] that functions z h converge to z(t) with
error of order h, provided h ≤ 1/Lg .
3.2. Lyapunov analysis for the unperturbed ODE. First, we give the definition of a rate-generating
Lyapunov function for (ODE).
Definition 3.3. We say E(t, z) is a rate-generating Lyapunov function for (ODE) if, for all t > 0, E(t, z ∗ ) =
0 and ∇E(t, z ∗ ) = 0 where z ∗ is a stationary solution of (ODE), i.e. g(t, z ∗ , ∇f (z ∗ )) = 0, and if there are
constants rE , aE ≥ 0 such that
(5) ∂t E(t, z) + ∇E(t, z) · g(t, z, ∇f (z)) ≤ −rE E(t, z) − aE |g(t, z, ∇f (z))|2
Remark 3.4. Definition 3.3 can be extended to consider nonegative time depending gap aE = aE (t). This
may appear especially in the convex case and the analysis below does not change.
Then, we can deduce the following rate in the continuous case.
NESTEROV 9

Lemma 3.5. Let E be a rate generating Lyapunov function for (ODE). Then
E(t, z(t)) ≤ E(0, z(0)) exp(−rE t)
Proof.
d
E(t, z(t)) = ∂t E(t, z(t)) + ∇E(t, z(t)) · g(t, z(t), ∇f (z(t))) ≤ −rE E(t, z(t)) − aE |g(t, z(t), ∇f (z(t)))|2
dt
by assumption (3.3). Gronwall’s inequality gives the result. 
In the discrete setting, we obtain
Lemma 3.6. Let zk be the solution of the forward Euler method (FE) for (ODE). Let E be a rate generating
Lyapunov function for (ODE) so that (3.3) holds. Suppose in addition that there exists LE > 0 such that E
satisfies,
LE
(6) E(tk+1 , zk+1 ) − E(tk , zk ) ≤ ∂t E(tk , zk )(tk+1 − tk ) + h∇E(tk , zk ), zk+1 − zk i + |zk+1 − zk |2 ,
2
Then
E(tk+1 , zk+1 ) ≤ (1 − hrE )E(tk , zk )
provided
2aE
(CFL) h≤
LE
In particular, choosing equality in (CFL) we have
 k
2aE rE
(7) E(tk , zk ) ≤ 1 − E(0, z0 )
LE
Remark 3.7. Note that condition (3.6) is a generalization of the L-smoothness condition in space and is
automatically satisfied in the case where the Lyapunov function does not depend on time by LE -smoothness.
Below in Lemma 4.1, we will see that this assumption is satisfied in the gradient descent case for convex and
strongly convex functions.
Proof. Estimate E(tk+1 , zk+1 ) − E(tk , zk ) using (3.6) to get
LE
E(tk+1 , zk+1 ) − E(tk , zk ) ≤ ∂t E(tk , zk )(tk+1 − tk ) + ∇E(tk , zk )(zk+1 − zk ) + |zk+1 − zk |2
2
h2 L E
≤ −hrE E(tk , zk ) − haE |g(tk , zk , ∇f (zk ))|2 + |g(tk , zk , ∇f (zk ))|2
2
h2 L E
 
≤ −hrE E(tk , zk ) − haE − |g(tk , zk , ∇f (zk ))|2
2
So apply (CFL) to get
E(tk+1 , zk+1 ) − E(tk , zk ) ≤ −hrE E(tk , zk )
which also gives (3.6). 
3.3. Lyapunov analysis for the perturbed ODE. First, we compute the dissipation of a rate-generating
Lyapunov function in the unperturbed case along (PODE).
Lemma 3.8. Let z be a solution of (PODE) and suppose E(t, z) is a rate-generating Lyapunov function for
(ODE) which satisfies (3.6). Then
d
E(t, z(t)) ≤ −rE E(t, z(t)) − aE |g(t, z(t), ∇f (z(t)))|2 + h∇E(t, z), g2 (t, z(t))e(t)i.
dt
Proof. Since z is solution of (PODE) and E satisfies (3.3),
d
E(t, z(t)) = ∂t E(t, z(t)) + h∇E(t, z(t)), g(t, z(t))(∇f (z(t)) + e(t))i
dt
= ∂t E(t, z(t)) + h∇E(t, z(t)), g(t, z(t))∇f (z(t))i + h∇E(t, z), g2 (t, z(t))e(t)i
≤ −rE E(t, z(t)) − aE |g(t, z(t), ∇f (z(t)))|2 + h∇E(t, z), g2 (t, z(t))e(t)i. 
10 MAXIME LABORDE AND ADAM M. OBERMAN

Observe that when we go from the unperturbed ODE (ODE) to the perturbed ODE (PODE), the addi-
tional term h∇E(t, z), g2 (t, z(t))e(t)i appears in the time derivative of E along (PODE). In this section we
show how to add a term to the original Lyapunov function to obtain a Lyapunov function in the perturbed
case.
In order to compensate for the additional term coming from the error e(t) we are motivated to define the
perturbed Lyapunov function by Ẽ(t, z) = E(t, z) + I(t, z(·)) where I(t, z(·)) satisfies
I ′ (t, z(·)) = −h∇E(s, z(s)) · g2 (s, z(s))e(s)i − rE I(t, z(·)), I(0, z(·)) = 0
Note that unlike E, I depends on the history of z and on e, so we emphasize this with the notation. The
preceding time dependent ODE is easily solved by standard methods. The solution is given by
Z t
I(t, z(·)) = − exp(−rE t) exp(rE s)h∇E(s, z(s)), g2 (s, z(s))e(s)i ds.
0

Definition 3.9. Write


Z t
J(t, z(·)) = exp(rE s)h∇E(s, z(s)), g2 (s, z(s))e(s)i ds.
0

so that J(t, z(·)) = − exp(rE t)I(t, z(·)). Define the perturbed Lyapunov function
(8) Ẽ(t, z(·)) = E(t, z(t)) − exp(−rE t)J(t, z(·))
Proposition 3.10. Let z(t) be a solution of the perturbed ODE (PODE) and let E be a rate-generating
Lyapunov function for the unperturbed ODE (ODE).
Then
(9) E(t, z(t)) ≤ exp(−rE t)(E(0, z(0)) + J(t, z(·))).
Proof. We first establish
(10) Ẽ(t, z(t)) ≤ Ẽ(0, z(0)) exp(−rE t)
By assumption (3.3) and the calculation at the beginning of this section,
d
Ẽ(t, z(t)) = ∂t E(t, z) + ∇E(t, z) · g(t, z, ∇f (z(t))) + I ′ (t, z(·))
dt
≤ −rE E(t, z(t)) − aE |g(t, z(t), ∇f (z(t)))|2 + h∇E(t, z), g2 (t, z(t))e(t)i + I ′ (t)
≤ −rE E(t, z(t)) − aE |g(t, z(t), ∇f (z(t)))|2 − rE I(t, z(·))
≤ −rE Ẽ(t, z(t)) − aE |g(t, z(t), ∇f (z(t)))|2 .
Gronwall’s inequality completes the proof of (3.3).
From (3.3), using the definition of Ẽ, and the fact that Ẽ(0, z(0)) = E(0, z(0), we have
Ẽ(t, z(·)) = E(t, z(t)) − exp(−rE t)J(t, z(·)) ≤ exp(−rE t)E(0, z(0))
which gives the second result. 

Corollary 3.11. Under the assumptions of the previous proposition, for all t > 0, |∇E(t, z(t))| satisfies
sup |∇E(s, z(s))| ≤ M (t, z(·)),
0≤s≤t

where

M (t) := 1 + 2LE E(0, z(0)) exp(−rE t)


Z t  Z t 
+ exp(−rE t)4L2E E(0, z(0)) |g2 (s, z(s))e(s)| exp 2LE |g2 (u, z(u))e(u)| du ds
0 s
Z t  Z t 
+ 2LE exp(−rE t) exp(rE s)|g2 (s, z(s))e(s)| exp 2LE |g2 (u, z(u))e(u)| du ds.
0 s
NESTEROV 11

and, if e satisfies
Z +∞
(11) exp(rE s)|g2 (s, z(s))e(s)| ds < +∞,
0

then M (t, z(·)) is bounded in L∞ (R+ ) and


 Z t 
E(t, z(t)) ≤ exp(−rE t) E(0, z(0)) + M (t, z(·)) exp(rE s)|g2 (s, z(s))e(s)| ds = O(exp(−rE t)).
0

Proof. From Proposition 3.10, we have

exp(rE t)E(t, z(t)) ≤ (E(0, z(0)) + J(t, z(·))).

Since for all t, by (1.2),

1 1
E(t, z(t)) ≥ |∇E(t, z(t))|2 ≥ (|∇E(t, z(t))| − 1) ,
2LE 2LE

and
Z t
J(t, z(·)) ≤ exp(rE s)|∇E(s, z(s))||g2 (s, z(s))e(s)| ds.
0

Then,
Z t
exp(rE t)|∇E(t, z(t))| ≤ (exp(rE t) + 2LE E(0, z(0))) + 2LE exp(rE s)|∇E(s, z(s))||g2 (s, z(s))e(s)| ds.
0

Use Gronwall’s Lemma t 7→ exp(rE t)|∇E(t, z(t))| to obtain the first part of the proof. We conclude noticing
that all term in the right hand side are bounded when t ր +∞ under the assumption (3.11). 

3.4. Lyapunov analysis for the perturbed algorithm. Now we consider the discrete case. As in the
continuous case, the strategy is to see which extra terms arise in the dissipation of the original Lyapunov
function along the perturbed equation, and then build an additional term into the perturbed Lyapunov
function to cancel them out. The following lemma computes the excess term.

Lemma 3.12. Let zk be the solution of (FEP). Suppose E is a rate-generating Lyapunov function for (ODE)
which satisfies (3.6). Then
 
LE h
(12) E(tk+1 , zk+1 ) − E(tk , zk ) ≤ −hrE E(tk , zk ) − h aE − |g(tk , zk , ∇f (zk ))|2 + hβk
2

where βk is defined by
 
1
(13) βk := h∇E(tk , zk ), g2 (tk , zk )ek i + LE h g(tk , zk , ∇f (zk )) + g2 (tk , zk )ek , g2 (tk , zk )ek .
2

Remark 3.13. Note (3.12) is a perturbation of the analogous result in the continuous case: the first term in
βk is a discretization of the corresponding term in the continuous case - the remaining terms are perturbations
of order h.

Remark 3.14. Note also that if under the standard assumptions on the error, E[ei ] = 0 and V ar(ei ) = σ 2 ,
along with independence, then

LE g22 2
E[βk ] = h σ
2
12 MAXIME LABORDE AND ADAM M. OBERMAN

Proof.
LE
E(tk+1 , zk+1 ) − E(tk , zk ) ≤ ∂t E(tk , zk )(tk+1 − tk ) + h∇E(tk , zk ), zk+1 − zk i + |zk+1 − zk |2 by (3.6)
2
≤ h (∂t E(tk , zk ) + h∇E(tk , zk ), g(tk , zk , ∇f (zk ))i) by (FEP)
2
LE h
+ hh∇E(tk , zk ), g2 (tk , zk )ek i + |g(tk , zk , ∇f (zk ) + ek )|2
2
≤ −hrE E(tk , zk ) − haE |g(tk , zk , ∇f (zk ))|2 by (3.3)
2
LE h
+ hh∇E(tk , zk ), g2 (tk , zk )ek i + |g(tk , zk , ∇f (zk ) + ek )|2
 2

LE h
≤ −hrE E(tk , zk ) − h aE − |g(tk , zk , ∇f (zk ))|2
2
 
LE h
+h g2 (tk , zk )ek + LE hg(tk , zk , ∇f (zk )) + ∇E(tk , zk ), g2 (tk , zk )ek ,
2
which concludes the proof. 
Following the argument in the continuous, and using the previous calculation, we see that the problem is
to define Ik such that I0 = 0 and
(14) Ik+1 − Ik = −hrE Ik − hβk ,
Lemma 3.15. The solution of (3.4) is given by
k−1
X
(15) Ik = −(1 − hrE )k h (1 − hrE )−i−1 βi ,
i=0

Proof. Applying the definition (3.15) we obtain


h k
X k−1
X i
Ik+1 − Ik = −(1 − hrE )k h (1 − hrE ) (1 − hrE )−i−1 βi − (1 − hrE )−i−1 βi
i=0 i=0
h i
k
= −(1 − hrE ) h (1 − hrE )−k−1 βk − hrE (1 − hrE )−k−1 βk − hrE Ik
= −hrE Ik − hβk
as required. 
The arguments above lead us to the following definition.
Definition 3.16. Define the perturbed Lyapunov function Ẽk by

Ẽk := E(tk , zk ) − (1 − hrE )k Jk = E(tk , zk ) + Ik


where
k−1
X
Jk = h (1 − hrE )−i−1 βi ,
i=0

and βi given by (3.12).


Lemma 3.17. Let zk be the solution of the forward Euler method (FEP) for (PODE) with time step h > 0.
Let Ẽ be the perturbed Lyapunov function defined in Definition 3.16. Suppose E is a rate-generating Lyapunov
function for (ODE) which satisfies (3.6). Choose
2aE
(CFL) 0<h≤
LE
Then
(16) Ẽ(tk+1 , zk+1 ) ≤ (1 − hrE )Ẽ(tk , zk )
NESTEROV 13

Proof. From Lemma 3.12, (3.12) combined with (CFL) we have


E(tk+1 , zk+1 ) − E(tk , zk ) ≤ −hrE E(tk , zk ) + hβk .
Also, from Lemma 3.15, (3.4) holds for Ik . Combining these two estimates use the definition of Ẽk to obtain
the result
Ẽ(tk+1 , zk+1 ) − Ẽ(tk , zk ) ≤ −hrE Ẽ(tk , zk )
which gives (3.17). 
Proposition 3.18. Under the assumptions of Lemma 3.17 and suppose h satisfies (CFL), then
(17) E(tk , zk ) ≤ (1 − hrE )k (E(t0 , z0 ) + Jk )
Note the similar pattern of the result (3.18) to corresponding result (3.10) in the continuous case.
Proof. First, using the definitions in (3.17) and by induction, we have
E(tk , zk ) ≤ (1 − hrE )k E(t0 , z0 ) − Ik
= (1 − hrE )k E(t0 , z0 ) + (1 − hrE )k Jk
= (1 − hrE )k (E(t0 , z0 ) + Jk ) .

Corollary 3.19. Under the assumptions of Lemma 3.17, and assume in addition that E satisfies
(18) E(tk , zk ) ≥ C1 |β̃k | − C2 , β̃k = ∇E(tk , zk ) + LE hg(tk , zk , ∇f (zk )),
for some constants C1 > 0 and C2 ≥ 0. Then, for all k > 0, |β̃k | satisfies
sup |β̃i | ≤ Mk ,
0≤i≤k

where Mk depends on
k−1
X
(19) (1 − hrE )−i |g2 (ti , zi )ei |.
i=0

and, in particular, if (3.19) is finite when k ր +∞, then Mk is bounded in l∞ and, for all k ≥ 0,
k−1
!
X
k
E(tk , zk ) ≤ (1 − hrE ) E(t0 , z0 ) + Mk (1 − hrE )−i−1
|g2 (ti , zi )ei | = O((1 − hrE )k ).
0

Proof. The proof is similar to the continuous case, Corollary 3.11, using condition (3.19) to apply Gronwall’s
Lemma on (1 − hrE )−i |β̃k | . 
Remark 3.20. Note that assumption (3.19) is not satisfied in general in the accelerated setting as we will
see it in the following. Indeed, in Nesterov’s method, the gradient of f is evaluated at a different point than
the Lyapunov function.
3.5. Discussion of accelerated methods. Contrary to the gradient descent case, the situation is not
quite so simple in the accelerated case. First note that there is a gap between the discrete and continuous
setting: more than one ODE can be consistent with a discrete algorithm. This means, in principle, that
there may be whole parameterized families of ODEs which satisfy the condition (3.21) and which can be
discretized to obtain the algorithm. In the accelerated case, we need to use such a parameterized family. It
may also be too restrictive to require that the Forward Euler discretization of the ODE satisfies the same
Lyapunov function. However, we can consider a parameterized ODE system and we may need to consider a
parameterized Lyapunov function.

Consider the parameterized velocity field g(t, z, ∇f (y ǫ); ǫ) where, z = (x, v), y ǫ = x + θǫ (v − x), with
θǫ ∈ (0, 1), for ǫ > 0, and θ0 = 0, and
g(t, z, ∇f (y ǫ ); ǫ) = g1 (t, z; ǫ) + g2 (t, z; ǫ)∇f (y ǫ ),
14 MAXIME LABORDE AND ADAM M. OBERMAN

which is Lg -Lipschitz continuous, uniformly in ǫ. In continuous time, set ǫ = 0, and consider


(ODE-h) ż(t) = g(t, z(t), ∇f (x(t)); 0)
However, for a given learning rate (time step) h ≥ 0, we allow a perturbation which depends on h, i.e. ǫ = h,
and instead consider the algorithm, zk = (xk , vk ),
(FE-h) zk+1 = zk + hg(tk , zk , ∇f (ykh ); h), ykh = xk + θh (vk − xk ),
provided an initial value z0 .
Definition 3.21. The (continuous) function E(t, z; h) is a rate-generating Lyapunov function for (ODE-h)
if there exists rE , aE ≥ 0, such that
d
(20) E(t, z(t); 0) ≤ −rE E(t, z(t); 0) − aE |∇f (x)|2 .
dt
for every solution z(t) = (x(t), v(t)) : [0, ∞) → Rn × Rn of (ODE-h). If, in addition, there is LE > 0, so
that

E(tk+1 , zk+1 ; h) ≤ (1 − rE h)E(tk , zk ; h) − (LE h − aE )h|∇f (ykh )|2 .


aE
for all solutions zk of (FE-h) with 0 < h ≤ LE , then we call E a rate-generating Lyapunov function for the
sequence (FE-h).
In Sections 5 and 6, we will adapt the previous analysis to the accelerated gradient method in the convex
and strongly convex case. In particular, we will extend the accelerated gradient method to the perturbed
gradient case. The main difficulty resides in the fact that, in the discrete case, the gradient and the Lyapunov
function are not evaluated at the same point anymore and we will use different methods to overcome this
issue.
3.6. Variable time step and convergence in expectation. Now we consider the case where the error
ek satisfies
(21) E[ek ] = 0 and Var(ek ) = E[|ek |2 ] = σ 2 > 0,
and where the time step/learning rate h is not constant anymore i.e. h = hk , we require that h0 satisfies
CFL, which from induction, need to be small enough, so that means: take warm up where h small enough,
and run a few steps until E0 small enough. Then, (3.12) becomes
(22) E(tk+1 , zk+1 ) − E(tk , zk ) ≤ −hk rE E(tk , zk ) + hk βk ,
where now βk also depends on hk
 
LE
βk := h∇E(tk , zk ), g2 (tk , zk )ek i + hk g2 (tk , zk )ek + LE g(tk , zk , ∇f (zk )), g2 (tk , zk )ek .
2
so
hk LE g2 (tk , zk )2 σ 2
E[βk ] =
2
Taking the expectation in (3.6), we obtain
h2k LE g2 (tk , zk )2 σ 2
(23) E[E(tk+1 , zk+1 )] ≤ (1 − hk rE )E(tk , zk ) + .
2
Then, we deduce the following result
Proposition 3.22 (Case rE > 0). Assume that rE > 0 and g2 = max(t,z) g2 (t, z) < +∞. If
2
2 rE
hk := where α= ,
rE (k + α−1 E0−1 ) 2LE g2 2 σ 2
then,
1
E[E(tk , zk )] ≤ .
α(k + α−1 E0−1 )
Note, the assumption that g2 is bounded and rE > 0 apply to strongly convex gradient descent as well as
strongly convex accelerated gradient descent.
NESTEROV 15

Proof. The proof of Proposition 3.22 can be done by induction and is an adaptation of the one of [OP19].
Indeed, the initialization, k = 0, of Ek is trivial and for all k ≥ 1, from (3.6), we have
h2k LE g2 (tk , zk )2 σ 2
E[E(tk+1 , zk+1 )] ≤ (1 − hk rE )E(tk , zk ) +
2
and by definition of hk , α, and using the induction assumption,
 
2 1 1
E[E(tk+1 , zk+1 )] ≤ 1− +
k + α−1 E0−1 α(k + α−1 E0−1 ) α(k + α−1 E0−1 )2
1 1
≤ −1 −
α(k + α E0 ) α(k + α−1 E0−1 )2
−1

1
≤ ,
α(k + 1 + α−1 E0−1 )
which concludes the proof. 

In the rE = 0 case, we assume that there exists five constants a1 , a2 , a3 , b1 , b2 ≥ 0 such that
(a1 + a2 tk + a3 t2k )h2k σ 2
(24) E[E(tk+1 , zk+1 )] − E(tk , zk ) ≤
2
and in addition, that
(25) E[E(tk , zk )] ≥ (b1 tk + b2 t2k )(E[f (xk )] − f ∗ ).
Then we obtain
Proposition 3.23 (Case rE = 0). Assume that rE = 0 and E satisfies (3.6)-(3.6). If hk = k −α , tk =
Pk
i=0 hi , then the following holds:
• Case a1 , a2 , b1 > 0, a3 = b2 = 0: If α ∈ 32 , 1 , then

 
1
E[f (xk )] − f ∗ = O
k 1−α
and if α = 1, then
 
∗ 1
E[f (xk )] − f = O
ln(k)
3

• Case a3 > 0 and b2 > 0: If α ∈ 4, 1 , then
 
b1 b2
E[f (xk )] − f ∗ = O +
k 1−α k 2−2α
and if α = 1, then
 
∗ b1 b2
E[f (xk )] − f = O +
ln(k) ln(k)2
We will see that this result can be applied for convex gradient descent and convex accelerated gradient
method.
Pk
Proof. First, since tk = i=0 i1α , we need α < 1. Summing (3.6) over from 0 to k − 1, we obrain
k−1
σ2 X 2
E[E(tk , xk )] ≤ h (a1 + a2 ti + a3 t2i ).
2 i=0 i

Now we want to prove that E[E(tk , xk )] is bounded.


• Case a1 , a2 , b1 > 0, a3 = b2 = 0: In that case, we need to prove that
∞ ∞ i
X 1 X 1 X 1
a1 + a 2 < +∞
i=0
i2α i=0
i2α j=0 j α
16 MAXIME LABORDE AND ADAM M. OBERMAN

For the first series converges if and only if α > 12 . Before studying the second series, remark that
i  i1−α
X 1 if α 6= 1,
α
∼ 1−α
j=0
j ln(i) if α = 1.
and then
i
(
1 X 1 i1−3α
1−α if α 6= 1,
∼ ln(i)
i2α j=0 j α i2 if α = 1.
which implies that the series converges if and only if α = 1 or 23 < α < 1. So we have proved that
E[E(tk , xk )] is bounded if α = 1 or 23 < α < 1. By (3.6), we obtain
b1 tk (E[f (xk )] − f ∗ ) ≤ C.
In addition,
k  k1−α
X 1 1−α , if α 6= 1,
tk = α

i=0
i ln(k) if α = 1.
which concludes the proof of the first item.
• Case a3 > 0 and b2 > 0: In that case, we need to prove that
 2
∞ ∞ i ∞ i
X 1 X 1 X 1 X 1 X 1 
a1 + a2 + a3 < +∞
i=0
i2α i=0
i2α j=0 j α i=0
i2α j=0 j α
2
Arguing as in the first part of the proof the two first series converge if α = 1 or 3 < α < 1. Expanding
the square,
 2
i i i
1 X 1  1 X 1 1 X 1
2α α
= 2α 2α
+ 2α
.
i j=0
j i j=0 j i j lα
α
j,l=0, j6=l

Then in our case, the series converge if 34 < α < 1 or α = 1. Combining the fact that
1−α 2−2α
b1 k1−α + b2 k2−2α +, if α 6= 1,

2
b 1 tk + b 2 tk ≥
b1 ln(k) + b2 ln(k)2 if α = 1.
with (3.6) concludes the proof.


4. Applications to Gradient descent


In this section, we apply our previous abstract analysis to gradient descent for convex and strongly convex
functions. In this case, g is given by
g(t, z, p) = −p i.e. g1 (t, z) = 0 and g2 (t, z) = −1.
Let f be a µ-strongly convex, L-smooth function. Consider
(28) ẋ(t) = −∇f (x(t))
and its associated forward Euler scheme
(29) xk+1 − xk = −h∇f (xk )
˜ = ∇f + e,
as well as their perturbed version, where the gradient is replaced by ∇f
(30) ẋ(t) = −(∇f (x(t)) + e(t)).
and the forward Euler scheme
(31) xk+1 − xk = −h(∇f (xk ) + ek ),
with initial condition x0 .
NESTEROV 17

Then, define the functions


1
E c (t, x) := t(f (x) − f ∗ ) + |x − x∗ |2 ,
2
µ
sc
E (x) := f (x) − f + |x − x∗ |2 .

2
for the convex and strongly convex cases, respectively.
In the following, we will apply the previous abstract analysis to E c and E sc .
4.1. Unperturbed case. First, we show that the Lyapunov functions satisfy (3.6).
Proposition 4.1.

• E c is a Lyapunov function in the sense of Definition 3.3 with rE c = 0 and aE c = t. In addition, E c


satisfies (3.6) with LE c = Ltk+1 + 1.
• E sc is a Lyapunov function in the sense of Definition 3.3 with rE sc = µ and aE sc = 1. In addition,
E sc satisfies (3.6) with LE sc = L + µ.
Proof.

• In the convex case, we first start to look for (3.3):


∂t E c (t, z) − ∇E c (t, z)∇f (z) = f (z) − f ∗ − ht∇f (z) + z − x∗ , ∇f (z)i
= f (z) − f ∗ − hz − x∗ , ∇f (z)i − t|∇f (z)|2
≤ −t|∇f (z)|2 ,
by convexity, which gives rE c = 0 and aE c = t. Now, by 1-convexity of the quadratic term and
L-smoothness of f ,
L
E c (tk+1 , zk+1 ) − E c (tk , zk ) ≤ tk+1 (f (zk ) − f ∗ + h∇f (zk ), zk+1 − zk i + |zk+1 − zk |2 ) − tk (f (zk ) − f ∗ )
2
1 1
+ |zk+1 − x∗ |2 − |zk − x∗ |2
2 2
Ltk+1 + 1
≤ htk ∇f (zk ) + zk − x∗ , zk+1 − zk i + |zk+1 − zk |2 + (tk+1 − tk )(f (zk ) − f ∗ )
2
+(tk+1 − tk )h∇f (zk ), zk+1 − zk i
Ltk+1 + 1
≤ (tk+1 − tk )∂t E c (tk , zk ) + h∇E c (tk , zk ), zk+1 − zk i + |zk+1 − zk |2 ,
2
since
(tk+1 − tk )h∇f (zk ), zk+1 − zk i = −h2 |∇f (zk )|2 ≤ 0.
Then LE c = Ltk+1 + 1.
• In the strongly convex case,

∂t E sc (z) − ∇E sc (z)∇f (z) = −h∇f (z) + µ(z − x∗ ), ∇f (z)i


−µhz − x∗ , ∇f (z)i − |∇f (z)|2
=
 µ 
≤ −µ f (z) − f ∗ − |z − x∗ |2 − |∇f (z)|2 ,
2
by strong convexity and then rE sc = µ and aE sc = 1. Concerning (3.6), since E sc is time independent,
(3.6) is equivalent to L-smoothness condition which gives LE sc = L + µ.

Then, applying Lemma 3.5 and Lemma 3.6 to E c and E sc , we obtain the usual rates in the convex and
strongly convex case.
Corollary 4.2. Let x be the solution of (4) and xk the sequence generated by (4). Then,
• Convex case: for all h ≤ L1 ,
1 1
f (x(t)) − f ∗ ≤ |x0 − x∗ |2 and f (xk ) − f ∗ ≤ |x0 − x∗ |2 ,
2t 2hk
18 MAXIME LABORDE AND ADAM M. OBERMAN

2
• Strongly convex case: for all h ≤ L+µ ,
µ µ
f (x(t)) − f ∗ + |x(t) − x∗ |2 ≤ e−µt E sc (x0 ) and f (xk ) − f ∗ + |xk − x∗ |2 ≤ (1 − hµ)k E sc (x0 ),
2 2
4.2. Perturbed gradient descent: convex case. In this section, we replace gradients in (4) and (4) by
perturbed gradient ∇f˜ = ∇f + e, where e is an error term, and then we consider (4) and (4).
Following (3.9), define the perturbed Lyapunov function, Ẽ c , by
Ẽ c (t, x) = E c (t, x) + I c (t, x(·)),
where E c is defined as previously and,
Z t
c
I (t, x(·)) = h∇E c (s, z(z)), e(s)i ds
0
Z t
= hx(s) − x∗ + s∇f (x(s)), e(s)i ds.
0
Then applying Proposition 3.10 and Corollary 3.11, we have
Proposition 4.3. Let x be a solution of (4). Then,
E c (t, x) ≤ E c (0, x0 ) − I c (t, x(·)).
In addition, if e ∈ L1 (R+ ), then supt≥0 |x(s) − x∗ + s∇f (x(s))| ≤ M∞ < +∞, and
 
1 1
f (x(t)) − f ∗ ≤ |x0 − x∗ |2 + M∞ kekL1 .
t 2
Following the abstract case, define the discrete perturbed Lyapunov function, Ẽkc , by
Ẽkc = E c (tk , xk ) + Ikc
where tk := hk, and
k−1
X
Ikc = −h βi ,
i=0
with,
 
∗ 1
(32) βi := −hxi − x − ti+1 ∇f (xi ), ei i + h(Lti+1 + 1) ∇f (xi ) + ei , ei .
2
In the next proposition we sum up results coming from Lemma 3.12, Lemma 3.17 and Corollary 3.19.
Proposition 4.4. Let xk be the sequence generated by perturbed gradient descent (4). Assume that h satisfies
1
h≤ .
L
Then, Ekc satisfies
c
(33) Ek+1 ≤ Ekc + hβk ,
where βk is defined by (4.2). Then
 
1 1
f (xk ) − f ∗ ≤ |x0 − x∗ |2 − Ik ) .
tk 2
In addition, sup0≤i≤k |xi − x∗ | + |ti+1 ∇f (xi )| ≤ Mk , where Mk depends on
k−1
X
(34) (|ei | + i|ei |2 ).
i=0

Assuming that (4.4) is finite then Mk is bounded in l∞ and


k−1
!  
∗ 1 1 ∗ 2
X
2 1
f (xk ) − f ≤ |x0 − x | + Mk h (|ei | + i|ei | ) = O .
tk 2 i=0
k
NESTEROV 19

4.3. Variable time step and convergence in expectation: convex case. In this section we consider
the case of a variable time step hk and a zero-mean and fixed Variance error ek i.e. ek satisfies (3.6).
h2k (Ltk+1 +1)σ2
Note that (4.4) still holds for an adaptative time step and then, since E[βk ] = 2 ,
c h2k (Ltk+1 + 1)σ 2
E[Ek+1 ] ≤ Ekc + ,
2
and
E[Ekc ] ≥ tk (E[f (xk )] − f ∗ ),
which correpsond to (3.6) and (3.6) with
L(hk + 1) L
a1 = , a2 = , a3 = 0, b1 = 1 and b2 = 0.
2 2
So Proposition 3.23 gives
Pk
Proposition 4.5. Assume hk := k −α and tk = i=0 hi , then the following holds:
• If α ∈ 23 , 1 , then

 
1
E[f (xk )] − f ∗ = O ,
k 1−α
• if α = 1, then  
∗ 1
E[f (xk )] − f = O .
ln(k)
4.4. Perturbed gradient descent: strongly convex case. Now consider µ-strongly convex function f .
Define the perturbed Lyapunov function Ẽ sc : [0, +∞) × Rd → [0, +∞) by
Ẽ sc (t, x) = E sc (x) + I sc (t, x(·)),
where,
Z t
I sc (t, x(·)) = e−µt eµs h∇E sc (x(s)), e(s)i ds
0
Z t
= e−µt eµs hµ(x(s) − x∗ ) + ∇f (x(s)), e(s)i ds = −e−µt J sc (t, x(·)).
0
Then the following result holds.
Proposition 4.6. Let x be a solution of (4) with initial data x0 , then
µ
f (x(t)) − f ∗ + |x(t) − x∗ |2 ≤ e−µt (E sc (x0 ) + J sc (t, x(·))) ,
2
In addition, if exp(µ·)e ∈ L1 (R+ ), then supt≥0 |x(s) − x∗ + ∇f (x(s))| ≤ M∞ < +∞ and
µ
f (x(t)) − f ∗ + |x(t) − x∗ |2 ≤ e−µt (E sc (x0 ) + M∞ k exp(µ·)ekL1 ) = O(e−µt ).
2

Now in the discrete case, define the discrete perturbed Lyapunov function Ẽksc , for k ≥ 0, by
Ẽksc = E sc (xk ) + Iksc ,
where xk is generated by the forward Euler discretization of (4), (4), and
k−1
X
Iksc = (1 − hµ)k h (1 − hµ)−i−1 βi = −(1 − hµ)k Jksc ,
i=0
where
 
1
(35) βi = hµ(xi − x∗ ) + ∇f (xi ), ei i + h(L + µ) ∇f (xi ) + ei , ei .
2
As in the convex case, in the next proposition we sum up results coming from Lemma 3.12, Lemma 3.17
and Corollary 3.19.
20 MAXIME LABORDE AND ADAM M. OBERMAN

Proposition 4.7. Let xk be the sequence generated by perturbed gradient descent (4). Assume that h satisfies
2
h≤ .
L+µ
Eksc satisfies
sc
(36) Ek+1 ≤ Eksc + hβk ,
where βk is defined by (4.4). Then
µ
|xk − x∗ |2 ≤ (1 − hµ)k (E sc (x0 ) + Jksc ) .
f (xk ) − f ∗ +
2
In addition, sup0≤i≤k (|xi − x∗ | + |∇f (xi )|) ≤ Mk , where Mk depends on
k−1
X
(37) (1 − hµ)−i |ei |.
i=0

Assuming that (4.7) is finite when k ր +∞, then Mk is bounded in l∞ and


k−1
!
µ X
f (xk ) − f + |xk − x∗ |2 ≤ (1 − hµ)k
∗ sc
E (x0 ) + Mk −i
(1 − hµ) |ei | = O((1 − hµ)k ).
2 i=0

4.5. Variable time step and convergence in expectation: strongly convex case. Now consider the
case of a variable time step hk and a zero-mean and constant Variance error ek i.e. ek satisfies (3.6).
h2k (L+µ)σ2
As in the convex case, (4.7) is still true for an adaptative time step hk and then, since E[βk ] = 2 ,

c h2k (L + µ)σ 2
E[Ek+1 ] ≤ (1 − µhk )Ekc + .
2
We are in the case where rE sc = µ > 0 and g22 = 1, then Proposition 3.22 gives
Proposition 4.8. If
2 µ
hsc
k := , αsc :=
µ(k + α−1 E0−1 ) 2(Cf + 1)σ 2
Pk
and tk = i=0 hi , then the following holds:
1 2(Cf + 1)σ 2
E[E(xk )] ≤ −1 = .
αsc k + E0 µk + 2(Cf + 1)σ 2 E0−1

5. Accelerated method: convex case


In the remainder of the paper, we will extend the analysis developed in section 3 to the accelerated gradient
method in both continuous and discrete time. We study the perturbed case as in the full gradient descent
case. However, we will see, in the perturbed case, that the definition of I requires slight modifications.
Indeed, the Lyapunov function and the gradient of f are not evaluated at the same point. We will also
consider the case where the time step is varying and the error has zero-mean and a constant variance.
Section 5 considers the convex case while Section 6 focuses on the strongly convex case.
In this section, we study system (1st-ODE) as well as its discretization (FE-C). Then we extend the
Lyapunov analysis to the perturbed case. We present also a convergence rate for the expectation of f in the
case of the time step varies, hk , and the error has a constant variance.

5.1. Unperturbed gradient case. Define the perturbed ODE by setting z = (x, v) and g(t, x, v, ∇f (y ǫ ); ǫ)
as follows
(38) g(t, x, v, ∇f (y ǫ ); ǫ) = g1 (t, x, v) + g2 (t, x, v)∇f (y ǫ )
where,
 2 2
   1 
−t t x − √L
(39) g1 (t, x, v) = and g2 (t, x, v) = ,
0 0 v − 2t
NESTEROV 21

and

(40) yǫ = x + (v − x).
t
Let h > 0 be a given small time step/learning rate and let tk = h(k + 2). Consider the perturbation with
ǫ = h of the forward Euler method for g given by
 2h h
xk+1 − xk = tk (vk − xk ) − √L ∇f (yk ),

(FE-C) vk+1 − vk = − ht2k ∇f (yk ),
yk = kxk+2
k +2vk
 2
= xk + k+2 (vk − xk ).

Definition 5.1. Define the continuous time parameterized Lyapunov function


(41) E ac,c (t, x, v; ǫ) := (t − ǫ)2 (f (x) − f ∗ ) + 2|v − x∗ |2
Define the discrete time Lyapunov function Ekac,c by
(42) Ekac,c = E ac,c (tk , xk , vk ; h) = E ac,c (tk−1 , xk , vk ; 0)
In the next proposition, we show that E ac,c is a rate-generating Lyapunov function, in the sense of
Definition 3.21, for system (1st-ODE) and its explicit discretization (FE-C).
Proposition 5.2. Suppose f is convex and L-smooth. Define the velocity field by (5.1)-(5.1) and let E ac,c
2
be given by (5.1). Then E ac,c (·; 0) is a continuous Lyapunov function with rE ac,c = 0 and aE ac,c = √t L (with
gap t2 |∇f (x)|2 ) i.e.
d ac,c t2
E (t, x(t), v(t); 0) ≤ − √ |∇f (x)|2 ,
dt L
ac,c 2
and Ek is a discrete Lyapunov function, with LE ac,c = tk , for the sequence generated by (FE-C), for all
k ≥ 0,  
ac,c ac,c 2 1

Ek+1 ≤ Ek − h (f (xk ) − f ) + h − √ t2k h|∇f (yk )|2 ,
L
for h ≤ √1L .
Since (FE-C) is equivalent to Nesterov’s method, the rate is known. The proof of the rate using a Lyapunov
function can be found in [BT09]. A proof which shows that we can use the constant time step can be found
in [Bec17]. The discrete Lyapunov function (5.1) was used in [SBC14, APR16] to prove a rate.
Remark 5.3. Note, compared to Su-Boyd-Candés’ ODE (A-ODE), there is a gap in the dissipation of the
Lyapunov function E ac,c , which will not be there if the extra term, − √1L ∇f (x), was missing. In particular,
if z̃ solution of (A-ODE), and z solution of (1st-ODE), then we can prove faster convergence due to the gap.
Indeed, this gap permits to improve the asymptotic rate of the descent of the gradient, see Corollary 5.5.
Corollary 5.4. Let f be a convex and L-smooth function. Let (x(t), v(t)) be a solution to (1st-ODE), then
for all t > 0,
2
f (x(t)) − f ∗ ≤ 2 |v0 − x∗ |2 .
t
Furthermore, let xk , vk be given by (FE-C). Then for all k ≥ 0 and h = √1L ,
1
f (xk ) − f ∗ ≤ f (x0 ) − f ∗ + 2L|v0 − x∗ |2 .

(k + 1)2

Proof of Proposition 5.2. First, in the continuous case, by definition of E, we have


d ac,c
E (t, x(t), v(t); 0) ≤ 2t(f (x) − f ∗ ) + t2 h∇f (x), ẋi
dt
+4hv − x∗ , v̇i
t2
≤ 2t(f (x) − f ∗ ) + 2th∇f (x), v − xi − √ |∇f (x)|2
L
−2thv − x∗ , ∇f (x)i
t2
≤ 2t(f (x) − f ∗ − hx − x∗ , ∇f (x)i) − √ |∇f (x)|2 .
L
22 MAXIME LABORDE AND ADAM M. OBERMAN

The proof is concluded by convexity,


f (x) − f ∗ − hx − x∗ , ∇f (x)i ≤ 0.
In the discrete case, using the convexity and the L-smoothness of f , we obtain the following classical in-
equality, see [SBC14, APR16],
t2k (f (xk+1 ) − f ∗ ) − t2k−1 (f (xk ) − f ∗ )
 2
2t2k

ktk
≤ − t2k−1 (f (xk ) − f ∗ ) + h∇f (yk ), vk − x∗ i
k+2 k+2
 
h 1
+ −√ ht2k |∇f (yk )|2 .
2 L
By defintion of vk+1 , we have
h2 t2k
2|vk+1 − x∗ |2 − 2|vk − x∗ |2 = −2htk hvk − x∗ , ∇f (yk )i + |∇f (yk )|2 .
2

Combining these two previous inequalities, we obtain


 
ac,c 1
Ek+1 − Ekac,c ≤ −h2 (f (xk ) − f ∗ ) + h − √ t2k h|∇f (yk )|2 .
L

Notice that due to the extra gap we obtain an improvement in the rate of convergence of |∇f (yk )|2 . It is
well-known that by (1.2) and Corollary 5.4, we have
1 1
|∇f (xk )|2 ≤ f (x0 ) − f ∗ + 2L|v0 − x∗ |2 .

2L (k + 1)2

The gap obtained in the dissipation of E ac,c gives a faster convergence rate.
Corollary 5.5. For all h < √1 , k ≥ 1,
L

2 3 L 1 ac,c
min |∇f (yi )| ≤ √ E (x0 , v0 ).
0≤i≤k h (1 − Lh) k 3
3

Proof. By Proposition 5.2, for all k ≥ 0,


k−1
Ekac,sc − E0ac,sc ac,sc
− Eiac,sc )
X
≤ (Ei+1
i=0
 X k
1
≤ h h− √ t2i |∇f (yi )|2
L i=0
  k
3 1 X
≤ h h− √ min |∇f (yi )|2 (i + 2)2 .
L 0≤i≤k i=0

To simplify the bound, notice that


k
X 2k 3 + 6k 2 + 13k 1
(i + 2)2 = ≥ k3 ,
i=0
6 3

which concludes the proof. 


3
Remark 5.6. The optimal choice for h is h = √
4 L
and then, we have

256L2 ac,c
min |∇f (yi )|2 ≤ E (x0 , v0 ).
0≤i≤k 9k 3
NESTEROV 23

5.2. Perturbed gradient: continuous time. In this section, we consider that an error e(t) is made in
the evaluation of the gradient at time t. We study the following perturbation of system (1st-ODE),
(
ẋ = 2t (v − x) − √1 (∇f (x)
L
+ e(t)),
(Per-1st-ODE)
v̇ = − 2t (∇f (x) + e(t)).
where e is a function which represents an error in the calculation of the gradient ∇f (x).

If we assume that e and f are smooth, the corresponding second order ODE would be
   
3 1 1 1 1
ẍ + ẋ + √ D2 f (x) · ẋ + √ + 1 ∇f (x) = − √ + 1 e(t) − √ e′ (t),
t L t L t L L
 
which corresponds to (H-ODE) perturbed by the term − t√1L + 1 e(t) − √1L e′ (t).

Definition 5.7. Define the perturbed Lyapunov function for this system, Ẽ ac,c , by
Ẽ ac,c (t, x, v; ǫ) = E ac,c (t, x, v; ǫ) + I ac,c (t, x(·), v(·); ǫ),
where E ac,c is defined as in (5.1) and
Z t
s e(s)
I ac,c (t, x(·), v(·); ǫ) = sh2(v − x∗ ) + √ ∇f (y ǫ (s)), e(s)i + 2ǫs2 h∇f (y ǫ (s)) + , e(s)i ds,
0 L 2
where y ǫ is defined by (5.1).
Remark 5.8. We will see in Proposition 5.11 that I ac,c (·; ǫ) is the good definition in the sense of Defintion
3.16. Howerver, in the discrete case (ǫ = h), the gradient in I ac,c (·; h) is evlauated in yk while f in E ac,c (·; h)
is evaluated at xk . We will overcome this difficulty defining another Lyapunov function which does not depend
on ∇f (yk ). Nevertheless, Ẽ ac,c will be still useful assuming that f is in addition µ-strongly convex to obtain
an accelerated rate for mini≤k |∇f (yi )|2 . Morevover Ẽ ac,c will also allow us to apply the abstract analysis
with a variable time step and an error with zero mean and fixed Variance, developped in Section 3.6.
Proposition 5.9. Let (x, v) be a solution of (Per-1st-ODE) with initial condition (x(0), v(0)) = (x0 , v0 ).
Then
d ac,c t2
Ẽ (t, x, v; 0) ≤ − √ |∇f (x)|2 .
dt L
and
1
f (x) − f ∗ ≤ 2 2|v0 − x∗ |2 − I ac,c (t, x(·), v(·)) .

t
Proof. Following the proof of Proposition 5.2, we have
d ac,c t2 t2
E (t, x, v; 0) ≤ − √ |∇f (x)|2 − √ h∇f (x), e(t)i − 2thv − x∗ , e(t)i.
dt L L
In addition,
d ac,c t2
I (t) = √ h∇f (x), e(t)i + 2thv − x∗ , e(t)i.
dt L
Then,
d ac,c t2
Ẽ (t, x, v; 0) ≤ − √ |∇f (x)|2 ,
dt L
and the rest of the proof follows directly. 
Then we deduce
Corollary 5.10. Let (x, v) be a solution of (Per-1st-ODE) with initial condition (x(0), v(0)) = (x0 , v0 ).
Then,
sup |v(s) − x∗ | + |s∇f (s)| < M (t),
0≤s≤t
24 MAXIME LABORDE AND ADAM M. OBERMAN

where M (t) depends on


Z t
(43) s|e(s)|.
0

Assume that (5.10) is bounded, then M (t) is bounded in L∞ (R+ ) and,


 Z t   
∗ 1 ∗ 2 1
f (x(t)) − f ≤ 2 2|v0 − x | + M (t) s|e(s)| = O 2 .
t 0 t
Proof. From the previous proposition, Ẽ ac,c is decreasing and
Z t  
s
t2 (f (x) − f ∗ ) + 2|v − x∗ |2 ≤ 2|v0 − x∗ |2 − s 2(v − x∗ ) + √ ∇f (x), e(s) ds.
0 L
Using (1.2), we obtain
Z t 
1 1 1
|t∇f (x)| + 2|v − x∗ | ≤ 2|x0 − x∗ |2 + +2+ √ |s∇f (x)| + 2|v − x∗ | |se(s)| ds.
2L 2L 0 L
And since e satisfies (5.10), we conclude applying Gronwall’s Lemma. 

5.3. Perturbed gradient: discrete time. Replacing gradients with ∇f ˜ , the ǫ = h-perturbation of the
Forward Euler scheme (FE-C) becomes

2h h
 xk+1 − xk = (vk − xk ) − √ (∇f (yk ) + ek ),


tk L
(Per-FE-C)
 t k
 vk+1 − vk = −h (∇f (yk ) + ek ),

2
where yk is as in (FE-C), h is a constant time step, and tk := h(k + 2).

In this setting, we first give two estimates of the dissipation of Ekac,c along (Per-FE-C).
Proposition 5.11. Let xk , vk , yk be sequences generated by (Per-FE-C)-(FE-C). Then,
 
ac,c ac,c 2 1
(44) Ek+1 − Ek ≤ −h (f (xk ) − f ) + h − √ ∗
t2k h|∇f (yk )|2 + hβk ,
L
where βk := −tk h2(vk − x∗ ), ek i − √tkL h∇f (yk ), ek i + 2ht2k ∇f (yk ) + e2k , ek and,

 
ac,c 1
(45) Ek+1 − Ekac,c ≤ h − √ t2k h|∇f (yk ) + ek |2 + hβ k ,
L
t2
where β k := −2tk hvk − x∗ , ek i + √k h∇f (yk )
L
+ ek , ek i.

Proof. First, using the convexity and the L-smoothness of f , we obtain the following classical inequality (see
[APR16] or [SBC14] in the case ek = 0 and Appendix A for ek 6= 0),
(46)
 
1 h
t2k (f (xk+1 − f ∗ ) − t2k−1 (f (xk − f ∗ ) ≤ −h2 (f (xk ) − f ∗ ) + 2htk h∇f (yk ), vk − x∗ i − √ − ht2k |∇f (yk )|2
L 2
ht2 D ek E
− √ k h∇f (yk ), ek i + h2 t2k ∇f (yk ) + , ek .
L 2
By defintion of vk+1 , we have
h2 t2k
2|vk+1 − x∗ |2 − 2|vk − x∗ |2 = −2htk hvk − x∗ , ∇f (yk ) + ek i + |∇f (yk ) + ek |2 .
2
h2 t2k
= −2htk hvk − x∗ , ∇f (yk ) + |∇f (yk )|2
2
D ek E
−2htk hvk − x∗ , ek i + h2 t2k ∇f (yk ) + , ek .
2
NESTEROV 25

Therefore,
 
ac,c 1 tk
Ek+1 − Ekac,c −h2 (f (xk ) − f ∗ ) − √ − h ht2k |∇f (yk )|2 − 2htk hvk − x∗ − √ ∇f (yk ), ek i

L L
D e k
E
+2h2 t2k ∇f (yk ) + , ek ,
2
and (5.11) is proved. For the second inequality, arguing as in Appendix A, we obtain

 
1 h
t2k (f (xk+1 −f ∗ )−t2k−1 (f (xk −f ∗ ) 2 ∗
≤ −h (f (xk )−f )+2htk h∇f (yk ), vk −x i− √ − ∗
ht2k |∇f (yk )+ek |2
L 2
ht2
+ √ k h∇f (yk ), ek i.
L
So, we conclude the proof of (5.11).

Remark 5.12. Inequalities (5.11) and (5.11) still hold with a variable time step hk with tk defines by
Pk
tk = i=0 hi . Then, in Section 5.5, we will use (5.11) to apply Proposition 3.23.
We can not use the same method as in Section 3 to define the Lyapunov function because the perturbed
part of Lyapunov function would be
k−1
Ikac,c := −h
X
βi
i=0
which implies we need to control |∇f (yk )|. However, Ekac,c , controls only |∇f (xk )|. In Section 5.4, we will
show that we can still use this perturbed Lyapunov function assuming that f is µ-strongly convex. Indeed,
in this case, we have,
µ µ
t2i |xk − x|2 + |vk − x|2 ≤ E ac,c and ti |∇f (yk )| ≤ L|yk − x∗ | ≤ (1 − λh )Lti |xk − x∗ | + λh Lti |vk − x∗ |.
2 2
 
However in this section, we focus on the optimal case h = √1L and then, we can define a simpler discrete
perturbed Lyapunov function as follow.
ac,c ac,c
Definition 5.13. Define the discrete perturbed Lyapunov function E k := Ekac,c + I k , for k ≥ 0, where
Ekac,c is given by (5.1) and, for k ≥ 0,
k−1
ac,c X
Ik := h 2ti hvi+1 − x∗ , ei i .
i=0
ac,c
Proposition 5.14. Let xk , vk , yk be sequences generated by (Per-FE-C)-(FE-C). Then, for h = √1 , Ek
L
is decreasing and
L
f (xk ) − f ∗ ≤ (E ac,c − Ikac,c ) .
(k + 2)2 0
Proof.
ac,c
Ik+1 − Ikac,c = 2htk hvk+1 − x∗ , ek i
= 2htk hvk − x∗ , ek i − h2 t2k h∇f (yk ) + ek , ek i.
Combine this inequality and (5.11) to obtain
 
ac,c ac,c 1
E k+1 − E k ≤ −h2 (f (xk ) − f ∗ ) + h − √ t2k h|∇f (yk ) + ek |2
L
 2 
ht
√ k − h2 t2k hek , ∇f (yk ) + ek i.
L
ac,c
Since h = √1 , we deduce that E k is decreasing and the rest of the proof follows. 
L

We immediately have the following result.


26 MAXIME LABORDE AND ADAM M. OBERMAN

Corollary 5.15. Under the assumption of Proposition 5.14, then max0≤i≤k |vi −x∗ | ≤ Mk where Mk depends
on
k−1
X
(47) i|ei |.
i=1

Suppose (5.15) is finite, then Mk is bounded in l∞ and


k−1
!  
L 1
E0ac,c
X

f (xk ) − f ≤ + Mk i|ei | =O .
(k + 1)2 i=1
(k + 1)2
ac,c
Proof. Since E k is decreasing from Proposition 5.14,
k−1
1X
2|vk − x∗ |2 ≤ 2|v0 − x∗ |2 + |vi − x∗ |(i + 3)|ei |.
L i=0
and the discrete version of Gronwall’s Lemma gives the result. 
5.4. Decrease of |∇f (yk )| in the strongly convex case. If, in addition, we assume that f is µ-strongly
convex, it is possible to improve the decrease of |∇f (yk )|. In order to do that, we introduce a different
discrete perturbed Lyapunov function than (5.13).
Definition 5.16. Define the perturbed Lyapunov function Ẽkac,c := Ẽ ac,c (tk , xk , vk ; h) where Ẽ ac,c is defined
in Defintion 5.7 i.e. Ẽ ac,c (tk , xk , vk ; h) = Ekac,c + Ikac,c , as in for k ≥ 0, where Ekac,c is given by (5.1) and
and for k ≥ 0,
k−1
Ikac,c := I ac,c (tk , (xk ), (vk ); h) = −h
X
βi ,
i=0
where  
1 D ek E
βi = −tk 2(vk − x ) + √ ∇f (yk ), ek + 2ht2k ∇f (yk ) + , ek .

2 L 2
First, we show that
Proposition 5.17. The perturbed Lyapunov function satisfies
 
ac,c ac,c 1
Ẽk+1 − Ẽk ≤ h − √ t2k h|∇f (yk )|2 , k≥0
L
Proof. From, (5.11), we have
 
ac,c 1
Ek+1 − Ekac,c ≤ h− √ ht2k |∇f (yk )|2 + hβk
L
In addition,
ac,c
Ik+1 − Ikac,c = −hβk .
Therefore we get
 
ac,c 1
Ẽk+1 − Ẽkac,c ≤ h− √ t2k h|∇f (yk )|2
L

Then, we obtain the following result.
Corollary 5.18. For all h < √1 , we have
L
sup |vi − x∗ | + k|xi − x∗ | ≤ Mk ,
0≤i≤k

where Mk depends on
k−1
X
(48) i2 |ei |
i=0
NESTEROV 27

Moreover, if ek has a finite second moment i.e. (5.18) is bounded, then, Mk is bounded and
√ k−1
!  
2 3 L 1 ac,c
X
2 2 1
min |∇f (yi )| ≤ √ 3
E (x0 , v0 ) + CMk i |ek | =O 3
.
0≤i≤k 3
h (1 − Lh) k i=0
k

Proof. From Proposition 5.17, Ẽkac,c is decreasing and by µ-convexity and L-smoothness, we have
t2k µ
|xk − x∗ |2 + 2|vk − x∗ |2 ≤ Ekac,sc
2
k−1 k−1
X 
ti
E0ac,sc + h
X
≤ t2i |ei |2 + h 2|vi − x∗ | + √ |∇f (yi )| ti |ei |
i=0 i=0
L
k−1 k−1
X √ 
E0ac,sc + h
X
≤ t2i |ei |2 + h 2|vi − x∗ | + Lti |yi − x∗ | ti |ei |
i=0 i=0
k−1 k−1
X √ √ 
E0ac,sc + h
X
≤ t2i |ei |2 + h (2 + Lti λh )|vi − x∗ | + Lti (1 − λh )|xi − x∗ | ti |ei |
i=0 i=0

Then, Gronwall’s Lemma gives a bound on supk≥1 |vk − x | and supk≥1 k|xk − x∗ | which depends on (5.18)

and is bounded as soon as (5.18) is. Then, the rest of the proof is similiar to the unperturbed case, Corollary
5.5, taking advantage of the gap in the dissipation of Ẽkac,c , Proposition 5.17. 
5.5. Variable time step: convergence in expectation. To conclude this section, we consider the case
of a variable time step hk and a zero-mean and fixed Variance error ek i.e. ek satisfies (3.6). We do not
assume strong convexity anymore, only convexity.

From (5.11) and Remark 5.12, we can deduce that if xk , vk are generated by (Per-FE-C) with hk , tk =
Pk
i=0 hi , and  
2hk 2hk
yk = 1 − xk + vk ,
tk tk
then (5.11) becomes
 
ac,c ac,c 2 1

Ek+1 − Ek ≤ −hk (f (xk ) − f ) + hk − √ t2k hk |∇f (yk )|2 + hk βk .
L
Since E[βk ] = hk t2k σ 2 , then
ac,c
E[Ek+1 ] − Ekac,c ≤ h2k t2k σ 2 ,
and
E[Ekac,c ] ≥ t2k−1 E[f (xk )] − f ∗ ,
which corresponds to (3.6) and (3.6) with
a1 = a2 = 0, a3 = 2, b1 = 0 and b2 = 1.
Then we can apply Proposition 3.23 to obtain the following convergence results.
Pk
Proposition 5.19. Assume hk := k −α and tk = i=0 hi , then the following holds:
• If α ∈ 34 , 1 , then

 
1
E[f (xk )] − f ∗ = O
k 2−2α
• if α = 1, then  
∗ 1
E[f (xk )] − f = O .
ln(k)2
Remark 5.20. Compared to gradient descent, Proposition 4.5, the range of power α is smaller but for a
fixed hk = k −α , with α ∈ 43 , 1 , then the rate in Proposition 5.19 is accelerated. Indeed,
accelerated rate = (GD rate)2 .
28 MAXIME LABORDE AND ADAM M. OBERMAN

6. Accelerated method: Strongly Convex case


This section is devoted to the analysis of (H-ODE-SC) and in particular to its first order system (1st-ODE-SC)
in continuous and discrete time using a Lyapunov analysis in the strongly convex case. Then we extend the
obtained results to the perturbed case providing that the error decreases fast enough. At the end, we present
a convergence rate for the expectation of f in the case of the time step varies, hk , and the error has a fixed
variance.

6.1. Unperturbed gradient: continuous and discrete time. Define the perturbed ODE by setting
z = (x, v) and g(t, x, v, ∇f (y ǫ ); ǫ) = g1 (t, x, v; ǫ) + g2 (t, x, v)∇f (y ǫ ) as follows
√ √
µ µ !  !
− 1+ǫ√µ 1+ǫ

µ x − √1L
g1 (t, x, v; ǫ) = √ √ and g2 (t, x, v) = .
µ
√ − √
µ v − √1µ
1+ǫ µ 1+ǫ µ

with, √
ǫ ǫ µ
y =x+ √ (v − x).
1+ǫ µ
Then, we recall that the second order equation with Hessian damping (H-ODE-SC) can be rewritten as
( √
ẋ = µ(v − x) − √1L ∇f (x),
(1st-ODE-SC) √
v̇ = µ(x − v) − √1µ ∇f (x),
which is the case where ǫ = 0, and given a small time step h > 0, the ǫ = h perturbation of the forward
Euler method for (1st-ODE-SC) gives
h


 xk+1 − xk = λh (vk − xk ) − √ ∇f (yk ),
L




h


(FE-SC) vk+1 − vk = λh (xk − vk ) − √ ∇f (yk )
µ





 h µ
yk = (1 − λh )xk + λh vk , λh = √ ,


1+h µ

Now define the Lyapunov function associated to this problem.


Definition 6.1. Define the continuous time Lyapunov function, E ac,sc , by
µ
(49) E ac,sc (x, v) = f (x) − f ∗ + |v − x∗ |2 ,
2
and the discrete in time Lyapunov function by
µ
(50) Ekac,sc = E ac,sc (xk , vk ) = f (xk ) − f ∗ + |vk − x∗ |2 .
2
ac,c
In the next proposition, we show that E is a rate-generating Lyapunov function, in the sense of
Definition 3.21, for system (1st-ODE-SC) and its explicit discretization (FE-SC).
Proposition 6.2. Suppose f is µ-strongly convex and L-smooth. Let (x, v) be a solution of (1st-ODE-SC)
and (xk , vk ) be a sequences generated by (FE-SC). Let E ac,sc be given by (6.1). Then E ac,sc is a continuous

Lyapunov function with rE ac,sc = µ and aE ac,sc = √1L i.e.

d ac,sc √ 1 µ µ
(51) E (x, v) ≤ − µE ac,sc (x, v) − √ |∇f (x)|2 − |v − x|2 .
dt L 2
and Ekac,c is a discrete Lyapunov function, with LE ac,sc = 1, for the sequence generated by (FE-C), for all
k ≥ 0,
 √ √ 
h µL µ
 
ac,sc √ h
(52) Ek+1 ≤ (1 − h µ)Ekac,sc + h2 − √ |∇f (yk )|2 + − |xk − yk |2 .
L 2 2h
for h ≤ √1 .
L

Then we retrieve the usual optimal rates in the continuous and discrete cases.
NESTEROV 29

Corollary 6.3. Let f be a µ-strongly convex and L-smooth function. Let (x(t), v(t)) be a solution to
(1st-ODE), then for all t > 0,
µ √
f (x(t)) − f ∗ + |v(t) − x∗ |2 ≤ exp(− µt)E ac,sc (x0 , v0 ).
2
Furthermore, let xk , vk be given by (FE-C). Then for all k ≥ 0 and h ≤ √1L ,
µ √  µ 
f (xk ) − f ∗ + |vk − x∗ |2 ≤ (1 − h µ)k f (x0 ) − f ∗ + |v0 − x∗ |2 .
2 2
The proof of Corollary 6.3 results immedialtly from Proposition 6.2 and then, we focus on the proof of
(6.2) and (6.2) in the following.

The discrete Lyapunov function E ac,sc was used to prove a rate in the strongly convex case by [WRJ16].
The proof of (6.2) can be found in [WRJ16, Theorem 6]. For completeness we also provide the proof. We
split the proof in two parts: first we prove the gap inequality in the continuous case and then we consider
the discrete case.

Proof of (6.2). Using (1st-ODE-SC), we obtain


d ac,sc √
E (x, v) = h∇f (x), ẋi + µhv − x∗ , v̇i
dt
√ 1 √ √
= µh∇f (x), v − xi − √ |∇f (x)|2 − µ µhv − x∗ , v − xi − µh∇f (x), v − x∗ i
L

√ 1 µ µ
= − µh∇f (x), x − x∗ i − √ |∇f (x)|2 − |v − x∗ |2 + |v − x|2 − |x − x∗ |2

L 2
By strong convexity, we have
d ac,sc √  µ  1
E (x, v) ≤ − µ f (x) − f ∗ + |x − x∗ |2 − √ |∇f (x)|2
dt 2 L

µ µ
|v − x∗ |2 + |v − x|2 − |x − x∗ |2


2 √
√ ac,sc 1 2 µ µ
≤ − µE (x, v) − √ |∇f (x)| − |v − x|2 .
L 2

which establishes (6.2). 


In order to prove the gap inequality in the discrete case, note, since the gradients are evaluated at yk , not
xk , the first step is to use strong convexity and L-smoothness to estimate the differences of Ekac,sc in terms
of gradients evaluated at yk .
Lemma 6.4. Suppose that f is a µ-strongly convex and L-smooth function, then
 
µ 2 h 2
(53) f (xk+1 ) − f (xk ) ≤ h∇f (yk ), yk − xk i − |yk − xk | + h− √ |∇f (yk )|2 .
2 2 L
Proof. First, we remark that
f (xk+1 ) − f (xk )
= f (xk+1 ) − f (yk ) + f (yk ) − f (xk )
L
≤ h∇f (yk ), xk+1 − yk i + |xk+1 − yk |2
2
µ
+h∇f (yk ), yk − xk i − |yk − xk |2 .
2
Since the first line of (1st-ODE-SC) can be rewritten as
h
xk+1 = yk − √ ∇f (yk ),
L
we obtain (6.4). 
Now we can prove (6.2).
30 MAXIME LABORDE AND ADAM M. OBERMAN

Proof of (6.2). In the proof we will estimate the linear term hyk − xk , ∇f (yk )i in terms of hyk − x∗ , ∇f (yk )i
plus a correction which is controlled by the gap (the negative quadratic) in (6.4) and the quadratic term in
Ekac,sc .
The second term in the Lyapunov function gives, using 1-smoothness of the quadratic term in Ekac,sc ,
µ µ
|vk+1 − x∗ |2 − |vk − x∗ |2 = µhvk − x∗ , vk+1 − vk i + |vk+1 − vk |2

2 2
= −µλh hvk − x∗ , vk − xk i

−h µhvk − x∗ , ∇f (yk )i
µ
+ |vk+1 − vk |2 .
2
Before going on, by definition of yk in (FE-SC) as a convex combination of xk and vk , we have
λh √ 1 − λh 1
λh (vk − xk ) = (vk − yk ) = h µ(vk − yk ) and vk − yk = (yk − xk ) = √ (yk − xk )
1 − λh λh h µ
which gives
µ √
|vk+1 − x∗ |2 − |vk − x∗ |2 = −hµ µhvk − x∗ , vk − yk i

2 √ √
−h µhvk − yk , ∇f (yk )i − h µhyk − x∗ , ∇f (yk )i
µ
|vk+1 − vk |2
2 √
hµ µ
|vk − x∗ |2 + |vk − yk |2 − |yk − x∗ |2

≤ −
2
√  µ 
−hyk − xk , ∇f (yk )i − h µ f (yk ) − f ∗ + |yk − x∗ |2
2
2
 
µ 2 2h h 2
|yk − xk | + √ hyk − xk , ∇f (yk )i + |∇f (yk )| ,
2 µ µ

by strong convexity. Then using the L-smoothness of f , we obtain



µ ∗ 2 ∗ 2
≤ −h µEkac,sc − hy

2 |vk+1 − x | − |vk − x | √ √
k − xk , ∇f (yk )i
(54) µ h µL µ 2
+ 2 + 2 − 2h |yk − xk |2 + h2 |∇f (yk )|2 .

Combining (6.4) and (6.1), we obtain (6.2). 

Then, we can deduce from the gap in (6.2) an accelerated convergence for min0≤i≤k |∇f (yi )|2 and
min0≤i≤k |xi − yi |2 for h < √1L .

Corollary 6.5. For all h < √1 , k ≥ 1,


L
√ √
h µ
 
L
2
min |∇f (yi )| ≤ √ √ −k E0ac,cs ,
0≤i≤k h(1 − h L) (1 − h µ) − 1
and

h µ
 
2h
2
min |xi − yi | ≤ √ √ E0ac,cs .
0≤i≤k µ(1 − h2 L)
(1 − h µ)−k − 1

Remark 6.6. The rate in Corollary (6.5) is better than the usual rate (1 − h µ)k . Indeed, for k ≥ 1,
√ √ √ √
(1 − h µ)k ((1 − h µ)−k − 1) = 1 − (1 − h µ)k ≥ h µ,
and then,

√ h µ
(1 − h µ)k ≥ √ −k .
(1 − h µ) − 1
NESTEROV 31

Proof. From (6.4), we have


k−1
ac,sc √ √ √
− (1 − h µ)k E0ac,sc ac,sc
− (1 − h µ)Eiac,sc
X
(1 − h µ)k−1−i Ei+1

Ek+1 =
i=0
k−1  √ √ 
h µL µ
  
X √ h
≤ (1 − h µ)k−1−i 2
h −√ 2
|∇f (yi )| + − |xi − yi | 2

i=0
L 2 2h
 √ √   k−1
h µL µ
 
h X √
≤ min h2 − √ |∇f (yi )|2 + − |xi − yi |2 (1 − h µ)k−1−i .
0≤i≤k L 2 2h i=0

Therefore,
√ √  √
µ h µL h µ
  
h
min √ − h2 |∇f (yi )|2 + − |xi − yi |2 ≤ √ −k E ac,cs ,
0≤i≤k L 2h 2 (1 − h µ) − 1 0

which concludes the proof. 

Remark 6.7. In the proof we also show that for all h < √1 ,
L

k−1 √
√ L
√ E0ac,cs ,
X
(1 − h µ)−i−1 |∇f (yi )|2 ≤
i=0
h(1 − h L)

and
k−1
√ 2h
E ac,cs .
X
(1 − h µ)−i−1 |xi − yi |2 ≤ √ 2 L) 0
i=0
µ(1 − h

6.2. Perturbed gradient: continuous time. Now, define the perturbed system of (1st-ODE-SC) by
( √
ẋ = µ(v − x) − √1L (∇f (x) + e(t)),
(Per-1st-ODE-SC) √
v̇ = µ(x − v) − √1µ (∇f (x) + e(t)).

where e is a locally integrable function.

Definition 6.8. Define the continuous in time perturbed Lyapunov function Ẽ ac,sc by

Ẽ ac,sc (t, x, v) := E ac,sc (x, v) + I ac,sc (t, x(·), v(·)),

where E ac,sc is given by (6.1) and


Z t  
ac,sc √ √ √ ∗ 1
I (t, x(·), v(·)) := exp(− µt) exp( µs) µ(v(s) − x ) + √ ∇f (x), e(s) ds
0 L

=: − exp(− µt)J(t, x(·), v(·)).

The next proposition gives the dissipation of Ẽ ac,sc .

Proposition 6.9. Let (x, v) be a solution of (Per-1st-ODE-SC) with initial condition (x(0), v(0)) = (x0 , v0 ),
then

d ac,sc √ 1 µµ
Ẽ (t, x, v) ≤ − µẼ ac,sc − √ |∇f (x)|2 − |v − x|2 ,
dt L 2
and
µ √
f (x(t)) − f ∗ + |vk − x∗ |2 ≤ exp(− µt) (E ac,sc (x0 , v0 ) + J(t, x(·), v(·))) .
2
32 MAXIME LABORDE AND ADAM M. OBERMAN

Proof. Using (6.2), we obtain


d ac,sc d ac,sc √ √ 1
Ẽ (t, x, v) ≤ E (x, v) − µI ac,sc (t) + h µ(v − x∗ ) + √ ∇f (x), ei
dt dt L

√ √ 1 1 µµ
≤ − µE ac,sc (x, v) − h µ(v − x∗ ) + √ ∇f (x), ei − √ |∇f (x)|2 − |v − x|2
L L 2
√ √ 1
− µI ac,sc (t) + h µ(v − x∗ ) + √ ∇f (x), ei
L

√ ac,sc 1 2 µµ
≤ − µẼ (t, x, v) − √ |∇f (x)| − |v − x|2
L 2
Then, t 7→ Ẽ ac,sc (t, x(t), v(t)) is decreasing and the last inequality follows directly. 
Then, we obtain:
Corollary 6.10. Let (x, v) be a solution of (Per-1st-ODE-SC) with initial condition (x(0), v(0)) = (x0 , v0 ).
Then,
1
sup 2|v(s) − x∗ | + √ |∇f (x(s))| ≤ M (t),
0≤s≤t L
where M (t) depends on
Z t

(57) exp( µs)|e(s)| ds.
0
Assume that (6.10) is finite, then M ∈ L∞ (R+ ) and,
 Z t 
∗ µ ∗ 2 √ ac,sc √ √
f (x(t)) − f + |v(t) − x | ≤ exp(− µt) E (x0 , v0 ) + M (t) exp( µs)|e(s)| ds = O(exp(− µt)).
2 0

Proof. By Proposition 6.9, we have


µ √ h µ
f (x(t)) − f ∗ + |v(t) − x∗ |2 ≤ exp(− µt) f (x0 ) − f ∗ + |v0 − x∗ |2
2 2
Z t
à 1 | exp(õs)e(s)|ds .
i


+ µ(v(s) − x ) + √ ∇f (x)
0
L
The first part of the proof is due to Gronwall’s Lemma and then, we conclude the proof. 
6.3. Perturbed gradient: discrete time. Consider the h-perturbation of the forward Euler discretization
of (Per-1st-ODE-SC) given by

√h
xk+1 − xk = λh (vk − xk ) − L (∇f (yk ) + ek ),


(Per-FE-SC) vk+1 − vk = λh (xk − vk ) − √hµ (∇f (yk ) + ek ),

h µ

yk = (1 − λh )xk + λh vk ,

λh = √ .
1+h µ

where ek is a given error.

Inspired by the continuous framework, define the discrete perturbed Lyapunov function, Ẽkac,sc .
Definition 6.11. Define Ẽkac,sc := Ekac,sc + Ikac,sc , where Ekac,sc is given by (6.1) and
k−1
√ kX √ −i−1
Ikac,sc := −h (1 − h µ) (1 − h µ) βi
i=0
√ k
=: −h (1 − h µ) Jkac,sc ,
where,  
√ 1 D ei E
βi := − µ(xk − yk + vk − x∗ ) − √ ∇f (yk ) + 2h ∇f (yi ) + , ei .
L 2
Then we obtain the following convergence result for sequences generated by (Per-FE-SC).
NESTEROV 33

Proposition 6.12. Let xk , vk be two sequences generated by the scheme (Per-FE-SC) with initial condition
(x0 , v0 ). Suppose that h ≤ √1L , then
 √
1 h µ
  
ac,sc √ h
(60) Ek+1 ≤ (1 − h µ)Ekac,sc + h2 − √ |∇f (yk )|2 + 1 − 2 |xk − yk |2 + hβk .
L h 2
Then, Ẽkac,sc is decreasing and,
µ √
|vk − x∗ |2 ≤ (1 − h µ)k (E0ac,sc + Jkac,sc ),
f (xk ) − f ∗ +
2
Proof of Proposition 6.12. First, arguing as in Proposition 6.2,
 2 
µ 2 h h
f (xk+1 ) − f (xk ) ≤ h∇f (yk ), yk − xk i − |yk − xk | + −√ |∇f (yk )|2
2 2 L
h D ek E
− √ h∇f (yk ), ek i + h2 ∇f (yk ) + , ek ,
L 2
and,

µ ∗ 2 µ ∗ 2 √ ac,sc √ µ h2
|vk+1 − x | − |vk − x | ≤ −h µEk + ( µ + Lh) |xk − yk |2 + |∇f (yk )|2
2 2 2 2
√ ∗ 2
D ek E
−h µhvk − x + xk − yk , ek i + h ∇f (yk ) + , ek .
2
Summing these two inequalities,
 √ √ 
h µL µ
 
ac,sc √ h
Ek+1 − Ekac,sc ≤ −h µEkac,sc + h2 − √ |∇f (yk )|2 + − |xk − yk |2
L 2 2h
√ ∗ h
−h µhxk − yk + vk − x , ek i + √ h∇f (yk ) + ek , ek i.
L
For the term Ik , we obtain
ac,sc √
Ik+1 − Ikac,sc ≤ −h µIkac,sc − hβk .
Putting all together, we obtain (6.12) and then Ẽkac,sc is decreasing which concludes the proof. 
Remark that
xk −yk +vk −x∗ = λh (xk −x∗ )+(1−λh )(vk −x∗ ) and |∇f (yk )| ≤ L|yk −x∗ | ≤ L(1−λh )|xk −x∗ |+Lλh |vk −x∗ |,
then
√ 1
| µ(xk − yk + vk − x∗ ) − √ ∇f (yi )| ≤ (λh (1 − L) + L)|xk − x∗ | + (1 + λh (L − 1))|vk − x∗ |.
L
Therefore, we achieve the same rate as in the unperturbed case.
Corollary 6.13. Under the assumptions of Proposition 6.12, then sup0≤i≤k |vi −x∗ |+supi≥0 |xi −x∗ | ≤ Mk ,
where Mk depends on
k−1
X √
(65) (1 − h µ)−i |ei |.
i=0
If (6.13) is bounded when k ր +∞, then Mk ∈ l∞ and
k−1
!
µ √ √ √ 
E0ac,sc
X
f (xk ) − f + |vk − x∗ |2 ≤ (1 − h µ)k

+ CMk (1 − h µ)−i |ei | = O (1 − h µ)k .
2 i=0

Proof. Since Jkac,sc satisfies


k−1
√ 1 √
Jkac,sc
X
≤ h | µ(xk − yk + vk − x∗ ) − √ ∇f (yi )|(1 − h µ)−i−1 |ei |
i=0
L
k−1
X √
≤ h ((λh (1 − L) + L)|xk − x∗ | + (1 + λh (L − 1))|vk − x∗ |) (1 − h µ)−i−1 |ei |.
i=0
34 MAXIME LABORDE AND ADAM M. OBERMAN

Combine with Ekac,sc ≥ µ2 |xk − x∗ |2 + |vk − x∗ |2 , we conclude, as in the convex case, applying discrete


Gronwall’s lemma and (6.13). 

Moreover, taking advantage of the gap in the dissipation of Ẽkac,sc , we obtain


Corollary 6.14. For all h < √1 , k ≥ 1,
L
√ √
h µ
 
L
min |∇f (yi )| ≤ 2
√ √ −k (E0ac,cs + I∞ ),
0≤i≤k h(1 − h L) (1 − h µ) − 1
and √
h µ
 
2h
2
min |xi − yi | ≤ √ √ (E0ac,cs + I∞ ).
0≤i≤k µ(1 − h2 L) (1 − h µ)−k − 1
Proof. The proof is similar to the one of Corollary 6.5 using (6.12). 
Remark 6.15. As in the deterministic case, we have
k−1 √
√ L
√ (E0ac,cs + I∞ ),
X
(1 − h µ)−i−1 |∇f (yi )|2 ≤
i=0
h(1 − h L)
and
k−1
√ 2h
(E ac,cs + I∞ ).
X
(1 − h µ)−i−1 |xi − yi |2 ≤ √
i=0
µ(1 − h2 L) 0

6.4. Variable time step and convergence in expectation. Now we consider the case of a variable time
step hk and a zero-mean and a constant variance error ek , (3.6).

As in the convex case, the proof of (6.12) is still true for an adaptative time step hk and then, since
h2 σ 2
E[βk ] = k2 ,
ac,sc √ h2 σ 2
E[Ek+1 ] ≤ (1 − µhk )Ekac,sc + k .
2

We are in the case where rE ac,sc = µ > 0, LE ac,sc and g22 = 1 (Proposition 6.2), then Proposition 3.22 gives
Proposition 6.16. If
2 µ
hac,sc
k := √ −1 , αac,sc :=
µ(k + α E0ac,sc )
−1 2σ 2
Pk
and tk = i=0 hi , then the following holds:
1 2σ 2
E[E ac,sc (xk )] ≤ −1 = −1 .
αac,sc k + E0ac,sc µk + 2σ 2 E0ac,sc
Remark 6.17. If we compare this result to the gredient descent case, Proposition 4.8, starting with E0sc =
E0ac,sc = E0 , we remark that since αac,sc ≥ αsc , even if the time step has to be smaller in the accelerated
case hac,sc
k ≤ hsc
k we obtain an accelerated rate:

2σ 2 1 1 2(Cf + 1)σ 2
−1 = −1 ≤ −1 = .
µk + 2σ 2 E0 αac,sc k + E0 αsc k + E0 µk + 2(Cf + 1)σ 2 E0−1
Acknowledgments. Research supported by AFOSR FA9550-18-1-0167 (A.O.).

References
[AABR02] Felipe Alvarez, Hedy Attouch, Jérôme Bolte, and P Redont. A second-order gradient-like dissipative dynamical
system with hessian-driven damping.-application to optimization and mechanics. Journal de mathématiques pures
et appliquées, 81(8):747–780, 2002.
[ACPR18] Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont. Fast convergence of inertial dynamics and
algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168(1-2):123–175, 2018.
[APR16] Hedy Attouch, Juan Peypouquet, and Patrick Redont. Fast convex optimization via inertial dynamics with hessian
driven damping. Journal of Differential Equations, 261(10):5734–5783, 2016.
NESTEROV 35

[AZ17] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the
49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1200–1205, New York, NY,
USA, 2017. ACM. URL: http://doi.acm.org/10.1145/3055399.3055448, doi:10.1145/3055399.3055448 .
[B+ 15] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine
Learning, 8(3-4):231–357, 2015.
[BCN16] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv
preprint arXiv:1606.04838, 2016.
[Bec17] Amir Beck. First-Order Methods in Optimization, volume 25. SIAM, 2017.
[Bot91] Léon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8):12, 1991.
[BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM
journal on imaging sciences, 2(1):183–202, 2009.
[Bub14] S. Bubeck. Convex Optimization: Algorithms and Complexity. ArXiv e-prints, May 2014. arXiv:1405.4980.
[FB15] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. In Conference on
Learning Theory, pages 658–695, 2015.
[FGKS15] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and
faster stochastic algorithms for empirical risk minimization. In Francis R. Bach and David M. Blei, editors,
ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 2540–2548. JMLR.org, 2015. URL:
http://dblp.uni-trier.de/db/conf/icml/icml2015.html#FrostigGKS15.
[Ise09] Arieh Iserles. A first course in the numerical analysis of differential equations. Number 44. Cambridge university
press, 2009.
[JKK+ 18] Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic
gradient descent for least squares regression. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,
Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research,
pages 545–604. PMLR, 06–09 Jul 2018.
[JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS,
2013.
[KBB15] Walid Krichene, Alexandre Bayen, and Peter L Bartlett. Accelerated mirror descent in continuous and
discrete time. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Ad-
vances in Neural Information Processing Systems 28, pages 2845–2853. Curran Associates, Inc., 2015. URL:
http://papers.nips.cc/paper/5843-accelerated-mirror-descent-in-continuous-and-discrete-time.pdf.
[KNJK18] Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insufficiency of existing momentum
schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9.
IEEE, 2018.
[LJSB12] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence
rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
[LMH15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization.
In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances
in Neural Information Processing Systems 28, pages 3384–3392. Curran Associates, Inc., 2015. URL:
http://papers.nips.cc/paper/5928-a-universal-catalyst-for-first-order-optimization.pdf .
[LRP16] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral
quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
[Nes13] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business
Media, 2013.
[NJLS09] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation
approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
[NY83] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in opti-
mization. Wiley, 1983.
[Oks13] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business
Media, 2013.
[OP19] Adam M. Oberman and Mariana Prazeres. Stochastic Gradient Descent with Polyak’s Learning Rate. arXiv e-prints,
page arXiv:1903.08688, Mar 2019. arXiv:1903.08688.
[Pav16] Grigorios A Pavliotis. Stochastic processes and applications. Springer, 2016.
[Pol64] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathe-
matics and Mathematical Physics, 4(5):1–17, 1964.
[Pol87] Boris T Polyak. Introduction to optimization. translations series in mathematics and engineering. Optimization
Software, 1987.
[QRG+ 19] Xun Qian, Peter Richtarik, Robert Gower, Alibek Sailanbayev, Nicolas Loizou, and Egor Shulgin. Sgd with arbitrary
sampling: General analysis and improved rates. In International Conference on Machine Learning, pages 5200–5209,
2019.
[RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 09
1951. URL: https://doi.org/10.1214/aoms/1177729586, doi:10.1214/aoms/1177729586.
[SBC14] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient
method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
[SDJS18] Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration phenomenon via high-
resolution differential equations. arXiv preprint arXiv:1810.08907, 2018.
36 MAXIME LABORDE AND ADAM M. OBERMAN

[SH96] AM Stuart and AR Humphries. Dynamical systems and numerical analysis, volume 2 of Cambridge Monographs
on Applied and Computational Mathematics. Cambridge University Press, Cambridge, 1996.
[SRBd17] Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d’Aspremont. Integration methods and accelerated
optimization algorithms. arXiv preprint arXiv:1702.06751, 2017.
[Var57] Richard S. Varga. A comparison of the successive overrelaxation method and semi-iterative methods using Chebyshev
polynomials. J. Soc. Indust. Appl. Math., 5:39–46, 1957.
[WMW19] Ashia Wilson, Lester Mackey, and Andre Wibisono. Accelerating Rescaled Gradient Descent: Fast Optimization of
Smooth Functions. arXiv e-prints, page arXiv:1902.08825, Feb 2019. arXiv:1902.08825.
[WRJ16] Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A lyapunov analysis of momentum methods in optimization.
arXiv preprint arXiv:1611.02635, 2016.
[WWJ16] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in
optimization. Proceedings of the National Academy of Sciences, page 201614734, 2016.
[XWG18a] Pan Xu, Tianhao Wang, and Quanquan Gu. Accelerated stochastic mirror descent: From continuous-time dynamics
to discrete-time algorithms. In Proceedings of the Twenty-First International Conference on Artificial Intelligence
and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1087–1096, Playa Blanca, Lanzarote,
Canary Islands, 2018.
[XWG18b] Pan Xu, Tianhao Wang, and Quanquan Gu. Continuous and discrete-time accelerated stochastic mirror descent for
strongly convex functions. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of
Proceedings of Machine Learning Research, pages 5492–5501, Stockholmsmssan, Stockholm Sweden, 2018.
[You54] David Young. Iterative methods for solving partial difference equations of elliptic type. Trans. Amer. Math. Soc.,
76:92–111, 1954. URL: https://doi.org/10.2307/1990745, doi:10.2307/1990745.

Appendix A. Proof of inequality (5.3)


The proof is a perturbed version of the one in [BT09, SBC14, APR16]. First we prove the following
inequality:
Lemma A.1. Assume f is a convex, L-soothness function. For all x, y, z, f satisfies
L
(68) f (z) ≤ f (x) + h∇f (y), z − xi + |z − y|2 .
2
Proof. By L-smoothness,
L
f (z) − f (x) ≤ f (y) − f (x) + h∇f (y), z − yi + |z − y|2 .
2
and since f is convex,
f (y) − f (x) ≤ h∇f (y), y − xi.
We conclude the proof comibning these two inequalities. 
Now apply inequality (A.1) at (x, y, z) = (xk , yk , xk+1 ):
L
f (xk+1 ) ≤ f (xk ) + h∇f (yk ), xk+1 − xk i + |xk+1 − yk |2
2
2h h
≤ f (xk ) + h∇f (yk ), vk − xk i − √ |∇f (yk )|2
tk L
2
h h
− √ h∇f (yk ), ek i + |∇f (yk ) + ek |2 ,
L 2
and then
(69)  
2h 1 h h D ek E
f (xk+1 ) ≤ f (xk )+ h∇f (yk ), vk −xk i− √ − h|∇f (yk )|2 − √ h∇f (yk ), ek i+h2 ∇f (yk ) + , ek .
tk L 2 L 2
If we apply (A.1) also at (x, y, z) = (x∗ , yk , xk+1 ) we obtain
L
f (xk+1 ) ≤ f ∗ + h∇f (yk ), xk+1 − x∗ i + |xk+1 − yk |2
2
h
≤ f ∗ + h∇f (yk ), yk − x∗ i − √ |∇f (yk )|2
L
2
h h
− √ h∇f (yk ), ek i + |∇f (yk ) + ek |2
L 2
NESTEROV 37

then,
 
1 h h D ek E
(70) f (xk+1 ) ≤ f ∗ +h∇f (yk ), yk −x∗ i− √ − h|∇f (yk )|2 − √ h∇f (yk ), ek i+h2 ∇f (yk ) + , ek .
L 2 L 2
 
Summing 1 − 2h 2h
tk (A) and tk (A), we have
   
2h 2h 1 h
f (xk+1 ) − f ∗ ≤ 1− (f (xk ) − f ∗ ) + h∇f (yk ), vk − x∗ i − √ − h|∇f (yk )|2
tk tk L 2
h D ek E
− √ h∇f (yk ), ek i + h2 ∇f (yk ) + , ek .
L 2
Then,
 
2 1 h
∗ ∗
tk (f (xk+1 ) − f ) ≤ (tk − 2h) tk (f (xk ) − f ) + 2htk h∇f (yk ), vk − x i − √ − ∗
ht2k |∇f (yk )|2
L 2
ht2 D ek E
− √ k h∇f (yk ), ek i + h2 t2k ∇f (yk ) + , ek
L 2
 
2 2 1 h
∗ ∗
ht2k |∇f (yk )|2

≤ tk−1 − h (f (xk ) − f ) + 2htk h∇f (yk ), vk − x i − √ −
L 2
ht2 D ek E
− √ k h∇f (yk ), ek i + h2 t2k ∇f (yk ) + , ek ,
L 2
which concludes the proof.

You might also like