On The Optimization of Deep Networks: Implicit Acceleration by Overparameterization

On the Optimization of Deep Networks:
Implicit Acceleration by Overparameterization
Sanjeev Arora 1 2 Nadav Cohen 2 Elad Hazan 1 3
Abstract Given the longstanding consensus on expressiveness vs. op-

Conventional wisdom in deep learning states that timization trade-offs, this paper conveys a rather counter-
arXiv:1802.06509v2 [cs.LG] 11 Jun 2018
increasing depth improves expressiveness but intuitive message: increasing depth can accelerate opti-
complicates optimization. This paper suggests mization. The effect is shown, via first-cut theoretical and
that, sometimes, increasing depth can speed up empirical analyses, to resemble a combination of two well-
optimization. The effect of depth on optimization known tools in the field of optimization: momentum, which
is decoupled from expressiveness by focusing on led to provable acceleration bounds (Nesterov, 1983); and
settings where additional layers amount to over- adaptive regularization, a more recent technique proven to
parameterization – linear neural networks, a well- accelerate by Duchi et al. (2011) in their proposal of the
studied model. Theoretical analysis, as well as AdaGrad algorithm. Explicit mergers of both techniques
experiments, show that here depth acts as a pre- are quite popular in deep learning (Kingma & Ba, 2014;
conditioner which may accelerate convergence. Tieleman & Hinton, 2012). It is thus intriguing that merely
Even on simple convex problems such as linear introducing depth, with no other modification, can have a
regression with `p loss, p > 2, gradient descent similar effect, but implicitly.
can benefit from transitioning to a non-convex There is an obvious hurdle in isolating the effect of depth
overparameterized objective, more than it would on optimization: if increasing depth leads to faster train-
from some common acceleration schemes. We ing on a given dataset, how can one tell whether the im-
also prove that it is mathematically impossible to provement arose from a true acceleration phenomenon, or
obtain the acceleration effect of overparametriza- simply due to better representational power (the shallower
tion via gradients of any regularizer. network was unable to attain the same training loss)? We
respond to this hurdle by focusing on linear neural networks
(cf. Saxe et al. (2013); Goodfellow et al. (2016); Hardt &
1. Introduction Ma (2016); Kawaguchi (2016)). With these models, adding
layers does not alter expressiveness; it manifests itself only
How does depth help? This central question of deep learn-
in the replacement of a matrix parameter by a product of
ing still eludes full theoretical understanding. The general
matrices – an overparameterization.
consensus is that there is a trade-off: increasing depth im-
proves expressiveness, but complicates optimization. Supe- We provide a new analysis of linear neural network opti-
rior expressiveness of deeper networks, long suspected, is mization via direct treatment of the differential equations
now confirmed by theory, albeit for fairly limited learning associated with gradient descent when training arbitrarily
problems (Eldan & Shamir, 2015; Raghu et al., 2016; Lee deep networks on arbitrary loss functions. We find that the
et al., 2017; Cohen et al., 2017; Daniely, 2017; Arora et al., overparameterization introduced by depth leads gradient
2018). Difficulties in optimizing deeper networks have also descent to operate as if it were training a shallow (single
been long clear – the signal held by a gradient gets buried layer) network, while employing a particular precondition-
as it propagates through many layers. This is known as ing scheme. The preconditioning promotes movement along
the “vanishing/exploding gradient problem”. Modern tech- directions already taken by the optimization, and can be seen
niques such as batch normalization (Ioffe & Szegedy, 2015) as an acceleration procedure that combines momentum with
and residual connections (He et al., 2015) have somewhat adaptive learning rates. Even on simple convex problems
alleviated these difficulties in practice. such as linear regression with `p loss, p > 2, overparam-
1
eterization via depth can significantly speed up training.
Department of Computer Science, Princeton University,
Surprisingly, in some of our experiments, not only did over-
Princeton, NJ, USA 2 School of Mathematics, Institute for Ad-
vanced Study, Princeton, NJ, USA 3 Google Brain, USA. Corre- parameterization outperform naı̈ve gradient descent, but it
spondence to: Nadav Cohen <cohennadav@ias.edu>. was also faster than two well-known acceleration methods –
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
AdaGrad (Duchi et al., 2011) and AdaDelta (Zeiler, 2012). sive), ultimately concluding that increasing depth can slow
In addition to purely linear networks, we also demonstrate down optimization, albeit by a modest amount. In contrast
(empirically) the implicit acceleration of overparameteri- to these two works, our analysis applies to a general loss
zation on a non-linear model, by replacing hidden layers function, and any depth including one. Intriguingly, we find
with depth-2 linear networks. The implicit acceleration of that for `p regression, acceleration by depth is revealed only
overparametrization is different from standard regulariza- when p > 2. This explains why the conclusion reached
tion – we prove its effect cannot be attained via gradients of in Saxe et al. (2013) differs from ours.
any fixed regularizer.
Turning to general optimization, accelerated gradient (mo-
Both our theoretical analysis and our empirical evaluation mentum) methods were introduced in Nesterov (1983), and
indicate that acceleration via overparameterization need not later studied in numerous works (see Wibisono et al. (2016)
be computationally expensive. From an optimization per- for a short review). Such methods effectively accumulate
spective, overparameterizing using wide or narrow networks gradients throughout the entire optimization path, using the
has the same effect – it is only the depth that matters. collected history to determine the step at a current point in
time. Use of preconditioners to speed up optimization is
The remainder of the paper is organized as follows. In Sec-
also a well-known technique. Indeed, the classic Newton’s
tion 2 we review related work. Section 3 presents a warmup
method can be seen as preconditioning based on second
example of linear regression with `p loss, demonstrating
derivatives. Adaptive preconditioning with only first-order
the immense effect overparameterization can have on op-
(gradient) information was popularized by the BFGS al-
timization, with as little as a single additional scalar. Our
gorithm and its variants (cf. Nocedal (1980)). Relevant
theoretical analysis begins in Section 4, with a setup of pre-
theoretical guarantees, in the context of regret minimization,
liminary notation and terminology. Section 5 derives the
were given in Hazan et al. (2007); Duchi et al. (2011). In
preconditioning scheme implicitly induced by overparame-
terms of combining momentum and adaptive precondition-
terization, followed by Section 6 which shows that this form
ing, Adam (Kingma & Ba, 2014) is a popular approach,
of preconditioning is not attainable via any regularizer. In
particularly for optimization of deep networks.
Section 7 we qualitatively analyze a very simple learning
problem, demonstrating how the preconditioning can speed Algorithms with certain theoretical guarantees for non-
up optimization. Our empirical evaluation is delivered in convex optimization, and in particular for training deep
Section 8. Finally, Section 9 concludes. neural networks, were recently suggested in various works,
for example Ge et al. (2015); Agarwal et al. (2017); Carmon
2. Related Work et al. (2016); Janzamin et al. (2015); Livni et al. (2014) and
references therein. Since the focus of this paper lies on the
Theoretical study of optimization in deep learning is a highly analysis of algorithms already used by practitioners, such
active area of research. Works along this line typically an- works lie outside our scope.
alyze critical points (local minima, saddles) in the land-
scape of the training objective, either for linear networks 3. Warmup: `p Regression
(see for example Kawaguchi (2016); Hardt & Ma (2016) or
Baldi & Hornik (1989) for a classic account), or for specific We begin with a simple yet striking example of the effect
non-linear networks under different restrictive assumptions being studied. For linear regression with `p loss, we will
(cf. Choromanska et al. (2015); Haeffele & Vidal (2015); see how even the slightest overparameterization can have an
Soudry & Carmon (2016); Safran & Shamir (2017)). Other immense effect on optimization. Specifically, we will see
works characterize other aspects of objective landscapes, that simple gradient descent on an objective overparameter-
for example Safran & Shamir (2016) showed that under ized by a single scalar, corresponds to a form of accelerated
certain conditions a monotonically descending path from gradient descent on the original objective.
initialization to global optimum exists (in compliance with
Consider the objective for a scalar linear regression problem
the empirical observations of Goodfellow et al. (2014)).
with `p loss (p – even positive integer):
The dynamics of optimization was studied in Fukumizu h i
(1998) and Saxe et al. (2013), for linear networks. Like L(w) = E(x,y)∼S p1 (x> w − y)p
ours, these works analyze gradient descent through its cor- x ∈ Rd here are instances, y ∈ R are continuous labels,
responding differential equations. Fukumizu (1998) focuses S is a finite collection of labeled instances (training set), and
on linear regression with `2 loss, and does not consider the w ∈ Rd is a learned parameter vector. Suppose now that we
effect of varying depth – only a two (single hidden) layer apply a simple overparameterization, replacing the param-
network is analyzed. Saxe et al. (2013) also focuses on eter vector w by a vector w1 ∈ Rd times a scalar w2 ∈ R:
`2 regression, but considers any depth beyond two (inclu- h i
L(w1 , w2 ) = E(x,y)∼S p1 (x> w1 w2 − y)p
Pk Pk
Obviously the overparameterization does not affect the ex- − r=1 yr log(eŷr / r0 =1 eŷr0 ), where yr and ŷr stand
pressiveness of the linear model. How does it affect opti- for coordinate r of y and ŷ respectively. For a predic-
mization? What happens to gradient descent on this non- tor φ, i.e. a mapping X to Y, the overall training loss
Pm from (i)
convex objective? is L(φ) := m 1 (i)
i=1 l(φ(x ), y ). If φ comes from some
Observation 1. Gradient descent over L(w1 , w2 ), with parametric family Φ := {φθ : X → Y|θ ∈ Θ}, we view the
fixed small learning rate and near-zero initialization, is corresponding training loss as aPfunction of the parameters,
1 m
equivalent to gradient descent over L(w) with particular i.e. we consider LΦ (θ) := m (i) (i)
i=1 l(φθ (x ), y ). For
adaptive learning rate and momentum terms. example, if the parametric family in question is the class of
(directly parameterized) linear predictors:
To see this, consider the gradients of L(w) and L(w1 , w2 ): Φlin := {x 7→ W x|W ∈ Rk,d } (1)
:= E(x,y)∼S (x> w − y)p−1 x

∇w the respective training loss is a function from R k,d
to R≥0 .
∇w1 := E(x,y)∼S (x> w1 w2 − y)p−1 w2 x

In our context, a depth-N (N ≥ 2) linear neural network,
∇w2 := E(x,y)∼S (x> w1 w2 − y)p−1 w1> x

with hidden widths n1 , n2 , . . . , nN −1 ∈N, is the follow-
ing parametric family of linear predictors: Φn1 ...nN −1 :=
Gradient descent over L(w1 , w2 ) with learning rate η > 0:
{x 7→ WN WN −1 · · ·W1 x|Wj ∈Rnj ,nj−1 , j=1...N }, where
w1
(t+1) (t)
←[ w1 −η∇w(t) , w2
(t+1) (t)
←[ w2 −η∇w(t) by definition n0 := d and nN := k. As customary, we refer
1 2 to each Wj , j=1...N , as the weight matrix of layer j. For
The dynamics of the underlying parameter w = w1 w2 are: simplicity of presentation, we hereinafter omit from our no-
(t+1) (t+1)
tation the hidden widths n1 ...nN −1 , and simply write ΦN
w(t+1) = w1 w2 instead of Φn1 ...nN −1 (n1 . . .nN −1 will be specified explic-
(t) (t) itly if not clear by context). That is, we denote:
←[ (w1 −η∇w(t) )(w2 −η∇w(t) )
1 2
(t) (t) (t) (t) ΦN := (2)
= w1 w 2 − ηw2 ∇w(t) − η∇w(t) w1 + O(η 2 ) nj ,nj−1
1 2 {x 7→ WN WN −1 · · ·W1 x| Wj ∈ R , j=1...N }
(t) (t)
= w(t) − η(w2 )2 ∇w(t) − η(w2 )−1 ∇w(t) w(t) + O(η 2 ) For completeness, we regard a depth-1 network as the family
2
η is assumed to be small, thus we neglect O(η ). De- 2 of directly parameterized linear predictors, i.e. we set Φ1 :=
(t) (t) Φlin (see Equation 1).
noting ρ(t) :=η(w2 )2 ∈ R and γ (t) :=η(w2 )−1 ∇w(t) ∈ R,
2
this gives: The training loss that corresponds to a depth-N lin-
N
w(t+1) ←[ w(t) − ρ(t) ∇w(t) − γ (t) w(t) ear network – LΦ (W1 , ..., WN ), is a function from
Rn1 ,n0 ×· · ·×RnN ,nN −1 to R≥0 . For brevity, we will de-
Since by assumption w1 and w2 are initialized near zero, note this function by LN (·). Our focus lies on the behavior
w will initialize near zero as well. This implies that at every of gradient descent when minimizing LN (·). More specifi-
iteration t, w(t) is a weighted combination of past gradients. cally, we are interested in the dependence of this behavior
There thus exist µ(t,τ ) ∈ R such that: on N , and in particular, in the possibility of increasing N
Xt−1 leading to acceleration. Notice that for any N ≥ 2 we have:
w(t+1) ←[ w(t) − ρ(t) ∇w(t) − µ(t,τ ) ∇w(τ )
τ =1
LN (W1 , ..., WN ) = L1 (WN WN −1 · · ·W1 ) (3)
We conclude that the dynamics governing the underlying
parameter w correspond to gradient descent with a momen- and so the sole difference between the training loss of a
tum term, where both the learning rate (ρ(t) ) and momentum depth-N network and that of a depth-1 network (classic lin-
coefficients (µ(t,τ ) ) are time-varying and adaptive. ear model) lies in the replacement of a matrix parameter by
a product of N matrices. This implies that if increasing N
can indeed accelerate convergence, it is not an outcome of
4. Linear Neural Networks any phenomenon other than favorable properties of depth-
Let X := Rd be a space of objects (e.g. images or word induced overparameterization for optimization.
embeddings) that we would like to infer something about,
and let Y := Rk be the space of possible inferences.
5. Implicit Dynamics of Gradient Descent
Suppose we are given a training set {(x(i) , y(i) )}m
i=1 ⊂
X × Y, along with a (point-wise) loss function l : Y × In this section we present a new result for linear neural
Y → R≥0 . For example, y(i) could hold continuous val- networks, tying the dynamics of gradient descent on LN (·) –
2
ues with l(·) being the `2 loss: l(ŷ, y) = 21 kŷ − yk2 ; the training loss corresponding to a depth-N network, to
or it could hold one-hot vectors representing categories those on L1 (·) – training loss of a depth-1 network (classic
with l(·) being the softmax-cross-entropy loss: l(ŷ, y) = linear model). Specifically, we show that gradient descent
on LN (·), a complicated and seemingly pointless overpa- Then, the end-to-end weight matrix We (Equation 5) is
rameterization, can be directly rewritten as a particular pre- governed by the following differential equation:
conditioning scheme over gradient descent on L1 (·).
Ẇe (t) = −ηλN · We (t) (8)
When applied to LN (·), gradient descent takes on the fol- XN j−1
lowing form: −η We (t)We> (t) N ·
j=1
dL1
NN−j
∂LN
>
Wj
(t+1)
←[ (1 − ηλ)Wj
(t)
−η
(t) (t)
(W1 , . . . , WN ) (4) dW (We (t)) · We (t)We (t)
∂Wj j−1 N −j
, j = 1. . .N where [·] N and [·] N , j = 1 . . . N , are fractional power
operators defined over positive semidefinite matrices.
η > 0 here is a learning rate, and λ ≥ 0 is an optional
weight decay coefficient. For simplicity, we regard both η Proof. (sketch – full details in Appendix A.1) If λ = 0
and λ as fixed (no dependence on t). Define the underlying (no weight decay) then one can easily show that
end-to-end weight matrix: >
Wj+1 (t)Ẇj+1 (t) = Ẇj (t)Wj> (t) throughout optimization.
We := WN WN −1 · · · W1 (5) Taking the transpose of this equation and adding to itself,
followed by integration over time, imply that the differ-
>
Given that LN (W1 , . . . , WN ) = L1 (We ) (Equation 3), we ence between Wj+1 (t)Wj+1 (t) and Wj (t)Wj> (t) is con-
view We as an optimized weight matrix for L1 (·), whose stant. This difference is zero at initialization (Equation 7),
dynamics are governed by Equation 4. Our interest then thus will remain zero throughout, i.e.:
>
boils down to the study of these dynamics for different Wj+1 (t)Wj+1 (t) = Wj (t)Wj> (t) , ∀t ≥ t0 (9)
choices of N . For N = 1 they are (trivially) equivalent to
standard gradient descent over L1 (·). We will characterize A slightly more delicate treatment shows that this is true
the dynamics for N ≥ 2. even if λ > 0, i.e. with weight decay included.
To be able to derive, in our general setting, an explicit update Equation 9 implies alignment of the (left and right) sin-
rule for the end-to-end weight matrix We (Equation 5), we gular spaces of Wj (t) and Wj+1 (t), simplifying the prod-
introduce an assumption by which the learning rate is small, uct Wj+1 (t)Wj (t). Successive application of this simpli-
i.e. η 2 ≈ 0. Formally, this amounts to translating Equation 4 fication allows a clean computation for the product of all
to the following set of differential equations: layers (that is, We ), leading to the explicit form presented
in theorem statement (Equation 8).
∂LN
Ẇj (t) = −ηλWj (t) − η (W1 (t), . . . , WN (t)) (6)
∂Wj Translating the continuous dynamics of Equation 8 back to
, j = 1. . .N discrete time, we obtain the sought-after update rule for the
end-to-end weight matrix:
where t is now a continuous time index, and Ẇj (t) stands
for the derivative of Wj with respect to time. The use of We(t+1) ←[ (1 − ηλN )We(t) (10)
differential equations, for both theoretical analysis and al- XN h i j−1
N
gorithm design, has a long and rich history in optimization −η We(t) (We(t) )> ·
j=1
research (see Helmke & Moore (2012) for an overview). h i NN−j
dL1 (t) (t) > (t)
When step sizes (learning rates) are taken to be small, tra- dW (W e ) · (W e ) W e
jectories of discrete optimization algorithms converge to This update rule relies on two assumptions: first, that the
smooth curves modeled by continuous-time differential learning rate η is small enough for discrete updates to ap-
equations, paving way to the well-established theory of proximate continuous ones; and second, that weights are
the latter (cf. Boyce et al. (1969)). This approach has led initialized on par with Equation 7, which will approximately
to numerous interesting findings, including recent results in be the case if initialization values are close enough to zero.
the context of acceleration methods (e.g. Su et al. (2014); It is customary in deep learning for both learning rate and
Wibisono et al. (2016)). weight initializations to be small, but nonetheless above
assumptions are only met to a certain extent. We support
With the continuous formulation in place, we turn to express
their applicability by showing empirically (Section 8) that
the dynamics of the end-to-end matrix We :
the end-to-end update rule (Equation 10) indeed provides
Theorem 1. Assume the weight matrices W1 . . .WN follow an accurate description for the dynamics of We .
the dynamics of continuous gradient descent (Equation 6).
Assume also that their initial values (time t0 ) satisfy, for A close look at Equation 10 reveals that the dynamics of the
j = 1. . .N − 1: end-to-end weight matrix We are similar to gradient descent
over L1 (·) – training loss corresponding to a depth-1 net-
>
Wj+1 (t0 )Wj+1 (t0 ) = Wj (t0 )Wj> (t0 ) (7) work (classic linear model). The only difference (besides the
scaling by N of the weight decay coefficient λ) is that the i.e. be oblivious to parameter value. However, if one takes
1
gradient dL
dW (We ) is subject to a transformation before be-
into account the common practice in deep learning of ini-
ing used. Namely, for j = 1. . .N , it is multiplied from the tializing weights near zero, the location in parameter space
j−1 N −j
left by [We We> ] N and from the right by [We> We ] N , fol- can also be regarded as the overall movement made by the
lowed by summation over j. Clearly, when N = 1 (depth-1 algorithm. We thus interpret our findings as indicating that
network) this transformation reduces to identity, and as ex- overparameterization promotes movement along directions
pected, We precisely adheres to gradient descent over L1 (·). already taken by the optimization, and therefore can be seen
When N ≥ 2 the dynamics of We are less interpretable. We as a form of acceleration. This intuitive interpretation will
arrange it as a vector to gain more insight: become more concrete in the subsection that follows.
Claim 1. For an arbitrary matrix A, denote by vec(A) its A final point to make, is that the end-to-end update rule
arrangement as a vector in column-first order. Then, the (Equation 10 or 11), which obviously depends on N – num-
end-to-end update rule in Equation 10 can be written as: ber of layers in the deep linear network, does not depend on
the hidden widths n1 . . . nN −1 (see Section 4). This implies
vec(We(t+1) ) ←[ (1 − ηλN ) · vec(We(t) ) (11)
1 that from an optimization perspective, overparameterizing
dL (t) using wide or narrow networks has the same effect – it is
−η · PW (t) vec dW (We )
e
only the depth that matters. Consequently, the acceleration
where PW (t) is a positive semidefinite preconditioning ma- of overparameterization can be attained at a minimal compu-
e
(t) tational price, as we demonstrate empirically in Section 8.
trix that depends on We . Namely, if we denote the sin-
(t)
gular values of We ∈ Rk,d by σ1 . . . σmax{k,d} ∈ R≥0
(by definition σr = 0 if r > min{k, d}), and correspond- 5.1. Single Output Case
ing left and right singular vectors by u1 . . . uk ∈ Rk and To facilitate a straightforward presentation of our findings,
v1 . . . vd ∈ Rd respectively, the eigenvectors of PW (t) are: we hereinafter focus on the special case where the optimized
e
vec(ur vr>0 ) , r = 1 . . . k , r0 = 1 . . . d models have a single output, i.e. where k = 1. This corre-
sponds, for example, to a binary (two-class) classification
with corresponding eigenvalues: problem, or to the prediction of a numeric scalar property
XN 2 N −j 2 j−1 (regression). It admits a particularly simple form for the
σr N σr 0 N , r = 1 . . . k , r0 = 1 . . . d end-to-end update rule of Equation 10:
j=1
Claim 2. Assume k = 1, i.e. We ∈ R1,d . Then, the end-to-

Proof. The result readily follows from the properties of the
end update rule in Equation 10 can be written as follows:
Kronecker product – see Appendix A.2 for details.
We(t+1) ←[ (1 − ηλN ) · We(t) (12)
Claim 1 implies that in the end-to-end update rule of Equa- 2
(t) 2− N
1
(t)
1 −ηkWe k2 · dLdW (We )+
tion 10, the transformation applied to the gradient dL dW (We ) 1 (t)

is essentially a preconditioning, whose eigendirections (N − 1) · P rW (t) dL dW (We )
e
and eigenvalues depend on the singular value decompo- 2− 2
sition of We . The eigendirections are the rank-1 matri- where k·k2 N stands for Euclidean norm raised to the
ces ur vr>0 , where ur and vr0 are left and right (respec- power of 2 − N2 , and P rW {·}, W ∈ R1,d , is defined to be
tively) singular vectors of We . The eigenvalue of ur vr>0 the projection operator onto the direction of W :
PN 2(N −j)/N 2(j−1)/N
is j=1 σr σr 0 , where σr and σr0 are the P rW : R1,d → R1,d (13)
singular values of We corresponding to ur and vr0 (respec- (
W > W
kW k2 V · , W 6= 0
tively). When N ≥ 2, an increase in σr or σr0 leads to P rW {V } := kW k2
an increase in the eigenvalue corresponding to the eigendi- 0 , W =0
rection ur vr>0 . Qualitatively, this implies that the precondi-
tioning favors directions that correspond to singular vectors Proof. The result follows from the definition of a fractional
whose presence in We is stronger. We conclude that the power operator over matrices – see Appendix A.3.
effect of overparameterization, i.e. of replacing a classic lin-
ear model (depth-1 network) by a depth-N linear network, Claim 2 implies that in the single output case, the effect
boils down to modifying gradient descent by promoting of overparameterization (replacing classic linear model by
movement along directions that fall in line with the current depth-N linear network) on gradient descent is twofold:
location in parameter space. A-priori, such a preference may first, it leads to an adaptive learning rate schedule, by in-
2−2/N
seem peculiar – why should an optimization algorithm be troducing the multiplicative factor kWe k2 ; and second,
sensitive to its location in parameter space? Indeed, we gen- it amplifies (by N ) the projection of the gradient on the
erally expect sensible algorithms to be translation invariant, direction of We . Recall that we view We not only as the
e
optimized parameter, but also as the overall movement made
R e r e 0
r e R e
in optimization (initialization is assumed to be near zero).
Accordingly, the adaptive learning rate schedule can be seen  1
r ,R  3
r ,R
as gaining confidence (increasing step sizes) when optimiza-

tion moves farther away from initialization, and the gradient  2r , R
projection amplification can be thought of as a certain type
of momentum that favors movement along the azimuth taken
so far. These effects bear potential to accelerate convergence,  4r , R
as we illustrate qualitatively in Section 7, and demonstrate
empirically in Section 8. Figure 1. Curve Γr,R over which line integral is non-zero.
6. Overparametrization Effects Cannot Be

Fφ (W ) =
D E
Attained via Regularization 2
(
2− N W W
kW k φ(W )+(N −1) φ(W ), kW k kW k , W 6=0
Adding a regularizer to the objective is a standard approach 0 , W =0
for improving optimization (though lately the term regular- Notice that for φ = dL1
we get exactly the vector field F (·)
dW ,
ization is typically associated with generalization). For ex- defined in theorem statement.
ample, AdaGrad was originally invented to compete with the
best regularizer from a particular family. The next theorem We note simple properties of the mapping φ 7→ Fφ . First,
shows (for single output case) that the effects of overparame- it is linear, since for any vector fields φ1 , φ2 and scalar c:
terization cannot be attained by adding a regularization term Fφ1 +φ2 =Fφ1 +Fφ2 and Fc·φ1 =c·Fφ1 . Second, because of
to the original training loss, or via any similar modification. the linearity
R of line integrals, for any curve Γ, the functional
This is not obvious a-priori, as unlike many acceleration φ 7→ Γ Fφ , a mapping of vector fields to scalars, is linear.
methods that explicitly maintain memory of past gradients, We show that F (·) contradicts the fundamental theorem for
updates under overparametrization are by definition the gra- line integrals. To do so, we construct aHclosed curve Γ=Γr,R
dients of something. The assumptions in the theorem are for which the linear functional φ 7→ Γ Fφ does not vanish
minimal and also necessary, as one must rule-out the trivial 1
dL1 dL1
counter-example of a constant training loss. at φ= dLdW . Let e := dW (W =0)/k dW (W =0)k, which
1
1 is well-defined since by assumption dL dW (W =0)6= 0. For
Theorem 2. Assume dLdW does not vanish at W = 0, and is r < R we define (see Figure 1):
continuous on some neighborhood around this point. For a
given N ∈ N, N > 2,1 define: Γr,R := Γ1r,R → Γ2r,R → Γ3r,R → Γ4r,R
where:
F (W ) := (14)
2
• Γ1r,R is the line segment from −R · e to −r · e.

2− N dL1
dL1
kW k2 · dW (W ) + (N −1) · P rW dW (W )
• Γ2r,R is a spherical curve from −r · e to r · e.
where P rW {·} is the projection given in Equation 13. Then, • Γ3r,R is the line segment from r · e to R · e.
there exists no function (of W ) whose gradient field is F (·).
• Γ4r,R is a spherical curve from R · e to −R · e.
1
Proof. (sketch – full details in Appendix A.4) The proof With the definition of Γr,R in place, we decompose dLdW into
uses elementary differential geometry (Buck, 2003): curves, dL1
a constant vector field κ ≡ dW (W =0) plus a residual ξ.
arc length and the fundamental theorem for line integrals, We explicitly compute the line integrals along Γ1r,R . . . Γ4r,R
which states that the integral of ∇g for any differentiable for Fκ , and derive bounds for FRξ . This, along with the
function g amounts to 0 along every closed curve. linearity of the functional φ 7→ Γ Fφ , provides a lower
Overparametrization changes gradient descent’s behavior: bound on the line integral of F (·) over Γr,R . We show
1
instead of following the original gradient dL the lower bound is positive as r, R → 0, thus F (·) indeed
dW , it follows
some other direction F (·) (see Equations 12 and 14) that contradicts the fundamental theorem for line integrals.
is a function of the original gradient as well as the current
point W . We think of this change as a transformation that
maps one vector field φ(·) to another – Fφ (·): 7. Illustration of Acceleration
1
For the result to hold with N = 2, additional assump- To this end, we showed that overparameterization (use of
tions on L1 (·) are required; otherwise any non-zero linear func- depth-N linear network in place of classic linear model)
tion L1 (W ) = W U > serves as a counter-example – it leads to a induces on gradient descent a particular preconditioning
vector field F (·) that is the gradient of W 7→ kW k2 · W U > . scheme (Equation 10 in general and 12 in the single output
L2 loss (distance from optimum)

case), which can be interpreted as introducing some forms 2-layer, width-100 2-layer, width-100
2-layer, width-1 2-layer, width-1
of momentum and adaptive learning rate. We now illus- 2-layer emulation 2-layer emulation
3-layer, width-100 3-layer, width-100
trate qualitatively, on a very simple hypothetical learning 10−1
3-layer, width-1
100
3-layer, width-1
3-layer emulation 3-layer emulation
problem, the potential of these to accelerate optimization.
Consider the task of linear regression, assigning to vectors 10−2
0 50000 100000 150000 200000 0 50000 100000 150000 200000
in R2 labels in R. Suppose that our training set consists iteration iteration
of two points in R2 × R: ([1, 0]> , y1 ) and ([0, 1]> , y2 ). Figure 2. (to be viewed in color) Gradient descent optimization
of deep linear networks (depths 2, 3) vs. the analytically-derived
Assume also that the loss function of interest is `p , p ∈ 2N:
equivalent preconditioning schemes (over single layer model;
`p (ŷ, y) = p1 (ŷ − y)p . Denoting the learned parameter by
Equation 12). Both plots show training objective (left – `2 loss;
w = [w1 , w2 ]> , the overall training loss can be written as:2 right – `4 loss) per iteration, on a numeric regression dataset from
UCI Machine Learning Repository (details in text). Notice the
L(w1 , w2 ) = p1 (w1 − y1 )p + p1 (w2 − y2 )p
emulation of preconditioning schemes. Notice also the negligible
effect of network width – for a given depth, setting size of hidden
With fixed learning rate η > 0 (weight decay omitted for
layers to 1 (scalars) or 100 yielded similar convergence (on par
simplicity), gradient descent over L(·) gives: with our analysis).
(t+1) (t) (t)
wi ←[ wi − η(wi − yi )p−1 , i = 1, 2
Changing variables per ∆i = wi − yi , we have:
network, induces on gradient descent a certain precondition-
(t+1) (t) (t)
←[ ∆i 1 − η(∆i )p−2

∆i , i = 1, 2 (15) ing scheme. We qualitatively argued (Section 7) that in some
Assuming the original weights w1 and w2 are initialized cases, this preconditioning may accelerate convergence. In
near zero, ∆1 and ∆2 start off at −y1 and −y2 respectively, this section we put these claims to the test, through a se-
and will eventually reach the optimum ∆∗1 = ∆∗2 = 0 if the ries of empirical evaluations based on TensorFlow toolbox
learning rate is small enough to prevent divergence: (Abadi et al. (2016)). For conciseness, many of the details
2 behind our implementation are deferred to Appendix C.
η< yip−2
, i = 1, 2
Suppose now that the problem is ill-conditioned, in the We begin by evaluating our analytically-derived precondi-
sense that y1 y2 . If p = 2 this has no effect on the bound tioning scheme – the end-to-end update rule in Equation 10.
for η.3 If p > 2 the learning rate is determined by y1 , lead- Our objective in this experiment is to ensure that our analy-
ing ∆2 to converge very slowly. In a sense, ∆2 will suffer sis, continuous in nature and based on a particular assump-
from the fact that there is no “communication” between tion on weight initialization (Equation 7), is indeed appli-
the coordinates (this will actually be the case not just with cable to practical scenarios. We focus on the single output
gradient descent, but with most algorithms typically used in case, where the update-rule takes on a particularly simple
large-scale settings – AdaGrad, Adam, etc.). (and efficiently implementable) form – Equation 12. The
dataset chosen was UCI Machine Learning Repository’s
Now consider the scenario where we optimize L(·) via over- “Gas Sensor Array Drift at Different Concentrations” (Ver-
parameterization, i.e. with the update rule in Equation 12 gara et al., 2012; Rodriguez-Lujan et al., 2014). Specifi-
(single output). In this case the coordinates are coupled, cally, we used the dataset’s “Ethanol” problem – a scalar
and as ∆1 gets small (w1 gets close to y1 ), the learning regression task with 2565 examples, each comprising 128
2− 2
rate is effectively scaled by y1 N (in addition to a scal- features (one of the largest numeric regression tasks in the
ing by N in coordinate 1 only), allowing (if y1 >1) faster repository). As training objectives, we tried both `2 and `4
convergence of ∆2 . We thus have the luxury of temporar- losses. Figure 2 shows convergence (training objective per
ily slowing down ∆2 to ensure that ∆1 does not diverge, iteration) of gradient descent optimizing depth-2 and depth-
with the latter speeding up the former as it reaches safe 3 linear networks, against optimization of a single layer
grounds. In Appendix B we consider a special case and model using the respective preconditioning schemes (Equa-
formalize this intuition, deriving a concrete bound for the tion 12 with N = 2, 3). As can be seen, the preconditioning
acceleration of overparameterization. schemes reliably emulate deep network optimization, sug-
gesting that, at least in some cases, our analysis indeed
captures practical dynamics.
8. Experiments
Alongside the validity of the end-to-end update rule, Fig-
Our analysis (Section 5) suggests that overparameteriza- ure 2 also demonstrates the negligible effect of network
tion – replacement of a classic linear model by a deep linear width on convergence, in accordance with our analysis (see
2
We omit the averaging constant 21 for conciseness. Section 5). Specifically, it shows that in the evaluated set-
3
Optimal learning rate for gradient descent on quadratic objec- ting, hidden layers of size 1 (scalars) suffice in order for the
tive does not depend on current parameter value (cf. Goh (2017)). essence of overparameterization to fully emerge. Unless oth-

1-layer 1-layer AdaGrad 100
Adam, 1-layer
2-layer 100
2-layer AdaDelta Adam, 2-layer
100
3-layer 3-layer 3-layer Adam, 3-layer
10−1
10−1
10−1 10−1
10−2
10−2
10−2 10−2
0 200000 400000 600000 800000 1000000 0 200000 400000 600000 800000 1000000 0 200000 400000 600000 800000 1000000 0 50000 100000 150000 200000
iteration iteration iteration iteration
Figure 3. (to be viewed in color) Gradient descent optimization Figure 4. (to be viewed in color) Left: Gradient descent optimiza-
of single layer model vs. linear networks of depth 2 and 3. Setup tion of depth-3 linear network vs. AdaGrad and AdaDelta over
is identical to that of Figure 2, except that here learning rates were single layer model. Setup is identical to that of Figure 3-right.
chosen via grid search, individually per model (see Appendix C). Notice that the implicit acceleration of overparameterization out-
Notice that with `2 loss, depth (slightly) hinders optimization, performs both AdaGrad and AdaDelta (former is actually slower
whereas with `4 loss it leads to significant acceleration (on par than plain gradient descent). Right: Adam optimization of single
with our qualitative analysis in Section 7). layer model vs. Adam over linear networks of depth 2 and 3. Same
setup, but with learning rates set per Adam’s default in TensorFlow.
Notice that depth improves speed, suggesting that the acceleration
erwise indicated, all results reported hereinafter are based of overparameterization may be somewhat orthogonal to explicit
on this configuration, i.e. on scalar hidden layers. The com- acceleration methods.
putational toll associated with overparameterization will
thus be virtually non-existent.
Adam (a setting we did not theoretically analyze), further
As a final observation on Figure 2, notice that it exhibits
acceleration is attained – see Figure 4-right. This suggests
faster convergence with a deeper network. This however
that at least in some cases, not only plain gradient descent
does not serve as evidence in favor of acceleration by depth,
benefits from depth, but also more elaborate algorithms
as we did not set learning rates optimally per model (simply
commonly employed in state of the art applications.
used the common choice of 10−3 ). To conduct a fair compar-
ison between the networks, and more importantly, between An immediate question arises at this point. If depth indeed
them and a classic single layer model, multiple learning accelerates convergence, why not add as many layers as one
rates were tried, and the one giving fastest convergence was can computationally afford? The reason, which is actually
taken on a per-model basis. Figure 3 shows the results of apparent in our analysis, is the so-called vanishing gradient
this experiment. As can be seen, convergence of deeper problem. When training a very deep network (large N ),
networks is (slightly) slower in the case of `2 loss. This while initializing weights to be small, the end-to-end ma-
falls in line with the findings of Saxe et al. (2013). In stark trix We (Equation 5) is extremely close to zero, severely
contrast, and on par with our qualitative analysis in Sec- attenuating gradients in the preconditioning scheme (Equa-
tion 7, is the fact that with `4 loss adding depth significantly tion 10). A possible approach for alleviating this issue is to
accelerated convergence. To the best of our knowledge, this initialize weights to be larger, yet small enough such that the
provides first empirical evidence to the fact that depth, even end-to-end matrix does not “explode”. The choice of iden-
without any gain in expressiveness, and despite introducing tity (or near identity) initialization leads to what is known
non-convexity to a formerly convex problem, can lead to as linear residual networks (Hardt & Ma, 2016), akin to the
favorable optimization. successful residual networks architecture (He et al., 2015)
commonly employed in deep learning. Notice that identity
In light of the speedup observed with `4 loss, it is natu-
initialization satisfies the condition in Equation 7, rendering
ral to ask how the implicit acceleration of depth compares
the end-to-end update rule (Equation 10) applicable. Fig-
against explicit methods for acceleration and adaptive learn-
ure 5-left shows convergence, under gradient descent, of
ing. Figure 4-left shows convergence of a depth-3 network
a single layer model against deeper networks than those
(optimized with gradient descent) against that of a single
evaluated before – depths 4 and 8. As can be seen, with
layer model optimized with AdaGrad (Duchi et al., 2011)
standard, near-zero initialization, the depth-4 network starts
and AdaDelta (Zeiler, 2012). The displayed curves cor-
making visible progress only after about 65K iterations,
respond to optimal learning rates, chosen individually via
whereas the depth-8 network seems stuck even after 100K
grid search. Quite surprisingly, we find that in this spe-
iterations. In contrast, under identity initialization, both net-
cific setting, overparameterizing, thereby turning a convex
works immediately make progress, and again depth serves
problem non-convex, is a more effective optimization strat-
as an implicit accelerator.
egy than carefully designed algorithms tailored for convex
problems. We note that this was not observed with all al- As a final sanity test, we evaluate the effect of overparam-
gorithms – for example Adam (Kingma & Ba, 2014) was eterization on optimization in a non-idealized (yet simple)
considerably faster than overparameterization. However, deep learning setting. Specifically, we experiment with the
when introducing overparameterization simultaneously with convolutional network tutorial for MNIST built into Ten-
101
original be interpreted as a combination between certain forms of
1-layer
overparameterized
4-layer
8-layer
adaptive learning rate and momentum. Given that it depends
batch loss
100
4-layer, id-init
8-layer, id-init
100
on network depth but not on width, acceleration by overpa-
rameterization can be attained at a minimal computational
10−1
price, as we demonstrate empirically in Section 8.
10−1
0 20000 40000 60000 80000 100000 0 1000 2000 3000 4000 5000 6000 7000 8000
iteration iteration Clearly, complete theoretical analysis for non-linear net-
Figure 5. (to be viewed in color) Left: Gradient descent optimiza- works will be challenging. Empirically however, we showed
tion of single layer model vs. linear networks deeper than before
that the trivial idea of replacing an internal weight matrix by
(depths 4, 8). For deep networks, both near-zero and near-identity
a product of two can significantly accelerate optimization,
initializations were evaluated. Setup identical to that of Figure 3-
right. Notice that deep networks suffer from vanishing gradients with absolutely no effect on expressiveness (Figure 5-right).
under near-zero initialization, while near-identity (“residual”) ini- The fact that gradient descent over classic convex problems
tialization eliminates the problem. Right: Stochastic gradient such as linear regression with `p loss, p > 2, can accelerate
descent optimization in TensorFlow’s convolutional network tu-
from transitioning to a non-convex overparameterized objec-
torial for MNIST. Plot shows batch loss per iteration, in original
tive, does not coincide with conventional wisdom, and pro-
setting vs. overparameterized one (depth-2 linear networks in place
of dense layers). vides food for thought. Can this effect be rigorously quanti-
fied, similarly to analyses of explicit acceleration methods
such as momentum or adaptive regularization (AdaGrad)?
sorFlow,4 which includes convolution, pooling and dense
layers, ReLU non-linearities, stochastic gradient descent Acknowledgments
with momentum, and dropout (Srivastava et al., 2014). We Sanjeev Arora’s work is supported by NSF, ONR, Simons
introduced overparameterization by simply placing two ma- Foundation, Schmidt Foundation, Mozilla Research, Ama-
trices in succession instead of the matrix in each dense layer. zon Research, DARPA and SRC. Elad Hazan’s work is
Here, as opposed to previous experiments, widths of the supported by NSF grant 1523815 and Google Brain. Na-
newly formed hidden layers were not set to 1, but rather to dav Cohen is a member of the Zuckerman Israeli Post-
the minimal values that do not deteriorate expressiveness doctoral Scholars Program, and is supported by Eric and
(see Appendix C). Overall, with an addition of roughly 15% Wendy Schmidt.
in number of parameters, optimization has accelerated con-
siderably – see Figure 5-right. The displayed results were
obtained with the hyperparameter settings hardcoded into References
the tutorial. We have tried alternative settings (varying Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,
learning rates and standard deviations of initializations – see Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Ten-
Appendix C), and in all cases observed an outcome similar sorflow: A system for large-scale machine learning. In OSDI,
volume 16, pp. 265–283, 2016.
to that in Figure 5-right – overparameterization led to sig-
nificant speedup. Nevertheless, as reported above for linear Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T.
networks, it is likely that for non-linear networks the effect Finding approximate local minima faster than gradient descent.
of depth on optimization is mixed – some settings accelerate In Proceedings of the 49th Annual ACM SIGACT Symposium
on Theory of Computing, pp. 1195–1199. ACM, 2017.
by it, while others do not. Comprehensive characteriza-
tion of the cases in which depth accelerates optimization Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding
warrants much further study. We hope our work will spur deep neural networks with rectified linear units. International
interest in this avenue of research. Conference on Learning Representations (ICLR), 2018.
Baldi, P. and Hornik, K. Neural networks and principal component
analysis: Learning from examples without local minima. Neural
9. Conclusion networks, 2(1):53–58, 1989.
Through theory and experiments, we demonstrated that over-
Boyce, W. E., DiPrima, R. C., and Haines, C. W. Elementary
parameterizing a neural network by increasing its depth can differential equations and boundary value problems, volume 9.
accelerate optimization, even on very simple problems. Wiley New York, 1969.
Our analysis of linear neural networks, the subject of vari- Buck, R. C. Advanced calculus. Waveland Press, 2003.
ous recent studies, yielded a new result: for these models,
overparameterization by depth can be understood as a pre- Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accel-
erated methods for non-convex optimization. arXiv preprint
conditioning scheme with a closed form description (Theo- arXiv:1611.00756, 2016.
rem 1 and the claims thereafter). The preconditioning may
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and
4
https://github.com/tensorflow/models/ LeCun, Y. The loss surfaces of multilayer networks. In Artificial
tree/master/tutorials/image/mnist Intelligence and Statistics, pp. 192–204, 2015.
Cohen, N., Sharir, O., Levine, Y., Tamari, R., Yakira, D., Lee, H., Ge, R., Risteski, A., Ma, T., and Arora, S. On the
and Shashua, A. Analysis and design of convolutional net- ability of neural nets to express distributions. arXiv preprint
works via hierarchical tensor decompositions. arXiv preprint arXiv:1702.07028, 2017.
arXiv:1705.02302, 2017.
Livni, R., Shalev-Shwartz, S., and Shamir, O. On the compu-
Daniely, A. Depth separation for neural networks. arXiv preprint tational efficiency of training neural networks. Advances in
arXiv:1702.08489, 2017. Neural Information Processing Systems, 2014.
Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods Nesterov, Y. A method of solving a convex programming problem
for online learning and stochastic optimization. Journal of with convergence rate o (1/k2). In Soviet Mathematics Doklady,
Machine Learning Research, 12(Jul):2121–2159, 2011. volume 27, pp. 372–376, 1983.
Eldan, R. and Shamir, O. The power of depth for feedforward Nocedal, J. Updating quasi-newton matrices with limited storage.
neural networks. arXiv preprint arXiv:1512.03965, 2015. Mathematics of Computation, 35(151):773–782, 1980.
Fukumizu, K. Effect of batch learning in multilayer neural net- Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-
works. Gen, 1(04):1E–03, 1998. Dickstein, J. On the expressive power of deep neural networks.
arXiv preprint arXiv:1606.05336, 2016.
Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle
points—online stochastic gradient for tensor decomposition. In Rodriguez-Lujan, I., Fonollosa, J., Vergara, A., Homer, M., and
Conference on Learning Theory, pp. 797–842, 2015. Huerta, R. On the calibration of sensor arrays for pattern recog-
nition using the minimal number of experiments. Chemometrics
Goh, G. Why momentum really works. Distill, 2017. doi: 10. and Intelligent Laboratory Systems, 130:123–134, 2014.
23915/distill.00006. URL http://distill.pub/2017/
momentum. Safran, I. and Shamir, O. On the quality of the initial basin in
overspecified neural networks. In International Conference on
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep Machine Learning, pp. 774–782, 2016.
learning, volume 1. MIT press Cambridge, 2016.
Safran, I. and Shamir, O. Spurious local minima are com-
Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualitatively mon in two-layer relu neural networks. arXiv preprint
characterizing neural network optimization problems. arXiv arXiv:1712.08968, 2017.
preprint arXiv:1412.6544, 2014.
Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions
Haeffele, B. D. and Vidal, R. Global Optimality in Tensor Fac- to the nonlinear dynamics of learning in deep linear neural
torization, Deep Learning, and Beyond. CoRR abs/1202.2745, networks. arXiv preprint arXiv:1312.6120, 2013.
cs.NA, 2015.
Soudry, D. and Carmon, Y. No bad local minima: Data independent
Hardt, M. and Ma, T. Identity matters in deep learning. arXiv training error guarantees for multilayer neural networks. arXiv
preprint arXiv:1611.04231, 2016. preprint arXiv:1605.08361, 2016.
Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and
for online convex optimization. Mach. Learn., 69(2-3):169–192, Salakhutdinov, R. Dropout: a simple way to prevent neural net-
December 2007. ISSN 0885-6125. works from overfitting. Journal of Machine Learning Research,
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for 15(1):1929–1958, 2014.
image recognition. arXiv preprint arXiv:1512.03385, 2015. Su, W., Boyd, S., and Candes, E. A differential equation for
Helmke, U. and Moore, J. B. Optimization and dynamical systems. modeling nesterovs accelerated gradient method: Theory and
Springer Science & Business Media, 2012. insights. In Advances in Neural Information Processing Systems,
pp. 2510–2518, 2014.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. In Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gra-
International conference on machine learning, pp. 448–456, dient by a running average of its recent magnitude. COURSERA:
2015. Neural networks for machine learning, 4(2):26–31, 2012.
Janzamin, M., Sedghi, H., and Anandkumar, A. Beating the Perils Vergara, A., Vembu, S., Ayhan, T., Ryan, M. A., Homer, M. L.,
of Non-Convexity: Guaranteed Training of Neural Networks and Huerta, R. Chemical gas sensor drift compensation using
using Tensor Methods. CoRR abs/1506.08473, 2015. classifier ensembles. Sensors and Actuators B: Chemical, 166:
320–329, 2012.
Jones, E., Oliphant, T., Peterson, P., et al. SciPy: Open source sci-
entific tools for Python, 2001–. URL http://www.scipy. Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational per-
org/. [Online; accessed ¡today¿]. spective on accelerated methods in optimization. Proceedings
of the National Academy of Sciences, 113(47):E7351–E7358,
Kawaguchi, K. Deep learning without poor local minima. In 2016.
Advances in Neural Information Processing Systems, pp. 586–
594, 2016. Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv
preprint arXiv:1212.5701, 2012.
Kingma, D. and Ba, J. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.
A. Deferred Proofs the left. This yields:

A.1. Proof of Theorem 1 > >
Wj+1 (t)Ẇj+1 (t) + ηλ · Wj+1 (t)Wj+1 (t) =
Before delving into the proof, we introduce notation that will Ẇj (t)Wj> (t) + ηλ · Wj (t)Wj> (t)
admit a more compact presentation of formulae. For 1 ≤
a ≤ b ≤ N , we denote: Taking the transpose of these equations and adding to them-
selves, we obtain, for every j = 1 . . . N −1:
Yj=b
Wj := Wb Wb−1 · · · Wa >
Wj+1 >
(t)Ẇj+1 (t) + Ẇj+1 (t)Wj+1 (t)+
a
Yb >
Wj> := Wa> Wa+1 >
· · · Wb> 2ηλ · Wj+1 (t)Wj+1 (t) =
j=a
Ẇj (t)Wj> (t) + Wj (t)Ẇj> (t)+
where W1 . . . WN are the weight matrices of the depth-
2ηλ · Wj (t)Wj> (t) (17)
N linear network (Equation 2). If a > b, then by defini-
Qj=b Qb
tion both a Wj and j=a Wj> are identity matrices, Denote for j = 1 . . . N :
with size depending on context, i.e. on the dimensions of
matrices they are multiplied against. Given any square Cj (t) := Wj (t)Wj> (t) , Cj0 (t) := Wj> (t)Wj (t)
matrices (possibly scalars) A1 , A2 , . . . , Am , we denote by
Equation 17 can now be written as:
diag(A1 . . . Am ) a block-diagonal matrix holding them on
its diagonal: 0
Ċj+1 0
(t) + 2ηλ · Cj+1 (t) = Ċj (t) + 2ηλ · Cj (t)
A1 0 0 0 , j = 1. . .N − 1
 
 0 ... 0 0
diag(A1 . . . Am ) = 
 
 Turning to Lemma 1 below, while recalling our assumption
0 0 Am 0 for time t0 (Equation 7):
0 0 0 0 0
Cj+1 (t0 ) = Cj (t0 ) , j = 1. . .N − 1
As illustrated above, diag(A1 . . . Am ) may hold additional,
zero-valued rows and columns beyond A1 . . . Am . Con- we conclude that, throughout the entire time-line:
versely, it may also trim (omit) rows and columns, from its 0
Cj+1 (t) = Cj (t) , j = 1. . .N − 1
bottom and right ends respectively, so long as only zeros
are being removed. The exact shape of diag(A1 . . . Am ) is Recollecting the definitions of Cj (t), Cj0 (t), this means:
again determined by context, and so if B and C are matrices,
>
the expression B · diag(A1 . . . Am ) · C infers a number of Wj+1 (t)Wj+1 (t) = Wj (t)Wj> (t) , j = 1. . .N −1 (18)
rows equal to the number of columns in B, and a number of
columns equal to the number of rows in C.
Regard t now as fixed, and for every j = 1 . . . N , let:
Turning to the actual proof, we disregard the trivial case Wj (t) = Uj Σj Vj> (19)
N = 1, and begin by noticing that Equation 3, along with
the definition of We (Equation 5), imply that for every j = be a singular value decomposition. That is to say, Uj and Vj
1 . . . N: are orthogonal matrices, and Σj is a rectangular-diagonal
j−1
matrix holding non-decreasing, non-negative singular values
N
∂LN Y
> dL
1 Y on its diagonal. Equation 18 implies that for j = 1 . . . N −1:
(W1 , . . ., WN ) = Wi · (We ) · Wi>
∂Wj dW
i=j+1 i=1 Vj+1 Σ> > > >
j+1 Σj+1 Vj+1 = Uj Σj Σj Uj
Plugging this into the differential equations of gradient de- For a given j, the two sides of the above equation are
scent (Equation 6), we get: both orthogonal eigenvalue decompositions of the same ma-
Ẇj (t) = −ηλWj (t) (16) trix. The square-diagonal matrices Σ> >
j+1 Σj+1 and Σj Σj
are thus the same, up to a possible permutation of diag-
N j−1
Y dL1 Y onal elements (eigenvalues). However, since by defini-
−η Wi> (t) · (We (t)) · Wi> (t)
dW tion Σj+1 and Σj have non-increasing diagonals, it must
i=j+1 i=1
hold that Σ> >
j+1 Σj+1 = Σj Σj . Let ρ1 >ρ2 > · · · >ρm ≥0
, j = 1. . .N be the distinct eigenvalues, with corresponding multiplici-
ties d1 , d2 , . . . , dm ∈ N. We may write:
For j = 1 . . . N −1, multiply the j’th equation by Wj> (t)
from the right, and the j+1’th equation by Wj+1>
(t) from Σ> >
j+1 Σj+1 = Σj Σj = diag(ρ1 Id1 , . . . , ρm Idm ) (20)
where Idr , 1≤r≤m, is the identity matrix of size dr × dr . It follows that for every j = 1 . . . N :
Moreover, there exist orthogonal matrices Oj,r ∈ Rdr ,dr ,
i=N N
1≤r≤m, such that: Y Y N −j+1
Wi> (t) = We (t)We> (t) N

Wi (t) (23)
j i=j
Uj = Vj+1 · diag(Oj,1 , . . . , Oj,m ) j i=j
Y Y j
Wi> (t) Wi (t) = We> (t)We (t) N

(24)
Oj,r here is simply a matrix changing between orthogonal i=1 1
bases in the eigenspace of ρr – it maps the basis comprising
N −j+1 j
Vj+1 -columns to that comprising Uj -columns. Recalling where [·] N and [·] N stand for fractional power operators
that both Σj and Σj+1 are rectangular-diagonal, holding defined over positive semidefinite matrices.
only non-negative values, Equation 20 implies that each
√ √
of these matrices is equal to diag( ρ1 ·Id1 , . . . , ρm ·Idm ). With Equations 23 and 24 in place, we are finally in a po-
Note that the matrices generally do not have the same shape sition to complete the proof. Returning Qi=N to Equation 16,
and thus, formally, are not equal to one another. Nonethe- we multiply Ẇj (t) from the left by j+1 Wi (t) and from
less, in line with our diag notation (see beginning of this the right by 1
Qi=j−1
Wi (t), followed by summation over
subsection), Σj and Σj+1 may differ from each other only j = 1 . . . N . This gives:
in trailing, zero-valued rows and columns. By an inductive
argument, all the singular value matrices Σ1 , Σ2 , . . . , ΣN XN Yi=N Y
i=j−1

(see Equation 19) are equal up to trailing zero rows and Wi (t) Ẇj (t) Wi (t) =
j=1 j+1 1
columns. The fact that ρ1 . . . ρm do not include an index j XN Yi=N Y
i=j−1

in their notation is thus in order, and we may write, for every −ηλ Wi (t) Wj (t) Wi (t)
j=1 j+1 1
j = 1 . . . N −1:
XN Yi=N YN
−η Wi (t) Wi> (t) ·
j=1 j+1 i=j+1
Wj (t) = Uj Σj Vj>
1
Y
= Vj+1 · diag(Oj,1 , . . . , Oj,m ) · dL j−1 Yi=j−1
(We (t)) · Wi> (t) Wi (t)
√ √ dW i=1 1
diag( ρ1 ·Id1 , . . . , ρm ·Idm ) · Vj>
Qi=N
By definition We (t) = 1 Wj (t), so we can substitute
For the N ’th weight matrix we have: the first two lines above:
WN (t) = UN ΣN VN> Ẇe (t) = −ηλN · We (t)

√ √
= UN · diag( ρ1 ·Id1 , . . . , ρm ·Idm ) · VN> N Y
X i=N YN
−η Wi (t) Wi> (t) ·
j+1 i=j+1
j=1
Concatenations of weight matrices thus simplify as follows:
dL1
Y
j−1 Yi=j−1
(We (t)) · Wi> (t) Wi (t)
Qi=N
Wi (t)
QN
Wi> (t) = (21) dW i=1 1
j i=j

UN · diag (ρ1 )N −j+1 ·Id1 , . . . , (ρm )N −j+1 ·Idm · UN
> Finally, plugging in the relations in Equations 23 and 24,
the sought-after result is revealed:
Qj >
Qi=j
i=1 Wi (t) 1 Wi (t) = (22) Ẇe (t) = −ηλN · We (t)

N
V1 · diag (ρ1 )j ·Id1 , . . . , (ρm )j ·Idm · V1> X N −j
We (t)We> (t) N ·

−η
j=1
, j = 1...N
dL1 j−1
(We (t)) · We> (t)We (t) N

dW
where we used the orthogonality of Oj,r , and the obvi-
ous fact that it commutes with Idr . Consider Equation 21
with j = 1 and Equation 22 Qwith j = N , while recalling
i=N
that by definition We (t) = 1 Wj (t):
Lemma 1. Let I ⊂ R be a connected interval, and let
f, g : I → R be differentiable functions. Suppose that there
We (t)We> (t) = UN ·diag (ρ1 )N Id1 , . . . , (ρm )N Idm ·UN>
exists a constant α ≥ 0 for which:

We> (t)We (t) = V1 ·diag (ρ1 )N Id1 , . . . , (ρm )N Idm ·V1> f˙(t) + α · f (t) = ġ(t) + α · g(t) , ∀t ∈ I
Then, if f and g assume the same value at some t0 ∈ I in A and the number of columns in B. vec(·) here, as
(interior or boundary), they must coincide along the entire in claim statement, stands for matrix vectorization in
interval, i.e. it must hold that f (t) = g(t) for all t ∈ I. column-first order.
Proof. Define h := f − g. h is a differentiable function • If A1 , A2 , B1 and B2 are matrices such that the matrix
from I to R, and we have: products A1 B1 and A2 B2 are defined, then:
ḣ(t) = −α · h(t) , ∀t ∈ I (25) (A1 A2 )(B1 B2 ) = (A1 B1 ) (A2 B2 ) (28)
We know that h(t0 ) = 0 for some t0 ∈ I, and would • For any matrices A and B:
like to show that h(t) = 0 ∀t ∈ I. Assume by contra-
diction that this is not the case, so there exists t2 ∈ I for (A B)> = A> B > (29)
which h(t2 ) 6= 0. Without loss of generality, suppose that
h(t2 ) > 0, and that t2 > t0 . Let S be the zero set of h, • Equation 28 and 29 imply, that if A and B are some
i.e. S := {t ∈ I : h(t) = 0}. Since h is continuous orthogonal matrices, so is A B:
in I, S is topologically closed, therefore its intersection
with the interval [t0 , t2 ] is compact. Denote by t1 the maxi- A> = A−1 ∧ B > = B −1
mal element in this intersection, and consider the interval =⇒ (A B)> = (A B)−1 (30)
J := [t1 , t2 ] ⊂ I. By construction, h is positive along J, be-
sides on the endpoint t1 where it assumes the value of zero.
For t1 < t ≤ t2 , we may solve as follows the differential With the Kronecker product in place, we proceed to the
equation of h (Equation 25): actual proof. It suffices to show that vectorizing:
N h i j−1 i NN−j
ḣ(t) X N dL1 h
= −α =⇒ h(t) = βe−αt We(t) (We(t) )> · (We(t) )· (We(t) )> We(t)
h(t) j=1
dW
−αt2
where β is the positive constant defined by h(t2 ) = βe . yields:
Since in particular h is bounded away from zero on (t1 , t2 ],
dL1

(t)
and assumes zero at t1 , we obtain a contradiction to its PW (t) · vec (We )
e dW
continuity. This completes the proof.
where PW (t) is the preconditioning matrix defined in claim
e
A.2. Proof of Claim 1 statement. For notational conciseness, we hereinafter omit

(t)
the iteration index t, and simply write We instead of We .
Our proof relies on the Kronecker product operation for
matrices. For arbitrary matrices A and B of sizes ma × na Let Id and Ik be the identity matrices of sizes d × d and k ×
and mb × nb respectively, the Kronecker product A B is k respectively. Utilizing the properties of the Kronecker
defined to be the following block matrix: product, we have:
 
  N
a11 ·B · · · a1na ·B X j−1
dL 1 N −j
We We> N (We ) We> We N 

 . vec 
.. ..  ∈ Rma mb ,na nb

dW
A B :=   .. . .  j=1
ama 1 ·B · · · ama na ·B N
X j−1

Id We We> N

(26) = ·
where aij stands for the element in row i and column j of A. j=1
1
The Kronecker product admits numerous useful properties. > NN−j dL
We will employ the following: W e We Ik · vec (We )
dW
N 1
X N −j j−1 dL
• If A and B are matrices such that the matrix product = We> We N We We> N vec (We )
AB is defined, then: j=1
dW
vec(AB) = (B > IrA ) · vec(A) The first equality here makes use of Equation 27, and the
second of Equation 28. We will show that the matrix:
= (IcB A) · vec(B) (27)
N
X > NN−j j−1
where IrA and IcB are the identity matrices whose Q := We We We We> N (31)
sizes correspond, respectively, to the number of rows j=1
meets the characterization of PWe , thereby completing the A.3. Proof of Claim 2
proof. Let:
We disregard the trivial case N = 1, as well as the scenario
We = U DV > (t)
We = 0 (both lead Equations 10 and 12 to equate). Omit-
be a singular value decomposition, i.e. U ∈ Rk,k and V ∈ ting the iteration index t from our notation, it suffices to
Rd,d are orthogonal matrices, and D is a rectangular- show that:
diagonal matrix holding (non-negative) singular values on
N
its diagonal. Plug this into the definition of Q (Equation 31): X j−1 dL1 N −j
We We> N · (We ) · We> We N =

(35)
N j=1
dW
X N −j j−1
V D> DV > N U DD> U > N

Q= 2− 2
1 dL1
kWe k2 N dLdW (W e ) + (N − 1)P rWe dW (W e )
j=1
N
> NN−j > j−1 >
where P rWe {·} is the projection operator defined in claim
X
> N

= V D D V U DD U
j=1 statement (Equation 13), and we recall that by assump-
j−1
tion k = 1 (We ∈ R1,d ). We We> N is a scalar, equal

N
X > NN−j j−1
DD> N (V > U > )

= (V U ) D D 2 j−1 N −j
to kWe k2 N for every j = 1 . . . N . We> We N on the

j=1
  other hand is a d × d matrix, by definition equal to iden-
N tity for j = N , and otherwise, for j = 1 . . . N − 1, it
X > NN−j j−1
DD> N  (V U )>

= (V U )  D D 2 N −j >
is equal to kWe k2 N (We/kWe k2 ) (We/kWe k2 ). Plugging
j=1
these equalities into the first line of Equation 35 gives:
The third equality here is based on the relation in Equa-
N
tion 28, and the last equality is based on Equation 29. De- X j−1 dL1 N −j
We We> N (We ) We> We N =

noting: dW
j=1
N −1
O := V U (32) X 2 j−1 dL1 2 N −j

We
>
We

N kWe k2 N
(We ) kWe k2 N kW k kW k
X > NN−j j−1 dW e 2 e 2
Λ := D D DD> N (33) j=1
j=1 −1
2 NN dL1
+ kWe k2 · (We ) =
dW
we have: 1 >
2 N −1 dL

Q = OΛO> (34) (N − 1) kWe k2 N We
(We ) kW e k2
We
kWe k2
dW
Now, since by definition U and V are orthogonal, O is or- 1
2 N −1 dL
thogonal as well (follows from the relation in Equation 30). + kWe k2 N · (We )
dW
Additionally, the fact that D is rectangular-diagonal im-
plies that the square matrix Λ is also diagonal. Equation 34 The latter expression is precisely the second line of Equa-
is thus an orthogonal eigenvalue decomposition of Q. Fi- tion 35, thus our proof is complete.
nally, denote the columns of U (left singular vectors of We )
by u1 . . . uk , those of V (right singular vectors of We ) by
v1 . . . vd , and the diagonal elements of D (singular val-
A.4. Proof of Theorem 2
ues of We ) by σ1 . . . σmax{k,d} (by definition σr = 0 if
r > min{k, d}). The definitions in Equations 32 and 33 Our proof relies on elementary differential geometry:
imply that the columns of O are: curves, arc length and line integrals (see Chapters 8 and 9
in Buck (2003)).
vec(ur vr>0 ) , r = 1 . . . k , r0 = 1 . . . d
Let U ⊂ R1,d be a neighborhood of W = 0 (i.e. an open
1
with corresponding diagonal elements in Λ being: set that includes this point) on which dL dW is continuous
(U exists by assumption). It is not difficult to see that F (·)
XN −j
2 NN 2 j−1
σr σr 0 N , r = 1 . . . k , r0 = 1 . . . d (Equation 14) is continuous on U as well. The strategy of our
j=1 proof will be to show that F (·) does not admit the gradient
theorem (also known as the fundamental theorem for line
We conclude that Q indeed meets the characterization
integrals). According to the theorem, if h : U → R is a
of PWe in claim statement. This completes the proof.
continuously differentiable function, and Γ is a piecewise
smooth curve lying in U with start-point γs and end-point γe ,
then: Z Let R be a positive constant small enough such that the Eu-
dh clidean ball of radius R around the origin is contained in U.
= h(γe ) − h(γs )
Γ dW Let r be a positive constant smaller than R. Define Γr,R to
In words, the line integral of the gradient of h over Γ, is be a curve as follows (see illustration in Figure 1):6
equal to the difference between the value taken by h at the
Γr,R := Γ1r,R → Γ2r,R → Γ3r,R → Γ4r,R (38)
end-point of Γ, and that taken at the start-point. A direct
implication of the theorem is that if Γ is closed (γe = γs ), where:
the line integral vanishes:
I
dh • Γ1r,R is the line segment from −R · e to −r · e.
=0
dW
Γ • Γ2r,R is a geodesic on the sphere of radius r, starting
We conclude that if F (·) is the gradient field of some func- from −r · e and ending at r · e.
tion, its line integral over any closed (piecewise smooth) • Γ3r,R is the line segment from r · e to R · e.
curve lying in U must vanish. We will show that this is not
the case. • Γ4r,R is a geodesic on the sphere of radius R, starting
from R · e and ending at −R · e.
For notational conciseness we hereinafter identify R1,d
and Rd , so in particular U is now a subset of Rd . To further Γr,R is a piecewise smooth, closed curve that fully lies
simplify, we omit the subindex from the Euclidean norm, within U. Consider the linear functional it induces: φ 7→
writing k·k instead of k·k2 . Given an arbitrary continuous H
F . We will evaluate this functional on φ = dL
1
Γr,R φ dW . To
vector field φ : U → Rd , we define a respective (continuous)
do so, we decompose the latter as follows:
vector field as follows:
dL1
dW (·) = c · e(·) + ξ(·) (39)
F φ : U → Rd
where:
Fφ (w) = (36)
1
• c is a scalar equal to k dL
( 2
D E
dW (W =0)k.
2− w w
kwk N φ(w)+(N −1) φ(w), kwk kwk , w6=0
0 , w=0 • e(·) is a vector field returning the constant e (Equa-
tion 37).
1
Notice that for φ = dLdW , we get exactly the vector field F (·)
1
• ξ(·) is a vector field returning the values of dL
dW (·)
defined in theorem statement (Equation 14) – the subject of
dL1
our inquiry. As an operator on (continuous) vector fields, shifted by the constant − dW (W =0). It is continuous
the mapping φ 7→ Fφ is linear.5 This, along with the linear- on U and vanishes at the origin.
ity of line integrals, imply that for any piecewise
R smooth
curve Γ lying in U, the functional φ 7→ Γ Fφ , a mapping Applying Lemma 2 to ξ over Γr,R gives:
of (continuous) vector fields to scalars, is linear. Lemma 2 I
below provides an upper bound on this linear functional in
2− 2
Fξ ≤ N · len(Γr,R ) · max kγk N · max kξ(γ)k

terms of the length of Γ, its maximal distance from origin,
Γr,R γ∈Γr,R γ∈Γr,R
and the maximal norm φ takes on it. 2
= N · (πr + πR + 2(R − r)) · R2− N · max kξ(γ)k
In light of the above, to show that F (·) contradicts the γ∈Γr,R
2
gradient theorem, thereby completing the proof, it suffices ≤ N · 2π · R 3− N
· max kξ(γ)k
to find a closed (piecewise smooth) curve Γ for which the γ∈Γr,R
1
linear functional φ 7→ Γ Fφ does not vanish at φ = dL
H
dW . On the other hand, by Lemma 3:
1
By assumption dL dW (W =0) 6= 0, and so we may define the I
1
unit vector in the direction of dL 2N 2 2

dW (W =0):
3− N 3− N
Fe = − 2 R − r
Γr,R 3 − 2/N
dL1
dW (W =0) 6
e := ∈ Rd (37) The proof would have been slightly simplified had we used
1 (W =0)
dL
dW a curve that passes directly through the origin. We avoid this in
order to emphasize that the result is not driven by some point-wise
5
For any φ1 , φ2 : U → Rd and c ∈ R, it holds that Fφ1 +φ2 = singularity (the origin received special treatment in the definition
Fφ1 + Fφ2 and Fc·φ1 = c · Fφ1 . of F (·) – see Equations 14 and 13).
Lemma 2. Let φ : U → Rd be a continuous vector field,

H
The linearity of the functional φ 7→ Γr,R
Fφ , along with
Equation 39, then imply: and let Γ be a piecewise smooth curve lying in U. Consider
I I I the (continuous) vector field Fφ : U → Rd defined in Equa-
tion 36. The line integral of the latter over Γ is bounded as
F dL1 = c · Fe + Fξ
Γr,R dW
Γr,R Γr,R follows:

2N Z
2 2
3− N 3− N 2
≥ c· − 2 R − r Fφ ≤ N · len(Γ) · max kγk2− N · max kφ(γ)k

3 − 2/N
Γ
γ∈Γ γ∈Γ
2
− N · 2π · R3− N · max kξ(γ)k where len(Γ) is the arc length of Γ, and γ ∈ Γ refers to a
γ∈Γr,R
point lying on the curve.
We will show that for proper choices of R and r, the lower
bound above is positive. Γr,R will then be a piecewise Proof. We begin by noting that the use of max (as opposed
smooth closed curve lying in U, for which the functional to sup) in stated upper bound is appropriate, since under our
1
φ 7→ Γr,R Fφ does not vanish at φ = dL
H
dW . As stated, this
definition of a curve (adopted from Buck (2003)), points
will imply that F (·) violates the gradient theorem, thereby lying on it constitute a compact set. This subtlety is of little
concluding our proof. importance – one may as well replace max by sup, and the
lemma would still serve its purpose.
All that is left is to affirm that the expression: It is not difficult to see that for any w ∈ U, w 6= 0:

w w
2 2

c · 3−2N R3− N − r3− N 2− 2
2/N − 2 kFφ (w)k = kwk N φ(w)+(N −1) φ(w),
kwk kwk
2
−N · 2π · R3− N · maxγ∈Γr,R kξ(γ)k
2− N2 w w
≤ kwk kφ(w)k +(N −1) φ(w),
·
kwk kwk
can indeed be made positive with proper choices of R and r.
Recall that: 2− N2 w
= kwk kφ(w)k +(N −1) φ(w),
kwk
2
2N 2− N
• N > 2 by assumption; implies 3−2/N − 2 > 0. ≤ kwk (kφ(w)k +(N −1)kφ(w)k)
2
2− N
• R is any positive constant small enough such that the ≤ N kwk kφ(w)k
ball of radius R around the origin is contained in U. 2− 2
Trivially, kFφ (w)k ≤ N kwk N kφ(w)k holds for w=0
• r is any positive constant smaller than R. as well. The sought-after result now follows from the prop-
erties of line integrals:
• Γr,R is a curve whose points are all within distance R Z Z Z
from the origin.

Fφ ≤
2− 2
kFφ k ≤ N kwk N kφ(w)k
Γ Γ Γ
1
• c = k dL
dW (W =0)k – positive by assumption. ≤ N · len(Γ) · max kγk
2
2− N
· max kφ(γ)k
γ∈Γ γ∈Γ
• ξ(·) is a vector field that is continuous on U and van-
ishes at the origin.
Lemma 3. Let e be a unit vector, let Γr,R be a piecewise
The following procedure gives R and r as required: smooth closed curve as specified in Equation 38 and the text
thereafter, and let φ 7→ Fφ be the operator on continuous
2 2 vector fields defined by Equation 36. Overloading notation
• Set r to follow R such that: r3− N = 0.5 · R3− N .
by regarding e(·) ≡ e as a constant vector field, it holds
• Choose > 0 for which 0.5c 3− 2N that:
2 −2 − 2πN > 0.
N I
2N 2
3− N 2
3− N

Fe = − 2 R − r
• Set R to be small enough such that kξ(w)k ≤ for Γr,R 3 − 2/N
any point w within distance R from the origin.
Proof. We compute the line integral by decomposing Γr,R
The proof is complete. into its smooth components Γ1r,R . . . Γ4r,R :
I Z Z Z Z
Fe = Fe + Fe + Fe + Fe (40)
Γr,R Γ1r,R Γ2r,R Γ3r,R Γ4r,R
Starting from Γ1r,R , notice that for every point w lying on In the context of Section 7, we will treat the setting of p = 4
w w
this curve: he, kwk i kwk = e. Therefore: (`4 loss) and N = 2 (depth-2 network). We will also as-
Z Z Z sume, in accordance with the problem being ill-conditioned –
2
2− N 2
2− N y1 y2 , that initialization values are ill-conditioned as well,
Fe = kwk (e+(N −1)e) = N kwk e (0)
Γ1r,R Γ1r,R Γ1r,R and in particular 1 /2 ≈ y1 /y2 , where i := |wi |. An ad-
ditional assumption we make is that y2 is on the order of 1,
The line integral on the right translates into a simple univari- and thus the near-zero initialization of w1 and w2 implies
ate integral: y2 1 , 2 . Finally, we assume that 1 y1 1.
Z −r Z R
As shown in Section 7, under gradient descent, w1 and w2
Z
2− 2 2− 2 2
kwk N e = |ρ| N dρ = ρ2− N dρ move independently, and to prevent divergence, the learning
Γ1r,R −R r
rate must satisfy η < min{2/y1p−2 , 2/y2p−2 }. In our setting,
1 2
3− N 2
3− N

= R − r this translates to (GD below stands for gradient descent):
3 − 2/N
η GD < 2/y12 (45)
We thus have:
For w2 , the optimal learning rate (convergence in a single
step) is 1/y22 , and the constraint above will lead to very slow
Z
N 2
3− N 2
3− N

Fe = R − r (41)
Γ1r,R 3 − 2/N convergence (see Equation 15 and its surrounding text).
Suppose now that we optimize via overparameterization,
Turning to Γ2r,R , note that for any point w along this curve i.e. with the update rule in Equation 12 (single output). In
2
our particular setting (recall, in addition to the above, that
2− N 2− N2 w
kwk =r , and kwk is perpendicular to the direc- we omitted weight decay for simplicity – λ = 0), this update
tion of motion. This implies: rule translates to:
(t+1) >
] ←[ [w1 , w2 ]>
(t+1) (t) (t)
Z Z [w1 , w2 (46)
2
Fe = r2− N e 1/2
· [(w1 − y1 )3 , (w2 − y2 )3 ]>
(t) (t) (t) (t)
Γ2r,R Γ2r,R −η (w1 )2 + (w2 )2
−1/2
(t) (t)
The line integral Γ2 e is simply equal to the progress Γ2r,R −η (w1 )2 + (w2 )2
R
r,R
· w1 (w1 − y1 )3 + w2 (w2 − y2 )3 · [w1 , w2 ]>

makes in the direction of e, which is 2r. Accordingly: (t) (t) (t) (t) (t) (t)
(0)
Z
2
Fe = r2− N · 2r = 2r3− N
2
(42) For the first iteration (t = 0), replacing i := |wi |, while
Γ2r,R recalling that y1 y2 1 2 , we obtain:
(1) (1)
[w1 , w2 ]> ≈ η · 1 · [y13 , y23 ]> + η · y13 · [1 , 2 ]>
As for Γ3r,R and Γ4r,R , their line integrals may be computed
similarly to those of Γ1r,R and Γ2r,R respectively. Such com- = η · [21 y13 , 1 y23 + 2 y13 ]>
putations yield: (1)
Set η = 1/21 y12 . Then w1 ≈ y1 and w2 ≈ y23 /2y12 +
(1)
Z
N 2 2
2 y1 /21 . Our assumptions thus far (y1 y2 and 1 2 )
3− N 3− N
Fe = R − r (43) (1) (1)
imply w1 w2 . Moreover, since 2 /1 ≈ y2 /y1 , it
Γ3r,R 3 − 2/N
(1)
Z
2
holds that w2 ∈ O(y2 ) = O(1). Taking all of this into ac-
Fe = −2R3− N (44) count, the second iteration (t = 1) of the overparameterized
Γ4r,R update rule (Equation 46) becomes:
(2) (2) (1)
Combining Equation 40 with Equations 41, 42, 43 and 44, [w1 , w2 ]> ≈ [y1 , w2 ]>
we obtain the desired result. 1 (1) (1)
− [(w1 − y1 )3 , (w2 − y2 )3 ]>
21 y1
(1) (1) (1)
B. A Concrete Acceleration Bound y1 (w1 − y1 )3 + w2 (w2 − y2 )3 (1)
− [y1 , w2 ]>
21 y13
In Section 7 we illustrated qualitatively, on a family of
(1) (1)
very simple hypothetical learning problems, the potential ≈ [y1 , w2 − 1/21 y1 · (w2 − y2 )3 ]>
of overparameterization (use of depth-N linear network in
place of classic linear model) to accelerate optimization. In In words, w1 will stay approximately equal to y1 , whereas
this appendix we demonstrate how the illustration can be w2 will take a step that corresponds to gradient descent with
made formal, by considering a special case and deriving a learning rate (OP below stands for overparameterization):
concrete bound on the acceleration. η OP := 1/21 y1 (47)
By assumption 1 y1 1 and y2 ∈O(1), thus η OP <2/y22 , C.2. Convolutional Network

meaning that w2 will remain on the order of y2 (or less).
An inductive argument can therefore be applied, and our For the experiment on TensorFlow’s MNIST convolutional
observation regarding the second iteration (t = 1) continues network tutorial, we simply downloaded the code,7 and
to hold throughout – w1 is (approximately) fixed at y1 , and introduced two minor changes:
w2 follows steps that correspond to gradient descent with • Hidden dense layer: 3136×512 weight matrix replaced
learning rate η OP . by multiplication of 3136×512 and 512×512 matrices.
To summarize our findings, we have shown that while stan- • Output layer: 512×10 weight matrix replaced by mul-
dard gradient descent limits w2 with a learning rate η GD tiplication of 512×10 and 10×10 matrices.
that is at most 2/y12 (Equation 45), overparameterization The newly introduced weight matrices were initialized in the
can be adjusted to induce on w2 an implicit gradient descent same way as their predecessors (random Gaussian distribu-
scheme with learning rate η OP = 1/21 y1 (Equation 47), tion with mean zero and standard deviation 0.1). Besides the
all while admitting immediate (single-step) convergence above, no change was made. An addition of roughly 250K
for w1 . Since both η GD and η OP are well below 1/y22 , we parameters to a 1.6M -parameter model gave the speedup
obtain acceleration by at least η OP /η GD > y1 /41 (we presented in Figure 5-right.
remind the reader that y1 1 is the target value of w1 , and
1 1 is the magnitude of its initialization). To rule out the possibility of the speedup resulting from sub-
optimal learning rates, we reran the experiment with grid
C. Implementation Details search over the latter. The learning rate hardcoded into the
tutorial follows an exponentially decaying schedule, with
Below we provide implementation details omitted from our base value 10−2 . For both the original and overparameter-
experimental report (Section 8). ized models, training was run multiple times, with the base
value varying in {10−5 , 5 · 10−5 , . . . , 10−1 , 5 · 10−1 }. We
C.1. Linear Neural Networks chose, for each model separately, the configuration giving
The details hereafter apply to all of our experiments besides fastest convergence, and then compared the models one
that on the convolutional network (Figure 5-right). against the other. The observed gap in convergence rates
was similar to that in Figure 5-right.
In accordance with our theoretical setup (Section 4), evalu-
ated linear networks did not include bias terms, only weight An additional point we set out to examine, is the sensitivity
matrices. The latter were initialized to small values, drawn of the speedup to initialization of overparameterized layers.
i.i.d. from a Gaussian distribution with mean zero and stan- For this purpose, we retrained the overparameterized model
dard deviation 0.01. The only exception to this was the multiple times, varying in {10−3 , 5 · 10−3 , . . . , 10−1 , 5 ·
setting of identity initialization (Figure 5-left), in which 10−1 } the standard deviation of the Gaussian distribution
an offset of 1 was added to the diagonal elements of each initializing overparameterized layers (as stated above, this
weight matrix (including those that are not square). standard deviation was originally set to 10−1 ). Convergence
rates across the different runs were almost identical. In par-
When applying a grid search over learning rates, the values ticular, they were all orders of magnitude faster than the con-
{10−5 , 5 · 10−5 , . . . , 10−1 , 5 · 10−1 } were tried. We note vergence rate of the baseline, non-overparameterized model.
that in the case of depth-8 network with standard near-zero
initialization (Figure 5-left), all learning rates led either to
divergence, or to a failure to converge (vanishing gradients).
For computing optimal `2 loss (used as an offset in respec-
tive convergence plots), we simply solved, in closed form,
the corresponding least squares problem. For the optimal
`4 loss, we used scipy.optimize.minimize – a nu-
merical optimizer built into SciPy (Jones et al., 2001–), with
the default method of BFGS (Nocedal, 1980).
7
https://github.com/tensorflow/models/
tree/master/tutorials/image/mnist

On The Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On The Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Uploaded by

Copyright:

Available Formats

On the Optimization of Deep Networks:

Implicit Acceleration by Overparameterization

Sanjeev Arora 1 2 Nadav Cohen 2 Elad Hazan 1 3

Abstract Given the longstanding consensus on expressiveness vs. op-

Claim 2. Assume k = 1, i.e. We ∈ R1,d . Then, the end-to-

as gaining confidence (increasing step sizes) when optimiza-

6. Overparametrization Effects Cannot Be

L2 loss (distance from optimum)

L4 loss (distance from optimum)

in R2 labels in R. Suppose that our training set consists iteration iteration

L4 loss (distance from optimum)

L4 loss (distance from optimum)

L4 loss (distance from optimum)

A. Deferred Proofs the left. This yields:

WN (t) = UN ΣN VN> Ẇe (t) = −ηλN · We (t)

ḣ(t) = −α · h(t) , ∀t ∈ I (25) (A1 A2 )(B1 B2 ) = (A1 B1 ) (A2 B2 ) (28)

A.2. Proof of Claim 1 statement. For notational conciseness, we hereinafter omit

Λ := D D DD> N (33) j=1

Lemma 2. Let φ : U → Rd be a continuous vector field,

· w1 (w1 − y1 )3 + w2 (w2 − y2 )3 · [w1 , w2 ]>

By assumption 1 y1 1 and y2 ∈O(1), thus η OP <2/y22 , C.2. Convolutional Network

You might also like

On The Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On The Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Uploaded by

Copyright:

Available Formats

On the Optimization of Deep Networks:

Implicit Acceleration by Overparameterization

Sanjeev Arora 1 2 Nadav Cohen 2 Elad Hazan 1 3

Abstract Given the longstanding consensus on expressiveness vs. op-

Claim 2. Assume k = 1, i.e. We ∈ R1,d . Then, the end-to-

as gaining confidence (increasing step sizes) when optimiza-

6. Overparametrization Effects Cannot Be 

L2 loss (distance from optimum)

L4 loss (distance from optimum)

in R2 labels in R. Suppose that our training set consists iteration iteration

L4 loss (distance from optimum)

L4 loss (distance from optimum)

L4 loss (distance from optimum)

A. Deferred Proofs the left. This yields:

WN (t) = UN ΣN VN> Ẇe (t) = −ηλN · We (t)

ḣ(t) = −α · h(t) , ∀t ∈ I (25) (A1 A2 )(B1 B2 ) = (A1 B1 ) (A2 B2 ) (28)

A.2. Proof of Claim 1 statement. For notational conciseness, we hereinafter omit

Λ := D D DD> N (33) j=1

Lemma 2. Let φ : U → Rd be a continuous vector field,

· w1 (w1 − y1 )3 + w2 (w2 − y2 )3 · [w1 , w2 ]>

By assumption 1 y1 1 and y2 ∈O(1), thus η OP <2/y22 , C.2. Convolutional Network

You might also like

6. Overparametrization Effects Cannot Be

By assumption 1 y1 1 and y2 ∈O(1), thus η OP <2/y22 , C.2. Convolutional Network