Professional Documents
Culture Documents
On The Optimization of Deep Networks: Implicit Acceleration by Overparameterization
On The Optimization of Deep Networks: Implicit Acceleration by Overparameterization
increasing depth improves expressiveness but intuitive message: increasing depth can accelerate opti-
complicates optimization. This paper suggests mization. The effect is shown, via first-cut theoretical and
that, sometimes, increasing depth can speed up empirical analyses, to resemble a combination of two well-
optimization. The effect of depth on optimization known tools in the field of optimization: momentum, which
is decoupled from expressiveness by focusing on led to provable acceleration bounds (Nesterov, 1983); and
settings where additional layers amount to over- adaptive regularization, a more recent technique proven to
parameterization – linear neural networks, a well- accelerate by Duchi et al. (2011) in their proposal of the
studied model. Theoretical analysis, as well as AdaGrad algorithm. Explicit mergers of both techniques
experiments, show that here depth acts as a pre- are quite popular in deep learning (Kingma & Ba, 2014;
conditioner which may accelerate convergence. Tieleman & Hinton, 2012). It is thus intriguing that merely
Even on simple convex problems such as linear introducing depth, with no other modification, can have a
regression with `p loss, p > 2, gradient descent similar effect, but implicitly.
can benefit from transitioning to a non-convex There is an obvious hurdle in isolating the effect of depth
overparameterized objective, more than it would on optimization: if increasing depth leads to faster train-
from some common acceleration schemes. We ing on a given dataset, how can one tell whether the im-
also prove that it is mathematically impossible to provement arose from a true acceleration phenomenon, or
obtain the acceleration effect of overparametriza- simply due to better representational power (the shallower
tion via gradients of any regularizer. network was unable to attain the same training loss)? We
respond to this hurdle by focusing on linear neural networks
(cf. Saxe et al. (2013); Goodfellow et al. (2016); Hardt &
1. Introduction Ma (2016); Kawaguchi (2016)). With these models, adding
layers does not alter expressiveness; it manifests itself only
How does depth help? This central question of deep learn-
in the replacement of a matrix parameter by a product of
ing still eludes full theoretical understanding. The general
matrices – an overparameterization.
consensus is that there is a trade-off: increasing depth im-
proves expressiveness, but complicates optimization. Supe- We provide a new analysis of linear neural network opti-
rior expressiveness of deeper networks, long suspected, is mization via direct treatment of the differential equations
now confirmed by theory, albeit for fairly limited learning associated with gradient descent when training arbitrarily
problems (Eldan & Shamir, 2015; Raghu et al., 2016; Lee deep networks on arbitrary loss functions. We find that the
et al., 2017; Cohen et al., 2017; Daniely, 2017; Arora et al., overparameterization introduced by depth leads gradient
2018). Difficulties in optimizing deeper networks have also descent to operate as if it were training a shallow (single
been long clear – the signal held by a gradient gets buried layer) network, while employing a particular precondition-
as it propagates through many layers. This is known as ing scheme. The preconditioning promotes movement along
the “vanishing/exploding gradient problem”. Modern tech- directions already taken by the optimization, and can be seen
niques such as batch normalization (Ioffe & Szegedy, 2015) as an acceleration procedure that combines momentum with
and residual connections (He et al., 2015) have somewhat adaptive learning rates. Even on simple convex problems
alleviated these difficulties in practice. such as linear regression with `p loss, p > 2, overparam-
1
eterization via depth can significantly speed up training.
Department of Computer Science, Princeton University,
Surprisingly, in some of our experiments, not only did over-
Princeton, NJ, USA 2 School of Mathematics, Institute for Ad-
vanced Study, Princeton, NJ, USA 3 Google Brain, USA. Corre- parameterization outperform naı̈ve gradient descent, but it
spondence to: Nadav Cohen <cohennadav@ias.edu>. was also faster than two well-known acceleration methods –
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
AdaGrad (Duchi et al., 2011) and AdaDelta (Zeiler, 2012). sive), ultimately concluding that increasing depth can slow
In addition to purely linear networks, we also demonstrate down optimization, albeit by a modest amount. In contrast
(empirically) the implicit acceleration of overparameteri- to these two works, our analysis applies to a general loss
zation on a non-linear model, by replacing hidden layers function, and any depth including one. Intriguingly, we find
with depth-2 linear networks. The implicit acceleration of that for `p regression, acceleration by depth is revealed only
overparametrization is different from standard regulariza- when p > 2. This explains why the conclusion reached
tion – we prove its effect cannot be attained via gradients of in Saxe et al. (2013) differs from ours.
any fixed regularizer.
Turning to general optimization, accelerated gradient (mo-
Both our theoretical analysis and our empirical evaluation mentum) methods were introduced in Nesterov (1983), and
indicate that acceleration via overparameterization need not later studied in numerous works (see Wibisono et al. (2016)
be computationally expensive. From an optimization per- for a short review). Such methods effectively accumulate
spective, overparameterizing using wide or narrow networks gradients throughout the entire optimization path, using the
has the same effect – it is only the depth that matters. collected history to determine the step at a current point in
time. Use of preconditioners to speed up optimization is
The remainder of the paper is organized as follows. In Sec-
also a well-known technique. Indeed, the classic Newton’s
tion 2 we review related work. Section 3 presents a warmup
method can be seen as preconditioning based on second
example of linear regression with `p loss, demonstrating
derivatives. Adaptive preconditioning with only first-order
the immense effect overparameterization can have on op-
(gradient) information was popularized by the BFGS al-
timization, with as little as a single additional scalar. Our
gorithm and its variants (cf. Nocedal (1980)). Relevant
theoretical analysis begins in Section 4, with a setup of pre-
theoretical guarantees, in the context of regret minimization,
liminary notation and terminology. Section 5 derives the
were given in Hazan et al. (2007); Duchi et al. (2011). In
preconditioning scheme implicitly induced by overparame-
terms of combining momentum and adaptive precondition-
terization, followed by Section 6 which shows that this form
ing, Adam (Kingma & Ba, 2014) is a popular approach,
of preconditioning is not attainable via any regularizer. In
particularly for optimization of deep networks.
Section 7 we qualitatively analyze a very simple learning
problem, demonstrating how the preconditioning can speed Algorithms with certain theoretical guarantees for non-
up optimization. Our empirical evaluation is delivered in convex optimization, and in particular for training deep
Section 8. Finally, Section 9 concludes. neural networks, were recently suggested in various works,
for example Ge et al. (2015); Agarwal et al. (2017); Carmon
2. Related Work et al. (2016); Janzamin et al. (2015); Livni et al. (2014) and
references therein. Since the focus of this paper lies on the
Theoretical study of optimization in deep learning is a highly analysis of algorithms already used by practitioners, such
active area of research. Works along this line typically an- works lie outside our scope.
alyze critical points (local minima, saddles) in the land-
scape of the training objective, either for linear networks 3. Warmup: `p Regression
(see for example Kawaguchi (2016); Hardt & Ma (2016) or
Baldi & Hornik (1989) for a classic account), or for specific We begin with a simple yet striking example of the effect
non-linear networks under different restrictive assumptions being studied. For linear regression with `p loss, we will
(cf. Choromanska et al. (2015); Haeffele & Vidal (2015); see how even the slightest overparameterization can have an
Soudry & Carmon (2016); Safran & Shamir (2017)). Other immense effect on optimization. Specifically, we will see
works characterize other aspects of objective landscapes, that simple gradient descent on an objective overparameter-
for example Safran & Shamir (2016) showed that under ized by a single scalar, corresponds to a form of accelerated
certain conditions a monotonically descending path from gradient descent on the original objective.
initialization to global optimum exists (in compliance with
Consider the objective for a scalar linear regression problem
the empirical observations of Goodfellow et al. (2014)).
with `p loss (p – even positive integer):
The dynamics of optimization was studied in Fukumizu h i
(1998) and Saxe et al. (2013), for linear networks. Like L(w) = E(x,y)∼S p1 (x> w − y)p
ours, these works analyze gradient descent through its cor- x ∈ Rd here are instances, y ∈ R are continuous labels,
responding differential equations. Fukumizu (1998) focuses S is a finite collection of labeled instances (training set), and
on linear regression with `2 loss, and does not consider the w ∈ Rd is a learned parameter vector. Suppose now that we
effect of varying depth – only a two (single hidden) layer apply a simple overparameterization, replacing the param-
network is analyzed. Saxe et al. (2013) also focuses on eter vector w by a vector w1 ∈ Rd times a scalar w2 ∈ R:
`2 regression, but considers any depth beyond two (inclu- h i
L(w1 , w2 ) = E(x,y)∼S p1 (x> w1 w2 − y)p
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
Pk Pk
Obviously the overparameterization does not affect the ex- − r=1 yr log(eŷr / r0 =1 eŷr0 ), where yr and ŷr stand
pressiveness of the linear model. How does it affect opti- for coordinate r of y and ŷ respectively. For a predic-
mization? What happens to gradient descent on this non- tor φ, i.e. a mapping X to Y, the overall training loss
Pm from (i)
convex objective? is L(φ) := m 1 (i)
i=1 l(φ(x ), y ). If φ comes from some
Observation 1. Gradient descent over L(w1 , w2 ), with parametric family Φ := {φθ : X → Y|θ ∈ Θ}, we view the
fixed small learning rate and near-zero initialization, is corresponding training loss as aPfunction of the parameters,
1 m
equivalent to gradient descent over L(w) with particular i.e. we consider LΦ (θ) := m (i) (i)
i=1 l(φθ (x ), y ). For
adaptive learning rate and momentum terms. example, if the parametric family in question is the class of
(directly parameterized) linear predictors:
To see this, consider the gradients of L(w) and L(w1 , w2 ): Φlin := {x 7→ W x|W ∈ Rk,d } (1)
:= E(x,y)∼S (x> w − y)p−1 x
∇w the respective training loss is a function from R k,d
to R≥0 .
∇w1 := E(x,y)∼S (x> w1 w2 − y)p−1 w2 x
In our context, a depth-N (N ≥ 2) linear neural network,
∇w2 := E(x,y)∼S (x> w1 w2 − y)p−1 w1> x
with hidden widths n1 , n2 , . . . , nN −1 ∈N, is the follow-
ing parametric family of linear predictors: Φn1 ...nN −1 :=
Gradient descent over L(w1 , w2 ) with learning rate η > 0:
{x 7→ WN WN −1 · · ·W1 x|Wj ∈Rnj ,nj−1 , j=1...N }, where
w1
(t+1) (t)
←[ w1 −η∇w(t) , w2
(t+1) (t)
←[ w2 −η∇w(t) by definition n0 := d and nN := k. As customary, we refer
1 2 to each Wj , j=1...N , as the weight matrix of layer j. For
The dynamics of the underlying parameter w = w1 w2 are: simplicity of presentation, we hereinafter omit from our no-
(t+1) (t+1)
tation the hidden widths n1 ...nN −1 , and simply write ΦN
w(t+1) = w1 w2 instead of Φn1 ...nN −1 (n1 . . .nN −1 will be specified explic-
(t) (t) itly if not clear by context). That is, we denote:
←[ (w1 −η∇w(t) )(w2 −η∇w(t) )
1 2
(t) (t) (t) (t) ΦN := (2)
= w1 w 2 − ηw2 ∇w(t) − η∇w(t) w1 + O(η 2 ) nj ,nj−1
1 2 {x 7→ WN WN −1 · · ·W1 x| Wj ∈ R , j=1...N }
(t) (t)
= w(t) − η(w2 )2 ∇w(t) − η(w2 )−1 ∇w(t) w(t) + O(η 2 ) For completeness, we regard a depth-1 network as the family
2
η is assumed to be small, thus we neglect O(η ). De- 2 of directly parameterized linear predictors, i.e. we set Φ1 :=
(t) (t) Φlin (see Equation 1).
noting ρ(t) :=η(w2 )2 ∈ R and γ (t) :=η(w2 )−1 ∇w(t) ∈ R,
2
this gives: The training loss that corresponds to a depth-N lin-
N
w(t+1) ←[ w(t) − ρ(t) ∇w(t) − γ (t) w(t) ear network – LΦ (W1 , ..., WN ), is a function from
Rn1 ,n0 ×· · ·×RnN ,nN −1 to R≥0 . For brevity, we will de-
Since by assumption w1 and w2 are initialized near zero, note this function by LN (·). Our focus lies on the behavior
w will initialize near zero as well. This implies that at every of gradient descent when minimizing LN (·). More specifi-
iteration t, w(t) is a weighted combination of past gradients. cally, we are interested in the dependence of this behavior
There thus exist µ(t,τ ) ∈ R such that: on N , and in particular, in the possibility of increasing N
Xt−1 leading to acceleration. Notice that for any N ≥ 2 we have:
w(t+1) ←[ w(t) − ρ(t) ∇w(t) − µ(t,τ ) ∇w(τ )
τ =1
LN (W1 , ..., WN ) = L1 (WN WN −1 · · ·W1 ) (3)
We conclude that the dynamics governing the underlying
parameter w correspond to gradient descent with a momen- and so the sole difference between the training loss of a
tum term, where both the learning rate (ρ(t) ) and momentum depth-N network and that of a depth-1 network (classic lin-
coefficients (µ(t,τ ) ) are time-varying and adaptive. ear model) lies in the replacement of a matrix parameter by
a product of N matrices. This implies that if increasing N
can indeed accelerate convergence, it is not an outcome of
4. Linear Neural Networks any phenomenon other than favorable properties of depth-
Let X := Rd be a space of objects (e.g. images or word induced overparameterization for optimization.
embeddings) that we would like to infer something about,
and let Y := Rk be the space of possible inferences.
5. Implicit Dynamics of Gradient Descent
Suppose we are given a training set {(x(i) , y(i) )}m
i=1 ⊂
X × Y, along with a (point-wise) loss function l : Y × In this section we present a new result for linear neural
Y → R≥0 . For example, y(i) could hold continuous val- networks, tying the dynamics of gradient descent on LN (·) –
2
ues with l(·) being the `2 loss: l(ŷ, y) = 21 kŷ − yk2 ; the training loss corresponding to a depth-N network, to
or it could hold one-hot vectors representing categories those on L1 (·) – training loss of a depth-1 network (classic
with l(·) being the softmax-cross-entropy loss: l(ŷ, y) = linear model). Specifically, we show that gradient descent
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
on LN (·), a complicated and seemingly pointless overpa- Then, the end-to-end weight matrix We (Equation 5) is
rameterization, can be directly rewritten as a particular pre- governed by the following differential equation:
conditioning scheme over gradient descent on L1 (·).
Ẇe (t) = −ηλN · We (t) (8)
When applied to LN (·), gradient descent takes on the fol- XN j−1
lowing form: −η We (t)We> (t) N ·
j=1
dL1
NN−j
∂LN
>
Wj
(t+1)
←[ (1 − ηλ)Wj
(t)
−η
(t) (t)
(W1 , . . . , WN ) (4) dW (We (t)) · We (t)We (t)
∂Wj j−1 N −j
, j = 1. . .N where [·] N and [·] N , j = 1 . . . N , are fractional power
operators defined over positive semidefinite matrices.
η > 0 here is a learning rate, and λ ≥ 0 is an optional
weight decay coefficient. For simplicity, we regard both η Proof. (sketch – full details in Appendix A.1) If λ = 0
and λ as fixed (no dependence on t). Define the underlying (no weight decay) then one can easily show that
end-to-end weight matrix: >
Wj+1 (t)Ẇj+1 (t) = Ẇj (t)Wj> (t) throughout optimization.
We := WN WN −1 · · · W1 (5) Taking the transpose of this equation and adding to itself,
followed by integration over time, imply that the differ-
>
Given that LN (W1 , . . . , WN ) = L1 (We ) (Equation 3), we ence between Wj+1 (t)Wj+1 (t) and Wj (t)Wj> (t) is con-
view We as an optimized weight matrix for L1 (·), whose stant. This difference is zero at initialization (Equation 7),
dynamics are governed by Equation 4. Our interest then thus will remain zero throughout, i.e.:
>
boils down to the study of these dynamics for different Wj+1 (t)Wj+1 (t) = Wj (t)Wj> (t) , ∀t ≥ t0 (9)
choices of N . For N = 1 they are (trivially) equivalent to
standard gradient descent over L1 (·). We will characterize A slightly more delicate treatment shows that this is true
the dynamics for N ≥ 2. even if λ > 0, i.e. with weight decay included.
To be able to derive, in our general setting, an explicit update Equation 9 implies alignment of the (left and right) sin-
rule for the end-to-end weight matrix We (Equation 5), we gular spaces of Wj (t) and Wj+1 (t), simplifying the prod-
introduce an assumption by which the learning rate is small, uct Wj+1 (t)Wj (t). Successive application of this simpli-
i.e. η 2 ≈ 0. Formally, this amounts to translating Equation 4 fication allows a clean computation for the product of all
to the following set of differential equations: layers (that is, We ), leading to the explicit form presented
in theorem statement (Equation 8).
∂LN
Ẇj (t) = −ηλWj (t) − η (W1 (t), . . . , WN (t)) (6)
∂Wj Translating the continuous dynamics of Equation 8 back to
, j = 1. . .N discrete time, we obtain the sought-after update rule for the
end-to-end weight matrix:
where t is now a continuous time index, and Ẇj (t) stands
for the derivative of Wj with respect to time. The use of We(t+1) ←[ (1 − ηλN )We(t) (10)
differential equations, for both theoretical analysis and al- XN h i j−1
N
gorithm design, has a long and rich history in optimization −η We(t) (We(t) )> ·
j=1
research (see Helmke & Moore (2012) for an overview). h i NN−j
dL1 (t) (t) > (t)
When step sizes (learning rates) are taken to be small, tra- dW (W e ) · (W e ) W e
jectories of discrete optimization algorithms converge to This update rule relies on two assumptions: first, that the
smooth curves modeled by continuous-time differential learning rate η is small enough for discrete updates to ap-
equations, paving way to the well-established theory of proximate continuous ones; and second, that weights are
the latter (cf. Boyce et al. (1969)). This approach has led initialized on par with Equation 7, which will approximately
to numerous interesting findings, including recent results in be the case if initialization values are close enough to zero.
the context of acceleration methods (e.g. Su et al. (2014); It is customary in deep learning for both learning rate and
Wibisono et al. (2016)). weight initializations to be small, but nonetheless above
assumptions are only met to a certain extent. We support
With the continuous formulation in place, we turn to express
their applicability by showing empirically (Section 8) that
the dynamics of the end-to-end matrix We :
the end-to-end update rule (Equation 10) indeed provides
Theorem 1. Assume the weight matrices W1 . . .WN follow an accurate description for the dynamics of We .
the dynamics of continuous gradient descent (Equation 6).
Assume also that their initial values (time t0 ) satisfy, for A close look at Equation 10 reveals that the dynamics of the
j = 1. . .N − 1: end-to-end weight matrix We are similar to gradient descent
over L1 (·) – training loss corresponding to a depth-1 net-
>
Wj+1 (t0 )Wj+1 (t0 ) = Wj (t0 )Wj> (t0 ) (7) work (classic linear model). The only difference (besides the
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
scaling by N of the weight decay coefficient λ) is that the i.e. be oblivious to parameter value. However, if one takes
1
gradient dL
dW (We ) is subject to a transformation before be-
into account the common practice in deep learning of ini-
ing used. Namely, for j = 1. . .N , it is multiplied from the tializing weights near zero, the location in parameter space
j−1 N −j
left by [We We> ] N and from the right by [We> We ] N , fol- can also be regarded as the overall movement made by the
lowed by summation over j. Clearly, when N = 1 (depth-1 algorithm. We thus interpret our findings as indicating that
network) this transformation reduces to identity, and as ex- overparameterization promotes movement along directions
pected, We precisely adheres to gradient descent over L1 (·). already taken by the optimization, and therefore can be seen
When N ≥ 2 the dynamics of We are less interpretable. We as a form of acceleration. This intuitive interpretation will
arrange it as a vector to gain more insight: become more concrete in the subsection that follows.
Claim 1. For an arbitrary matrix A, denote by vec(A) its A final point to make, is that the end-to-end update rule
arrangement as a vector in column-first order. Then, the (Equation 10 or 11), which obviously depends on N – num-
end-to-end update rule in Equation 10 can be written as: ber of layers in the deep linear network, does not depend on
the hidden widths n1 . . . nN −1 (see Section 4). This implies
vec(We(t+1) ) ←[ (1 − ηλN ) · vec(We(t) ) (11)
1 that from an optimization perspective, overparameterizing
dL (t) using wide or narrow networks has the same effect – it is
−η · PW (t) vec dW (We )
e
only the depth that matters. Consequently, the acceleration
where PW (t) is a positive semidefinite preconditioning ma- of overparameterization can be attained at a minimal compu-
e
(t) tational price, as we demonstrate empirically in Section 8.
trix that depends on We . Namely, if we denote the sin-
(t)
gular values of We ∈ Rk,d by σ1 . . . σmax{k,d} ∈ R≥0
(by definition σr = 0 if r > min{k, d}), and correspond- 5.1. Single Output Case
ing left and right singular vectors by u1 . . . uk ∈ Rk and To facilitate a straightforward presentation of our findings,
v1 . . . vd ∈ Rd respectively, the eigenvectors of PW (t) are: we hereinafter focus on the special case where the optimized
e
vec(ur vr>0 ) , r = 1 . . . k , r0 = 1 . . . d models have a single output, i.e. where k = 1. This corre-
sponds, for example, to a binary (two-class) classification
with corresponding eigenvalues: problem, or to the prediction of a numeric scalar property
XN 2 N −j 2 j−1 (regression). It admits a particularly simple form for the
σr N σr 0 N , r = 1 . . . k , r0 = 1 . . . d end-to-end update rule of Equation 10:
j=1
sition of We . The eigendirections are the rank-1 matri- where k·k2 N stands for Euclidean norm raised to the
ces ur vr>0 , where ur and vr0 are left and right (respec- power of 2 − N2 , and P rW {·}, W ∈ R1,d , is defined to be
tively) singular vectors of We . The eigenvalue of ur vr>0 the projection operator onto the direction of W :
PN 2(N −j)/N 2(j−1)/N
is j=1 σr σr 0 , where σr and σr0 are the P rW : R1,d → R1,d (13)
singular values of We corresponding to ur and vr0 (respec- (
W > W
kW k2 V · , W 6= 0
tively). When N ≥ 2, an increase in σr or σr0 leads to P rW {V } := kW k2
an increase in the eigenvalue corresponding to the eigendi- 0 , W =0
rection ur vr>0 . Qualitatively, this implies that the precondi-
tioning favors directions that correspond to singular vectors Proof. The result follows from the definition of a fractional
whose presence in We is stronger. We conclude that the power operator over matrices – see Appendix A.3.
effect of overparameterization, i.e. of replacing a classic lin-
ear model (depth-1 network) by a depth-N linear network, Claim 2 implies that in the single output case, the effect
boils down to modifying gradient descent by promoting of overparameterization (replacing classic linear model by
movement along directions that fall in line with the current depth-N linear network) on gradient descent is twofold:
location in parameter space. A-priori, such a preference may first, it leads to an adaptive learning rate schedule, by in-
2−2/N
seem peculiar – why should an optimization algorithm be troducing the multiplicative factor kWe k2 ; and second,
sensitive to its location in parameter space? Indeed, we gen- it amplifies (by N ) the projection of the gradient on the
erally expect sensible algorithms to be translation invariant, direction of We . Recall that we view We not only as the
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
e
optimized parameter, but also as the overall movement made
R e r e 0
r e R e
in optimization (initialization is assumed to be near zero).
Accordingly, the adaptive learning rate schedule can be seen 1
r ,R 3
r ,R
of two points in R2 × R: ([1, 0]> , y1 ) and ([0, 1]> , y2 ). Figure 2. (to be viewed in color) Gradient descent optimization
of deep linear networks (depths 2, 3) vs. the analytically-derived
Assume also that the loss function of interest is `p , p ∈ 2N:
equivalent preconditioning schemes (over single layer model;
`p (ŷ, y) = p1 (ŷ − y)p . Denoting the learned parameter by
Equation 12). Both plots show training objective (left – `2 loss;
w = [w1 , w2 ]> , the overall training loss can be written as:2 right – `4 loss) per iteration, on a numeric regression dataset from
UCI Machine Learning Repository (details in text). Notice the
L(w1 , w2 ) = p1 (w1 − y1 )p + p1 (w2 − y2 )p
emulation of preconditioning schemes. Notice also the negligible
effect of network width – for a given depth, setting size of hidden
With fixed learning rate η > 0 (weight decay omitted for
layers to 1 (scalars) or 100 yielded similar convergence (on par
simplicity), gradient descent over L(·) gives: with our analysis).
(t+1) (t) (t)
wi ←[ wi − η(wi − yi )p−1 , i = 1, 2
Changing variables per ∆i = wi − yi , we have:
network, induces on gradient descent a certain precondition-
(t+1) (t) (t)
←[ ∆i 1 − η(∆i )p−2
∆i , i = 1, 2 (15) ing scheme. We qualitatively argued (Section 7) that in some
Assuming the original weights w1 and w2 are initialized cases, this preconditioning may accelerate convergence. In
near zero, ∆1 and ∆2 start off at −y1 and −y2 respectively, this section we put these claims to the test, through a se-
and will eventually reach the optimum ∆∗1 = ∆∗2 = 0 if the ries of empirical evaluations based on TensorFlow toolbox
learning rate is small enough to prevent divergence: (Abadi et al. (2016)). For conciseness, many of the details
2 behind our implementation are deferred to Appendix C.
η< yip−2
, i = 1, 2
Suppose now that the problem is ill-conditioned, in the We begin by evaluating our analytically-derived precondi-
sense that y1 y2 . If p = 2 this has no effect on the bound tioning scheme – the end-to-end update rule in Equation 10.
for η.3 If p > 2 the learning rate is determined by y1 , lead- Our objective in this experiment is to ensure that our analy-
ing ∆2 to converge very slowly. In a sense, ∆2 will suffer sis, continuous in nature and based on a particular assump-
from the fact that there is no “communication” between tion on weight initialization (Equation 7), is indeed appli-
the coordinates (this will actually be the case not just with cable to practical scenarios. We focus on the single output
gradient descent, but with most algorithms typically used in case, where the update-rule takes on a particularly simple
large-scale settings – AdaGrad, Adam, etc.). (and efficiently implementable) form – Equation 12. The
dataset chosen was UCI Machine Learning Repository’s
Now consider the scenario where we optimize L(·) via over- “Gas Sensor Array Drift at Different Concentrations” (Ver-
parameterization, i.e. with the update rule in Equation 12 gara et al., 2012; Rodriguez-Lujan et al., 2014). Specifi-
(single output). In this case the coordinates are coupled, cally, we used the dataset’s “Ethanol” problem – a scalar
and as ∆1 gets small (w1 gets close to y1 ), the learning regression task with 2565 examples, each comprising 128
2− 2
rate is effectively scaled by y1 N (in addition to a scal- features (one of the largest numeric regression tasks in the
ing by N in coordinate 1 only), allowing (if y1 >1) faster repository). As training objectives, we tried both `2 and `4
convergence of ∆2 . We thus have the luxury of temporar- losses. Figure 2 shows convergence (training objective per
ily slowing down ∆2 to ensure that ∆1 does not diverge, iteration) of gradient descent optimizing depth-2 and depth-
with the latter speeding up the former as it reaches safe 3 linear networks, against optimization of a single layer
grounds. In Appendix B we consider a special case and model using the respective preconditioning schemes (Equa-
formalize this intuition, deriving a concrete bound for the tion 12 with N = 2, 3). As can be seen, the preconditioning
acceleration of overparameterization. schemes reliably emulate deep network optimization, sug-
gesting that, at least in some cases, our analysis indeed
captures practical dynamics.
8. Experiments
Alongside the validity of the end-to-end update rule, Fig-
Our analysis (Section 5) suggests that overparameteriza- ure 2 also demonstrates the negligible effect of network
tion – replacement of a classic linear model by a deep linear width on convergence, in accordance with our analysis (see
2
We omit the averaging constant 21 for conciseness. Section 5). Specifically, it shows that in the evaluated set-
3
Optimal learning rate for gradient descent on quadratic objec- ting, hidden layers of size 1 (scalars) suffice in order for the
tive does not depend on current parameter value (cf. Goh (2017)). essence of overparameterization to fully emerge. Unless oth-
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
L2 loss (distance from optimum)
10−1 10−1
10−2
10−2
10−2 10−2
0 200000 400000 600000 800000 1000000 0 200000 400000 600000 800000 1000000 0 200000 400000 600000 800000 1000000 0 50000 100000 150000 200000
iteration iteration iteration iteration
Figure 3. (to be viewed in color) Gradient descent optimization Figure 4. (to be viewed in color) Left: Gradient descent optimiza-
of single layer model vs. linear networks of depth 2 and 3. Setup tion of depth-3 linear network vs. AdaGrad and AdaDelta over
is identical to that of Figure 2, except that here learning rates were single layer model. Setup is identical to that of Figure 3-right.
chosen via grid search, individually per model (see Appendix C). Notice that the implicit acceleration of overparameterization out-
Notice that with `2 loss, depth (slightly) hinders optimization, performs both AdaGrad and AdaDelta (former is actually slower
whereas with `4 loss it leads to significant acceleration (on par than plain gradient descent). Right: Adam optimization of single
with our qualitative analysis in Section 7). layer model vs. Adam over linear networks of depth 2 and 3. Same
setup, but with learning rates set per Adam’s default in TensorFlow.
Notice that depth improves speed, suggesting that the acceleration
erwise indicated, all results reported hereinafter are based of overparameterization may be somewhat orthogonal to explicit
on this configuration, i.e. on scalar hidden layers. The com- acceleration methods.
putational toll associated with overparameterization will
thus be virtually non-existent.
Adam (a setting we did not theoretically analyze), further
As a final observation on Figure 2, notice that it exhibits
acceleration is attained – see Figure 4-right. This suggests
faster convergence with a deeper network. This however
that at least in some cases, not only plain gradient descent
does not serve as evidence in favor of acceleration by depth,
benefits from depth, but also more elaborate algorithms
as we did not set learning rates optimally per model (simply
commonly employed in state of the art applications.
used the common choice of 10−3 ). To conduct a fair compar-
ison between the networks, and more importantly, between An immediate question arises at this point. If depth indeed
them and a classic single layer model, multiple learning accelerates convergence, why not add as many layers as one
rates were tried, and the one giving fastest convergence was can computationally afford? The reason, which is actually
taken on a per-model basis. Figure 3 shows the results of apparent in our analysis, is the so-called vanishing gradient
this experiment. As can be seen, convergence of deeper problem. When training a very deep network (large N ),
networks is (slightly) slower in the case of `2 loss. This while initializing weights to be small, the end-to-end ma-
falls in line with the findings of Saxe et al. (2013). In stark trix We (Equation 5) is extremely close to zero, severely
contrast, and on par with our qualitative analysis in Sec- attenuating gradients in the preconditioning scheme (Equa-
tion 7, is the fact that with `4 loss adding depth significantly tion 10). A possible approach for alleviating this issue is to
accelerated convergence. To the best of our knowledge, this initialize weights to be larger, yet small enough such that the
provides first empirical evidence to the fact that depth, even end-to-end matrix does not “explode”. The choice of iden-
without any gain in expressiveness, and despite introducing tity (or near identity) initialization leads to what is known
non-convexity to a formerly convex problem, can lead to as linear residual networks (Hardt & Ma, 2016), akin to the
favorable optimization. successful residual networks architecture (He et al., 2015)
commonly employed in deep learning. Notice that identity
In light of the speedup observed with `4 loss, it is natu-
initialization satisfies the condition in Equation 7, rendering
ral to ask how the implicit acceleration of depth compares
the end-to-end update rule (Equation 10) applicable. Fig-
against explicit methods for acceleration and adaptive learn-
ure 5-left shows convergence, under gradient descent, of
ing. Figure 4-left shows convergence of a depth-3 network
a single layer model against deeper networks than those
(optimized with gradient descent) against that of a single
evaluated before – depths 4 and 8. As can be seen, with
layer model optimized with AdaGrad (Duchi et al., 2011)
standard, near-zero initialization, the depth-4 network starts
and AdaDelta (Zeiler, 2012). The displayed curves cor-
making visible progress only after about 65K iterations,
respond to optimal learning rates, chosen individually via
whereas the depth-8 network seems stuck even after 100K
grid search. Quite surprisingly, we find that in this spe-
iterations. In contrast, under identity initialization, both net-
cific setting, overparameterizing, thereby turning a convex
works immediately make progress, and again depth serves
problem non-convex, is a more effective optimization strat-
as an implicit accelerator.
egy than carefully designed algorithms tailored for convex
problems. We note that this was not observed with all al- As a final sanity test, we evaluate the effect of overparam-
gorithms – for example Adam (Kingma & Ba, 2014) was eterization on optimization in a non-idealized (yet simple)
considerably faster than overparameterization. However, deep learning setting. Specifically, we experiment with the
when introducing overparameterization simultaneously with convolutional network tutorial for MNIST built into Ten-
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
L4 loss (distance from optimum)
101
original be interpreted as a combination between certain forms of
1-layer
overparameterized
4-layer
8-layer
adaptive learning rate and momentum. Given that it depends
batch loss
100
4-layer, id-init
8-layer, id-init
100
on network depth but not on width, acceleration by overpa-
rameterization can be attained at a minimal computational
10−1
price, as we demonstrate empirically in Section 8.
10−1
0 20000 40000 60000 80000 100000 0 1000 2000 3000 4000 5000 6000 7000 8000
iteration iteration Clearly, complete theoretical analysis for non-linear net-
Figure 5. (to be viewed in color) Left: Gradient descent optimiza- works will be challenging. Empirically however, we showed
tion of single layer model vs. linear networks deeper than before
that the trivial idea of replacing an internal weight matrix by
(depths 4, 8). For deep networks, both near-zero and near-identity
a product of two can significantly accelerate optimization,
initializations were evaluated. Setup identical to that of Figure 3-
right. Notice that deep networks suffer from vanishing gradients with absolutely no effect on expressiveness (Figure 5-right).
under near-zero initialization, while near-identity (“residual”) ini- The fact that gradient descent over classic convex problems
tialization eliminates the problem. Right: Stochastic gradient such as linear regression with `p loss, p > 2, can accelerate
descent optimization in TensorFlow’s convolutional network tu-
from transitioning to a non-convex overparameterized objec-
torial for MNIST. Plot shows batch loss per iteration, in original
tive, does not coincide with conventional wisdom, and pro-
setting vs. overparameterized one (depth-2 linear networks in place
of dense layers). vides food for thought. Can this effect be rigorously quanti-
fied, similarly to analyses of explicit acceleration methods
such as momentum or adaptive regularization (AdaGrad)?
sorFlow,4 which includes convolution, pooling and dense
layers, ReLU non-linearities, stochastic gradient descent Acknowledgments
with momentum, and dropout (Srivastava et al., 2014). We Sanjeev Arora’s work is supported by NSF, ONR, Simons
introduced overparameterization by simply placing two ma- Foundation, Schmidt Foundation, Mozilla Research, Ama-
trices in succession instead of the matrix in each dense layer. zon Research, DARPA and SRC. Elad Hazan’s work is
Here, as opposed to previous experiments, widths of the supported by NSF grant 1523815 and Google Brain. Na-
newly formed hidden layers were not set to 1, but rather to dav Cohen is a member of the Zuckerman Israeli Post-
the minimal values that do not deteriorate expressiveness doctoral Scholars Program, and is supported by Eric and
(see Appendix C). Overall, with an addition of roughly 15% Wendy Schmidt.
in number of parameters, optimization has accelerated con-
siderably – see Figure 5-right. The displayed results were
obtained with the hyperparameter settings hardcoded into References
the tutorial. We have tried alternative settings (varying Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,
learning rates and standard deviations of initializations – see Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Ten-
Appendix C), and in all cases observed an outcome similar sorflow: A system for large-scale machine learning. In OSDI,
volume 16, pp. 265–283, 2016.
to that in Figure 5-right – overparameterization led to sig-
nificant speedup. Nevertheless, as reported above for linear Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T.
networks, it is likely that for non-linear networks the effect Finding approximate local minima faster than gradient descent.
of depth on optimization is mixed – some settings accelerate In Proceedings of the 49th Annual ACM SIGACT Symposium
on Theory of Computing, pp. 1195–1199. ACM, 2017.
by it, while others do not. Comprehensive characteriza-
tion of the cases in which depth accelerates optimization Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding
warrants much further study. We hope our work will spur deep neural networks with rectified linear units. International
interest in this avenue of research. Conference on Learning Representations (ICLR), 2018.
Baldi, P. and Hornik, K. Neural networks and principal component
analysis: Learning from examples without local minima. Neural
9. Conclusion networks, 2(1):53–58, 1989.
Through theory and experiments, we demonstrated that over-
Boyce, W. E., DiPrima, R. C., and Haines, C. W. Elementary
parameterizing a neural network by increasing its depth can differential equations and boundary value problems, volume 9.
accelerate optimization, even on very simple problems. Wiley New York, 1969.
Our analysis of linear neural networks, the subject of vari- Buck, R. C. Advanced calculus. Waveland Press, 2003.
ous recent studies, yielded a new result: for these models,
overparameterization by depth can be understood as a pre- Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accel-
erated methods for non-convex optimization. arXiv preprint
conditioning scheme with a closed form description (Theo- arXiv:1611.00756, 2016.
rem 1 and the claims thereafter). The preconditioning may
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and
4
https://github.com/tensorflow/models/ LeCun, Y. The loss surfaces of multilayer networks. In Artificial
tree/master/tutorials/image/mnist Intelligence and Statistics, pp. 192–204, 2015.
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
Cohen, N., Sharir, O., Levine, Y., Tamari, R., Yakira, D., Lee, H., Ge, R., Risteski, A., Ma, T., and Arora, S. On the
and Shashua, A. Analysis and design of convolutional net- ability of neural nets to express distributions. arXiv preprint
works via hierarchical tensor decompositions. arXiv preprint arXiv:1702.07028, 2017.
arXiv:1705.02302, 2017.
Livni, R., Shalev-Shwartz, S., and Shamir, O. On the compu-
Daniely, A. Depth separation for neural networks. arXiv preprint tational efficiency of training neural networks. Advances in
arXiv:1702.08489, 2017. Neural Information Processing Systems, 2014.
Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods Nesterov, Y. A method of solving a convex programming problem
for online learning and stochastic optimization. Journal of with convergence rate o (1/k2). In Soviet Mathematics Doklady,
Machine Learning Research, 12(Jul):2121–2159, 2011. volume 27, pp. 372–376, 1983.
Eldan, R. and Shamir, O. The power of depth for feedforward Nocedal, J. Updating quasi-newton matrices with limited storage.
neural networks. arXiv preprint arXiv:1512.03965, 2015. Mathematics of Computation, 35(151):773–782, 1980.
Fukumizu, K. Effect of batch learning in multilayer neural net- Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-
works. Gen, 1(04):1E–03, 1998. Dickstein, J. On the expressive power of deep neural networks.
arXiv preprint arXiv:1606.05336, 2016.
Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle
points—online stochastic gradient for tensor decomposition. In Rodriguez-Lujan, I., Fonollosa, J., Vergara, A., Homer, M., and
Conference on Learning Theory, pp. 797–842, 2015. Huerta, R. On the calibration of sensor arrays for pattern recog-
nition using the minimal number of experiments. Chemometrics
Goh, G. Why momentum really works. Distill, 2017. doi: 10. and Intelligent Laboratory Systems, 130:123–134, 2014.
23915/distill.00006. URL http://distill.pub/2017/
momentum. Safran, I. and Shamir, O. On the quality of the initial basin in
overspecified neural networks. In International Conference on
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep Machine Learning, pp. 774–782, 2016.
learning, volume 1. MIT press Cambridge, 2016.
Safran, I. and Shamir, O. Spurious local minima are com-
Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualitatively mon in two-layer relu neural networks. arXiv preprint
characterizing neural network optimization problems. arXiv arXiv:1712.08968, 2017.
preprint arXiv:1412.6544, 2014.
Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions
Haeffele, B. D. and Vidal, R. Global Optimality in Tensor Fac- to the nonlinear dynamics of learning in deep linear neural
torization, Deep Learning, and Beyond. CoRR abs/1202.2745, networks. arXiv preprint arXiv:1312.6120, 2013.
cs.NA, 2015.
Soudry, D. and Carmon, Y. No bad local minima: Data independent
Hardt, M. and Ma, T. Identity matters in deep learning. arXiv training error guarantees for multilayer neural networks. arXiv
preprint arXiv:1611.04231, 2016. preprint arXiv:1605.08361, 2016.
Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and
for online convex optimization. Mach. Learn., 69(2-3):169–192, Salakhutdinov, R. Dropout: a simple way to prevent neural net-
December 2007. ISSN 0885-6125. works from overfitting. Journal of Machine Learning Research,
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for 15(1):1929–1958, 2014.
image recognition. arXiv preprint arXiv:1512.03385, 2015. Su, W., Boyd, S., and Candes, E. A differential equation for
Helmke, U. and Moore, J. B. Optimization and dynamical systems. modeling nesterovs accelerated gradient method: Theory and
Springer Science & Business Media, 2012. insights. In Advances in Neural Information Processing Systems,
pp. 2510–2518, 2014.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. In Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gra-
International conference on machine learning, pp. 448–456, dient by a running average of its recent magnitude. COURSERA:
2015. Neural networks for machine learning, 4(2):26–31, 2012.
Janzamin, M., Sedghi, H., and Anandkumar, A. Beating the Perils Vergara, A., Vembu, S., Ayhan, T., Ryan, M. A., Homer, M. L.,
of Non-Convexity: Guaranteed Training of Neural Networks and Huerta, R. Chemical gas sensor drift compensation using
using Tensor Methods. CoRR abs/1506.08473, 2015. classifier ensembles. Sensors and Actuators B: Chemical, 166:
320–329, 2012.
Jones, E., Oliphant, T., Peterson, P., et al. SciPy: Open source sci-
entific tools for Python, 2001–. URL http://www.scipy. Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational per-
org/. [Online; accessed ¡today¿]. spective on accelerated methods in optimization. Proceedings
of the National Academy of Sciences, 113(47):E7351–E7358,
Kawaguchi, K. Deep learning without poor local minima. In 2016.
Advances in Neural Information Processing Systems, pp. 586–
594, 2016. Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv
preprint arXiv:1212.5701, 2012.
Kingma, D. and Ba, J. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
where Idr , 1≤r≤m, is the identity matrix of size dr × dr . It follows that for every j = 1 . . . N :
Moreover, there exist orthogonal matrices Oj,r ∈ Rdr ,dr ,
i=N N
1≤r≤m, such that: Y Y N −j+1
Wi> (t) = We (t)We> (t) N
Wi (t) (23)
j i=j
Uj = Vj+1 · diag(Oj,1 , . . . , Oj,m ) j i=j
Y Y j
Wi> (t) Wi (t) = We> (t)We (t) N
(24)
Oj,r here is simply a matrix changing between orthogonal i=1 1
bases in the eigenspace of ρr – it maps the basis comprising
N −j+1 j
Vj+1 -columns to that comprising Uj -columns. Recalling where [·] N and [·] N stand for fractional power operators
that both Σj and Σj+1 are rectangular-diagonal, holding defined over positive semidefinite matrices.
only non-negative values, Equation 20 implies that each
√ √
of these matrices is equal to diag( ρ1 ·Id1 , . . . , ρm ·Idm ). With Equations 23 and 24 in place, we are finally in a po-
Note that the matrices generally do not have the same shape sition to complete the proof. Returning Qi=N to Equation 16,
and thus, formally, are not equal to one another. Nonethe- we multiply Ẇj (t) from the left by j+1 Wi (t) and from
less, in line with our diag notation (see beginning of this the right by 1
Qi=j−1
Wi (t), followed by summation over
subsection), Σj and Σj+1 may differ from each other only j = 1 . . . N . This gives:
in trailing, zero-valued rows and columns. By an inductive
argument, all the singular value matrices Σ1 , Σ2 , . . . , ΣN XN Yi=N Y
i=j−1
(see Equation 19) are equal up to trailing zero rows and Wi (t) Ẇj (t) Wi (t) =
j=1 j+1 1
columns. The fact that ρ1 . . . ρm do not include an index j XN Yi=N Y
i=j−1
in their notation is thus in order, and we may write, for every −ηλ Wi (t) Wj (t) Wi (t)
j=1 j+1 1
j = 1 . . . N −1:
XN Yi=N YN
−η Wi (t) Wi> (t) ·
j=1 j+1 i=j+1
Wj (t) = Uj Σj Vj>
1
Y
= Vj+1 · diag(Oj,1 , . . . , Oj,m ) · dL j−1 Yi=j−1
(We (t)) · Wi> (t) Wi (t)
√ √ dW i=1 1
diag( ρ1 ·Id1 , . . . , ρm ·Idm ) · Vj>
Qi=N
By definition We (t) = 1 Wj (t), so we can substitute
For the N ’th weight matrix we have: the first two lines above:
Then, if f and g assume the same value at some t0 ∈ I in A and the number of columns in B. vec(·) here, as
(interior or boundary), they must coincide along the entire in claim statement, stands for matrix vectorization in
interval, i.e. it must hold that f (t) = g(t) for all t ∈ I. column-first order.
Proof. Define h := f − g. h is a differentiable function • If A1 , A2 , B1 and B2 are matrices such that the matrix
from I to R, and we have: products A1 B1 and A2 B2 are defined, then:
We know that h(t0 ) = 0 for some t0 ∈ I, and would • For any matrices A and B:
like to show that h(t) = 0 ∀t ∈ I. Assume by contra-
diction that this is not the case, so there exists t2 ∈ I for (A B)> = A> B > (29)
which h(t2 ) 6= 0. Without loss of generality, suppose that
h(t2 ) > 0, and that t2 > t0 . Let S be the zero set of h, • Equation 28 and 29 imply, that if A and B are some
i.e. S := {t ∈ I : h(t) = 0}. Since h is continuous orthogonal matrices, so is A B:
in I, S is topologically closed, therefore its intersection
with the interval [t0 , t2 ] is compact. Denote by t1 the maxi- A> = A−1 ∧ B > = B −1
mal element in this intersection, and consider the interval =⇒ (A B)> = (A B)−1 (30)
J := [t1 , t2 ] ⊂ I. By construction, h is positive along J, be-
sides on the endpoint t1 where it assumes the value of zero.
For t1 < t ≤ t2 , we may solve as follows the differential With the Kronecker product in place, we proceed to the
equation of h (Equation 25): actual proof. It suffices to show that vectorizing:
N h i j−1 i NN−j
ḣ(t) X N dL1 h
= −α =⇒ h(t) = βe−αt We(t) (We(t) )> · (We(t) )· (We(t) )> We(t)
h(t) j=1
dW
−αt2
where β is the positive constant defined by h(t2 ) = βe . yields:
Since in particular h is bounded away from zero on (t1 , t2 ],
dL1
(t)
and assumes zero at t1 , we obtain a contradiction to its PW (t) · vec (We )
e dW
continuity. This completes the proof.
where PW (t) is the preconditioning matrix defined in claim
e
ama 1 ·B · · · ama na ·B N
X j−1
Id We We> N
(26) = ·
where aij stands for the element in row i and column j of A. j=1
1
The Kronecker product admits numerous useful properties. > NN−j dL
We will employ the following: W e We Ik · vec (We )
dW
N 1
X N −j j−1 dL
• If A and B are matrices such that the matrix product = We> We N We We> N vec (We )
AB is defined, then: j=1
dW
vec(AB) = (B > IrA ) · vec(A) The first equality here makes use of Equation 27, and the
second of Equation 28. We will show that the matrix:
= (IcB A) · vec(B) (27)
N
X > NN−j j−1
where IrA and IcB are the identity matrices whose Q := We We We We> N (31)
sizes correspond, respectively, to the number of rows j=1
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
meets the characterization of PWe , thereby completing the A.3. Proof of Claim 2
proof. Let:
We disregard the trivial case N = 1, as well as the scenario
We = U DV > (t)
We = 0 (both lead Equations 10 and 12 to equate). Omit-
be a singular value decomposition, i.e. U ∈ Rk,k and V ∈ ting the iteration index t from our notation, it suffices to
Rd,d are orthogonal matrices, and D is a rectangular- show that:
diagonal matrix holding (non-negative) singular values on
N
its diagonal. Plug this into the definition of Q (Equation 31): X j−1 dL1 N −j
We We> N · (We ) · We> We N =
(35)
N j=1
dW
X N −j j−1
V D> DV > N U DD> U > N
Q= 2− 2
1 dL1
kWe k2 N dLdW (W e ) + (N − 1)P rWe dW (W e )
j=1
N
> NN−j > j−1 >
where P rWe {·} is the projection operator defined in claim
X
> N
= V D D V U DD U
j=1 statement (Equation 13), and we recall that by assump-
j−1
tion k = 1 (We ∈ R1,d ). We We> N is a scalar, equal
N
X > NN−j j−1
DD> N (V > U > )
= (V U ) D D 2 j−1 N −j
to kWe k2 N for every j = 1 . . . N . We> We N on the
j=1
other hand is a d × d matrix, by definition equal to iden-
N tity for j = N , and otherwise, for j = 1 . . . N − 1, it
X > NN−j j−1
DD> N (V U )>
= (V U ) D D 2 N −j >
is equal to kWe k2 N (We/kWe k2 ) (We/kWe k2 ). Plugging
j=1
these equalities into the first line of Equation 35 gives:
The third equality here is based on the relation in Equa-
N
tion 28, and the last equality is based on Equation 29. De- X j−1 dL1 N −j
We We> N (We ) We> We N =
noting: dW
j=1
N −1
O := V U (32) X 2 j−1 dL1 2 N −j
We
>
We
N kWe k2 N
(We ) kWe k2 N kW k kW k
X > NN−j j−1 dW e 2 e 2
j=1 −1
2 NN dL1
+ kWe k2 · (We ) =
dW
we have: 1 >
2 N −1 dL
Q = OΛO> (34) (N − 1) kWe k2 N We
(We ) kW e k2
We
kWe k2
dW
Now, since by definition U and V are orthogonal, O is or- 1
2 N −1 dL
thogonal as well (follows from the relation in Equation 30). + kWe k2 N · (We )
dW
Additionally, the fact that D is rectangular-diagonal im-
plies that the square matrix Λ is also diagonal. Equation 34 The latter expression is precisely the second line of Equa-
is thus an orthogonal eigenvalue decomposition of Q. Fi- tion 35, thus our proof is complete.
nally, denote the columns of U (left singular vectors of We )
by u1 . . . uk , those of V (right singular vectors of We ) by
v1 . . . vd , and the diagonal elements of D (singular val-
A.4. Proof of Theorem 2
ues of We ) by σ1 . . . σmax{k,d} (by definition σr = 0 if
r > min{k, d}). The definitions in Equations 32 and 33 Our proof relies on elementary differential geometry:
imply that the columns of O are: curves, arc length and line integrals (see Chapters 8 and 9
in Buck (2003)).
vec(ur vr>0 ) , r = 1 . . . k , r0 = 1 . . . d
Let U ⊂ R1,d be a neighborhood of W = 0 (i.e. an open
1
with corresponding diagonal elements in Λ being: set that includes this point) on which dL dW is continuous
(U exists by assumption). It is not difficult to see that F (·)
XN −j
2 NN 2 j−1
σr σr 0 N , r = 1 . . . k , r0 = 1 . . . d (Equation 14) is continuous on U as well. The strategy of our
j=1 proof will be to show that F (·) does not admit the gradient
theorem (also known as the fundamental theorem for line
We conclude that Q indeed meets the characterization
integrals). According to the theorem, if h : U → R is a
of PWe in claim statement. This completes the proof.
continuously differentiable function, and Γ is a piecewise
smooth curve lying in U with start-point γs and end-point γe ,
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
then: Z Let R be a positive constant small enough such that the Eu-
dh clidean ball of radius R around the origin is contained in U.
= h(γe ) − h(γs )
Γ dW Let r be a positive constant smaller than R. Define Γr,R to
In words, the line integral of the gradient of h over Γ, is be a curve as follows (see illustration in Figure 1):6
equal to the difference between the value taken by h at the
Γr,R := Γ1r,R → Γ2r,R → Γ3r,R → Γ4r,R (38)
end-point of Γ, and that taken at the start-point. A direct
implication of the theorem is that if Γ is closed (γe = γs ), where:
the line integral vanishes:
I
dh • Γ1r,R is the line segment from −R · e to −r · e.
=0
dW
Γ • Γ2r,R is a geodesic on the sphere of radius r, starting
We conclude that if F (·) is the gradient field of some func- from −r · e and ending at r · e.
tion, its line integral over any closed (piecewise smooth) • Γ3r,R is the line segment from r · e to R · e.
curve lying in U must vanish. We will show that this is not
the case. • Γ4r,R is a geodesic on the sphere of radius R, starting
from R · e and ending at −R · e.
For notational conciseness we hereinafter identify R1,d
and Rd , so in particular U is now a subset of Rd . To further Γr,R is a piecewise smooth, closed curve that fully lies
simplify, we omit the subindex from the Euclidean norm, within U. Consider the linear functional it induces: φ 7→
writing k·k instead of k·k2 . Given an arbitrary continuous H
F . We will evaluate this functional on φ = dL
1
Γr,R φ dW . To
vector field φ : U → Rd , we define a respective (continuous)
do so, we decompose the latter as follows:
vector field as follows:
dL1
dW (·) = c · e(·) + ξ(·) (39)
F φ : U → Rd
where:
Fφ (w) = (36)
1
• c is a scalar equal to k dL
( 2
D E
dW (W =0)k.
2− w w
kwk N φ(w)+(N −1) φ(w), kwk kwk , w6=0
0 , w=0 • e(·) is a vector field returning the constant e (Equa-
tion 37).
1
Notice that for φ = dLdW , we get exactly the vector field F (·)
1
• ξ(·) is a vector field returning the values of dL
dW (·)
defined in theorem statement (Equation 14) – the subject of
dL1
our inquiry. As an operator on (continuous) vector fields, shifted by the constant − dW (W =0). It is continuous
the mapping φ 7→ Fφ is linear.5 This, along with the linear- on U and vanishes at the origin.
ity of line integrals, imply that for any piecewise
R smooth
curve Γ lying in U, the functional φ 7→ Γ Fφ , a mapping Applying Lemma 2 to ξ over Γr,R gives:
of (continuous) vector fields to scalars, is linear. Lemma 2 I
below provides an upper bound on this linear functional in
2− 2
Fξ ≤ N · len(Γr,R ) · max kγk N · max kξ(γ)k
terms of the length of Γ, its maximal distance from origin,
Γr,R γ∈Γr,R γ∈Γr,R
and the maximal norm φ takes on it. 2
= N · (πr + πR + 2(R − r)) · R2− N · max kξ(γ)k
In light of the above, to show that F (·) contradicts the γ∈Γr,R
2
gradient theorem, thereby completing the proof, it suffices ≤ N · 2π · R 3− N
· max kξ(γ)k
to find a closed (piecewise smooth) curve Γ for which the γ∈Γr,R
1
linear functional φ 7→ Γ Fφ does not vanish at φ = dL
H
dW . On the other hand, by Lemma 3:
1
By assumption dL dW (W =0) 6= 0, and so we may define the I
1
unit vector in the direction of dL 2N 2 2
dW (W =0):
3− N 3− N
Fe = − 2 R − r
Γr,R 3 − 2/N
dL1
dW (W =0)
6
e := ∈ Rd (37) The proof would have been slightly simplified had we used
1 (W =0)
dL
dW a curve that passes directly through the origin. We avoid this in
order to emphasize that the result is not driven by some point-wise
5
For any φ1 , φ2 : U → Rd and c ∈ R, it holds that Fφ1 +φ2 = singularity (the origin received special treatment in the definition
Fφ1 + Fφ2 and Fc·φ1 = c · Fφ1 . of F (·) – see Equations 14 and 13).
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
Starting from Γ1r,R , notice that for every point w lying on In the context of Section 7, we will treat the setting of p = 4
w w
this curve: he, kwk i kwk = e. Therefore: (`4 loss) and N = 2 (depth-2 network). We will also as-
Z Z Z sume, in accordance with the problem being ill-conditioned –
2
2− N 2
2− N y1 y2 , that initialization values are ill-conditioned as well,
Fe = kwk (e+(N −1)e) = N kwk e (0)
Γ1r,R Γ1r,R Γ1r,R and in particular 1 /2 ≈ y1 /y2 , where i := |wi |. An ad-
ditional assumption we make is that y2 is on the order of 1,
The line integral on the right translates into a simple univari- and thus the near-zero initialization of w1 and w2 implies
ate integral: y2 1 , 2 . Finally, we assume that 1 y1 1.
Z −r Z R
As shown in Section 7, under gradient descent, w1 and w2
Z
2− 2 2− 2 2
kwk N e = |ρ| N dρ = ρ2− N dρ move independently, and to prevent divergence, the learning
Γ1r,R −R r
rate must satisfy η < min{2/y1p−2 , 2/y2p−2 }. In our setting,
1 2
3− N 2
3− N
= R − r this translates to (GD below stands for gradient descent):
3 − 2/N
η GD < 2/y12 (45)
We thus have:
For w2 , the optimal learning rate (convergence in a single
step) is 1/y22 , and the constraint above will lead to very slow
Z
N 2
3− N 2
3− N
Fe = R − r (41)
Γ1r,R 3 − 2/N convergence (see Equation 15 and its surrounding text).
Suppose now that we optimize via overparameterization,
Turning to Γ2r,R , note that for any point w along this curve i.e. with the update rule in Equation 12 (single output). In
2
our particular setting (recall, in addition to the above, that
2− N 2− N2 w
kwk =r , and kwk is perpendicular to the direc- we omitted weight decay for simplicity – λ = 0), this update
tion of motion. This implies: rule translates to:
(t+1) >
] ←[ [w1 , w2 ]>
(t+1) (t) (t)
Z Z [w1 , w2 (46)
2
Fe = r2− N e 1/2
· [(w1 − y1 )3 , (w2 − y2 )3 ]>
(t) (t) (t) (t)
Γ2r,R Γ2r,R −η (w1 )2 + (w2 )2
−1/2
(t) (t)
The line integral Γ2 e is simply equal to the progress Γ2r,R −η (w1 )2 + (w2 )2
R
r,R
(0)
Z
2
Fe = r2− N · 2r = 2r3− N
2
(42) For the first iteration (t = 0), replacing i := |wi |, while
Γ2r,R recalling that y1 y2 1 2 , we obtain:
(1) (1)
[w1 , w2 ]> ≈ η · 1 · [y13 , y23 ]> + η · y13 · [1 , 2 ]>
As for Γ3r,R and Γ4r,R , their line integrals may be computed
similarly to those of Γ1r,R and Γ2r,R respectively. Such com- = η · [21 y13 , 1 y23 + 2 y13 ]>
putations yield: (1)
Set η = 1/21 y12 . Then w1 ≈ y1 and w2 ≈ y23 /2y12 +
(1)
Z
N 2 2
2 y1 /21 . Our assumptions thus far (y1 y2 and 1 2 )
3− N 3− N
Fe = R − r (43) (1) (1)
imply w1 w2 . Moreover, since 2 /1 ≈ y2 /y1 , it
Γ3r,R 3 − 2/N
(1)
Z
2
holds that w2 ∈ O(y2 ) = O(1). Taking all of this into ac-
Fe = −2R3− N (44) count, the second iteration (t = 1) of the overparameterized
Γ4r,R update rule (Equation 46) becomes:
(2) (2) (1)
Combining Equation 40 with Equations 41, 42, 43 and 44, [w1 , w2 ]> ≈ [y1 , w2 ]>
we obtain the desired result. 1 (1) (1)
− [(w1 − y1 )3 , (w2 − y2 )3 ]>
21 y1
(1) (1) (1)
B. A Concrete Acceleration Bound y1 (w1 − y1 )3 + w2 (w2 − y2 )3 (1)
− [y1 , w2 ]>
21 y13
In Section 7 we illustrated qualitatively, on a family of
(1) (1)
very simple hypothetical learning problems, the potential ≈ [y1 , w2 − 1/21 y1 · (w2 − y2 )3 ]>
of overparameterization (use of depth-N linear network in
place of classic linear model) to accelerate optimization. In In words, w1 will stay approximately equal to y1 , whereas
this appendix we demonstrate how the illustration can be w2 will take a step that corresponds to gradient descent with
made formal, by considering a special case and deriving a learning rate (OP below stands for overparameterization):
concrete bound on the acceleration. η OP := 1/21 y1 (47)
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
7
https://github.com/tensorflow/models/
tree/master/tutorials/image/mnist