Professional Documents
Culture Documents
Zhou 18 A
Zhou 18 A
Abstract tion (Szegedy et al., 2015; He et al., 2016; Wang et al., 2017),
speech recognition (Sainath et al., 2013; Abdel-Hamid et al.,
This work aims to provide understandings on the 2014), and classic games (Silver et al., 2016; Brown &
remarkable success of deep convolutional neu- Sandholm, 2017). However, theoretical analyses and un-
ral networks (CNNs) by theoretically analyzing derstandings on deep CNNs still largely lag their practical
their generalization performance and establish- applications. Recently, although many works establish theo-
ing optimization guarantees for gradient descent retical understandings on deep feedforward neural networks
based training algorithms. Specifically, for a CNN (DNNs) from various aspects, e.g. (Neyshabur et al., 2015;
model consisting of l convolutional layers and one Kawaguchi, 2016; Zhou & Feng, 2018; Tian, 2017; Lee
fully connected layer, we provepthat its general- et al., 2017), only a few (Sun et al., 2016; Kawaguchi et al.,
ization error is bounded by O( θe %/n) where θ 2017; Du et al., 2017a;b) provide explanations on deep
denotes freedom degree of the network parame- CNNs due to their more complex architectures and opera-
Ql
ters and %e = O(log( i=1 bi (ki − si + 1)/p) + tions. Besides, these existing works all suffer certain dis-
log(bl+1 )) encapsulates architecture parameters crepancy between their theories and practice. For example,
including the kernel size ki , stride si , pooling size the developed generalization error bound either exponen-
p and parameter magnitude bi . To our best knowl- tially grows along with the depth of a CNN model (Sun et al.,
edge, this is the first generalization bound that 2016) or is data-dependent (Kawaguchi et al., 2017), and the
Ql+1
only depends on O(log( i=1 bi )), tighter than convergence guarantees for optimization algorithms over
existing ones that all involve an exponential term CNNs are achieved by assuming an over-simplified CNN
Ql+1
like O( i=1 bi ). Besides, we prove that for an ar- model consisting of only one non-overlapping convolutional
bitrary gradient descent algorithm, the computed layer (Du et al., 2017a;b).
approximate stationary point by minimizing em-
pirical risk is also an approximate stationary point As an attempt to explain the practical success of deep CNNs
to the population risk. This well explains why gra- and mitigate the gap between theory and practice, this work
dient descent training algorithms usually perform aims to provide tighter data-independent generalization er-
sufficiently well in practice. Furthermore, we ror bound and algorithmic optimization guarantees for the
prove the one-to-one correspondence and conver- commonly used deep CNN models in practice. Specifically,
gence guarantees for the non-degenerate station- we theoretically analyze the deep CNNs from following two
ary points between the empirical and population aspects: (1) how their generalization performance varies
risks. It implies that the computed local minimum with different network architecture choices and (2) why gra-
for the empirical risk is also close to a local min- dient descent based algorithms such as stochastic gradient
imum for the population risk, thus ensuring the descend (SGD) (Robbins & Monro, 1951), adam (Kingma
good generalization performance of CNNs. & Ba, 2015) and RMSProp (Tieleman & Hinton, 2012), on
minimizing empirical risk usually offer models with satisfac-
tory performance. Moreover, we theoretically demonstrate
the benefits of (stride) convolution and pooling operations,
1. Introduction which are unique for CNNs, to the generalization perfor-
Deep convolutional neural networks (CNNs) have been suc- mance, compared with feedforward networks.
cessfully applied to various fields, such as image classifica- Formally, we consider a CNN model g(w; D) parameter-
1
Department of Electrical & Computer Engineering (ECE), ized by w ∈ Rd , consisting of l convolutional layers and
National University of Singapore, Singapore. Correspondence to: one subsequent fully connected layer. It maps the input
Pan Zhou <pzhou@u.nus.edu>. D ∈ Rr0 ×c0 to an output vector v ∈ Rdl+1 . Its i-th con-
volutional layer takes Z(i−1) ∈ Rr̃i−1 ×c̃i−1 ×di−1 as input
Proceedings of the 35 th International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
and outputs Z(i) ∈ Rr̃i ×c̃i ×di through spatial convolution,
by the author(s). non-linear activation and pooling operations sequentially.
Understanding Generalization and Optimization Performance of Deep CNNs
Here r̃i × c̃i and di respectively denote resolution and the network architecture parameters including the convolutional
number of feature maps. Specifically, the computation with kernel size ki , stride si , pooling size p, channel number
the i-th convolutional layer is described as di and parameter magnitudes. It is worth mentioning that
j our generalization error bound is the first one that does not
X(i) (:, :, j) = Z(i−1) ~ W(i) ∈ Rri ×ci , ∀j = 1, · · · , di ,
exponentially grow with depth.
Y(i) = σ1 (X(i) ) ∈ Rri ×ci ×di ,
Secondly, we prove that for any gradient descent based
Z(i) = pool Y(i) ∈ Rr̃i ×c̃i ×di ,
optimization algorithm, e.g. SGD, RMSProp or adam, if its
where X(i) (:, :, j) denotes the j-th feature map output by output we is an approximate stationary point of the empirical
j risk Q
e n (w), we is also an approximate stationary point of
the i-th layer; W(i) ∈ Rki ×ki ×di−1 denotes the j-th convo-
the population risk Q(w). This result is important as it
lutional kernel of size ki ×ki and there are in total di kernels
explains why CNNs trained by minimizing the empirical risk
in the i-th layer; ~, pool (·) and σ1 (·) respectively denote
have good generalization performance on test samples. We
the convolutional operation with stride si , pooling operation
achieve such results by analyzing the convergence behavior
with window size p × p without overlap and the sigmoid
of the empirical gradient to its population counterpart.
function. In particular, Z(0) = D is the input sample. The
last layer is a fully connected one and formulated as Finally, we go further and quantitatively bound the distance
dl+1 dl+1 between w e and w∗ . We prove that when the samples are
u = W(l+1) z(l) ∈ R and v = σ2 (u) ∈ R ,
sufficient, a non-degenerate stationary point wn of Q e n (w)
where z(l) ∈ Rr̃l c̃l dl is vectorization of the output Z(l) of uniquely corresponds to a non-degenerate stationary point
the last convolutional layer; W(l+1) ∈ Rdl+1 ×r̃l c̃l dl denotes w∗ of the population risk
p Q(w), with a distance shrinking
the connection weight matrix; σ2 (·) is a softmax activation at the rate of O((β/ζ) de %/n) where β also depends on
function (for classification) and dl+1 is the class number. the CNN architecture parameters (see Thereom 2). Here
ζ accounts for the geometric topology of non-degenerate
In practice, a deep CNN model is trained by minimizing stationary points. Besides, the corresponding pair (wn , w∗ )
the following empirical risk in terms of squared loss on shares the same geometrical property—if one in (wn , w∗ )
the training data pairs (D (i) , y (i) ) drawn from an unknown is a local minimum or saddle point, so is the other one. All
distribution D, these results guarantee that for an arbitrary algorithm pro-
n
1X vided with sufficient samples, if the computed w e is close to
Qn (w) ,
e f (g(w; D (i) ), y (i) ), (1) the stationary point wn , then w
e is also close to the optimum
n i=1
w∗ and they share the same geometrical property.
where f (g(w; D), y) = 12 kv − yk22 is the squared loss
To sum up, we make multiple contributions to understand
function. One can obtain the model parameter w e via SGD
deep CNNs theoretically. To our best knowledge, this work
or its variants like adam and RMSProp. However, this
presents the first theoretical guarantees on both general-
empirical solution is different from the desired optimum w∗
ization error bound without exponential growth over net-
that minimizes the population risk:
work depth and optimization performance for deep CNNs.
Q(w) , E(D,y)∼D f (g(w; D), y). We substantially extend prior works on CNNs (Du et al.,
2017a;b) from the over-simplified single-layer models to
This raises an important question: why CNNs trained
the multi-layer ones, which is of more practical significance.
by minimizing the empirical risk usually perform well in
Our generalization error bound is much tighter than the one
practice, considering the high model complexity and non-
derived from Rademacher complexity (Sun et al., 2016) and
convexity? This work answers this question by (1) establish-
is also independent of data and specific training procedure,
ing the generalization performance guarantee for CNNs and
which distinguishes it from (Kawaguchi et al., 2017).
(2) expounding why the computed solution w e by gradient
descent based algorithms for minimizing the empirical risk
usually performs sufficiently well in practice. 2. Related Works
To be specific, we present three new theoretical guarantees Recently, many works have been devoted to explaining the
for CNNs. First, we prove that the generalization
p error of remarkable success of deep neural networks. However, most
deep CNNs decreases at the rate of O( θe %/(2n)) where θ works only focus on analyzing fully feedforward networks
denotes parameter freedom degree1 , and %e depends on the from aspects like generalization performance (Bartlett &
1 Maass, 2003; Neyshabur et al., 2015), loss surface (Saxe
We use the terminology of “parameter freedom degree” here
for characterizing redundancy of parameters. For example, for a et al., 2014; Dauphin et al., 2014; Choromanska et al., 2015;
rank-r matrix A ∈ Rm1 ×m2 , the parameter freedom degree in Kawaguchi, 2016; Nguyen & Hein, 2017; Zhou & Feng,
this work is r(m1 + m2 + 1) instead of the commonly used one 2018), optimization algorithm convergence (Tian, 2017; Li
r(m1 + m2 − r).
Understanding Generalization and Optimization Performance of Deep CNNs
& Yuan, 2017) and expression ability (Eldan & Shamir, the convergence rate of the empirical risk and generalization
2016; Soudry & Hoffer, 2017; Lee et al., 2017). error of CNN which is absent in (Mei et al., 2017).
The literature targeting at analyzing CNNs is very limited, Our work is also critically different from the recent
mainly because CNNs have much more complex architec- work (Zhou & Feng, 2018) although we adopt a similar anal-
tures and computation. Among the few existing works, Du ysis road map with it. Zhou & Feng (2018) analyzed DNNs
et al. (2017b) presented results for a simple and shallow while this work focuses on CNNs with more complex archi-
CNN consisting of only one non-overlapping convolutional tectures and operations which are more challenging and re-
layer and ReLU activations, showing that gradient descent quires different analysis techniques. Besides, this work pro-
(GD) algorithms with weight normalization can converge to vides stronger results in the sense of several tighter bounds
the global minimum. Similarly, Du et al. (2017a) also ana- with much milder assumptions. (1) For nonlinear DNNs,
lyzed optimization performance of GD and SGD with non- Zhou & Feng (2018) assumed the input data to be Gauss-
Gaussian inputs for CNNs with only one non-overlapping ian, while this work gets rid of such a restrictive passump-
convolutional layer. By utilizing the kernel method, Zhang tion. (2) The generalization error bound O(b rl+1 d/n) in
et al. (2017) transformed a CNN model into a single-layer (Zhou & Feng, 2018) exponentially depends on the upper
convex model which has almost the same loss as the original magnitude bound rb of the weight matrix per layer and lin-
CNN with high probability and proved that the transformed depends on the total parameter number d, while ours
early p
model has higher learning efficiency. is O( θe %/n) which only depends on the logarithm term
Ql+1
Regarding generalization performance of CNNs, Sun et al. %e = log( i=1 bi ) and the freedom degree θ of the network
(2016) provided the Rademacher complexity of a deep CNN parameters, where bi and bl+1 respectively denote the upper
model which is then used to establish the generalization magnitude bounds of each kernel per layer and the weight
error bound. But the Rademacher complexity exponentially matrix of the fully connected layer. Note, the exponential
depends on the magnitude of total parameters per layer, term O(b rl+1 ) in (Zhou & Feng, 2018) cannot be further
leading to loose results. In contrast, the generalization error improved due to their proof techniques. The results on em-
bound established in this work is much tighter, as discussed pirical gradient and stationary point pairs in (Zhou & Feng,
Ql+1
in details in Sec. 3. Kawaguchi et al. (2017) introduced 2018) rely on O(b r2(l+1) ), while ours is O( i=1 bi ) which
two generalization error bounds of CNN, but both depend only depends on bi instead of b2i . (3) This work explores
on a specific dataset as they involve the validation error the parameter structures, i.e. the low-rankness property, and
or the intermediate outputs for the network model on a derives tighter bounds as the parameter freedom degree θ is
provided dataset. They also presented dataset-independent usually smaller than the total parameter number d.
generalization error bound, but with a specific two-phase
training procedure required, where the second phase need 3. Generalization Performance of Deep CNNs
fix the states of ReLU activation functions. However, such
two-phase training procedure is not used in practice. In this section, we present the generalization error bound
for deep CNNs and reveal effects of different architecture
There are also some works focusing on convergence behav- parameters on their generalization performance, providing
ior of nonconvex empirical risk of a single-layer model to some principles on model architecture design. We derive
the population risk. Our proof techniques essentially dif- these results by establishing uniform convergence of the
fer from theirs. For example, (Gonen & Shalev-Shwartz, empirical risk Qe n (w) to its population one Q(w).
2017) proved that the empirical risk converges to the popula-
tion risk for those nonconvex problems with no degenerated We start with explaining our assumptions. Similar to (Xu &
saddle points. Unfortunately, due to existence of degener- Mannor, 2012; Tian, 2017; Zhou & Feng, 2018), we assume
ated saddle points in deep networks (Dauphin et al., 2014; that the parameters of the CNN have bounded magnitude.
Kawaguchi, 2016), their results are not applicable here. A But we get rid of the Gaussian assumptions on the input data,
very recent work (Mei et al., 2017) focuses on single-layer meaning our assumption is milder than the ones in (Tian,
nonconvex problems but requires the gradient and Hessian 2017; Soudry & Hoffer, 2017; Zhou & Feng, 2018).
j
of the empirical risk to be strong sub-Gaussian and sub- Assumption 1. The magnitudes of the j-th kernel W(i) ∈
exponential respectively. Besides, it assumes a linearity Rki ×ki ×di−1 in the i-th convolutional layer and the weight
property for the gradient which hardly holds in practice. matrix W(l+1) ∈ Rdl+1 ×r̃l c̃l dl in the the fully connected
Comparatively, our assumptions are much milder. We only layer are respectively bounded as follows
assume magnitude of the parameters to be bounded. Further- j
more, we also explore the parameter structures of optimized kW(i) kF ≤ bi (1 ≤ j ≤ di ; 1 ≤ i ≤ l), kW(l+1) kF ≤ bl+1 ,
CNNs, i.e. the low-rankness property, and derive bounds where bi (1 ≤ i ≤ l) and bl+1 are positive constants.
matching empirical observations better. Finally, we analyze
We also assume that the entry value of the target output y
Understanding Generalization and Optimization Performance of Deep CNNs
always falls in [0, 1], which can be achieved by scaling the Pl θ of the network is θ =
where the total freedom degree
entry value in y conveniently. c̃l dl + 1) + i=1 ai (ki 2 di−1 + di + 1) and
al+1 (dl+1 + r̃l√
% = i=1 log di bi (k4pi −si +1) + log(bl+1 ) + log 128p
Pl n
In this work, we also consider possible emerging structure 2 .
error, more samples are needed to train a deep CNN model with a logarithm term log bi . This implies the architecture
having larger freedom degree θ. As aforementioned, al- plays a more important role than the parameter magnitudes.
though the results in Theorem 1 are obtained under the Therefore, for achieving better generalization performance
low-rankness condition defined on the parameter matrix in practice, architecture engineering is indeed essential.
consisting of kernels per layer, they are easily extended
Finally, by observing the factor %, we find that imposing
to the (tensor) low-rankness defined on each kernel indi-
certain regularization, such as kwk22 , on the trainable param-
vidually. The low-rankness captures common parameter
eters is useful. The effectiveness of such a regularization
redundancy in practice. For instance, (Lebedev et al., 2014;
will be more significant when imposing on the weight ma-
Jaderberg et al., 2014) showed that parameter redundancy
trix of the fully connected layer due to its large size. Such a
exists in a trained network model and can be squeezed by
regularization technique, in deep learning literature, is well
low-rank tensor decomposition. The classic residual func-
known as “weight decay”. This conclusion is consistent
tion (He et al., 2016; Zagoruyko & Komodakis, 2016) with
with other analysis works on the deep forward networks,
three-layer bottleneck architecture (1 × 1, 3 × 3 and 1 × 1
such as (Bartlett & Maass, 2003; Neyshabur et al., 2015;
convs) has rank 1 in generalized block term decomposi-
Zhou & Feng, 2018).
tion (Chen et al., 2017; Cohen & Shashua, 2016). Similarly,
inception networks (Szegedy et al., 2017) explicitly decom-
poses a convolutional kernel of large size into two separate 3.2. Discussions
convolutional kernels of smaller size (e.g. a 7 × 7 kernel is Sun et al. (2016) also analyzed generalization error bound
replaced by two multiplying kernels of size 7 × 1 and 1 × 7). in deep CNNs but employing different techniques. They
Employing these low-rank approximation techniques helps proved that the Rademacher complexity Rm (F) of a
reduce the freedom degree and provides smaller generaliza- deep CNN model p with sigmoid √
activation functions is
tion error. Notice, the low-rankness assumption only affects O(ebx (2peb)l+1 log(r0 c0 )/ n) where F denotes the func-
the freedom degree θ. Without this assumption, θ will be tion hypothesis that maps the input data D to v ∈ Rdl+1
replaced by the total parameter number of the network. by the analyzed CNN model. Here ebx denotes the up-
From the factor %, one can observe that the kernel size ki per bound of the absolute entry values in the input da-
and its stride si determine the generalization error but in tum D, i.e. ebx ≥ |Di,j | (∀i, j), and eb obeys eb ≥
Pdi j
opposite ways. Larger kernel size ki leads to larger gener- max{maxi j=1 kW(i) k1 , kW(l+1) k1 }. Sun et al. (2016)
alization error, while larger stride si provides smaller one. showed that with probability at least 1 − ε, the difference
This is because both larger kernel and smaller stride increase between the empirical margin error errγe (g) (g ∈ F) and
the model complexity, since larger kernel means more train- the population margin error errγp (g) can be bounded as
able parameters and smaller stride implies larger size of
feature maps in the subsequent layer. Also, the pooling
8dl+1 (2dl+1 − 1)
operation in the first l convolutional layers helps reduce the errγp (g) ≤ inf errγe (g) + Rm (F)
generalization error, as reflected by the factor 1/p in %.
γ>0 γ
r r #
log log2 (2/γ) log(2/ε)
Furthermore, the number of feature maps (i.e. channels) di + + , (3)
appearing in the θ and % also affects the generalization error. n n
A wider network with larger di requires more samples for
training such that it can generalize well. This is because where γ controls the error margin since it obeys γ ≥
(1) a larger di indicates more trainable parameters, which vy − maxk6=y vk and y denotes the label of v. However,
usually increases the freedom degree θ, and (2) a larger di the bound in Eqn. (3) is practically loose, since Rm (F) in-
j
also requires larger kernels W(i) with more channel-wise volves the exponential factor (2eb)l+1 which is usually very
dimensions since there are more channels to convolve, lead- large. In this case, Rm (F) is extremely large. By compari-
j
ing to a larger magnitude bound bi for the kernel W(i) . son, the bound provided in our Theorem 1 only depends on
Therefore, as suggested by Theorem 1, a thin network is Pl+1
i=1 log(bi ) which avoids the exponential growth along
more preferable than a fat network. Such an observation with the depth l, giving a much tighter and more practi-
is consistent with other analysis works on the network ex- cally meaningful bound. The generalization error bounds
pression ability (Eldan & Shamir, 2016; Lu et al., 2017) in (Kawaguchi et al., 2017) either depend on a specific
and the architecture-engineering practice, such as (He et al., dataset or rely on restrictive and rarely used training proce-
2016; Szegedy et al., 2015). By comparing contributions dure, while our Theorem 1 is independent of any specific
of the architecture and parameter magnitude to the general- dataset or training procedure, rendering itself more general.
ization performance, we find that the generalization error More importantly, the results in Theorem 1 make the roles
usually depends on the network architecture parameters lin- of network parameters transparent, which could benefit un-
early or more heavily, and also on parameter magnitudes but derstanding and architecture design of CNNs.
Understanding Generalization and Optimization Performance of Deep CNNs
4. Optimization Guarantees for Deep CNNs Its proof is given in Sec. D.3 in supplement. From Theo-
rem 2, the empirical gradient
√ converges to the population
Although deep CNNs are highly non-convex, gradient de- one at the rate of O(1/ n) (up to a log factor). In Sec. 3.1,
scent based algorithms usually perform quite well on op- we have discussed the roles of the network architecture pa-
timizing the models in practice. After characterizing the rameters in %. Here we further analyze the effects of the
roles of different network parameters for the generalization network parameters on the optimization behavior through
performance, here we present optimization guarantees for the factor β. The roles of the kernel size ki , the stride si , the
gradient descent based algorithms in training CNNs. pooling size p and the channel number di in β are consistent
Specifically, in practice one usually adopts SGD or its vari- with those in Theorem 1. The extra factor ri ci advocates not
ants, such as adam and RMSProp, to optimize the CNN building such CNN networks with extremely large feature
models. Such algorithms usually terminate when the gra- map sizes. The total number of parameters d is involved
dient magnitude decreases to a low level and the training here instead of the degree of freedom because the gradient
hardly proceeds. This implies that the algorithms in fact ∇w Q e n (w) may not have low-rank structures.
compute an -approximate stationary point w e for the loss Based on Theorem 2, we can further conclude that if the
function Q e n (w), i.e. k∇w Q e 22 ≤ . Here we explore
e n (w)k
computed solution w e is an -approximate stationary point
such a problem: by computing an -stationary point w e of of the empirical risk, then it is also a 4-approximate sta-
the empirical risk Q e n (w), can we also expect w
e to be suf- tionary point of the population risk. We state this result in
ficiently good for generalization, or in other words expect Corollary 1 with proof in Sec. D.4 in supplement.
that it is also an approximate stationary point for the popu-
lation risk Q(w)? To answer this question, first we analyze Corollary 1. Suppose assumptions in Theorem 2 hold and
we have n ≥ (d% + log(4/ε))β 2 /. Then if the solu-
the relationship between the empirical gradient ∇w Q e n (w)
tion we computed by minimizing the empirical risk obeys
and its population counterpart ∇w Q(w). Founded on this,
k∇Q e 22 ≤ , we have k∇Q(w)k
e n (w)k e 22 ≤ 4 with proba-
we further establish convergence of the empirical gradient
bility at least 1 − ε.
of the computed solution to its corresponding population
gradient. Finally, we present the bounded distance between Corollary 1 shows that by using full gradient descent al-
the computed solution w e and the optimum w∗ . gorithms to minimize the empirical risk, the computed ap-
To our best knowledge, this work is the first one that ana- proximate stationary point w e is also close to the desired
lyzes the optimization behavior of gradient descent based stationary point w∗ of the population risk. This guarantee
algorithms for training multi-layer CNN models with the is also applicable to other stochastic gradient descent based
commonly used convolutional and pooling operations. algorithms, like SGD, adam and RMSProp, by applying
recent results on obtaining -approximate stationary point
4.1. Convergence Guarantees on Gradients for nonconvex problems (Ghadimi & Lan, 2013; Tieleman
& Hinton, 2012; Kingma & Ba, 2015). Accordingly, the
Here we present guarantees on convergence of the empirical computed solution w e has guaranteed generalization perfor-
gradient to the population one in Theorem 2. As aforemen- mance on new data. It partially explains the success of
tioned, such results imply good generalization performance gradient descent based optimization algorithms for CNNs.
of the computed solution w e to the empirical risk Q
e n (w).
Theorem 2. Assume that in CNNs, σ1 and σ2 respectively 4.2. Convergence of Stationary Points
are the sigmoid and softmax functions and the loss func- Here we go further and directly characterize the distance
tion f (g(w; D), y) is squared loss. Suppose Assumptions 1 between stationary points in the empirical risk Q e n (w) and
and 2 hold. Then the empirical gradient uniformly con- its population counterpart Q(w). Compared with the results
verges to the population gradient in Euclidean norm. More for the risk and gradient, the results on stationary points give
specifically, there exist universal constants cg0 and cg such more direct performance guarantees for CNNs. Here we
l2 bl+1 2 (bl+1 + li=1 di bi )2 (r0 c0 d0 )4
P
that if n ≥ cg0 d40 b1 8 (d log(6)+θ%)ε2 maxi (ri ci )
, then only analyze the non-degenerate stationary points including
s local minimum/maximum and non-degenerate saddle points,
4
as they are geometrically isolated and thus are unique in
2d+θ%+log ε
sup
∇w Q
≤ cg β
e n (w)−∇w Q(w)
local regions. We first introduce some necessary definitions.
w∈Ω 2 2n
Definition 2. (Non-degenerate stationary points and saddle
holds with probability at least 1 − ε, where % is pro- points) (Gromoll & Meyer, 1969) A stationary point w is
vided in Lemma 1. Here β and d are defined as β = said to be a non-degenerate stationary point of Q(w) if
h Pl bl+1 2 di−1 Ql dj bj 2 (kj −sj +1)2 i1/2
inf λi ∇2 Q(w) ≥ ζ,
rl c l d l
8p 2 + 2
i=1 8p bi di
2 ri−1 c i−1 j=i 16p2 i
Pl 2
and d = r̃l c̃l dl dl+1 + i=1 ki di−1 di , respectively. where λi ∇2 Q(w) is the i-th eigenvalue of the Hessian
Understanding Generalization and Optimization Performance of Deep CNNs
∇2 Q(w) and ζ is a positive constant. A stationary point normalization can converge to the global minimum. In
is said to be a saddle point if the smallest eigenvalue of its contrast to their simplified models, we analyze complex
Hessian ∇2 Q(w) has a negative value. multi-layer CNNs with the commonly used convolutional
and pooling operations. Besides, we provide results on both
Suppose Q(w) has m non-degenerate stationary points gradient and the distance between the computed solution and
which are denoted as {w(1) , w(2) , · · · , w(m) }. We have desired stationary points, which are applicable to arbitrary
following results on the geometry of these stationary points gradient descent based algorithms.
in Theorem 3. The proof is given in Sec. D.5 in supplement.
Next, based on Theorem 3, we derive that the correspond-
Theorem 3. Assume in CNNs, σ1 and σ2 are respec- (k)
ing pair (wn , w(k) ) in the empirical and population risks
tively the sigmoid and softmax activation functions and
shares the same geometrical property stated in Corollary 2.
the loss f (g(w; D), y) is squared loss. Suppose As-
sumptions 1 and 2 hold. Then if n ≥ ch max d+θ% Corollary 2. Suppose the assumptions in Theorem 3 hold.
ζ2 , (k)
l2 bl+1 2 (bl+1 + li=1 di bi )2 (r0 c0 d0 )4
P If any one in the pair (wn , w(k) ) in Theorem 3 is a local
d40 b1 8 d%ε2 maxi (ri ci )
where ch is a constant, minimum or saddle point, so is the other one.
for k ∈ {1, · · · , m}, there exists a non-degenerate sta-
tionary point wn of Q
(k) e n (w) which uniquely corresponds See the proof of Corollary 2 in Sec. D.6 in supplement.
(k)
to the non-degenerate stationary point w(k) of Q(w) with Corollary 2 tells us that the corresponding pair, wn and
probability at least 1 − ε. Moreover, with same probability w(k) , has the same geometric property. Namely, if either
(k)
the distance between wn and w(k) is bounded as one in the pair is a local minimum or saddle point, so is
the other one. This result is important for optimization. If
s
4
the computed solution w e by minimizing the empirical risk
2cg β 2d+θ%+log Qe n (w) is a local minimum, then it is also a local minimum
kwn(k) −w(k) k2 ≤ ε
, (1 ≤ k ≤ m),
ζ 2n of the population risk Q(w). Thus it partially explains
why the computed solution w e can generalize well on new
where % and β are given in Lemma 1 and Theorem 2, re-
data. This property also benefits designing new optimization
spectively.
algorithms. For example, Saxe et al. (2014) and Kawaguchi
Theorem 3 shows that there exists exact one-to-one corre- (2016) pointed out that degenerate stationary points indeed
spondence between the non-degenerate stationary points of exist for deep linear neural networks and Dauphin et al.
the empirical risk Q e n (w) and the popular risk Q(w) for (2014) empirically validated that saddle points are usually
CNNs, if the sample size n is sufficiently large. Moreover, surrounded by high error plateaus in deep forward neural
(k) networks. So it is necessary to avoid the saddle points and
the non-degenerate stationary point wn of Q e n (w) is very
find the local minimum of population risk. From Theorem 3,
close to its corresponding non-degenerate stationary point
it is clear that one only needs to find the local minimum of
w(k) of Q(w). More √ importantly, their distance shrinks empirical risk by using escaping saddle points algorithms,
at the rate of O (1/ n) (up to a log factor). The network
e.g. (Ge et al., 2015; Jin et al., 2017; Agarwal et al., 2017).
parameters have similar influence on the distance bounds
as explained in the above subsection. Compared with gra-
dient convergence rate in Theorem 2, the convergence rate 5. Comparison on DNNs And CNNs
of corresponding stationary point pairs in Theorem 3 has an
Here we compare deep feedforward neural networks
extra factor 1/ζ that accounts for the geometric topology
(DNNs) with deep CNNs from their generalization error
of non-degenerate stationary points, similar to other works
and optimization guarantees to theoretically explain why
like stochastic optimization analysis (Duchi & Ruan, 2016).
CNNs are more preferable than DNNs, to some extent.
For degenerate stationary points to which the corresponding
By assuming the input to be standard Gaussian N (0, τ 2 ),
Hessian matrix has zero eigenvalues, one cannot expect
Zhou & Feng (2018) proved that if n ≥ 18r2 /(dτ 2 ε2 log(l+
to establish unique correspondence for stationary points in
1)), with probability 1 − ε, the generalization error of an
empirical and population risks, since they are not isolated
(l + 1)-layer DNN model with sigmoid activation functions
anymore and may reside in flat regions. But Theorem 2
is bounded by n :
guarantees that the gradients of Q e n (w) and Q(w) at these
r
points are close. This implies a degenerate stationary point q d log(n(l+1)) + log(4/ε)
of Q(w) will also give a near-zero gradient for Q e n (w), n , cn τ (1 + cr l) max di ,
i n
indicating it is also a stationary point for Qn (w).
e
where cn is a universal constant; di denotes the width
Du et al. (2017a;b) showed that for a simple and shallow of the i-th layer; d is the total parameter number of
l
CNN consisting of only one non-overlapping convolutional the network; cr = max(b r2 /16, rb2 /16 ) where rb up-
layer, (stochastic) gradient descent algorithms with weight per bounds Frobenius norm of the weight matrix in each
Understanding Generalization and Optimization Performance of Deep CNNs
Eldan, R. and Shamir, O. The power of depth for feedfor- Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based
ward neural networks. In COLT, pp. 907–940, 2016. capacity control in neural networks. In COLT, pp. 1376–
1401, 2015.
Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from
saddle points—online stochastic gradient for tensor de- Nguyen, Q. and Hein, M. The loss surface of deep and wide
composition. In COLT, pp. 797–842, 2015. neural networks. In ICML, 2017.
Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order Robbins, H. and Monro, S. A stochastic approximation
methods for nonconvex stochastic programming. SIAM method. The Annals of Mathematical Statistics, 22(3):
Journal on Optimization, 23(4):2341–2368, 2013. 400–407, 1951.
Understanding Generalization and Optimization Performance of Deep CNNs
Sainath, T., Mohamed, A., Kingsbury, B., and Ramabhadran, Zhou, P. and Feng, J. Outlier-robust tensor PCA. In CVPR,
B. Deep convolutional neural networks for LVCSR. In pp. 1–9, 2017.
ICASSP, pp. 8614–8618. IEEE, 2013.
Zhou, P. and Feng, J. Empirical risk landscape analysis for
Saxe, A., McClelland, J., and Ganguli, S. Exact solutions to understanding deep neural networks. In ICLR, 2018.
the nonlinear dynamics of learning in deep linear neural
networks. ICLR, 2014. Zhou, P., Lu, C., Lin, Z., and Zhang, C. Tensor factorization
for low-rank tensor completion. IEEE TIP, 27(3):1152 –
Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridha- 1163, 2017.
ran, K. Learnability, stability and uniform convergence.
JMLR, 11:2635–2670, 2010.
Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,
Driessche, G. Van Den, Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., and Lanctot, M. Mastering the game
of go with deep neural networks and tree search. Nature,
529(7587):484–489, 2016.
Sun, S., Chen, W., Wang, L., Liu, X., and Liu, T. On the
depth of deep neural networks: A theoretical view. In
AAAI, pp. 2066–2072, 2016.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
A. Going deeper with convolutions. In CVPR, pp. 1–9,
2015.
Wang, Y., Xu, C., Xu, C., and Tao, D. Beyond filters:
Compact feature map for portable deep model. In ICML,
pp. 3703–3711, 2017.