Professional Documents
Culture Documents
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
net/publication/330638379
CITATIONS READS
17 3,113
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Charles Martin on 06 July 2019.
Abstract theory did not apply to these systems (Vapnik et al., 1994). It
was originally assumed that local minima in the energy/loss
Random Matrix Theory (RMT) is applied to an- surface were responsible for the inability of VC theory to
alyze the weight matrices of Deep Neural Net- describe NNs (Vapnik et al., 1994), and that the mechanism
works (DNNs), including both production quality, for this was that getting trapped in local minima during
pre-trained models such as AlexNet and Incep- training limited the number of possible functions realizable
tion, and smaller models trained from scratch, by the network. However, it was very soon realized that the
such as LeNet5 and a miniature-AlexNet. Empir- presence of local minima in the energy function was not a
ical and theoretical results clearly indicate that problem in practice (LeCun et al., 1998; Duda et al., 2001).
the empirical spectral density (ESD) of DNN Thus, another reason for the inapplicability of VC theory
layer matrices displays signatures of traditionally- was needed. At the time, there did exist other theories of
regularized statistical models, even in the ab- generalization based on statistical mechanics (Seung et al.,
sence of exogenously specifying traditional forms 1992; Watkin et al., 1993; Haussler et al., 1996; Engel & den
of regularization, such as Dropout or Weight Broeck, 2001), but for various technical and nontechnical
Norm constraints. Building on recent results in reasons these fell out of favor in the ML/NN communities.
RMT, most notably its extension to Universality Instead, VC theory and related techniques continued to
classes of Heavy-Tailed matrices, we develop a remain popular, in spite of their obvious problems.
theory to identify 5+1 Phases of Training, corre-
sponding to increasing amounts of Implicit Self- More recently, theoretical results of Choromanska et
Regularization. For smaller and/or older DNNs, al. (Choromanska et al., 2014) (which are related to (Se-
this Implicit Self-Regularization is like traditional ung et al., 1992; Watkin et al., 1993; Haussler et al.,
Tikhonov regularization, in that there is a “size 1996; Engel & den Broeck, 2001)) suggested that the En-
scale” separating signal from noise. For state- ergy/optimization Landscape of modern DNNs resembles
of-the-art DNNs, however, we identify a novel the Energy Landscape of a zero-temperature Gaussian Spin
form of Heavy-Tailed Self-Regularization, simi- Glass; and empirical results of Zhang et al. (Zhang et al.,
lar to the self-organization seen in the statistical 2016) have again pointed out that VC theory does not de-
physics of disordered systems. This implicit Self- scribe the properties of DNNs. Martin and Mahoney then
Regularization can depend strongly on the many suggested that the Spin Glass analogy may be useful to un-
knobs of the training process. We demonstrate derstand severe overtraining versus the inability to overtrain
that we can cause a small model to exhibit all 5+1 in modern DNNs (Martin & Mahoney, 2017).
phases simply by changing the batch size. Motivated by this, we are interested here in two questions.
• Theoretical Question. Why is regularization in deep
learning seemingly quite different than regularization in
1. Introduction other areas on ML; and what is the right theoretical frame-
The inability of optimization and learning theory to explain work with which to investigate regularization for DNNs?
and predict the properties of NNs is not a new phenomenon. • Practical Question. How can one control and adjust, in
From the earliest days of DNNs, it was suspected that VC a theoretically-principled way, the many knobs and switches
1 that exist in modern DNN systems, e.g., to train these mod-
Calculation Consulting, 8 Locksley Ave, 6B, San Francisco,
CA 94122 2 ICSI and Department of Statistics, University of els efficiently and effectively, to monitor their effects on the
California at Berkeley, Berkeley, CA 94720. Correspondence global Energy Landscape, etc.?
to: Charles H. Martin <charles@CalculationConsulting.com>, That is, we seek a Practical Theory of Deep Learning,
Michael W. Mahoney <mmahoney@stat.berkeley.edu>. one that is prescriptive and not just descriptive. This (phe-
Proceedings of the 36 th International Conference on Machine nomenological) theory would provide useful tools for prac-
Learning, Long Beach, California, PMLR 97, 2019. Copyright titioners wanting to know How to characterize and control
2019 by the author(s). the Energy Landscape to engineer larger and betters DNNs;
Traditional and Heavy Tailed Self Regularization in Neural Network Models
and it would also provide theoretical answers to broad open visually distinct signatures in the ESD of DNN weight
questions as Why Deep Learning even works. matrices, and successive phases correspond to decreas-
Main Empirical Results. Our main empirical results con- ing MP Soft Rank and increasing amounts of Self-
sist in evaluating empirically the ESDs (and related RMT- Regularization. The 5+1 phases are: R ANDOM - LIKE,
based statistics) for weight matrices for a suite of DNN mod- B LEEDING - OUT, B ULK +S PIKES, B ULK - DECAY, H EAVY-
els, thereby probing the Energy Landscapes of these DNNs. TAILED, and R ANK - COLLAPSE.
For older and/or smaller models, these results are consistent Based on these results, we speculate that all well optimized,
with implicit Self-Regularization that is Tikhonov-like; and large DNNs will display Heavy-Tailed Self-Regularization.
for modern state-of-the-art models, these results suggest Evaluating the Theory. We also provide a detailed evalua-
novel forms of Heavy-Tailed Self-Regularization. tion of our theory using a smaller MiniAlexNew model.
• Self-Regularization in old/small models. The ESDs of • Effect of Explicit Regularization. We analyze ESDs
older/smaller DNN models (like LeNet5 and a toy MLP3 of MiniAlexNet by removing all explicit regularization
model) exhibit weak Self-Regularization, well-modeled by (Dropout, Weight Norm constraints, Batch Normalization,
a perturbative variant of Marchenko-Pastur (MP) theory, the etc.) and characterizing how the ESD of weight matrices
Spiked-Covariance model. Here, a small number of eigen- behave during and at the end of Backprop training, as we
values pull out from the random bulk, and thus the MP Soft systematically add back in different forms of explicit regu-
Rank (defined in Section 4) and Stable Rank both decrease. larization.
This weak form of Self-Regularization is like Tikhonov • Exhibiting the 5+1 Phases. We demonstrate that we can
regularization, in that there is a “size scale” that cleanly sep- exhibit all 5+1 phases by appropriate modification of the
arates “signal” from “noise,” but it is different than explicit various knobs of the training process. In particular, by de-
Tikhonov regularization in that it arises implicitly due to the creasing the batch size from 500 to 2, we can make the ESDs
DNN training process itself. of the fully-connected layers of MiniAlexNet vary contin-
• Heavy-Tailed Self-Regularization. The ESDs of larger, uously from R ANDOM - LIKE to H EAVY-TAILED, while in-
modern DNN models (including AlexNet and Inception and creasing generalization accuracy along the way. These re-
nearly every other large-scale model we have examined) de- sults illustrate the Generalization Gap pheneomena (Hoffer
viate strongly from the common Gaussian-based MP model. et al., 2017; Keskar et al., 2016; Goyal et al., 2017), and
Instead, they appear to lie in one of the very different Uni- they explain that pheneomena as being caused by the im-
versality classes of Heavy-Tailed random matrix models. plicit Self-Regularization associated with models trained
We call this Heavy-Tailed Self-Regularization. The ESD with smaller and smaller batch sizes.
appears Heavy-Tailed, but with finite support. In this case, We should note that since the initial dissemination of these
there is not a “size scale” (even in the theory) that cleanly results, our theory has been used to develop a Universal
separates “signal” from “noise.” capacity control metric to predict trends in test accuracies for
Main Theoretical Results. Our main theoretical results large-scale pre-trained DNNs (Martin & Mahoney, 2019).
consist in an phenomenological theory for DNN Self- A longer and more detailed version of this paper is available
Regularization. Our theory uses ideas from RMT—both in technical report form (Martin & Mahoney, 2018).
vanilla MP-based RMT as well as extensions to other Uni- 2. Basic Random Matrix Theory (RMT)
versality classes based on Heavy-Tailed distributions—to
In this section, we summarize results from RMT that we use.
provide a visual taxonomy for 5 + 1 Phases of Training,
corresponding to increasing amounts of Self-Regularization. 2.1. Marchenko-Pastur (MP) theory
MP theory considers the density of singular values ρ(νi )
• Modeling Noise and Signal. We assume that a weight
of random rectangular matrices W. This is equivalent to
matrix W can be modeled as W ' Wrand + ∆sig , where
considering the density of eigenvalues ρ(λi ), i.e., the ESD,
Wrand is “noise” and where ∆sig is “signal.” For small
of matrices of the form X = N1 WT W. MP theory then
to medium sized signal, W is well-approximated by an
makes strong statements about such quantities as the shape
MP distribution—with elements drawn from the Gaussian
of the distribution in the infinite limit, it’s bounds, expected
Universality class—perhaps after removing a few eigenvec-
finite-size effects, such as fluctuations near the edge, and
tors. For large and strongly-correlated signal, Wrand gets
rates of convergence.
progressively smaller, but we can model the non-random
strongly-correlated signal ∆sig by a Heavy-Tailed random To apply RMT, we need only specify the number of rows and
matrix, i.e., a random matrix with elements drawn from a columns of W and assume that the elements Wi,j are drawn
Heavy-Tailed (rather than Gaussian) Universality class. from a distribution that is a member of a certain Universality
• 5+1 Phases of Regularization. Based on this, we class (there are different results for different Universality
construct a practical, visual taxonomy for 5+1 Phases classes). RMT then describes properties of the ESD, even at
of Training. Each phase is characterized by stronger, finite size; and one can compare predictions of RMT with
Traditional and Heavy Tailed Self Regularization in Neural Network Models
empirical results. Most well-known is the Universality class & Sornette, 2002); and our theory for more sophisticated
of Gaussian distributions. This leads to the basic or vanilla Heavy-Tailed Self-Regularization is inspired by the applica-
MP theory. More esoteric—but ultimately more useful for tion of MP/RMT tools in quantitative finance by Bouchuad,
us—are Universality classes of Heavy-Tailed distributions. Potters, and coworkers (Galluccio et al., 1998; Laloux et al.,
Gaussian Universality class. We start by modeling W as 1999; 2005; Biroli et al., 2007b;a; Bouchaud & Potters,
an N × M random matrix, with elements from a Gaussian 2011; Bun et al., 2017), as well as the relation of Heavy-
2
distribution, such that: Wij ∼ N (0, σmp ). Then, MP theory Tailed phenomena more generally to Self-Organized Criti-
states that the ESD of the correlation matrix, X, has the cality in Nature (Sornette, 2006).
limiting density given by the MP distribution ρ(λ): Universality classes for modeling strongly correlated
p matrices. Consider modeling W as an N × M random
N →∞ Q (λ+ − λ)(λ − λ− )
ρN (λ) −−−−→ 2
, (1) matrix, with elements drawn from a Heavy-Tailed—e.g., a
Q fixed 2πσmp λ Pareto or Power Law (PL)—distribution:
if λ ∈ [λ− , λ+ ], and 0 otherwise. Here, σmp
2
is the element- 1
Wij ∼ P (x) ∼ 1+µ , µ > 0. (3)
wise variance of the original matrix, Q = N/M ≥ 1 is the x
aspect ratio of the matrix, and the minimum and maximum In these cases, if W is element-wise Heavy-Tailed, then
eigenvalues, λ± , are given by the ESD ρN (λ) exhibits Heavy-Tailed properties, either
2
1 globally for the entire ESD and/or locally at the bulk edge.
λ± = σmp 2
1± √ . (2)
Q
Table 1 summarizes these recent results, comparing basic
Finite-size Fluctuations at the MP Edge. In the infinite MP theory, the Spiked-Covariance model, and Heavy-Tailed
limit, all fluctuations in ρN (λ) concentrate very sharply extensions of MP theory, including associated Universality
at the MP edge, λ± , and the distribution of the maximum classes. To apply the MP theory, at finite sizes, to matrices
eigenvalues ρ∞ (λmax ) is governed by the Tracy Widom with elements drawn from a Heavy-Tailed distribution of
(TW) Law. Even for a single finite-sized matrix, however, the form given in Eqn. (3), we have one of the following
MP theory states the upper edge of ρ(λ) is very sharp; and three Universality classes.
even when the MP Law is violated, the TW Law, with finite- • (Weakly) Heavy-Tailed, 4 < µ: Here, the ESD ρN (λ)
size corrections, works very well at describing the edge exhibits “vanilla” MP behavior in the infinite limit, and the
statistics. When these laws are violated, this is very strong expected mean value of the bulk edge is λ+ ∼ M −2/3 .
evidence for the onset of more regular non-random structure Unlike standard MP theory, which exhibits TW statistics
in the DNN weight matrices, which we will interpret as at the bulk edge, here the edge exhibits PL / Heavy-Tailed
evidence of Self-Regularization. fluctuations at finite N . These finite-size effects appear
in the edge / tail of the ESD, and they make it hard or
2.2. Heavy-Tailed extensions of MP theory
impossible to distinguish the edge versus the tail at finite N .
MP-based RMT is applicable to a wide range of matri-
• (Moderately) Heavy-Tailed, 2 < µ < 4: Here, the ESD
ces; but it is not in general applicable when matrix ele-
ρN (λ) is Heavy-Tailed / PL in the infinite limit, approaching
ments are strongly-correlated. Strong correlations appear
ρ(λ) ∼ λ−1−µ/2 . In this regime, there is no bulk edge. At
to be the case for many well-trained, production-quality
finite size, the global ESD can be modeled by ρN (λ) ∼
DNNs. In statistical physics, it is common to model strongly-
λ−(aµ+b) , for all λ > λmin , but the slope a and intercept
correlated systems by Heavy-Tailed distributions (Sornette,
b must be fit, as they display large finite-size effects. The
2006). The reason is that these models exhibit, more or
maximum eigenvalues follow Frechet (not TW) statistics,
less, the same large-scale statistical behavior as natural phe-
with λmax ∼ M 4/µ−1 (1/Q)1−2/µ , and they have large
nomena in which strong correlations exist (Sornette, 2006;
finite-size effects. Thus, at any finite N , ρN (λ) is Heavy-
Bouchaud & Potters, 2011). Moreover, recent results from
Tailed, but the tail decays moderately quickly.
MP/RMT have shown that new Universality classes exist for
• (Very) Heavy-Tailed, 0 < µ < 2: Here, the ESD ρN (λ)
matrices with elements drawn from certain Heavy-Tailed
is Heavy-Tailed / PL for all finite N , and as N → ∞
distributions (Bouchaud & Potters, 2011).
it converges more quickly to a PL distribution with tails
We use these Heavy-Tailed extensions of basic MP/RMT to ρ(λ) ∼ λ−1−µ/2 . In this regime, there is no bulk edge, and
build an operational and phenomenological theory of Regu- the maximum eigenvalues follow Frechet (not TW) statistics.
larization in Deep Learning; and we use these extensions to Finite-size effects exist, but they are are much smaller here
justify our analysis of both Self-Regularization and Heavy- than in the 2 < µ < 4 regime of µ.
Tailed Self-Regularization. Briefly, our theory for simple Fitting PL distributions to ESD plots. Once we have iden-
Self-Regularization is insipred by the Spiked-Covariance tified PL distributions visually, we can fit the ESD to a PL
model of Johnstone (Johnstone, 2001) and its interpretation in order to obtain the exponent α. We use the Clauset-
as a form of Self-Organization by Sornette (Malevergne Shalizi-Newman (CSN) approach (Clauset et al., 2009),
Traditional and Heavy Tailed Self Regularization in Neural Network Models
Table 1. Basic MP theory, and the spiked and Heavy-Tailed extensions we use, including known, empirically-observed, and conjectured
relations between them. Boxes marked “∗ ” are best described as following “TW with large finite size corrections” that are likely
Heavy-Tailed (Biroli et al., 2007b), leading to bulk edge statistics and far tail statistics that are indistinguishable. Boxes marked “∗∗ ” are
phenomenological fits, describing large (2 < µ < 4) or small (0 < µ < 2) finite-size corrections on N → ∞ behavior. See (Davis et al.,
2014; Biroli et al., 2007b;a; Péché; Auffinger et al., 2009; Edelman et al., 2016; Auffinger & Tang, 2016; Burda & Jurkiewicz, 2009;
Bouchaud & Potters, 2011; Bouchaud & Mézard, 1997) for additional details.
as implemented in the python PowerLaw package (Alstott els, we examine selected layers from AlexNet, InceptionV3,
et al., 2014) and many other models (as distributed with pyTorch).
Identifying the Universality class. Given α, we identify Example: LeNet5 (1998). LeNet5 is the prototype early
the corresponding µ and thus which of the three Heavy- model for DNNs (LeCun et al., 1998). Since LeNet5 is
Tailed Universality classes (0 < µ < 2 or 2 < µ < 4 or older, we actually recoded and retrained it. We used Keras
4 < µ, as described in Table 1) is appropriate to describe 2.0, using 20 epochs of the AdaDelta optimizer, on the
the system. The following are particularly important points. MNIST data set. This model has 100.00% training accuracy,
First, observing a Heavy-Tailed ESD may indicate the pres- and 99.25% test accuracy on the default MNIST split. We
ence of a scale-free DNN. This suggests that the underlying analyze the ESD of the FC1 Layer. The FC1 matrix WF C1
DNN is strongly-correlated, and that we need more than is a 2450 × 500 matrix, with Q = 4.9, and thus it yields
just a few separated spikes, plus some random-like bulk 500 eigenvalues.
structure, to model the DNN and to understand DNN regu-
Figures 1(a) and 1(b) present the ESD for FC1 of LeNet5,
larization. Second, this does not necessarily imply that the
with Figure 1(a) showing the full ESD and Figure 1(b)
matrix elements of Wl form a Heavy-Tailed distribution.
zoomed-in along the X-axis. We show (red curve) our fit
Rather, the Heavy-Tailed distribution arises since we posit
to the MP distribution ρemp (λ). Several things are striking.
it as a model of the strongly correlated, highly non-random
First, the bulk of the density ρemp (λ) has a large, MP-like
matrix Wl . Third, we conjecture that this is more general,
shape for eigenvalues λ < λ+ ≈ 3.5, and the MP distribu-
and that very well-trained DNNs will exhibit Heavy-Tailed
tion fits this part of the ESD very well, including the fact
behavior in their ESD for many the weight matrices.
that the ESD just below the best fit λ+ is concave. Second,
3. Empirical Results: ESDs for Existing, some eigenvalue mass is bleeding out from the MP bulk for
Pretrained DNNs λ ∈ [3.5, 5], although it is quite small. Third, beyond the
Early on, we observed that small DNNs and large DNNs MP bulk and this bleeding out region, are several clear out-
have very different ESDs. For smaller models, ESDs tend to liers, or spikes, ranging from ≈ 5 to λmax . 25. Overall,
fit the MP theory well, with well-understood deviations, e.g., the shape of ρemp (λ), the quality of the global bulk fit, and
low-rank perturbations. For larger models, the ESDs ρN (λ) the statistics and crisp shape of the local bulk edge all agree
almost never fit the theoretical ρmp (λ), and they frequently well with MP theory plus a low-rank perturbation.
have a completely different form. We use RMT to compare Example: AlexNet (2012). AlexNet was the first modern
and contrast the ESDs of a smaller, older NN and many DNN (Krizhevsky et al., 2012). AlexNet resembles a scaled-
larger, modern DNNs. For the small model, we retrain a up version of the LeNet5 architecture; it consists of 5 layers,
modern variant of one of the very early and well-known 2 convolutional, followed by 3 FC layers (the last being a
Convolutional Nets—LeNet5. For the larger, modern mod- softmax classifier). We refer to the last 2 layers before the
Traditional and Heavy Tailed Self Regularization in Neural Network Models
Empirical results for other pre-trained DNNs. We have 4. 5+1 Phases of Regularized Training
also examined the properties of a wide range of other pre- In this section, we develop an operational/phenomenological
trained models, and we have observed similar Heavy-Tailed theory for DNN Self-Regularization.
properties to AlexNet in all of the larger, state-of-the-art
MP Soft Rank. We first define the MP Soft Rank (Rmp ),
DNNs, including VGG16, VGG19, ResNet50, InceptionV3,
that is designed to capture the “size scale” of the noise
etc. Several observations can be made. First, all of our
part of Wl , relative to the largest eigenvalue of WlT Wl .
fits, except for certain layers in InceptionV3, appear to be
Assume that MP theory fits at least a bulk of ρN (λ). Then,
in the range 1.5 < α . 3.5 (where the CSN method is
we can identify a bulk edge λ+ and a bulk variance σbulk
2
,
known to perform well). Second, we also check to see +
and define the MP Soft Rank as the ratio of λ and λmax :
whether PL is the best fit by comparing the distribution to
Rmp (W) := λ+ /λmax . Clearly, Rmp ∈ [0, 1]; Rmp = 1
a Truncated Power Law (TPL), as well as an exponential,
for a purely random matrix; and for a matrix with an ESD
stretch-exponential, and log normal distributions. In all
with outlying spikes, λmax > λ+ , and Rmp < 1. If there is
cases, we find either a PL or TPL fits best (with a p-value
no good MP fit because the entire ESD is well-approximated
≤ 0.05), with TPL being more common for smaller values
by a Heavy-Tailed distribution, then we can define λ+ = 0,
of α. Third, even when taking into account the large finite-
in which case Rmp = 0.
size effects in the range 2 < α < 4, nearly all of the ESDs
appear to fall into the 2 < µ < 4 Universality class. Visual Taxonomy. We characterize implicit Self-
Regularization, both for DNNs during SGD training as
Towards a Theory of Self-Regularization. For older
well as for pre-trained DNNs, as a visual taxonomy of
and/or smaller models, like LeNet5, the bulk of their ESDs
5+1 Phases of Training (R ANDOM - LIKE, B LEEDING -
(ρN (λ); λ λ+ ) can be well-fit to theoretical MP density
ρmp (λ), potentially with distinct, outlying spikes (λ > λ+ ). 1
For DNNs, these correlations arise in the weight matrices
This is consistent with the Spiked-Covariance model of John- during Backprop training. That is, the weight matrices “learn” the
stone (Johnstone, 2001), a simple perturbative extension of correlations in the data.
Traditional and Heavy Tailed Self Regularization in Neural Network Models
Table 2. The 5+1 phases of learning we identified in DNN training. We observed B ULK +S PIKES and H EAVY-TAILED in existing trained
models (LeNet5 and AlexNet/InceptionV3, respectively; see Section 3); and we exhibited all 5+1 phases in a simple model (MiniAlexNet;
see Section 6).
OUT, B ULK +S PIKES, B ULK - DECAY, H EAVY-TAILED, and qualitatively describe the shape of each ESD. We model
R ANK - COLLAPSE). See Table 2 for a summary. The 5+1 the weight matrices W as “noise plus signal,” where the
phases can be ordered, with each successive phase corre- “noise” is modeled by a random matrix Wrand , with entries
sponding to a smaller Stable Rank / MP Soft Rank and drawn from the Gaussian Universality class (well-described
to progressively more Self-Regularization than previous by traditional MP theory) and the “signal” is a (small or
phases. Figure 2 depicts typical ESDs for each phase, with large) correction ∆sig :
the MP fits (in red). Earlier phases of training correspond to
the final state of older and/or smaller models like LeNet5 W ' Wrand + ∆sig . (4)
and MLP3. Later phases correspond to the final state of Table 2 summarizes the theoretical model for each phase.
more modern models like AlexNet, Inception, etc. While Each model uses RMT to describe the global shape of
we can describe this in terms of SGD training, this taxonomy ρN (λ), the local shape of the fluctuations at the bulk edge,
allows us to compare different architectures and/or amounts and the statistics and information in the outlying spikes,
of regularization in a trained—or even pre-trained—DNN. including possible Heavy-Tailed behaviors.
Each phase is visually distinct, and each has a natural inter- In the first phase (R ANDOM - LIKE), the ESD is well-
pretation in terms of RMT. One consideration is the global described by traditional MP theory, in which a random
properties of the ESD: how well all or part of the ESD is fit matrix has entries drawn from the Gaussian Universality
by an MP distriution, for some value of λ+ , or how well all class. In the next phases (B LEEDING - OUT, B ULK +S PIKES),
or part of the ESD is fit by a Heavy-Tailed or PL distribution, and/or for small networks such as LetNet5, ∆sig is a
for some value of a PL parameter. A second consideration relatively-small perturbative correction to Wrand , and
is local properties of the ESD: the form of fluctuations, in vanilla MP theory (as reviewed in Section 2.1) can be ap-
particular around the edge λ+ or around the largest eigen- plied, as least to the bulk of the ESD. In these phases, we
value λmax . For example, the shape of the ESD near to and will model the Wrand matrix by a vanilla Wmp matrix
immediately above λ+ is very different in Figure 2(a) and (for appropriate parameters), and the MP Soft Rank is rela-
Figure 2(c) (where there is a crisp edge) versus Figure 2(b) tively large (Rmp (W) 0). In the B ULK +S PIKES phase,
(where the ESD is concave) versus Figure 2(d) (where the the model resembles a Spiked-Covariance model, and the
ESD is convex). Self-Regularization resembles Tikhonov regularization.
Theory of Each Phase. RMT provides more than simple In later phases (B ULK - DECAY, H EAVY-TAILED), and/or
visual insights, and we can use RMT to differentiate be- for modern DNNs such as AlexNet and InceptionV3, ∆sig
tween the 5+1 Phases of Training using simple models that becomes more complex and increasingly dominates over
Traditional and Heavy Tailed Self Regularization in Neural Network Models
(a) R ANDOM - LIKE. (b) B LEEDING - OUT. (c) B ULK +S PIKES. (d) B ULK - DECAY. (e) H EAVY-TAILED. (f) R ANK C OLLAPSE.
Figure 2. Taxonomy of trained models. Starting off with an initial random or R ANDOM - LIKE model (2(a)), training can lead to a
B ULK +S PIKES model (2(c)), with data-dependent spikes on top of a random-like bulk. Depending on the network size and architecture,
properties of training data, etc., additional training can lead to a H EAVY-TAILED model (2(e)), a high-quality model with long-range
correlations. An intermediate B LEEDING - OUT model (2(b)), where spikes start to pull out from the bulk, and an intermediate B ULK -
DECAY model (2(d)), where correlations start to degrade the separation between the bulk and spikes, leading to a decay of the bulk, are
also possible. In extreme cases, a severely over-regularized model (2(f)) is possible.