A Law of Data Separation in Deep Learning.17020

A Law of Data Separation in Deep Learning
Hangfeng Hea,b and Weijie J. Suc,1

a
Department of Computer Science, University of Rochester, Rochester, NY 14627; b Goergen Institute for Data Science, University of Rochester, Rochester, NY 14627;
c
Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104
This manuscript was compiled on August 14, 2023
While deep learning has enabled significant advances in many areas k, and x̄ denote the mean of all n := n1 + · · · + nK data
of science, its black-box nature hinders architecture design for future points. We define the between-class sum of squares and the
artificial intelligence applications and interpretation for high-stakes within-class sum of squares as
decision makings. We addressed this issue by studying the funda-
1X
K
mental question of how deep neural networks process data in the
SSb := nk (x̄k − x̄)(x̄k − x̄)⊤
intermediate layers. Our finding is a simple and quantitative law that n
arXiv:2210.17020v2 [cs.LG] 11 Aug 2023
k=1
governs how deep neural networks separate data according to class
K nk
membership throughout all layers for classification. This law shows 1 XX
SSw := (xki − x̄k )(xki − x̄k )⊤ ,
that each layer improves data separation at a constant geometric rate, n
k=1 i=1
and its emergence is observed in a collection of network architec-
tures and datasets during training. This law offers practical guidelines respectively. The former matrix represents the between-class
for designing architectures, improving model robustness and out-of- “signal” for classification, whereas the latter denotes the within-
sample performance, as well as interpreting the predictions. class variability. Writing SS+ b for the Moore–Penrose inverse
of SSb (The matrix SSb has rank at most K − 1 and is not
deep learning | data separation | constant geometric rate | intermediate invertible in general since the dimension of the data is typically
layers larger than the number of classes.), the ratio matrix SSw SS+ b
can be thought of as the inverse signal-to-noise ratio. We use
eep learning methodologies have achieved remarkable suc- its trace
D cess across a wide range of data-intensive tasks in image
recognition, biological research, and scientific computing (2–5).
D := Tr(SSw SS+ b) [1]
to measure how well the data are separated (13, 14). This
In contrast to other machine learning techniques (6), however, value, which is referred to as the separation fuzziness, is large
the practice of deep learning relies heavily on a plethora of when the data points are not concentrated to their class means
heuristics and tricks that are not well justified. This situation or, equivalently, are not well separated, and vice versa.
often makes deep learning-based approaches ungrounded for
some applications or necessitates the need for huge computa-
1. Main Results
tional resources for exhaustive search, making it difficult to
fully realize the potential of this set of methodologies (7). Given an L-layer feedforward neural network, let Dl denote
This unfortunate situation is in part owing to a lack of un- the separation fuzziness (Eq. 1) of the training data passing
derstanding of how the prediction depends on the intermediate through the first l layers for 0 ≤ l ≤ L − 1.∗ Fig. 1 suggests
layers of deep neural networks (8–10). In particular, little is that the dynamics of Dl follows the relation
known about how the data of different classes (e.g., images .
Dl = ρl D0 [2]
of cats and dogs) in classification problems are gradually sep-
arated from the bottom layers to the top layers in modern ∗
For clarification, D0 is calculated from the raw data, and D1 is calculated from the data that have
architectures such as AlexNet (2) and residual neural net- passed through the first layer but not the second layer.
works (11). Any knowledge about data separation, especially

quantitative characterization, would offer useful principles and
insights for designing network architectures, training processes, Significance Statement
and model interpretation. The practice of deep learning has long been shrouded in mys-
The main finding of this paper is a quantitative delineation tery, leading many to believe that the inner workings of these
of the data separation process throughout all layers of deep black-box models are chaotic during training. In this paper, we
neural networks. As an illustration, Fig. 1 plots a certain value challenge this belief by presenting a simple and approximate
that measures how well the data are separated according to law that deep neural networks follow when processing data in
their class membership at each layer for feedforward neural the intermediate layers. This empirical law is observed in a
networks trained on the Fashion-MNIST dataset (12). This class of modern network architectures for vision tasks, and its
value in the logarithmic scale decays, in a distinct manner, emergence is shown to bring important benefits for the trained
linearly in the number of layers the data have passed through. models. The significance of this law is that it allows for a new
The Pearson correlation coefficients between the logarithm of perspective that provides useful insights into the practice of
this value and the layer index range from −0.997 and −1 in deep learning.
Fig. 1.
The measure shown in Fig. 1 is canonical for measuring H.H and W.J.S. designed research, performed research, analyzed data, and wrote the paper.
data separation in classification problems. Let xki denote an The authors declare no competing interest.
intermediate output of a neural network on the ith point of
Class k for 1 ≤ i ≤ nk , x̄k denote the sample mean of Class 1
To whom correspondence should be addressed. E-mail: suw@wharton.upenn.edu.
PNAS | August 14, 2023 | vol. XXX | no. XX | 1–14

SGD
Depth=4 Depth=8 Depth=20

Momentum

Adam
0 1 2 3
4 5 6 7
Fig. 1. Illustration of the law of equi-separation in feedforward neural networks with ReLU activations trained on the Fashion-MNIST dataset. The three rows correspond to three
Fig. 1. Illustration of the law of equi-separation in feedforward neural networks with ReLU activations trained on the Fashion-MNIST dataset. The three rows correspond to three
different training methods, stochastic gradient descent (SGD), SGD with momentum, and Adam (12). Throughout the paper, the x axis represents the layer index and the y axis
different training methods, stochastic gradient descent (SGD), SGD with momentum, and Adam (1). Throughout the paper, the x axis represents the layer index and the y axis
represents the separation fuzziness defined in Eq. 1, unless otherwise specified. The Pearson correlation coefficients, by row first, are ≠1.000, ≠0.998, ≠0.997, ≠0.999,
represents the separation fuzziness defined in Eq. 1, unless otherwise specified. The Pearson correlation coefficients, by row first, are −1.000, −0.998, −0.997, −0.999,
≠0.998, ≠0.998, ≠1.000, ≠0.997, and ≠0.999. The last two rows show the intermediate data points on the plane of the first two principal components for the 8-layer
−0.998, −0.998, −1.000, −0.997, and −0.999. The last two rows show the intermediate data points on the plane of the first two principal components for the 8-layer
network trained using Adam. Layer 0 corresponds to the raw input. More details can be found in the supplementary materials.
network trained using Adam. Layer 0 corresponds to the raw input. More details can be found in SI Appendix.
2 2| | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX He He
andand
Su Su
Epoch=0 Epoch=10 Epoch=20
Fig.
Fig. 2. 2. A 20-layer
A 20-layer feedforward
feedforward neural
neural networktrained
network trainedononFashion-MNIST.
Fashion-MNIST.The Thelaw
lawofofequi-separation
equi-separation starts
starts to
to emerge
emerge at
at epoch
epoch 100
100 and
and becomes
becomes more
more clear
clear as
astraining
training
proceeds.
proceeds. AsAs
in in Fig.
Fig. 1 and
1 and allall
thethe other
other figuresininthe
figures themain
maintext,
text,the
thexxaxis
axisrepresents
representsthe
thelayer
layerindex
indexand
andthe
the yy axis
axis represents
represents the
the separation
separation fuzziness.
fuzziness.
60
forfor somedecay
some decayratio
ratio0 0<<ρ fl<<1.1.Alternatively,
Alternatively,thisthislaw
lawimplies
implies layers have
roughly learned
equally the necessary
capable of reducing features. Finally, each
the separation layer
fuzziness 85
. . 1
61
loglog
DD l+1 ≠ log Dl = ≠ log1 , showing that the neural network
l+1 − log Dl = − log ρ ,fl showing that the neural network
becomes roughly equally
multiplicatively. capable of
This dynamics of reducing the separation
data separation during 86
62 makesequal
makes equalprogress
progressininreducing
reducingloglogDDover
overeach
eachlayer
layerononthe
the fuzzinessis multiplicatively.
training illustrated in Fig.This dynamics
2. Neural of data
networks separation
in the terminal 87
63 training data. Hence, we call this the law of

training data. Hence, we call this the law of equi-separation.equi-separation. during training is illustrated in Fig. 2.
phase of training also exhibit certain symmetric geometries Neural networks in 88
64 Thislaw
This lawis isthe
thefirst
firstquantitative
quantitativeandandgeometric
geometriccharacteriza-
characteriza- inthe
theterminal
last layerphasesuchofas training
neural also exhibit
collapse (14,certain
15) and symmetric
minority 89
65 tion of the data separation process in the intermediate

tion of the data separation process in the intermediate layers. layers. geometries
collapse in the
(10). last layer
However, such
it is as neural mentioning
worthwhile collapse (14, that15) andthe 90
Indeed, it is unexpected because the intermediate output of minority collapse (9). However, it is worthwhile mentioning
Indeed, it is unexpected because the intermediate output of law of equi-separation emerges earlier than neural collapse
66 91
the neural network does not exhibit any quantitative patterns, that the law of equi-separation emerges earlier than neural
the neural network does not exhibit any quantitative patterns, during training (see Fig. S1 in SI Appendix).
67 92
as shown by the last two rows of Fig. 1. collapse during training (see Fig. 10 in the supplementary
as shown by the last two rows of Fig. 1.
68 93
The pervasive law of equi-separation consistently prevails
materials).
Thedecay
decayratio
ratioρ fldepends
dependson onthethedepth
depth ofof thethe neural
neural
94
69
The across diverse datasets, learning rates, and class imbalances,
network,dataset,
network, dataset,training
trainingtime,
time,and
andnetwork
networkarchitecture,
architecture,and and as illustrated in Fig.
The pervasive law3.of Additionally,
equi-separation Fig. S6 in SI Appendix
consistently prevails
70
95
is also affected, to a lesser extent, by optimization
is also affected, to a lesser extent, by optimization methods and methods and demonstrates
across diverse its applicability
datasets, learning inrates,
a finer,
andclass-wise context.
class imbalances,
71
96
many other hyperparameters. For the 20-layer network trained
many other hyperparameters. For the 20-layer network trained This law is further
as illustrated in Fig.exemplified in contemporary
3. Additionally, Fig. 15 in network archi-
the supple-
72
97
using Adam (12) (the bottom-right plot of Fig. 1), the decay
using Adam (1) (the bottom-right plot ofln Fig. 1), the decay mentaryfor
tectures materials
vision tasks, demonstrates
such as AlexNetits applicability
and VGGNet in a(16),
finer,
as
73
98
2 0.693
ratio fl is 0.818. Thus, the half-life is ln fl2≠1 = 0.693 = 3.45,
ratio ρ is 0.818. Thus, the half-life is lnln −1 = ln ρ−1 = 3.45, class-wise
shown context.
in Fig. 4 (seeThis law is further
SI Appendix exemplified
for additional in contem-
convolutional
74
ln fl≠1 99
suggesting that this 20-layer neural network reduces the value
ρ
suggesting that this 20-layer neural network reduces the value porarynetwork
neural networkexperiments
architecturesinfor vision
Fig. tasks,
S2 and Fig.such
S3).asMoreover,
AlexNet
75
100
of the separation fuzziness in every three and a half layers.
of the separation fuzziness in every three and a half layers. andlaw
the VGGNet
manifests (16),inas shown neural
residual in Fig. networks
4 (see theand supplementary
densely con-
76
101
77 Thislaw
This lawmanifests
manifestsininthe theterminal
terminalphase
phaseofoftraining
training(14),
(14), materials
nected for additional
convolutional convolutional
networks (17) when neural networkfuzziness
separation experi- 102
where we continue to train the model to interpolate in-sample isments in Fig.

assessed 11 and
at each Fig. 12).
block, Moreover,
as depicted the law
in Fig. 7 andmanifests
Fig. 8,
where we continue to train the model to interpolate in-sample
78 103
data. At initialization, the separation fuzziness may even in- in residual neural
respectively. networks
Intriguingly, thisandlawdensely
appearsconnected
slightly moreconvolu-
pro-
data. At initialization, the separation fuzziness may even
79 104
crease from the bottom to top layers. During the early stages tional networks
nounced in feedforward(17) whenneuralseparation
networks fuzziness
when compared is assessed
with
increase from the bottom to top layers. During the early
80 105
of training, the bottom layers tend to learn faster at reduc- at each

other block, architectures.
network as depicted in Fig. 7 and Fig. 8, respectively.
stages of training, the bottom layers tend to learn faster at
81 106
ing the separation fuzziness compared to the top layers (see Intriguingly, this law appears slightly more pronounced in feed-
reducing the separation fuzziness compared to the top layers The separation dynamics of neural networks have been
82 107
Fig. 16 in the supplementary materials). However, as training forward neural networks when compared with other network
(see Fig. S7 in SI Appendix). However, as training progresses, extensively investigated in prior research studies (18–22). For
83 108
84 progresses, the top layers eventually catch up as the bottom architectures. 109
the top layers eventually catch up as the bottom layers have instance, (18) employed linear classifiers as probes to assess the
learned the necessary features. Finally, each layer becomes separability of intermediate outputs. In (19), the author scru-
He and Su PNAS | March 22, 2023 | vol. XXX | no. XX | 3
He and Su PNAS | August 14, 2023 | vol. XXX | no. XX | 3

CIFAR-10
CIFAR-10

Rate Rate
Learning
Learning
Learning rate=0.01 Learning rate=0.03 Learning rate=0.1
Learning rate=0.01 Learning rate=0.03 Learning rate=0.1

Imbalance
Imbalance

Illustration
Fig. 3.Fig. showing
3. Illustration how the
showing howlaw
theoflaw
equi-separation holds holds
of equi-separation in a wide
in a range of settings.
wide range The experimental
of settings. detailsdetails
The experimental are provided in thein
are provided supplementary
SI Appendix. materials.
Fig. 3. Illustration showing how the law of equi-separation holds in a wide range of settings. The experimental details are provided in the supplementary materials.
110 The
tinized theseparation
separationdynamics
capabilities of ofneural
neuralnetworks
networkshave throughbeen
The separation
extensively dynamics
investigated in of neural
prior research networks
studies have beenFor
(18–22).
empirical spectral analysis. More recently, (20–22) explored
110
111
111 extensively
instance, investigated
(18) employed in priorclassifiers
linear research as studies
probes (18–22).
to assessForthe
112
the separation
instance, (18)
ability
employed
of neural
linear
networksasby
classifiers
examining
probes to
neural
assess the
112
113 separability
collapse of intermediate
at intermediate layers outputs.
and itsInIn (19), the author
relationship withscru- scru-
gen-
separability of intermediate outputs. (19), the author
tinized the separation capabilities of neural networks through
113
114
eralization. In particular, (22) provided crucial
tinized the separation capabilities of neural networks through experimental
empirical spectral analysis. More recently, (20–22) explored
114
115
evidence
empiricalillustrating the progression
spectral analysis. of neural
More recently, collapse
(20–22) within
explored
theinterior
separation abilitynetworks,
of neural networks by examining neural
115
116
116the of neural which is perhaps
the separation ability of neural networks by examining neural the most
117
117
collapse
relevant at
work intermediate
to our paper. layers and its relationship
collapse at intermediate layers and its relationship with gen- with gen-
118
118
eralization. Importantly, (22)
eralization. Importantly, (22)experimentally
experimentallydemonstrated
demonstrated
the progression of neural
neuralcollapse
collapsewithin
withinthetheinterior
interiorof of neu-
2. the progression
fromofthe neu-
119
119
Insights Law
120
120
ral
ral networks. Nevertheless,this
networks. Nevertheless, thisworkworkhas hasnot
notyetyet derived
derived a a AlexNet-Fashion-MNIST VGG13-Fashion-MNIST
121Next, we demonstrate
quantitative law, such
suchthe asbenefits
Eq.2,2,toof the
togovern lawthe
govern of equi-separation
thedata
data separation AlexNet-Fashion-MNIST VGG13-Fashion-MNIST
121 quantitative law, as Eq. separation
122and
122 the perspective of data separation in regard to the three
process.
process.
fundamental aspects of deep learning practice, namely, archi-
123
123
tecture
3. design,from
3. Insights
Insights training,
from and interpretation.
the Law
the Law
124
124 Next,
A.Next, we demonstrate
we
Network demonstrate
architecture.the
thebenefits
The law ofofequi-separation
benefits ofthe
thelawlawofof
equi-separation
equi-separation
can inform
125
125
the design principles for network architecture. toFirst,
and
and the
the perspective
perspective of
of data
data separation
separation ininregard
regard thethe
to three
three
the
126
126
lawfundamental
fundamental
implies thataspects
aspects of
of deep
neural deeplearning
learning
networks practice,
shouldpractice, namely,
be deep for archi-
namely, archi-
good
tecture design,
tecture design, training,
training, and
andinterpretation.
interpretation.
performance, thereby reflecting the hierarchical nature of deep
127
127
learning (3, 23–25). This is because, as shown by Eq. 2, all AlexNet-CIFAR-10

AlexNet-CIFAR-10 VGG13-CIFAR-10
VGG13-CIFAR-10
128 A. Network
A. Network architecture.
architecture. The
The law
lawofofequi-separation
equi-separation can inform
can inform
128
layers
the reduce
design the separation
principles for fuzziness
network D0 of the raw
architecture. input
First, theto
129
129
ρ theDdesign
L−1
at theprinciples
last layer.for network
When L isarchitecture.
small—say, LFirst,
= 2 orthe Fig. 4. Illustration of the equi-separation law on convolutional neural networks. A
Fig. 4. 4.
Fig. Illustration of of
Illustration thethe
equi-separation
equi-separationlawlaw
onon
convolutional neural
convolutional networks.
neural networks.A A
law implies
0 that neural networks should be deep for good more in-depth examination of convolutional neural networks can be found in Figs. 11
130
130 law implies
3—the ratio ρ thatwould
L−1 neural networks
generally not should
be beand,
small deeptherefore,
for good more in-depth
more in-depthexamination
examination of convolutional neural
of convolutional networks
neural cancan
networks be be
found in Figs.
found S2 11
in Figs.
performance, thereby reflecting the hierarchical nature of deep and 12 of the supplementary materials.
performance, thereby reflectingtothe hierarchical nature of deep and S312
in of
SI the
Appendix.
131 and supplementary materials.
131
the neural (2,
learning network
23–25).is This
unlikely separate
is because, the data
as shown well.
by Eq. In
2, all
learning (2, 23–25). This is because, as shown by Eq. 2,byall
132
132
133the literature, the fundamental role of depth is recognized
layers reduce the separation fuzziness D0 of the raw input to
133 layers reduce
analyzing thefunctions
the loss separation(11, fuzziness
16, 23, D 0 of
26, theand
27), rawourinput
lawto
4 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX He and Su
4 4| | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX HeHe and
and SuSu
MNIST Fashion-MNIST CIFAR-10
Fig.5.
Fig. Illustrationof
5. Illustration ofthe
thelaw
lawof
ofequi-separation
equi-separationwith
withvarying
varyingnetwork
networkdepths.
depths. The
Thelegend
legendshows
showshow
howdepths
depthsare
arecolor
colorcoded.
coded. The
Thedepth
depththat
thatyields
yieldsthe
thelowest
lowestseparation
separation
fuzzinessdepends
fuzziness dependsononthe
thedataset
datasetcomplexity.
complexity.
134 flL≠1 D0 at the last layer. When L is small—say, L = 2 or in network weights when the law manifests. To see this point, 179
135 3—the ratio flL≠1 would generally not be small and, therefore, let 180
the neural network is unlikely to separate the data well. In DL≠1 DL≠2 D1 DL≠1
136 R := ◊ ◊ ··· ◊ =
Width
181
137 the literature, the fundamental role of depth is recognized by DL≠2 DL≠3 D0 D0
138 analyzing the loss functions (10, 16, 23, 26, 27), and our law denote the reduction ratio of the separation fuzziness made 182
139 of equi-separation offers a new perspective on network depth. by the entire network. Assuming that a perturbation of the 183
140 For completeness, the deeper the better is not necessarily network weights leads to a change of Á in the ratio for each 184
141 true because as the depth L increase, fl might become larger. layer, the perturbed reduction ratio becomes 185
142 Moreover, a very large depth would render the optimization 3 43 4 1 2

143 problem challenging.Width=20 As shown in Fig. 1, the 20-layer neural Width=100 RÁ =
DL≠1
+Á
DL≠2
+ Á ···
Width=1000
D1
+Á
144 networks have higher values of the final separation fuzziness DL≠2 DL≠3 D0
145 than those of their 8-layer counterparts. The Fashion-MNIST 3 4 186
DL≠2 DL≠3 D0 2
146 dataset is not very complex and, therefore, an 8-layer net- = R + R + + · · · + Á + O(Á ).
DL≠1 DL≠2 D1
work will suffice to classify the data well. Adding additional
Shape
147
148 layers, therefore, would only give rise to optimization bur- Applying the inequality of arithmetic and geometric means, 187
1 2
dens. Indeed, Fig. 5 illustrates the law of equi-separation with
the perturbation term R DL≠2 + + + Á is min- 188
149 D DL≠3 D0
· · ·
150 varying network depths, showing that different datasets corre- L≠1 DL≠2 D1
spond to different optimal depths. Therefore, as a practical imized in absolute value when DL≠2 = DL≠3 = · · · = D . 189
DL≠1 DL≠2 1
151
D0
152 guideline, the choice of depth should consider the complexity This result is summarized in the following proposition: 190
153 of the applications.Narrow-Wide For instance, this perspective of data Wide-Narrow

separation shows that the optimal depths for MNIST (28), Proposition 1. For small Á, |RÁ ≠ R| is minimized when
Mixed 191
154
Fashion-MNIST,
6. Illustration and
of law
the of
lawCIFAR-10 (29)
of equi-separation are 6, 10, and 12, respec-
Fig. Fig. with with different widths
and and shapes of neural
the neural networks. For example, “Narrow-Wide” means that the
1 bottom layers are narrower
155
6. Illustration of the equi-separation different widths shapes of the networks. DL≠1
For example, DL≠2
“Narrow-Wide” means the D
156 tively.
than This
the is consistent with the increasing complexity from
top layers. = = ·that
·· = bottom
. layers are narrower 192
than the top layers. DL≠2 DL≠3 D0
157 MNIST to Fashion-MNIST to CIFAR-10 (10, 30, 31).
158 Moreover, the law of equi-separation indicates that depth This condition is precisely the case when the equi-separation 193
159 of equi-separation
bears a more significant offers aconnection
new perspective on network
to training depth. than
performance law holds. Thus, of
the “shape” theneural
data separation
networks. In ability
the firstof the rowclassifier
of Fig. 6,is 194
160 than the “shape” of neural networks. In
For completeness, the deeper the better is not necessarily the first row of Fig. 6, not
the very
network sensitive
does to
not perturbations
separate the in
data network
well when weights
the when
width 195
width of theeachlawlayer
manifests.
is merelyTherefore,
20. This is to because
improvethe thenetwork
robustness,has
Two-layer block
161 the network

true because does as thenot separate
depth the data
L increase, well when
ρ might become the larger. 196
162 of each layer is merely 20. This is

Moreover, a very large depth would render the optimization because the network has we
to be need
wide to train
enough a toneural
pass network
the useful at least
features until
of the the law
data of
for 197
163 to be wide enough to pass the useful

problem challenging. As shown in Fig. 1, the 20-layer neural features of the data for equi-separation
classification. comes
However, into
when effect.
the In
network the literature,
is wide enough,however,
the 198
164 classification. However, when the network

networks have higher values of the final separation fuzziness is wide enough, the connections
network with
performance robustness
may be are often
saturated established
even if through
more neuronsloss 199
165 network
than those performance
of their 8-layer maycounterparts.
be saturated The if more neurons are
even Fashion-MNIST functions
added rather
to eachthan layerdata(32).separation
In addition, (34).the second row of 200
166 are added to each layer (32). In addition,

dataset is not very complex and, therefore, an 8-layer net- the second row of Fig. The
6 law
shows of equi-separation
that it is better to can
startalso
with shed light
relatively on the
wide out-of-
layers 201
167 Fig. 6 shows that it is better to start with

work will suffice to classify the data well. Adding additional relatively wide layers sample
at the performance
bottom layers of neural
followed networks.
by narrow Fig.
top 9 considers
layers. Thesetwo 202
168 at the therefore,

layers, bottom layers would followed
Layerwise only giveby narrow top layers.
rise to optimization BlockwiseThese neural networks
bur- observations suggestwhere
Layerwisethat one exhibits
very the law
wide neural ofBlockwise
networks equi-separation
in general 203
169 observations
dens. Indeed, suggest that very the
Fig. 5 illustrates widelaw
neural networks in general
of equi-separation with and
should the other
not be does not.
recommended The results
as they of our
consume experiments
more showed
time for 204
should not be recommended as they consume more time for that the
optimization. former Apart one has
from better
encoding test performance.
purposes, the width Although
should
Three-layer block
170
varying network depths, showing that different datasets corre- 205
171 optimization. Apart from encoding purposes,

spond to different optimal depths. Therefore, as a practical the width should this
be setlawto is sometimes
about the same not observed
across all in
layersneural
or benetworks
made with
larger 206
172 be set to about

guideline, the choice the sameof depth across all layers
should consider or thebe made larger for
complexity good thetest
bottomperformance,
layers (30,interestingly,
33). one can fine-tune the 207
173 forthe
of theapplications.
bottom layersFor (30,instance,
33). this perspective of data parameters to reactivate the law with about the same or better 208
separation shows that the optimal depths for MNIST (28), B. test performance. 209
B. Training. The emergence of the law of equi-separation dur- Training.

Moreover, Thethe emergence
law of of the law of equi-separation
equi-separation can be dur- 210
potentially
174
Fashion-MNIST, and CIFAR-10 (29) are 6, 10, and 12, respec- ing training is indicative of good model performance. For ex-
175 ing training
tively. This isisconsistent
indicativewith of goodthe model
increasingperformance.
complexityFor fromex- used to facilitate the training of transfer learning. In transfer 211
ample, its manifestation can improve the robustness to model ample,
learning, itsfor
manifestation
example, bottom can improve the robustness
layers trained from an to model
upstream
176
MNIST to Fashion-MNIST Layerwise to CIFAR-10 (11, 30,Blockwise 31). shifts. Indeed, the
212
part of a network that will be trained onasa 213

data separation ability of the network,
Layerwise Blockwise
177 shifts. Indeed, the data separation ability of the network, as task can form
Moreover, by the law of equi-separation indicates that depth measured by task. ,A becomes least sensitive to perturbations
DL−1
measured , becomes least sensitive to perturbations downstream challenge arising in practice is how to 214
DL≠1
178 D0
bears a more significant D0
connection to training performance in network weights when the law manifests. To see this point,
Mixed
He and Su PNAS | March 22, 2023 | vol. XXX | no. XX | 5

Fig. 6. Illustration of the law of equi-separation with different widths and shapes of the neural networks. For example, “Narrow-Wide” means that the bottom layers are narrower
than the top layers.
Two-layer block
Layerwise Blockwise Layerwise Blockwise

Three-layer block

Mixed
Fashion-MNIST CIFAR-10
Fig. 7. Illustration of the law of equi-separation in residual neural networks: Each block contains either two or three layers. In the “Mixed” configuration, the first two blocks
Fig. 7. Illustration
consist of the each,
of three layers law of while
equi-separation
the last twoinblocks
residual
haveneural networks:
two layers each.Each block
In each contains
plot, a layereither two or
or a block is three layers.
identified as aInmodule
the “Mixed”
whenconfiguration,
presenting thethe first two
x axis. Theblocks
first two
consist of three
columns layers each,
are evaluated on while the last two blocks
the Fashion-MNIST haveand
dataset, twothe
layers
last each. In eachare
two columns plot, a layer orona the
evaluated block is identified
CIFAR-10 as a module when presenting the x axis. The first two
dataset.
columns are evaluated on the Fashion-MNIST dataset, and the last two columns are evaluated on the CIFAR-10 dataset.
let6 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX He and Su last layer in

DL−1 DL−2 D1 DL−1
R := × × ··· × = we have ext
DL−2 DL−3 D0 D0 black-box m
denote the reduction ratio of the separation fuzziness made law that qua
by the entire network. Assuming that a perturbation of the neural netwo
network weights leads to a change of ε in the ratio for each has been de
layer, the perturbed reduction ratio becomes sights into de
design, train

DL−1 DL−2 D1 Fashion-MNIST CIFAR-10 One aven
Rε := +ε + ε ··· +ε
DL−2 DL−3 D0 ity of the law
[3] Illustrationofofthe
Fig.8.8.Illustration
Fig. thelaw
law ofof equi-separation
equi-separation in
indensely
denselyconnected
connectedconvolutional
convolutional architectures
DL−2 DL−3 D0 networks(DenseNet161)
(DenseNet161)by by identifying
identifying a
a block
block as
asaamodule.
=R+R + + ··· + ε + O(ε2 ). networks module. esting to see
DL−1 DL−2 D1 equations an
understandin
Applying the inequality of arithmetic and geometric
means, train a neural network at least until the law of equi-separation Additionally,
the perturbation term R DL−1 + DL−2 + · · · + D1 ε is min-
DL−2 DL−3 D0
comes into effect. In the literature, however, connections with sion settings
robustness are often established through loss functions rather that remain
imized in absolute value when DL−1 = = = .
D DL−2 D1
· · ·
L−2 DL−3 D0
than the data separation perspective (34). is most accu
This result is summarized in the following proposition: somewhat fu
This proposition suggests that the law of equi-separation
Proposition 1. For small ε, |Rε − R| is minimized when Next, while
could be considered a demonstration of the neural networks’ variety of vis
DL−1 DL−2 D1 “homogeneity” across all layers. An essential component in models such
= = ··· = . Standard training
the derivation of Proposition 1 involves the assumption that
Frozen training
DL−2 DL−3 D0 materials). T
the
Fig.magnitude of thetheperturbation
9. Illustration showing remains
law of equi-separation uniform
manifesting across
in the left networkall learn token-l
This condition is precisely the case when the equi-separation layers,
but not inastheseen from because
right network Eq. 3.ofEven though
frozen training. Bothitnetworks
is improbable for
have about the each layer. A
law holds. Thus, the data separation ability of the classifier is the
sameperturbation
training losses as towell
remain constant
as training inHowever,
accuracies. practice, it is
the left intriguing
network has a
the law of eq
higher test accuracy (23.85%) than that of the right network (19.67%).
not very sensitive to perturbations in network weights when the to note that the emergence of this law is contingent on the use sentence-leve
law emerges. Therefore, to improve the robustness, we need to of batch normalization (35), which enhances the homogeneity Perhaps
215 balance between the training times on the source and the target. derlying me
6 | 216 In light of the law of equi-separation, a sensible balance is to
He and Su some comm
217 find when the improvement in reducing separation fuzziness is learning theo
218 roughly the same between the bottom and top layers. linearity of a
the separatio
746 could be considered a demonstration of the neural networks’ law networks by identifying
that quantitatively governs a block howaswell-trained
a module. For comparison,
real-world 249 808
747 “homogeneity” across all layers. An essential component in neural when networks separate data throughout all layers. This law (36),
the perspective of data separation is not considered 250 809
748 the derivation of Proposition 1 involves the assumption that has the been correct module cannot
demonstrated to provide be identified.
useful guidelines For completeness,
and in- 251it 810
749 the magnitude of the perturbation remains uniform across sights should
into deepbe notedlearning that the law
practice becomes
through network less architecture
clear for deeper 252 811
†
750 all layers, as seen from Eq. 3. Even though it is improbable design, residual
training,neural andnetworks
interpretation , as shown in Fig. 10.
of predictions. 253 812
751 for the perturbation
Fashion-MNIST to remain constantCIFAR-10 in practice, it is One avenue for future research is to explore the applicabil- 254 813
752
of the networks
intriguing to note that in a the certainemergencesense.ofIndeed, this law Fig. S10 in SIity of the law of equi-separation to a wider range of network 255
is contingent 814
753 onAppendix
Fig.the use of
8. Illustration showsbatch
of the lawthat the law does
normalization
of equi-separation in(35), notwhich
densely appear
connected when batch
enhances
convolutionalthe architectures and applications. For example, it would be inter- 256 815
754
normalization
homogeneity
networks (DenseNet161)
ofisthe not used. a block
networks
by identifying
inasaa certain
module.
sense. Indeed, esting to see if this law holds for neural ordinary differential 257 816
755 Fig. S10 in SI Appendix shows that the law does not appear equations and, if so, how it could be used to improve our 258 817
756 when batch normalization is not used. understanding of the depth of these continuous-depth models. 259 818
757
Additionally, it would be valuable to examine the law in regres- 260 819
758
sion settings and multi-modal tasks (39). A crucial question 261 820
that remains to be addressed is understanding why the law 262
759 821
is most accurate in feedforward neural networks and becomes 263
760 Fashion-MNIST CIFAR-10 822
somewhat fuzzier in certain convolutional neural networks. 264
761 823
Next,
Fig.while
10. The thelaw law
law of
of equi-separation
of equi-separation
equi-separation in in ResNet18
has been
ResNet18 isisnot notas
observed
aspronounced
in aasititis265
pronouncedas isinin
762 824
variety of vision
shallower residual tasks,
neuralit does not appear to hold for language 266
networks.
763
Standard training Frozen training models such as BERT (40) (see Fig. 19 in the supplementary 267 825
764
materials). This could be due to the fact that language models 268 826
Because each module makes equal but small contributions,
Fig. 9. Illustration showing the law of equi-separation manifesting in the left networklearn token-level
crucial to take features rather than sentence-level
allalayers
single collectively
features at (37).
for interpretation
765 Fig. 9. Illustration showing the law of equi-separation manifesting in the left network 827
a few neurons in layer are unlikely to explain how
269
but not in the right network because of frozen training. Both networks have about theeach layer. As such, it would be important to studyused whether
but not in theStandard
right networktraining
because of frozen training. Both Frozen
networkstraining
have about the
This view, however, challenges the widely layer-wise
766 828
a deep learning model makes a particular prediction. It is
270
same training losses as well as training accuracies. However, the left network has a
same training losses as well as training accuracies. However, the left network has athe law of equi-separation could be utilized to derive better
approaches to deep learning interpretation (38, 39).
767 271 829
768
higher
Fig.
higher
test accuracy
9. Illustration
test accuracy
(23.85%)
showing than
the law
(23.85%) than
that
of thatof the right
equi-separation network (19.67%).
manifesting
of the right network in the left network
(19.67%). therefore crucial
sentence-level features to fromtake all layersof collectively
a sequence for interpre-
token-level features. 272 830
but not in the right network because of frozen training. Both networks have about tation (37). This view, however, challenges the widely used
T
769 Perhaps the most pressing question is to reveal the un- 273 831
the same training losses as well as training accuracies. However, the left network
layer-wise
3. Discussion approaches to deep learning interpretation (38, 39).
770215 balance
has higherbetween
aThe lawtest of the (23.85%)
trainingthan
equi-separation
accuracy timesthaton
can theright
ofalso
the source
shed and(19.67%).
light
network thethe
on out-of-derlying
target. mechanism of this data-separation law. However, 274 832
771216 In light of
sample the law of equi-separation,
performance of neural networks. a sensibleFig. 9balanceconsiders is totwosome Due
commonly used assumptions in the literature of deep 275
to (14) and follow-up work, it has been known that neural
833
find when the improvement fuzziness is learning 3. Discussion

theory may not be applicable. For instance, the non- 276
neural networks where oneinexhibits reducing theseparation
law of equi-separation
F
772217 834
The law of equi-separation can also shed light on the out-of- networks can exhibit precise mathematical structures at the
roughly the same between the bottom and top layers. linearity of (14)
Due layer
to activation functions is crucial to thisknown
law because
and theperformance
other does not. The results of our experiments
9 considersshowed inand thefollow-up
terminalwork, it has been In that
thisneural
773218 835
two the last phase of training. paper,
277
sample of neural networks. Fig. separation fuzziness is roughly unchanged by linear trans-
774
that the
neural formerwhere
networks one has better
exhibitstest theperformance. Although we networks can exhibit
have Additionally,
extended thisprecise
viewpoint mathematical thestructures
from networks surface ofatthese
the
278 836
C. Interpretation. The one equi-separation law offers
law of equi-separation
a new per- formations. analyses of neural using
this law is sometimes not observed in neural networks with last layer in the terminal phase of training. In this paper, we
775219 279
RA
837
and the other
spective does not. The
for interpreting deep results
learning of our experiments
predictions, showed the black-box
especially neural tangent modelskernel to their (41)interior
are notbyable introducing
to capture an the
empirical
good test performance, interestingly, one can fine-tune the have extended this viewpoint from the surface of these black-
280
776220 838
that the formerdecision
in high-stakes one hasmakings. better test performance.
An important Although
ingredient in lawofthat
effect depth, quantitatively
which is essential governs how well-trained
for understanding real-world
the data 281
777221
parameters to reactivate notthe law with about neuralthe same or with better neuralbox models to their interior bythroughout
introducingallanlayers. empirical Thislaw
839
222 this law is sometimes
interpretation is to pinpoint observed
the basicinoperational networks
modules in separation networks
process across separate data
all layers. This suggests that more law
282
778
223
test
good performance.
neuraltest performance,
networks and subsequentlyinterestingly, to reflectone that can fine-tune
“all modules that quantitatively governs to how
the innovative approaches may be necessary to fully understand 283in-
has been demonstrated provide well-trained
useful real-world
guidelines neural
and
840
779
224 are Moreover,
created equal.”
parameters to the law of the
For feedforward
reactivate equi-separation
law with can the
and convolutional
about be potentially
same networks
neuralor the sights intoseparate
mechanisms deepbehind data
learningthis law. throughout
practice throughall layers.
network This law has
architecture 284
841
780
225
used
networks,
better to facilitate
test each the training of transfer
layer is a module as it reduces the separation
performance. learning. In transfer been demonstrated
design, training,
Broadly speaking,and to provide
interpretation
a perspective useful
informed guidelines
of predictions. and
by this law is 285 insights 842
781
226
learning,
fuzziness
Moreover, for an
by example,
theequal bottom
law multiplicative
of layersfactor
equi-separation trained (for
can from
bemore andetails,
upstreamto focus
potentially intoOne deep thelearning
on avenue for future
non-temporal practice researchthrough
dynamics to network
is of explore
the datathe architecture
applicabil-
separa- 286
843
782
task
see Fig.can 18 form
in the part of
supplementary a network
used to facilitate the training of transfer learning. In transfer that
materials). will be
Interestingly,trained even on a tiondesign,
ity of
process thetraining,
law
throughout of and interpretation
equi-separation
all layers when to aof predictions.
wider
studying range
deep of
neural network
844
D
227 287
783
228 downstream
when
learning, thefor task.
lawexample,
no longer Abottom
challenge
manifests layers arising
in residual
trained in from
practice
neural is how tonetworks.
annetworks,
upstream OneThis
architectures avenue andfor
perspective future
applications.probes research is to
intoexample,
For the explore
interior thebeappli-
of deep
it would inter-
288
845
784
229 balance
Fig.
task 7 shows
can between
formthat the
part it training
cana be
of times on
restored
network that theidentifying
by source
will beand athe
trained target.
block cability
on learning
esting toofseethe
models by law of equi-separation
perceiving
if this law holds thefor neuraltoordinary
black-box a widerdifferential
classifier asrange
a 289 of 846
785
230 In light
rather thanof the
a law
layer asof a equi-separation,
module.
a downstream task. A challenge arising in practice is This a
figure sensible
also balance
suggests thatis to network
sequence
equationsof architectures
transformations,
and, if so, and
how applications.
each it of
couldwhich be For example,
corresponds
used to to
improveita would our
290
847
786231
find to
blocks
how when balancethe improvement
containing
between theintraining
more layers are reducing separation
more capable
times fuzziness
of reducing
on the source islayer beorinteresting
a block in the
understanding to the
of seedepth
case if residual
of thisoflaw these holds
neural for neural
networks.
continuous-depth ordinary
In con- models.291
848
787232 the
roughlyseparation the samefuzziness.
between Likewise,
the bottomFig. 8and shows top that
layers. the law trast, differential
existing equations
analyses often and, if
consider
Additionally, it would be valuable to examine the law in so, thehow it could
“exterior” of be
theseused to
regres-
849
and the target. In light of the law of equi-separation, a 292
788233 of equi-separation holds in densely connected convolutional models improve
by our
focusing understanding
on the loss of the
function
sion settings and multi-modal tasks (40). A crucial question depth
or test of these continuous-
performance, 850
sensible balance is to find when the improvement in reducing
293
networks by identifying The aequi-separation

block as a module. lawFor offerscomparison,
a new per-without considering the
C. Interpretation. depth be intermediate
models.toAdditionally, islayers
it would (41–43).
be valuable Informed
to examine
789234 851
separation fuzziness is roughly the same between the bottom that remains addressed understanding why the 294 law
790235 when
spective the for perspective
interpreting of data deepseparation
learningis predictions,
not considered (35), by the
especially the law
law of
in equi-separation,
regression settings thisand perspective
multi-modal may tasks
provide (40). A 852
and top layers. is most accurate in feedforward neural networks and becomes 295
791236 the correct module

in high-stakes decisioncannot be identified.
makings. An important ingredient informal insights
crucial question into that
the innerremains workings
to be of deep learning.
addressed is understanding 853
somewhat fuzzier in certain convolutional neural networks.
296
Because eachismodule
interpretation to pinpoint makes the equal but small
basic operationalcontributions,modules a in why the lawthat is most the accurate in feedforward neural networks
792237 854
C.few Interpretation.
neurons in a The
single equi-separation
layer are unlikely law
to offers
explain a
how newa per-
deep Next, given law is most evident in feedforward neural
793238 neural networks and subsequently to reflect that “all modulesData and becomes
Availability.
networks, an Our code fuzzier
somewhat
interesting isdirection
publicly isavailable
in certain to at new
GitHub
convolutional
explore neural
measures
855
spective for interpreting deep learning predictions, especially
297
learning
are created model makes
equal.” For a particular
feedforward prediction.
and It is therefore
convolutional neural(https://github.com/HornHehhf/Equi-Separation ).
794239
networks. Next, given that the
in lieu of the separation fuzziness defined in Eq. 1. The hope law is most evident in
298 856
incrucial
high-stakes
networks, to takeeach decision
all
layerlayersismakings.moduleAn
acollectively as important
for interpretation
it reduces ingredient
the separation(36). in
795240
feedforward neural networks,
is that this could potentially make the law clearer in otheran interesting direction is to 857
interpretation
This view, however,
fuzziness by an is to pinpoint
equal challenges the basic
multiplicative the widelyoperational
factorused (for more modules
layer-wise in
details,
796241
explore new
network
ACKNOWLEDGMENTS. measuresWe
architectures. in
Inarelieu
doing of so,
grateful theitto separation
would
X.Y. Han be fuzziness
beneficial
for 299
858
242 neural
approaches networks to deep and subsequently
learning to
interpretation
see Fig. S11 in SI Appendix). Interestingly, even when theveryto reflect that
(37, “all
38). modules
797
defined
helpful
take in Eq.
comments
into 1. and
account Thenetwork hope structures,
feedback isonthat thissuch
an early could
versionas potentially
of the 300
convolution
859
are
lawcreated
no longer equal.” For feedforward
manifests in residual and convolutional
neural networks, neural
Fig. 7manuscript. Welaw also clearer
thank Ben Zhou
798
make the
kernels in (convolutional otherfor
inneural reproducing
network
networks, during
our experi- 301
architectures.
the develop- In 860
networks,
4. Discussion each layer is a module
shows that it can be restored by identifying a block rather ment as it reduces the separation mental results https://github.com/Slash0BZ/data-separation/tree/main/
799243
doing so, it would be beneficial to take into account network
302 861
fuzziness by anas equal multiplicative factor (for of). new
Thisseparation measures.
supportedFurthermore,
in part by NSFwhile grantsthe 303 law
also more details, reproduction research was
800244 thantoa(14)
Due layer and follow-upa module. work, This it has figure
been known suggests
that neural thatCCF-1934876
structures,
of equi-separation
and such as convolution
CAREER has DMS-1847415,
been observed kernelsand ininconvolutional
ana Alfred
variety Sloanofneural
vision
862
see Fig. S11
blocks containing in SI Appendix).
more layers Interestingly,
are more capable even when
of at the
reducing
304
801245 networks can exhibit precise mathematical structures the networks,
Research
tasks, during
Fellowship.
it does notthe development
appear to hold for of new separation
language models measures.
such305as
863
law
the no longer manifests
separation fuzziness.inLikewise, residual Fig. neural networks,
8 shows that Fig.the law7
802
Furthermore, while the law
BERT (41) (see Fig. S9 in SI Appendix). This could be due of equi-separation has beento 864
shows
of and that it can beholds
equi-separation restored by identifying
in densely connected a block rather
convolutional
803
He Su observed
the fact thatin PNAS
a variety
language | March of 22,
models visionlearn
2023 tasks,
| token-level
vol. XXX it does
| no. not| appear
features
XX 7 rather
865
than
networks a layer as a module.
by identifying a block Thisasfigurea module. also For suggests that
comparison,
804
to hold
than for languagefeatures
sentence-level models such at each as BERT
layer. As (41)such,
(see it Fig.wouldS9 866
805 blocks containing more layers are more capable of reducing
when the perspective of data separation is not considered (36), †be important to study whether the law of equi-separation 867
806 the
theseparation
correct module fuzziness. cannot Likewise, Fig. 8 shows
be identified. that the law it
For completeness, See the architecture in https://github.com/kuangliu/pytorch-cifar/tree/master/models. 868
could be utilized to derive better sentence-level features from
should be noted that the law becomes less clear for deeper a sequence of token-level features.
Heresidual
and Su et neural
al. networks† , as shown in Fig. 10. PNAS | July 22, 2023 | vol. XXX | no. XX | 7
Perhaps the most pressing question is to reveal the un-
Because each module makes equal but small contributions, a
derlying mechanism of this data-separation law. However,
few neurons in a single layer are unlikely to explain how a deep
some commonly used assumptions in the literature of deep
learning model makes a particular prediction. It is therefore
learning theory may not be applicable. For instance, the non-
†
See the architecture in https://github.com/kuangliu/pytorch-cifar/tree/master/models. linearity of activation functions is crucial to this law because

the separation fuzziness is roughly unchanged by linear trans- 22. I Ben-Shaul, S Dekel, Nearest class-center simplification through intermediate layers in
formations. Additionally, analyses of neural networks using Topological, Algebraic and Geometric Learning Workshops 2022. (PMLR), pp. 37–47 (2022).
23. Y Bengio, Learning deep architectures for AI. Foundations trends Mach. Learn. 2, 1–127
the neural tangent kernel (42) are not able to capture the (2009).
effect of depth, which is essential for understanding the data 24. J Schmidhuber, Deep learning in neural networks: An overview. Neural Networks 61, 85–117
separation process across all layers. This suggests that more (2015).
25. S Hihi, Y Bengio, Hierarchical recurrent neural networks for long-term dependencies in
innovative approaches may be necessary to fully understand Advances in Neural Information Processing Systems, eds. D Touretzky, M Mozer, M Hasselmo.
the mechanisms behind this law. (MIT Press), Vol. 8, (1995).
26. X Glorot, Y Bengio, Understanding the difficulty of training deep feedforward neural networks in
Broadly speaking, a perspective informed by this law is Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.
to focus on the non-temporal dynamics of the data separa- (PMLR), Vol. 9, pp. 249–256 (2010).
tion process throughout all layers when studying deep neural 27. D Yarotsky, Error bounds for approximations with deep ReLU networks. Neural Networks 94,
103–114 (2017).
networks. This perspective probes into the interior of deep 28. Y LeCun, The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/
learning models by perceiving the black-box classifier as a (1998).
sequence of transformations, each of which corresponds to a 29. A Krizhevsky, Master’s thesis (University of Toronto) (2009).
30. K He, J Sun, Convolutional neural networks at constrained time cost in 2015 IEEE Conference
layer or a block in the case of residual neural networks. In con- on Computer Vision and Pattern Recognition. pp. 5353–5360 (2015).
trast, existing analyses often consider the “exterior” of these 31. RK Srivastava, K Greff, J Schmidhuber, Training very deep networks in Advances in Neural
Information Processing Systems. Vol. 28, (2015).
models by focusing on the loss function or test performance, 32. AG Howard, et al., Mobilenets: Efficient convolutional neural networks for mobile vision
without considering the intermediate layers (42–44). Informed applications. arXiv preprint arXiv:1704.04861 (2017).
by the law of equi-separation, this perspective may provide 33. M Tan, Q Le, EfficientNet: Rethinking model scaling for convolutional neural networks in
Proceedings of the 36th International Conference on Machine Learning. (PMLR), Vol. 97, pp.
formal insights into the inner workings of deep learning. 6105–6114 (2019).
34. S Bubeck, M Sellke, A universal law of robustness via isoperimetry. Adv. Neural Inf. Process.
Syst. 34, 28811–28822 (2021).
Data Availability. Our code is publicly available at GitHub 35. S Ioffe, C Szegedy, Batch normalization: Accelerating deep network training by reducing
(https://github.com/HornHehhf/Equi-Separation). internal covariate shift in Proceedings of the 32nd International Conference on Machine
Learning. (PMLR), Vol. 37, pp. 448–456 (2015).
36. C Zhang, S Bengio, Y Singer, Are all layers created equal? J. Mach. Learn. Res. 23, 1–28
ACKNOWLEDGMENTS. We are grateful to X.Y. Han for (2022).
37. WJ Su, Neurashed: A phenomenological model for imitating deep learning training. arXiv
very helpful comments and feedback on an early version of preprint arXiv:2112.09741 (2021).
the manuscript. We also thank Hangyu Lin (https://github.com/ 38. MD Zeiler, R Fergus, Visualizing and understanding convolutional networks in European
avalonstrel/NeuralCollapse/tree/mlp), Cheng Shi (https://github.com/ Conference on Computer Vision. pp. 818–833 (2014).
DaDaCheng/Re-equi-sepa), and Ben Zhou (https://github.com/Slash0BZ/ 39. I Tenney, D Das, E Pavlick, Bert rediscovers the classical nlp pipeline in Proceedings of the
data-separation/tree/main/reproduction) for reproducing our experimen- 57th Annual Meeting of the Association for Computational Linguistics. pp. 4593–4601 (2019).
tal results. This research was supported in part by NSF grants 40. VW Liang, Y Zhang, Y Kwon, S Yeung, JY Zou, Mind the gap: Understanding the modality
CCF-1934876 and CAREER DMS-1847415, and an Alfred Sloan gap in multi-modal contrastive representation learning. Adv. Neural Inf. Process. Syst. 35,
Research Fellowship. 17612–17625 (2022).
41. J Devlin, MW Chang, K Lee, K Toutanova, BERT: Pre-training of deep bidirectional transform-
ers for language understanding in Proceedings of the 2019 Conference of the North American
1. DP Kingma, J Ba, Adam: A method for stochastic optimization in International Conference on
Chapter of the Association for Computational Linguistics: Human Language Technologies,
Learning Representations. (2015).
Volume 1 (Long and Short Papers). pp. 4171–4186 (2019).
2. A Krizhevsky, I Sutskever, GE Hinton, ImageNet classification with deep convolutional neural
42. A Jacot, F Gabriel, C Hongler, Neural tangent kernel: Convergence and generalization in
networks in Advances in Neural Information Processing Systems. Vol. 25, (2012).
neural networks. Adv. Neural Inf. Process. Syst. 31 (2018).
3. Y LeCun, Y Bengio, G Hinton, Deep learning. Nature 521, 436–444 (2015).
43. H He, WJ Su, The local elasticity of neural networks in International Conference on Learning
4. D Silver, et al., Mastering the game of go with deep neural networks and tree search. Nature
Representations. (2020).
529, 484–489 (2016).
44. M Belkin, D Hsu, S Ma, S Mandal, Reconciling modern machine-learning practice and the
5. A Fawzi, et al., Discovering faster matrix multiplication algorithms with reinforcement learning.
classical bias–variance trade-off. Proc. Natl. Acad. Sci. 116, 15849–15854 (2019).
Nature 610, 47–53 (2022).
45. T Wolf, et al., Transformers: State-of-the-art natural language processing in Proceedings
6. T Hastie, R Tibshirani, JH Friedman, The elements of statistical learning: data mining,
of the 2020 Conference on Empirical Methods in Natural Language Processing: System
inference, and prediction. (Springer) Vol. 2, (2009).
Demonstrations. pp. 38–45 (2020).
7. M Hutson, Has artificial intelligence become alchemy? Science 360, 478–478 (2018).
46. R Socher, et al., Recursive deep models for semantic compositionality over a sentiment
8. PL Bartlett, PM Long, G Lugosi, A Tsigler, Benign overfitting in linear regression. Proc. Natl.
treebank in Proceedings of the 2013 Conference on Empirical Methods in Natural Language
Acad. Sci. 117, 30063–30070 (2020).
Processing. pp. 1631–1642 (2013).
9. Y Lu, A Zhong, Q Li, B Dong, Beyond finite layer neural networks: Bridging deep architectures
and numerical differential equations in International Conference on Machine Learning. (PMLR),
pp. 3276–3285 (2018).
10. C Fang, H He, Q Long, WJ Su, Exploring deep neural networks via layer-peeled model:
Minority collapse in imbalanced training. Proc. Natl. Acad. Sci. 118, e2103091118 (2021).
11. K He, X Zhang, S Ren, J Sun, Deep residual learning for image recognition in 2016 IEEE
Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016).
12. H Xiao, K Rasul, R Vollgraf, Fashion-MNIST: a novel image dataset for benchmarking machine
learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
13. JP Stevens, Applied multivariate statistics for the social sciences. (Routledge), (2012).
14. V Papyan, X Han, DL Donoho, Prevalence of neural collapse during the terminal phase of
deep learning training. Proc. Natl. Acad. Sci. 117, 24652–24663 (2020).
15. X Han, V Papyan, DL Donoho, Neural collapse under mse loss: Proximity to and dynamics on
the central path in International Conference on Learning Representations. (2022).
16. K Simonyan, A Zisserman, Very deep convolutional networks for large-scale image recognition
in International Conference on Learning Representations. (2015).
17. G Huang, Z Liu, L Van Der Maaten, KQ Weinberger, Densely connected convolutional networks
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp.
4700–4708 (2017).
18. G Alain, Y Bengio, Understanding intermediate layers using linear classifier probes in Interna-
tional Conference on Learning Representations. (2017).
19. V Papyan, Traces of class/cross-class structure pervade deep learning spectra. The J. Mach.
Learn. Res. 21, 10197–10260 (2020).
20. T Galanti, A György, M Hutter, On the role of neural collapse in transfer learning in International
Conference on Learning Representations. (2022).
21. T Galanti, On the implicit bias towards minimal depth of deep neural networks. arXiv preprint
arXiv:2202.09028 (2022).
8 | He and Su
Supporting Information Imbalanced data. As shown in Fig. 3, we experiment with imbal-
anced data. Specifically, we consider 5 majority classes with 500
In this appendix, we first detail the general setup for all experiments, examples in each class, and 5 minority classes with 100 examples
then briefly highlight some distinctive settings for figures in the in each class. The law of equi-separation also emerges in neural
main text, after that mention the argument for robustness, and networks trained with imbalanced Fashion-MNIST data, though its
finally show some additional results. More details can be found in half-life is significantly larger than that in the balanced case. This
our code.‡ might be caused by the collapse of minority classifiers as shown in
(10).
General Setup. In this subsection, we detail the general setup that
is used for the experiments throughout the paper. Convolutional neural networks (CNNs). As shown in Fig. 4, we
experiment with two canonical CNNs, AlexNet and VGG (16). We
Network architecture. In our experiments, we mainly consider use the PyTorch implementation of both models.§ We experiment
feedforward neural networks (FNNs). Unless otherwise stated, the with both Fashion-MNIST and CIFAR-10, and the original images
corresponding hidden size is 100. For simplicity, we use neural are resized to 32 × 32 pixels. Given that AlexNet and VGG are
networks with 4, 8, 20 layers as representatives for shallow, mid, designed for ImageNet images with 224 × 224 pixels, we make
and deep neural networks. By default, batch normalization (35) is some small modifications to handle images with 32 × 32 pixels. As
inserted after fully connected layers and before the ReLU nonlinear for AlexNet, we change the original convolution filters to 5 × 5
activation functions. convolution filters with padding 2. The color channel number is
Dataset. Our experiments are mainly conducted on Fashion- multiplied by 32 at the beginning. As for VGG, we change the
MNIST. Unless otherwise stated, we resize the original images original convolution filters to 5 × 5 convolution filters with padding
to 10 × 10 pixels. By default, we randomly sample a total of 1000 2, and the last average pooling layer before fully connected layers
training examples evenly distributed over the 10 classes. is replaced by an average pooling layer over a 1 × 1 pixel window.
The color channel number is multiplied by 16 at the beginning.
Optimization methodology. In our experiments, three popular Moreover, we examined VGG employing four distinct config-
optimization methods are considered: SGD, SGD with momentum, urations¶ on the Fashion-MNIST and CIFAR-10 datasets. As
and Adam. The weight decay is set to 5 × 10−4 , and the momentum demonstrated in Fig. S2, an approximate equi-separation law is also
term is set to 0.9 for SGD with momentum. The neural networks observed across different VGG configurations. It is important to
are trained for 600 epochs with a batch size of 128. The initial note that the equi-separation law can be further refined by modify-
learning rate is annealed by a factor of 10 at 1/3 and 2/3 of the ing convolution filter size and input channel number, as illustrated
600 epochs. As for SGD and SGD with momentum, we pick the in Fig. 4.
model resulting in the best equi-separation law in the last epoch To gain further insights into CNNs, we explored a variant of
among the seven learning rates: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, VGG11 (referred to as pure CNNs) by retaining all convolutional
and 1.0. Similarity, we pick the model resulting in the best equi- layers, the final fully connected layer, and removing all pooling
separation law in the last epoch for Adam among the five learning layers along with the initial two fully connected layers. Additionally,
rates: 3 × 10−5 , 1 × 10−4 , 3 × 10−4 , 1 × 10−3 , 3 × 10−3 . Unless we ensured an equal number of channels in each layer for our
otherwise stated, Adam is adopted in the experiments. analysis. In this simplified network, convolutional layers serve to
extract features, which are subsequently input into the final fully
Detailed Experimental Settings. In this subsection, we show detailed connected layer for classification. As presented in Fig. S3, the
experimental settings for the figures in the main text. equi-separation law is present in pure CNNs for various datasets
with an appropriate number of channels; however, the separation
Optimization. As shown in Fig. 1, for FNNs with different depths, fuzziness of the last-layer features is considerably higher than that
three different optimization methods are considered: SGD, SGD of FNNs and canonical CNNs (e.g., AlexNet and VGG). We further
with momentum, and Adam. investigated the influence of other factors in CNNs, such as depth,
Visualization. As shown in Fig. 1, we visualize the features of kernel size, pooling layer, and image size. Our findings indicate that
different layers in the 8-layer FNNs trained on Fashion-MNIST. In the equi-separation law is indeed present in CNNs but necessitates
particular, principal component analysis (PCA) is used to visualize a delicate balance among these factors. This suggests that an
the features in a two-dimensional plane. alternative measure, rather than separation fuzziness (D in Eq. 1),
may be required to describe the information flow in CNNs, given
Training epoch. As shown in Fig. 2, we show the separability of the unique structure of features following convolutional layers.
features among layers in different training epochs. In this part, we
simply consider the 20-layer FNNs trained on Fashion-MNIST. Depth. As shown in Fig. 5, besides Fashion-MNIST and
The equi-separation law doesn’t emerge at the beginning, and CIFAR-10, we also consider MNIST to illustrate the optimal depth
gradually emerges during training. After that the decay ratio is for different datasets. Similar to Fashion-MNIST, we resize the
decreased along the training process until the network converges. original MNIST images to 10 × 10 pixels, and randomly sample a
Furthermore, we conducted an analysis of the last-layer features total of 1000 training examples evenly distributed over the 10 classes.
in the aforementioned experiments. As illustrated in Fig. S1, it
is evident that the separation fuzziness of the last-layer features As shown in Fig. 6, we consider FNNs with different widths and
does not reach convergence until after 400 epochs. Conversely, shapes.
the equi-separation law manifests at epoch 100, as demonstrated
Width. As shown in Fig. 6, we consider FNNs with three different
in Fig. 2, with a Pearson correlation coefficient of −0.996. This
widths. All layers have the same width in each setting. For simplicity,
observation indicates that the equi-separation law emerges prior to
we consider the 8-layer FNNs trained on Fashion-MNIST.
the occurrence of neural collapse.
Shape. As shown in Fig. 6, we consider three different shapes of
As shown in Fig. 3, we further consider the impact of the following FNNs: Narrow-Wide, Wide-Narrow, and Mixed. As for Narrow-
factors on the equi-separation law: dataset, learning rate, and class Wide, the FNNs have narrow hidden layers at the beginning, and
distribution. then have wide hidden layers in the later layers. Similarity, we use
Dataset. As shown in Fig. 3, we experiment with CIFAR-10. We Wide-Narrow to denote FNNs that start with wide hidden layers
use the second channel of the images and resize the original images followed by narrow hidden layers. In the case of Mixed, the FNNs
to 10 × 10. Similar to Fashion-MNIST, we randomly sample a total have more complicated patterns of the widths of hidden layers. For
of 1000 training examples evenly distributed over the 10 classes. simplicity, we only consider 8-layer FNNs trained on Fashion-MNIST.
The corresponding widths of hidden layers for Narrow-Wide, Wide-
Learning rate. As shown in Fig. 3, we compare the 9-layer FNNs Narrow, and Mixed we considered in our experiments are as follows:
trained using SGD with different leaning rates on Fashion-MNIST.
§
More details are in https://github.com/pytorch/vision/tree/main/torchvision/models.
‡ ¶
Our code is publicly available at GitHub (https://github.com/HornHehhf/Equi-Separation). More details can be found at: https://github.com/kuangliu/pytorch-cifar/tree/master/models.

(100, 100, 100, 100, 1000, 1000, 1000), (1000, 1000, 1000, 1000, 100, BERT. As shown in Fig. S9, we experiment with BERT features.
100, 100), and (100, 500, 500, 2500, 2500, 500, 500). We use the pretrained case-sensitive BERT-base PyTorch implemen-
tation (45), and the common hyperparameters for BERT. Specifi-
Residual neural networks (ResNets). As shown in Fig. 7, we con- cally, we fine-tuned the pretrained BERT model on a binary senti-
sider three different types of ResNets: ResNets with 2-layer building ment classification task (SST-2) (46), where we randomly sampled
blocks, ResNets with 3-layer bottleneck blocks, and ResNets with 500 sentences for each class. Given a sequence of token-level BERT
mixed blocks. For 3-layer blocks, the expansion rate is set to 1 features, two most popular approaches are used to get the sentence-
instead of 4, and the original 1 × 1 and 3 × 3 convolution filters are level features at each layer: 1) using the features of the first token∗∗
replaced by 5 × 5 convolution filters. For all types of ResNets, the (i.e., the [CLS] token); 2) averaging the features among all tokens
channel number is set to 8, and the color channel number is also in the sentence.
multiplied by 8 at the beginning. In this part, the Fashion-MNIST
and CIFAR-10 images are resized to 32 × 32 pixels instead of 10 × 10 Batch normalization. As shown in Fig. S10, it is difficult to opti-
pixels. mize a neural network without batch normalization, let alone achieve
the law of equi-separation. Even when the network is well-optimized,
Dense Convolutional Network (DenseNet). As shown in Fig. 8, the law is often not clear.
we investigate the law when applied to DenseNet (17), namely
DenseNet161.‖ In this part, the images of Fashion-MNIST and As shown in Fig. S11, we consider the impact of the
Pretraining.
CIFAR-10 are resized to 32 × 32 pixels, i.e., (3, 32, 32) for CIFAR-10 pretraining on the equi-separation law. Specifically, we first pre-
and (32, 32) for Fashion-MNIST. trained the 20-layer FNNs on FashionMNNIST as shown in Fig. S11
(Depth=20). After that we fix the first 10 layers of the pretrained
Out-of-sample performance. As shown in Fig. 9, given 20-layer model and train additional 5 layers as in Fig. S11 (Pretraining),
FNNs, instead of the standard training, we also consider the fol- which is quite different from training 15-layer FNNs from scratch as
lowing two-stage frozen training procedure: 1) freeze the last 10 in Fig. S11 (Depth=15). At the same time, the equi-separation law
layers and train the first 10 layers; 2) freeze the first 10 layers and also emerges in the additional 5 layers in the pretraining setting.
train the last 10 layers. When we conduct the frozen and standard
training on CIFAR-10, we find that the our-of-sample performance
of standard training (test accuracy: 23.85%) is better than that of
frozen training (test accuracy: 19.67%), even though the two train- 1.8
ing procedures have similar training losses (standard: 0.0021; frozen:

0.0019) and training accuracies (standard: 100%; frozen: 100%). 1.6
This is consistent with the equi-separation law: the neural networks
trained using standard training have much clearer equi-separation
law compared to the neural networks trained using frozen training. 1.4
Specifically, the corresponding Pearson correlation coefficients for

standard and frozen training are −0.997 and −0.971, respectively. 1.2
Additional Results. In this subsection, we provide some additional 1.0

results for the equi-separation law.
Image size. As depicted in Fig. S4, we examine FNNs on images 0.8
with a resolution of 32 × 32 pixels, specifically (3, 32, 32) for CIFAR-
10 and (32, 32) for Fashion-MNIST. It is important to note that 100 200 300 400 500 600
Fashion-MNIST, despite its original size of (28, 28), is frequently
resized to (32, 32) in the same manner as MNIST. In this part, Epoch
we set the width as 1000 instead of 100 for FNNs, given that the
images with 32 × 32 pixel resolution possess a higher dimension in Fig. S1. Separation fuzziness of last-layer features throughout the training process.
comparison to those with 10 × 10 pixel resolution. The x axis represents the epoch number, while the y axis indicates the separation
Sample size. As shown in Fig. S5, we experiment with different fuzziness (D in Eq. 1). The x axis commences from approximately 100 epochs
number of randomly sampled training examples. For each setting, due to the exceedingly high initial separation fuzziness, which would otherwise mask
we have the same number of examples for each of the 10 classes. For discernible trends.
simplicity, we only consider the 8-layer FNNs trained on Fashion-
MNIST. The law of equi-separation emerges in neural networks
trained with different number of training examples, though the half-
life can be larger for larger datasets due to the increased optimization
difficulty.
Class-wise data separation. As shown in Fig. S6, we further con-
sider separation fuzziness for each class, i.e., SSkw SS+
b , where SSw
k
indicates the within-class covariance for Class k. For simplicity, we

only consider the 8-layer FNNs trained on Fashion-MNIST. The
equi-separation law emerges in each class. Even though the equi-
separation law can be a little noisy for some classes, the noise is
reduced when we consider the separation fuzziness for all classes.
Training dynamics. As shown in Fig. S7, we consider the conver-
gence rates of different layers (without the last-layer classifier) with
Dl+1
respect to the relative improvement D of separability of features.
l
For simplicity, we consider the 8-layer FNNs trained on CIFAR-10
in this part. The relative improvement of bottom layers converges
earlier compared to those of top layers.
Test data. As shown in Fig. S8, we further show how features
separate across layers in the test data. For simplicity, we only
consider the 8-layer FNNs trained on Fashion-MNIST here. A fuzzy
version of the equi-separation law also exists in the test data.
∗∗
The input layer is not considered here since the inputs of the first token do not take other tokens into
‖
Further details can be found at https://github.com/kuangliu/pytorch-cifar/tree/master/models. account.
10 | He and Su
102 102 102 102
Fashion-MNIST
101 101 101 101
100 100 100 100
10−1 10−1 10−1 10−1

0 5 10 0 5 10 0 5 10 15 0 5 10 15
VGG11 VGG13 VGG16 VGG19

103 103 103 103
102 102 102 102

CIFAR-10
101 101 101 101
100 100 100 100
10−1 10−1 10−1 10−1

0 5 10 0 5 10 0 5 10 15 0 5 10 15
VGG11 VGG13 VGG16 VGG19
Fig. S2. Emergence of a fuzzy equi-separation law in VGG networks with varied depths on Fashion-MNIST and CIFAR-10.
102 102 102 102

Fashion-MNIST
101 101 101 101
100 100 100 100

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
16 Channels 32 Channels 64 Channels 128 Channels

103 103 103 103
CIFAR-10
102 102 102 102
101 101 101 101

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
16 Channels 32 Channels 64 Channels 128 Channels
Fig. S3. Equi-separation law emerges in pure CNNs with appropriate numbers of channels on Fashion-MNIST (64 channels) and CIFAR-10 (16 channels).

102 102 102
101 101 101

Fashion-MNIST
100 100 100
10−1 10−1 10−1

0 1 2 3 0 2 4 6 0 5 10 15
103 103 103
102 102 102
101 101 101

CIFAR-10
100 100 100
10−1 10−1 10−1

0 1 2 3 0 2 4 6 0 5 10 15
Fig. S4. Equi-separation law emerges in FNNs across varied depths on 32 × 32 pixel images of Fashion-MNIST and CIFAR-10.
102 102 102
101 101 101
100 100 100
10−1 10−1 10−1

0 2 4 6 0 2 4 6 0 2 4 6
Sample size=1000 Sample size=2000 Sample size=5000
Fig. S5. The impact of the sample size on the law of equi-separation.
101 101 101 101 101
100 100 100 100 100
10−1 10−1 10−1 10−1 10−1
10−2 10−2 10−2 10−2 10−2

0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
Class 1 Class 2 Class 3 Class 4 Class 5

101 101 101 101 101
100 100 100 100 100
10−1 10−1 10−1 10−1 10−1
10−2 10−2 10−2 10−2 10−2

0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
Class 6 Class 7 Class 8 Class 9 Class 10
Fig. S6. The law of equi-separation emerges for each class.
12 | He and Su
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0

0 200 400 600 0 200 400 600 0 200 400 600 0 200 400 600
Layer=0 Layer=1 Layer=2 Layer=3

1.0 1.0 1.0
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0

0 200 400 600 0 200 400 600 0 200 400 600
Layer=4 Layer=5 Layer=6
Dl+1
Fig. S7. Convergence rates of different layers (without the last-layer classifier) with respect to the relative improvement ( Dl ) of separability of features. The x axis represents
Dl+1
the training epoch, while the y-axis indicates the relative improvement ( D ).
l
102 102 102
101 101 101
100 100 100

0 1 2 3 0 2 4 6 0 5 10 15
Fig. S8. A fuzzy version of the equi-separation law also emerges in the test data.
102 102
101
100 101
10−1
10−2 100
10−3
10−4 10−1
0 5 10 0 5 10
BERT-CLS BERT-AVG
Fig. S9. The equi-separation law does not exist in BERT features. Given a sequence of token-level BERT features, two most popular approaches are used to get the
sentence-level features at each layer: 1) using the features of the first token (i.e., the [CLS] token); 2) averaging the features among all tokens in the sentence (denoted as
BERT-AVG).

102 102 106
105
104
101 101
103
102
100 100 101

0 1 2 3 0 2 4 6 0 5 10 15
Fig. S10. The equi-separation law is not clear on Fashion-MNIST for FNNs trained without batch normalization.
102 102 102
101 101 101
100 100 100
10−1 10−1 10−1

0 5 10 15 0 5 10 0 5 10
Depth=20 Depth=15 Pretraining
Fig. S11. The impact of pretraining on the equi-separation law. We first pretrained the 20-layer FNNs as shown in Fig. S11 (Depth=20). After that we fix the first 10 layers of
the pretrained model and train additional 5 layers as in Fig. S11 (Pretraining), which is quite different from training the 15-layer neural networks from scratch as in Fig. S11
(Depth=15).
14 | He and Su

A Law of Data Separation in Deep Learning.17020

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Law of Data Separation in Deep Learning.17020

Uploaded by

Copyright:

Available Formats

A Law of Data Separation in Deep Learning

Hangfeng Hea,b and Weijie J. Suc,1

This manuscript was compiled on August 14, 2023

works (11). Any knowledge about data separation, especially

PNAS | August 14, 2023 | vol. XXX | no. XX | 1–14

Depth=4 Depth=8 Depth=20

Depth=4 Depth=8 Depth=20

Depth=4 Depth=8 Depth=20

Epoch=30 Epoch=50 Epoch=100

Epoch=200 Epoch=300 Epoch=600

63 training data. Hence, we call this the law of

65 tion of the data separation process in the intermediate

where we continue to train the model to interpolate in-sample isments in Fig.

of training, the bottom layers tend to learn faster at reduc- at each

He and Su PNAS | August 14, 2023 | vol. XXX | no. XX | 3

Depth=4 Depth=8 Depth=20

Depth=4 Depth=8 Depth=20

Learning rate=0.01 Learning rate=0.03 Learning rate=0.1

Learning rate=0.01 Learning rate=0.03 Learning rate=0.1

Depth=4 Depth=8 Depth=20

learning (3, 23–25). This is because, as shown by Eq. 2, all AlexNet-CIFAR-10

142 Moreover, a very large depth would render the optimization 3 43 4 1 2

153 of the applications.Narrow-Wide For instance, this perspective of data Wide-Narrow

161 the network

162 of each layer is merely 20. This is

163 to be wide enough to pass the useful

164 classification. However, when the network

166 are added to each layer (32). In addition,

167 Fig. 6 shows that it is better to start with

168 at the therefore,

171 optimization. Apart from encoding purposes,

172 be set to about

B. Training. The emergence of the law of equi-separation dur- Training.

part of a network that will be trained onasa 213

He and Su PNAS | March 22, 2023 | vol. XXX | no. XX | 5

Layerwise Blockwise Layerwise Blockwise

Layerwise Blockwise Layerwise Blockwise

Layerwise Blockwise Layerwise Blockwise

let6 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX He and Su last layer in

find when the improvement fuzziness is learning 3. Discussion

networks by identifying The aequi-separation

791236 the correct module

He and Su PNAS | August 14, 2023 | vol. XXX | no. XX | 7

He and Su PNAS | August 14, 2023 | vol. XXX | no. XX | 9

ing procedures have similar training losses (standard: 0.0021; frozen:

Specifically, the corresponding Pearson correlation coefficients for

Additional Results. In this subsection, we provide some additional 1.0

indicates the within-class covariance for Class k. For simplicity, we

101 101 101 101

100 100 100 100

10−1 10−1 10−1 10−1

VGG11 VGG13 VGG16 VGG19

102 102 102 102

101 101 101 101

100 100 100 100

10−1 10−1 10−1 10−1

VGG11 VGG13 VGG16 VGG19

102 102 102 102

101 101 101 101

100 100 100 100

16 Channels 32 Channels 64 Channels 128 Channels

102 102 102 102

101 101 101 101