Professional Documents
Culture Documents
A Law of Data Separation in Deep Learning.17020
A Law of Data Separation in Deep Learning.17020
While deep learning has enabled significant advances in many areas k, and x̄ denote the mean of all n := n1 + · · · + nK data
of science, its black-box nature hinders architecture design for future points. We define the between-class sum of squares and the
artificial intelligence applications and interpretation for high-stakes within-class sum of squares as
decision makings. We addressed this issue by studying the funda-
1X
K
mental question of how deep neural networks process data in the
SSb := nk (x̄k − x̄)(x̄k − x̄)⊤
intermediate layers. Our finding is a simple and quantitative law that n
arXiv:2210.17020v2 [cs.LG] 11 Aug 2023
k=1
governs how deep neural networks separate data according to class
K nk
membership throughout all layers for classification. This law shows 1 XX
SSw := (xki − x̄k )(xki − x̄k )⊤ ,
that each layer improves data separation at a constant geometric rate, n
k=1 i=1
and its emergence is observed in a collection of network architec-
tures and datasets during training. This law offers practical guidelines respectively. The former matrix represents the between-class
for designing architectures, improving model robustness and out-of- “signal” for classification, whereas the latter denotes the within-
sample performance, as well as interpreting the predictions. class variability. Writing SS+ b for the Moore–Penrose inverse
of SSb (The matrix SSb has rank at most K − 1 and is not
deep learning | data separation | constant geometric rate | intermediate invertible in general since the dimension of the data is typically
layers larger than the number of classes.), the ratio matrix SSw SS+ b
can be thought of as the inverse signal-to-noise ratio. We use
eep learning methodologies have achieved remarkable suc- its trace
D cess across a wide range of data-intensive tasks in image
recognition, biological research, and scientific computing (2–5).
D := Tr(SSw SS+ b) [1]
to measure how well the data are separated (13, 14). This
In contrast to other machine learning techniques (6), however, value, which is referred to as the separation fuzziness, is large
the practice of deep learning relies heavily on a plethora of when the data points are not concentrated to their class means
heuristics and tricks that are not well justified. This situation or, equivalently, are not well separated, and vice versa.
often makes deep learning-based approaches ungrounded for
some applications or necessitates the need for huge computa-
1. Main Results
tional resources for exhaustive search, making it difficult to
fully realize the potential of this set of methodologies (7). Given an L-layer feedforward neural network, let Dl denote
This unfortunate situation is in part owing to a lack of un- the separation fuzziness (Eq. 1) of the training data passing
derstanding of how the prediction depends on the intermediate through the first l layers for 0 ≤ l ≤ L − 1.∗ Fig. 1 suggests
layers of deep neural networks (8–10). In particular, little is that the dynamics of Dl follows the relation
known about how the data of different classes (e.g., images .
Dl = ρl D0 [2]
of cats and dogs) in classification problems are gradually sep-
arated from the bottom layers to the top layers in modern ∗
For clarification, D0 is calculated from the raw data, and D1 is calculated from the data that have
architectures such as AlexNet (2) and residual neural net- passed through the first layer but not the second layer.
0 1 2 3
4 5 6 7
Fig. 1. Illustration of the law of equi-separation in feedforward neural networks with ReLU activations trained on the Fashion-MNIST dataset. The three rows correspond to three
Fig. 1. Illustration of the law of equi-separation in feedforward neural networks with ReLU activations trained on the Fashion-MNIST dataset. The three rows correspond to three
different training methods, stochastic gradient descent (SGD), SGD with momentum, and Adam (12). Throughout the paper, the x axis represents the layer index and the y axis
different training methods, stochastic gradient descent (SGD), SGD with momentum, and Adam (1). Throughout the paper, the x axis represents the layer index and the y axis
represents the separation fuzziness defined in Eq. 1, unless otherwise specified. The Pearson correlation coefficients, by row first, are ≠1.000, ≠0.998, ≠0.997, ≠0.999,
represents the separation fuzziness defined in Eq. 1, unless otherwise specified. The Pearson correlation coefficients, by row first, are −1.000, −0.998, −0.997, −0.999,
≠0.998, ≠0.998, ≠1.000, ≠0.997, and ≠0.999. The last two rows show the intermediate data points on the plane of the first two principal components for the 8-layer
−0.998, −0.998, −1.000, −0.997, and −0.999. The last two rows show the intermediate data points on the plane of the first two principal components for the 8-layer
network trained using Adam. Layer 0 corresponds to the raw input. More details can be found in the supplementary materials.
network trained using Adam. Layer 0 corresponds to the raw input. More details can be found in SI Appendix.
2 2| | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX He He
andand
Su Su
Epoch=0 Epoch=10 Epoch=20
Fig.
Fig. 2. 2. A 20-layer
A 20-layer feedforward
feedforward neural
neural networktrained
network trainedononFashion-MNIST.
Fashion-MNIST.The Thelaw
lawofofequi-separation
equi-separation starts
starts to
to emerge
emerge at
at epoch
epoch 100
100 and
and becomes
becomes more
more clear
clear as
astraining
training
proceeds.
proceeds. AsAs
in in Fig.
Fig. 1 and
1 and allall
thethe other
other figuresininthe
figures themain
maintext,
text,the
thexxaxis
axisrepresents
representsthe
thelayer
layerindex
indexand
andthe
the yy axis
axis represents
represents the
the separation
separation fuzziness.
fuzziness.
60
forfor somedecay
some decayratio
ratio0 0<<ρ fl<<1.1.Alternatively,
Alternatively,thisthislaw
lawimplies
implies layers have
roughly learned
equally the necessary
capable of reducing features. Finally, each
the separation layer
fuzziness 85
. . 1
61
loglog
DD l+1 ≠ log Dl = ≠ log1 , showing that the neural network
l+1 − log Dl = − log ρ ,fl showing that the neural network
becomes roughly equally
multiplicatively. capable of
This dynamics of reducing the separation
data separation during 86
62 makesequal
makes equalprogress
progressininreducing
reducingloglogDDover
overeach
eachlayer
layerononthe
the fuzzinessis multiplicatively.
training illustrated in Fig.This dynamics
2. Neural of data
networks separation
in the terminal 87
64 Thislaw
This lawis isthe
thefirst
firstquantitative
quantitativeandandgeometric
geometriccharacteriza-
characteriza- inthe
theterminal
last layerphasesuchofas training
neural also exhibit
collapse (14,certain
15) and symmetric
minority 89
Indeed, it is unexpected because the intermediate output of minority collapse (9). However, it is worthwhile mentioning
Indeed, it is unexpected because the intermediate output of law of equi-separation emerges earlier than neural collapse
66 91
the neural network does not exhibit any quantitative patterns, that the law of equi-separation emerges earlier than neural
the neural network does not exhibit any quantitative patterns, during training (see Fig. S1 in SI Appendix).
67 92
as shown by the last two rows of Fig. 1. collapse during training (see Fig. 10 in the supplementary
as shown by the last two rows of Fig. 1.
68 93
The pervasive law of equi-separation consistently prevails
materials).
Thedecay
decayratio
ratioρ fldepends
dependson onthethedepth
depth ofof thethe neural
neural
94
69
The across diverse datasets, learning rates, and class imbalances,
network,dataset,
network, dataset,training
trainingtime,
time,and
andnetwork
networkarchitecture,
architecture,and and as illustrated in Fig.
The pervasive law3.of Additionally,
equi-separation Fig. S6 in SI Appendix
consistently prevails
70
95
is also affected, to a lesser extent, by optimization
is also affected, to a lesser extent, by optimization methods and methods and demonstrates
across diverse its applicability
datasets, learning inrates,
a finer,
andclass-wise context.
class imbalances,
71
96
many other hyperparameters. For the 20-layer network trained
many other hyperparameters. For the 20-layer network trained This law is further
as illustrated in Fig.exemplified in contemporary
3. Additionally, Fig. 15 in network archi-
the supple-
72
97
using Adam (12) (the bottom-right plot of Fig. 1), the decay
using Adam (1) (the bottom-right plot ofln Fig. 1), the decay mentaryfor
tectures materials
vision tasks, demonstrates
such as AlexNetits applicability
and VGGNet in a(16),
finer,
as
73
98
2 0.693
ratio fl is 0.818. Thus, the half-life is ln fl2≠1 = 0.693 = 3.45,
ratio ρ is 0.818. Thus, the half-life is lnln −1 = ln ρ−1 = 3.45, class-wise
shown context.
in Fig. 4 (seeThis law is further
SI Appendix exemplified
for additional in contem-
convolutional
74
ln fl≠1 99
suggesting that this 20-layer neural network reduces the value
ρ
suggesting that this 20-layer neural network reduces the value porarynetwork
neural networkexperiments
architecturesinfor vision
Fig. tasks,
S2 and Fig.such
S3).asMoreover,
AlexNet
75
100
of the separation fuzziness in every three and a half layers.
of the separation fuzziness in every three and a half layers. andlaw
the VGGNet
manifests (16),inas shown neural
residual in Fig. networks
4 (see theand supplementary
densely con-
76
101
77 Thislaw
This lawmanifests
manifestsininthe theterminal
terminalphase
phaseofoftraining
training(14),
(14), materials
nected for additional
convolutional convolutional
networks (17) when neural networkfuzziness
separation experi- 102
data. At initialization, the separation fuzziness may even in- in residual neural
respectively. networks
Intriguingly, thisandlawdensely
appearsconnected
slightly moreconvolu-
pro-
data. At initialization, the separation fuzziness may even
79 104
crease from the bottom to top layers. During the early stages tional networks
nounced in feedforward(17) whenneuralseparation
networks fuzziness
when compared is assessed
with
increase from the bottom to top layers. During the early
80 105
ing the separation fuzziness compared to the top layers (see Intriguingly, this law appears slightly more pronounced in feed-
reducing the separation fuzziness compared to the top layers The separation dynamics of neural networks have been
82 107
Fig. 16 in the supplementary materials). However, as training forward neural networks when compared with other network
(see Fig. S7 in SI Appendix). However, as training progresses, extensively investigated in prior research studies (18–22). For
83 108
84 progresses, the top layers eventually catch up as the bottom architectures. 109
the top layers eventually catch up as the bottom layers have instance, (18) employed linear classifiers as probes to assess the
learned the necessary features. Finally, each layer becomes separability of intermediate outputs. In (19), the author scru-
He and Su PNAS | March 22, 2023 | vol. XXX | no. XX | 3
110 The
tinized theseparation
separationdynamics
capabilities of ofneural
neuralnetworks
networkshave throughbeen
The separation
extensively dynamics
investigated in of neural
prior research networks
studies have beenFor
(18–22).
empirical spectral analysis. More recently, (20–22) explored
110
111
111 extensively
instance, investigated
(18) employed in priorclassifiers
linear research as studies
probes (18–22).
to assessForthe
112
the separation
instance, (18)
ability
employed
of neural
linear
networksasby
classifiers
examining
probes to
neural
assess the
112
113 separability
collapse of intermediate
at intermediate layers outputs.
and itsInIn (19), the author
relationship withscru- scru-
gen-
separability of intermediate outputs. (19), the author
tinized the separation capabilities of neural networks through
113
114
eralization. In particular, (22) provided crucial
tinized the separation capabilities of neural networks through experimental
empirical spectral analysis. More recently, (20–22) explored
114
115
evidence
empiricalillustrating the progression
spectral analysis. of neural
More recently, collapse
(20–22) within
explored
theinterior
separation abilitynetworks,
of neural networks by examining neural
115
116
116the of neural which is perhaps
the separation ability of neural networks by examining neural the most
117
117
collapse
relevant at
work intermediate
to our paper. layers and its relationship
collapse at intermediate layers and its relationship with gen- with gen-
118
118
eralization. Importantly, (22)
eralization. Importantly, (22)experimentally
experimentallydemonstrated
demonstrated
the progression of neural
neuralcollapse
collapsewithin
withinthetheinterior
interiorof of neu-
2. the progression
fromofthe neu-
119
119
Insights Law
120
120
ral
ral networks. Nevertheless,this
networks. Nevertheless, thisworkworkhas hasnot
notyetyet derived
derived a a AlexNet-Fashion-MNIST VGG13-Fashion-MNIST
121Next, we demonstrate
quantitative law, such
suchthe asbenefits
Eq.2,2,toof the
togovern lawthe
govern of equi-separation
thedata
data separation AlexNet-Fashion-MNIST VGG13-Fashion-MNIST
121 quantitative law, as Eq. separation
122and
122 the perspective of data separation in regard to the three
process.
process.
fundamental aspects of deep learning practice, namely, archi-
123
123
tecture
3. design,from
3. Insights
Insights training,
from and interpretation.
the Law
the Law
124
124 Next,
A.Next, we demonstrate
we
Network demonstrate
architecture.the
thebenefits
The law ofofequi-separation
benefits ofthe
thelawlawofof
equi-separation
equi-separation
can inform
125
125
the design principles for network architecture. toFirst,
and
and the
the perspective
perspective of
of data
data separation
separation ininregard
regard thethe
to three
three
the
126
126
lawfundamental
fundamental
implies thataspects
aspects of
of deep
neural deeplearning
learning
networks practice,
shouldpractice, namely,
be deep for archi-
namely, archi-
good
tecture design,
tecture design, training,
training, and
andinterpretation.
interpretation.
performance, thereby reflecting the hierarchical nature of deep
127
127
Fig.5.
Fig. Illustrationof
5. Illustration ofthe
thelaw
lawof
ofequi-separation
equi-separationwith
withvarying
varyingnetwork
networkdepths.
depths. The
Thelegend
legendshows
showshow
howdepths
depthsare
arecolor
colorcoded.
coded. The
Thedepth
depththat
thatyields
yieldsthe
thelowest
lowestseparation
separation
fuzzinessdepends
fuzziness dependsononthe
thedataset
datasetcomplexity.
complexity.
134 flL≠1 D0 at the last layer. When L is small—say, L = 2 or in network weights when the law manifests. To see this point, 179
135 3—the ratio flL≠1 would generally not be small and, therefore, let 180
the neural network is unlikely to separate the data well. In DL≠1 DL≠2 D1 DL≠1
136 R := ◊ ◊ ··· ◊ =
Width
181
137 the literature, the fundamental role of depth is recognized by DL≠2 DL≠3 D0 D0
138 analyzing the loss functions (10, 16, 23, 26, 27), and our law denote the reduction ratio of the separation fuzziness made 182
139 of equi-separation offers a new perspective on network depth. by the entire network. Assuming that a perturbation of the 183
140 For completeness, the deeper the better is not necessarily network weights leads to a change of Á in the ratio for each 184
141 true because as the depth L increase, fl might become larger. layer, the perturbed reduction ratio becomes 185
147
148 layers, therefore, would only give rise to optimization bur- Applying the inequality of arithmetic and geometric means, 187
1 2
dens. Indeed, Fig. 5 illustrates the law of equi-separation with
the perturbation term R DL≠2 + + + Á is min- 188
149 D DL≠3 D0
· · ·
150 varying network depths, showing that different datasets corre- L≠1 DL≠2 D1
spond to different optimal depths. Therefore, as a practical imized in absolute value when DL≠2 = DL≠3 = · · · = D . 189
DL≠1 DL≠2 1
151
D0
152 guideline, the choice of depth should consider the complexity This result is summarized in the following proposition: 190
Fashion-MNIST,
6. Illustration and
of law
the of
lawCIFAR-10 (29)
of equi-separation are 6, 10, and 12, respec-
Fig. Fig. with with different widths
and and shapes of neural
the neural networks. For example, “Narrow-Wide” means that the
1 bottom layers are narrower
155
6. Illustration of the equi-separation different widths shapes of the networks. DL≠1
For example, DL≠2
“Narrow-Wide” means the D
156 tively.
than This
the is consistent with the increasing complexity from
top layers. = = ·that
·· = bottom
. layers are narrower 192
than the top layers. DL≠2 DL≠3 D0
157 MNIST to Fashion-MNIST to CIFAR-10 (10, 30, 31).
158 Moreover, the law of equi-separation indicates that depth This condition is precisely the case when the equi-separation 193
159 of equi-separation
bears a more significant offers aconnection
new perspective on network
to training depth. than
performance law holds. Thus, of
the “shape” theneural
data separation
networks. In ability
the firstof the rowclassifier
of Fig. 6,is 194
160 than the “shape” of neural networks. In
For completeness, the deeper the better is not necessarily the first row of Fig. 6, not
the very
network sensitive
does to
not perturbations
separate the in
data network
well when weights
the when
width 195
width of theeachlawlayer
manifests.
is merelyTherefore,
20. This is to because
improvethe thenetwork
robustness,has
Two-layer block
165 network
than those performance
of their 8-layer maycounterparts.
be saturated The if more neurons are
even Fashion-MNIST functions
added rather
to eachthan layerdata(32).separation
In addition, (34).the second row of 200
169 observations
dens. Indeed, suggest that very the
Fig. 5 illustrates widelaw
neural networks in general
of equi-separation with and
should the other
not be does not.
recommended The results
as they of our
consume experiments
more showed
time for 204
should not be recommended as they consume more time for that the
optimization. former Apart one has
from better
encoding test performance.
purposes, the width Although
should
Three-layer block
170
varying network depths, showing that different datasets corre- 205
Two-layer block
Fashion-MNIST CIFAR-10
Fig. 7. Illustration of the law of equi-separation in residual neural networks: Each block contains either two or three layers. In the “Mixed” configuration, the first two blocks
Fig. 7. Illustration
consist of the each,
of three layers law of while
equi-separation
the last twoinblocks
residual
haveneural networks:
two layers each.Each block
In each contains
plot, a layereither two or
or a block is three layers.
identified as aInmodule
the “Mixed”
whenconfiguration,
presenting thethe first two
x axis. Theblocks
first two
consist of three
columns layers each,
are evaluated on while the last two blocks
the Fashion-MNIST haveand
dataset, twothe
layers
last each. In eachare
two columns plot, a layer orona the
evaluated block is identified
CIFAR-10 as a module when presenting the x axis. The first two
dataset.
columns are evaluated on the Fashion-MNIST dataset, and the last two columns are evaluated on the CIFAR-10 dataset.
but not in the right network because of frozen training. Both networks have about theeach layer. As such, it would be important to studyused whether
but not in theStandard
right networktraining
because of frozen training. Both Frozen
networkstraining
have about the
This view, however, challenges the widely layer-wise
766 828
a deep learning model makes a particular prediction. It is
270
same training losses as well as training accuracies. However, the left network has a
same training losses as well as training accuracies. However, the left network has athe law of equi-separation could be utilized to derive better
approaches to deep learning interpretation (38, 39).
767 271 829
768
higher
Fig.
higher
test accuracy
9. Illustration
test accuracy
(23.85%)
showing than
the law
(23.85%) than
that
of thatof the right
equi-separation network (19.67%).
manifesting
of the right network in the left network
(19.67%). therefore crucial
sentence-level features to fromtake all layersof collectively
a sequence for interpre-
token-level features. 272 830
but not in the right network because of frozen training. Both networks have about tation (37). This view, however, challenges the widely used
T
769 Perhaps the most pressing question is to reveal the un- 273 831
the same training losses as well as training accuracies. However, the left network
layer-wise
3. Discussion approaches to deep learning interpretation (38, 39).
770215 balance
has higherbetween
aThe lawtest of the (23.85%)
trainingthan
equi-separation
accuracy timesthaton
can theright
ofalso
the source
shed and(19.67%).
light
network thethe
on out-of-derlying
target. mechanism of this data-separation law. However, 274 832
771216 In light of
sample the law of equi-separation,
performance of neural networks. a sensibleFig. 9balanceconsiders is totwosome Due
commonly used assumptions in the literature of deep 275
to (14) and follow-up work, it has been known that neural
833
F
772217 834
The law of equi-separation can also shed light on the out-of- networks can exhibit precise mathematical structures at the
roughly the same between the bottom and top layers. linearity of (14)
Due layer
to activation functions is crucial to thisknown
law because
and theperformance
other does not. The results of our experiments
9 considersshowed inand thefollow-up
terminalwork, it has been In that
thisneural
773218 835
two the last phase of training. paper,
277
sample of neural networks. Fig. separation fuzziness is roughly unchanged by linear trans-
774
that the
neural formerwhere
networks one has better
exhibitstest theperformance. Although we networks can exhibit
have Additionally,
extended thisprecise
viewpoint mathematical thestructures
from networks surface ofatthese
the
278 836
C. Interpretation. The one equi-separation law offers
law of equi-separation
a new per- formations. analyses of neural using
this law is sometimes not observed in neural networks with last layer in the terminal phase of training. In this paper, we
775219 279
RA
837
and the other
spective does not. The
for interpreting deep results
learning of our experiments
predictions, showed the black-box
especially neural tangent modelskernel to their (41)interior
are notbyable introducing
to capture an the
empirical
good test performance, interestingly, one can fine-tune the have extended this viewpoint from the surface of these black-
280
776220 838
that the formerdecision
in high-stakes one hasmakings. better test performance.
An important Although
ingredient in lawofthat
effect depth, quantitatively
which is essential governs how well-trained
for understanding real-world
the data 281
777221
parameters to reactivate notthe law with about neuralthe same or with better neuralbox models to their interior bythroughout
introducingallanlayers. empirical Thislaw
839
222 this law is sometimes
interpretation is to pinpoint observed
the basicinoperational networks
modules in separation networks
process across separate data
all layers. This suggests that more law
282
778
223
test
good performance.
neuraltest performance,
networks and subsequentlyinterestingly, to reflectone that can fine-tune
“all modules that quantitatively governs to how
the innovative approaches may be necessary to fully understand 283in-
has been demonstrated provide well-trained
useful real-world
guidelines neural
and
840
779
224 are Moreover,
created equal.”
parameters to the law of the
For feedforward
reactivate equi-separation
law with can the
and convolutional
about be potentially
same networks
neuralor the sights intoseparate
mechanisms deepbehind data
learningthis law. throughout
practice throughall layers.
network This law has
architecture 284
841
780
225
used
networks,
better to facilitate
test each the training of transfer
layer is a module as it reduces the separation
performance. learning. In transfer been demonstrated
design, training,
Broadly speaking,and to provide
interpretation
a perspective useful
informed guidelines
of predictions. and
by this law is 285 insights 842
781
226
learning,
fuzziness
Moreover, for an
by example,
theequal bottom
law multiplicative
of layersfactor
equi-separation trained (for
can from
bemore andetails,
upstreamto focus
potentially intoOne deep thelearning
on avenue for future
non-temporal practice researchthrough
dynamics to network
is of explore
the datathe architecture
applicabil-
separa- 286
843
782
task
see Fig.can 18 form
in the part of
supplementary a network
used to facilitate the training of transfer learning. In transfer that
materials). will be
Interestingly,trained even on a tiondesign,
ity of
process thetraining,
law
throughout of and interpretation
equi-separation
all layers when to aof predictions.
wider
studying range
deep of
neural network
844
D
227 287
783
228 downstream
when
learning, thefor task.
lawexample,
no longer Abottom
challenge
manifests layers arising
in residual
trained in from
practice
neural is how tonetworks.
annetworks,
upstream OneThis
architectures avenue andfor
perspective future
applications.probes research is to
intoexample,
For the explore
interior thebeappli-
of deep
it would inter-
288
845
784
229 balance
Fig.
task 7 shows
can between
formthat the
part it training
cana be
of times on
restored
network that theidentifying
by source
will beand athe
trained target.
block cability
on learning
esting toofseethe
models by law of equi-separation
perceiving
if this law holds thefor neuraltoordinary
black-box a widerdifferential
classifier asrange
a 289 of 846
785
230 In light
rather thanof the
a law
layer asof a equi-separation,
module.
a downstream task. A challenge arising in practice is This a
figure sensible
also balance
suggests thatis to network
sequence
equationsof architectures
transformations,
and, if so, and
how applications.
each it of
couldwhich be For example,
corresponds
used to to
improveita would our
290
847
786231
find to
blocks
how when balancethe improvement
containing
between theintraining
more layers are reducing separation
more capable
times fuzziness
of reducing
on the source islayer beorinteresting
a block in the
understanding to the
of seedepth
case if residual
of thisoflaw these holds
neural for neural
networks.
continuous-depth ordinary
In con- models.291
848
787232 the
roughlyseparation the samefuzziness.
between Likewise,
the bottomFig. 8and shows top that
layers. the law trast, differential
existing equations
analyses often and, if
consider
Additionally, it would be valuable to examine the law in so, thehow it could
“exterior” of be
theseused to
regres-
849
and the target. In light of the law of equi-separation, a 292
788233 of equi-separation holds in densely connected convolutional models improve
by our
focusing understanding
on the loss of the
function
sion settings and multi-modal tasks (40). A crucial question depth
or test of these continuous-
performance, 850
sensible balance is to find when the improvement in reducing
293
Because eachismodule
interpretation to pinpoint makes the equal but small
basic operationalcontributions,modules a in why the lawthat is most the accurate in feedforward neural networks
792237 854
C.few Interpretation.
neurons in a The
single equi-separation
layer are unlikely law
to offers
explain a
how newa per-
deep Next, given law is most evident in feedforward neural
793238 neural networks and subsequently to reflect that “all modulesData and becomes
Availability.
networks, an Our code fuzzier
somewhat
interesting isdirection
publicly isavailable
in certain to at new
GitHub
convolutional
explore neural
measures
855
spective for interpreting deep learning predictions, especially
297
learning
are created model makes
equal.” For a particular
feedforward prediction.
and It is therefore
convolutional neural(https://github.com/HornHehhf/Equi-Separation ).
794239
networks. Next, given that the
in lieu of the separation fuzziness defined in Eq. 1. The hope law is most evident in
298 856
incrucial
high-stakes
networks, to takeeach decision
all
layerlayersismakings.moduleAn
acollectively as important
for interpretation
it reduces ingredient
the separation(36). in
795240
feedforward neural networks,
is that this could potentially make the law clearer in otheran interesting direction is to 857
interpretation
This view, however,
fuzziness by an is to pinpoint
equal challenges the basic
multiplicative the widelyoperational
factorused (for more modules
layer-wise in
details,
796241
explore new
network
ACKNOWLEDGMENTS. measuresWe
architectures. in
Inarelieu
doing of so,
grateful theitto separation
would
X.Y. Han be fuzziness
beneficial
for 299
858
242 neural
approaches networks to deep and subsequently
learning to
interpretation
see Fig. S11 in SI Appendix). Interestingly, even when theveryto reflect that
(37, “all
38). modules
797
defined
helpful
take in Eq.
comments
into 1. and
account Thenetwork hope structures,
feedback isonthat thissuch
an early could
versionas potentially
of the 300
convolution
859
are
lawcreated
no longer equal.” For feedforward
manifests in residual and convolutional
neural networks, neural
Fig. 7manuscript. Welaw also clearer
thank Ben Zhou
798
make the
kernels in (convolutional otherfor
inneural reproducing
network
networks, during
our experi- 301
architectures.
the develop- In 860
networks,
4. Discussion each layer is a module
shows that it can be restored by identifying a block rather ment as it reduces the separation mental results https://github.com/Slash0BZ/data-separation/tree/main/
799243
doing so, it would be beneficial to take into account network
302 861
fuzziness by anas equal multiplicative factor (for of). new
Thisseparation measures.
supportedFurthermore,
in part by NSFwhile grantsthe 303 law
also more details, reproduction research was
800244 thantoa(14)
Due layer and follow-upa module. work, This it has figure
been known suggests
that neural thatCCF-1934876
structures,
of equi-separation
and such as convolution
CAREER has DMS-1847415,
been observed kernelsand ininconvolutional
ana Alfred
variety Sloanofneural
vision
862
see Fig. S11
blocks containing in SI Appendix).
more layers Interestingly,
are more capable even when
of at the
reducing
304
801245 networks can exhibit precise mathematical structures the networks,
Research
tasks, during
Fellowship.
it does notthe development
appear to hold for of new separation
language models measures.
such305as
863
law
the no longer manifests
separation fuzziness.inLikewise, residual Fig. neural networks,
8 shows that Fig.the law7
802
Furthermore, while the law
BERT (41) (see Fig. S9 in SI Appendix). This could be due of equi-separation has beento 864
shows
of and that it can beholds
equi-separation restored by identifying
in densely connected a block rather
convolutional
803
He Su observed
the fact thatin PNAS
a variety
language | March of 22,
models visionlearn
2023 tasks,
| token-level
vol. XXX it does
| no. not| appear
features
XX 7 rather
865
than
networks a layer as a module.
by identifying a block Thisasfigurea module. also For suggests that
comparison,
804
to hold
than for languagefeatures
sentence-level models such at each as BERT
layer. As (41)such,
(see it Fig.wouldS9 866
805 blocks containing more layers are more capable of reducing
when the perspective of data separation is not considered (36), †be important to study whether the law of equi-separation 867
806 the
theseparation
correct module fuzziness. cannot Likewise, Fig. 8 shows
be identified. that the law it
For completeness, See the architecture in https://github.com/kuangliu/pytorch-cifar/tree/master/models. 868
could be utilized to derive better sentence-level features from
should be noted that the law becomes less clear for deeper a sequence of token-level features.
Heresidual
and Su et neural
al. networks† , as shown in Fig. 10. PNAS | July 22, 2023 | vol. XXX | no. XX | 7
Perhaps the most pressing question is to reveal the un-
Because each module makes equal but small contributions, a
derlying mechanism of this data-separation law. However,
few neurons in a single layer are unlikely to explain how a deep
some commonly used assumptions in the literature of deep
learning model makes a particular prediction. It is therefore
learning theory may not be applicable. For instance, the non-
†
See the architecture in https://github.com/kuangliu/pytorch-cifar/tree/master/models. linearity of activation functions is crucial to this law because
separation process across all layers. This suggests that more (2015).
25. S Hihi, Y Bengio, Hierarchical recurrent neural networks for long-term dependencies in
innovative approaches may be necessary to fully understand Advances in Neural Information Processing Systems, eds. D Touretzky, M Mozer, M Hasselmo.
the mechanisms behind this law. (MIT Press), Vol. 8, (1995).
26. X Glorot, Y Bengio, Understanding the difficulty of training deep feedforward neural networks in
Broadly speaking, a perspective informed by this law is Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.
to focus on the non-temporal dynamics of the data separa- (PMLR), Vol. 9, pp. 249–256 (2010).
tion process throughout all layers when studying deep neural 27. D Yarotsky, Error bounds for approximations with deep ReLU networks. Neural Networks 94,
103–114 (2017).
networks. This perspective probes into the interior of deep 28. Y LeCun, The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/
learning models by perceiving the black-box classifier as a (1998).
sequence of transformations, each of which corresponds to a 29. A Krizhevsky, Master’s thesis (University of Toronto) (2009).
30. K He, J Sun, Convolutional neural networks at constrained time cost in 2015 IEEE Conference
layer or a block in the case of residual neural networks. In con- on Computer Vision and Pattern Recognition. pp. 5353–5360 (2015).
trast, existing analyses often consider the “exterior” of these 31. RK Srivastava, K Greff, J Schmidhuber, Training very deep networks in Advances in Neural
Information Processing Systems. Vol. 28, (2015).
models by focusing on the loss function or test performance, 32. AG Howard, et al., Mobilenets: Efficient convolutional neural networks for mobile vision
without considering the intermediate layers (42–44). Informed applications. arXiv preprint arXiv:1704.04861 (2017).
by the law of equi-separation, this perspective may provide 33. M Tan, Q Le, EfficientNet: Rethinking model scaling for convolutional neural networks in
Proceedings of the 36th International Conference on Machine Learning. (PMLR), Vol. 97, pp.
formal insights into the inner workings of deep learning. 6105–6114 (2019).
34. S Bubeck, M Sellke, A universal law of robustness via isoperimetry. Adv. Neural Inf. Process.
Syst. 34, 28811–28822 (2021).
Data Availability. Our code is publicly available at GitHub 35. S Ioffe, C Szegedy, Batch normalization: Accelerating deep network training by reducing
(https://github.com/HornHehhf/Equi-Separation). internal covariate shift in Proceedings of the 32nd International Conference on Machine
Learning. (PMLR), Vol. 37, pp. 448–456 (2015).
36. C Zhang, S Bengio, Y Singer, Are all layers created equal? J. Mach. Learn. Res. 23, 1–28
ACKNOWLEDGMENTS. We are grateful to X.Y. Han for (2022).
37. WJ Su, Neurashed: A phenomenological model for imitating deep learning training. arXiv
very helpful comments and feedback on an early version of preprint arXiv:2112.09741 (2021).
the manuscript. We also thank Hangyu Lin (https://github.com/ 38. MD Zeiler, R Fergus, Visualizing and understanding convolutional networks in European
avalonstrel/NeuralCollapse/tree/mlp), Cheng Shi (https://github.com/ Conference on Computer Vision. pp. 818–833 (2014).
DaDaCheng/Re-equi-sepa), and Ben Zhou (https://github.com/Slash0BZ/ 39. I Tenney, D Das, E Pavlick, Bert rediscovers the classical nlp pipeline in Proceedings of the
data-separation/tree/main/reproduction) for reproducing our experimen- 57th Annual Meeting of the Association for Computational Linguistics. pp. 4593–4601 (2019).
tal results. This research was supported in part by NSF grants 40. VW Liang, Y Zhang, Y Kwon, S Yeung, JY Zou, Mind the gap: Understanding the modality
CCF-1934876 and CAREER DMS-1847415, and an Alfred Sloan gap in multi-modal contrastive representation learning. Adv. Neural Inf. Process. Syst. 35,
Research Fellowship. 17612–17625 (2022).
41. J Devlin, MW Chang, K Lee, K Toutanova, BERT: Pre-training of deep bidirectional transform-
ers for language understanding in Proceedings of the 2019 Conference of the North American
1. DP Kingma, J Ba, Adam: A method for stochastic optimization in International Conference on
Chapter of the Association for Computational Linguistics: Human Language Technologies,
Learning Representations. (2015).
Volume 1 (Long and Short Papers). pp. 4171–4186 (2019).
2. A Krizhevsky, I Sutskever, GE Hinton, ImageNet classification with deep convolutional neural
42. A Jacot, F Gabriel, C Hongler, Neural tangent kernel: Convergence and generalization in
networks in Advances in Neural Information Processing Systems. Vol. 25, (2012).
neural networks. Adv. Neural Inf. Process. Syst. 31 (2018).
3. Y LeCun, Y Bengio, G Hinton, Deep learning. Nature 521, 436–444 (2015).
43. H He, WJ Su, The local elasticity of neural networks in International Conference on Learning
4. D Silver, et al., Mastering the game of go with deep neural networks and tree search. Nature
Representations. (2020).
529, 484–489 (2016).
44. M Belkin, D Hsu, S Ma, S Mandal, Reconciling modern machine-learning practice and the
5. A Fawzi, et al., Discovering faster matrix multiplication algorithms with reinforcement learning.
classical bias–variance trade-off. Proc. Natl. Acad. Sci. 116, 15849–15854 (2019).
Nature 610, 47–53 (2022).
45. T Wolf, et al., Transformers: State-of-the-art natural language processing in Proceedings
6. T Hastie, R Tibshirani, JH Friedman, The elements of statistical learning: data mining,
of the 2020 Conference on Empirical Methods in Natural Language Processing: System
inference, and prediction. (Springer) Vol. 2, (2009).
Demonstrations. pp. 38–45 (2020).
7. M Hutson, Has artificial intelligence become alchemy? Science 360, 478–478 (2018).
46. R Socher, et al., Recursive deep models for semantic compositionality over a sentiment
8. PL Bartlett, PM Long, G Lugosi, A Tsigler, Benign overfitting in linear regression. Proc. Natl.
treebank in Proceedings of the 2013 Conference on Empirical Methods in Natural Language
Acad. Sci. 117, 30063–30070 (2020).
Processing. pp. 1631–1642 (2013).
9. Y Lu, A Zhong, Q Li, B Dong, Beyond finite layer neural networks: Bridging deep architectures
and numerical differential equations in International Conference on Machine Learning. (PMLR),
pp. 3276–3285 (2018).
10. C Fang, H He, Q Long, WJ Su, Exploring deep neural networks via layer-peeled model:
Minority collapse in imbalanced training. Proc. Natl. Acad. Sci. 118, e2103091118 (2021).
11. K He, X Zhang, S Ren, J Sun, Deep residual learning for image recognition in 2016 IEEE
Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016).
12. H Xiao, K Rasul, R Vollgraf, Fashion-MNIST: a novel image dataset for benchmarking machine
learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
13. JP Stevens, Applied multivariate statistics for the social sciences. (Routledge), (2012).
14. V Papyan, X Han, DL Donoho, Prevalence of neural collapse during the terminal phase of
deep learning training. Proc. Natl. Acad. Sci. 117, 24652–24663 (2020).
15. X Han, V Papyan, DL Donoho, Neural collapse under mse loss: Proximity to and dynamics on
the central path in International Conference on Learning Representations. (2022).
16. K Simonyan, A Zisserman, Very deep convolutional networks for large-scale image recognition
in International Conference on Learning Representations. (2015).
17. G Huang, Z Liu, L Van Der Maaten, KQ Weinberger, Densely connected convolutional networks
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp.
4700–4708 (2017).
18. G Alain, Y Bengio, Understanding intermediate layers using linear classifier probes in Interna-
tional Conference on Learning Representations. (2017).
19. V Papyan, Traces of class/cross-class structure pervade deep learning spectra. The J. Mach.
Learn. Res. 21, 10197–10260 (2020).
20. T Galanti, A György, M Hutter, On the role of neural collapse in transfer learning in International
Conference on Learning Representations. (2022).
21. T Galanti, On the implicit bias towards minimal depth of deep neural networks. arXiv preprint
arXiv:2202.09028 (2022).
8 | He and Su
Supporting Information Imbalanced data. As shown in Fig. 3, we experiment with imbal-
anced data. Specifically, we consider 5 majority classes with 500
In this appendix, we first detail the general setup for all experiments, examples in each class, and 5 minority classes with 100 examples
then briefly highlight some distinctive settings for figures in the in each class. The law of equi-separation also emerges in neural
main text, after that mention the argument for robustness, and networks trained with imbalanced Fashion-MNIST data, though its
finally show some additional results. More details can be found in half-life is significantly larger than that in the balanced case. This
our code.‡ might be caused by the collapse of minority classifiers as shown in
(10).
General Setup. In this subsection, we detail the general setup that
is used for the experiments throughout the paper. Convolutional neural networks (CNNs). As shown in Fig. 4, we
experiment with two canonical CNNs, AlexNet and VGG (16). We
Network architecture. In our experiments, we mainly consider use the PyTorch implementation of both models.§ We experiment
feedforward neural networks (FNNs). Unless otherwise stated, the with both Fashion-MNIST and CIFAR-10, and the original images
corresponding hidden size is 100. For simplicity, we use neural are resized to 32 × 32 pixels. Given that AlexNet and VGG are
networks with 4, 8, 20 layers as representatives for shallow, mid, designed for ImageNet images with 224 × 224 pixels, we make
and deep neural networks. By default, batch normalization (35) is some small modifications to handle images with 32 × 32 pixels. As
inserted after fully connected layers and before the ReLU nonlinear for AlexNet, we change the original convolution filters to 5 × 5
activation functions. convolution filters with padding 2. The color channel number is
Dataset. Our experiments are mainly conducted on Fashion- multiplied by 32 at the beginning. As for VGG, we change the
MNIST. Unless otherwise stated, we resize the original images original convolution filters to 5 × 5 convolution filters with padding
to 10 × 10 pixels. By default, we randomly sample a total of 1000 2, and the last average pooling layer before fully connected layers
training examples evenly distributed over the 10 classes. is replaced by an average pooling layer over a 1 × 1 pixel window.
The color channel number is multiplied by 16 at the beginning.
Optimization methodology. In our experiments, three popular Moreover, we examined VGG employing four distinct config-
optimization methods are considered: SGD, SGD with momentum, urations¶ on the Fashion-MNIST and CIFAR-10 datasets. As
and Adam. The weight decay is set to 5 × 10−4 , and the momentum demonstrated in Fig. S2, an approximate equi-separation law is also
term is set to 0.9 for SGD with momentum. The neural networks observed across different VGG configurations. It is important to
are trained for 600 epochs with a batch size of 128. The initial note that the equi-separation law can be further refined by modify-
learning rate is annealed by a factor of 10 at 1/3 and 2/3 of the ing convolution filter size and input channel number, as illustrated
600 epochs. As for SGD and SGD with momentum, we pick the in Fig. 4.
model resulting in the best equi-separation law in the last epoch To gain further insights into CNNs, we explored a variant of
among the seven learning rates: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, VGG11 (referred to as pure CNNs) by retaining all convolutional
and 1.0. Similarity, we pick the model resulting in the best equi- layers, the final fully connected layer, and removing all pooling
separation law in the last epoch for Adam among the five learning layers along with the initial two fully connected layers. Additionally,
rates: 3 × 10−5 , 1 × 10−4 , 3 × 10−4 , 1 × 10−3 , 3 × 10−3 . Unless we ensured an equal number of channels in each layer for our
otherwise stated, Adam is adopted in the experiments. analysis. In this simplified network, convolutional layers serve to
extract features, which are subsequently input into the final fully
Detailed Experimental Settings. In this subsection, we show detailed connected layer for classification. As presented in Fig. S3, the
experimental settings for the figures in the main text. equi-separation law is present in pure CNNs for various datasets
with an appropriate number of channels; however, the separation
Optimization. As shown in Fig. 1, for FNNs with different depths, fuzziness of the last-layer features is considerably higher than that
three different optimization methods are considered: SGD, SGD of FNNs and canonical CNNs (e.g., AlexNet and VGG). We further
with momentum, and Adam. investigated the influence of other factors in CNNs, such as depth,
Visualization. As shown in Fig. 1, we visualize the features of kernel size, pooling layer, and image size. Our findings indicate that
different layers in the 8-layer FNNs trained on Fashion-MNIST. In the equi-separation law is indeed present in CNNs but necessitates
particular, principal component analysis (PCA) is used to visualize a delicate balance among these factors. This suggests that an
the features in a two-dimensional plane. alternative measure, rather than separation fuzziness (D in Eq. 1),
may be required to describe the information flow in CNNs, given
Training epoch. As shown in Fig. 2, we show the separability of the unique structure of features following convolutional layers.
features among layers in different training epochs. In this part, we
simply consider the 20-layer FNNs trained on Fashion-MNIST. Depth. As shown in Fig. 5, besides Fashion-MNIST and
The equi-separation law doesn’t emerge at the beginning, and CIFAR-10, we also consider MNIST to illustrate the optimal depth
gradually emerges during training. After that the decay ratio is for different datasets. Similar to Fashion-MNIST, we resize the
decreased along the training process until the network converges. original MNIST images to 10 × 10 pixels, and randomly sample a
Furthermore, we conducted an analysis of the last-layer features total of 1000 training examples evenly distributed over the 10 classes.
in the aforementioned experiments. As illustrated in Fig. S1, it
is evident that the separation fuzziness of the last-layer features As shown in Fig. 6, we consider FNNs with different widths and
does not reach convergence until after 400 epochs. Conversely, shapes.
the equi-separation law manifests at epoch 100, as demonstrated
Width. As shown in Fig. 6, we consider FNNs with three different
in Fig. 2, with a Pearson correlation coefficient of −0.996. This
widths. All layers have the same width in each setting. For simplicity,
observation indicates that the equi-separation law emerges prior to
we consider the 8-layer FNNs trained on Fashion-MNIST.
the occurrence of neural collapse.
Shape. As shown in Fig. 6, we consider three different shapes of
As shown in Fig. 3, we further consider the impact of the following FNNs: Narrow-Wide, Wide-Narrow, and Mixed. As for Narrow-
factors on the equi-separation law: dataset, learning rate, and class Wide, the FNNs have narrow hidden layers at the beginning, and
distribution. then have wide hidden layers in the later layers. Similarity, we use
Dataset. As shown in Fig. 3, we experiment with CIFAR-10. We Wide-Narrow to denote FNNs that start with wide hidden layers
use the second channel of the images and resize the original images followed by narrow hidden layers. In the case of Mixed, the FNNs
to 10 × 10. Similar to Fashion-MNIST, we randomly sample a total have more complicated patterns of the widths of hidden layers. For
of 1000 training examples evenly distributed over the 10 classes. simplicity, we only consider 8-layer FNNs trained on Fashion-MNIST.
The corresponding widths of hidden layers for Narrow-Wide, Wide-
Learning rate. As shown in Fig. 3, we compare the 9-layer FNNs Narrow, and Mixed we considered in our experiments are as follows:
trained using SGD with different leaning rates on Fashion-MNIST.
§
More details are in https://github.com/pytorch/vision/tree/main/torchvision/models.
‡ ¶
Our code is publicly available at GitHub (https://github.com/HornHehhf/Equi-Separation). More details can be found at: https://github.com/kuangliu/pytorch-cifar/tree/master/models.
10 | He and Su
102 102 102 102
Fashion-MNIST
Fig. S2. Emergence of a fuzzy equi-separation law in VGG networks with varied depths on Fashion-MNIST and CIFAR-10.
Fig. S3. Equi-separation law emerges in pure CNNs with appropriate numbers of channels on Fashion-MNIST (64 channels) and CIFAR-10 (16 channels).
Fig. S4. Equi-separation law emerges in FNNs across varied depths on 32 × 32 pixel images of Fashion-MNIST and CIFAR-10.
Fig. S5. The impact of the sample size on the law of equi-separation.
12 | He and Su
1.0 1.0 1.0 1.0
Dl+1
Fig. S7. Convergence rates of different layers (without the last-layer classifier) with respect to the relative improvement ( Dl ) of separability of features. The x axis represents
Dl+1
the training epoch, while the y-axis indicates the relative improvement ( D ).
l
Fig. S8. A fuzzy version of the equi-separation law also emerges in the test data.
102 102
101
100 101
10−1
10−2 100
10−3
10−4 10−1
0 5 10 0 5 10
BERT-CLS BERT-AVG
Fig. S9. The equi-separation law does not exist in BERT features. Given a sequence of token-level BERT features, two most popular approaches are used to get the
sentence-level features at each layer: 1) using the features of the first token (i.e., the [CLS] token); 2) averaging the features among all tokens in the sentence (denoted as
BERT-AVG).
105
104
101 101
103
102
Fig. S10. The equi-separation law is not clear on Fashion-MNIST for FNNs trained without batch normalization.
Fig. S11. The impact of pretraining on the equi-separation law. We first pretrained the 20-layer FNNs as shown in Fig. S11 (Depth=20). After that we fix the first 10 layers of
the pretrained model and train additional 5 layers as in Fig. S11 (Pretraining), which is quite different from training the 15-layer neural networks from scratch as in Fig. S11
(Depth=15).
14 | He and Su