Mechanisms of Dimensionality Reduction and Decorrelation in Deep Neural Networks

PHYSICAL REVIEW E 98, 062313 (2018)
Mechanisms of dimensionality reduction and decorrelation in deep neural networks

Haiping Huang*
School of Physics, Sun Yat-sen University, Guangzhou 510275, People’s Republic of China
and Laboratory for Neural Computation and Adaptation, RIKEN Center for Brain Science, Wako-shi, Saitama 351-0198, Japan
(Received 30 October 2017; revised manuscript received 26 November 2018; published 14 December 2018)
Deep neural networks are widely used in various domains. However, the nature of computations at each layer
of the deep networks is far from being well understood. Increasing the interpretability of deep neural networks
is thus important. Here, we construct a mean-field framework to understand how compact representations are
developed across layers, not only in deterministic deep networks with random weights but also in generative deep
networks where an unsupervised learning is carried out. Our theory shows that the deep computation implements
a dimensionality reduction while maintaining a finite level of weak correlations between neurons for possible
feature extraction. Mechanisms of dimensionality reduction and decorrelation are unified in the same framework.
This work may pave the way for understanding how a sensory hierarchy works.
DOI: 10.1103/PhysRevE.98.062313
I. INTRODUCTION dimensionality of a data representation and, moreover, how

the covariance level (redundancy) varies along the hierarchy.
The sensory cortex in the brain encodes the structure of the
Both of these two features are fundamental properties of deep
environment in an efficient way. This is achieved by creating
neural networks, and even information processing in vision
progressively better representations of sensory inputs, and
[11].
these representations finally become easily decoded without
Our theory helps to advance the understanding of deep
any reward or supervision signals [1–3]. This kind of learning
computation in the following two aspects. (i) There exists
is called unsupervised learning, which has long been thought
an operating point where input and output covariance levels
of as a fundamental function of the sensory cortex [4]. Based
are equal. This point controls the level of covariance nei-
on the similar computational principle, many layers of ar-
ther diverging nor decaying to zero, given sufficiently strong
tificial neural networks were designed to perform a nonlin-
connections between layers. (ii) The dimensionality of data
ear dimensionality reduction of high-dimensional data [5],
representation is reduced across layers, due to an additive
which later triggered resurgence of deep neural networks. By
positive term (contributed by the previous layer) affecting
stacking unsupervised modules on top of each other, one can
the dimensionality in a divisive way. These computational
produce a deep feature hierarchy, in which high-level features
principles are revealed not only in deterministic deep net-
can be constructed from less abstract ones along the hierarchy.
works with random weights but also in generative deep trained
However, these empirical results do not have a principled
networks. Our analytical findings coincide with numerical
understanding so far. Understanding what each layer exactly
simulations, demonstrating that the previous empirically ob-
computes may shed light on how sensory systems work in
served dimensionality reduction [1,2,5] and the redundancy
general.
reduction hypothesis [12] could be theoretically explained
Recent theoretical efforts focused on the layerwise prop-
within the same mean-field framework.
agation of one input vector length, correlations between two
inputs [6], and clustered noisy inputs of supervised classifi-
cation tasks [7,8], generalizing a theoretical work of layered II. A DETERMINISTIC DEEP NETWORK
feedforward neural networks that studied the iteration of the
A deep network is a multilayered neural network perform-
overlap between the layer’s activity and embedded random
ing hierarchical nonlinear transformations of sensory inputs
patterns [9]. However, these studies did not address the covari-
(Fig. 1). The number of hidden layers is defined as the depth of
ance of neural activity, one important feature of neural data
the network, and the number of neurons at each layer is called
modeling [10], which is directly related to the dimensionality
the width of that layer. For simplicity, we assume an equal
and complexity of hierarchical representations. Therefore, a
width (N ). Weights between l − 1 and lth layers are specified
clear understanding of hierarchical representations has been
by a matrix wl , in which the ith row corresponds to incoming
lacking so far, which makes deep computation extremely non-
connections to the neuron i at the higher layer. Biases of
transparent. Here, we propose a mean-field theory of input di-
neurons at the lth layer are denoted by bl . The input data
mensionality reduction in deep neural networks. In this theory,
vector is denoted by v, and hl (l = 1, . . . , d) denotes a hidden
we capture how a deep nonlinear transformation reduces the
representation of the lth layer, in which each entry hli defines
a nonlinear transformation of its preactivation ãil ≡ [wl hl−1 ]i ,
as hli = φ(ãil + bil ). Without loss of generality, we choose the
*
physhuang@gmail.com; www.labxing.com/hphuang2018 nonlinear transfer function as φ(x) = tanh(x) and assume that
2470-0045/2018/98(6)/062313(9) 062313-1 ©2018 American Physical Society

HAIPING HUANG PHYSICAL REVIEW E 98, 062313 (2018)
FIG. 1. Schematic illustration of a deep neural network. The

deep neural network performs a layer-by-layer nonlinear transforma- FIG. 2. Representation dimensionality versus depth in determin-
tion of the original input data (a high-dimensional vector v). During istic deep networks with random weights. Ten√network realizations
the transformation, a cascade of internal representations (h1 , . . . , hd ) are considered for each network width. ρ/ N = 0.05, g = 0.8,
are created. Here, d = 3 denotes the depth of the deep network. and σb = 0.1. The right inset shows how the overall strength of
covariance changes with depth and connection strength (g), and the
left inset is a mechanism illustration ( 0,1,∗ has been scaled by N ).
the weight follows a normal distribution N (0, g/N ), and the The crosses show simulation results (g = 0.8) obtained from 105
bias follows N (0, σb ). Random weight assumption plays an sampled configurations at each layer, compared with the theoretical
important role in recent studies of artificial neural networks predictions.
[13–18], and the weight distribution of trained networks may
appear random [19]. N 2
We consider a Gaussian input ensemble with zero mean, ( λi )
r representation as D = i=1
N 2 [20], where {λi } is the eigen-
covariance vi vj = √ijN for all i = j (rij is a uniformly dis- i=1 λi
spectrum of the covariance matrix Cl . It is expected that
tributed random variable from [−ρ, ρ]), and variance vi2 = D = N if each component of the representation is generated
1. In the following derivations, we define
the weighted-sum independently with the same variance. Generally speaking,
ãil subtracted by its mean as ail = j wijl (hl−1 j − hl−1
j ), nontrivial correlations in the representation will result in D <
l
thus ai has zero mean. As a result, the covariance of al N . Therefore, we can use the above mean-field equation to-
can be expressed as lij = ail ajl = [wl Cl−1 (wl )T ]ij , where gether with the dimensionality to address how the complexity
Cl−1 defines the covariance matrix of neural activity at the of hierarchical representations changes along the depth.
l − 1-th layer (also called the connected correlation matrix Based on this mean-field framework, we first study the
in physics). Because the deep network defined in Fig. 1 is a aforementioned deterministic deep neural networks. Regard-
fully connected feedforward network, where each neuron at an less of whichever network width is used, we find that the
intermediate layer receives a large number of inputs, the cen- representation dimensionality progressively decreases across
tral limit theorem implies that the means of the hidden neural layers (Fig. 2). The theoretical results agree very well with
activity ml and the covariance Cl are given separately by numerical simulations (indicated by crosses in Fig. 2). This
shows that, even in a random multilayered neural network,
a more compact representation of the correlated input is

gradually computed as the network becomes deeper, which
mli = hli = Dtφ lii t + [wl ml−1 ]i + bil , (1a) is also one of the basic properties in biological hierarchical

computations [2,21].
Cijl = DxDyφ lii x + bil + [wl ml−1 ]i φ ljj x To get deeper insights about the hidden representation,
we study how the overall strength of covariance at each
+ y 1 − 2 + bjl + [wl ml−1 ]j − mli mlj , (1b) layer changes with the network depth and the connection
strength (g).
The overall covariance-strength is measured by
√ = N (N−1)
2 2
i<j Cij , which is related to the dimensional-
l
where Dx = e−x dx/ 2π and = √ lij 1
2
/2
. To derive ity via = N−1 [( N i Cii )2 /D̃ − N1 i Cii2 ], where D̃ =
1
ii ljj

Eq. (1), we parametrize ail and ajl by independent normal D/N, which is derived by noting that tr(C) = i λi and
random variables (see Appendix A). Equation (1) forms an tr(C2 ) = i λ2i . We find that, to support an effective repre-
iterative mean-field equation across layers to describe the sentation where neurons are not completely independent, the
transformation of the activity statistics in deep networks. connection strength must be sufficiently strong (Fig. 2), such
To characterize the collective property of the entire hidden that weakly correlated neural activities are still maintained
representation, we define an intrinsic dimensionality of the at later stages of processing. Otherwise, the information will
062313-2
MECHANISMS OF DIMENSIONALITY REDUCTION AND … PHYSICAL REVIEW E 98, 062313 (2018)
be blocked from passing through that layer where the neural compared with a shallow RBM network. To study the expres-
activity becomes completely independent. High correlations sive property of the DBN, we first specify a data distribution
imply strong statistical dependence and thus redundancy. generated by a random RBM whose parameters follow the
An efficient representation must not be highly redundant normal distribution N (0, g/N ) for weights and N (0, σb ) for
[12], because a highly redundant representation cannot be biases. Using the random RBM as a data generator allows us
easily disentangled and is thus not useful for computation; to calculate analytically the complexity of the input data. In
e.g., coadaptation of neural activities is harmful for feature the random RBM, the hidden neural activity h at the top layer
extraction [22]. can be marginalized over using the conditional independence
The dimensionality reduction results from the nested [Eq. (2)]; thus the distribution of the representation v at the
nonlinear transformation of input data. For a mechanistic√ bottom layer can be expressed as (Appendix C)
explanation, by noting that lij is of the order O(1/ N ), we
expand Cijl = Kij lij + O[(lij )2 ] in a large-N limit 1

P (v) = {2 cosh([wl+1 v]a + ba )} evi bi , (3)

(Appendix B), where Kij ≡ φ (xi0 )φ (xj0 ) and xi,j 0
≡ Z a
2 2
i
l
bi,j + [wl ml−1 ]i,j . Then l g 2 κ 2 l−1 + N 2
g κ
i (C l−1
ii )
2
where κ ≡ [φ (xi0 )]2 (the average is taken over the random where a is the site index of hidden neurons, and Z is the parti-
network parameters, see Appendix B). For the random tion function. Based on the Bethe approximation [23], which
model, N 1 = g 2 κ 2 (N 0 + 1), which determines a captures weak correlations among neurons, the covariance of
g2 κ 2 neural activity (the same definition as before) under Eq. (3)
critical N ∗ = 1−g 2κ2 (so-called operating point), such
that a first boost of the correlation strength is observed can be computed from the approximate free energy using the
when 0 < ∗ (Fig. 2); otherwise, the correlation level linear response theory (Appendix D). The estimated statistics
is maintained, or decorrelation is achieved. The iteration of the random-RBM representation are used as a starting point
of l can be used to derive D̃ 1 = (N−1)10 +1+ϒ , where from which the mean-field complexity-propagation equation
[Eq. (1)] iterates, for the investigation of the dimensionality
ϒ > 0 and its value depends on the layer’s parameters.
and the redundancy reduction in the deep generative model.
Thus ϒ determines how significantly the dimensionality
Finally, we study the generative deep network where net-
is reduced, and the dimensionality reduction is explained
work parameters are learned in a bottom-up pass from the
as D̃ 1 < D̃ 0 = (N−1) 1
0 +1 . This relationship carries over to
representations at lower layers. The network parameters for
deeper layers (Appendix B), due to an additive positive term each stacked RBM in the DBN are updated by the contrast
in the denominator of the dimension formula. Therefore, divergence procedure truncated to one step [24]. With this lay-
the decorrelation of redundant inputs together with the erwise training, each layer learns a nonlinear transformation
dimensionality reduction is theoretically explained. of the data, and upper layers are conjectured to learn more
abstract (complex) concepts, which is a key step in object
III. A STOCHASTIC DEEP NETWORK recognition problems [25]. One typical learning trajectory
for each layer is shown in the top inset of Fig. 3(a), where
It is of practical interest to see whether a deep generative
the reconstruction error decreases with the learning epoch.
model trained in an unsupervised way has a similar collective
The input data and subsequent representation complexity are
behavior. We consider a deep belief network (DBN) as a
captured very well by the theory [Fig. 3(b)]. We use the mean-
typical example of stochastic deep networks [5], in which
field framework derived for deterministic networks to study
each neuron’s activity at one hidden layer takes a binary
the complexity propagation (starting from the statistics of the
value (±1) according to a stochastic function of the neuron’s
input data), which is reasonable, because to suppress the noise
preactivation. Specifically, the DBN is composed of multiple
due to sampling, the mean activities at the intermediate layer
restricted Boltzmann machines (RBMs) stacked on top of
are used as the input data when the next layer is learned [24].
each other (Fig. 1). RBM is a two-layered neural network,
Therefore, the stochasticity of the neural response is implicitly
where there are no lateral connections within each layer, and
encoded into the learned parameters during training.
the bottom (top) layer is also named the visible (hidden)
Compared to the initial input dimensionality, the represen-
layer. Therefore, given the input hl at the lth layer, the neural
tation dimensionality the successive layers create becomes
representation at a higher (l + 1-th) layer is determined by a
lower [Fig. 3(a)], which coincides with observations in the
conditional probability:
deterministic random deep networks. This feature does not
change when more neurons are used in each layer. Dur-

e hl+1
i [wl+1 hl ]i +bil+1
P (hl+1 |hl ) = . (2) ing learning, the evolution of the dimensionality displays a
i
2 cosh [wl+1 hl ]i + bil+1 nonmonotonic behavior [Fig. 3(c)]: the dimensionality first
increases and then decreases to a stationary value. Moreover,
Similarly, P (hl |hl+1 ) is also factorized. the learning decorrelates the correlated input, whereas, after
The DBN as a generative model, once network parameters the first drop, the learning seems to preserve a finite level of
(weights and biases) are learned (so-called training) from correlations [the bottom inset of Fig. 3(a)]. These compact
a data distribution, can be used to reproduce the samples representations may remove some irrelevant factors in the
mimicking that data distribution. With deep layers, the net- input, which facilitates the formation of easily decoded repre-
work becomes more expressive to capture high-order inter- sentations at deeper layers. Our theoretical analysis in random
dependence among components of a high-dimensional input, neural networks will likely carry over to this unsupervised
062313-3
FIG. 3. (a) Representation behavior as a function of depth in generative deep networks. Ten network realizations are considered for each
network width. The top inset shows an example (N = 150) of reconstruction errors (ε ≡
h − h
22 ) between input h and reconstructed one
h for each layer during learning. The bottom inset shows the overall strength of covariance as a function of depth. (b) Numerically estimated
off-diagonal correlation versus its theoretical prediction (N = 150). In these plots, we generate M = 60 000 training examples (each example
is an N -dimensional vector) from the random RBM whose parameters follow the normal distribution N (0, g/N ) for weights and N (0, σb ) for
biases. g = 0.8 and σb = 0.1. Then these examples are learned by the DBN (see simulation details in Appendix C). (c) One typical learning
trial shows how the estimated dimensionality evolves.
learning system; e.g., the operating point may explain the make feature extraction possible, in accord with the redun-
low-level preserved correlation. dancy reduction hypothesis [12]. In the deep computation,
By looking at the eigenvalue distribution of the covariance more abstract concepts captured at higher layers along the
matrix, we find that the distribution for the unsupervised hierarchy are typically built upon less abstract ones at lower
learning system deviates significantly from the Marchenko- layers, and high-level representations are generally invariant
Pastur law of a Wishart ensemble [26] (Appendix E). For the to local changes of the input [2], which thereby coincides
random neural networks, the eigenvalue distribution at deep with our theory that demonstrates a compact (compressed)
layers seems to assign a higher probability density when the representation formed by a series of dimensionality reduction.
eigenvalue gets close to zero, yet a lower density at the tail of Unwanted variability may be suppressed in this compressed
the distribution, compared with the Marchenko-Pastur law of a representation. It was hypothesized that neuronal manifolds
random-sample covariance matrix [27] (Appendix E). There- at lower layers are strongly entangled with each other, while
fore, the dimensionality reduction and its relationship with at later stages, manifolds are flattened to facilitate that rele-
decorrelation are a nontrivial result of the deep computation. vant information can be easily decoded by downstream areas
[1,2,21], which connects to the small level of correlations
preserved in the network for a representation that may be
IV. SUMMARY
maximally disentangled [28].
Brain computation can be thought of as a transformation of Our work thus provides a theoretical underpinning of the
internal representations along different stages of a hierarchy hierarchical representations, through a physics explanation of
[3,21]. Deep artificial neural networks can also be interpreted dimensionality reduction and decorrelation, which encourages
as a way of creating progressively better representations of several directions such as generalization of this theory to
input sensory data. Our work provides mean-field evidence more complex architectures and data distributions, demon-
about this picture that compact representations of relatively stration of how the compact representation helps generaliza-
low dimensionality are progressively created by deep compu- tion (invariance) or discrimination (selectivity) in a neural
tation, while a small level of correlations is still maintained to system [29,30], and using the revealed principles to control
062313-4
the complexity of internal representations for an engineering where Kij = φ (xi0 )φ (xj0 ) and xi,j
0
= bi,j
l
+ [wl ml−1 ]i,j . It
application. follows that
2
l K 2 2
ACKNOWLEDGMENTS N (N − 1) i<j ij ij
I thank Hai-Jun Zhou and Chang-Song Zhou for their g 2 Kij2 l−1 2
insightful comments. This research was supported by AMED g 2 Kij2 l−1 + Cii , (B2)
under Grant No. JP18dm020700 and the start-up budget Grant N2 i
No. 74130-18831109 of the 100 Talents Program of Sun
2
Yat-sen University. where Kij2 [φ (xi0 )]2 in which the mean (overline) is taken
over the quenched disorder, based on the facts that the cor-
relation between different weights is negligible and that the
APPENDIX A: DERIVATION OF MEAN-FIELD covariance of the mean preactivations of different units is
EQUATIONS FOR THE COMPLEXITY PROPAGATION IN negligible as well in the thermodynamic limit.
DEEP RANDOM NEURAL NETWORKS √
Finally, [φ (xi0 )]2 = Dt Du[φ ( σb u + gQl−1 t )]2 ,
We first derive the mean-field equation for the mean where one independent Gaussian random variable (u) corre-
activity mli by noting that its pre-activation ãil + bil = ail + sponds to the randomness of the bias, and the other Gaussian
[wl ml−1 ]i + bil . ail behaves like a Gaussian random variable random variable (t) corresponds to the random mean preacti-
with zero mean and variance lii , depending on the fluctuating vation [wl ml−1 ]i because of the random weights. In addition,
input; thus the average operation for the mean activity can be gQ specifies the variance
of the mean preactivation, with the
computed as a Gaussian integral: definition Q = N1 i m2i (or spin glass order parameter in
physics). Clearly, Q can be iteratively computed from one

layer to its next layer as follows:
mli = hli = Dtφ lii t + [wl ml−1 ]i + bil . (A1)
√
Q = Dt Duφ 2 σb u + gQl−1 t .
l
(B3)
Analogously, to compute the covariance Cijl , one first evalu-
ates the statistics of the pre-activations of units i and j , i.e., Note that the initial Q0 = 0 by the construction of the random
ail and ajl , which follows a joint Gaussian distribution with model.
zero mean, variances lii and ljj , respectively, and covari- Looking at l = 1 for the random model, one finds
√
ance lij . Therefore, one can use two independent standard [φ (xi0 )]2 = Du[φ ( σb u)]2 ≡ κ as defined in the main
Gaussian random variables with zero mean and unit variance text. It follows that 1 = g 2 κ 2 0 + g 2 κ 2 /N . Following the
[Tr(C)]2
to parametrize this joint distribution, which results in definition of the normalized dimensionality (D̃ = NTr(C 2) ,
an alternative definition of the dimensionality in the

main text), one easily arrives at the relationship between
Cijl = DxDyφ ii x + bil + [wl ml−1 ]i
D̃ 1 = (N−1)10 +1+ϒ and D̃ 0 = (N−1) 1
0 +1 , noting that ϒ ≡
[φ (xi0 )]4
×φ jj x + y 1 − 2 + bjl + [wl ml−1 ]j 2 1.
[φ (xi0 )]2
− mli mlj , (A2) To derive the relationship between dimensionality
of con-
secutive layers, we first define K1l = N1 i Ciil and K2l =
√ 1 l 2
i (Cii ) , and then we get the normalized dimensionality
where Dx = e−x dx/ 2π and = √ ij
2
/2 N
. The super- of layer l as
ii jj
script l is omitted for the covariance of al . It is easy to verify l 2
that the above parametrization of ail and ajl follows the same K1
D̃ =
l
, (B4)
statistics as mentioned above. (N − 1) l + K2l
which is compared with the counterpart at a higher layer l + 1
APPENDIX B: THEORETICAL ANALYSIS given by
IN THE LARGE-N LIMIT l 2
K1
In the thermodynamic limit,√the covariance of activation D̃ =
l+1
2 . (B5)
al , lij , is of the order of O(1/ N ), where N is the network (N − 1) + K2l + K1l ϒ
l
width. This is because both the weights and√the (connected) To derive Eq. (B5), we used Eq. (B2). Because the additive
correlations are also of the order of O(1/ N ). Thus, one term (K1l )2 ϒ in the denominator is always positive, Eq. (B5)
can expand the covariance equation [Eq.(A2)] in the small explains the dimensionality reduction across layers. The value
ij limit. After the expansion and some simple algebra, one of the additive term thus determines how significantly the di-
obtains mensionality is reduced. Its behavior with increasing numbers
of layers can also be analyzed within the large-N expansion.
Cijl = Kij ij + O 2ij , (B1) First, we derive the recursion equation for K1l . Using the fact
062313-5
function (also named Hamiltonian in physics):

E(s, σ ) = − sa wia σi − bah sa − biv σi , (C1)
i,a a i
where s and σ are the hidden and visible activity vectors,

respectively. bh,v is the hidden (h) or visible (v) bias vector. In
statistical mechanics, the neural activity follows a Boltzmann
distribution, P (s, σ ) = exp[−E(s, σ )]/Z, where Z is the par-
tition function of the model. For a large network, Z can only
be computed by approximated methods [16]. This distribution
can be used to fit any arbitrary discrete distribution, following
the maximal likelihood learning principle; i.e., the network
parameters are updated according to gradient ascent of the
data log-likelihood defined as follows:

L= ln 2 cosh wia σi + bah + biv σi data
a
FIG. 4. The behavior of the additive term (K1l )2 ϒ as a function i data i
of the network depth. − ln Z, (C2)

where the average is performed over all training data samples
that ii gK1l−1 , one derives that (or a mini-batch of the entire data set in the case where
stochastic gradient ascent is used). The gradient ascent leads
√ to the following learning equations for updating network
K1l = Du Dtφ 2 g K1l−1 + Ql−1 t + σb u − Ql , parameters:
(B6)
by following the same principle as mentioned above. Second, wia = η[σi sa p(sa |σ )data − sa σi model ], (C3a)
according to the definition, it is easy to write that
biv = η[σi data − σi model ], (C3b)
√ 4
Dt Du φ σb u + gQl−1 t
ϒ = √ 2 2 . (B7) bah = η[sa p(sa |σ )data − sa model ], (C3c)
Dt Du φ σb u + gQl−1 t
Last, we find that the additive positive term tends to be a very where η specifies a learning rate. Since there are no lateral
small value as the number of layers increases (Fig. 4), which connections between neurons at each layer, given one layer’s
is consistent with the observation in a finite-N system. This activity, the other layer’s activity is factorized as
implies that the estimated dimensionality at deep layers be-
comes nearly a constant, due to the nearly vanishing additive

esa [wσ ]a +ba
h
term. p(s|σ ) = p(sa |σ ) = . (C4)

a a
2 cosh [wσ ]a + bah
Similarly, p(σ |s) is also factorized.
APPENDIX C: TRAINING PROCEDURE OF DEEP BELIEF
There are many approximate methods to evaluate the
NETWORKS
model-dependent terms in the learning equations. Here, we
A deep belief network (DBN) is composed of multiple use the most popular method, namely contrast divergence
restricted Boltzmann machines (RBMs) stacked on top of [24]. More precisely, RBMs are trained in a feedforward
each other. It is a probabilistic deep generative model, because fashion using the contrast divergence algorithm [24], where
after network parameters (weights and biases) are learned (so- Gibbs samplings of the model starting from each data point
called training) from a data distribution, the model can be used are truncated to a few steps and then used to compute model-
to reproduce the samples mimicking the data distribution. dependent statistics for learning. The upper layer is trained
With deep layers, the network becomes more expressive to with the lower layer’s parameters being frozen. During the
capture high-order interdependence among components of training of each RBM, the visible inputs are set to the mean
a high-dimensional input, compared with a shallow RBM activity of hidden neurons at the lower layer, while hidden
network. neurons of the upper layer still adopt stochastic binary values
Learning in a deep belief network can be achieved by according to Eq. (C4). With this layerwise training, each
layerwise training of each RBM in a bottom-up pass, which layer learns a nonlinear transformation of the data, and up-
was justified to improve a variational lower bound on the data per layers are conjectured to learn more abstract (complex)
log-likelihood [24]. RBM is a two-layered neural network, concepts, which is a key step in object and speech recognition
where there are no lateral connections within each layer, and problems [25].
the bottom (top) layer is also named the visible (hidden) layer. The DBN learns a data distribution generated by a ran-
Therefore, the RBM is described by the following energy dom RBM whose parameters follow the normal distribution
062313-6
N (0, g/N ) for weights and N (0, σb ) for biases. g = 0.8 and approximation, i.e., Cii = 1 − m2i . Therefore, we impose the
σb = 0.1 unless otherwise specified. Using the RBM as a data statistical consistency of diagonal
terms on a corrected free
generator allows us to control the complexity of the input data. energy as F̃Bethe = FBethe − 21 i i (1 − m2i ) [33–35]. Fol-
In addition, the RBM has been used to model many real data lowing the similar procedure in our previous work [16], we
sets (e.g., handwritten digits [24]). In numerical simulations obtain the following mean-field iterative equation:
(Fig. 3 in the main text), we generate M = 60 000 training
examples (each example is an N -dimensional vector) from ⎛ ⎞
the random RBM. Then these examples are learned by RBMs
in the DBN. We divide the entire data set into mini-batches mi→a = tanh ⎝biv − i mi + ua →i ⎠, (D2a)
of size B = 150. One epoch corresponds to a sweep of the a ∈∂i\a
entire data set. Each RBM is trained for tens of epochs until
1 cosh bah + Ga →i + wia
the reconstruction error (ε ≡
h − h
22 ) between input h and u = ln
a →i , (D2b)
reconstructed input h does not decrease. We use an initial 2 cosh bah + Ga →i − wia
learning rate of 0.12 divided by t/10 at the tth epoch, and
where Ga →i ≡ j ∈∂a \i wj a mj →a and the correction intro-
an
2 weight decay parameter of 0.0025.
duces an Onsager term (−i mi ). The cavity magnetization
mi→a can be understood as the message passing from the
APPENDIX D: ESTIMATING THE COVARIANCE visible node i to the factor node a, while the cavity bias ua →i
STRUCTURE OF A RANDOM RESTRICTED BOLTZMANN is interpreted as the message passing from the factor node a
MACHINE to the visible node i. In fact, Eq. (D2) is not closed. {i } must
To study the complexity propagation in the DBN, it is nec- be computed based on correlations. Therefore, we define the
essary to evaluate the statistics of the input data distribution, cavity susceptibility χi→a,k ≡ ∂m∂bi→av [36]. According to this
k
which is provided by a random RBM in the model setup of definition and the linear response theory, we close Eq. (D2) by
stochastic neural networks. This is because the estimated co- obtaining the following susceptibility propagation equations:
variance can be used as a starting point from which the mean-
field complexity-propagation equation iterates. In addition,
characterizing the RBM representation may provide insights χi→a,k = 1 − m2i→a a →i Pa →i,k + δik 1 − m2i→a
towards deep representations, since RBM is a building block a ∈∂i\a
for deep models and moreover a universal approximator of − i Cik , (D3a)
discrete distributions [31].
Given the RBM, the hidden neural activity at a higher 1 − m2i
layer (e.g., s) can be marginalized over using the conditional Cik = Fik , (D3b)
1 + 1 − m2i i
independence [Eq. (C4)]; thus the distribution of the represen- Fii − 1
tation at a lower layer (e.g., σ ) can be expressed as i = , (D3c)
1 − m2i
1

P (σ ) = P (s, σ ) = 2 cosh [wl+1 σ ]a where the full magnetization mi = tanh(biv − i mi +
Z a
s tanh(w )[1−tanh2 (bah +Ga→i )]

a ∈∂i ua →i ), a→i ≡ 1−tanhia2 (bh +G ) tanh , Pa→i,k ≡

v 2
(wia )
+ bah eσi bi , (D1) a a→i
j ∈∂a\i χj →a,k wj a , and Fik ≡ a∈∂i a→i Pa→i,k + δik . It

i
is easy to verify that Eq. (D3) leads to the consistency for
where Z is the partition function intractable for a large N . the diagonal terms. Adding the diagonal constraint through
To study the statistics of the RBM representation, we need the Lagrange multiplier can not only solve the diagonal
to compute the free energy function of Eq. (D1) defined as inconsistency problem but also improve the accuracy of
F = − ln Z, where a unit inverse temperature is assumed. We estimating off-diagonal terms. After the RBM parameters
use the Bethe approximation to compute an approximate free (weights and biases) are specified, we run the above iterative
energy defined by FBethe
. In physics,
the Bethe approxima- equations [Eqs. (D2) and (D3)] from a random initialization of
tion assumes P (σ ) ≈ a Pa (σ ∂a ) i Pi (σi )1−N [32], where the messages and estimate the covariance and the associated
a (∂a) indicates a factor node (its neighbors) representing representation dimensionality from the fixed point. These
the contribution of one hidden neuron to the joint probability statistics are used as an initialization condition for the
[Eq. (D1)] in a factor graph representation [16]. Pa and Pi complexity-propagation equation [Eq. (1) in the main text]
can be obtained from a variational principle of free energy that is used to study the expressive property of the DBN with
optimization [16]. This approximation takes into account the trained weights and biases.
correlations induced by nearest neighbors of each neuron in We finally remark that, for a trained RBM, some compo-
the factor graph, which thus improves the naive mean-field nents of the correlation matrix may lose the symmetry prop-
approximation where neurons are assumed independent. erty (Cij = Cj i ), likely because of the above Bethe (cavity-
Covariance of neural activity under Eq. (D1) (the so-called based) approximation incapable of dealing with an irregular
connected correlation in physics) can be computed from the distribution of learned connection weights. The irregularity
approximate free energy using the linear response theory. means that the distribution is divided into two parts: the bulk
However, due to the approximation, there exists a statistical part is around zero, while the other part is dominated by a few
inconsistency for diagonal terms computed under the Bethe large values of weights (as also observed recently in spectral
062313-7
FIG. 5. The eigenvalue distribution of the covariance matrix estimated from the deep computation. (a) The distribution for the deep
networks with random weights (N = 100). One hundred instances are used. In the inset, the distribution at deeper layers is compared with the
Marchenko-Pastur law of a random-sample covariance matrix. The tail part is enlarged for comparison. (b) The distribution obtained from an
unsupervised learning system of deep belief networks is strongly different from the Marchenko-Pastur law. Ten instances are used.
dynamics of learning in RBMs [37]). Our mean-field formula mean-field theory of deep networks, we choose P = N ,
[Eqs. (D2) and (D3)] may offer a basis to be further improved and ς 2 is obtained by matching the range of the eigen-
to address this interesting special property, although one can value. The designed random-sample covariance matrix has the
enforce the symmetry by [Cij + Cj i ]/2 in our mean-field Marchenko-Pastur law for the density of eigenvalues [26,27]:
formula. 1
μ(λ) = [(λ − λ− )(λ+ − λ)]1/2 , (E1)
2π λς 2
where λ ∈ [λ− , λ+ ], λ− = 0, and λ+ = 4ς 2 .
APPENDIX E: THE EIGENVALUE DISTRIBUTION
For the random neural networks, the eigenvalue distri-
OF THE COVARIANCE MATRIX
bution at deep layers seems to assign a higher probability
To analyze the eigenvalue distribution of the covariance density when the eigenvalue gets close to zero, yet a lower
matrix at each layer of deep networks, we first construct a density at the tail of the distribution [Fig. 5(a)], compared with
random-sample covariance matrix. More precisely, we con- the Marchenko-Pastur law of a random-sample covariance
sider a real Wishart ensemble, where a random-sample co- matrix. For the unsupervised learning system, we find that the
variance matrix is defined as N1 ξ ξ T in which ξ defines an distribution deviates significantly from the Marchenko-Pastur
N × P matrix whose entries follow independently a normal law of a Wishart ensemble. The distribution has a Gaussian-
distribution N (0, ς 2 ). In fact, ξ can be thought of as a random like bulk part together with a long tail [Fig. 5(b)]. Therefore,
uncorrelated pattern matrix. To compare the real Wishart the dimensionality reduction and its relationship with decor-
ensemble with the covariance matrix estimated from the relation are a nontrivial result of the deep computation.
[1] J. J. DiCarlo and D. D. Cox, Trends Cognit. Sci. 11, 333 (2007). [10] G. F. Elsayed and J. P. Cunningham, Nat. Neurosci. 20, 1310
[2] J. J. DiCarlo, D. Zoccolan, and N. C. Rust, Neuron 73, 415 (2017).
(2012). [11] U. Guclu and M. A. J. van Gerven, J. Neurosci. 35, 10005
[3] N. Kriegeskorte and R. A. Kievit, Trends Cognit. Sci. 17, 401 (2015).
(2013). [12] H. Barlow, in Sensory Communication, edited by W. Rosenblith
[4] D. Marr, Proc. R. Soc. London, Ser. B 176, 161 (1970). (MIT, Cambridge, MA, 1961), pp. 217–234.
[5] G. E. Hinton and R. R. Salakhutdinov, Science 313, 504 (2006). [13] J. Tubiana and R. Monasson, Phys. Rev. Lett. 118, 138301
[6] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and (2017).
S. Ganguli, in Advances in Neural Information Processing [14] S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein,
Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, arXiv:1710.06570.
I. Guyon, and R. Garnett (Curran Associates, Red Hook, NY, [15] A. Barra, G. Genovese, P. Sollich, and D. Tantari, Phys. Rev. E
2016), pp. 3360–3368. 97, 022310 (2018).
[7] B. Babadi and H. Sompolinsky, Neuron 83, 1213 (2014). [16] H. Huang and T. Toyoizumi, Phys. Rev. E 91, 050101 (2015).
[8] J. Kadmon and H. Sompolinsky, in Advances in Neural Informa- [17] H. Huang and T. Toyoizumi, Phys. Rev. E 94, 062310
tion Processing Systems 29, edited by D. D. Lee, M. Sugiyama, (2016).
U. V. Luxburg, I. Guyon, and R. Garnett (Curran Associates, [18] H. Huang, J. Stat. Mech.: Theory Exp. (2017) 053302.
Red Hook, NY, 2016), pp. 4781–4789. [19] J. Pennington and Y. Bahri, in Proceedings of the 34th Inter-
[9] E. Domany, W. Kinzel, and R. Meir, J. Phys. A: Math. Gen. 22, national Conference on Machine Learning, Vol. 70 of Proceed-
2081 (1989). ings of Machine Learning Research, edited by D. Precup and
062313-8
Y. W. Teh (PMLR, International Convention Centre, Sydney, [28] A. Achille and S. Soatto, arXiv:1706.01350.
Australia, 2017), pp. 2798–2806. [29] H. Hong, D. L. Yamins, N. J. Majaj, and J. J. DiCarlo,
[20] K. Rajan, L. Abbott, and H. Sompolinsky, in Advances in Neural Nat. Neurosci. 19, 613 (2016).
Information Processing Systems 23, edited by J. D. Lafferty, [30] M. Pagan, L. S. Urban, M. P. Wohl, and N. C. Rust,
C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta Nat. Neurosci. 16, 1132 (2013).
(Curran Associates, Red Hook, NY, 2010), pp. 1975–1983. [31] N. Le Roux and Y. Bengio, Neural Comput. 20, 1631
[21] D. L. Yamins and J. J. DiCarlo, Nat. Neurosci. 19, 356 (2008).
(2016). [32] M. Mézard and A. Montanari, Information, Physics, and Com-
[22] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and putation (Oxford University, Oxford, 2009).
R. Salakhutdinov, J. Mach. Learn. Res. 15, 1929 (2014). [33] H. Huang and Y. Kabashima, Phys. Rev. E 87, 062129
[23] H. A. Bethe, Proc. R. Soc. London, Ser. A 150, 552 (1935). (2013).
[24] G. Hinton, S. Osindero, and Y. Teh, Neural Comput. 18, 1527 [34] J. Raymond and F. Ricci-Tersenghi, Phys. Rev. E 87, 052111
(2006). (2013).
[25] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, Commun. ACM [35] M. Yasuda and K. Tanaka, Phys. Rev. E 87, 012134 (2013).
54, 95 (2011). [36] S. Higuchi and M. Mezard, J. Phys.: Conf. Ser. 233, 012003
[26] V. Marcenko and L. Pastur, Math. USSR-Sb. 1, 457 (1967). (2010).
[27] R. Cherrier, D. S. Dean, and A. Lefévre, Phys. Rev. E 67, [37] A. Decelle, G. Fissore, and C. Furtlehner, Europhys. Lett. 119,
046112 (2003). 60001 (2017).
062313-9

Mechanisms of Dimensionality Reduction and Decorrelation in Deep Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mechanisms of Dimensionality Reduction and Decorrelation in Deep Neural Networks

Uploaded by

Copyright:

Available Formats

PHYSICAL REVIEW E 98, 062313 (2018)