Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A

TherML: Thermodynamics of Machine Learning
Alexander A. Alemi 1 Ian Fischer 1
Abstract procedure. Generally, this distribution is intractable, but

we imagine we have or can generate a finite dataset of fair
In this work we offer a framework for reasoning samples from it.
about a wide class of existing objectives in ma-
If the data was all we had, we couldn’t do much in the way
arXiv:1807.04162v1 [cs.LG] 11 Jul 2018
chine learning. We develop a formal correspon-

dence between this work and thermodynamics of learning or inference, especially considering we know
and discuss its implications. nothing about the distribution describing our data. Since
we want to shy away from directly modelling the data itself,
our only choice is to augment the universe with some new
The traditional approach to learning involves modelling. A random variables under our direct control. We are primarily
practitioner describes a graphical model they believe gener- interested in learning a stochastic representation of X, call
ated the data, and then attempts to maximize the likelihood it Z, defined by some parametric distribution of our own
that the observed data was drawn from this model. design: p(zi |xi , θ) with its own parameters θ. A training
procedure is a process that assigns a distribution p(θ|{x, y})
Essentially all models are wrong, but some are (where {x, y} is shorthand for the entire set of xi , yi ) to
useful. — George Box (Box & Draper, 1987) the parameters conditioned on the observed dataset. With
our augmentations, the world now looks like the graphical
model in Figure 1a, described by the joint density:
Ideally we would be able to quantify, control and/or steer
the wrongness of our model, but to do so requires a formal
p(x, y, φ, z, θ) = p(φ)p(θ|{x, y})×
separation between the real world of our data and the imagi- Y
nary world of our model. To connect the two we augment p(xi |φ)p(yi |xi , φ)p(zi |xi , θ) (2)
the real world with something under our direct control. To i
that end, we formulate representation learning.
World P is what we have. It is not necessarily what we want.
What we have to contend with is an unknown distribution
1. A Tale of Two Worlds
of our data. What we want is a world in which Z causally
Let X, Y be some paired data: where X is raw high di- factors X and Y , acting as a latent variable for the pair,
mensional data and Y is some relevant low dimensional leaving no correlations unexplained. Similarly, we would
description or target. For example, a set of images X and prefer if we could easily marginalize out the dependence on
their labels Y . We imagine this data comes from some true our universal (Φ) and model specific (Θ) parameters. World
unknown distribution. The data is conditionally indepen- Q in Figure 1b is the world we want 1 . It satisfies the joint
dent given some true governing parameters φ. That is we density:
imagine the joint distribution: Y
Y q(x, y, φ, z, θ) = q(φ)q(θ) q(zi )q(xi |zi )q(yi |zi ) (3)
p({x, y}, φ) = p(φ) p(xi |φ)p(yi |xi , φ). (1) i
i
Having expressed both the situation we are in and the one we
We factor the conditional joint distribution p(xi , yi |φ) as desire, we can formally express how similar the two worlds
p(yi |xi , φ)p(xi |φ). Here φ stands in for the rest of the are by computing the relative information or relative entropy
universe, insomuch as it is relevant to our data generating
1
We could consider different alternatives, deciding to relax
1
Google. Correspondence to: Alexander A. Alemi some of the constraints we imposed in world Q, or generalizing
<alemi@google.com>. World P by letting the representation depend on X and Y jointly,
for instance. What follows demonstrates a general sort of calculus
Presented at the ICML 2018 workshop on Theoretical Foundations that we can invoke for any specified pair of graphical models. In
and Applications of Deep Generative Models. Copyright 2018 by particular Appendix A discusses a closely related alternative where
the author(s). we desire q(x, y, z) = q(y|z)q(z|x).
TherML
φ which measures the relative information between the joint

distribution for X and Y and the product of their marginal
distributions. That is, how much information you gain about
X by observing Y , or vice versa. It is similarly well defined,
Z X Y positive semidefinite and reparameterization independent
for continuous distributions (Cover & Thomas, 2012). The
mutual information vanishes if and only if the two variables
are independent, and diverges for continuous variables if
θ
they are related by an invertible transformation.
(a) Graphical model for world P , satisfying the joint density in
Equation 2. Here blue denotes nodes outside of our control, while The relative information between World P and Q is:
red nodes are under our direct control.
p(φ) p(θ|{x, y})
φ KL(p || q) = log + log
q(φ) q(θ)
+
X p(zi |xi , θ) p(xi |φ) p(yi |xi , φ)
+ log + log + log .
i
q(zi ) q(xi |zi ) q(yi |zi )
X Z Y P
(6)
Here we have split the terms into reparameterization inde-
pendent components.
θ
In practice, worlds P and Q will often be inconsistent, so
(b) Graphical model for world Q, the world we desire which satisfies that it will not be possible to achieve a vanishing relative
the joint density in Equation 3.
information (KL divergence) between the distributions. But
Figure 1. Graphical models. amongst all possible distributions Q consistent with the
conditional structure in the graphical model, we can ask
what the minimum achievable KL is. For this we need to
or KL divergence between them. The relative information know the optimal Q distribution. The optimal q distribu-
between two distributions: tions are simply the corresponding distributions in our true
KL(p || q) world P , taking whichever marginalizations or condition-
Z ing required. For instance, the q(xi |zi ) that achieves the
p(χ) p(χ) minimal KL is p(xi |zi ) the marginal posterior for xi given
= dχ p(χ) log = log (4)
q(χ) q(χ) P zi integrating out φ and θ. Setting all Q distributions to
measures how much information (in bits, or nats in the case their optimal values, it can be shown (Friedman et al., 2001)
the natural logarithm is used) you gain when you discover that the minimal relative information obtained is given by
you are in World P when what you expected was World the difference in multi-informations of each graph (IP , IQ ).
Q (Cover & Thomas, 2012). This is often expressed as the This is the well known information projection (Csiszár &
excess bits required to encode a signal from P if you use an Matúš, 2003). The multi-information in the graph is a sum
entropic code optimized for Q. Here and throughout h·iA of mutual informations between each node and its parents.
is used to denote expectations with respect to distribution
p(x, y, φ, z, θ)

A. It is worth notingR that while the continuous analog of IP ≡ log = I(Θ; {X, Y })
p(x)p(y)p(φ)p(z)p(θ) P
Shannon entropy − dx p(x) log p(x) is neither positive X
semidefinite nor reparameterization independent (Marsh, + I(Xi ; Φ) + I(Yi ; Xi , Φ) + I(Zi ; Xi , Θ) (7)
2013), the relative entropy dx p(x) log p(x)
R
q(x) is both posi-
i
tive semidefinite and reparameterization independent, and

X
IQ = I(Xi ; Zi ) + I(Yi ; Zi ) (8)
so Equation 4 is perfectly well defined and behaved for i
continuous distributions.
In our case:
Another useful quantity worth defining is the mutual infor-
mation: KL(p || Q) ≡ min KL(p || q) = IP − IQ
q∈Q

p(x, y) X
I(X; Y ) ≡ log = I(Θ; {X, Y }) + I(Xi ; Φ) + I(Yi ; Xi , Φ)
p(x)p(y) P i

p(x|y) p(y|x)
X
= log = log , (5) + I(Zi ; Xi , Θ) − I(Xi ; Zi ) − I(Yi ; Zi ). (9)
p(x) P p(y) P i
TherML
P P
This minimal relative information has two terms outside our • D ≡ − Pi hlog q(xi |zi )iP ≥ i H(Xi ) −
control and we can take them to be constant: I(Xi ; Φ) and I(Xi ; Zi ) = i H(Xi |Zi )
I(Yi ; Xi , Φ). These terms measure the intrinsic complexity The distortion measures the conditional entropy of X
of our data. The remaining four terms are: left after conditioning on Z. It is a measure of how
much information about X is left unspecified in our
• I(Xi ; Zi ) - which measures how much information representation. This functional measures our unsuper-
our representation contains about the input (X). This vised learning performance.
should be maximized to make world P more resemble D E
a world that satisfies the relations in Q. • S ≡ log p(θ|{x,y})
q(θ) ≥ I(Θ; {X, Y })
P
• I(Yi ; Zi ) - which measures how much information our The relative entropy in our parameters or just entropy
representation contains about our auxiliary data. This for short measures the relative information between the
should be maximized as well. distribution we assign our parameters in world P after
learning from the data {X, Y }, with respect to some
• I(Zi ; Xi , Θ) - which measures how much information data independent q(θ) prior on the parameters. This is
the parameters and input determine about our represen- an upper bound on the mutual information between the
tation. This should be minimized to ensure consistency data and our parameters and as such can measure our
between worlds. Notice that this is similar to, but dis- risk of overfitting.
tinct from the first term above.
1.2. Inequalities
I(Zi ; Xi , Θ) = I(Zi ; Xi ) + I(Zi ; Θ|Xi ) (10)
Notice that S and R are both themselves upper bounds on
by the Chain Rule for mutual information 2 . mutual informations, and so must be positive semidefinite.
Also, if our data is discrete, or if we have discretized it 3 , D
• I(Θ; {X, Y }) - which measures how much informa- and C which are both upper bounds on conditional entropies,
tion we store about our training data in the parameters must be positive as well.
of our encoder. This should also be minimized.
The distributions p(z|x, θ), p(θ|{x, y}), q(z), q(x|z), q(y|z)
can be chosen arbitrarily. Once chosen, the functionals
These mutual informations are all intractable in general,
R, C, D, S take on well described values. The choice of
since we cannot commute the necessary marginals in closed
the five distributional families specifies a single point in
form.
a four-dimensional space. This phase space is trivially
bounded by the positive semi-definiteness of the functionals
1.1. Functionals mentioned above and, less trivially, by the tighter bounds
Despite their intractability, we can compute variational set by the mutual informations they estimate in the previous
bounds on these mutual informations. section. Furthermore, since these mutual informations are
all defined in world P , which itself satisfies a particular set
P D of conditional dependencies, the mutual informations are all
E
p(zi |xi ,θ) P
• R≡ i log q(zi ) ≥ i I(Zi ; Xi , Θ)
P interdependent. This means that not even all points in the
The rate measures the complexity of our representa- positive quadrant of the R, C, D, S space can be reached.
tion. It is the relative information of a sample specific
For example:
representation zi ∼ p(z|xi , θ) with respect to our vari-
ational marginal q(z). It measures how many bits we
R + C + D + S ≥ KL(p || Q)
actually encode about each sample. X
P P + H(Xi ) + H(Yi ) − I(Xi ; Φ) − I(Yi ; Xi , Φ)
• C ≡ − Pi hlog q(yi |zi )iP ≥ i H(Yi ) − i
I(Yi ; Zi ) = i H(Yi |Zi ) X
= KL(p || Q) + H(Xi , Yi ) − I(Xi ; Yi ; Φ). (11)
The classification error measures the conditional en- i
tropy of Y left after conditioning on Z. It is a measure
of how much information about Y is left unspecified Showing that even if the distributional choices agree per-
in our representation. This functional measures our fectly (so that KL(p || Q) = 0), because the data and its
supervised learning performance. governing distribution are outside our control and can be
3
2
Given this relationship, we could actually reduce the total More generally, if we choose some measure m(x), m(y) on
number of functions we consider from 4 to 3, as discussed in Ap- both X and DY , we can define
E D and C in terms of that measure
pendix A. e.g. D ≡ − log q(x|z)
m(x)
≥ H m (X) − I(X; Z) = Hm (X|Z)
P
TherML
assumed to have non-vanishing entropies and mutual infor- 2. Optimal Frontier

mations, combinations of these functionals will not be able
to take on all values. In this case, the inherent complex- As in Alemi et al. (2018), under mild assumptions about
ity in the dataset forbids points too near the origin in the the distributional families, it can be argued that the surface
R, C, D, S plane. is monotonic in all of its arguments. The optimal surface
in the infinite family limit can be characterized as a con-
Similarly, vex polytope. In practice we will be in the realistic setting
X corresponding to finite parametric families such as neural
R+D ≥ H(Xi ) + I(Zi ; Xi , Θ) − I(Xi ; Zi ) network approximators. We then expect that there is an
i irrevocable gap that opens up in the variational bounds. Any
X
= H(Xi ) + I(Zi ; Θ|Xi ). (12) failure of the distributional families to model the correct
i corresponding marginal in P means that the space of all
This mirrors the bound given in Alemi et al. (2018) where achievable R, C, D, S values will be some convex relax-
R+D ≥ H(X), which is still true given that all conditional ation of the optimal surface. This surface will be described
mutual informations are positive semidefinite (H(X) + some function f (R, C, D, S) = 0, which means we can
I(Z; Θ|X) ≥ H(X)), but here we obtain a tighter bound identify points on the surface as a function of one functional
that has a term measuring how much information about our with respect to the others (e.g. R = R(C, D, S)). Find-
encoding is revealed by the parameters after conditioning on ing points on this surface equates to solving a constrained
the input itself. This term (I(Zi ; Θ|Xi ) is precisely the dif- optimization problem, e.g.
ference between our two closely related mutual information
governing the distortion between worlds (Equation 10) 4 . min R
q(z)q(x|z)q(y|z)p(z|x,θ)p(θ|{x,y})
Assuming we have access to infinitely powerful variational such that D = D0 , S = S0 , C = C0 . (16)
distributions for world Q, we can map out the optimal
frontier in terms of R, C, D, S. It forms a convex poly- Equivalently, we could solve the unconstrained Lagrange
tope whose edges, faces and corners are identifiable as well multipliers problem:
known objectives and their optimal solutions.
min R + δDγC + σS. (17)
q(z)q(x|z)q(y|z)p(z|x,θ)p(θ|{x,y})
1.3. Bounds on True Learning
We can also relate R, C, D and S to mutual informations Here δ, γ, σ are Lagrange multipliers that impose the con-
of interest that are otherwise out of reach. For instance, straints. They each correspond to the partial derivative of
I(Z; Φ) which measures how well our Zs approximate the the rate at the solution with respect to their corresponding
true latent variables for the data Φ. Notice that given the functional, keeping the others fixed.
conditional dependencies in world P , we have the following Notice that this single objective encompasses a wide range
Markov chain: of existing techniques.
Zi ↔ (Xi , Yi , Θ) ↔ Φ (13)
• If we retain C alone, we are doing traditional super-
and so by the Data Processing Inequality (Cover & Thomas, vised learning and our network will learn to be deter-
2012): ministic in its activations and parameters.
I(Zi ; Φ) ≤ I(Zi ; Θ, Xi , Yi ) • If δ = 0 we no longer require a variational recon-
(((
= I(Zi ; Xi , Θ) + (
I(Z i |Xi , Θ) ≤ R. (14)
; Y(
(i( struction network q(x|z), and are doing some form of
supervised learning generally, though a more general
The rate R forms an upper bound on the mutual information
form of VIB that similarly regularizes the parameters
between our encoding Zi and the true governing parameters
of our network.
of our data Φ. Similarly by the Data Processing Inequality
we can establish that: • If δ = 0, σ = 0 we exactly recover the Variational
I(Θ; Φ) ≤ I(Θ; {X, Y }) = S. (15) Information Bottleneck (VIB) objective of Alemi et al.
(2016) (where β = 1/γ), a form of stochastically regu-
S upper bounds the amount of information our encoder’s larized supervised learning that imposes a bottleneck
parameters Θ can contain about the true parameters Φ. on how much information our representation can re-
4
In Appendix A we consider taking this equality seriously tain about the input, while simultaneously maximizing
to limit the space only only three functionals, S, C and V ∼ the amount of information the representation contains
I(Zi ; Θ|Xi ) about the target.
TherML
• If δ = 0 and σ, γ → ∞ but in such a way as to keep can formally describe the exact solution to our minimization
the ratio fixed β ≡ σ/γ (that is if we drop the R term problem:
and only keep C + βS as our objective) we recover
the loss of Achille & Soatto (2017), presented as an min S s.t. R = R0 , C = C0 , D = D0 . (18)
alternative way to do Information Bottleneck (Tishby
Recall that S measures the relative entropy of our parameter
et al., 1999) but being stochastic on the parameters
distribution with respect to the q(θ) prior. As such, the
rather than the activations as in VIB.
solution that minimizes the relative entropy subject to some
• As a special case, if our objective is set to C + S (δ = constraints is a generalized Boltzmann distribution (Jaynes,
0, σ, γ → ∞, σ/γ → 1), we obtain the objective for a 1957):
Bayesian neural network, ala Blundell et al. (2015). q(θ) −(R+δD+γC)/σ
p∗ (θ|{x, y}) = e . (19)
• If we retain only D, we are training an autoencoder. Z
Here Z is the partition function, the normalization constant
• If σ = 0, γ = 0, δ = 1 the objective is equivalent to for the distribution
the ELBO used to train a VAE (Kingma & Welling, Z
2014). Z = dθ q(θ) e−(R+δD+γC)/σ (20)
• If σ = 0, γ = 0 more generally, the objective is equiva-
lent to a β-VAE (Higgins et al., 2017) where β = 1/δ. This suggests an alternative method for finding points on the
optimal frontier. We could turn the unconstrained Lagrange
• If γ = 0 all terms involving the auxiliary data Y drop optimization problem that required some explicit choice
out and we are doing some form of unsupervised learn- of tractable posterior distribution over parameters into a
ing without any variational classifier q(y|z). The pres- sampling problem for a richer implicit distribution.
ence of the S term makes this more general than a
usual β-VAE and should offer better generalization A naive way to draw samples from this posterior would
properties and control of overfitting by bottlenecking be to use Stochastic Gradient Langevin Dynamics or its
how much information we allow the parameters of our cousins (Welling & Teh, 2011; Chen et al., 2014; Ma et al.,
encoder to extract from the training data. 2015) which, in practice, would look like ordinary stochastic
gradient descent (or its cousins like momentum) for the
• σ = 0, γ = α, δ = 1 recovers the semi-supervised objective R + δD + γC, with injected noise. By choosing
objective of Kingma et al. (2014). the magnitude of the noise relative to the learning rate, the
effective temperature σ can be controlled.
Examples of all of these objectives behavior on a simple toy There is increasing evidence that the stochastic part of
model is shown in Appendix B. stochastic gradient descent itself is enough to turn SGD
Notice that all of these previous approaches describe low less into an optimization procedure and more into an ap-
dimensional sub-surfaces of the optimal three dimensional proximate posterior sampler (Mandt et al., 2017; Smith &
frontier. These approaches were all interested in different Le, 2017; Achille & Soatto, 2017; Zhang et al., 2018), where
domains, some were focused on supervised prediction ac- hyperparameters such as the learning rate and batch size set
curacy, others on learning a generative model. Depending the effective temperature. If ordinary stochastic gradient
on your specific problem, and downstream tasks, different descent is doing something more akin to sampling from
points on the optimal frontier will be desirable. However, in- a posterior and less like optimizing to some minimum, it
stead of choosing a single point on the frontier, we can now would help explain improved performance through ensem-
explore a region on the surface to see what class of solutions ble averages of different points along trajectories (Huang
are possible within the modeling choices. By simply adjust- et al., 2017).
ing the three control parameters δ, γ, σ, we can smoothly When viewed in this light, Equation 19 describes the optimal
move across the entire frontier and smoothly interpolate posterior for the parameters so as to ensure the minimal
between all of these objectives and beyond. divergence between worlds P and Q. q(θ) plays the role
of the prior over parameters, but our overall objective is
2.1. Optimization minimized when
So far we’ve considered explicit forms of the objective in q(θ) = p(θ) = hp(θ|{x, y})i{x,y} . (21)
terms of the four functionals. For S this would require some
kind of tractable approximation to the posterior over the That is, when our prior is the marginal of the posteriors
parameters of our encoding distribution. Alternatively, we over all possible datasets drawn from the true distribution.
TherML
A fair draw from this marginal is to take a sample from the mation. We will name the partial derivatives:
posterior obtained on a different but related dataset. Inso-
much as ordinary SGD training is an approximate method ∂R
γ≡− (23)
for drawing a posterior sample, the common practice of ∂C D,S
fine-tuning a pretrained network on a related dataset is using

∂R
a sample from the optimal prior as our initial parameters. δ≡− (24)
∂D C,S
The fact that fine-tuning approximates use of an optimal
prior presumably helps explain its broad success. ∂R
σ≡− . (25)
∂S C,D
If we identify our true goal not as optimizing some objective
but instead directly sampling from Equation 19, we can con- These measure the exchange rate for turning rate into re-
sider alternative approaches to define our learning dynamics, duced distortion, reduced classification error, or increased
such as parallel tempering or population annealing (Machta entropy, respectively.
& Ellis, 2011).
The functionals R, C, D, S are analogous to extensive ther-
modynamic variables such as volume, entropy, particle num-
3. Thermodynamics ber, magnetic field, charge, surface area, length and energy
which grow as the system grows, while the named partial
So far we have described a framework for learning that
derivatives γ, δ, σ are analogous to the intensive, general-
involves finding points that lie on the surface of a convex
ized forces in thermodynamics corresponding to their paired
three-dimensional surface in terms of four functional co-
state variable, such as pressure, temperature, chemical po-
ordinates R, C, D, S. Interestingly, this is all that is re-
tential, magnetization, electromotive force, surface tension,
quired to establish a formal connection to thermodynamics,
elastic force, etc. Just as in thermodynamics, the extensive
which similarly is little more than the study of exact differ-
functionals are defined for any state, while the intensive par-
entials (Sethna, 2006; Finn, 1993).
tial derivatives are only well defined for equilibrium states,
Whereas previous approaches connecting thermodynamics which in our language are the states lying on the optimal
and learning (Parrondo et al., 2015; Still, 2017; Still et al., surface.
2012) have focused on describing the thermodynamics and
Recasting our total differential:
statistical mechanics of physical realizations of learning
systems (i.e. the heat bath in these papers is a physical heat dR = −γdC − δdD − σdS, (26)
bath at finite temperature), in this work we make a formal
analogy to the structure of the theory of thermodynamics, we create a law analogous to the First Law of Thermodynam-
without any physical content. ics. In thermodynamics the First Law is often taken to be a
statement about the conservation of energy, and by analogy
3.1. First Law of Learning here we could think about this law as a statement about the
conservation of information. Granted, the actual content of
The optimal frontier creates an equivalence class of states, the law is fairly vacuous, equivalent only to the statement
being the set of all states that minimize as much as possible that there exists a scalar function R = R(C, D, S) defining
the distortion introduced in projecting world P onto a set of our surface and its partial derivatives.
distributions that respect the conditions in Q. The surface
satisfies some equation f (R, C, D, S) = 0 which we can 3.2. Maxwell Relations
use to describe any one of these functionals in terms of the
rest, e.g. R = R(C, D, S). This function is entire, and Requiring that Equation 26 be an exact differential has math-
so we can equate partial derivatives of the function with ematically trivial but intuitively non-obvious implications
differentials of the functionals5 : that relate various partial derivatives of the system to one
another, akin to the Maxwell Relations in thermodynamics.
For example, requiring that mixed second partial derivatives

∂R ∂R ∂R
dR = dC + dD + dS. are symmetric establishes that:
∂C D,S ∂D C,S ∂S C,D
(22) 2 2
∂ R ∂ R

∂δ

∂γ

Since the function is smooth and convex, instead of identi- = =⇒ = .
∂D∂C ∂C∂D ∂C D ∂D C
fying the surface of optimal rates in terms of the functionals
(27)
C, D, S, we could just as well describe the surface in terms
This equates the result of two very different experiments.
of the partial derivatives by applying a Legendre transfor-
In the experiment encoded in the partial derivative on the
5 ∂X left, one would measure the change in the derivative of the

∂Y Z
denotes the partial derivative of X with respect to Y
holding Z constant. R − D curve (δ) as a function of the classification error (C)
TherML
at fixed distortion (D). On the right one would measure the 3.3. Zeroth Law of Learning
change in the derivative of the R − C curve (γ) as a function
A central concept in thermodynamics is a notion of equilib-
of the distortion (D) at fixed classification error (C). As
rium. The so called Zeroth Law of thermodynamics defines
different as these scenarios appear, they are mathematically
thermal equilibrium as a sort of reflexive property of sys-
equivalent.
tems (Finn, 1993). If system A is in thermal equilibrium
We can also define other potentials analogous to the alterna- with system C, and system B is separately in thermal equi-
tive thermodynamic potentials such as enthalpy, free energy, librium with system C, then system A and B are in thermal
and Gibb’s free energy by performing partial Legendre trans- equilibrium with each other.
formations. For instance, we can define a free rate:
When any subpart of a system is in thermal equilibrium with
F (C, D, σ) ≡ R − σS (28) any other subpart, the system is said to be an equilibrium
state.
dF = −γdC − δdD − Sdσ. (29)
In our framework, the points on the optimal surface are anal-
The free rate measures the rate of our system, not as a ogous to the equilibrium states, for which we have well de-
function of S (something difficult to keep fixed), but in fined partial derivatives. We can demonstrate that this notion
terms of σ, a parameter in our loss or optimal posterior. of equilibrium agrees with a more intuitive notion of equilib-
The free rate gives rise to other Maxwell relations such as rium between coupled systems. Imagine we have two differ-
ent models, characterized by their own set of distributions,
∂S ∂γ Model A is defined by pA (z|x, θ), pA (θ, {x, y}), qA (z),
= , (30)
∂C σ ∂σ C and model B by pB (z|x, θ), pB (θ, {x, y}), qB (z). Both
models will have their own value for each of the function-
which equates how much each additional bit of entropy (S)
als: RA , SA , DA , CA and RB , SB , DB , CB . Each model
buys you in terms of classification error (C) at fixed effective
defines its own representation ZA , ZB . Now imagine
temperature (σ), to a seemingly very different experiment
coupling the models, by forming the joint representation
where you measure the change in the effective supervised
ZC = (ZA , ZB ) formed by concatenating the two represen-
tension (γ, the slope on the R − C curve) versus effective
tations together. Now the governing distributions over Z
temperature (σ) at a fixed classification error (C).
are simply the product of the two model’s distributions, e.g.
We can additionally take and name higher order partial qC (zC ) = qA (zA )qB (zB ). Thus the rate RC and entropy
derivatives, analogous to the susceptibilities of thermody- SC for the combined model is the sum of the individual
namics like bulk modulus, the thermal expansion coefficient, models: RC = RA + RB , SC = SA + SB .
or heat capacities. For instance, we can define the analog
Now imagine we sample new states for the combined system
of heat capacity for our system, a sort of rate capacity at
which are maximally entropic with the constraint that the
constant distortion:
combined rate stay constant:

∂R
KD ≡ . (31) q(θ) −R/σ
∂σ D min S s.t. R = RC =⇒ p(θ|{x, y}) = e .
Z
(32)
Just as in thermodynamics, these susceptibilities may offer
For the expectation of the two rates to be unchanged after
useful ways to characterize and quantify the systematic
they have been coupled and evolved holding their total rate
differences between model families. Perhaps general scaling
fixed, we must have,
laws can be found between susceptibilities and network
widths, or depths, or number of parameters or dataset size. 1 1 1 1
− RA − RB = − RC = − (RA + RB )
Divergences or discontinuities in the susceptibilities are the σ σB σC σC
hallmark of phase transitions in physical systems, and it is =⇒ σA = σB = σC . (33)
reasonable to expect to see similar phenomenon for certain
models. Therefore, we can see that σ, the effective temperature,
allows us to identify whether two systems are in thermal
A great deal of first, second and third order partial deriva- equilibrium with one another. Just as in thermodynamics,
tives in thermodynamics are given unique names. This is if two systems at different temperatures are coupled, some
because the quantities are particularly useful for compar- transfer takes place.
ing different physical systems. We expect a subset of the
first, second and higher order partial derivatives of the base
3.4. Second Law of Learning?
functionals will prove similarly useful for comparing, quan-
tifying, and understanding differences between modelling Even when doing deterministic training, training is non-
choices. invertible (Maclaurin et al., 2015), and we need to contend
TherML
with and track the entropy (S) term. We set the parameters For a stationary Markov process whose stationary distribu-
of our networks initially with a fair draw from some prior tion is an equilibrium distribution the KL divergence to the
distribution q(θ). The training procedure acts as a Markov stationary distribution must monotonically decrease each
process on the distribution of parameters, transforming it step. This means the ∆J/σ must decrease monotonically,
from the prior distribution into some modified distribution, that is our objective J must decrease monotonically:
the posterior p(θ|{x, y}). Optimization is a many-to-one
function, that in the ideal limiting case, maps all possible Jt=0 ≥ Jt ≥ Jt+1 ≥ Jt=∞ . (39)
initializations to a single global optimum. In this limiting
Furthermore, if we use q(θ) as our prior over parameters,
case S would be divergent, and there is nothing to prevent
we know:
us from memorizing the training set.
The Second Law of Thermodynamics states that the entropy Jt=0 = hR + δD + γCiq(θ) (40)
of an isolated system tends to increase. All systems tend to Jt=∞ = −σ log Z. (41)
disorder, and this places limits on the maximum possible
efficiency of heat engines.
4. Conclusion
Formally, there are many statements akin to the Second
Law of Thermodynamics that can be made about Markov We have formalized representation learning as the process
chains generally (Cover & Thomas, 2012). The central one of minimizing the distortion introduced when we project
is that for any for any two distributions pn , qn both evolving the real world (World P ) onto the world we desire (World
according to the same Markov process (n marks the time Q). The projection is naturally described by a set of four
step), the relative entropy KL(pn || qn ) is monotonically functionals which variationally bound relevant mutual infor-
decreasing with time. This establishes that for a stationary mations in the real world. Relations between the functionals
Markov chain, the relative entropy to the stationary state describe an optimal three-dimensional surface in a four
KL(pn || p∞ ) monotonically decreases 6 . dimensional space of optimal states. A single learning ob-
jective targeting points on this optimal surface can express a
In our language, we can make strong statements about dy- wide array of existing learning objectives spanning from un-
namics that target points on the optimal frontier, or dynamics supervised learning to supervised learning and everywhere
that implement a relaxation towards equilibrium. There is a in between. The geometry of the optimal frontier suggests a
fundamental distinction between states that live on the fron- wide array of identities involving the functionals and their
tier and those off of it, analogous to the distinction between partial derivatives. This offers a direct analogy to thermo-
equilibrium and non-equilibrium states in thermodynamics. dynamics independent of any physical content. By analogy
Any equilibrium distribution can be expressed in the to thermodynamics, we can begin to develop new quantita-
form Equation (19) and identified by its partial derviatives tive measures and relationships amongst properties of our
γ, δ, σ. If name the objective in Equation (17): models that we believe will offer a new class of theoretical
understanding of learning behavior.
J(γ, δ, σ) ≡ R + δD + γC + σS, (34)
ACKNOWLEDGEMENTS
The value this objective takes for any equilibrium distri-
bution can be shown to be given by the log partition func- The authors would like to Jascha Sohl-Dickstein, Mallory
tion (Equation (20)): Alemi and Rif A. Saurous for helpful discussions and feed-
back on the draft.
min J(γ, δ, σ) = −σ log Z(γ, δ, σ) (35)
References
and the KL divergence between any distribution over param-
eters p(θ) and an equilibrium distribution is: Achille, A. and Soatto, S. Emergence of Invariance and
Disentangling in Deep Representations. Proceedings of
KL(p(θ) || p∗ (θ; γ, δ, σ)) = ∆J/σ (36) the ICML Workshop on Principled Approaches to Deep
noneq Learning, 2017.
∆J ≡ J (p; γ, δ, σ) − J(γ, δ, σ) (37)
Where J noneq is the non-equilibrium objective: Alemi, Alexander A, Fischer, Ian, Dillon, Joshua V, and
Murphy, Kevin. Deep variational information bottleneck.
J noneq (p; γ, δ, σ) = hR + δD + γC + σSip(θ) . (38) arXiv:1612.00410, 2016. URL http://arxiv.org/
abs/1612.00410.
6
For discrete state Markov chains, this implies that if the sta-
tionary distribution is uniform, the entropy of the distribution Alemi, Alexander A, Poole, Ben, Dillon, Joshua V, Saurous,
H(pn ) is strictly increasing. Rif A, and Murphy, Kevin. Fixing a broken ELBO. ICML
TherML
2018, 2018. URL http://arxiv.org/abs/1711. Machta, J. and Ellis, R. S. Monte Carlo Methods for Rough
00464. Free Energy Landscapes: Population Annealing and Par-
allel Tempering. Journal of Statistical Physics, 144:541–
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wier- 553, August 2011. doi: 10.1007/s10955-011-0249-0.
stra, D. Weight Uncertainty in Neural Networks. arXiv: URL https://arxiv.org/abs/1104.1138.
1505.05424, May 2015. URL https//arxiv.org/
abs/1505.05424. Maclaurin, Dougal, Duvenaud, David, and Adams, Ryan P.
Gradient-based hyperparameter optimization through re-
Box, George EP and Draper, Norman R. Empirical model- versible learning. In Proceedings of the 32Nd Inter-
building and response surfaces. John Wiley & Sons, national Conference on International Conference on
1987. Machine Learning - Volume 37, ICML’15, pp. 2113–
Chen, T., Fox, E. B., and Guestrin, C. Stochastic Gradi- 2122. JMLR.org, 2015. URL http://dl.acm.org/
ent Hamiltonian Monte Carlo. arXiv:1402.4102, Febru- citation.cfm?id=3045118.3045343.
ary 2014. URL https://arxiv.org/abs/1402.
Mandt, S., Hoffman, M. D., and Blei, D. M. Stochas-
4102.
tic Gradient Descent as Approximate Bayesian Infer-
Cover, Thomas M and Thomas, Joy A. Elements of infor- ence. arXiv: 1704.04289, April 2017. URL https:
mation theory. John Wiley & Sons, 2012. //arxiv.org/abs/1704.04289.
Csiszár, Imre and Matúš, František. Information projections Marsh, Charles. Introduction to continuous entropy. 2013.
revisited. IEEE Transactions on Information Theory, 49 URL http://www.crmarsh.com/static/pdf/
(6):1474–1490, 2003. Charles_Marsh_Continuous_Entropy.pdf.
Finn, Colin BP. Thermal physics. CRC Press, 1993. Parrondo, Juan MR, Horowitz, Jordan M, and
Sagawa, Takahiro. Thermodynamics of informa-
Friedman, Nir, Mosenzon, Ori, Slonim, Noam, and Tishby, tion. Nature physics, 11(2):131–139, 2015. URL
Naftali. Multivariate information bottleneck. In Pro- http://jordanmhorowitz.mit.edu/sites/
ceedings of the Seventeenth conference on Uncertainty in default/files/documents/natureInfo.
artificial intelligence, pp. 152–161. Morgan Kaufmann pdf.
Publishers Inc., 2001.
Sethna, James. Statistical mechanics: entropy, order
Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christo- parameters, and complexity, volume 14. Oxford
pher, Glorot, Xavier, Botvinick, Matthew, Mohamed, University Press, 2006. URL http://pages.
Shakir, and Lerchner, Alexander. β-VAE: Learning Basic physics.cornell.edu/˜sethna/StatMech/
Visual Concepts with a Constrained Variational Frame- EntropyOrderParametersComplexity.pdf.
work. 2017.
Smith, S. L. and Le, Q. V. A Bayesian Perspective
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and
on Generalization and Stochastic Gradient Descent.
Weinberger, K. Q. Snapshot Ensembles: Train 1, get
arXiv:1710.06451, October 2017. URL https://
M for free. arXiv: 1704.00109, March 2017. URL
arxiv.org/abs/1710.06451.
https://arxiv.org/abs.1704.00109.
Still, S. Thermodynamic cost and benefit of data rep-
Jaynes, Edwin T. Information theory and statistical mechan-
resentations. arXiv: 1705.00612, April 2017. URL
ics. Physical review, 106(4):620, 1957.
https://arxiv.org/abs/1705.00612.
Kingma, D. P., Rezende, D. J., Mohamed, S., and Welling,
M. Semi-Supervised Learning with Deep Generative Still, S., Sivak, D. A., Bell, A. J., and Crooks, G. E. Thermo-
Models. arXiv: 1406.5298, June 2014. URL https: dynamics of Prediction. Physical Review Letters, 109(12):
//arxiv.org/abs/1406.5298. 120604, September 2012. doi: 10.1103/PhysRevLett.109.
120604. URL https://arxiv.org/abs/1203.
Kingma, Diederik P and Welling, Max. Auto-encoding 3271.
variational Bayes. 2014.
Tishby, N., Pereira, F.C., and Biale, W. The information
Ma, Y.-A., Chen, T., and Fox, E. B. A Complete Recipe bottleneck method. In The 37th annual Allerton Conf. on
for Stochastic Gradient MCMC. arXiv:1506.04696, Communication, Control, and Computing, pp. 368–377,
June 2015. URL https://arxiv.org/abs/1506. 1999. URL https://arxiv.org/abs/physics/
04696. 0004057.
TherML
Welling, Max and Teh, Yee W. Bayesian learning via

stochastic gradient langevin dynamics. In Proceedings
of the 28th international conference on machine learning
(ICML-11), pp. 681–688, 2011.
Zhang, Y., Saxe, A. M., Advani, M. S., and Lee, A. A.
Energy-entropy competition and the effectiveness of
stochastic gradient descent in machine learning. arXiv:
1803.01927, March 2018. URL https://arxiv.
org/abs/1803.01927.
TherML
A. Reconstruction Free Formulation φ

We can utilize the Chain Rule of Mutual Information (Equa-
tion (10)):
X Z Y
I(Zi ; Xi , Θ) = I(Zi ; Xi ) + I(Zi ; Θ|Xi ), (42)
to simplify our expression for the minimum possible KL

between worlds (Equation (9)), and consider a reduced set θ
of functionals (compare to Section 1.1):
Figure 2. Modified graphical model for world Q, the world we
P
• C ≡ − Pi hlog q(yi |zi )iP ≥
P desire which satisfies the joint density in Equation 43.
i H(Yi ) −
I(Yi ; Zi ) = i H(Yi |Zi )
The classification error, as before. Data generation. We follow the toy model from Alemi
D E et al. (2018), but add an additional classification label in
p(θ|{x,y})
• S ≡ log q(θ) ≥ I(Θ; {X, Y }) order to explore supervised and semi-supervised objectives.
P
The true data generating distribution is as follows. We
The entropy as before. first sample a latent binary variable, z ∼ Ber(0.7), then
D E sample a latent 1D continuous value from that variable,
p(zi |xi ,θ)
• V ≡ log q(zi |xi ) ≥ I(Zi ; Θ|Xi ) h|z ∼ N (h|µz , σz ), and finally we observe a discretized
P
value, x = discretize(h; B), where B is a set of 30 equally
The volume of the representation (for lack of a bet- spaced bins, and a discrete label, y = z (so the true label is
ter term), which measures the mutual information be- the latent variable that generated x). We set µz and σz such
tween our representation Z and the parameters Θ, con- that R∗ ≡ I(x; z) = 0.5 nats, in the true generative process,
ditioned on the input X. That is, this functional bounds representing the ideal rate target for a latent variable model.
how much of the information in our representation can For the finite dataset, we select 50 examples randomly from
come from the learning algorithm, independent of the the joint p(x, y, z). For the infinite dataset, we directly
actual input. supply the true full marginal p(x, y) at each iteration during
training. When training on the finite dataset, we evaluate
In principle, these three functionals still fully character- model performance against the infinite dataset so that there
ize the distortion introduced in our information projection. is no error in the evaluation metrics due to a finite test set.
Notice that this new functional requires the variational ap-
proximation q(zi |xi ), a variational approximation to the Model details. We choose to use a discrete latent rep-
marginal over our parameter distribution. Notice also that resentation with K = 30 values, with an encoder of the
we no longer require a variational approximation to p(xi |zi ). form q(zi |xj ) ∝ − exp[(wie xj − bei )2 ], where z is the
That is, in this formulation we no longer require any form one-hot encoding of the latent categorical variable, and
of decoder, or sythesis in our original data space X. While x is the one-hot encoding of the observed categorical vari-
equivalent in its information projection, this more naturally able. We use a decoder of the same form, but with dif-
corresponds to the model of our desired world Q: ferent parameters: q(xj |zi ) ∝ − exp[(wid xj − bdi )2 ]. We
Y use a classifier of the same form as well: q(yj |zi ) ∝
q(x, y, φ, z, θ) = q(φ)q(θ) q(zi |xi )q(yi |zi ), (43) − exp[(wic yj − bci )2 ]. Finally, we use a variational marginal,
i q(zi ) = πi . Given this, the true joint distribution has the
depicted below in Figure 2. Here we desire, not the joint P p(x, y, z) = p(x)p(z|x)p(y|x), with marginal p(z) =
form
generative model X ← Z → Y , but the predictive model x p(x, z), and conditionals p(x|z) = p(x, z)/p(z) and
p(y|z) = p(y, z)/p(z).
X →Z →Y.
The encoder is additionally parameterized following Achille
B. Experiments & Soatto (2017) by α, a set of learned parameters for a Log
Normal distribution of the form log N (−αi /2, αi ). In total,
We show examples of models trained on a toy dataset for the model has 184 parameters: 60 weights and biases in the
all of the different objectives we define above. The dataset encoder and decoder, 4 weights and biases in the classifier,
has both an infinite data variant, where overfitting is not a 30 weights in the marginal, and an additional 30 weights for
problem, and a finite data variant, where overfitting can be the αi parameterizing the stochastic encoder. We initialize
clearly observed for both reconstruction and classification. the weights so that when σ = 0, there is no noticeable effect
TherML
on the encoder during training or testing.
Experiments. In Figure 3, we show the optimal, hand-

crafted model for the toy dataset, as well as a selection of
parameterizations of the TherML objective that correspond
to commonly-used objective functions and a few new ob-
jective functions not previously described. In the captions,
the parameters are specified with γ, δ, σ as in the main text,
as well as ρ, which is a corresponding lagrange multiplier
for R, in order to simplify the parameterization. This is still
valid, so long as there is always one function whose multi-
plier is 1. It just parameterizes the optimal surface slightly
differently. We train all objectives for 10,000 gradient steps.
For all of the objectives described, the model has converged,
or come close to convergence, by that point.
Because the model is sufficiently powerful to memorize the
dataset, most of the objectives are very susceptible to over-
fitting. Only the objective variants that are “regularized” by
the S term (parameterized by σ) are able to avoid overfitting
in the decoder and classifier.
Figure 3. Hand-crafted optimal model. Toy Model illustrating

the difference between selected points on the three dimensional
optimal surface defined by γ, δ, and σ. See Section 2 for more
description of the objectives, and Appendix B for details on the ex-
periment setup. Top (i): Three distributions in data space: the true
P distribution, p(x), the model’s generative distribution, g(x) =
data
z q(z)q(x|z),PandPthe empirical data reconstruction distribu-
0 0
tion, d(x) = x 0 z p(x )q(z|x )q(x|z). Middle (ii): Four
distributions in latent space: the learned (or computed)
P marginal
q(z), the empirical induced marginal e(z) = x p(x)q(z|x), the
empirical distribution over z values for data vectors in the set
X0 = {xn : zn = 0}, which we denote by e(z0 ) in purple, and
the empirical distribution over z values for data vectors in the
set X1 = {xn : zn = 1}, which we denote by e(z1 ) in yellow.
Bottom: ThreeP K × K distributions: (iii) q(z|x), (iv) q(x|z) and
(v) q(x0 |x) = z q(z|x)q(x0 |z).
TherML
(a) Deterministic Supervised Classifier: δ = ρ = σ = (b) Entropy-regularized Deterministic Classifier: δ = ρ =

0, γ = 1. 0, γ = 1, σ = 0.1.
(c) Entropy-regularized IB: δ = 0, ρ = 0, γ = 1, σ = (d) Bayesian Neural Network Classifier: δ = 0, ρ = 0, σ =

0.01. γ = 1.
Figure 4. Supervised Learning approaches.

TherML
(b) Entropy-regularized VIB: δ = 0, γ = 1, ρ = 0.9, σ =

(a) VIB: δ = 0, σ = 0, γ = 1, ρ(β) = 0.5.
0.1.
Figure 5. VIB style objectives.
(b) Entropy-regularized Deterministic Autoencoder: γ =

(a) Deterministic Autoencoder: γ = ρ = σ = 0, δ = 1.
ρ = 0, δ = 1, σ = 0.01.
Figure 6. Autoencoder objectives.

TherML
(a) VAE: σ = 0, γ = 0, δ = ρ = 1. (b) β-VAE: σ = 0, γ = 0, δ = 1, ρ(β) = 0.5.
(c) Entropy-regularized β-VAE: σ = 0.5, γ = 0, δ =

1, ρ(β) = 0.9. (d) Semi-supervised VAE: σ = 0, γ(α) = 0.5, δ = ρ = 1.
Figure 7. VAE style objectives.

TherML
Figure 8. Full Objective. σ = 0.5, γ = 1000, δ = 1, ρ = 0.9. Simple demonstration of the behavior with all terms present in the
objective.

Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A

Uploaded by

Copyright:

Available Formats

TherML: Thermodynamics of Machine Learning

Alexander A. Alemi 1 Ian Fischer 1

Abstract procedure. Generally, this distribution is intractable, but

chine learning. We develop a formal correspon-

φ which measures the relative information between the joint

tive semidefinite and reparameterization independent, and

assumed to have non-vanishing entropies and mutual infor- 2. Optimal Frontier

Welling, Max and Teh, Yee W. Bayesian learning via

A. Reconstruction Free Formulation φ

to simplify our expression for the minimum possible KL

on the encoder during training or testing.

Experiments. In Figure 3, we show the optimal, hand-

Figure 3. Hand-crafted optimal model. Toy Model illustrating

(a) Deterministic Supervised Classifier: δ = ρ = σ = (b) Entropy-regularized Deterministic Classifier: δ = ρ =

(c) Entropy-regularized IB: δ = 0, ρ = 0, γ = 1, σ = (d) Bayesian Neural Network Classifier: δ = 0, ρ = 0, σ =

Figure 4. Supervised Learning approaches.

(b) Entropy-regularized VIB: δ = 0, γ = 1, ρ = 0.9, σ =

Figure 5. VIB style objectives.

(b) Entropy-regularized Deterministic Autoencoder: γ =

Figure 6. Autoencoder objectives.

(a) VAE: σ = 0, γ = 0, δ = ρ = 1. (b) β-VAE: σ = 0, γ = 0, δ = 1, ρ(β) = 0.5.

(c) Entropy-regularized β-VAE: σ = 0.5, γ = 0, δ =

Figure 7. VAE style objectives.

You might also like