Professional Documents
Culture Documents
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
• I(Yi ; Zi ) - which measures how much information our The relative entropy in our parameters or just entropy
representation contains about our auxiliary data. This for short measures the relative information between the
should be maximized as well. distribution we assign our parameters in world P after
learning from the data {X, Y }, with respect to some
• I(Zi ; Xi , Θ) - which measures how much information data independent q(θ) prior on the parameters. This is
the parameters and input determine about our represen- an upper bound on the mutual information between the
tation. This should be minimized to ensure consistency data and our parameters and as such can measure our
between worlds. Notice that this is similar to, but dis- risk of overfitting.
tinct from the first term above.
1.2. Inequalities
I(Zi ; Xi , Θ) = I(Zi ; Xi ) + I(Zi ; Θ|Xi ) (10)
Notice that S and R are both themselves upper bounds on
by the Chain Rule for mutual information 2 . mutual informations, and so must be positive semidefinite.
Also, if our data is discrete, or if we have discretized it 3 , D
• I(Θ; {X, Y }) - which measures how much informa- and C which are both upper bounds on conditional entropies,
tion we store about our training data in the parameters must be positive as well.
of our encoder. This should also be minimized.
The distributions p(z|x, θ), p(θ|{x, y}), q(z), q(x|z), q(y|z)
can be chosen arbitrarily. Once chosen, the functionals
These mutual informations are all intractable in general,
R, C, D, S take on well described values. The choice of
since we cannot commute the necessary marginals in closed
the five distributional families specifies a single point in
form.
a four-dimensional space. This phase space is trivially
bounded by the positive semi-definiteness of the functionals
1.1. Functionals mentioned above and, less trivially, by the tighter bounds
Despite their intractability, we can compute variational set by the mutual informations they estimate in the previous
bounds on these mutual informations. section. Furthermore, since these mutual informations are
all defined in world P , which itself satisfies a particular set
P D of conditional dependencies, the mutual informations are all
E
p(zi |xi ,θ) P
• R≡ i log q(zi ) ≥ i I(Zi ; Xi , Θ)
P interdependent. This means that not even all points in the
The rate measures the complexity of our representa- positive quadrant of the R, C, D, S space can be reached.
tion. It is the relative information of a sample specific
For example:
representation zi ∼ p(z|xi , θ) with respect to our vari-
ational marginal q(z). It measures how many bits we
R + C + D + S ≥ KL(p || Q)
actually encode about each sample. X
P P + H(Xi ) + H(Yi ) − I(Xi ; Φ) − I(Yi ; Xi , Φ)
• C ≡ − Pi hlog q(yi |zi )iP ≥ i H(Yi ) − i
I(Yi ; Zi ) = i H(Yi |Zi ) X
= KL(p || Q) + H(Xi , Yi ) − I(Xi ; Yi ; Φ). (11)
The classification error measures the conditional en- i
tropy of Y left after conditioning on Z. It is a measure
of how much information about Y is left unspecified Showing that even if the distributional choices agree per-
in our representation. This functional measures our fectly (so that KL(p || Q) = 0), because the data and its
supervised learning performance. governing distribution are outside our control and can be
3
2
Given this relationship, we could actually reduce the total More generally, if we choose some measure m(x), m(y) on
number of functions we consider from 4 to 3, as discussed in Ap- both X and DY , we can define
E D and C in terms of that measure
pendix A. e.g. D ≡ − log q(x|z)
m(x)
≥ H m (X) − I(X; Z) = Hm (X|Z)
P
TherML
• If δ = 0 and σ, γ → ∞ but in such a way as to keep can formally describe the exact solution to our minimization
the ratio fixed β ≡ σ/γ (that is if we drop the R term problem:
and only keep C + βS as our objective) we recover
the loss of Achille & Soatto (2017), presented as an min S s.t. R = R0 , C = C0 , D = D0 . (18)
alternative way to do Information Bottleneck (Tishby
Recall that S measures the relative entropy of our parameter
et al., 1999) but being stochastic on the parameters
distribution with respect to the q(θ) prior. As such, the
rather than the activations as in VIB.
solution that minimizes the relative entropy subject to some
• As a special case, if our objective is set to C + S (δ = constraints is a generalized Boltzmann distribution (Jaynes,
0, σ, γ → ∞, σ/γ → 1), we obtain the objective for a 1957):
Bayesian neural network, ala Blundell et al. (2015). q(θ) −(R+δD+γC)/σ
p∗ (θ|{x, y}) = e . (19)
• If we retain only D, we are training an autoencoder. Z
Here Z is the partition function, the normalization constant
• If σ = 0, γ = 0, δ = 1 the objective is equivalent to for the distribution
the ELBO used to train a VAE (Kingma & Welling, Z
2014). Z = dθ q(θ) e−(R+δD+γC)/σ (20)
• If σ = 0, γ = 0 more generally, the objective is equiva-
lent to a β-VAE (Higgins et al., 2017) where β = 1/δ. This suggests an alternative method for finding points on the
optimal frontier. We could turn the unconstrained Lagrange
• If γ = 0 all terms involving the auxiliary data Y drop optimization problem that required some explicit choice
out and we are doing some form of unsupervised learn- of tractable posterior distribution over parameters into a
ing without any variational classifier q(y|z). The pres- sampling problem for a richer implicit distribution.
ence of the S term makes this more general than a
usual β-VAE and should offer better generalization A naive way to draw samples from this posterior would
properties and control of overfitting by bottlenecking be to use Stochastic Gradient Langevin Dynamics or its
how much information we allow the parameters of our cousins (Welling & Teh, 2011; Chen et al., 2014; Ma et al.,
encoder to extract from the training data. 2015) which, in practice, would look like ordinary stochastic
gradient descent (or its cousins like momentum) for the
• σ = 0, γ = α, δ = 1 recovers the semi-supervised objective R + δD + γC, with injected noise. By choosing
objective of Kingma et al. (2014). the magnitude of the noise relative to the learning rate, the
effective temperature σ can be controlled.
Examples of all of these objectives behavior on a simple toy There is increasing evidence that the stochastic part of
model is shown in Appendix B. stochastic gradient descent itself is enough to turn SGD
Notice that all of these previous approaches describe low less into an optimization procedure and more into an ap-
dimensional sub-surfaces of the optimal three dimensional proximate posterior sampler (Mandt et al., 2017; Smith &
frontier. These approaches were all interested in different Le, 2017; Achille & Soatto, 2017; Zhang et al., 2018), where
domains, some were focused on supervised prediction ac- hyperparameters such as the learning rate and batch size set
curacy, others on learning a generative model. Depending the effective temperature. If ordinary stochastic gradient
on your specific problem, and downstream tasks, different descent is doing something more akin to sampling from
points on the optimal frontier will be desirable. However, in- a posterior and less like optimizing to some minimum, it
stead of choosing a single point on the frontier, we can now would help explain improved performance through ensem-
explore a region on the surface to see what class of solutions ble averages of different points along trajectories (Huang
are possible within the modeling choices. By simply adjust- et al., 2017).
ing the three control parameters δ, γ, σ, we can smoothly When viewed in this light, Equation 19 describes the optimal
move across the entire frontier and smoothly interpolate posterior for the parameters so as to ensure the minimal
between all of these objectives and beyond. divergence between worlds P and Q. q(θ) plays the role
of the prior over parameters, but our overall objective is
2.1. Optimization minimized when
So far we’ve considered explicit forms of the objective in q(θ) = p(θ) = hp(θ|{x, y})i{x,y} . (21)
terms of the four functionals. For S this would require some
kind of tractable approximation to the posterior over the That is, when our prior is the marginal of the posteriors
parameters of our encoding distribution. Alternatively, we over all possible datasets drawn from the true distribution.
TherML
A fair draw from this marginal is to take a sample from the mation. We will name the partial derivatives:
posterior obtained on a different but related dataset. Inso-
much as ordinary SGD training is an approximate method ∂R
γ≡− (23)
for drawing a posterior sample, the common practice of ∂C D,S
fine-tuning a pretrained network on a related dataset is using
∂R
a sample from the optimal prior as our initial parameters. δ≡− (24)
∂D C,S
The fact that fine-tuning approximates use of an optimal
prior presumably helps explain its broad success. ∂R
σ≡− . (25)
∂S C,D
If we identify our true goal not as optimizing some objective
but instead directly sampling from Equation 19, we can con- These measure the exchange rate for turning rate into re-
sider alternative approaches to define our learning dynamics, duced distortion, reduced classification error, or increased
such as parallel tempering or population annealing (Machta entropy, respectively.
& Ellis, 2011).
The functionals R, C, D, S are analogous to extensive ther-
modynamic variables such as volume, entropy, particle num-
3. Thermodynamics ber, magnetic field, charge, surface area, length and energy
which grow as the system grows, while the named partial
So far we have described a framework for learning that
derivatives γ, δ, σ are analogous to the intensive, general-
involves finding points that lie on the surface of a convex
ized forces in thermodynamics corresponding to their paired
three-dimensional surface in terms of four functional co-
state variable, such as pressure, temperature, chemical po-
ordinates R, C, D, S. Interestingly, this is all that is re-
tential, magnetization, electromotive force, surface tension,
quired to establish a formal connection to thermodynamics,
elastic force, etc. Just as in thermodynamics, the extensive
which similarly is little more than the study of exact differ-
functionals are defined for any state, while the intensive par-
entials (Sethna, 2006; Finn, 1993).
tial derivatives are only well defined for equilibrium states,
Whereas previous approaches connecting thermodynamics which in our language are the states lying on the optimal
and learning (Parrondo et al., 2015; Still, 2017; Still et al., surface.
2012) have focused on describing the thermodynamics and
Recasting our total differential:
statistical mechanics of physical realizations of learning
systems (i.e. the heat bath in these papers is a physical heat dR = −γdC − δdD − σdS, (26)
bath at finite temperature), in this work we make a formal
analogy to the structure of the theory of thermodynamics, we create a law analogous to the First Law of Thermodynam-
without any physical content. ics. In thermodynamics the First Law is often taken to be a
statement about the conservation of energy, and by analogy
3.1. First Law of Learning here we could think about this law as a statement about the
conservation of information. Granted, the actual content of
The optimal frontier creates an equivalence class of states, the law is fairly vacuous, equivalent only to the statement
being the set of all states that minimize as much as possible that there exists a scalar function R = R(C, D, S) defining
the distortion introduced in projecting world P onto a set of our surface and its partial derivatives.
distributions that respect the conditions in Q. The surface
satisfies some equation f (R, C, D, S) = 0 which we can 3.2. Maxwell Relations
use to describe any one of these functionals in terms of the
rest, e.g. R = R(C, D, S). This function is entire, and Requiring that Equation 26 be an exact differential has math-
so we can equate partial derivatives of the function with ematically trivial but intuitively non-obvious implications
differentials of the functionals5 : that relate various partial derivatives of the system to one
another, akin to the Maxwell Relations in thermodynamics.
For example, requiring that mixed second partial derivatives
∂R ∂R ∂R
dR = dC + dD + dS. are symmetric establishes that:
∂C D,S ∂D C,S ∂S C,D
(22) 2 2
∂ R ∂ R
∂δ
∂γ
Since the function is smooth and convex, instead of identi- = =⇒ = .
∂D∂C ∂C∂D ∂C D ∂D C
fying the surface of optimal rates in terms of the functionals
(27)
C, D, S, we could just as well describe the surface in terms
This equates the result of two very different experiments.
of the partial derivatives by applying a Legendre transfor-
In the experiment encoded in the partial derivative on the
5 ∂X left, one would measure the change in the derivative of the
∂Y Z
denotes the partial derivative of X with respect to Y
holding Z constant. R − D curve (δ) as a function of the classification error (C)
TherML
at fixed distortion (D). On the right one would measure the 3.3. Zeroth Law of Learning
change in the derivative of the R − C curve (γ) as a function
A central concept in thermodynamics is a notion of equilib-
of the distortion (D) at fixed classification error (C). As
rium. The so called Zeroth Law of thermodynamics defines
different as these scenarios appear, they are mathematically
thermal equilibrium as a sort of reflexive property of sys-
equivalent.
tems (Finn, 1993). If system A is in thermal equilibrium
We can also define other potentials analogous to the alterna- with system C, and system B is separately in thermal equi-
tive thermodynamic potentials such as enthalpy, free energy, librium with system C, then system A and B are in thermal
and Gibb’s free energy by performing partial Legendre trans- equilibrium with each other.
formations. For instance, we can define a free rate:
When any subpart of a system is in thermal equilibrium with
F (C, D, σ) ≡ R − σS (28) any other subpart, the system is said to be an equilibrium
state.
dF = −γdC − δdD − Sdσ. (29)
In our framework, the points on the optimal surface are anal-
The free rate measures the rate of our system, not as a ogous to the equilibrium states, for which we have well de-
function of S (something difficult to keep fixed), but in fined partial derivatives. We can demonstrate that this notion
terms of σ, a parameter in our loss or optimal posterior. of equilibrium agrees with a more intuitive notion of equilib-
The free rate gives rise to other Maxwell relations such as rium between coupled systems. Imagine we have two differ-
ent models, characterized by their own set of distributions,
∂S ∂γ Model A is defined by pA (z|x, θ), pA (θ, {x, y}), qA (z),
= , (30)
∂C σ ∂σ C and model B by pB (z|x, θ), pB (θ, {x, y}), qB (z). Both
models will have their own value for each of the function-
which equates how much each additional bit of entropy (S)
als: RA , SA , DA , CA and RB , SB , DB , CB . Each model
buys you in terms of classification error (C) at fixed effective
defines its own representation ZA , ZB . Now imagine
temperature (σ), to a seemingly very different experiment
coupling the models, by forming the joint representation
where you measure the change in the effective supervised
ZC = (ZA , ZB ) formed by concatenating the two represen-
tension (γ, the slope on the R − C curve) versus effective
tations together. Now the governing distributions over Z
temperature (σ) at a fixed classification error (C).
are simply the product of the two model’s distributions, e.g.
We can additionally take and name higher order partial qC (zC ) = qA (zA )qB (zB ). Thus the rate RC and entropy
derivatives, analogous to the susceptibilities of thermody- SC for the combined model is the sum of the individual
namics like bulk modulus, the thermal expansion coefficient, models: RC = RA + RB , SC = SA + SB .
or heat capacities. For instance, we can define the analog
Now imagine we sample new states for the combined system
of heat capacity for our system, a sort of rate capacity at
which are maximally entropic with the constraint that the
constant distortion:
combined rate stay constant:
∂R
KD ≡ . (31) q(θ) −R/σ
∂σ D min S s.t. R = RC =⇒ p(θ|{x, y}) = e .
Z
(32)
Just as in thermodynamics, these susceptibilities may offer
For the expectation of the two rates to be unchanged after
useful ways to characterize and quantify the systematic
they have been coupled and evolved holding their total rate
differences between model families. Perhaps general scaling
fixed, we must have,
laws can be found between susceptibilities and network
widths, or depths, or number of parameters or dataset size. 1 1 1 1
− RA − RB = − RC = − (RA + RB )
Divergences or discontinuities in the susceptibilities are the σ σB σC σC
hallmark of phase transitions in physical systems, and it is =⇒ σA = σB = σC . (33)
reasonable to expect to see similar phenomenon for certain
models. Therefore, we can see that σ, the effective temperature,
allows us to identify whether two systems are in thermal
A great deal of first, second and third order partial deriva- equilibrium with one another. Just as in thermodynamics,
tives in thermodynamics are given unique names. This is if two systems at different temperatures are coupled, some
because the quantities are particularly useful for compar- transfer takes place.
ing different physical systems. We expect a subset of the
first, second and higher order partial derivatives of the base
3.4. Second Law of Learning?
functionals will prove similarly useful for comparing, quan-
tifying, and understanding differences between modelling Even when doing deterministic training, training is non-
choices. invertible (Maclaurin et al., 2015), and we need to contend
TherML
with and track the entropy (S) term. We set the parameters For a stationary Markov process whose stationary distribu-
of our networks initially with a fair draw from some prior tion is an equilibrium distribution the KL divergence to the
distribution q(θ). The training procedure acts as a Markov stationary distribution must monotonically decrease each
process on the distribution of parameters, transforming it step. This means the ∆J/σ must decrease monotonically,
from the prior distribution into some modified distribution, that is our objective J must decrease monotonically:
the posterior p(θ|{x, y}). Optimization is a many-to-one
function, that in the ideal limiting case, maps all possible Jt=0 ≥ Jt ≥ Jt+1 ≥ Jt=∞ . (39)
initializations to a single global optimum. In this limiting
Furthermore, if we use q(θ) as our prior over parameters,
case S would be divergent, and there is nothing to prevent
we know:
us from memorizing the training set.
The Second Law of Thermodynamics states that the entropy Jt=0 = hR + δD + γCiq(θ) (40)
of an isolated system tends to increase. All systems tend to Jt=∞ = −σ log Z. (41)
disorder, and this places limits on the maximum possible
efficiency of heat engines.
4. Conclusion
Formally, there are many statements akin to the Second
Law of Thermodynamics that can be made about Markov We have formalized representation learning as the process
chains generally (Cover & Thomas, 2012). The central one of minimizing the distortion introduced when we project
is that for any for any two distributions pn , qn both evolving the real world (World P ) onto the world we desire (World
according to the same Markov process (n marks the time Q). The projection is naturally described by a set of four
step), the relative entropy KL(pn || qn ) is monotonically functionals which variationally bound relevant mutual infor-
decreasing with time. This establishes that for a stationary mations in the real world. Relations between the functionals
Markov chain, the relative entropy to the stationary state describe an optimal three-dimensional surface in a four
KL(pn || p∞ ) monotonically decreases 6 . dimensional space of optimal states. A single learning ob-
jective targeting points on this optimal surface can express a
In our language, we can make strong statements about dy- wide array of existing learning objectives spanning from un-
namics that target points on the optimal frontier, or dynamics supervised learning to supervised learning and everywhere
that implement a relaxation towards equilibrium. There is a in between. The geometry of the optimal frontier suggests a
fundamental distinction between states that live on the fron- wide array of identities involving the functionals and their
tier and those off of it, analogous to the distinction between partial derivatives. This offers a direct analogy to thermo-
equilibrium and non-equilibrium states in thermodynamics. dynamics independent of any physical content. By analogy
Any equilibrium distribution can be expressed in the to thermodynamics, we can begin to develop new quantita-
form Equation (19) and identified by its partial derviatives tive measures and relationships amongst properties of our
γ, δ, σ. If name the objective in Equation (17): models that we believe will offer a new class of theoretical
understanding of learning behavior.
J(γ, δ, σ) ≡ R + δD + γC + σS, (34)
ACKNOWLEDGEMENTS
The value this objective takes for any equilibrium distri-
bution can be shown to be given by the log partition func- The authors would like to Jascha Sohl-Dickstein, Mallory
tion (Equation (20)): Alemi and Rif A. Saurous for helpful discussions and feed-
back on the draft.
min J(γ, δ, σ) = −σ log Z(γ, δ, σ) (35)
References
and the KL divergence between any distribution over param-
eters p(θ) and an equilibrium distribution is: Achille, A. and Soatto, S. Emergence of Invariance and
Disentangling in Deep Representations. Proceedings of
KL(p(θ) || p∗ (θ; γ, δ, σ)) = ∆J/σ (36) the ICML Workshop on Principled Approaches to Deep
noneq Learning, 2017.
∆J ≡ J (p; γ, δ, σ) − J(γ, δ, σ) (37)
Where J noneq is the non-equilibrium objective: Alemi, Alexander A, Fischer, Ian, Dillon, Joshua V, and
Murphy, Kevin. Deep variational information bottleneck.
J noneq (p; γ, δ, σ) = hR + δD + γC + σSip(θ) . (38) arXiv:1612.00410, 2016. URL http://arxiv.org/
abs/1612.00410.
6
For discrete state Markov chains, this implies that if the sta-
tionary distribution is uniform, the entropy of the distribution Alemi, Alexander A, Poole, Ben, Dillon, Joshua V, Saurous,
H(pn ) is strictly increasing. Rif A, and Murphy, Kevin. Fixing a broken ELBO. ICML
TherML
2018, 2018. URL http://arxiv.org/abs/1711. Machta, J. and Ellis, R. S. Monte Carlo Methods for Rough
00464. Free Energy Landscapes: Population Annealing and Par-
allel Tempering. Journal of Statistical Physics, 144:541–
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wier- 553, August 2011. doi: 10.1007/s10955-011-0249-0.
stra, D. Weight Uncertainty in Neural Networks. arXiv: URL https://arxiv.org/abs/1104.1138.
1505.05424, May 2015. URL https//arxiv.org/
abs/1505.05424. Maclaurin, Dougal, Duvenaud, David, and Adams, Ryan P.
Gradient-based hyperparameter optimization through re-
Box, George EP and Draper, Norman R. Empirical model- versible learning. In Proceedings of the 32Nd Inter-
building and response surfaces. John Wiley & Sons, national Conference on International Conference on
1987. Machine Learning - Volume 37, ICML’15, pp. 2113–
Chen, T., Fox, E. B., and Guestrin, C. Stochastic Gradi- 2122. JMLR.org, 2015. URL http://dl.acm.org/
ent Hamiltonian Monte Carlo. arXiv:1402.4102, Febru- citation.cfm?id=3045118.3045343.
ary 2014. URL https://arxiv.org/abs/1402.
Mandt, S., Hoffman, M. D., and Blei, D. M. Stochas-
4102.
tic Gradient Descent as Approximate Bayesian Infer-
Cover, Thomas M and Thomas, Joy A. Elements of infor- ence. arXiv: 1704.04289, April 2017. URL https:
mation theory. John Wiley & Sons, 2012. //arxiv.org/abs/1704.04289.
Csiszár, Imre and Matúš, František. Information projections Marsh, Charles. Introduction to continuous entropy. 2013.
revisited. IEEE Transactions on Information Theory, 49 URL http://www.crmarsh.com/static/pdf/
(6):1474–1490, 2003. Charles_Marsh_Continuous_Entropy.pdf.
Finn, Colin BP. Thermal physics. CRC Press, 1993. Parrondo, Juan MR, Horowitz, Jordan M, and
Sagawa, Takahiro. Thermodynamics of informa-
Friedman, Nir, Mosenzon, Ori, Slonim, Noam, and Tishby, tion. Nature physics, 11(2):131–139, 2015. URL
Naftali. Multivariate information bottleneck. In Pro- http://jordanmhorowitz.mit.edu/sites/
ceedings of the Seventeenth conference on Uncertainty in default/files/documents/natureInfo.
artificial intelligence, pp. 152–161. Morgan Kaufmann pdf.
Publishers Inc., 2001.
Sethna, James. Statistical mechanics: entropy, order
Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christo- parameters, and complexity, volume 14. Oxford
pher, Glorot, Xavier, Botvinick, Matthew, Mohamed, University Press, 2006. URL http://pages.
Shakir, and Lerchner, Alexander. β-VAE: Learning Basic physics.cornell.edu/˜sethna/StatMech/
Visual Concepts with a Constrained Variational Frame- EntropyOrderParametersComplexity.pdf.
work. 2017.
Smith, S. L. and Le, Q. V. A Bayesian Perspective
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and
on Generalization and Stochastic Gradient Descent.
Weinberger, K. Q. Snapshot Ensembles: Train 1, get
arXiv:1710.06451, October 2017. URL https://
M for free. arXiv: 1704.00109, March 2017. URL
arxiv.org/abs/1710.06451.
https://arxiv.org/abs.1704.00109.
Still, S. Thermodynamic cost and benefit of data rep-
Jaynes, Edwin T. Information theory and statistical mechan-
resentations. arXiv: 1705.00612, April 2017. URL
ics. Physical review, 106(4):620, 1957.
https://arxiv.org/abs/1705.00612.
Kingma, D. P., Rezende, D. J., Mohamed, S., and Welling,
M. Semi-Supervised Learning with Deep Generative Still, S., Sivak, D. A., Bell, A. J., and Crooks, G. E. Thermo-
Models. arXiv: 1406.5298, June 2014. URL https: dynamics of Prediction. Physical Review Letters, 109(12):
//arxiv.org/abs/1406.5298. 120604, September 2012. doi: 10.1103/PhysRevLett.109.
120604. URL https://arxiv.org/abs/1203.
Kingma, Diederik P and Welling, Max. Auto-encoding 3271.
variational Bayes. 2014.
Tishby, N., Pereira, F.C., and Biale, W. The information
Ma, Y.-A., Chen, T., and Fox, E. B. A Complete Recipe bottleneck method. In The 37th annual Allerton Conf. on
for Stochastic Gradient MCMC. arXiv:1506.04696, Communication, Control, and Computing, pp. 368–377,
June 2015. URL https://arxiv.org/abs/1506. 1999. URL https://arxiv.org/abs/physics/
04696. 0004057.
TherML
Figure 8. Full Objective. σ = 0.5, γ = 1000, δ = 1, ρ = 0.9. Simple demonstration of the behavior with all terms present in the
objective.