On The Mathematics of Diffusion Models: David Mcallester Toyota Technologicical Institute at Chicago (Ttic)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

On the Mathematics of Diffusion Models

David McAllester
Toyota Technologicical Institute at Chicago (TTIC)

Abstract and where pgen (z) and pgen (y|z) are abitrary “generator”
arXiv:2301.11108v2 [cs.LG] 16 Feb 2023

distributions or densities.
This paper presents the stochastic differential equations of p(y)p(z|y)
diffusion models assuming only familiarity with Gaussian ln pop(y) = ln
p(z|y)
distributions. This treatment of the diffusion model SDE
and the associated reverse-time SDEs unifies the VAE and p(z)p(y|z)
= ln
score-matching treatments. It also yields the contribution p(z|y)
of this paper — a novel likelihood formula derived from  
p(z|y)
a non-variational VAE analysis (equations (10) and (12) in − ln pop(y) = Ez|y ln − Ez|y [ln p(y|z)]
p(z)
the text). The paper presents the mathematics directly with
attributions saved for a final section.

 KL(p(z|y), p(z))
= (2)
−Ez|y [ln p(y|z)]

1 The Diffusion Stochastic Differen- 
 Ey KL(p(z|y), pgen (z))
tial Equation (SDE) H(pop(y)) ≤ (3)
−Ey,z ln pgen (y|z)

We assume a population density pop(y) for y ∈ Rd . Of
particular interest is a population of images. For a given Here (3) follows from (2) and the fact that cross-entropies
time resolution ∆t we consider the discrete-time process upper bound entropies. Typically the generator distribu-
z(0), z(∆t), z(2∆t), z(3∆t), . . . defined by tions are trained by minimizing the “variational bound” in
(3). One also typically optimizes the encoder distribution or
z(0) = y, y ∼ pop(y) density p(z|y). However, for diffusion models the encoder
is defined by the diffusion process.
√ The diffusion process can be analyzed as a Markovian
z(t + ∆t) = z(t) + ∆t ǫ, ǫ ∼ N (0, I) (1) VAE. In a Markovian VAE we have z = z1 , . . . zN such
that the distribution on y, z1 , . . . zN is defined by pop(y),
√ ǫ ∼ N (0, I) each coordinate
For √ of the random variable p(z1 |y), and p(zi+1 |zi ). For a Markovian VAE we have
∆t ǫ has standard deviation ∆t and variance ∆t. A the following for p(y, z1 , . . . , zN ) > 0.
sum of Gaussians is a Gaussian with variance equal to the
sum of the variances. This implies that the variance of the p(zN )p(zN−1 |zN ) · · · p(z1 |z2 )p(y|z1 )
ln pop(y) = ln
coordinates of z(n∆t) conditioned on z(0) is n∆t. Taking p(zN |y)p(zN−1 |zN , y) · · · p(z1 |z2 , y)
this to limit as ∆t → 0 we get that the variance of the 
 KL(p(zN |y), p(zN ))
coordinates of z(t) given z(0) is t. More generally in the 



limit of ∆t → 0 equation (1) holds for sampling z(t + − ln pop(y) = + N
P
i=2 Ezi |y KL(p(zi−1 |zi , y), p(zi−1 |zi ))
(4)
∆t) given z(t) for any (continuous) non-negative t and ∆t. 



This is the density (measure) on continuous functions z(t) −Ez1 |y [ln p(y|z1 )]

defined by the diffusion SDE (1). 

 Ey KL(p(zN |y), pgen (zN ))



≤ + N
P
2 The VAE Analysis H(pop(y)) i=2 Ey,zi KL(p(zi−1 |zi , y), pgen (zi−1 |zi ))
(5)




− ln pgen (y|z1 )

We first review general variational auto-encoders (VAEs).
Let pop(y) be any population distribution or density on any This can be mapped to diffusion models by taking zi =
set Y. Let p(z|y) be any distribution or density on any set Z z(i∆t) with N ∆t >> 1. VAE analyses of diffusion mod-
conditioned on y ∈ Y. For the joint distribution defined by els typically construct a reverse-diffusion process by train-
pop(y) and p(z|y) we have the following for p(y, z) > 0 ing generator models to minimize the upper bound in (5).

1
However, here we focus on (4) rather than (5). For the In (11), as t0 → 0 we have that the second term goes to
limit ∆t → 0 section 2.2 gives a direct derivation of the negative infinity while the first term goes to positive infin-
following (without use of the Fokker-Planck Equation). ity with both diverging at the rate of ln t0 . This pathologi-
 ∆t(y−z(t))
 cal behavior is due to the pathological nature of differential
z(t) + t
, entropy. A real number actually carries an infinite amount
p(z(t − ∆t)|z(t), y) = N   (6) of information and the mutual information of a continuous
∆tI density with itself is infinite. In the continuous case it is bet-
 ∆t(E[y|t,z(t)]−z(t))
 ter to work with mutual information as in Shannon’s chan-
z(t) + t
, nel capacity theorem for continuous signals. The mutual
p(z(t − ∆t)|z(t)) = N   (7)
information I(y, z(t0 )) = H(y) − H(y|z(t0 )) is
∆tI

||y − E[y|t, z(t)]||2
Z  
Equation (7) defines a reverse-diffusion SDE which we I(y, z(t0 )) = dt Ey,z(t) (12)
t0 2t2
can write as
 
E[y|t,z(t)]−z(t)
 √ Note that as t0 → 0 we have that I(y, z(t0 )) approaches
 z(t) +

t
∆t + ∆tǫ
z(t − ∆t) = (8) I(y, y) which is infinite for continuous densities.


ǫ ∼ N (0, I)
Equations (10) and (12) are the main results of this pa-
per.
For two Gaussian distributions with the same isotropic co- For images a natural choice of t0 is the variance of cam-
variance we have era noise. A natural choice of discrete levels of t in sam-

N (µ1 , σ 2 I),

||u1 − µ2 ||2 pling or numerical integration seems to be uniform in ln t.
KL 2 =
N (µ2 , σ I) 2σ 2
2.1 Estimating E[y|t, z(t)]
Applying this to (6) and (7) we get
  Equations (8), (10) and (12) are stated in terms of E[y|t, z(t)].
p(z(t − ∆t)|z(t), y), We can train a network ŷ(t, z) to estimate E[y|t, z(t)] using
KL
p(z(t − ∆t)|z(t))
ŷ ∗ = argmin Et,z(t) (ŷ(t, z) − y)2 (13)
||y − E[y|t, z(t)]||2 ∆t2
 

=
2t2 ∆t
In practice it is better to train a network on values of the

||y − E[y|t, z(t)]||2
 same scale. If the population values√ are scaled so as to have
= ∆t (9) scale 1, then the scale of√z(t) is 1 + t. A constant-scale
2t2
network can then take z/ 1 + t as an argument rather than
We will consider (4) under the diffusion interpretation z.
in the limit as ∆t → 0 and N ∆t → ∞. Taking N ∆ → ∞ √
we have that the first term in (4) converges to zero. Balanc- ŷ ∗ = argmin Et,z(t) (ŷ(t, z/ 1 + t) − y)2 (14)
ing this we have that for a continuous population density, as ŷ
∆t goes to zero, the third term in (4), which can be written
as − ln p(y|z(∆t)), goes to negative infinity. To see this We then have
we can consider the one dimensional case where the pop- √
E[y|t, z(t)] = ŷ ∗ (t, z/ 1 + t)) (15)
ulation is uniformly distributed on the interval [0, 1]. For
small ∆t, and a given sample of z(∆t) given y, the density
p(y|z(∆t)) will√be a Gaussian centered at z(∆t) with stan- 2.2 The Derivation of (6) and (7)
dard deviation ∆t. √ As ∆t → 0 the central density will be We first condition the diffusion process on a particular value
proportional to 1/ ∆t and the third term diverges to nega-
tive infinity at the rate of 12 ln ∆t. To handle this difficulty of y. For t > 0 we can jointly draw z(t) and z(t + ∆t)
we can fix a time t0 > 0 at which to evaluate the third term conditioned on y as follows (taking t > 0 ensures that as
in (4). We then have that (4) and (9) imply ∆t → 0 we have t >> ∆t).
 R h i √
∞ ||y−E[y|t,z(t)]||2 z(t) = y + t δ, δ ∼ N (0, I)
 t0 dt Ez(t)|y

2t 2
− ln pop(y) = (10)


+ Ez(t0 )|y [− ln p(y|z(t1 ))]

z(t + ∆t) = z(t) + ∆t ǫ, ǫ ∼ N (0, I)
Taiking the expectation over y of (10) we get
 R h i Next we additionally condition on the value of z(t + ∆t).
∞ ||y−E[y|t,z(t0)]||2
 t0 dt Ey,z(t0 )

2t2
The choice of y, and z(t + ∆t) places a constraint on δ and
H(pop(y)) = (11) ǫ. √ √
tδ + ∆tǫ = z(t + ∆t) − y

+ H(y|z(t0 ))

2
This allows us to solve for δ as a function of ǫ. larger than ∆t. This implies that, to leading order in ∆t,
√ √ the variance remains ∆tI. So we now have
tδ + ∆tǫ = a for a = z(t + ∆t) − y
p(z(t)|z(t + ∆t))
1 √
δ(ǫ) = √ (a − ∆tǫ)
  
z(t + ∆t)
t = N , ∆tI
+ ∆t(E[y|t,z(t+∆t)]−z(t+∆t))
t
Given δ as a function of ǫ we can compute the density of ǫ
given y and z(t + ∆t). This is (7).

− ln p(ǫ|z(t + ∆t), y) 3 The Fokker-Planck Analysis (Score


1 1 Matching)
= ||δ(ǫ)||2 + ||ǫ||2 + C1
2 2
A general stochastic differential equation can be formulated

2
as the limit as ∆t → 0 of a discrete-time difference equa-

1 1 1
√ (a − ∆tǫ) + ||ǫ||2 + C1

=
2 t 2 tion
√  √
∆t ∆ta⊤ ǫ 1  z(t) + v(z(t), t)∆t + ∆t ǫ
= ||ǫ||2 − + ||ǫ||2 + C2
2t t 2 z(t + ∆t) = (16)
√ ⊤
ǫ ∼ N (0, Σ(z(t), t))
  
∆ta ǫ ∆t 1 1
= η ||ǫ||2 − + C2 η = + ≈
ηt 2t 2 2
The diffusion process is the special case given by (1) which
√ 2
we repeat here.

∆ta
= η ǫ − + C3
2ηt √

z(t + ∆t) = z(t) + ∆t ǫ, ǫ ∼ N (0, I) (1)
2
1 ∆ta
≈ ǫ− + C3 The Fokker-Planck equation governs the time evolution
2 t
of the probability density pt (z) for an SDE (16).
This gives 
v(z(t), t)pt (z)

∂pt (z)
√ ! = −∇ ·   (17)
∆t (z(t + ∆t) − y) ∂t 1
p(ǫ|z(t + ∆t), y) = N , I − 2 Σ(z(t), t)∇z pt (z)
t
For the special case of (1) the Fokker-Planck equation

We have z(t) = z(t + ∆t) − ∆tǫ which now gives becomes
 
∂pt (z) 1
p(z(t)|z(t + ∆t), y) = −∇ · − ∇z pt (z) (18)
  ∂t 2
∆t(y − z(t + ∆t))
= N z(t + ∆t) + , ∆tI
t One can gain intuition for this equation by considering
the density of perfume in the air over time. Perfume ex-
This is (6). This implies pands into air by a diffusion process. There is a diffusion-
flow of the perfume. The perfume flows in a direction from
E[z(t)|z(t + ∆t)]
higher concentration to lower concentrations. For the diffu-
∆t(E[y|t, z(t + ∆t)] − z(t + ∆t)) sion process defined here the diffusion flow vector at “posi-
= z(t + ∆t) +
t tion” z and time t is −(1/2)∇z pt (z). The time derivative
We also have of the concentration at a particular position z and time t is
determined by the difference between the flow-out and the
p(z(t)|z(t + ∆t)) = Ey [p(z(t)|z(t + ∆t), y)] flow-in for a little ball about z. For a flow vector F the
flow-out minus flow-in (per unit volume) is given by the di-
This implies that p(z(t)|z(t + ∆t)) is a mixture of Gaus- vergence ∇ · F . The flow-in minus the flow-out is −∇ · F .
sians and hence is Gaussian. The mean of the distribution We can rewrite (18) as
equals z(t + ∆t) plus a term proportional to ∆t. The stan-
dard deviation
√ (the square root of the variance) is propor-   
tional to ∆t which, in the limit of small ∆t, is infinitely ∂pt (z) 1
= −∇ · − ∇z ln pt (z) pt (z) (19)
∂t 2

3
Comparing this with the general case of the Fokker- This last equation corresponds to the time-reversal of de-
Planck equation (17) we can see that (19) can be interpreted terministic differential equation (20). Hence the Fokker-
as the case of a velocity vector with zero noise. We have Planck analysis of (8) also gives that (8) reverses the time
that the time evolution of pt (z) under (1) is the same as that evolution of the density pt (z).
given by (18) which is the same as that given by (19) which, We can generalize the reverse-diffusion process to in-
in turn, is the same as the time evolution of pt (z) under a clude any degree of stochasticity. For any λ ≥ 0 we can
deterministic velocity field given by use the SDE
dz 1 
z(t) 
= − ∇z ln pt (z) (20) 
dt 2
 
E[y|t,z(t)]−z(t)
 + 1+λ

∆t


2 t
The time-reversal of this deterministic differential equation √
z(t − ∆t) = +λǫ ∆t (23)
gives a deterministic reverse-diffusion process. 

For any density p(z) the gradient vector ∇z ln p(z) is



ǫ ∼ N (0, I)

called the score function. For the diffusion process we can
solve for the score function as follows.
This provides a potentially useful hyper-parameter for im-
pt (z) = Ey pt (z|y)
plementations which are necessarily discrete time approxi-
2
mations to the differential equations.
1 − ||z−y||
= Ey e 2t
Z(t)
4 Conclusions
∇z pt (z) = Ey pt (z|y) (y − z)/t
Both the VAE analysis and the Fokker-Planck analysis yield
= Ez,y (y − z)/t reverse-diffusion processes for sampling from a population.
Fokker-Planck yields reverse-diffusion processes parame-
terized by a reverse-diffusion noise level. The VAE analy-
= pt (z) Ey|z (y − z)/t
sis yields likelihood formulas. In contrast to some discus-
sions in the literature, Langevin dynamics and simulated
= pt (z) (E[y|t, z] − z)/t annealing seem unrelated to reverse-diffusion.

∇z pt (z) E[y|t, z] − z
∇z ln pt (z) = = (21)
pt (z) t 5 Attributions
The deterministic reverse-diffusion process (20) can now
The Fokker-Planck equation is named after Adriaan Fokker
be written as
  and Max Planck who described it in 1914 and 1917 respec-
E[y|t, z(t)] − z(t) tively. It is also known as the Kolmogorov forward equa-
z(t − ∆t) = z(t) + ∆t (22)
2t tion, after Andrey Kolmogorov, who independently discov-
It is useful to compare (22) with the reverse-diffusion SDE ered it in 1931. Feller (1946), cited by Sohl-Dickstein et al.
(8) derived by direct analysis. The reverse-diffusion SDE (2015), gives a reverse-time variant of the Fokker-Planck
(8) has exactly twice the linear term as the deterministic equation but not a reverse time SDE. Feller (1946) cites a
reverse-diffusion process (22). Intuitively, noise broadens more detailed analysis in Feller (1936). Reverse-time SDEs
the distribution whether it is applied forward or backward derived from the Fokker-Planck equation can be found in
in time. To reverse the process we can apply the noise for Anderson (1982) which is cited by Song et al. (2021).
∆t, which “diffuses” the distribution by ∆t but then reverse The idea of sampling by reverse-diffusion appears in
that diffusion by doubling the reverse-diffusion term. More
Sohl-Dickstein et al. (2015). They employ a VAE analysis
rigorously, if we define t′ = −t then (8) can be written as
in which generator distributions are trained by optimizing

 
E[y|t, z(t)] − z(t) the variational bound (5). More effective methods for opti-
z(t′ + ∆t′ ) = z(t′ ) + ∆t′ + ∆t ǫ
t mizing the variational bound were given in Ho et al. (2020)
and a large number of empirical refinements to this ap-
ǫ ∼ N (0, I)
proach have appeared in the literature. A variational bound
Applying the Fokker-Planck equation to this yields related to (10) appears in Huang et al. (2021). To my knowl-
∂pt (z)

1
 edge the expressions for ln pop(y) and I(y, z(t)) given in
= −∇ · (∇z ln pt (z))pt (z)) − ∇z pt (z) (10) and (12) are original here.
∂t′ 2
Song et al. (2021) cites the results in Anderson (1982)
to motivate the score-matching interpretation. They use a
  
1
= −∇ · ∇z ln pt (z) pz (t) score-matching objective from Hyvarinen (2005) to train an
2

4
approximate score network. Neither the score formula (21)
nor the deterministic reverse-diffusion ODE (22) appear
in Song et al. (2021). The Fokker-Planck derivations, the
deterministic reverse-diffusion ODE (22), and the reverse-
diffusion SDE (8) appear in Karras et al. (2022).

References
Anderson, B. D. O. Reverse-time diffusion equation mod-
els. Stochastic Process. Appl., 12(3), May 1982.
Feller, W. Zur theorie der stochastischen prozesse. Mathe-
matische Annalen, 113:113–160, 1936.
Feller, W. On the theory of stochastic processes, with par-
ticular reference to applications. In In Proceedings of
the [First] Berkeley Symposium on Mathematical Statis-
tics and Probability, 1946.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
bilistic models. In NeurIPS, 2020.
Huang, C., Lim, J. H., and Courville, A. C. A varia-
tional perspective on diffusion-based generative models
and score matching. In NeurIPS, 2021.
Hyvarinen, A. Estimation of non-normalized statistical
models by score matching. Journal of Machine Learn-
ing Research, 6, 2005.
Karras, T., Aittala, M., Aila, T., and Laine,
S. Elucidating the design space of diffusion-
based generative models, 2022. URL
https://arxiv.org/abs/2206.00364. Arxiv
2206.00364.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and
Ganguli, S. Deep unsupervised learning using nonequi-
librium thermodynamics. In ICML, 2015.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er-
mon, S., and Pool, B. Score-based generative modeling
through stochastic differential equations. In ICLR, 2021.

You might also like