Demystifying Variational Diffusion Models

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Demystifying Variational Diffusion Models

Fabio De Sousa Ribeiro1 and Ben Glocker1


1
Department of Computing, Imperial College London, UK
{fdesousa,b.glocker}@imperial.ac.uk
arXiv:2401.06281v1 [cs.LG] 11 Jan 2024

January 15, 2024

Abstract

Despite the growing popularity of diffusion models, gaining a deep understanding of the model
class remains somewhat elusive for the uninitiated in non-equilibrium statistical physics. With that
in mind, we present what we believe is a more straightforward introduction to diffusion models
using directed graphical modelling and variational Bayesian principles, which imposes relatively
fewer prerequisites on the average reader. Our exposition constitutes a comprehensive technical
review spanning from foundational concepts like deep latent variable models to recent advances
in continuous-time diffusion-based modelling, highlighting theoretical connections between model
classes along the way. We provide additional mathematical insights that were omitted in the seminal
works whenever possible to aid in understanding, while avoiding the introduction of new notation. We
envision this article serving as a useful educational supplement for both researchers and practitioners
in the area, and we welcome feedback and contributions from the community.1

1 https://github.com/biomedia-mira/demystifying-diffusion
CONTENTS 2

Contents

0 Notation 3

1 Introduction 4

1.1 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Hierarchical VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Generative Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Top-down Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Variational Diffusion Models 10

2.1 Forward Process: Gaussian Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Linear Gaussian Transitions: q(zt | zs ) . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.2 Top-down Posterior: q(zs | zt , x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.3 Learning the Noise Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Reverse Process: Discrete-Time Generative Model . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Generative Transitions: p(zs | zt ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Variational Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.3 Deriving DKL (q(zs | zt , x) ∥ p(zs | zt )) . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.4 Monte Carlo Estimator of LT (x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Reverse Process: Continuous-Time Generative Model . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 On Infinite Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.2 Monte Carlo Estimator of L∞ (x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.3 Invariance to the Noise Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Understanding Diffusion Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4.1 Weighted Diffusion Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4.2 Noise Schedule Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4.3 Importance Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4.4 ELBO with Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Discussion 42
CONTENTS 3

Notation

Description Section
x Observed datapoint, e.g. input image §1
t Time variable t ∈ {1, . . . , T }, or t ∈ [0, 1] for continuous-time §1.2, §2.1
zt Latent variable at time t §1.2
z1:T Finite set of latent variables z1 , z2 , . . . , zT §1.2
z0:1 Set of latent variables in continuous-time from t=0 to t=1 §2.1
αt Noise schedule coefficient αt ∈ (0, 1) §2.1
σt2 Noise schedule variance σt2 ∈ (0, 1) §2.1
ϵt Isotropic random noise, ϵt ∼ N (0, I) §1, §2.1
SNR(t) Signal-to-noise ratio function, defined as αt2 /σt2 §2.1.3
q(zt | x) Latent variable distribution §2.1
q(zt | zs ) Transition distribution from time s to time t, where s < t §2.1.1
αt|s Transition coefficient from time s to t §2.1.1
2
σt|s Variance of transition distribution §2.1.1
q(zs | zt , x) Top-down posterior distribution at time s < t §1.4, §2.1.2
µQ (zt , x; s, t) Mean of top-down posterior distribution at time s; µQ for short §2.1.2
2
σQ (s, t) Variance of top-down posterior distribution; 2
σQ for short §2.1.2
p(zs | zt ) Generative transition distribution defined as q(zs | zt , x = x̂θ (zt , t)) §2.2
p(x | z0 ) Image likelihood, equiv. p(x | z1 ) in discrete-time §2.2, §1.2
ϕ Variational parameters pertaining to qϕ §1
θ Model parameters pertaining to pθ §1, §2.2
x̂θ (zt , t) Denoising model for mapping any zt to x §2.2.1
ϵ̂θ (zt , t) Noise prediction model, approximates ∇zt log q(zt ) §2.2.1
ŝθ (zt , t) Score model, equal to −ϵ̂θ (zt , t)/σt §2.2.1
µθ (zt ; s, t) Predicted posterior mean at time s; µθ for short §2.2.1
VLB(x) Single-datapoint variational lower bound; equiv. ELBO(x) §1, §2.2.2
LT (x) Discrete-time diffusion loss §2.2.2
L∞ (x) Continuous-time diffusion loss §2.3.1, §2.3.2
Lw (x) Weighted diffusion loss; also L∞ (x, w) §2.4.1, §2.4.2
γ η (t) Neural network with parameters η for learning the noise schedule §2.1.3
w(·) Noise level weighting function §2.4.1
λ Logarithm of the signal-to-noise ratio SNR(t); also λt §2.4.2
λmin Lowest log signal-to-noise ratio given by fλ (t = 1) §2.4.2
λmax Highest log signal-to-noise ratio given by fλ (t = 0) §2.4.2
p(λ) Density over noise levels §2.4.2
fλ (t) Noise schedule function, mapping t to λ §2.4.2
L(t; x) Joint KL divergence up to time t §2.4.4
pw (t) Data augmentation kernel specified by w(·) §2.4.4
1 Introduction 4

1 Introduction

The basic setup of generative modelling involves using a dataset of observations of x to estimate the
marginal distribution p(x). Estimating p(x) accurately enables various useful things such as: (i) sample
generation; (ii) density estimation; (iii) compression; (iv) data imputation; (v) model selection, etc. Since
p(x) is unknown, we must approximate it with a model pθ (x) ≈ p(x), by optimizing some parameters θ.
There are many ways to estimate p(x); we focus on Variational Diffusion Models (VDMs) (Kingma et al.,
2021; Kingma and Gao, 2023), which are a family of diffusion-based generative models (Sohl-Dickstein
et al., 2015). Despite the growing popularity of diffusion models, gaining a deep understanding of the
model class remains somewhat elusive for the uninitiated in non-equilibrium statistical physics. Hence,
we present a more straightforward introduction to diffusion models using directed graphical modelling
and variational inference principles, which imposes relatively fewer prerequisites on the average reader.

With that in mind, the goal of this paper is to provide a thorough introduction to VDMs, with-
out overlooking mathematical details or introducing new notation relative to the seminal works. We
start by reviewing the basic fundamental principles and motivations behind Variational Autoencoders
(VAEs) (Kingma and Welling, 2013; Rezende et al., 2014), and their hierarchical counterparts (Salimans
et al., 2015; Sønderby et al., 2016; Kingma et al., 2016). We then introduce diffusion probabilistic mod-
els as a natural extension of discrete-time hierarchical VAEs with a particular choice of inference and
generative model, before delving into continuous-time variants which represent infinitely deep VAEs.

Variational perspectives on diffusion have also been studied by Tzen and Raginsky (2019); Huang et al.
(2021); Vahdat et al. (2021); we focus on VDMs since they represent the smoothest transition from VAEs
to diffusion models. Our work is complementary to Luo (2022), as they too provide an introduction to
diffusion models. However, our exposition is far more comprehensive, up-to-date, and mathematically
consistent with the seminal works on VDMs (Kingma et al., 2021; Kingma and Gao, 2023). Furthermore,
we cover recent material on VDMs++ (Kingma and Gao, 2023) and provide additional instructive insights
which we hope will contribute to the dissemination and understanding of this model class.

1.1 Variational Autoencoder

Variational autoencoder models assume that data x ∈ X D are generated by some random process
involving an unobserved random variable z ∈ Z K . The marginal distribution of x is: p(x) = p(x, z) dz.
R

The generative process is straightforward: (i) sample a latent variable from a prior distribution z ∼ p(z);
(ii) sample an observation from a conditional distribution x ∼ p(x | z). If we choose z to be a discrete
random variable and p(x | z) to be a Gaussian distribution, then p(x) is a Gaussian mixture. If we instead
choose z to be a continuous random variable, then p(x) represents an infinite mixture of Gaussians.

For complicated non-linear likelihood functions where p(x | z) is parameterized by a deep neural network,
integrating out the latent variable z to compute p(x) has no analytic solution, so we must rely on
approximations. A straightforward Monte Carlo approximation of p(x) is certainly possible:

N
1 X iid
p(x) = Ez∼p(z) [p(x | z)] ≈ p(x | zi ), z1 , . . . , zN ∼ p(z), (1)
N i=1

but is subject to the curse of dimensionality, since the number of samples needed to properly cover the
latent space grows exponentially with the dimensionality of the latent variable z.
1.1 Variational Autoencoder 5

Alternatively, we can turn to variational methods, which pose probabilistic inference as an optimization
problem (Jordan et al., 1999). The first thing to note is that the intractability of p(x) is related to the
intractability of the true posterior over the latent variable p(z | x) through a basic identity:

p(x | z)p(z) p(x | z)p(z)


p(x) = , where p(z | x) = R . (2)
p(z | x) p(x | z)p(z) dz

Using a complicated neural network-based likelihood renders the integral on the RHS intractable. To
estimate p(x) we can approximate the true posterior p(z | x) via a parametric inference model q(z | x) of
our choice, such that q(z | x) ≈ p(z | x). Learning a single function with shared variational parameters
ϕ to map each datapoint x to a posterior distribution q(z | x) is known as amortized inference.

We may optionally write qϕ (z | x) and pθ (x | z) to explicitly state that these are parametric distributions
realized by an encoder-decoder setup with variational parameters ϕ and model parameters θ. The typical
VAE setup specifies a prior p(z) with no learnable parameters, and it is often chosen to be standard
Gaussian: p(z) = N (z; 0, I). It is important to note that unlike the latent variable(s) z which are local,
the parameters {ϕ, θ} are global since they are shared for all datapoints. To improve our approximation,
we’d like to minimize the Kullback-Leibler (KL) divergence arg minq(z|x) DKL (q(z | x) ∥ p(z | x)), but it
is not possible do so directly as we do not have access to the true posterior p(z | x) for evaluation.

VAEs maximise the Variational Lower Bound (VLB) of log p(x):

q(z | x)
Z
DKL (q(z | x) ∥ p(z | x)) = q(z | x) log dz (3)
p(z | x)
 
p(x, z)
= Eq(z|x) log q(z | x) − log (4)
p(x)

= Eq(z|x) [log q(z | x) − log p(x, z)] + log p(x) (5)

= −VLB(x) + log p(x), (6)

adding VLB(x) to both sides reveals:

DKL (q(z | x) ∥ p(z | x)) + VLB(x) = log p(x) =⇒ log p(x) ≥ VLB(x), (7)

as DKL (q(z | x) ∥ p(z | x)) ≥ 0 by Gibbs’ inequality. Hence, maximizing the VLB implicitly minimizes
the KL divergence of q(z | x) from the true posterior p(z | x) as desired. The VLB is also known as the
Evidence Lower BOund (ELBO) since p(x) is called the evidence. The VLB optimized by VAEs is:
 
p(x, z)
VLB(x) = Eq(z|x) log (8)
q(z | x)
 
p(z)
= Eq(z|x) [log p(x | z)] + Eq(z|x) log (9)
q(z | x)

= Eq(z|x) [log p(x | z)] − DKL (q(z | x) ∥ p(z)) , (10)

which amounts to an expected likelihood objective regularized by the KL of the posterior from the prior.
P
If we let D be a dataset of i.i.d. data, then VLB(D) = x∈D VLB(x). We can use stochastic variational
inference (Hoffman et al., 2013) and the reparameterization trick (Kingma and Welling, 2013; Rezende
et al., 2014) to jointly optimize the VLB w.r.t. the model parameters θ, and variational parameters ϕ.
For more details on this procedure, the reader may refer to Kingma et al. (2019) and Blei et al. (2017).
1.2 Hierarchical VAE 6

zT zT x
q(zT | x)
··· p(zt−1 | zt )
q(zt | zt−1 )

···
zT
z2 z2
q(zt−1 | zt , x)

···
p(z1 | z2 ) q(z2 | z1 )
z1 z1 z2
p(x | z1 ) q(z1 | x) q(z1 | z2 , x)
x x z1

(a) Hierarchical Generative Model (b) Bottom-up Inference Model (c) Top-down Inference Model

Figure 1: Hierarchical latent variable graphical models. (a) The generative model p(x, z1:T ) of a hier-
archical VAE with T latent variables is a Markov chain. (b) The standard bottom-up inference model
q(z1:T | x) of a hierarchical VAE is a Markov chain in the reverse direction. (c) The top-down infer-
ence model follows the same topological ordering of the latent variables as the generative model. This
top-down structure is used to specify diffusion models. In diffusion models the posterior q(z1:T | x) is
tractable due to Gaussian conjugacy, which enables us to specify the generative model transitions as
p(zt−1 | zt ) = q(zt−1 | zt , x = x̂θ (zt ; t)), where the data x is replaced by a denoising model x̂θ (zt ; t).

1.2 Hierarchical VAE

A hierarchical VAE is a deep latent variable model comprised of a hierarchy of latent variables z1 , z2 . . . , zT .
Introducing additional (auxiliary) latent variables significantly improves the flexibility of both inference
and generative models (Salimans et al., 2015; Ranganath et al., 2016; Maaløe et al., 2016).

The joint distribution p(x, z1:T ) specifying a generative model of x is a variational Markov chain
zT → zT −1 → · · · → z1 → x:

p(x, z1:T ) = p(zT )p(zT −1 | zT ) · · · p(z1 | z2 )p(x | z1 ) (11)


"T #
Y
= p(zT ) p(zt−1 | zt ) p(x | z1 ). (12)
t=2

The approximate posterior q(z1:T | x) is a Markov chain in the reverse (bottom-up) direction
zT ← zT −1 ← · · · ← z1 ← x:

q(z1:T | x) = q(z1 | x)q(z2 | z1 )q(z3 | z2 ) · · · q(zT | zT −1 ) (13)


T
Y
= q(z1 | x) q(zt | zt−1 ). (14)
t=2

The marginal likelihood p(x) is obtained by marginalizing out the latent variables:
Z Z
p(x) = p(x | z1 )p(z1 ) dz1 , p(zt ) = p(zt | zt+1 )p(zt+1 ) dzt+1 , t = 1, 2, . . . , T − 1, (15)

and the model is fit by maximizing the VLB of log p(x):


 
p(x, z1:T )
log p(x) ≥ Eq(z1:T |x) log =: VLB(x). (16)
q(z1:T | x)
1.3 Generative Feedback 7

1.3 Generative Feedback

The trouble with hierarchical latent variable models with N (0, σT2 ) zT ẑT dT
bottom-up inference is bottom-up inference. Burda et al.

···
(2015) and Sønderby et al. (2016) both found hierarchical

···

···
models with purely bottom-up inference are typically not
capable of utilizing more than two layers of latent variables. N (0, σ22 ) z2 ẑ2 d2
This often manifests as posterior collapse, whereby the pos-
terior distribution (of the top-most layer, say) collapses to a N (0, σ12 ) z1 ẑ1 d1
standard Gaussian prior, failing to learn meaningful repre-
sentations and effectively deactivating latent variables.
x x
To understand why bottom-up inference is challenging for
even modestly deep hierarchies, we start by noting the asym- Figure 2: A Ladder Network. The latent
metry between the associated generative and inference mod- variables z1 , z2 , . . . , zT are noisy represen-
tations of x, and d1 , d2 , . . . , dT are clean
els in Equations 12 and 14 respectively. Burda et al. (2015); representations; both sets are produced
Sohl-Dickstein et al. (2015) point this out as a source of dif- by a shared encoder (blue arrows). The
ficulty in training the inference model efficiently, since there variables ẑ1 , ẑ2 , . . . , ẑT are outputs of de-
is no way to express each term in the VLB as an expectation noising functions where ẑt = gt (zt , ẑt+1 ).
Notice how gt (·) receives both bottom-up
under a distribution over a single latent variable. Luo (2022) and top-down information. The dashed
present a similar efficiency-based argument against using horizontal lines denote local cost functions
bottom-up inference in hierarchical latent variable models. used to minimize ∥ẑt − dt ∥22 . The main
difference compared to denoising diffusion
We claim that efficiency arguments paint an incomplete pic- models is that here the denoising targets
ture; the main reason one should avoid bottom-up inference dt are learned representations of x.
is the lack of direct feedback from the generative model. To show why generative feedback is important,
we stress that the purpose of the inference model is to perform Bayesian inference at any given layer in
the hierarchy. That is, to compute the posterior distribution q(zt | x) over each latent variable zt :

p(x | zt )p(zt )
Z
q(zt | x) = ∝ p(x | zt ) p(zt | zt+1 )p(zt+1 ) dzt+1 , (17)
p(x)

which clearly shows that the posterior is not only proportional to the current layer’s prior p(zt ) but also to
the layer above’s p(zt+1 ), and so on, following the reverse of the generative Markov chain (Equation 12).

It therefore stands to reason that interleaving feedback from each transition in the generative model into
each respective transition in the inference model can only make the inference network more accurate.
To that end, we can take Equation 17 and rewrite the posterior distribution over each zt such that it
contains a more explicit dependency on the preceding latent variables as prescribed:
Z Z
q(zt | x) = q(zt | zt+1 , x) dzt+1 ∝ p(x | zt )p(zt | zt+1 )p(zt+1 ) dzt+1 (18)

=⇒ q(zt−1 | zt , x) ∝ p(x | zt−1 )p(zt−1 | zt )p(zt ). (19)

The posterior q(zt−1 | zt , x) now follows the same topological ordering of the latent variables as the
prior p(zt−1 | zt ), and it coincides with the top-down inference model in HVAEs (Sønderby et al., 2016;
Kingma et al., 2016). Figure 1 shows how this top-down structure compares to the bottom-up approach.
An added benefit of the top-down approach is that the generative model can also receive data-dependent
feedback from the inference procedure, which Sønderby et al. (2016) found to be beneficial in practice.
1.4 Top-down Inference 8

Valpola (2015); Rasmus et al. (2015) were the first to introduce such lateral feedback connections between
the inference and generative paths in hierarchical latent variable models. They called their denoising
autoencoder a ladder network (see Figure 2), which later inspired the ladder VAE (Sønderby et al.,
2016). Valpola (2015) argue that incorporating lateral feedback connections enables the higher layers
to learn abstract invariant representations as they no longer have to retain all the details about the
input. Concretely, as depicted in Figure 2, each denoised variable ẑt := gt (zt , ẑt+1 ) is computed using a
denoising function gt (·) which receives bottom-up feedback from zt and top-down feedback from ẑt+1 .

1.4 Top-down Inference

The joint approximate posterior q(z1:T | x) can be alternatively factorized into the top-down inference
model (Sønderby et al., 2016; Kingma et al., 2016). As mentioned in Section 1.3, the top-down inference
model follows the same topological ordering of the latent variables as the generative model, that is:

q(z1:T | x) = q(zT | x)q(zT −1 | zT , x) · · · q(z2 | z3 , x)q(z1 | z2 , x) (20)


T
Y
= q(zT | x) q(zt−1 | zt , x). (21)
t=2

Variants of the top-down inference model have featured in much deeper state-of-the-art HVAEs for
sample generation (Maaløe et al., 2019; Vahdat and Kautz, 2020; Child, 2020; Shu and Ermon, 2022)
and approximate counterfactual inference (De Sousa Ribeiro et al., 2023; Monteiro et al., 2022). As we
will explain later, the top-down hierarchical latent variable model serves as the basis for parameterizing
denoising diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Kingma et al., 2021).

For now, we derive the corresponding VLB to obtain a concrete optimization objective:
 
p(x, z1:T )
VLB(x) = Eq(z1:T |x) log (22)
q(z1:T | x)
" QT #
p(zT )p(x | z1 ) t=2 p(zt−1 | zt )
= Eq(z1:T |x) log QT (factor the joint) (23)
q(zT | x) t=2 q(zt−1 | zt , x)
" T #
p(zt−1 | zt )
 
p(zT ) X
= Eq(z1:T |x) log p(x | z1 ) + log + Eq(z1:T |x) log (24)
q(zT | x) t=2
q(zt−1 | zt , x)
 
p(zT )
= Eq(z1 |x) [log p(x | z1 )] + Eq(zT |x) log
q(zT | x)
T
p(zt−1 | zt )
X  
+ Eq(zt−1 ,zt |x) log (25)
t=2
q(zt−1 | zt , x)
 
p(zT )
= Eq(z1 |x) [log p(x | z1 )] + Eq(zT |x) log
q(zT | x)
T
p(zt−1 | zt )
X   
+ Eq(zt |x) Eq(zt−1 |zt ,x) log (26)
t=2
q(zt−1 | zt , x)

= Eq(z1 |x) [log p(x | z1 )] − DKL (q(zT | x) ∥ p(zT ))


T
X
− Eq(zt |x) [DKL (q(zt−1 | zt , x) ∥ p(zt−1 | zt ))] . (27)
t=2
1.4 Top-down Inference 9

2 2 0.5 0.5

0.4 0.4
1 1

Likelihood

Likelihood
0.3 0.3
x2

z2
0 0
0.2 0.2
−1 −1
0.1 0.1

−2 −2
0.0 0.0
−2 −1 0 1 2 −2 −1 0 1 2 −2 0 2 −2 0 2
x1 z1 z1 z2
x ∼ p(x | z), z ∼ p(z) x ∼ p(x | z), z ∼ q(z | x) Prior : p(z) 0.4 : q(z)
Aggregate Posterior
0.6
2 2
0.5
1 1 0.3
0.4

Likelihood

Likelihood
x2

z2

0 0 0.3 0.2

0.2
−1 −1
0.1
0.1
−2 −2
0.0 0.0
−2 −1 0 1 2 −2 −1 0 1 2 −2 0 2 −2 0 2
x1 z1 z1 z2
x ∼ p(x | z), z ∼ p(z) x ∼ p(x | z), z ∼ q(z | x) Prior : p(z) Aggregate Posterior : q(z)

Figure 3: Demonstration of the hole problem in VAEs. Results are from a single stochastic layer VAE
trained on a 2D toy dataset with five clusters. The latent variable z is also 2-dimensional for illustration
purposes. The leftmost column shows the dataset, overlaid with reconstructed datapoints (red border)
and random samples from the generative model (blue border). The remaining columns show the assumed
PN
prior p(z) = N (z; 0, I) (blue contours) overlaid with the aggregate posterior q(z) = i=1 q(z | xi )/N .
As shown, there are regions with high density under the prior which are assigned low density under the
aggregate posterior. This affects the quality of the random samples since we are likely to sample from
regions in p(z) not covered by the data. Further, the bottom row shows a common occurrence in VAEs
where latent variable(s) are not activated/used at all by the model, in this case, z2 was not used.

It is well worth dedicating some time to understanding the details of the above derivation and the
resulting expression, as it is the exact objective optimized by diffusion models as well.

One thing to notice is that it comprises the familiar trade-off between minimizing input reconstruction
error and keeping the (hierarchical) approximate posterior q(z1:T | x) close to the prior p(z1:T ). In
contrast to standard VAEs, the prior in HVAEs is learned from data rather than being fixed, as this
affords greater flexibility (Kingma et al., 2016; Hoffman and Johnson, 2016; Tomczak and Welling, 2018).

The Problem with VAEs. A primary issue with VAEs is the hole problem (Rezende and Viola,
2018). The hole problem refers to the mismatch between the so-called aggregate posterior q(z) and the
prior p(z) over the latent variables (Makhzani et al., 2015; Hoffman and Johnson, 2016). The aggregate
posterior is simply the average posterior distribution over the dataset D = {xi }N
i=1 , that is:

Z N
1 X
q(z) = q(z | x)pD (x) dx, pD (x) = δ(x − xi ), (28)
N i=1

where pD (x) is the empirical distribution, constructed by a Dirac delta function δ(·) centered on each
training datapoint xi . As shown in Figure 3, there can be regions with high probability density under the
prior which have low density under the aggregate posterior. This affects the quality of generated samples
when the decoder receives z’s sampled from regions not covered by the data. As we will show, diffusion
models circumvent this by defining the aggregate posterior to be equal to the prior by construction.
2 Variational Diffusion Models 10

x x dT zT zT
q(zT | x)
q(zt−1 | zt , dt ) q(zt−1 | zt , dt )

···

···
·
··
···
zT zT
d2 z2 d2 z2
q(zt−1 | zt , x)
···

···
q(z1 | z2 , d1 ) q(z1 | z2 , d1 )
z2 z2 d1 z1 d1 z1
q(z1 | z2 , x)
z1 z1 x x

(a) Top-down Hierarchy (b) Diffusion Model (c) HVAE Inference Model (d) Reverse Process

Figure 4: Probabilistic graphical models of HVAEs and diffusion models. (a) The general top-down
hierarchical latent variable model. (b) The top-down model used to specify diffusion models, where
q(zT | x) = q(zT ) by construction. Here the posterior q(z1:T | x) is a fixed noising process, so the
modelling task is bottom-up prediction of x from each zt , i.e. denoising (dashed lines). (c) The top-
down model used for posterior inference in HVAEs. It consists of a deterministic bottom-up pass to
compute d1 , . . . , dT , followed a stochastic top-down pass to compute zT , . . . , z1 . (d) The reverse process
of a diffusion model, i.e. the generative model. The main differences compared to (c) are that here the
deterministic variables dT −1 , . . . , d1 do not depend on x nor have their own hierarchical dependencies.
Further, the blue lines represent a denoising model x̂θ : zt → dt which is shared across the hierarchy.

2 Variational Diffusion Models

A diffusion probabilistic model (Sohl-Dickstein et al., 2015) can be understood as a hierarchical VAE
with a particular choice of inference and generative model. Like HVAEs, diffusion models are deep latent
variable models that maximize the variational lower bound of the log-likelihood of the data (i.e. the
ELBO). Diffusion models were largely inspired by ideas from statistical physics rather than variational
Bayesian methods, so they come with a different set of modelling choices and advantages. The general
idea behind diffusion models is to define a fixed forward (inference) diffusion process that converts any
complex data distribution into a tractable distribution, and then learn a generative model that reverses
this diffusion process. Figure 4 compares diffusion models with (top-down inference) HVAEs.

Diffusion probabilistic models have the following distinctive properties:

(i) The joint posterior q(z1:T | x) is fixed rather than learned from observed data. This amounts to
having a fixed encoder defining a Gaussian diffusion process of the data;

(ii) Each latent variable zt has the same dimensionality as the input data x;

(iii) The aggregate posterior q(zT ) is equal to the prior p(zT ) by construction;

(iv) The functional form of the inference model is identical to that of the generative model. This
corresponds exactly to the top-down inference model structure used in HVAEs;

(v) A single neural network is shared across all levels of the latent variable hierarchy, and each layer
can be trained without having to compute the preceding ones;

(vi) They maximize a particular weighted objective which seems to better align with human perception
by suppressing modelling effort on imperceptible details.
2 Variational Diffusion Models 11


N 0, σT2 |T −1 I

N 0, σT2 I zT zT x

···
αt|t−1

···
2
/σ32

N 0, I zT α2 σ3|2
 2

N 0, σ22 I z2 N 0, σ2|1 I z2

···
2
αT αt|t−1 σt−1 /σt2 2
α1 σ2|1 /σ22
α2|1
 2
 2

N 0, σ12 I z1 α2 N 0, σ1|0 I z1 N 0, σ2|3 I z2

α1 α1|0 α2|1 σ12 /σ22


2

x x N 0, σ1|2 I z1

(a) Gaussian Diffusion (b) Markovian Transitions (c) Top-down Posterior

Figure 5: Graphical model(s) describing a discrete-time Gaussian diffusion process (T timesteps in total).
(a) Parameterization of the forward process in terms of the conditionals q(zt | x) (ref. Section 2.1). Each
latent variable zt is a noisy version of x given by: zt = αt x + σt ϵt , and ϵt ∼ N (0, I). (b) Markov chain
formed by a sequence of transition distributions q(zt | zt−1 ) (ref. Section 2.1.1). Each latent variable is
2 := σt2 −αt|t−1
2 2
given by: zt = αt|t−1 zt−1 +σt|t−1 ϵt , with parameters αt|t−1 := αt /αt−1 and σt|t−1 σt−1 . (c)
The top-down posterior is tractable due to Gaussian conjugacy: q(zt−1 | zt , x) ∝ q(zt | zt−1 )q(zt−1 | x)
(ref. Section 2.1.2), where q(zt−1 | x) acts as a Gaussian prior and q(zt | zt−1 ) as a Gaussian likelihood.
This top-down posterior is used to specify the generative model transitions as p(zt−1 | zt ) = q(zt−1 |
zt , x = x̂θ (zt ; t)), where the data x is replaced by a learnable denoising model x̂θ (zt ; t).

Recent model innovations (Ho et al., 2020) – along with insights from stochastic processes (Anderson,
1982) and score-based generative modelling (Hyvärinen and Dayan, 2005; Vincent, 2011; Song and Er-
mon, 2019; Song et al., 2021b) – have yielded a myriad of impressive synthesis results at scale (Nichol
and Dhariwal, 2021; Dhariwal and Nichol, 2021; Nichol et al., 2022; Ho et al., 2022; Rombach et al.,
2022; Saharia et al., 2022; Hoogeboom et al., 2022).

Kingma et al. (2021); Kingma and Gao (2023) introduced a family of diffusion-based generative models
they call Variational Diffusion Models (VDMs), and showed us that:

(i) The latent hierarchy can be made infinitely deep2 via a continuous-time model where T → ∞;

(ii) The continuous-time VLB is invariant to the noise schedule3 , meaning we can learn/adapt our noise
schedule such that it minimizes the variance of the resulting Monte Carlo estimator of the loss;

(iii) Although weighted diffusion objectives appear markedly different from regular maximum likelihood
training, they all implicitly optimize some instance of the ELBO;

(iv) VDMs are capable of state-of-the-art image synthesis, showing that standard maximum likelihood-
based training objectives (i.e. the ELBO) are not inherently at odds with perceptual quality.

One important distinction to make between HVAEs and diffusion probabilistic models at this stage is
that the role of the latent variables z1:T are very different from a representation learning perspective.
In HVAEs, the posterior latents z1:T are useful learned representations of x, which increase in semantic
informativeness w.r.t. x as we go from z1 to zT . In diffusion probabilistic models, the latent variables
z1:T generally have no semantic meaning, and they decrease in informativeness w.r.t. x as we go from
z1 to zT . This is because each zt is simply a noisy version of x following a Gaussian diffusion process.
2 This notion was concurrently explored by Song et al. (2021b); Huang et al. (2021); Vahdat et al. (2021).
3 Except for the signal-to-noise ratio at its endpoints (see Section 2.3.3).
2.1 Forward Process: Gaussian Diffusion 12

1.0

0.5
α
σ2
0.0
0.0 0.5 1.0
t x z1 z2 z4 z8 z16 z32 z64 zT
1.0

α
0.5
σ2

0.0
0.0 0.5 1.0
t x z1 z2 z4 z8 z16 z32 z64 zT

Figure 6: Gaussian diffusion process (T =100). Showing two popular noise schedules in terms of α, σ 2 as
per Section 2.1: (top) cosine (Nichol and Dhariwal, 2021); (bottom) EDM (Karras et al., 2022).

2.1 Forward Process: Gaussian Diffusion

A Gaussian diffusion process gradually transforms data x into random noise by adding increasing
amounts of Gaussian noise at each timestep t = 0, . . . , 1 resulting in a set of latent variables z0 , . . . , z1 .4
Each latent variable zt is simply a noisy version of x, and its distribution conditional on x is given by:

q(zt | x) = N zt ; αt x, σt2 I ,

zt = αt x + σt ϵt , ϵt ∼ N (ϵt ; 0, I) , (29)

where αt ∈ (0, 1) and σt2 ∈ (0, 1) are chosen scalar valued functions of time t ∈ [0, 1]. See Figures 5a, 6.

The key idea is to define the forward diffusion process such that the noisiest latent variable z1 at time
t = 1 is standard Gaussian distributed: q(z1 | x) = N (z1 ; 0, I), thus q(z1 | x) = q(z1 ). To that end,
the scaling coefficients α0 > . . . > α1 decrease w.r.t. time t, whereas the noise variances σ02 < . . . < σ12
increase w.r.t. t. As we will show, this enables us to learn a generative Markov chain which starts from
z1 ∼ q(z1 ) and reverses the forward diffusion process to obtain samples from the data distribution. The
implications of this are profound; the aggregate posterior q(z1 ) is equal to the prior p(z1 ) by construction,
which circumvents the hole problem in VAEs (see Figure 3). Hoffman and Johnson (2016) showed that
the optimal prior is the aggregate posterior, as long as our posterior approximation is good enough.

A variance-preserving process is achieved by solving for the value of αt such that the variance of the
respective latent variable V[zt ] is equal to the variance of the input data V[x]. This can be important
from a modelling perspective, as adding increasing amounts of noise to the input alters its statistics.

We can first apply some basic properties of variance to simplify V[zt ] as follows:

V[zt ] = V[αt x + σt ϵt ] = V[αt x] + V[σt ϵt ] = αt2 V[x] + σt2 V[ϵt ] = αt2 V[x] + σt2 , (30)

since V[ϵt ] = 1 by definition. Taking the result and solving for αt yields

αt2 V[x] + σt2 = V[x] (31)


V[x] − σt2
αt2 = (32)
V[x]
σt2
=⇒ V[zt ] = V[x] ⇐⇒ αt2 = 1 − , (33)
V[x]

which further simplifies to αt2 = 1 − σt2 as long as our input data is standardized.
4 For consistency with the continuous-time case where T → ∞, we denote the latent variables as z0:1 rather than z1:T .
2.1 Forward Process: Gaussian Diffusion 13

2.1.1 Linear Gaussian Transitions: q(zt | zs )

The conditional distribution of zt given a preceding latent variable zs , for any timestep s < t, is:
 
2
q(zt | zs ) = N zt ; αt|s zs , σt|s I , zt = αt|s zs + σt|s ϵt , ϵt ∼ N (ϵt ; 0, I) , (34)

which forms a Markov chain: z1 ← z(T −1)/T ← z(T −2)/T ← · · · ← z0 ← x, see Figure 5b for an example.
In the continuous-time case where T → ∞, each transition is w.r.t. an infinitesimal change in time dt.

The transition distribution q(zt | zs ) is useful for computing closed-form expressions for the parameters of
the posterior q(zs | zt , x), which defines our reverse-process, i.e. the generative model (ref. Section 2.1.2).

Let’s focus on deriving αt|s first. By construction, we know that each zt is given by:

z s − σ s ϵs
 
zt = αt x + σt ϵt = αt + σ t ϵt , (35)
αs

since x = (zs − σs ϵs )/αs for any s < t. The conditional mean of q(zt | zs ) is then readily given by:

zs − σs E [ϵs ]
 
E [zt | zs ] = αt + σt E [ϵt ] (36)
αs
αt
= zs (since E[ϵt ] = 0, ∀t) (37)
αs

=: αt|s zs . (38)

2
To compute a closed-form expression for the variance σt|s of the transition distribution q(zt | zs ), we can
start by rewriting the equation for zt in terms of the preceding latent zs as follows:

zt = αt|s zs + σt|s ϵt (39)


αt
= (αs x + σs ϵs ) + σt|s ϵt (substitute αt|s and zs ) (40)
αs
αt
= αt x + σs ϵs + σt|s ϵt (41)
αs
αt
=⇒ σt ϵt = σs ϵs + σt|s ϵt . (since zt = αt x + σt ϵt ) (42)
αs

2
The above implication allows us to compute the variance σt|s straightforwardly. Firstly, recall that
variance is invariant to changes in a location parameter, therefore: V [cX] = c2 V [X] for some constant
c and random variable X. Secondly, the variance of a sum of n independent random variables is simply
Pn Pn
the sum of their variances: V [ i=1 Xn ] = i=1 V [Xi ]. Using these two properties we can show that:
 
αt
V [σt ϵt ] = V σs ϵs + σt|s ϵt (43)
αs
 2
αt
σt2 V [ϵt ] = σs2 V [ϵs ] + σt|s
2
V [ϵt ] (44)
αs
 2
2 αt
σt = σs2 + σt|s
2
(45)
αs
2
σt|s = σt2 − αt|s
2
σs2 . (46)
2.1 Forward Process: Gaussian Diffusion 14

2.1.2 Top-down Posterior: q(zs | zt , x)

Since the forward process is a Markov chain, the joint distribution of any two latent variables zt and zs
where t > s factorizes as: q(zs , zt | x) = q(zt | zs )q(zs | x). Using Bayes’ theorem, it is then possible to
derive closed-form expressions for the parameters of the posterior distribution q(zs | zt , x), which is itself
Gaussian due to conjugacy, where q(zs | x) acts as a Gaussian prior and q(zt | zs ) a Gaussian likelihood:

2

q(zs | zt , x) = N zs ; µQ (zt , x; s, t), σQ (s, t)I , zs = µQ (zt , x; s, t) + σQ (s, t)ϵt , (47)

with ϵt ∼ N (ϵt ; 0, I). In the following, we will derive closed-form expressions for the posterior parameters
2
µQ (zt , x; s, t) and σQ (s, t) in detail. For a graphical model of the posterior see Figure 5c.

Before proceeding, we note that this posterior distribution will be instrumental in defining our generative
model (i.e. the reverse process) as explained later on in Section 2.2. Furthermore, notice that the posterior
q(zs | zt , x) coincides with the top-down inference model specification of a hierarchical VAE.

For simplicity, let D denote the dimensionality of zt , satisfying dim(zt ) = dim(x), ∀t. Furthermore, recall
2
that our covariance matrix of choice is isotropic/spherical: σQ I. The posterior is then given by

q(zt | zs )q(zs | x)
q(zs | zt , x) = (48)
q(zt | x)
∝ q(zt | zs )q(zs | x) (49)
 
2
I · N zs ; αs x, σs2 I

= N zt ; αt|s zs , σt|s (50)

D
( ) D  
Y 1 1 2 Y 1 1 2
= √ exp − 2 zt,i − αt|s zs,i · √ exp − 2 (zs,i − αs xi )
i=1
σt|s 2π 2σ t|s i=1
σs 2π 2σ s

(51)
D
( ) D  
Y 1 2 Y 1 2
∝ exp − 2 zt,i − αt|s zs,i · exp − 2 (zs,i − αs xi ) (52)
i=1
2σ t|s i=1
2σ s

D
( )
Y 1   1
exp − 2 z2t,i − 2zt,i αt|s zs,i + αt|s
2
z2s,i − 2 z2s,i − 2zs,i αs xi + αs2 x2i

= (53)
i=1
2σt|s 2σs

D
" 2 ( 2 2
#)
Y 1 zt,i − 2zt,i αt|s zs,i + αt|s zs,i z2s,i − 2zs,i αs xi + αs2 x2i
= exp − 2 + (54)
i=1
2 σt|s σs2

D
" ( 2
! ! #)
Y 1 2 αt|s 1 αt|s zt,i αs xi z2t,i α 2 x2
= exp − zs,i 2 + 2 − 2zs,i 2 + 2 + 2 + s2i . (55)
i=1
2 σt|s σs σt|s σs σt|s σs

The next step is to ‘match the moments’ from Equation (55) n


with what we expect
o to see in a Gaussian
x2 µx µ2
2

distribution, i.e. something of the form: N x; µ, σ ∝ exp − 2σ2 + σ2 − 2σ2 . This exercise yields
closed-form expressions for the parameters of the posterior distribution as desired. Without loss of
generality, consider the D = 1 dimensional case for brevity.
2
x
Matching the first term in Eq. (55) with − 2σ 2 we can see that:

2
! 2
z2 αt|s 1 1 αt|s 1
− s 2 + 2 =⇒ 2 = 2 + 2, (56)
2 σt|s σs σQ σt|s σs
2.1 Forward Process: Gaussian Diffusion 15

2 µx
where σQ is the variance of the posterior q(zs | zt , x). Matching the second term in Eq. (55) with σ2 we
get:
! !
αt|s zt αs x µQ αt|s zt αs x 2 αt|s zt αs x
zs 2 + 2 =⇒ 2 = 2 + 2 =⇒ µQ = σQ 2 + 2 , (57)
σt|s σs σQ σt|s σs σt|s σs

where µQ is the mean of the posterior q(zs | zt , x).


2
The closed-form expressions for µQ , σQ simplify quite significantly:

2 2
1 σs2 αt|s σt|s 1
2 = · 2 + 2 · σ2 (58)
σQ σs2 σt|s σt|s s

2
αt|s σs2 + σt|s
2
2
2
σt|s σs2
= 2 σ2 =⇒ σQ = 2 σ2 + σ2 , (59)
σt|s s αt|s s t|s

and for the posterior mean we then have:


!
2 αt|s zt αs x
µQ = σQ 2 + 2 (60)
σt|s σs
2
σt|s σs2 2
σs2 αt|s zt + σt|s αs x
= 2 2 · 2 σ2 (61)
αt|s σs2 + σt|s σt|s s

2
σs2 αt|s zt + σt|s αs x
= 2 σ2 + σ2 (62)
αt|s s t|s

2
αt|s σs2 αs σt|s
= 2 2 2 zt + α2 σ 2 + σ 2 x. (63)
αt|s σs + σt|s t|s s t|s

2
Using the fact that σt|s 2
= σt2 − αt|s σs2 as in Equation 46, we get the final expression:

2
αt|s σs2 αs σt|s
µQ (zt , x; s, t) = z t + x, (64)
σt2 σt2

revealing that the posterior mean µQ , equivalently denoted as µQ (zt , x; s, t) by Kingma et al. (2021), is
essentially a weighted average of the conditioning set {zt , x} of the posterior distribution q(zs | zt , x).

In summary, the top-down posterior distribution is given by:

2 2
!
αt|s σs2 αs σt|s σt|s σs2
q(zs | zt , x) = N zs ; z t + x, 2 σ2 + σ2 I (65)
σt2 σt2 αt|s s t|s

2

= N zs ; µQ (zt , x; s, t), σQ (s, t)I . (66)

To conclude, Table 1 provides a concise breakdown of all the distributions involved in defining a Gaussian
diffusion, along with the respective closed-form expressions of their parameters.
2.1 Forward Process: Gaussian Diffusion 16

Parameter Expression
αt|s αt /αs
Distribution Mean Covariance
2
σt|s σt2 − αt|s
2
σs2
q(zt | x) (§2.1) αt x σt2 I
2
q(zt | zs ) (§2.1.1) αt|s zs 2
σt|s I αt|s σs2 αs σt|s
µQ (zt , x; s, t) zt + x
q(zs | zt , x) (§2.1.2) µQ (zt , x; s, t) 2
σQ (s, t)I σt2 σt2
2 2
σt|s σs
2
σQ (s, t) 2 σ2 + σ2
αt|s s t|s

Table 1: Breakdown of the distributions involved in defining a typical Gaussian diffusion (LHS), along
with closed-form expressions for their respective parameters (RHS). Note that s denotes a preceding
timestep relative to timestep t, i.e. s < t. The top-down posterior distribution q(zs | zt , x) is tractable
due to Gaussian conjugacy: q(zs | zt , x) ∝ q(zt | zs )q(zs | x), where q(zs | x) plays the role of a conjugate
(Gaussian) prior and q(zt | zs ) the plays the role of a Gaussian likelihood.

2.1.3 Learning the Noise Schedule

Perturbing data with multiple noise scales and choosing an appropriate noise schedule is instrumental
to the success of diffusion models. The noise schedule of the forward process is typically pre-specified
and has no learnable parameters, however, VDMs learn the noise schedule via the parameterization:

σt2 = sigmoid (γη (t)) , (67)

where γη (t) is a monotonic neural network comprised of linear layers with weights η restricted to be
positive. A monotonic function is a function defined on a subset of the real numbers which is either
entirely non-increasing or entirely non-decreasing. As explained later, the noise schedule can be conve-
niently parameterized in terms of the signal-to-noise ratio. The signal-to-noise ratio (SNR) is defined as
SNR(t) = αt2 /σt2 , and since zt grow noisier over time we have that: SNR(t) < SNR(s) for any t > s.

For now, we provide some straightforward derivations of the expressions for αt2 and SNR(t) as a function
of γη (t). Recall that in a variance-preserving diffusion process αt2 = 1 − σt2 , therefore:

αt2 = 1 − σt2 = 1 − sigmoid (γη (t)) =⇒ αt2 = sigmoid (−γη (t)) , (68)

as for an input x ∈ R the following holds

1 1 + e−x 1 e−x ex
1 − sigmoid (x) = 1 − = − = · = sigmoid (−x) . (69)
1 + e−x 1 + e−x 1 + e−x 1 + e−x ex

To derive SNR(t) as a function of γη (t), we simply substitute in the above equations and simplify:

αt2 sigmoid (−γη (t))


SNR(t) = = (by definition) (70)
σt2 sigmoid (γη (t))
eγη (t) 1
(1 + eγη (t) )−1 1 + e−γη (t) eγη (t)
+ eγη (t) eγη (t) eγη (t) + 1
= = = · = (71)
(1 + e−γη (t) )−1 1 + eγη (t) 1+ eγη (t) eγη (t) eγη (t) (1 + eγη (t) )
1
= , (72)
eγη (t)

which is equivalently expressed as SNR(t) = exp(−γη (t)).


2.2 Reverse Process: Discrete-Time Generative Model 17

2.2 Reverse Process: Discrete-Time Generative Model

The generative model in diffusion models inverts the Gaussian diffusion process outlined in Section 2.1. In
other words, it estimates the reverse-time variational Markov Chain relative to a corresponding forward-
time diffusion process. An interesting aspect of VDMs is that they admit continuous-time generative
models (T → ∞) in a principled manner, and these correspond to the infinitely deep limit of a hierarchical
VAE with a fixed encoder. We describe the discrete-time model for finite T first – since it is more closely
linked to the material we have already covered – and describe the continuous-time version thereafter.

Notation. To unify the notation for both the discrete and continuous-time model versions, Kingma
et al. (2021) uniformly discretize time into T segments of width τ = 1/T . Each time segment corresponds
to a level/step in the hierarchy of latent variables defined as follows:

i i−1
t(i) = , s(i) = , (73)
T T

where s(i) precedes t(i) in the timestep hierarchy, for an index i. For simplicity, we may sometimes use
s and t as shorthand notation for s(i) and t(i) when our intentions are clear from context.

As previously mentioned, the discrete-time generative model of a variational diffusion model is identical
to the hierarchical VAE’s generative model described in Section 1.2. Using the new index notation
defined above, we can re-express the discrete-time generative model as:

p(x, z0:1 ) = p(z1 )p(z(T −1)/T | zT )p(z(T −2)/T | z(T −1)/T ) · · · p(z0 | z1/T )p(x | z0 ) (74)
T
Y
= p(z1 ) p(x | z0 ) p(zs(i) | zt(i) ) . (75)
| {z } | {z }
i=1
| {z }
prior likelihood transitions

This corresponds to a Markov chain: z1 → z(T −1)/T → z(T −2)/T → · · · → z0 → x, which is equivalent
in principle to the hierarchical VAE’s Markov chain: zT → zT −1 → · · · → z1 → x, for equal T .

Each component of the discrete-time generative model is defined as follows:

(i) The prior term can be safely set to p(z1 ) = N (0, I) in a variance preserving diffusion process since
– for small enough SNR(t = 1) – the noisiest latent z1 holds almost no information about the input
x. In other words, this means that q(z1 | x) ≈ N (z1 ; 0, I) by construction, and as such there exists
a distribution p(z1 ) such that DKL (q(z1 | x) ∥ p(z1 )) ≈ 0.

(ii) The likelihood term p(x | z0 ) factorizes over the number of elements D (e.g. pixels) in x, z0 as:

D
(i)
Y
p(x | z0 ) = p(x(i) | z0 ), (76)
i=1

such as a product of (potentially discretized) Gaussian distributions. This distribution could con-
ceivably be modelled autoregressively, but there is little advantage in doing so, as z0 (the least noisy
latent) is almost identical to x by construction. This means that p(x | z0 ) ≈ q(x | z0 ) for suffi-
ciently large SNR(t = 0). Intuitively, since z0 is almost equal to x by construction, modelling p(z0 )
is practically equivalent to modelling p(x), so the likelihood term p(x | z0 ) is typically omitted, as
learning p(z0 | z1/T ) has proven to be sufficient in practice.
2.2 Reverse Process: Discrete-Time Generative Model 18

Image Denoising Noise Prediction Score-based Energy-based


Model x̂θ (zt ; t) ϵ̂θ (zt ; t) sθ (zt ; t) Eθ (zt ; t)
2 2 2 2
αt|s σs2 zt αs σt|s x̂θ (zt ;t) αt|s σt|s ϵ̂θ (zt ;t) αt|s σt|s sθ (zt ;t) αt|s σt|s ∇zt Eθ (zt ;t)
µθ (zt ; s, t) σt2
+ σt2 zt − αt|s σt zt + αt|s zt − αt|s

Table 2: Four ways of parameterizing a diffusion-based generative model (ref. Section 2.2.1), where
µθ (zt ; s, t) is our estimate of the true mean µQ (zt , x; s, t) of the tractable top-down posterior q(zs | zt , x).

(iii) The transition conditional distributions p(zs | zt ) are defined to be the same as the top-down
posteriors q(zs | zt , x) presented in Section 2.1.2, but with the observed data x replaced by the
output of a time-dependent denoising model x̂θ (zt ; t), that is:

p(zs | zt ) = q(zs | zt , x = x̂θ (zt ; t)). (77)

The role of the denoising model is to predict x from each of its noisy versions zt in turn. There are
three different interpretations of this component of the generative model, as we describe next.

2.2.1 Generative Transitions: p(zs | zt )

The conditional distributions of the generative model are given by:

2

p(zs | zt ) = N zs ; µθ (zt ; s, t), σQ (s, t)I (78)

2
where σQ (s, t) is the posterior variance we derived in Equation 46, and µθ (zt ; s, t) is analogous to the
posterior mean we derived in Equation 64, that is:

2

q(zs | zt , x) = N zs ; µQ (zt , x; s, t), σQ (s, t)I , (79)

where the posterior mean is given by


2
αt|s σs2 αs σt|s
µQ (zt , x; s, t) = z t + x. (80)
σt2 σt2

The crucial difference between µQ (zt , x; s, t) and µθ (zt ; s, t) is that, in the latter, the observed data x is
replaced by our predictive model with parameters θ. There are four main (equivalently valid) ways of
operationalizing this model as summarized in Table 3 and derived in detail below:

(i) A denoising model x̂θ (zt ; t):

2
αt|s σs2 αs σt|s
µθ (zt ; s, t) = z t + x̂θ (zt ; t), (81)
σt2 σt2

which as mentioned earlier, simply predicts x from its noisy versions zt , i.e. performs denoising.

(ii) A noise prediction model ϵ̂θ (zt ; t):

2
1 σt|s
µθ (zt ; s, t) = zt − ϵ̂θ (zt ; t), (82)
αt|s αt|s σt
2.2 Reverse Process: Discrete-Time Generative Model 19

which we can derive in detail starting from the denoising model:


2
αt|s σs2 αs σt|s
µθ (zt ; s, t) = z t + x̂θ (zt ; t) (83)
σt2 σt2
 
2 zt −σt ϵ̂θ (zt ;t)
αt|s σs2 zt αs σt|s αt
= + (since xt = (zt − σt ϵt )/αt ) (84)
σt2 σt2
αs σ 2 zt αs σ 2 σt ϵ̂θ (zt ;t)
αt|s αt|s σs2 zt + αt −
t|s t|s
αt αt
= · (recall that αt|s = αs ) (85)
αt|s σt2
 2 2 
αt 2 αs σt|s zt αs σt|s σt ϵ̂θ (zt ;t)
αs α σ
t|s s tz + αt − αt
= (cancel common factors) (86)
αt|s σt2
2
αt|s σs2 zt + σt|s
2 2
zt − σt|s σt ϵ̂θ (zt ; t)
= (87)
αt|s σt2
 
2 2
zt σt|s + αt|s σs2 2
σt|s σt ϵ̂θ (zt ; t)
= 2 − (combine like terms) (88)
αt|s σt αt|s σt2
 
zt σt2 − αt|s
2
σs2 + αt|s2
σs2 2
σt|s σt ϵ̂θ (zt ; t)
2
= 2 − (recall that σt|s 2
= σt2 − αt|s σs2 ) (89)
αt|s σt αt|s σt2
2
zt σt2 σt|s ϵ̂θ (zt ; t)
= 2 − (90)
αt|s σt αt|s σt
2
1 σt|s
= zt − ϵ̂θ (zt ; t). (91)
αt|s αt|s σt

(iii) A score model sθ (zt ; t):

2
1 σt|s
µθ (zt ; s, t) = zt + sθ (zt ; t), (92)
αt|s αt|s

which approximates ∇zt log q(zt ), and is closely related to noise-prediction in the following way:

sθ (zt ; t) ≈ ∇zt log q(zt ) (93)

= Eq(x) [∇zt log q(zt | x)] (marginal of the data q(x)) (94)

= Eq(x) ∇zt log N zt ; αt x, σt2 I


 
(95)
" D   !#
Y 1 1 2
= Eq(x) ∇zt log √ exp − 2 (zt,i − αt xi ) (isotropic covariance) (96)
i=1 t
σ 2π 2σt
" D
!#
D 2
 1 X 2
= Eq(x) ∇zt − log 2πσt − 2 (zt,i − αt xi ) (97)
2 2σt i=1
 
1
= Eq(x) − 2 (zt − αt x) (expected gradient) (98)
σt
1 zt − αt x
 
= Eq(x) − (99)
σt σt
2.2 Reverse Process: Discrete-Time Generative Model 20

Image Denoising Noise Prediction Score-based


Parameterization x̂θ (zt ; t) ϵ̂θ (zt ; t) sθ (zt ; t)

x̂θ (zt ; t) - (zt − σt ϵ̂θ (zt ; t))/αt (zt + σt2 sθ (zt ; t))/αt
ϵ̂θ (zt ; t) (zt − αt x̂θ (zt ; t))/σt - −σt sθ (zt ; t)
sθ (zt ; t) (αt x̂θ (zt ; t) − zt )/σt2 −ϵ̂θ (zt ; t))/σt -

Table 3: Translating between the three main equivalently valid ways to parameterize a diffusion model.
All the operations are linear because zt = αt x + σt ϵt by the definition of the forward diffusion process.

 
1
= Eq(x) − ϵ̂θ (zt ; t) (due to ϵ = (zt − αt x)/σt ) (100)
σt
1
=− ϵ̂θ (zt ; t). (101)
σt

The optimal score model (with parameters θ ∗ ) is equal to the gradient of the log-probability density
w.r.t. the data at each noise scale, i.e. we have that: sθ∗ (zt ; t) = ∇zt log q(zt ), for any t. This stems
from a Score Matching with Langevin Dynamics (SMLD) perspective on generative modelling (Song
and Ermon, 2019; Song et al., 2021b). SMLD is closely related to probabilistic diffusion models (Ho
et al., 2020). For continuous state spaces, diffusion models implicitly compute the score at each
noise scale, so the two approaches can be categorized jointly as Score-based Generative Models or
Gaussian Diffusion Processes. For a more detailed discussion on score-based generative modelling
the reader may refer to Song et al. (2021b).

(iv) An energy-based model Eθ (zt ; t):

2
1 σt|s
µθ (zt ; s, t) = zt − ∇z Eθ (zt ; t), (102)
αt|s αt|s t

since the score model can be parameterized with the gradient of an energy-based model:

sθ (zt ; t) ≈ ∇zt log q(zt ) (103)


 
1
= ∇zt log exp (−Eθ (zt ; t)) (Boltzmann distribution) (104)
Z
 
= ∇zt − Eθ (zt ; t) − log Z (∇zt log Z = 0) (105)

= −∇zt Eθ (zt ; t), (106)

which we can use to substitute sθ (zt ; t) in Equation 92 to get the new expression in Equation 102.
For a detailed review of energy-based models and their relationship with score-based generative
models refer to e.g. Song and Kingma (2021) and Salimans and Ho (2021).

Two other notable parameterizations not elaborated upon in this article but certainly worth learning
about include v-prediction (Salimans and Ho, 2022), and F-prediction (Karras et al., 2022). There are
also interesting links to Flow Matching (Lipman et al., 2023), specifically with the Optimal Transport
(OT) flow path, which can be interpreted as a type of Gaussian diffusion. Kingma and Gao (2023)
formalize this relation under what they call the o-prediction parameterization – for further details on
this and more the reader may refer to appendix D.3 in Kingma and Gao (2023).
2.2 Reverse Process: Discrete-Time Generative Model 21

Simplifying p(zs | zt ). The closed-form expressions for the mean and variance of p(zs | zt ) can be
further simplified to include more numerically stable functions like expm1(·) = exp(·) − 1, which are
available in standard numerical packages. The resulting simplified expressions – which we derive in
detail next – enable more numerically stable implementations as highlighted by Kingma et al. (2021).

Recall from Section 2.1.3 that: σt2 = sigmoid(γη (t)), and αt2 = sigmoid(−γη (t)), for any t. For brevity,
let s and t be shorthand notation for γη (s) and γη (t) respectively. The posterior variance simplifies to:

α2t 2
 
2
σt|s σs2 σs2 σt2 − α2s σs
2
σQ (s, t) = = (107)
σt2 σt2
(1+et )−1
 
1 1 1
1+e−s · 1+e−t − (1+es )−1 · 1+e−s
= 1 (cancel denominator) (108)
1+e−t

1 + es
 
−t 1 1 1
(distribute 1 + e−t ) (109)

= 1+e · · − ·
1 + e−s 1 + e−t 1 + et 1 + e−s
1 + es 1 + e−t
 
1
= · 1− · (110)
1 + e−s 1 + et 1 + e−s
es (1 + e−s ) e−t (1 + et )
 
1
= · 1− · (cancel common factors) (111)
1 + e−s 1 + et 1 + e−s
1
· 1 − es−t

= −s
(112)
1+e

= σs2 · (−expm1 (γη (s) − γη (t))) . (expm1(·) = exp(·) − 1) (113)

The posterior mean – under a noise-prediction model ϵ̂θ (zt ; t) – simplifies in a similar fashion to:

2
1 σt|s
µθ (zt ; s, t) = zt − ϵ̂θ (zt ; t) (114)
αt|s αt|s σt
2
!
αs σt|s
= zt − ϵ̂θ (zt ; t) (115)
αt σt

α2
 
αs  σt2 − αt2 σs2
2
= zt − s
ϵ̂θ (zt ; t) (substituting σt|s 2
= σt2 − αt|s σs2 ) (116)
αt σt
 s

1
αs  1+e−t − 1+e t ·
1
1+e−s
= zt − q 1+e ϵ̂θ (zt ; t) (117)
αt 1
1+e−t

r !
1 + es
 
αs 1 1 1
= zt − (1 + e−t ) · · − · ϵ̂θ (zt ; t) (118)
αt 1 + e−t 1 + e−t 1 + et 1 + e−s
αs
zt − σt 1 − es−t ϵ̂θ (zt ; t)
 
= (119)
αt
αs
= (zt + σt expm1 (γη (s) − γη (t)) ϵ̂θ (zt ; t)) , (120)
αt

where Equation 118 simplifies significantly via the same logical steps in Equations 109-112 above.
2.2 Reverse Process: Discrete-Time Generative Model 22

QT
Ancestral Sampling. To generate random samples from our generative model p(x | z0 ) i=1 p(zs(i) |
zt(i) ) we can perform what’s known as ancestral sampling, i.e starting from z1 ∼ N (0, I) and following
the estimated reverse Markov Chain: z1 → z(T −1)/T → z(T −2)/T → · · · → z0 → x, according to:

αs p
zs = (zt − σt cϵ̂θ (zt ; t)) + 1 − αs2 cϵ (121)
αt

= µθ (zt ; s, t) + σQ (s, t)ϵ, (122)


p
where c = −expm1 (γη (s) − γη (t)), ϵ ∼ N (0, I), and we used the fact that σs = 1 − αs2 by definition in
a variance-preserving diffusion process. In summary, since the forward process transitions are Markovian
and linear Gaussian, the top-down posterior is tractable due to Gaussian conjugacy. Furthermore, our
generative model is defined to be equal to the top-down posterior p(zs | zt ) = q(zs | zt , x = x̂θ (zt ; t))
but with a denoising model x̂θ (zt ; t) in place of x, so we can use our estimate of the posterior mean
µθ (zt ; s, t) to sample from q in reverse order following a Markov chain: z1 → z(T −1)/T → · · · → z0 → x.

2.2.2 Variational Lower Bound

The optimization objective of a discrete-time variational diffusion model is the ELBO in Equation 27,
i.e. the same as a hierarchical VAE’s with a top-down inference model. For consistency, we re-express
the VLB here using the discrete-time index notation: s(i) = (i − 1)/T , t(i) = i/T , as follows:

− log p(x) ≤ −Eq(z0 |x) [log p(x | z0 )] + DKL (q(z1 | x) ∥ p(z1 )) + LT (x) = −VLB(x) (123)
| {z }
Diffusion loss

where the so-called diffusion loss LT (x) term is given by:

T
X  
LT (x) = Eq(zt(i) |x) DKL (q(zs(i) | zt(i) , x) ∥ p(zs(i) | zt(i) )) . (124)
i=1

The remaining terms are the familiar expected reconstruction loss and KL of the posterior from the
prior. For reasons explained in detail in Section 2.2, under a well-specified diffusion process, these terms
can be safely omitted in practice as they do not provide meaningful contributions to the loss.

2.2.3 Deriving DKL (q(zs | zt , x) ∥ p(zs | zt ))

Minimizing the diffusion loss LT (x) involves computing the (expected) KL divergence of the poste-
rior from the prior, at each noise level. Kingma et al. (2021) provide a relatively detailed derivation
of DKL (q(zs(i) | zt(i) , x) ∥ p(zs(i) | zt(i) )); we re-derive it here for completeness, whilst adding some
additional instructive details to aid in understanding.

Using s and t as shorthand notation for s(i) and t(i), recall that the posterior is given by:

2
2
 αt|s σs2 αs σt|s
q(zs | zt , x) = N zs ; µQ (zt , x; s, t), σQ (s, t)I , µQ (zt , x; s, t) = z t + x, (125)
σt2 σt2

and since we have defined our generative model as p(zs | zt ) = q(zs | zt , x = x̂θ (zt ; t)) we have

2
2
 αt|s σs2 αs σt|s
p(zs | zt ) = N zs ; µθ (zt ; s, t), σQ (s, t)I , µθ (zt ; s, t) = z t + x̂θ (zt ; t). (126)
σt2 σt2
2.2 Reverse Process: Discrete-Time Generative Model 23

It is arguably more intuitive to parameterize µθ (zt ; s, t) in terms of the denoising model x̂θ (zt ; t) as
shown above, but as outlined in Section 2.2.1, using a noise-prediction model ϵ̂θ (zt ; t) or a score model
sθ (zt ; t) would be equally valid.

Of particular importance is the fact that the variances of both q(zs | zt , x) and p(zs | zt ) are equal:

2
2
σt|s σs2 2
σt|s σs2
σQ (s, t) = 2 σ2 + σ2 = , (127)
αt|s s t|s σt2

2
where the result in Equation 46, σt|s = σt2 − αt|s
2
σs2 , simplifies the denominator. Furthermore, both
2 2
distributions have identical isotropic/spherical covariances: σQ (s, t)I, which we denote as σQ I for short.

The KL divergence between D-dimensional Gaussian distributions is available in closed form, thus:
" ! #
2
1 1 2
⊤ 1  det σQ I
DKL (q(zs | zt , x) ∥ p(zs | zt )) = Tr 2 IσQ I − D + µθ − µQ 2 I µθ − µQ + log 2
2 σQ σQ det σQ I
(128)
" #
1 1 ⊤ 
= D − D + 2 µθ − µQ µθ − µQ + 0 (129)
2 σQ

D
1 X 2
= 2 µQ,i − µθ,i (130)
2σQ i=1

1 2
= 2 (s, t) µQ (zt , x; s, t) − µθ (zt ; s, t) 2
. (131)
2σQ

It is possible to simplify the above equation quite significantly, resulting in a short expression involving
the signal-to-noise ratio of the diffused data.

To that end, expressing Equation 131 in terms of the denoising model x̂θ (zt ; t) we get:

1 2
DKL (q(zs | zt , x) ∥ p(zs | zt )) = 2 (s, t) µQ (zt , x; s, t) − µθ (zt ; s, t) 2
(132)
2σQ

2 2
! 2
1 αt|s σs2 αs σt|s αt|s σs2 αs σt|s
= 2 (s, t) z t + x− z t + x̂θ (zt ; t)
2σQ σt2 σt2 σt2 σt2
2
(133)

2 2 2
1 αs σt|s αs σt|s
= 2 (s, t) x − x̂θ (zt ; t) (134)
2σQ σt2 σt2
2

2
!2
1 αs σt|s 2
= 2 (s, t) ∥x − x̂θ (zt ; t)∥2 (135)
2σQ σt2
2 4
σt2 αs σt|s 2 2 2
= 2 σ2 ∥x − x̂θ (zt ; t)∥2 (recall σQ (s, t) = (σt|s σs2 )/σt2 ) (136)
2σt|s s σt4
2 2
1 αs σt|s 2
= ∥x − x̂θ (zt ; t)∥2 (exponents cancel) (137)
2σs2 σt2
2 2 2 2
1 αs (σt − αt|s σs ) 2 2
= ∥x − x̂θ (zt ; t)∥2 (recall σt|s = σt2 − αt|s
2
σs2 ) (138)
2σs2 σt2
2.2 Reverse Process: Discrete-Time Generative Model 24

2
 
−2 2 2 2 αt 2
1 σ s αs σ t − αs α2 σ s
s 2
= ∥x − x̂θ (zt ; t)∥2 (139)
2 σt2
1 αs2 σt2 σs−2 − αt2 2
= ∥x − x̂θ (zt ; t)∥2 (140)
2 σt2
1 αs2 σt2 1 αt2
 
2
= − ∥x − x̂θ (zt ; t)∥2 (141)
2 σs2 σt2 σt2
1 αs2 αt2
 
2
= − ∥x − x̂θ (zt ; t)∥2 (142)
2 σs2 σt2
1 2
= (SNR(s) − SNR(t)) ∥x − x̂θ (zt ; t)∥2 . (143)
2

In words, the final expression shows that the diffusion loss, at timestep t, consists of a squared error term
involving the data x and the model x̂θ (zt ; t), weighted by a difference in signal-to-noise ratio at s and t.

Parameterizations. Translating between different loss parameterizations is straightforward due to the


linearity of the forward diffusion process zt = αt x + σt ϵ. This will be particularly useful for analyzing
diffusion loss objectives later on. For now, we provide the derivations of each reparameterization and
summarize the results in Table 4. Firstly, we rewrite image prediction in terms of noise prediction by:
2
2 zt − σt ϵ zt − σt ϵ̂θ (zt ; t)
∥x − x̂θ (zt ; t)∥2 = − (since zt = αt x + σt ϵ) (144)
αt αt 2

σt2 2
= ∥ϵ − ϵ̂θ (zt ; t)∥2 . (cancel terms and factor) (145)
αt2

Similarly, in terms of v-prediction (Salimans and Ho, 2022) we first have:

v := αt ϵ − σt x (by definition) (146)

zt − αt x
 
= αt − σt x (subtituting ϵ = (zt − αt x)/σt ) (147)
σt

αt zt − σt v = (1 − σt2 )x + σt2 x (αt2 = 1 − σt2 in a variance preserving process) (148)

=⇒ x = αt zt − σt v, (149)

which we can now substitute into the image prediction loss, with a v-prediction model v̂θ (zt ; t), to get:

2 2
∥x − x̂θ (zt ; t)∥2 = ∥αt zt − σt v − (αt zt − σt v̂θ (zt ; t))∥2 (150)
2
= ∥σt v̂θ (zt ; t) − σt v∥2 (αt zt terms cancel) (151)
2
= σt2 ∥v̂θ (zt ; t) − v∥2 (by factoring) (152)

To rewrite the noise prediction loss in terms of image prediction we have:


2
2 zt − αt x zt − αt x̂θ (zt ; t)
∥ϵ − ϵ̂θ (zt ; t)∥2 = − (recall that zt = αt x + σt ϵ) (153)
σt σt 2
2
αt
= (x̂θ (zt ; t) − x) (cancel zt terms and factor) (154)
σt 2
2.2 Reverse Process: Discrete-Time Generative Model 25

Image Denoising Noise Prediction Velocity Prediction


Loss ∥x − x̂θ (zt ; t)∥22 ∥ϵ − ϵ̂θ (zt ; t)∥22 ∥v − v̂θ (zt ; t)∥22

∥x − x̂θ (zt ; t)∥22 1 σt2 /αt2 σt2


∥ϵ − ϵ̂θ (zt ; t)∥22 αt2 /σt2 1 1/αt2
2 2
∥v − v̂θ (zt ; t)∥22 σt2 αt2 /σt2 + 1 αt2 σt2 /αt2 + 1 1

Table 4: Translating between three main ways to parameterize a diffusion model loss. Each loss on the
LHS column can be rewritten in terms of the other parameterizations weighted by a specific constant.
For example, the image prediction loss can be written in terms of noise prediction weighted by σt2 /αt2 ,
that is: ∥x − x̂θ (zt ; t)∥22 = σt2 /αt2 ∥ϵ − ϵ̂θ (zt ; t)∥22 , whereas the v-prediction (Salimans and Ho, 2022) loss
2
can be written in terms of image prediction by: ∥v − v̂θ (zt ; t)∥22 = σt2 αt2 /σt2 + 1 ∥x − x̂θ (zt ; t)∥22 .

2
= SNR(t) ∥x̂θ (zt ; t) − x∥2 , (recall SNR(t) = αt2 /σt2 ) (155)

whereas in terms of v-prediction we get:


2
2 v + σt x v̂θ (zt ; t) + σt x
∥ϵ − ϵ̂θ (zt ; t)∥2 = − (solving v = αt ϵ − σt x for ϵ) (156)
αt αt 2
2
1
= (v − v̂θ (zt ; t)) (cancel x terms and factor) (157)
αt 2

1 2
= ∥v − v̂θ (zt ; t)∥2 . (158)
αt2

Lastly, we can rewrite v-prediction in terms of image prediction as follows:

2 2
∥v − v̂θ (zt ; t)∥2 = ∥αt ϵ − σt x − (αt ϵ̂θ (zt ; t) − σt x̂θ (zt ; t))∥2 (since v := αt ϵ − σt x) (159)
2
zt − α t x zt − αt x̂θ (zt ; t)
   
= αt − σt x − αt + σt x̂θ (zt ; t) (160)
σt σt 2
2
αt2 x̂θ (zt ; t) α2 x
= + σt x̂θ (zt ; t) − t − σt x (cancel zt terms and factor) (161)
σt σt 2
 2  2
αt
= + σt (x̂θ (zt ; t) − x) (162)
σt 2

2 2
= σt2 (SNR(t) + 1) ∥x̂θ (zt ; t) − x∥2 . (recall SNR(t) = αt2 /σt2 ) (163)

We proceed similarly for noise prediction, instead substituting x-related terms to get:

2 2
∥v − v̂θ (zt ; t)∥2 = ∥αt ϵ − σt x − (αt ϵ̂θ (zt ; t) − σt x̂θ (zt ; t))∥2 (164)
 2
zt − σ t ϵ zt − σt ϵ̂θ (zt ; t)
  
= αt ϵ − σt − αt ϵ̂θ (zt ; t) + σt (165)
αt αt 2
 2  2
σt
= + αt (ϵ − ϵ̂θ (zt ; t)) (cancel zt terms and factor) (166)
αt 2
 2 2
σt 2
= αt2 + 1 ∥ϵ − ϵ̂θ (zt ; t)∥2 . (167)
αt2
2.2 Reverse Process: Discrete-Time Generative Model 26

2.2.4 Monte Carlo Estimator of LT (x)

To calculate the diffusion loss LT (x) in practice, we can use an unbiased Monte Carlo estimator by:

(i) Using the reparameterisation gradient estimator (Kingma and Welling, 2013; Rezende et al., 2014)
to reparameterize zt ∼ q(zt | x) following:

zt = αt x + σt ϵ := gαt ,σt (ϵ, x), where ϵ ∼ p(ϵ), and p(ϵ) = N (0, I). (168)

(ii) Avoid having to compute all T loss terms by selecting a single timestep, sampled uniformly at
random from i ∼ U {1, T }, to use at each training iteration for estimating the diffusion loss.

Under this setup, the estimator of the diffusion loss LT (x) is given by:

T
X  
LT (x) = Eq(zt(i) |x) DKL (q(zs(i) | zt(i) , x) ∥ p(zs(i) | zt(i) )) (169)
i=1

T
X
= Eq(zt |x) [DKL (q(zs | zt , x) ∥ p(zs | zt ))] (shorthand notation s, t) (170)
i=1

T Z  
X 1 2
= (SNR(s) − SNR(t)) ∥x − x̂θ (zt ; t)∥2 q(zt | x) dzt (from Equation 143) (171)
i=1
2
Z T
!
1 X 2
= (SNR(s) − SNR(t)) ∥x − x̂θ (gαt ,σt (ϵ, x); t)∥2 p(ϵ) dϵ (as zt = αt x + σt ϵ) (172)
2 i=1

1 h h
2
ii
= Eϵ∼N (0,I) T · Ei∼U {1,T } (SNR(s) − SNR(t)) ∥x − x̂θ (zt ; t)∥2 (MC estimate) (173)
2
T h
2
i
= Eϵ∼N (0,I),i∼U {1,T } (SNR(s) − SNR(t)) ∥x − x̂θ (zt ; t)∥2 . (174)
2

For total clarity, we used Monte Carlo estimation and a basic identity to arrive at Equation 173:

T T
1X X
Eq [f (x)] ≈ f (xi ) =⇒ T · Eq [f (x)] ≈ f (xi ), (175)
T i=1 i=1

where xi ∼ q are random samples from a distribution q, which is representative of U {1, T } in our case.

Equation 174 can be rewritten in terms of the more commonly used noise-prediction model ϵ̂θ (zt ; t) as:
" #
2
T zt − σt ϵ zt − σt ϵ̂θ (zt ; t)
LT (x) = Eϵ∼N (0,I),i∼U {1,T } (SNR(s) − SNR(t)) − (176)
2 αt αt 2
" #
2
T σt
= Eϵ∼N (0,I),i∼U {1,T } (SNR(s) − SNR(t)) (ϵ̂θ (zt ; t) − ϵ) (177)
2 αt 2
 2 
T σ 2
= Eϵ∼N (0,I),i∼U {1,T } t2 (SNR(s) − SNR(t)) ∥ϵ − ϵ̂θ (zt ; t)∥2 (178)
2 αt
T h
2
i
= Eϵ∼N (0,I),i∼U {1,T } SNR(t)−1 (SNR(s) − SNR(t)) ∥ϵ − ϵ̂θ (zt ; t)∥2 (179)
2
  
T SNR(s) 2
= Eϵ∼N (0,I),i∼U {1,T } − 1 ∥ϵ − ϵ̂θ (zt ; t)∥2 . (180)
2 SNR(t)
2.2 Reverse Process: Discrete-Time Generative Model 27

The constant term inside the expectation can be re-expressed in more numerically stable primitives as:

SNR(s) α2 α2
− 1 = 2s ÷ 2t − 1 (181)
SNR(t) σs σt
αs2 σt2
= −1 (182)
αt2 σs2
sigmoid(−γη (s)) · sigmoid(γη (t))
= − 1, (183)
sigmoid(−γη (t)) · sigmoid(γη (s))

letting s and t denote γη (s) and γη (t) for brevity we have:

1 1
1+es · 1+e−t (1 + et ) (1 + e−s )
1 1 −1= −1 (184)
1+et · 1+e−s
(1 + es ) (1 + e−t )

et (1 + e−t ) e−s (1 + es )
= −1 (185)
(1 + es ) (1 + e−t )

= et e−s − 1 (186)

= expm1 (γη (t) − γη (s)) . (187)

Substituting the above back into the (noise-prediction-based) diffusion loss estimator gives:

T h
2
i
LT (x) = Eϵ∼N (0,I),i∼U {1,T } expm1 (γη (t) − γη (s)) ∥ϵ − ϵ̂θ (αt x + σt ϵ; t)∥2 . (188)
2
2.3 Reverse Process: Continuous-Time Generative Model 28

2.3 Reverse Process: Continuous-Time Generative Model

A continuous-time variational diffusion model (T → ∞) corresponds to the infinitely deep limit of a


hierarchical VAE, when the diffusion process (noise schedule) is learned rather than fixed. As previously
alluded to, the extension of diffusion models to continuous-time has been proven to be advantageous by
various authors (Song et al., 2021b; Kingma et al., 2021; Huang et al., 2021; Vahdat et al., 2021).

In this section, we begin by explaining why using a continuous-time VLB is strictly preferable over a
discrete-time version, and provide detailed derivations of its estimator in terms of a denoising and noise-
prediction model. We then explain why the continuous-time VLB is invariant to the noise schedule of
the forward diffusion process, except for at its endpoints. In other words, the VLB is unaffected by the
shape of the signal-to-noise ratio function SNR(t) between t = 0 and t = 1. Lastly, we explain how this
invariance holds for models that optimize a weighted diffusion loss rather than the standard VLB.

Note that, due to the shared notation between discrete and continuous-time models introduced in Sec-
tion 2.2, the various derivations and results therein (e.g. for p(zs | zt )) are equally applicable for the
continuous-time version presented in this section.

2.3.1 On Infinite Depth

Kingma et al. (2021) showed that doubling the number of timesteps T always improves the diffusion loss,
which suggests we should optimize a continuous-time VLB, with T → ∞. This finding is straightforward
to verify; we start by recalling that the discrete-time diffusion loss using T steps is given by:
" T #
1 X 2
LT (x) = Eϵ∼N (0,I) (SNR(s(i)) − SNR(t(i))) x − x̂θ (zt(i) ; t(i)) 2
, (189)
2 i=1

where s(i) = (i − 1)/T and t(i) = i/T . To double the number of timesteps T , we can introduce a new
symbol t′ (i) to represent an interpolation between s(i) and t(i), defined as:

i−1 i − 0.5
 
s(i) + t(i) 1 i 0.5
t′ (i) = = + = = t(i) − . (190)
2 2 T T T T

Using shorthand notation s, t and t′ for s(i), t(i) and t′ (i); the diffusion loss with T timesteps can be
written equivalently to Equation 189 as:
" T #
1 X
′ ′ 2
LT (x) = Eϵ∼N (0,I) (SNR(s) − SNR(t ) + SNR(t ) − SNR(t)) ∥x − x̂θ (zt ; t)∥2 , (191)
2 i=1

whereas the new diffusion loss with 2T timesteps is given by:


" T #
1 X
′ ′ 2 ′ 2
L2T (x) = Eϵ∼N (0,I) (SNR(s) − SNR(t )) ∥x − x̂θ (zt′ ; t )∥2 + (SNR(t ) − SNR(t)) ∥x − x̂θ (zt ; t)∥2 .
2 i=1
(192)

If we then subtract the two losses and cancel out common terms we get the following:
" T
1 Xn 2 2
L2T (x) − LT (x) = Eϵ∼N (0,I) SNR(s) ∥x − x̂θ (zt′ ; t′ )∥2 − SNR(t′ ) ∥x − x̂θ (zt′ ; t′ )∥2
2 i=1
2.3 Reverse Process: Continuous-Time Generative Model 29

′ ((((
((( 2 (( (2((o
+ SNR(t
( )
( ∥x − x̂θ (z t ; t)∥ 2 − SNR(t)
( ∥x
( −
( x̂
( θ (z t ; t)∥2
(( (((
T
X 2 2
− SNR(s) ∥x − x̂θ (zt ; t)∥2 − SNR(t′ ) ∥x − x̂θ (zt ; t)∥2
i=1
!#

((( 2 ( (( (
2
) ∥x
( − x̂θ (zt ; t)∥2 − SNR(t) ∥x
(− x̂θ (zt ; t)∥2
(
+ SNR(t ( ( ( ( (193)
(((
( ((((
" T #
1 X


′ 2 2

= Eϵ∼N (0,I) (SNR(s) − SNR(t )) ∥x − x̂θ (zt′ ; t )∥2 − ∥x − x̂θ (zt ; t)∥2 . (194)
2 i=1

We can use Equation 194 to justify optimizing a continuous-time objective. Since t′ < t, the prediction
error term with zt′ will be lower than the one with zt , as zt′ is a less noisy version of x from earlier
on in the diffusion process. In other words, it is always easier to predict x from zt′ than from zt , given
an adequately trained model. More formally, doubling the number of timesteps T always improves the
VLB:

L2T (x) − LT (x) < 0 =⇒ VLB2T (x) > VLBT (x), ∀T ∈ N+ . (195)

Thus it is strictly advantageous to optimize a continuous-time VLB, where T → ∞ and time t is treated
as continuous rather than discrete.

2.3.2 Monte Carlo Estimator of L∞ (x)

To arrive at an unbiased Monte Carlo estimator of the continuous-time diffusion loss L∞ (x), we can first
take the discrete-time version and substitute in the time segment width τ = 1/T to reveal:

T h
2
i
LT (x) = Eϵ∼N (0,I),i∼U {1,T } (SNR(s) − SNR(t)) ∥x − x̂θ (zt ; t)∥2 (196)
2
     
1 1 2
= Eϵ∼N (0,I),i∼U {1,T } T SNR t − − SNR(t) ∥x − x̂θ (zt ; t)∥2
2 T
(since s = (i − 1)/T ) (197)

SNR(t − τ ) − SNR(t)
 
1 2
= Eϵ∼N (0,I),i∼U {1,T } ∥x − x̂θ (zt ; t)∥2 , (substitute τ = 1/T ) (198)
2 τ

again using the shorthand notation s and t for s(i) = (i − 1)/T and t(i) = i/T , respectively.

The constant inside the expectation in Equation 198 is readily recognized as the (negative) backward
difference numerical approximation to the derivative of SNR(t) w.r.t t, since:

d SNR(t) SNR(t + τ ) − SNR(t)


= lim (forward difference) (199)
dt τ →0 τ
SNR(t) − SNR(t − τ )
= lim , (backward difference) (200)
τ →0 τ

and therefore

SNR(t − τ ) − SNR(t) SNR(t) − SNR(t − τ )


lim = lim − (201)
τ →0 τ τ →0 τ
d SNR(t)
=− . (202)
dt
2.3 Reverse Process: Continuous-Time Generative Model 30

Thus taking the limit as T → ∞ of the discrete-time diffusion loss we get


" T #
1 X 2
L∞ (x) = lim Eϵ∼N (0,I) (SNR(s) − SNR(t)) ∥x − x̂θ (zt ; t)∥2 (203)
T →∞ 2
i=1

SNR(t − τ ) − SNR(t)
 
1 2
= lim Eϵ∼N (0,I),i∼U {1,T } ∥x − x̂θ (zt ; t)∥2 (204)
T →∞ 2 τ
Z 1 
1 d SNR(t) 2
= Eϵ∼N (0,I) − ∥x − x̂θ (zt ; t)∥2 dt (205)
2 0 dt
1 h
2
i
= − Eϵ∼N (0,I),t∼U (0,1) SNR′ (t) ∥x − x̂θ (zt ; t)∥2 . (206)
2

We can express the above in terms of the noise-prediction model ϵ̂θ (zt ; t) as follows:

2
2 zt − σt ϵ zt − σt ϵ̂θ (zt ; t)
SNR′ (t) ∥x − x̂θ (zt ; t)∥2 = SNR′ (t) − (207)
αt αt 2
2
σt
= SNR′ (t) (ϵ − ϵ̂θ (zt ; t)) (cancel zt terms and factor) (208)
αt 2

σt2 2
= SNR′ (t) · ∥ϵ − ϵ̂θ (zt ; t)∥2 (209)
αt2
2
= SNR′ (t) · SNR(t)−1 ∥ϵ − ϵ̂θ (zt ; t)∥2 (SNR(t) = αt2 /σt2 ) (210)

d −γη (t) 2
= SNR(t)−1 · e ∥ϵ − ϵ̂θ (zt ; t)∥2 (211)
dt
d 2
= SNR(t)−1 · e−γη (t) · − γη (t) ∥ϵ − ϵ̂θ (zt ; t)∥2 (chain rule) (212)
dt
1 d 2
= · e−γη (t) · − γη (t) ∥ϵ − ϵ̂θ (zt ; t)∥2 (SNR(t) = e−γη (t) ) (213)
e−γη (t) dt
2
= −γη′ (t) ∥ϵ − ϵ̂θ (zt ; t)∥2 , (214)

where the simplified form of SNR(t) = exp(−γη (t)) derived in Equation 72 was used to arrive at the
final result. Plugging the final expression back into the expected loss in Equation 206 we get

1 h
2
i
L∞ (x) = Eϵ∼N (0,I),t∼U (0,1) γη′ (t) ∥ϵ − ϵ̂θ (zt ; t)∥2 (215)
2
= Eq(z0 |x) [log p(x | z0 )] − DKL (q(z1 | x) ∥ p(z1 )) − VLB(x) (216)

= −VLB(x) + c, (217)

where c ≈ 0 is constant with respect to the model parameters of ϵ̂θ (zt ; t).

2.3.3 Invariance to the Noise Schedule

An important result established that the continuous-time VLB is invariant to the noise schedule of the
forward diffusion process (Kingma et al., 2021). To explain this result, we begin by performing a change
of variables; i.e. we transform the integral w.r.t time t in the diffusion loss (Equation 205) into an
integral w.r.t the signal-to-noise ratio. Since the signal-to-noise ratio function SNR(t) = exp(−γη (t)) is
monotonic, it is invertible (SNR(t) is entirely non-increasing in time t meaning: SNR(t) < SNR(s), for
2.3 Reverse Process: Continuous-Time Generative Model 31

any t > s; ref. Section 2.1.3). Using this fact, we can re-express our loss in terms of a new variable
v ≡ SNR(t), such that time t is instead given by t = SNR−1 (v). Let zv = αv x + σv ϵ denote the latent
variable zv whose noise-schedule functions αv and σv correspond to αt and σt evaluated at t = SNR−1 (v).

By applying the integration by substitution formula


Z b Z g(b)

f (g(t)) · g (t) dt = f (v) dv, (218)
a g(a)

we can express the diffusion loss in terms of our new variable v as follows:

1 h
2
i
L∞ (x) = − Eϵ∼N (0,I),t∼U (0,1) SNR′ (t) ∥x − x̂θ (zt ; t)∥2 (219)
2
Z 1 
1   p   2

= − Eϵ∼N (0,I) x − x̂θ σt x SNR(t) + ϵ ; t · SNR (t) dt (220)
2 0 2
"Z #
SNR(1)
1 2
= − Eϵ∼N (0,I) ∥x − x̂θ (zv ; v)∥2 dv (dv = SNR′ (t) dt) (221)
2 SNR(0)
"Z #
SNR(0)
1 2
= Eϵ∼N (0,I) ∥x − x̂θ (zv ; v)∥2 dv (swap limits) (222)
2 SNR(1)
"Z #
SNRmax
1 2
= Eϵ∼N (0,I) ∥x − x̂θ (zv ; v)∥2 dv , (223)
2 SNRmin

where SNRmax = SNR(0) denotes the highest signal-to-noise ratio at time t = 0 resulting in the least
noisy latent z0 at the start of the diffusion process (i.e. essentially the same as x). Conversely, SNRmin =
SNR(1) denotes the lowest signal-to-noise ratio resulting in the noisiest latent z1 at time t = 1.

The above shows that the diffusion loss is determined by the endpoints SNRmin and SNRmax , and is
invariant to the shape of SNR(t) between t = 0 and t = 1. More precisely, the noise schedule function
exp(−γη (t)) which maps the time variable t ∈ [0, 1] to the signal-to-noise ratio SNR(t) does not influence
the diffusion loss integral in Equation 223, except for at its endpoints SNRmax and SNRmin . Therefore,
given v, the shape of the noise schedule function exp(−γη (t)) does not affect the diffusion loss.

Another way to understand the above result is by realizing that to compute the diffusion loss integral, it
suffices to evaluate the antiderivative F of the squared-error term at the endpoints SNRmin and SNRmax :
Z 1 2

1   p  
L∞ (x) = − Eϵ∼N (0,I) x − x̂θ σt x SNR(t) + ϵ ; t · SNR′ (t) dt (224)
2 0 2
Z 1 
1 ′ ′
= − Eϵ∼N (0,I) F (SNR(t)) · SNR (t) dt (225)
2 0
Z 1 
1
= − Eϵ∼N (0,I) (F ◦ SNR)′ (t) dt (226)
2 0

1
= Eϵ∼N (0,I) [− (F (SNR(1)) − F (SNR(0)))] (227)
2
1
= Eϵ∼N (0,I) [F (SNRmax ) − F (SNRmin )] , (228)
2

since every continuous function has an antiderivative. Furthermore, there are infinitely many an-
2.3 Reverse Process: Continuous-Time Generative Model 32

tiderivaties of the mean-square-error term, each of which, G, differs from F by a only constant c:
Z
2
G(v) := ∥x − x̂θ (zv ; v)∥2 dv = F (v) + c, (229)

for all signal-to-noise ratio functions v ≡ SNR(t).

Diffusion Specifications. Kingma et al. (2021) elaborate on the equivalence of diffusion noise-schedule
specifications using the following straightforward example. Firstly, the change of variables we used implies
that σv is given by:

αv2 √ αv αv
v= =⇒ v = =⇒ σv = √ , (230)
σv2 σv v

therefore, zv can be equivalently expressed as


 
αv ϵ
zv = αv x + σv ϵ = αv x + √ ϵ = αv x + √ , (231)
v v

which holds for any diffusion specification (forward process) by definition.


 
Now, consider two distinct diffusion specifications denoted as αvA , σvA , x̃A
θ and αvB , σvB , x̃B
θ . Due to
Equation 231, any two diffusion specifications produce equivalent latents, up to element-wise rescaling:

αvA B
zA
v = z (232)
αvB v
αvA B
   
ϵ ϵ
αvA x+ √ = B αv x + √ . (233)
v αv v

This implies that we can denoise from any latent zB A


v using a model x̃θ trained under a different noise
specification, by trivially rescaling the latent zv such that it’d be equivalent to denoising from zA
B
v:

αvA B
 
x̃B B A

θ zv , v ≡ x̃θ z , v . (234)
αvB v

Furthermore, when two diffusion specifications have equal SNRmin and SNRmax , then the marginal
distributions pA (x) and pB (x) defined by the two generative models are equal:

αvA B
 
x̃B B A
=⇒ pA (x) = pB (x),

θ zv , v ≡ x̃θ z ,v (235)
αvB v

and both specifications yield identical diffusion loss in continuous time: LA B


∞ (x) = L∞ (x), due to Equa-
tion 223. Importantly, this does not mean that training under different noise specifications will result
in the same model. To be clear, the x̃B A
θ model is fully determined by the x̃θ model and the rescaling
operation αvA /αvB . Furthermore, this invariance to the noise schedule does not hold for the Monte Carlo
estimator of the diffusion loss, as the noise schedule affects the variance of the estimator and therefore
affects optimization efficiency as explained in subsequent sections.
2.4 Understanding Diffusion Objectives 33

2.4 Understanding Diffusion Objectives

In this section, we provide a deeper understanding of the close connection between optimizing various
weighted diffusion objectives and maximizing the variational lower bound (a.k.a. the ELBO). Our ex-
position is designed to be instructive and consistent with VDMs++ (Kingma and Gao, 2023), without
departing too far from the material already covered and the notation already used.

In Sections 2.2 and 2.3 we established that when the weighting function is uniform, diffusion-based
objectives correspond directly to the ELBO. However, the relationship between the non-uniform weighted
diffusion objectives and the ELBO is less well understood, as on the face of it they appear to optimize
different things. This has led to the widely held belief that the ELBO (i.e. maximum likelihood) may
not be the correct objective to use if the goal is to obtain high-quality samples.

Although weighted diffusion model objectives appear markedly different from the ELBO, it turns out
that all commonly used diffusion objectives optimize a weighted integral of ELBOs over different noise
levels. Furthermore, if the weighting function is monotonic, then the diffusion objective equates to the
ELBO under simple Gaussian noise-based data augmentation (Kingma and Gao, 2023).

As detailed in subsequent sections, different diffusion objectives imply specific weighting functions w(·)
of the noise schedule. In the following, we provide a detailed introduction to these concepts, highlighting
the most pertinent examples along the way to aid in understanding. To avoid unnecessary repetition, we
refer the reader to Kingma and Gao (2023) for a detailed breakdown of the most commonly used diffusion
loss functions in the literature and the respective derivations of their implied weighting functions.

2.4.1 Weighted Diffusion Loss

The diffusion objectives used in practice can be understood as a weighted version of the diffusion loss:
Z SNRmax
1 h
2
i
L∞ (x, w) = w(v)Eϵ∼N (0,I) ∥x − x̂θ (zv ; v)∥2 dv, (236)
2 SNRmin
Z 1 
1 2
= − Eϵ∼N (0,I) w(SNR(t))SNR′ (t) ∥x − x̂θ (zt ; t)∥2 dt , (recall Eq. 206) (237)
2 0

where w(v) = w(SNR(t)) is a chosen weighting function of the noise schedule. In intuitive terms, the
weighting function stipulates the relative importance of each noise level prescribed by the noise schedule.
Ideally, we would like to be able to adjust the weighting function such that the model focuses on modelling
perceptually important information and ignoring imperceptible bits. In other words, by encouraging our
model to focus on some noise levels more than others using a weighting function, we are implicitly
specifying a preference for modelling low, mid, and/or high-frequency details at different levels.

When w(v) = 1, the diffusion objective is equivalent to maximizing the variational lower bound in
Section 2.2.2. As detailed later in Section 2.4, the invariance to the noise schedule property outlined in
Section 2.3.3 still holds for weighted diffusion objectives.

In terms of noise prediction, following Equation 215, the weighted diffusion objective becomes:
Z 1 
1 2
L∞ (x, w) = Eϵ∼N (0,I) w(SNR(t))γη′ (t) ∥ϵ − ϵ̂θ (zt ; t)∥2 dt , (238)
2 0

where w(SNR(t)) = w(exp(−γη (t))), as per the definition of the (learned) noise schedule in Section 2.1.3.
2.4 Understanding Diffusion Objectives 34

It turns out that the main difference between most diffusion model objectives boils down to the implied
weighting function w(SNR(t)) being used (Kingma et al., 2021; Kingma and Gao, 2023). For instance, Ho
et al. (2020); Song and Ermon (2019, 2020); Nichol and Dhariwal (2021) choose to minimize a so-called
simple objective of the form:
h i
2
L∞-simple (x) := Eϵ∼N (0,I),t∼U (0,1) ∥ϵ − ϵ̂θ (zt ; t)∥2 , (239)

or the analogous discrete-time version


h i
2
LT -simple (x) := Eϵ∼N (0,I),i∼U {1,T } ϵ − ϵ̂θ (zt(i) ; t(i)) 2
, (240)

where t(i) = i/T for T . Contrasting the above with Equation 238, we can deduce that the L∞-simple (x)
objective above implies the following weighting function:

1 h
2
i
L∞ (x, w) = Eϵ∼N (0,I),t∼U (0,1) w(SNR(t))γη′ (t) ∥ϵ − ϵ̂θ (zt ; t)∥2 (241)
2
 
1 1 2
= Eϵ∼N (0,I),t∼U (0,1) ′ γη′ (t) ∥ϵ − ϵ̂θ (zt ; t)∥2 (242)
2 γη (t)
1 1
= L∞-simple (x) =⇒ w(SNR(t)) = ′ . (243)
2 γη (t)

It is worth restating that – in contrast to VDMs – the noise schedule specification in most commonly
used diffusion models is fixed rather than learned from data, i.e. there are no learnable parameters η.

Moreover, notice that Ho et al. (2020)’s popular noise prediction objective is an implicitly defined weighted
objective in image space, where the weighting is a function of the signal-to-noise ratio:
h i
2
L∞-simple (x) = Eϵ∼N (0,I),t∼U (0,1) ∥ϵ − ϵ̂θ (zt ; t)∥2 (244)
" #
2
zt − αt x zt − αt x̂θ (zt ; t)
= Eϵ∼N (0,I),t∼U (0,1) − (245)
σt σt 2

αt2
 
2
= Eϵ∼N (0,I),t∼U (0,1) ∥x − x̂θ (zt ; t)∥2 (246)
σt2
h i
2
= Eϵ∼N (0,I),t∼U (0,1) w(SNR(t)) ∥x − x̂θ (zt ; t)∥2 , (247)

recalling that zt = αt x + σt ϵ, ϵ ∼ N (0, I) by definition (ref. Section 2.1). In this case, the implied
weighting function of the noise schedule (in image space) is the identity: w(SNR(t)) = SNR(t).

2.4.2 Noise Schedule Density



To remain consistent with Kingma and Gao (2023), let λ = log αλ2 /σλ2 denote the logarithm of the
signal-to-noise ratio function SNR(t), where αλ2 = sigmoid(λt ) and σλ2 = sigmoid(−λt ), for a timestep t.
Let fλ : [0, 1] → R denote the noise schedule function, which maps from time t ∈ [0, 1] to the log-SNR
λ, which we may explicitly denote by λt . Like before, the noise schedule function is monotonic thus
invertible: t = fλ−1 (λ), and its endpoints are λmax := fλ (0) and λmin := fλ (1).
2.4 Understanding Diffusion Objectives 35

We can perform a change of variables to define a probability density over noise levels:

dfλ−1 (λ)
p(λ) = pT (fλ−1 (λ)) (248)

dt
=1· (249)

dt
=− , (as fλ is monotonic) (250)

where pT = U(0, 1) is a (continuous) uniform distribution over time, which we sample from during
training t ∼ pT to compute the log-SNR λ = fλ (t). In intuitive terms, the density p(λ) describes the
relative importance that the model assigns to different noise levels. Note that it can sometimes be
beneficial to use different noise schedules for training and sampling (Karras et al., 2022). Since fλ is
strictly monotonically decreasing in time and thus has negative slope, we can simplify the absolute value
in Equation 249 with a negative sign to ensure the density p(λ) remains positive.

Nothing that SNR(t) = eλ , the weighted diffusion objective can be trivially expressed in terms of λ as:

1 h
2
i
L∞ (x) = − Eϵ∼N (0,I),t∼U (0,1) SNR′ (t) ∥x − x̂θ (zt ; t)∥2 (251)
2
 
1 λ dλ 2
= − Eϵ∼N (0,I),t∼U (0,1) e ∥x − x̂θ (zt ; t)∥2 (chain rule) (252)
2 dt
" #
2
1 dλ zt − σt ϵ zt − σt ϵ̂θ (zt ; t)
= − Eϵ∼N (0,I),t∼U (0,1) eλ − (253)
2 dt αt αt 2

2
 
1 λ dλ σt 2
= − Eϵ∼N (0,I),t∼U (0,1) e ∥ϵ − ϵ̂θ (zt ; λt )∥2 (254)
2 dt αt2
 
1 dλ 2
= − Eϵ∼N (0,I),t∼U (0,1) ∥ϵ − ϵ̂θ (zt ; λt )∥2 . (since eλ = αt2 /σt2 ) (255)
2 dt

For complete clarity, the negative sign in front comes from the fact that λt = −γη (t) in the previous
parameterization; so the negative sign in front of the original denoising objective in Equation 206 no
longer cancels out with the −γη′ (t) term from the noise-prediction derivation in Equation 214.

As in Section 2.3.3, we can perform a change of variables to transform our integral w.r.t. to time t into
an integral w.r.t. our new variable λ – the logarithm of the signal-to-noise ratio:
 
1 dλ 2
L∞ (x) = − Eϵ∼N (0,I),t∼U (0,1) ∥ϵ − ϵ̂θ (zt ; λt )∥2 (256)
2 dt
Z 1 
1   p   2 dλ
= − Eϵ∼N (0,I) ϵ − ϵ̂θ σt x exp(λt ) + ϵ ; t · dt (257)
2 0 2 dt
"Z #
fλ (1)
1 2
= − Eϵ∼N (0,I) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 dλ (258)
2 fλ (0)
"Z #
fλ (0)
1 2
= Eϵ∼N (0,I) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 dλ (swap limits) (259)
2 fλ (1)
"Z #
λmax
1 2
= Eϵ∼N (0,I) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 dλ . (260)
2 λmin
2.4 Understanding Diffusion Objectives 36

The weighted version of the objective is then simply


"Z #
λmax
1 2
Lw (x) = Eϵ∼N (0,I) w(λ) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 dλ , (261)
2 λmin

which once again shows that the diffusion loss integral does not depend directly on the noise schedule
function fλ except for at its endpoints λmin , λmax ; and through the choice of weighting function w(λ). In
other words, given the value of λ, the value of t = fλ−1 (λ) is simply irrelevant for evaluating the integral.

Therefore, the only meaningful difference between diffusion objectives is the choice of weighting function
used (Kingma and Gao, 2023).

2.4.3 Importance Sampling Distribution

Although the invariance to the noise schedule still holds under different weighting functions w(λ) in
Equation 261, it does not hold for the Monte Carlo estimator we use during training (e.g. Equation 255),
which is based on random samples from our distribution over the time variable t ∼ U(0, 1), and Gaussian
noise distribution ϵ ∼ N (0, I). Indeed, the choice of noise schedule affects the variance of the Monte Carlo
estimator of the diffusion loss. To demonstrate this fact, we first briefly review Importance Sampling
(IS); which is a set of Monte Carlo methods used to estimate expectations under a target distribution p
using a weighted average of samples from an importance distribution q of our choosing.

Let p(x) be a probability density for a random variable X, and f (X) be some function we would like to
compute the expectation of µ = Ep [f (X)]. The basic probability result of IS stipulates that whenever
sampling from some target distribution p(x) directly is inefficient or impossible (e.g. we only know p(x)
up to a normalizing constant), we can choose any density q(x) to compute µ:
Z Z  
q(x) p(X)
µ= f (x)p(x) dx = f (x)p(x) dx = Eq f (X) , (262)
q(x) q(X)

as long as q(x) > 0 whenever f (x)p(x) ̸= 0. Concretely, we can estimate µ using samples from q:

N
1 X p(Xi ) iid
µ
b= f (Xi ), X1 , . . . , XN ∼ q, (263)
N i=1 q(Xi )

P
b−
where, by the weak law of large numbers, µ → µ when N → ∞.

Now, observe that it is possible to rewrite the weighted diffusion objective above (i.e. Equation 255)
such that the noise schedule density p(λ) is revealed to be an importance sampling distribution:
 
1 dλ 2
Lw (x) = − Eϵ∼N (0,I),t∼U (0,1) w(λ) · · ∥ϵ − ϵ̂θ (zt ; λ)∥2 (264)
2 dt
Z 1 Z 
1 2 dλ
=− ∥ϵ − ϵ̂θ (zt ; λ)∥2 p(ϵ) dϵ w(λ) dt (265)
0 2 dt
Z 1

=: − h(t; x)w(λ) dt (define h(·) for brevity) (266)
0 dt
Z fλ (0)
= h(λ; x)w(λ) dλ (change-of-variables) (267)
fλ (1)
2.4 Understanding Diffusion Objectives 37

Z fλ (0)
p(λ)
= h(λ; x)w(λ)dλ (introduce IS distribution) (268)
fλ (1) p(λ)
 
w(λ)
= Eλ∼p(λ) h(λ; x) (269)
p(λ)
  
w(λ) 1 2
= Eλ∼p(λ) Eϵ∼N (0,I) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 (270)
p(λ) 2
 
1 w(λ) 2
= Eϵ∼N (0,I),λ∼p(λ) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 . (271)
2 p(λ)

It is clear then that different choices of the noise schedule affect the variance of the Monte Carlo estimator
of the diffusion loss because the noise schedule density p(λ) acts as an importance sampling distribution.
Importantly, judicious choices of the importance distribution can substantially increase the efficiency of
Monte Carlo algorithms for numerically evaluating integrals.

A natural question to ask at this stage is how one may select p(λ), such that a variance reduction is
obtained. Variance reduction is obtained if and only if the difference between the variance of the original
estimator h(λ; x) and the importance sampling estimator ĥ(λ; x) := h(λ; x)w(λ)/p(λ) is strictly positive.
Formally, the following expression should evaluate to a value greater than 0:
Z Z 2
2
Vw (h(λ; x)) − Vp (ĥ(λ; x)) = h (λ; x)w(λ) dλ − h(λ; x)w(λ) dλ (272)

Z  2 Z 2 !
h(λ; x)w(λ) h(λ; x)w(λ)
− p(λ) dλ − p(λ) dλ (273)
p(λ) p(λ)
Z Z 2
= h2 (λ; x)w(λ) dλ − h(λ; x)w(λ) dλ (274)

Z Z 2
2
− ĥ (λ; x)p(λ) dλ + h(λ; x)w(λ) dλ (cancel terms) (275)

h2 (λ; x)w2 (λ)


Z
= h2 (λ; x)w(λ) − dλ (substitute out ĥ) (276)
p(λ)
  
w(λ)
= Ew 1 − h2 (λ; x) , (277)
p(λ)

revealing a concise expression that may be useful for practical evaluation. It is a well-known result that
the optimal IS distribution is of the form p∗ (λ) ∝ |h(λ); x|w(λ), since it minimizes the variance of the IS
estimator (Wasserman, 2004). However, this result is mostly of theoretical interest rather then practical,
as it requires knowledge of the integral we are aiming to estimate in the first place.

Furthermore, our current setting is somewhat different from the type of problem one would typically
attack with importance sampling, as here we get to choose both distributions involved:

(i) The weighting function w(λ) acts as the target distribution. It stipulates the relative importance
of each noise level and ensures the model is focusing on perceptually important information. How-
ever, w(λ) may not be a valid probability density as the most commonly used (implied) weighting
functions do not integrate to 1 over their support.

(ii) The importance distribution is the noise schedule density p(λ), which specifies the noise schedule
of the Gaussian diffusion process.
2.4 Understanding Diffusion Objectives 38

This means we technically ought to retune the noise schedule for different choices of weighting function
To avoid this, Kingma and Gao (2023) propose an adaptive noise schedule where:
h i
2
p(λ) ∝ Ex∼D,ϵ∼N (0,I) w(λ) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 , (278)

thereby ensuring that the magnitude of the loss (Equation 271) is approximately invariant to λ, and
spread evenly across time t. This approach is often found to speed up optimization significantly.
Both Song et al. (2021a) and Vahdat et al. (2021) have also explored variance reduction techniques
for (latent) diffusion model objectives from a score-based diffusion perspective.

2.4.4 ELBO with Data Augmentation

In this section, we dissect the main result presented by Kingma and Gao (2023); that when the weighting
function of the diffusion loss is monotonic, the resulting objective is equivalent to the ELBO under simple
data augmentation using Gaussian additive noise. We provide an instructive derivation of this result,
discuss its implications, and consider an extension to the setting of non-monotonic weighting functions.

The general goal is to inspect the behaviour of the weighted diffusion objective across time t, and
manipulate the expression such that we end up with an expectation under a valid probability distribution
specified by the weighting function w(λ). This then allows us to examine the integrand and reveal that
it corresponds to the expected negative ELBO of noise-perturbed data.

To that end, let q(zt:1 | x) := q(zt , zt+dt , . . . , z1 | x) denote the joint distribution of the posterior (forward
process) for a subset of timesteps: {t, t + dt, . . . , 1}, where t > 0 and dt denotes an infinitesimal change
in time. Analogously, let p(zt:1 ) denote the prior (generative model) for the same subset of timesteps.

The KL divergence of the joint posterior q(zt:1 | x) from the joint prior p(zt:1 ) is given by

L(t; x) := DKL (q(zt:1 | x) ∥ p(zt:1 )) (279)


" Z #
fλ (1)
1 2
= Eϵ∼N (0,I) − ∥ϵ − ϵ̂θ (zλ ; λ)∥2 dλ . (from Eq. 260) (280)
2 fλ (t)

Next, we rearrange and differentiate under the integral sign w.r.t. time t to give:
!
Z fλ (1)
dL(t; x) d 1 h
2
i
= − Eϵ∼N (0,I) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 dλ (281)
dt dt fλ (t) 2
!
Z fλ (1)
d
=: h(λ; x) dλ (282)
dt fλ (t)

dh i
= F (fλ (1)) − F (fλ (t)) (F is an antiderivative of h) (283)
dt
= 0 − F ′ (fλ (t)) · fλ′ (t) (chain rule) (284)


= −F ′ (λ) · (recall λ = fλ (t)) (285)
dt
1 dλ h
2
i
= Eϵ∼N (0,I) ∥ϵ − ϵ̂θ (zλ ; λ)∥2 , (since F ′ (λ) = h(λ; x)) (286)
2 dt
2.4 Understanding Diffusion Objectives 39

which allows us to rewrite the weighted diffusion objective by substituting in the above result:
 
1 dλ 2
Lw (x) = − Eϵ∼N (0,I),t∼U (0,1) w(λt ) · · ∥ϵ − ϵ̂θ (zt ; λ)∥2 (287)
2 dt
 i
1 dλ h
2
= Et∼U (0,1) w(λt ) · − Eϵ∼N (0,I) ∥ϵ − ϵ̂θ (zt ; λ)∥2 (288)
2 dt
 
dL(t; x)
= Et∼U (0,1) − w(λt ) . (time derivative) (289)
dt

After some simple manipulation, we can see that the resulting expression is an expectation of the time
derivative of the joint KL divergence L(t; x), weighted by the weighting function w(λt ). This result is
not particularly interesting or surprising by itself, but it enables the next step; using integration by parts
to turn the above expression into an expectation under a valid probability distribution specified by the
weighting function. Recall that the formula for integration by parts is given by:
Z b Z b
u(t)v ′ (t) dt = u(b)v(b) − u(a)v(a) − u′ (t)v(t) dt . (290)
a a

Setting u(t) = w(λt ) and v ′ (t) = d/dt L(t; x) then gives:


Z 1
dL(t; x)
Lw (x) = − w(λt ) dt (291)
0 dt
 Z 1 
dw(λt )
= − w(λ1 )L(1; x) − w(λ0 )L(0; x) − L(t; x) dt (292)
0 dt
Z 1
dw(λt )
= L(t; x) dt + w(λ0 )L(0; x) − w(λ1 )L(1; x) (293)
0 dt
Z 1
dw(λt )
= L(t; x) dt + c, (absorb constants into c) (294)
0 dt

where c is a small constant for two simple reasons:

(i) w(λ0 )L(0; x) = w(λmax )DKL (q(z0:1 | x) ∥ p(z0:1 )) is small due to the weighting function acting on
λmax always being very small by construction (Kingma and Gao, 2023);

(ii) − w(λ1 )L(1; x) = − w(λmin )DKL (q(z1 | x) ∥ p(z1 )) includes the KL between the posterior of the
noisiest latent z1 and the prior, which is both independent of the parameters θ of the model
ϵ̂θ (zt ; λ), and very close to 0 for a well-specified forward diffusion process.

The astute reader may notice that the derivative term d/dt w(λt ) in Equation 294 is a valid probability
density function (PDF) specified by the weighting function, so long as w(λt ) is monotonically increasing
w.r.t. time t, and w(λt=1 ) = 1. The proof is straightforward: by the Fundamental Theorem of Calculus,
the PDF f (x) of a random variable X is obtained by differentiating the cumulative distribution function
(CDF) F (x), that is: f (x) = d/dx F (x), where F : R → [0, 1], limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.
2.4 Understanding Diffusion Objectives 40

Therefore, in our context, w is a valid CDF if it satisfies three conditions:

(i) w : R → [0, 1] (maps the real line to [0, 1]) (295)


(ii) t > t − dt =⇒ w(λt ) ≥ w(λt−dt ), ∀t ∈ [0, 1] (non-decreasing w.r.t. time t) (296)
(iii) lim w(λt ) = 0, and lim w(λt ) = 1. (w is normalized) (297)
t→0 t→1

If the above holds, we can define a valid probability distribution pw (t) specified by the weighting function:
Z λt
dw(λt )
pw (t) := , where w(λt ) = pw (t) dt, (298)
dt 0

R1
with support on the range [0, 1], thus 0
pw (t) dt = 1.

This then permits us to rewrite the diffusion loss as an expectation under pw (t) by substituting:
Z 1
dw(λt )
Lw (x) = L(t; x) dt + c (from Eq. 294) (299)
0 dt

= Et∼pw (t) [L(t; x)] + c. (300)

The final step is to show that the joint KL divergence L(t; x) for any subset of timesteps {t, t + dt, . . . , 1}
decomposes into the expected negative ELBO of noisy data zt ∼ q(zt | x) at any particular timestep t:

L(t; x) = DKL (q(zt:1 | x) ∥ p(zt:1 )) (301)

q(zt:1 | x)
Z
= q(zt:1 | x) log dzt:1 (302)
p(zt:1 )
q(zt | x)q(zt+dt:1 | x)
Z
= q(zt | x)q(zt+dt:1 | x) log dzt:1 (factor the joint) (303)
p(zt | zt+dt )p(zt+dt:1 )
q(zt+dt:1 | x)
  
= Eq(zt |x) Eq(zt+dt:1 |x) log − log p(zt | zt+dt ) + Eq(zt |x) [log q(zt | x)] (304)
p(zt+dt:1 )
 
= Eq(zt |x) Eq(zt+dt |x) [− log p(zt | zt+dt )] + DKL (q(zt+dt:1 | x) ∥ p(zt+dt:1 ))

− H(q(zt | x)) (constant entropy term H(·)) (305)

= Eq(zt |x) [−ELBOt (zt )] − H(q(zt | x)). (expected free energy) (306)

As shown, factoring the joint distributions into infinitesimal transitions between zt and zt+dt reveals
an expected variational free energy term (negative ELBO), which is an upper bound on the negative
log-likelihood of noisy data: −ELBOt (zt ) ≥ − log p(zt ), where zt ∼ q(zt | x) for any timestep t. The
entropy term H(q(zt | x)) is constant since our forward process is fixed, i.e. it is a Gaussian diffusion.

Finally, substituting the above result into the (expected) weighted diffusion loss gives

Lw (x) = Epw (t) [L(t; x)] + c (from Equation 300) (307)


 
= Epw (t) Eq(zt |x) [−ELBOt (zt )] − H(q(zt | x)) + c (substitute) (308)

= −Epw (t),q(zt |x) [ELBOt (zt )] + c (absorb entropy constant) (309)

≥ −Epw (t),q(zt |x) [log p(zt )] + c, (noisy data log-likelihood) (310)


2.4 Understanding Diffusion Objectives 41

which proves that when the weighting function w(λt ) is monotonically increasing w.r.t. time t, diffusion
objectives are equivalent to the ELBO under simple data augmentation using Gaussian additive noise.
To be clear, the Gaussian additive noise comes from the fact that the forward diffusion specification is
linear Gaussian, and as such, each zt ∼ q(zt | x) is simply a noisy version of the data x. The distribution
pw (t) acts as a sort of data augmentation kernel, specifying the importance of different noise levels. It is
worth noting that this type of data augmentation setup resembles distribution augmentation (DistAug)
and distribution smoothing methods (Meng et al., 2020; Jun et al., 2020), which have previously been
shown to improve the sample quality of autoregressive generative models.

Now going back to Equation 294, we see that the diffusion loss is a weighted integral of ELBOs:
Z 1 Z 1
dw(λt )
Lw (x) = L(t; x) dt + c = Eq(zt |x) [−ELBOt (zt )] dw(λt ) + c, (311)
0 dt 0

since L(t; x) equates to the expected negative ELBO for noise-perturbed data zt ∼ q(zt | x) as explained
above, and the dw(λt ) term simply weights the ELBO at each noise level.

Non-monotonic Weighting. Several works have observed impressive synthesis results when using
non-monotonic weighting functions (Nichol and Dhariwal, 2021; Karras et al., 2022; Choi et al., 2022;
Hang et al., 2023). What are the theoretical implications of using such weighting functions?

Looking again at Equation 311, we observe that regardless of the weighting function, diffusion objectives
boil down to a weighted integral of ELBOs. However, if the weighting function is non-monotonic (thus
w is not a valid CDF) then the derivative term d/dt w(λt ) will be negative for some points in time,
meaning we end up minimizing the ELBO at those noise levels rather than maximizing it! This is
somewhat inconvenient in light of the practical success of non-monotonic weighting functions, and seems
to reaffirm the widespread belief that maximum likelihood may not be the appropriate objective for
generating high-quality samples. One sensible explanation for this success is that the non-monotonic
weighting functions sacrifice some modes of the likelihood in exchange for better perceptual synthesis.
Indeed, the majority of bits in images are allocated to imperceptible details and can therefore be largely
ignored if what we care about is perceptual quality. This aspect was discussed by Ho et al. (2020); they
found that although their diffusion models were not competitive with the state-of-the-art likelihood-based
models in terms of lossless codelengths, the samples were of high quality nonetheless.

With that said, Kingma and Gao (2023) showed that contrary to popular belief, likelihood maximization
(i.e. maximizing the ELBO) and high-quality image synthesis are not mutually exclusive in diffusion
models. They were able to achieve state-of-the-art FID (Heusel et al., 2017)/Inception (Salimans et al.,
2016) scores on the high-resolution ImageNet benchmark using monotonic weighting functions and some
practical/architectural improvements proposed by Hoogeboom et al. (2023). As we have learned, opti-
mizing the weighted diffusion loss with a monotonic weighting function is equivalent to maximizing the
ELBO under simple data augmentation using Gaussian additive noise.
3 Discussion 42

3 Discussion

Despite the growing popularity of diffusion models, gaining a deep understanding of the model class
remained somewhat elusive for the uninitiated in non-equilibrium statistical physics. With that in mind,
we have presented a more straightforward introduction to diffusion models using directed graphical
modelling and variational inference principles, which imposes relatively fewer prerequisites on the reader.

Our exposition began with a basic review of latent variable models like VAEs. We then reviewed their
deep hierarchical counterparts and established a unifying graphical modelling-based perspective on their
connection with diffusion models. We showed that diffusion models share a specific top-down latent vari-
able hierarchy structure with ladder networks (Valpola, 2015) and top-down inference HVAEs (Sønderby
et al., 2016), which among other things explains why they share the same optimization objective. Al-
though introducing additional (auxiliary) latent variables significantly improves the flexibility of both
the inference and generative models, it comes with additional challenges. We highlighted the difficulties
with using purely bottom-up inference procedures in deep latent variable hierarchies, including posterior
collapse for instance, whereby the posterior distribution (of the top-most layer, say) may collapse to a
standard Gaussian prior, failing to learn meaningful representations and deactivating latent variables.
Both Burda et al. (2015) and Sohl-Dickstein et al. (2015) point to the asymmetry between the associ-
ated generative and inference models in HVAEs as a source of difficulty in training the inference model
efficiently, since there is no way to express each term in the variational lower bound as an expectation
under a distribution over a single latent variable. Luo (2022); Bishop and Bishop (2023) present a similar
efficiency-based argument against using bottom-up inference in hierarchical latent variable models.

We claim that efficiency arguments paint an incomplete picture; the main reason one should avoid
bottom-up inference in hierarchical latent variable models is the lack of direct feedback from the generative
model. We argue that since the purpose of the inference model is to perform Bayesian inference at any
given layer in the hierarchy, it stands to reason that interleaving feedback from each transition in the
generative model into each respective transition in the inference model can only make both the inference
and generative models more aligned and accurate. Although this rationale may not apply to diffusion
models in quite the same way as it applies to HVAEs, because the inference model of the former is
typically fixed, the top-down latent hierarchy structure is nonetheless ubiquitous. Moreover, since the
top-down posterior in diffusion models is tractable by definition, and follows the same topological ordering
of the latent variables as the generative model, it can be used to specify the generative model transitions
by simply replacing the data in our conditioning set with a denoising model. This offers an intuitive view
of diffusion models as a specific instantiation of ladder networks and/or HVAEs with top-down inference.

A major problem with VAEs and their hierarchical counterparts, which is not present in diffusion models,
is the hole problem. The hole problem refers to the mismatch between the so-called aggregate posterior
(i.e. simply the average posterior distribution over the dataset) and the prior over the latent variables.
As shown in Figure 3, there can be regions with high probability density under the prior which have low
probability density under the aggregate posterior. This then affects the quality of generated samples,
as the decoder may be tasked with decoding latent variables sampled from regions not covered by the
training data. Moreover, the higher the dimensionality of our input data, the less likely it is that our finite
dataset covers the entirety of the input space. The manifold hypothesis posits that high-dimensional
datasets lie along a much lower-dimensional latent manifold. However, since providing latent variable
identifiability guarantees is very challenging for most interesting problems, in practice, we often resort
to unfalsifiable assumptions about both the functional form and dimensionality of the latent space.
3 Discussion 43

Diffusion models cleverly circumvent both of the aforementioned issues by: (i) defining the aggregate
posterior to be equal to the prior by construction; and (ii) sacrificing the ability to learn reusable
representations by fixing the posterior distribution according to a predefined noise schedule. The first
point (i) ensures a smooth transition between the prior p(zT ) and the model p(x | z1 ), so we avoid
sampling latent variables from regions of low density under the aggregate posterior q(zT ). The second
point (ii) entails manually specifying how smooth this transition is by defining the latent variables z1:T
to simply be incrementally noisier versions of the input according to a judicious choice of noise schedule.
This added noise can be interpreted as a kind of data augmentation technique, which helps smooth
out the data density landscape and connect distant modes of the underlying data distribution. As
explained in Section 2.4.1, the noise schedule is specified by parameters αt , σt2 and in combination with
a weighting function w(αt2 /σt2 ), stipulates the relative importance of each noise level in the diffusion
objective. It turns out that the main difference between most diffusion model objectives boils down to
the implied weighting function being used as a result (Kingma et al., 2021; Kingma and Gao, 2023). If
our primary goal is high-quality sample generation, it suffices to adjust the noise schedule and weighting
function such that the model focuses on perceptually important information and ignores imperceptible
bits. Indeed, the majority of bits in images are allocated to imperceptible details and can in principle
be ignored. Moreover, by encouraging the model to focus on some noise levels more than others, we are
implicitly prescribing a preference for modelling low, mid, and/or high-frequency details at different noise
levels. Ho et al. (2020) found that although their diffusion models were not competitive with state-of-the-
art likelihood-based models in terms of lossless codelengths, the samples were of high quality nonetheless.
This demonstrates that diffusion models possess excellent inductive biases for image data.

Since the Markovian transitions between latent states q(zt | zt−1 ) are chosen to be linear Gaussian with
isotropic covariances, the top-down posterior q(zt−1 | zt , x) is tractable through Gaussian conjugacy
and the KL divergence terms in the associated VLB simplify significantly down to squared-error terms
(Section 2.2.3). Given that the Gaussian diffusion process can be defined directly in terms of the con-
ditionals q(zt | x) (ref. Section 2.1), it is possible to: (i) train any level of the latent variable hierarchy
independently of the others; (ii) share the same denoising model across the whole hierarchy. This consti-
tutes a critical advantage over ladder networks and top-down HVAEs, as they both induce hierarchical
dependencies between the latent states which prevent training individual layers independently. This
advantage is particularly salient for infinitely deep latent variable hierarchies. As explained in detail
in Section 2, diffusion models provide a principled framework for making the latent variable hierarchy
infinitely deep. Such models are trained in continuous-time where T → ∞, and can be shown to always
improve the diffusion loss compared to hierarchical latent variable models like top-down HVAEs and
discrete-time diffusion models (Kingma et al., 2021) (ref. Section 2.3.1). It is also worth reiterating the
ease with which it is possible to recast the denoising task in diffusion models in terms of noise predic-
tion (Ho et al., 2020) rather than image prediction (see e.g. Section 2.2), that noise prediction seems to
perform better in practice, and that the resulting setup has close connections to score-based generative
modeling (Hyvärinen and Dayan, 2005; Vincent, 2011; Song and Ermon, 2019; Song et al., 2021b).

The success of diffusion models can be partly attributed to an additional reduction in degrees of free-
dom compared to top-down HVAEs. In VAEs, several simplifying assumptions are made to ensure the
inference problem is both tractable and scalable: (i) amortized variational inference; (ii) mean-field
variational family assumption; (iii) assumed parametric distributions for both the prior and likelihood;
(iv) stochastic optimization of a Monte Carlo estimator of the evidence lower bound. Given the close
connection between top-down HVAEs and diffusion models established in Section 2, we can see that
comparatively speaking diffusion models constitute yet another simplifying assumption by fixing the
3 Discussion 44

inference distribution to follow a pre-defined noise schedule. This transforms the learning problem
from involving the minimization of the reverse KL divergence to improve our posterior approximation:
arg minq∈Q DKL (q(z1:T | x) ∥ p(z1:T )), where the prior p may be fixed, to minimization of the forward
KL divergence: arg minp∈P DKL (q(z1:T | x) ∥ p(z1:T )), where the posterior q is fixed:

arg min DKL (q(z1:T | x) ∥ p(z1:T )) = arg min Eq(z1:T |x) [− log p(z1:T )] − H(q(z1:T | x)) (312)
p∈P p∈P

= arg max Eq(z1:T |x) [log p(z1:T )] + c, (313)


p∈P

which essentially amounts to a supervised learning problem with noise-augmented i.i.d. data, optimized
via maximum likelihood. Indeed, if all we care about is image synthesis quality, then it is intuitively
advantageous to sacrifice the ability to learn reusable representations by fixing the posterior q and focusing
on leveraging the tried-and-tested machinery of supervised learning to train a good generative model.
This rationale also motivated ladder networks (Valpola, 2015). Purely unsupervised learning methods
try to represent all the information about p(x), which includes imperceptible details and complicates the
learning problem. Conversely, supervised learning is effective at filtering out unnecessary information for
the task at hand, which in combination with carefully weighted diffusion objectives, explains how/why
diffusion models are capable of high-quality image synthesis that better aligns with human perception.

In Section 2.4, we provided a deeper understanding of the various weighted diffusion objectives in liter-
ature. By further analyzing the objective in Equation 313, it is possible to show that the diffusion loss
is equivalent to the ELBO under simple data augmentation using Gaussian additive noise (Kingma and
Gao, 2023), so long as the weighting function w(λt ) is monotonically increasing w.r.t. time t. We showed
that if w is a valid CDF, then we can define a valid probability distribution pw (t) specified by the weight-
ing function, which acts as a data augmentation kernel and dictates the importance of different noise
levels as outlined by Kingma and Gao (2023). However, multiple works report impressive image synthesis
results using non-monotonic weighting functions (Nichol and Dhariwal, 2021; Karras et al., 2022), which
somewhat peculiarly implies that the ELBO is being minimized at certain noise levels. This seems to
reaffirm the widespread belief that maximum likelihood may not be the appropriate objective to use for
high-quality sample generation. However, Kingma and Gao (2023) showed that likelihood maximization
need not be intrinsically at odds with high-quality image synthesis, as they achieved state-of-the-art FID
scores on the high-resolution ImageNet benchmark by optimizing the ELBO (under data augmentation).

We further argue that reconnecting weighted diffusion objectives with maximum likelihood (i.e. the
ELBO) is particularly important because we know from Shannon’s source coding theorem that the
average codelength of the optimal compression scheme is the entropy of the data H(X) = E[− log p(x)].
Thus, as long as we are minimizing codelengths given by the information content − log pθ (x) defined
by a probabilistic model pθ (e.g. by maximizing likelihood), then the resulting average codelength
E[− log pθ (x)] approaches the entropy of the true data distribution. This is the fundamental goal of
generative modelling and compression, a goal with which maximum likelihood is well aligned.

To conclude, the success of diffusion models is arguably as much a product of collective engineering effort
and scale as it is a product of algorithmic and theoretical insight. Nonetheless, identifying analogies be-
tween model classes undoubtedly aids in understanding, and recognizing the unique properties of specific
models helps refine our intuitions about what may or may not work in the future. Fertile ground for
future work includes enabling diffusion models to learn semantic latent representations, expanding for-
ward diffusion processes beyond linear Gaussian transitions, and leveraging the identifiability guarantees
of probability flow ODEs (Song et al., 2021b) for causal representation learning and inference.
REFERENCES 45

Acknowledgements. We would like to thank Charles Jones, Rajat Rasal and Avinash Kori for fruitful
discussions and valuable feedback. The authors are supported with funding from the ERC under the
EU’s Horizon 2020 research and innovation programme (Project MIRA, grant No. 757173).

References
Anderson, B. D. (1982). Reverse-time diffusion equation models. Stochastic Processes and their Appli-
cations, 12(3):313–326.

Bishop, C. M. and Bishop, H. (2023). Deep Learning: Foundations and Concepts. Springer.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians.
Journal of the American statistical Association, 112(518):859–877.

Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint
arXiv:1509.00519.

Child, R. (2020). Very deep vaes generalize autoregressive models and can outperform them on images.
In International Conference on Learning Representations.

Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., and Yoon, S. (2022). Perception prioritized training
of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 11472–11481.

De Sousa Ribeiro, F., Xia, T., Monteiro, M., Pawlowski, N., and Glocker, B. (2023). High fidelity image
counterfactuals with probabilistic causal models. In Proceedings of the 40th International Conference
on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7390–7425.
PMLR.

Dhariwal, P. and Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in neural
information processing systems, 34:8780–8794.

Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., Geng, X., and Guo, B. (2023). Efficient diffusion
training via min-snr weighting strategy. arXiv preprint arXiv:2303.09556.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two
time-scale update rule converge to a local nash equilibrium. Advances in neural information processing
systems, 30.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural
information processing systems, 33:6840–6851.

Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and Salimans, T. (2022). Cascaded diffusion
models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal
of Machine Learning Research.

Hoffman, M. D. and Johnson, M. J. (2016). Elbo surgery: yet another way to carve up the variational
evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, volume 1.
REFERENCES 46

Hoogeboom, E., Heek, J., and Salimans, T. (2023). simple diffusion: End-to-end diffusion for high
resolution images. In Proceedings of the 40th International Conference on Machine Learning, volume
202 of Proceedings of Machine Learning Research, pages 13213–13232. PMLR.

Hoogeboom, E., Satorras, V. G., Vignac, C., and Welling, M. (2022). Equivariant diffusion for molecule
generation in 3d. In International conference on machine learning, pages 8867–8887. PMLR.

Huang, C.-W., Lim, J. H., and Courville, A. C. (2021). A variational perspective on diffusion-based
generative models and score matching. Advances in Neural Information Processing Systems, 34:22863–
22876.

Hyvärinen, A. and Dayan, P. (2005). Estimation of non-normalized statistical models by score matching.
Journal of Machine Learning Research, 6(4).

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational
methods for graphical models. Machine learning, 37:183–233.

Jun, H., Child, R., Chen, M., Schulman, J., Ramesh, A., Radford, A., and Sutskever, I. (2020). Dis-
tribution augmentation for generative modeling. In International Conference on Machine Learning,
pages 5006–5019. PMLR.

Karras, T., Aittala, M., Aila, T., and Laine, S. (2022). Elucidating the design space of diffusion-based
generative models. Advances in Neural Information Processing Systems, 35:26565–26577.

Kingma, D., Salimans, T., Poole, B., and Ho, J. (2021). Variational diffusion models. Advances in neural
information processing systems, 34:21696–21707.

Kingma, D. P. and Gao, R. (2023). Understanding the diffusion objective as a weighted integral of elbos.
arXiv preprint arXiv:2303.00848.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved
variational inference with inverse autoregressive flow. Advances in neural information processing sys-
tems, 29.

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

Kingma, D. P., Welling, M., et al. (2019). An introduction to variational autoencoders. Foundations and
Trends® in Machine Learning, 12(4):307–392.

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for generative
modeling. In The Eleventh International Conference on Learning Representations.

Luo, C. (2022). Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970.

Maaløe, L., Fraccaro, M., Liévin, V., and Winther, O. (2019). Biva: A very deep hierarchy of latent
variables for generative modeling. Advances in neural information processing systems, 32.

Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. (2016). Auxiliary deep generative models.
In International conference on machine learning, pages 1445–1453. PMLR.

Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. (2015). Adversarial autoencoders. arXiv
preprint arXiv:1511.05644.

Meng, C., Song, J., Song, Y., Zhao, S., and Ermon, S. (2020). Improved autoregressive modeling with
distribution smoothing. In International Conference on Learning Representations.
REFERENCES 47

Monteiro, M., Ribeiro, F. D. S., Pawlowski, N., Castro, D. C., and Glocker, B. (2022). Measuring
axiomatic soundness of counterfactual image models. In The Eleventh International Conference on
Learning Representations.

Nichol, A. Q. and Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International
Conference on Machine Learning, pages 8162–8171. PMLR.

Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., and Chen,
M. (2022). Glide: Towards photorealistic image generation and editing with text-guided diffusion
models. In International Conference on Machine Learning, pages 16784–16804. PMLR.

Ranganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models. In International conference
on machine learning, pages 324–333. PMLR.

Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. (2015). Semi-supervised learning
with ladder networks. Advances in neural information processing systems, 28.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate
inference in deep generative models. In International conference on machine learning, pages 1278–1286.
PMLR.

Rezende, D. J. and Viola, F. (2018). Taming vaes. arXiv preprint arXiv:1810.00597.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-resolution image
synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 10684–10695.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes,
R., Karagol Ayan, B., Salimans, T., et al. (2022). Photorealistic text-to-image diffusion models with
deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved
techniques for training gans. Advances in neural information processing systems, 29.

Salimans, T. and Ho, J. (2021). Should EBMs model the energy or the score? In Energy Based Models
Workshop - ICLR 2021.

Salimans, T. and Ho, J. (2022). Progressive distillation for fast sampling of diffusion models. In Inter-
national Conference on Learning Representations.

Salimans, T., Kingma, D., and Welling, M. (2015). Markov chain monte carlo and variational inference:
Bridging the gap. In International conference on machine learning, pages 1218–1226. PMLR.

Shu, R. and Ermon, S. (2022). Bit prioritization in variational autoencoders via progressive coding. In
International Conference on Machine Learning, pages 20141–20155. PMLR.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning
using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–
2265. PMLR.

Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. (2016). Ladder variational
autoencoders. Advances in neural information processing systems, 29.
REFERENCES 48

Song, Y., Durkan, C., Murray, I., and Ermon, S. (2021a). Maximum likelihood training of score-based
diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428.

Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution.
Advances in neural information processing systems, 32.

Song, Y. and Ermon, S. (2020). Improved techniques for training score-based generative models. Advances
in neural information processing systems, 33:12438–12448.

Song, Y. and Kingma, D. P. (2021). How to train your energy-based models. arXiv preprint
arXiv:2101.03288.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021b). Score-based
generative modeling through stochastic differential equations. In International Conference on Learning
Representations.

Tomczak, J. and Welling, M. (2018). Vae with a vampprior. In International Conference on Artificial
Intelligence and Statistics, pages 1214–1223. PMLR.

Tzen, B. and Raginsky, M. (2019). Neural stochastic differential equations: Deep latent gaussian models
in the diffusion limit. arXiv preprint arXiv:1905.09883.

Vahdat, A. and Kautz, J. (2020). Nvae: A deep hierarchical variational autoencoder. Advances in Neural
Information Processing Systems, 33:19667–19679.

Vahdat, A., Kreis, K., and Kautz, J. (2021). Score-based generative modeling in latent space. Advances
in Neural Information Processing Systems, 34:11287–11302.

Valpola, H. (2015). From neural pca to deep unsupervised learning. In Advances in independent compo-
nent analysis and learning machines, pages 143–171. Elsevier.

Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural computa-
tion, 23(7):1661–1674.

Wasserman, L. (2004). All of statistics: a concise course in statistical inference, volume 26. Springer.

You might also like