Lecture 12

Score-Based Generative Models
Volodymyr Kuleshov
Cornell Tech
Lecture 12
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 1 / 38

Announcements
Assignment 3 has been released

Please make sure to have signed up for a presentation slot by the
end of the week.

Recap So Far
We have seen several generative model families so far:

1 Autoregressive and Latent Variable Models:
Can perform generation, representation learning, outlier detection, etc.

Max-likelihood may not give best samples, imposes model constraints
2 Generative Adversarial Networks:
Can optimize arbitrary divergences; produces good samples
Training is difficult and unstable; no clear stopping criterion
3 Energy-Based Models:
Make very little modeling assumptions
Maximum likelihood training with MCMC is slow
This lecture: can we train energy-based models without slow MCMC?
Energy-Based Models
Energy-based models specify distributions of the form

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
Pros:
extreme flexibility: can use pretty much any function fθ (x) you want
Cons: Computing Z (θ) is generally intractable. As a result:
Sampling from pθ (x) is hard
Evaluating and optimizing likelihood pθ (x) is hard (learning is hard)
No feature learning (but can add latent variables)
What if we could train models without using Z ?

Lecture Outline
1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models
Figures and contents adapted from Stefano Ermon and Lily Weng
The (Stein) Score Function: Definition
Given a probability density p(x), its (Stein) score function is defined as
∇x log p(x)
This is the gradient of the density with respect to the input.

Note that this is different from the gradient of the density with respect to
the parameters (which is that we normally use in gradient descent).
The score function points in the direction of increasing model probability.

The (Stein) Score Function: 1D Example
On the left is a mixture of Gaussians. One the right is its score function in 1D.
Note that the score function is negative (points towards the left) near the left
mode and is positive (points towards the right) near the right mode.

Score Functions Do Not Involve Normalizing Constants
Suppose we have an energy-based model of the form

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
Computing a score function of the model does not involve Z :
∇x log pθ (x) = ∇x fθ (x) − ∇x log Z (θ) = ∇x fθ (x)
Idea: Instead of learning pθ (x), learn a model ∇x log pθ (x) of the

score function of the data ∇x log pdata (x)!

Score Function Estimation: Definition
The idea of score function estimation is to learn a model sθ (x) : Rd → Rd
of the score function from a dataset of sample points.
We are given a dataset D = {x1 , x2 , ..., xn }

Our task is to estimate ∇x log pdata (x)
Our model is a parameterized function sθ (x) : Rd → Rd
Our objective is that ∇x log pdata (x) ≈ sθ (x)
Lecture Outline
1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
Score Matching via Fisher Divergences
How do we fit score functions?
Data and model scores are vector fields; we want them to be aligned:
Formally, we use the Fisher divergence:

1
Ex∼pdata (x) ||∇x log pdata (x) − sθ (x)||22

2
It’s the average Euclidean distance between two vectors over all x
The Fisher divergence is zero if and only if ∇x log pdata (x) ≈ sθ (x)
A Practical Objective for Score Matching
The Fisher divergence doesn’t give us an objective we can easily optimize.
1
Ex∼pdata (x) ||∇x log pdata (x) − sθ (x)||22

2
We don’t know log pdata (x), and much less its gradient.
However, via a change of variables trick, we can obtain an equivalent
objective (Hyvarinen, 2005)

1 2
Ex∼pdata (x) ||sθ (x)||2 + tr (∇x sθ (x))
2
This is the score matching objective. It can be further approximated

via Monte-Carlo
n
1X 1 2
||sθ (xk )||2 + tr (∇x sθ (xk ))
n 2
i=1

Integration by Parts: 1D Proof
We apply integration by parts as follows:

1 h i
Ex∼p(x) ||∇x log p(x) − sθ (x)||22
2
1 ∞ 1 ∞
Z Z Z ∞
= p(x)(∇x log p(x))2 dx + p(x)sθ (x)2 dx − p(x)∇x log p(x)sθ (x)2 dx
2 −∞ 2 −∞ −∞
| {z }
const
We drop the first term and apply integration parts to the second term:
Z ∞ Z ∞ Z ∞
∇x p(x)
p(x)∇x log p(x)sθ (x)2 dx = p(x) sθ (x)2 dx = ∇x p(x)sθ (x)2 dx
−∞ −∞ p(x) −∞
Z ∞
= p(x)sθ (x)|∞
−∞ − p(x)∇x sθ (x)dx
| {z } −∞
=0
The first term is zero because we assume x has bounded support. Plugging the above
into the first expression yields the desired result.

Score Matching Doesn’t Scale
We may now try to fit a score matching model:

We parameterize sθ (x) using a neural net and optimize the objective
using gradient descent:
n
1X 1
||sθ (xk )||22 + tr (∇x sθ (xk ))
n 2
i=1
However, this doesn’t scale: gradient descent optimization requires us

to compute nested backrop on each diagonal element of ∇x sθ (x):

Sliced Score Matching: Intuition
We need to find more clever approximation to the score matching
objective.
n
1X 1 2
||sθ (xk )||2 + tr (∇x sθ (xk ))
n 2
i=1
Note that in low-dimensions, the problem is tractable.

Therefore, we will try to optimize 1D projections of the score function

Sliced Fisher Divergence
Sliced score matching optimized the sliced Fisher divergence:
1 h i
Ev∼q(v) Ex∼pdata (x) (v> ∇x log pθ (x) − v> ∇x log pdata (x))2
2
Where q(v) is a distribution over projection vectors v that we choose.

As before, this is not practical because we don’t know pdata . However,
via a change of variables trick, we can obtain an equivalent objective:

> 2 1 > 2
Ev∼q(v) Ex∼pdata (x) v ∇x log pθ (x) + (v ∇x log pθ (x))
2
Each term in the above objective can be computed efficiently.

q(v) can be multivariate standard normal or Rademacher.
Objective features consistency and asymptotic normality.
Lecture Outline
1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
Sample Generation: Intuition
Next, we want to generate samples from a trained score function. We can
do this by following the gradient of the data distribution
Intuitively, we start with noise z and we follow the gradient of the data
distribution until we generate accurate x
Langevin Dynamics
Langevin dynamics is an MCMC sampling process that allows us to sample

from pθ (x) using only its score ∇x log pθ (x).
Initialize x0 from a noise distribution
For steps t = 1, 2, . . . , T :
Sample a noise variable t ∼ N (0, I)
√
xt = xt−1 + α2t ∇x log pθ (x) + αt t
This process is guaranteed to produce samples from pθ (x) when the step
size αt → 0 and T → ∞.
We can use this process for sampling from a score function

sθ (x) ≈ ∇x log pθ (x) by plugging sθ into the above algorithm.

Score Function Modeling
Thus, our high-level strategy is to learn sθ (x) ≈ ∇x log pθ (x) from a set of
samples, and then generate new x via Langevin dynamics.

Manifold Hypothesis
The above sampling approach faces challenges arising from the manifold
structure of the data.
The manifold approach states that most of our data lives on a low
dimensional manifold (subspace) in a high-dimensional space.
This hypothesis can be verified empirically by examining the PCA

reconstructions on typical datasets.
This is a problem as ∇x log pdata (x) is not defined where pdata (x) = 0

Learning and Sampling Scores in Low Density Regions
Thus, learning scores functions and generating samples from them is

challenging in low density regions will be hard.
We don’t have enough samples to learn sθ in low density regions

Even if we knew ∇x log pθ , it may be too weak to generate samples.

Learning and Sampling Scores in Low Density Regions
One solution to this problem is to add Gaussian noise to the data (i.e.,
smooth the data):
The score function is now well-defined everywhere.

However, it’s defined on a different distribution.
There is a tradeoff: smooth more and get better score vs. smooth less
and better approximate the data distribution
Multi-Scale Noise
In order to trade off approximation accuracy and score function stability,
we want to add noise at multiple scales
We train using both small and large amounts of noise.

At the initial stages of sampling, we use a larger loss.
We gradually reduce noise over the course of sampling.
Noise Conditional Score Networks
We implement this strategy by training a score function model sθ (x, σ)

that is conditioned on the noise σ.
The same model is trained for all σ.

Parameters are shared across noise levels

Annealed Langevin Dynamics: Intuition
We can run Langevin Dynamics with decreasing noise levels. At each new
noise level, we start where we left off with the previous noise level.

Annealed Langevin Dynamics
Annealed Langevin dynamics is an MCMC sampling process that allows us

to sample from pθ (x) using its score ∇x log pθ (x).
Initialize x0 from a noise distribution
For noise levels σ` ∈ {σ1 , σ2 , ..., σL } :
Set step size to αt = βt · σσ`L
For steps t = 1, 2, . . . , T :
Sample a noise variable t ∼ N (0, I)
√
xt = xt−1 + α2t sθ (x, σ` ) + αt t
Set x0 = xT : start next run at the end of last run.

Score-Based Generative Models: Samples
Score-based generative models can generate high-resolution samples that

approach (and in recent version match) the quality of GANs:

Lecture Outline
1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
Diffusion Models: Intuition
The intuition behind diffusion models is to define a process for gradually
turning data to noise, and learning the inverse of this process.
Noise process q(xt |xt−1 ) is defined by user.

Forward process q(xt−1 |xt ) is unknown.
We approximate it by learning a model p(xt−1 |xt ) of q(xt−1 |xt ).

Forward Diffusion (Noising) Process
To define a diffusion model, we start by defining the noising or forward

diffusion process.
We start with data samples x0 ∼ q(x0 )
We run a Markov chain that gradually adds noise to the data,
producing sequence of increasingly noisy samples x1 , ...xT .
At each step t, we sample xt from the following Markov operator:
p
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I)
where βt ∈ (0, 1), βt → 1.

At the end, xT is a standard Gaussian distribution N (0, I)

Forward Diffusion (Noising) Process
Next, we want to learn a model that undoes the noising process.

The ideal denoising process is the inverse of the above Markov chain.
At time t we sample from q(xt−1 |xt )
We don’t know this process. Therefore, we will try to learn it from
running the noising process.
We define out modelQT as a distribution
pθ (x0:T ) = p(xT ) t=1 pθ (xt−1 |xt ) with the following structure:
pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t))
To generate from this model, we sample noise xT , and sample from

pθ (xt−1 |xt ) until we get an image x0 .

Forward and Backward Diffusion: Example
Here is the result of running forward (top) and backward (bottom)

diffusion processes on a 2D spiral dataset:

Learning Diffusion Models via Max Likelihood
We may learn diffusion models by maximizing an evidence lower bound on

the likelihood log p(x0 ) of a data point x0 :
− log pθ (x0 ) ≤ − log pθ (x0 ) + DKL (q(x1:T |x0 )kpθ (x1:T |x0 ))
h q(x1:T |x0 ) i
= − log pθ (x0 ) + Ex1:T ∼q(x1:T |x0 ) log
pθ (x0:T )/pθ (x0 )
h q(x1:T |x0 ) i
= − log pθ (x0 ) + Eq log + log pθ (x0 )
pθ (x0:T )
h q(x1:T |x0 ) i
= Eq log
pθ (x0:T )
h q(x1:T |x0 ) i
Let LVLB = Eq(x0:T ) log ≥ −Eq(x0 ) log pθ (x0 )
pθ (x0:T )

Transforming the ELBO Bound
We can use the conditional independence structure of p and q to transform the ELBO
into a sum of simpler terms:
QT
q(x1:T |x0 ) i t=1 q(xt |xt−1 )
h h i
LVLB = Eq(x0:T ) log = Eq log QT
pθ (x0:T ) pθ (xT ) t=1 pθ (xt−1 |xt )
T
h X q(xt |xt−1 ) q(x1 |x0 ) i
= Eq − log pθ (xT ) + log + log
t=2
pθ (xt−1 |xt ) pθ (x0 |x1 )
T q(x |x , x )
h X t−1 t 0 q(xt |x0 ) q(x1 |x0 ) i
= Eq − log pθ (xT ) + log · + log
t=2
pθ (xt−1 |xt ) q(xt−1 |x0 ) pθ (x0 |x1 )
T
h X q(xt−1 |xt , x0 ) q(xT |x0 ) q(x1 |x0 ) i
= Eq − log pθ (xT ) + log + log + log
t=2
pθ (xt−1 |xt ) q(x1 |x0 ) pθ (x0 |x1 )
T
h q(xT |x0 ) X q(xt−1 |xt , x0 ) i
= Eq log + log − log pθ (x0 |x1 )
pθ (xT ) t=2
pθ (xt−1 |xt )
T
X
= Eq [DKL (q(xT |x0 ) k pθ (xT )) + DKL (q(xt−1 |xt , x0 ) k pθ (xt−1 |xt )) − log pθ (x0 |x1 )]
| {z } t=2
| {z }| {z }
LT Lt−1 L0

A More Convenient ELBO
This transformation yields the following ELBO:
LVLB = LT + LT −1 + · · · + L0
where LT = DKL (q(xT |x0 ) k pθ (xT ))
Lt = DKL (q(xt |xt+1 , x0 ) k pθ (xt |xt+1 )) for 1 ≤ t ≤ T − 1
L0 = − log pθ (x0 |x1 )
It’s not hard to show that q(xt |x0 ) has an analytic form:
√
q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I)
QT
where αt = 1 − βt and ᾱt = i=1 αi and from that we can derive that
q(xt−1 |xt , x0 ) = N (xt−1 ; µ̃(xt , x0 ), β̃t I)
This gives us all the ingredients to learn these models by gradient descent.

Connection to Score-Based Models
Diffusion and score-based models are closely related:

We may train score-based models using an alternative denoised score
matching objective (Vincent et al., 2011)
There also exists an alternative reparameterization of the diffusion
model objective in which we try to learn the ”noise” zt = xt+1 − xt
instead of the parameters of the Gaussian.
These two objectives are almost identical to each other. See Ho et
al., 2020 for details.

Score-Based Generative Models: Pros and Cons
Pros:
1 Make very few model assumptions
2 Can train energy-based models without manipulating normalizing
constant
3 Can produce very high quality samples (matching GANs)
Cons:
1 Sampling is still quite slow (need to run MCMC chain)
2 Do not always provide likelihoods (though we get them for certain
models)

Lecture 12

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 12

Uploaded by

Copyright:

Available Formats

Score-Based Generative Models

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 1 / 38

Assignment 3 has been released

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 2 / 38

We have seen several generative model families so far:

Can perform generation, representation learning, outlier detection, etc.

Energy-based models specify distributions of the form

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 4 / 38

This is the gradient of the density with respect to the input.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 6 / 38

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 7 / 38

Suppose we have an energy-based model of the form

Computing a score function of the model does not involve Z :

∇x log pθ (x) = ∇x fθ (x) − ∇x log Z (θ) = ∇x fθ (x)

Idea: Instead of learning pθ (x), learn a model ∇x log pθ (x) of the

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 8 / 38

We are given a dataset D = {x1 , x2 , ..., xn }

Formally, we use the Fisher divergence:

This is the score matching objective. It can be further approximated

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 12 / 38

We apply integration by parts as follows:

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 13 / 38

We may now try to fit a score matching model:

However, this doesn’t scale: gradient descent optimization requires us

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 14 / 38

Note that in low-dimensions, the problem is tractable.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 15 / 38

Where q(v) is a distribution over projection vectors v that we choose.

Each term in the above objective can be computed efficiently.

Langevin dynamics is an MCMC sampling process that allows us to sample

We can use this process for sampling from a score function

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 19 / 38

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 20 / 38

This hypothesis can be verified empirically by examining the PCA

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 21 / 38

Thus, learning scores functions and generating samples from them is

We don’t have enough samples to learn sθ in low density regions

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 22 / 38

The score function is now well-defined everywhere.

We train using both small and large amounts of noise.

We implement this strategy by training a score function model sθ (x, σ)

The same model is trained for all σ.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 25 / 38

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 26 / 38

Annealed Langevin dynamics is an MCMC sampling process that allows us

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 27 / 38

Score-based generative models can generate high-resolution samples that

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 28 / 38

Noise process q(xt |xt−1 ) is defined by user.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 30 / 38

To define a diffusion model, we start by defining the noising or forward

where βt ∈ (0, 1), βt → 1.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 31 / 38

Next, we want to learn a model that undoes the noising process.

pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t))

To generate from this model, we sample noise xT , and sample from

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 32 / 38

Here is the result of running forward (top) and backward (bottom)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 33 / 38

We may learn diffusion models by maximizing an evidence lower bound on

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 34 / 38

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 35 / 38

This transformation yields the following ELBO: