Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Score-Based Generative Models

Volodymyr Kuleshov

Cornell Tech

Lecture 12

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 1 / 38


Announcements

Assignment 3 has been released


Please make sure to have signed up for a presentation slot by the
end of the week.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 2 / 38


Recap So Far

We have seen several generative model families so far:


1 Autoregressive and Latent Variable Models:

Can perform generation, representation learning, outlier detection, etc.


Max-likelihood may not give best samples, imposes model constraints
2 Generative Adversarial Networks:
Can optimize arbitrary divergences; produces good samples
Training is difficult and unstable; no clear stopping criterion
3 Energy-Based Models:
Make very little modeling assumptions
Maximum likelihood training with MCMC is slow
This lecture: can we train energy-based models without slow MCMC?
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 3 / 38
Energy-Based Models

Energy-based models specify distributions of the form


1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)

Pros:
extreme flexibility: can use pretty much any function fθ (x) you want
Cons: Computing Z (θ) is generally intractable. As a result:
Sampling from pθ (x) is hard
Evaluating and optimizing likelihood pθ (x) is hard (learning is hard)
No feature learning (but can add latent variables)
What if we could train models without using Z ?

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 4 / 38


Lecture Outline

1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models

Figures and contents adapted from Stefano Ermon and Lily Weng
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 5 / 38
The (Stein) Score Function: Definition
Given a probability density p(x), its (Stein) score function is defined as
∇x log p(x)

This is the gradient of the density with respect to the input.


Note that this is different from the gradient of the density with respect to
the parameters (which is that we normally use in gradient descent).
The score function points in the direction of increasing model probability.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 6 / 38


The (Stein) Score Function: 1D Example

On the left is a mixture of Gaussians. One the right is its score function in 1D.

Note that the score function is negative (points towards the left) near the left
mode and is positive (points towards the right) near the right mode.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 7 / 38


Score Functions Do Not Involve Normalizing Constants

Suppose we have an energy-based model of the form


1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)

Computing a score function of the model does not involve Z :

∇x log pθ (x) = ∇x fθ (x) − ∇x log Z (θ) = ∇x fθ (x)

Idea: Instead of learning pθ (x), learn a model ∇x log pθ (x) of the


score function of the data ∇x log pdata (x)!

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 8 / 38


Score Function Estimation: Definition
The idea of score function estimation is to learn a model sθ (x) : Rd → Rd
of the score function from a dataset of sample points.

We are given a dataset D = {x1 , x2 , ..., xn }


Our task is to estimate ∇x log pdata (x)
Our model is a parameterized function sθ (x) : Rd → Rd
Our objective is that ∇x log pdata (x) ≈ sθ (x)
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 9 / 38
Lecture Outline

1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models

Figures and contents adapted from Stefano Ermon and Lily Weng
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 10 / 38
Score Matching via Fisher Divergences
How do we fit score functions?
Data and model scores are vector fields; we want them to be aligned:

Formally, we use the Fisher divergence:


1
Ex∼pdata (x) ||∇x log pdata (x) − sθ (x)||22
 
2

It’s the average Euclidean distance between two vectors over all x
The Fisher divergence is zero if and only if ∇x log pdata (x) ≈ sθ (x)
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 11 / 38
A Practical Objective for Score Matching
The Fisher divergence doesn’t give us an objective we can easily optimize.
1
Ex∼pdata (x) ||∇x log pdata (x) − sθ (x)||22
 
2

We don’t know log pdata (x), and much less its gradient.
However, via a change of variables trick, we can obtain an equivalent
objective (Hyvarinen, 2005)
 
1 2
Ex∼pdata (x) ||sθ (x)||2 + tr (∇x sθ (x))
2

This is the score matching objective. It can be further approximated


via Monte-Carlo
n  
1X 1 2
||sθ (xk )||2 + tr (∇x sθ (xk ))
n 2
i=1

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 12 / 38


Integration by Parts: 1D Proof

We apply integration by parts as follows:


1 h i
Ex∼p(x) ||∇x log p(x) − sθ (x)||22
2
1 ∞ 1 ∞
Z Z Z ∞
= p(x)(∇x log p(x))2 dx + p(x)sθ (x)2 dx − p(x)∇x log p(x)sθ (x)2 dx
2 −∞ 2 −∞ −∞
| {z }
const

We drop the first term and apply integration parts to the second term:
Z ∞ Z ∞ Z ∞
∇x p(x)
p(x)∇x log p(x)sθ (x)2 dx = p(x) sθ (x)2 dx = ∇x p(x)sθ (x)2 dx
−∞ −∞ p(x) −∞
Z ∞
= p(x)sθ (x)|∞
−∞ − p(x)∇x sθ (x)dx
| {z } −∞
=0

The first term is zero because we assume x has bounded support. Plugging the above
into the first expression yields the desired result.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 13 / 38


Score Matching Doesn’t Scale

We may now try to fit a score matching model:


We parameterize sθ (x) using a neural net and optimize the objective
using gradient descent:
n  
1X 1
||sθ (xk )||22 + tr (∇x sθ (xk ))
n 2
i=1

However, this doesn’t scale: gradient descent optimization requires us


to compute nested backrop on each diagonal element of ∇x sθ (x):

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 14 / 38


Sliced Score Matching: Intuition
We need to find more clever approximation to the score matching
objective.
n  
1X 1 2
||sθ (xk )||2 + tr (∇x sθ (xk ))
n 2
i=1

Note that in low-dimensions, the problem is tractable.


Therefore, we will try to optimize 1D projections of the score function

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 15 / 38


Sliced Fisher Divergence
Sliced score matching optimized the sliced Fisher divergence:
1 h i
Ev∼q(v) Ex∼pdata (x) (v> ∇x log pθ (x) − v> ∇x log pdata (x))2
2

Where q(v) is a distribution over projection vectors v that we choose.


As before, this is not practical because we don’t know pdata . However,
via a change of variables trick, we can obtain an equivalent objective:
 
> 2 1 > 2
Ev∼q(v) Ex∼pdata (x) v ∇x log pθ (x) + (v ∇x log pθ (x))
2

Each term in the above objective can be computed efficiently.


q(v) can be multivariate standard normal or Rademacher.
Objective features consistency and asymptotic normality.
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 16 / 38
Lecture Outline

1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models

Figures and contents adapted from Stefano Ermon and Lily Weng
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 17 / 38
Sample Generation: Intuition
Next, we want to generate samples from a trained score function. We can
do this by following the gradient of the data distribution

Intuitively, we start with noise z and we follow the gradient of the data
distribution until we generate accurate x
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 18 / 38
Langevin Dynamics

Langevin dynamics is an MCMC sampling process that allows us to sample


from pθ (x) using only its score ∇x log pθ (x).
Initialize x0 from a noise distribution
For steps t = 1, 2, . . . , T :
Sample a noise variable t ∼ N (0, I)

xt = xt−1 + α2t ∇x log pθ (x) + αt t

This process is guaranteed to produce samples from pθ (x) when the step
size αt → 0 and T → ∞.

We can use this process for sampling from a score function


sθ (x) ≈ ∇x log pθ (x) by plugging sθ into the above algorithm.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 19 / 38


Score Function Modeling

Thus, our high-level strategy is to learn sθ (x) ≈ ∇x log pθ (x) from a set of
samples, and then generate new x via Langevin dynamics.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 20 / 38


Manifold Hypothesis
The above sampling approach faces challenges arising from the manifold
structure of the data.
The manifold approach states that most of our data lives on a low
dimensional manifold (subspace) in a high-dimensional space.

This hypothesis can be verified empirically by examining the PCA


reconstructions on typical datasets.

This is a problem as ∇x log pdata (x) is not defined where pdata (x) = 0

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 21 / 38


Learning and Sampling Scores in Low Density Regions

Thus, learning scores functions and generating samples from them is


challenging in low density regions will be hard.

We don’t have enough samples to learn sθ in low density regions


Even if we knew ∇x log pθ , it may be too weak to generate samples.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 22 / 38


Learning and Sampling Scores in Low Density Regions
One solution to this problem is to add Gaussian noise to the data (i.e.,
smooth the data):

The score function is now well-defined everywhere.


However, it’s defined on a different distribution.
There is a tradeoff: smooth more and get better score vs. smooth less
and better approximate the data distribution
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 23 / 38
Multi-Scale Noise
In order to trade off approximation accuracy and score function stability,
we want to add noise at multiple scales

We train using both small and large amounts of noise.


At the initial stages of sampling, we use a larger loss.
We gradually reduce noise over the course of sampling.
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 24 / 38
Noise Conditional Score Networks

We implement this strategy by training a score function model sθ (x, σ)


that is conditioned on the noise σ.

The same model is trained for all σ.


Parameters are shared across noise levels

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 25 / 38


Annealed Langevin Dynamics: Intuition

We can run Langevin Dynamics with decreasing noise levels. At each new
noise level, we start where we left off with the previous noise level.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 26 / 38


Annealed Langevin Dynamics

Annealed Langevin dynamics is an MCMC sampling process that allows us


to sample from pθ (x) using its score ∇x log pθ (x).
Initialize x0 from a noise distribution
For noise levels σ` ∈ {σ1 , σ2 , ..., σL } :
Set step size to αt = βt · σσ`L
For steps t = 1, 2, . . . , T :
Sample a noise variable t ∼ N (0, I)

xt = xt−1 + α2t sθ (x, σ` ) + αt t
Set x0 = xT : start next run at the end of last run.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 27 / 38


Score-Based Generative Models: Samples

Score-based generative models can generate high-resolution samples that


approach (and in recent version match) the quality of GANs:

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 28 / 38


Lecture Outline

1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models

Figures and contents adapted from Stefano Ermon and Lily Weng
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 29 / 38
Diffusion Models: Intuition
The intuition behind diffusion models is to define a process for gradually
turning data to noise, and learning the inverse of this process.

Noise process q(xt |xt−1 ) is defined by user.


Forward process q(xt−1 |xt ) is unknown.
We approximate it by learning a model p(xt−1 |xt ) of q(xt−1 |xt ).

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 30 / 38


Forward Diffusion (Noising) Process

To define a diffusion model, we start by defining the noising or forward


diffusion process.
We start with data samples x0 ∼ q(x0 )
We run a Markov chain that gradually adds noise to the data,
producing sequence of increasingly noisy samples x1 , ...xT .
At each step t, we sample xt from the following Markov operator:
p
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I)

where βt ∈ (0, 1), βt → 1.


At the end, xT is a standard Gaussian distribution N (0, I)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 31 / 38


Forward Diffusion (Noising) Process

Next, we want to learn a model that undoes the noising process.


The ideal denoising process is the inverse of the above Markov chain.
At time t we sample from q(xt−1 |xt )
We don’t know this process. Therefore, we will try to learn it from
running the noising process.
We define out modelQT as a distribution
pθ (x0:T ) = p(xT ) t=1 pθ (xt−1 |xt ) with the following structure:

pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t))

To generate from this model, we sample noise xT , and sample from


pθ (xt−1 |xt ) until we get an image x0 .

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 32 / 38


Forward and Backward Diffusion: Example

Here is the result of running forward (top) and backward (bottom)


diffusion processes on a 2D spiral dataset:

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 33 / 38


Learning Diffusion Models via Max Likelihood

We may learn diffusion models by maximizing an evidence lower bound on


the likelihood log p(x0 ) of a data point x0 :

− log pθ (x0 ) ≤ − log pθ (x0 ) + DKL (q(x1:T |x0 )kpθ (x1:T |x0 ))
h q(x1:T |x0 ) i
= − log pθ (x0 ) + Ex1:T ∼q(x1:T |x0 ) log
pθ (x0:T )/pθ (x0 )
h q(x1:T |x0 ) i
= − log pθ (x0 ) + Eq log + log pθ (x0 )
pθ (x0:T )
h q(x1:T |x0 ) i
= Eq log
pθ (x0:T )
h q(x1:T |x0 ) i
Let LVLB = Eq(x0:T ) log ≥ −Eq(x0 ) log pθ (x0 )
pθ (x0:T )

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 34 / 38


Transforming the ELBO Bound
We can use the conditional independence structure of p and q to transform the ELBO
into a sum of simpler terms:
QT
q(x1:T |x0 ) i t=1 q(xt |xt−1 )
h h i
LVLB = Eq(x0:T ) log = Eq log QT
pθ (x0:T ) pθ (xT ) t=1 pθ (xt−1 |xt )
T
h X q(xt |xt−1 ) q(x1 |x0 ) i
= Eq − log pθ (xT ) + log + log
t=2
pθ (xt−1 |xt ) pθ (x0 |x1 )
T  q(x |x , x )
h X t−1 t 0 q(xt |x0 )  q(x1 |x0 ) i
= Eq − log pθ (xT ) + log · + log
t=2
pθ (xt−1 |xt ) q(xt−1 |x0 ) pθ (x0 |x1 )
T
h X q(xt−1 |xt , x0 ) q(xT |x0 ) q(x1 |x0 ) i
= Eq − log pθ (xT ) + log + log + log
t=2
pθ (xt−1 |xt ) q(x1 |x0 ) pθ (x0 |x1 )
T
h q(xT |x0 ) X q(xt−1 |xt , x0 ) i
= Eq log + log − log pθ (x0 |x1 )
pθ (xT ) t=2
pθ (xt−1 |xt )
T
X
= Eq [DKL (q(xT |x0 ) k pθ (xT )) + DKL (q(xt−1 |xt , x0 ) k pθ (xt−1 |xt )) − log pθ (x0 |x1 )]
| {z } t=2
| {z }| {z }
LT Lt−1 L0

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 35 / 38


A More Convenient ELBO

This transformation yields the following ELBO:

LVLB = LT + LT −1 + · · · + L0
where LT = DKL (q(xT |x0 ) k pθ (xT ))
Lt = DKL (q(xt |xt+1 , x0 ) k pθ (xt |xt+1 )) for 1 ≤ t ≤ T − 1
L0 = − log pθ (x0 |x1 )

It’s not hard to show that q(xt |x0 ) has an analytic form:

q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I)
QT
where αt = 1 − βt and ᾱt = i=1 αi and from that we can derive that

q(xt−1 |xt , x0 ) = N (xt−1 ; µ̃(xt , x0 ), β̃t I)

This gives us all the ingredients to learn these models by gradient descent.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 36 / 38


Connection to Score-Based Models

Diffusion and score-based models are closely related:


We may train score-based models using an alternative denoised score
matching objective (Vincent et al., 2011)
There also exists an alternative reparameterization of the diffusion
model objective in which we try to learn the ”noise” zt = xt+1 − xt
instead of the parameters of the Gaussian.
These two objectives are almost identical to each other. See Ho et
al., 2020 for details.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 37 / 38


Score-Based Generative Models: Pros and Cons

Pros:
1 Make very few model assumptions
2 Can train energy-based models without manipulating normalizing
constant
3 Can produce very high quality samples (matching GANs)
Cons:
1 Sampling is still quite slow (need to run MCMC chain)
2 Do not always provide likelihoods (though we get them for certain
models)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 38 / 38

You might also like