Normalizing Flows An Introduction and Review of Current Methods

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

3964 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO.

11, NOVEMBER 2021

Normalizing Flows: An Introduction


and Review of Current Methods
Ivan Kobyzev , Simon J.D. Prince, and Marcus A. Brubaker, Member, IEEE

Abstract—Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation
can be efficient and exact. The goal of this survey article is to give a coherent and comprehensive review of the literature around the
construction and use of Normalizing Flows for distribution learning. We aim to provide context and explanation of the models, review
current state-of-the-art literature, and identify open questions and promising future directions.

Index Terms—Generative models, normalizing flows, density estimation, variational inference, invertible neural networks

1 INTRODUCTION
major goal of statistics and machine learning has been [65], reinforcement learning [67], [70], [93], computer
A to model a probability distribution given samples
drawn from that distribution. This is an example of unsu-
graphics [69], and physics [51], [58], [71], [104], [105].
There are several survey papers for VAEs [55] and
pervised learning and is sometimes called generative GANs [17], [100]. This article aims to provide a compre-
modelling. Its importance derives from the relative abun- hensive review of the literature around Normalizing
dance of unlabelled data compared to labelled data. Appli- Flows for distribution learning. Our goals are to 1) pro-
cations include density estimation, outlier detection, prior vide context and explanation to enable a reader to become
construction, and dataset summarization. familiar with the basics, 2) review the current literature,
Many methods for generative modeling have been pro- and 3) identify open questions and promising future
posed. Direct analytic approaches approximate observed data directions. Since this article was first made public, an
with a fixed family of distributions. Variational approaches excellent complementary treatment has been provided by
and expectation maximization introduce latent variables to Papamakarios et al. [75]. Their article is more tutorial in
explain the observed data. They provide additional flexibility nature and provides many details concerning implemen-
but can increase the complexity of learning and inference. tation, whereas our treatment is more formal and focuses
Graphical models [59] explicitly model the conditional depen- mainly on the families of flow models.
dence between random variables. Recently, generative neural In Section 2, we introduce Normalizing Flows and
approaches have been proposed including generative adver- describe how they are trained. In Section 3 we review con-
sarial networks (GANs) [33] and variational auto-encoders structions for Normalizing Flows. In Section 4 we describe
(VAEs) [54]. datasets for testing Normalizing Flows and discuss the per-
GANs and VAEs have demonstrated impressive perfor- formance of different approaches. Finally, in Section 5 we
mance results on challenging tasks such as learning distri- discuss open problems and possible research directions.
butions of natural images. However, several issues limit
their application in practice. Neither allows for exact evalu-
ation of the probability density of new points. Furthermore,
2 BACKGROUND
training can be challenging due to a variety of phenomena Normalizing Flows were popularised by Rezende and
including mode collapse, posterior collapse, vanishing gra- Mohamed [78] in the context of variational inference and by
dients and training instability [11], [82]. Dinh et al. [19] for density estimation. However, the frame-
Normalizing Flows (NF) are a family of generative models work was previously defined in Tabak and Vanden- Eijnden
with tractable distributions where both sampling and density [89] and Tabak and Turner [88], and explored for clustering
evaluation can be efficient and exact. Applications include and classification [2], and density estimation [61], [80].
image generation [41], [57], noise modelling [1], video gener- A Normalizing Flow is a transformation of a simple
ation [60], audio generation [27], [53], [77], graph generation probability distribution (e.g., a standard normal) into a
more complex distribution by a sequence of invertible and
differentiable mappings. The density of a sample can be
 The authors are with Borealis AI, Montreal H2S 3H1, Canada. evaluated by transforming it back to the original simple dis-
E-mail: {ivan.kobyzev, simon.prince}@borealisai.com, mab@eecs.yorku.ca. tribution and then computing the product of i) the density
Manuscript received 8 Dec. 2019; revised 21 Apr. 2020; accepted 1 May 2020. of the inverse-transformed sample under this distribution
Date of publication 7 May 2020; date of current version 1 Oct. 2021. and ii) the associated change in volume induced by the
(Corresponding author: Ivan Kobyzev.)
Recommended for acceptance by B. Kingsbury. sequence of inverse transformations. The change in volume
Digital Object Identifier no. 10.1109/TPAMI.2020.2992934 is the product of the absolute values of the determinants of
0162-8828 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3965

distribution pZ under reasonable assumptions on the two


distributions. This has been proven formally [9], [68], [99].
See Section 3.4.3.
Constructing arbitrarily complicated non-linear invert-
ible functions (bijections) can be difficult. By the term Nor-
malizing Flows people mean bijections which are
convenient to compute, invert, and calculate the determi-
nant of their Jacobian. One approach to this is to note that
the composition of invertible functions is itself invertible
and the determinant of its Jacobian has a specific form. In
particular, let g1 ; . . .; gN be a set of N bijective functions
and define g ¼ gN  gN1  . . .  g1 to be the composition of
the functions. Then it can be shown that g is also bijective,
with inverse

f ¼ f1  . . .  fN1  fN ; (2)
Fig. 1. Change of variables (Equation (1)). Top-left: the density of the source
pZ . Top-right: the density function of the target distribution pY ðyÞ. There and the determinant of the Jacobian is
exists a bijective function g, such that pY ¼ g pZ , with inverse f. Bottom-left:
the inverse function f. Bottom-right: the absolute Jacobian (derivative) of f.
Y
N
det DfðyÞ ¼ det Dfi ðxi Þ; (3)
the Jacobians for each transformation, as required by the i¼1
change of variables formula.
The result of this approach is a mechanism to construct where Dfi ðyÞ ¼ @f@xi is the Jacobian of fi . We denote the value
new families of distributions by choosing an initial density of the ith intermediate flow as xi ¼ gi  . . .  g1 ðzÞ ¼ fiþ1 
and then chaining together some number of parameterized, . . .  fN ðyÞ and so xN ¼ y. Thus, a set of nonlinear bijective
invertible and differentiable transformations. The new den- functions can be composed to construct successively more
sity can be sampled from (by sampling from the initial den- complicated functions.
sity and applying the transformations) and the density at a
sample (i.e., the likelihood) can be computed as above. 2.1.1 More Formal Construction
In this section we explain normalizing flows from more for-
2.1 Basics mal perspective. Readers unfamiliar with measure theory
Let Z 2 RD be a random variable with a known and tracta- can safely skip to Section 2.2. First, let us recall the general
ble probability density function pZ : RD ! R. Let g be an definition of a pushforward.
invertible function and Y ¼ gðZÞ. Then using the change of
Definition 1. If ðZ; SZ Þ, ðY; SY Þ are measurable spaces, g is a
variables formula, one can compute the probability density
measurable mapping between them, and m is a measure on Z,
function of the random variable Y
then one can define a measure on Y (called the pushforward
measure and denoted by g m) by the formula
pY ðyÞ ¼ pZ ðfðyÞÞjdet DfðyÞj
(1) g mðUÞ ¼ mðg1 ðUÞÞ; for all U 2 SY :
¼ pZ ðfðyÞÞjdet DgðfðyÞÞj1 ; (4)

@f
where f is the inverse of g, DfðyÞ ¼ @y is the Jacobian of f and This notion gives a general formulation of a generative
DgðzÞ ¼ @g @z is the Jacobian of g. This new density function model. Data can be understood as a sample from a mea-
pY ðyÞ is called a pushforward of the density pZ by the function sured “data” space ðY; SY ; nÞ, which we want to learn. To
g and denoted by g pZ (Fig. 1). do that one can introduce a simpler measured space
In the context of generative models, the above function g ðZ; SZ ; mÞ and find a function g : Z ! Y, such that n ¼ g m.
(a generator) “pushes forward” the base density pZ (some- This function g can be interpreted as a “generator”, and Z
times referred to as the “noise”) to a more complex density. as a latent space. This view puts generative models in the
This movement from base density to final complicated den- context of transportation theory [99].
sity is the generative direction. Note that to generate a data In this survey we will assume that Z ¼ RD , all sigma-
point y, one can sample z from the base distribution, and algebras are Borel, and all measures are absolutely continu-
then apply the generator: y ¼ gðzÞ. ous with respect to Lebesgue measure (i.e., m ¼ pZ dz).
The inverse function f moves (or “flows”) in the opposite,
Definition 2. A function g : RD ! RD is called a diffeomor-
normalizing direction: from a complicated and irregular data
phism, if it is bijective, differentiable, and its inverse is differen-
distribution towards the simpler, more regular or “normal”
tiable as well.
form, of the base measure pZ . This view is what gives rise to
the name “normalizing flows” as f is “normalizing” the data The pushforward of an absolutely continuous measure
distribution. This term is doubly accurate if the base measure pZ dz by a diffeomorphism g is also absolutely continuous
pZ is chosen as a Normal distribution as it often is in practice. with a density function given by Equation (1). Note that this
Intuitively, if the transformation g can be arbitrarily com- more general approach is important for studying generative
plex, one can generate any distribution pY from any base models on non-euclidean spaces (see Section 5.2).
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3966 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021

Remark 3. It is common in the normalizing flows literature distribution pðyjxÞ is used when estimating the parameters of
to simply refer to diffeomorphisms as “bijections” even the model, but its computation is usually intractable in prac-
though this is formally incorrect. In general, it is not nec- tice. One approach is to use variational inference and intro-
essary that g is everywhere differentiable; rather it is suf- duce the approximate posterior qðyjx; uÞ where u are
ficient that it is differentiable only almost everywhere parameters of the variational distribution. Ideally this distribu-
with respect to the Lebesgue measure on RD . This allows, tion should be as close to the real posterior as possible. This is
for instance, piecewise differentiable functions to be used done by minimizing the KL divergence DKL ðqðyjx; uÞjjpðyjxÞÞ,
in the construction of g. which is equivalent to maximizing the evidence lower bound
LðuÞ ¼ Eqðyjx;uÞ ½log ðpðy; xÞÞ log ðqðyjx; uÞÞ. The latter optimi-
2.2 Applications zation can be done with gradient descent; however for that
2.2.1 Density Estimation and Sampling one needs to compute gradients of the form ru Eqðyjx;uÞ ½hðyÞ,
which is not straightforward.
The natural and most obvious use of normalizing flows is
As was observed by Rezende and Mohamed [78], one can
to perform density estimation. For simplicity assume that
reparametrize qðyjx; uÞ ¼ pY ðyjuÞ with normalizing flows.
only a single flow, g, is used and it is parameterized by
Assume for simplicity, that only a single flow g with param-
the vector u. Further, assume that the base measure, pZ is
eters u is used, y ¼ gðzjuÞ and the base distribution pZ ðzÞ
given and is parameterized by the vector f. Given a set of
does not depend on u. Then
data observed from some complicated distribution,
D ¼ fyðiÞ gM
i¼1 , we can then perform likelihood-based esti- EpY ðyjuÞ ½hðyÞ ¼ EpZ ðzÞ ½hðgðzjuÞÞ; (6)
mation of the parameters Q ¼ ðu; fÞ. The data likelihood
in this case simply becomes and the gradient of the right hand side with respect to u can be
X
M computed. This approach generally to computing gradients of
log pðDjQÞ ¼ log pY ðyðiÞ jQÞ an expectation is often called the “reparameterization trick”.
i¼1 In this scenario evaluating the likelihood is only required
(5)
X
M   at points which have been sampled. Here the sampling per-
¼ log pZ ðfðyðiÞ juÞjfÞ þ log det Dfðyi juÞ; formance and evaluation of the log determinant are the only
i¼1
relevant metrics and computing the inverse of the mapping
where the first term is the log likelihood of the sample may not be necessary. Indeed, the planar and radial flows
under the base measure and the second term, sometimes introduced in Rezende and Mohamed [78] are not easily
called the log-determinant or volume correction, accounts invertible (see Section 3.3).
for the change of volume induced by the transformation of
the normalizing flows (see Equation (1)). During training, 3 METHODS
the parameters of the flow (u) and of the base distribution Normalizing Flows should satisfy several conditions in
(f) are adjusted to maximize the log-likelihood. order to be practical. They should:
Note that evaluating the likelihood of a distribution mod-
elled by a normalizing flow requires computing f (i.e., the  be invertible; for sampling we need g while for com-
normalizing direction), as well as its log determinant. The puting likelihood we need f,
efficiency of these operations is particularly important dur-  be sufficiently expressive to model the distribution of
ing training where the likelihood is repeatedly computed. interest,
However, sampling from the distribution defined by the  be computationally efficient, both in terms of com-
normalizing flow requires evaluating the inverse g (i.e., the puting f and g (depending on the application) but
generative direction). Thus sampling performance is deter- also in terms of the calculation of the determinant of
mined by the cost of the generative direction. Even though a the Jacobian.
flow must be theoretically invertible, computation of the In the following section, we describe different types of
inverse may be difficult in practice; hence, for density esti- flows and comment on the above properties. An overview
mation it is common to model a flow in the normalizing of the methods discussed can be seen in Fig. 2.
direction (i.e., f).1
Finally, while maximum likelihood estimation is often 3.1 Elementwise Flows
effective (and statistically efficient under certain conditions) A basic form of bijective non-linearity can be constructed
other forms of estimation can and have been used with nor- given any bijective scalar function. That is, let h : R ! R be
malizing flows. In particular, adversarial losses can be used a scalar valued bijection. Then, if x ¼ ðx1 ; x2 ; . . .; xD ÞT ,
with normalizing flow models (e.g., in Flow-GAN [36]).
gðxÞ ¼ ðhðx1 Þ; hðx2 Þ; . . .; hðxD ÞÞT ; (7)
2.2.2 Variational Inference
R is also a bijection whose inverse simply requires computing
Consider a latent variable model pðxÞ ¼ pðx; yÞdy where x is
h1 and whose Jacobian is the product of the absolute val-
an observed variable and y the latent variable. The posterior
ues of the derivatives of h. This can be generalized by allow-
ing each element to have its own distinct bijective function
1. To ensure both efficient density estimation and sampling, van den
which might be useful if we wish to only modify portions of
Oord et al. [98] proposed an approach called Probability Density Distil-
lation which trains the flow f as normal and then uses this as a teacher our parameter vector. In deep learning terminology, h,
network to train a tractable student network g. could be viewed as an “activation function”. Note that the
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3967

3.2.2 Triangular
The triangular matrix is a more expressive form of linear
transformation whose determinant is the product of its
diagonal. It is non-singular so long as its diagonal entries
are non-zero. Inversion is relatively inexpensive requiring a
single pass of back-substitution costing OðD2 Þ operations.
Tomczak and Welling [91] combined K triangular matri-
ces Ti , each with ones on the diagonal, and a K-dimensional
probability
P vector v to define a more general linear flow
y¼ð K i¼1 vi Ti Þz. The determinant of this bijection is one.
However finding the inverse has OðD3 Þ complexity, if some
of the matrices are upper- and some are lower-triangular.

Fig. 2. Overview of flows discussed in this review. We start with element- 3.2.3 Permutation and Orthogonal
wise bijections, linear flows, and planar and radial flows. All of these
have drawbacks and are limited in utility. We then discuss two architec- The expressiveness of triangular transformations is sensi-
tures (coupling flows and autoregressive flows) which support invertible tive to the ordering of dimensions. Reordering the dimen-
non-linear transformations. These both use a coupling function, and we
sions can be done easily using a permutation matrix which
summarize the different coupling functions available. Finally, we discuss
residual flows and their continuous extension infinitesimal flows. has an absolute determinant of 1. Different strategies have
been tried, including reversing and a fixed random permu-
most commonly used activation function ReLU is not bijec- tation [20], [57]. However, the permutations cannot be
tive and can not be directly applicable, however, the directly optimized and so remain fixed after initialization
(Parametric) Leaky ReLU [39], [64] can be used instead which may not be optimal.
among others. Note that recently spline-based activation A more general alternative is the use of orthogonal trans-
functions have also been considered [24], [25] and will be formations. The inverse and absolute determinant of an
discussed in Section 3.4.4.4. orthogonal matrix are both trivial to compute which make
them efficient. Tomczak and Welling [92] used orthogonal
3.2 Linear Flows matrices parameterized by the Householder transform. The
idea is based on the fact from linear algebra that any orthog-
Elementwise operations alone are insufficient as they cannot
onal matrix can be written as a product of reflections. To
express any form of correlation between dimensions. Linear
parameterize a reflection matrix H in RD one fixes a non-
mappings can express correlation between dimensions
zero vector v 2 RD , and then defines H ¼ 11  2 2 vvT .
jjvjj
gðxÞ ¼ Ax þ b; (8)
3.2.4 Factorizations
where A 2 RDD and b 2 RD are parameters. If A is an Instead of limiting the form of A, Kingma and Dhariwal [57]
invertible matrix, the function is invertible. proposed using the LU factorization
Linear flows are limited in their expressiveness. Consider
a Gaussian base distribution: pZ ðzÞ ¼ N ðz; m; SÞ. After gðxÞ ¼ PLUx þ b; (9)
transformation by a linear flow, the distribution remains
Gaussian with distribution pY ¼ N ðy; Am þ b; AT SAÞ. More where L is lower triangular with ones on the diagonal, U is
generally, a linear flow of a distribution from the exponen- upper triangular with non-zero diagonal entries, and P is a
tial family remains in the exponential family. However, lin- permutation matrix. The determinant is the product of the
ear flows are an important building block as they form the diagonal entries of U which can be computed in OðDÞ. The
basis of affine coupling flows (Section 3.4.4.1). inverse of the function g can be computed using two passes
Note that the determinant of the Jacobian is simply of backward substitution in OðD2 Þ. However, the discrete
detðAÞ, which can be computed in OðD3 Þ, as can the inverse. permutation P cannot be easily optimized. To avoid this, P
Hence, using linear flows can become expensive for large D. is randomly generated initially and then fixed. Hoogeboom
By restricting the form of A we can avoid these practical et al. [42] noted that fixing the permutation matrix limits the
problems at the expense of expressive power. In the follow- flexibility of the transformation, and proposed using the QR
ing sections we discuss different ways of limiting the form decomposition instead where the orthogonal matrix Q is
of linear transforms to make them more practical. described with Householder transforms.

3.2.1 Diagonal 3.2.5 Convolution


If A is diagonal with nonzero diagonal entries, then its Another form of linear transformation is a convolution
inverse can be computed in linear time and its determinant which has been a core component of modern deep learn-
is the product of the diagonal entries. However, the result is ing architectures. While convolutions are easy to compute
an elementwise transformation and hence cannot express their inverse and determinant are non-obvious. Several
correlation between dimensions. Nonetheless, a diagonal approaches have been considered. Kingma and Dhariwal
linear flow can still be useful for representing normalization [57] restricted themselves to “1  1” convolutions for
transformations [20] which have become a ubiquitous part flows which are simply a full linear transformation but
of modern neural networks [46]. applied only across channels. Zheng et al. [107] used 1D
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3968 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021

and van den Berg et al. [97] introduced Sylvester flows to


resolve this problem

gðxÞ ¼ x þ UhðWT x þ bÞ; (12)

where U and W are D  M matrices, b 2 RM and


h : RM ! RM is an elementwise smooth nonlinearity, where
M D is a hyperparameter to choose and which can be
interpreted as the dimension of a hidden layer. In this case
the Jacobian determinant is
 
@g
det ¼ detð11D þ U diagðh0 ðWT x þ bÞÞWT Þ
@x (13)
¼ detð11M þ diagðh0 ðWT x þ bÞÞWUT Þ;

where the last equality is Sylvester’s determinant identity


Fig. 3. Coupling architecture. a) A single coupling flow described in Equa- (which gives these flows their name). This can be computa-
tion (15). A coupling function h is applied to one part of the space, while
its parameters depend on the other part. b) Two subsequent multi-scale
tionally efficient if M is small. Some sufficient conditions
flows in the generative direction. A flow is applied to a relatively low dimen- for the invertibility of Sylvester transformations are dis-
sional vector z; its parameters no longer depend on the rest part zaux . cussed in Hasenclever et al. [38] and van den Berg et al. [97].
Then new dimensions are gradually introduced to the distribution.
3.3.2 Radial Flows
convolutions (ConvFlow) and exploited the triangular Radial flows instead modify the distribution around a spe-
structure of the resulting transform to efficiently compute cific point so that
the determinant. However Hoogeboom et al. [42] have b
provided a more general solution for modelling d  d con- gðxÞ ¼ x þ ðx  x0 Þ; (14)
a þ kx  x0 k
volutions, either by stacking together masked autoregres-
sive convolutions (referred to as Emerging Convolutions) where x0 2 RD is the point around which the distribution is dis-
or by exploiting the Fourier domain representation of torted, and a; b 2 R are parameters, a > 0. As for planar flows,
convolution to efficiently compute inverses and determi- the Jacobian determinant can be computed relatively efficiently.
nants (referred to as Periodic Convolutions). The inverse of radial flows cannot be given in closed form but
does exist under suitable constraints on the parameters [78].
3.3 Planar and Radial Flows
Rezende and Mohamed [78] introduced planar and radial 3.4 Coupling and Autoregressive Flows
flows. They are relatively simple, but their inverses aren’t In this section we describe coupling and auto-regressive
easily computed. These flows are not widely used in prac- flows which are the two most widely used flow architec-
tice, yet they are reviewed here for completeness. tures. We first present them in the general form, and then in
Section 3.4.4 we give specific examples.
3.3.1 Planar Flows
Planar flows expand and contract the distribution along cer- 3.4.1 Coupling Flows
tain specific directions and take the form [19] introduced a coupling method to enable highly expressive
transformations for flows (Fig. 3a). Consider a disjoint parti-
gðxÞ ¼ x þ uhðwT x þ bÞ; (10) tion of the input x 2 RD into two subspaces: ðxA ; xB Þ 2 Rd
where u; w 2 RD and b 2 R are parameters and h : R ! R is  RDd and a bijection hð ; uÞ : Rd ! Rd , parameterized by u.
a smooth non-linearity. The Jacobian determinant for this Then one can define a function g : RD ! RD by the formula
transformation is
yA ¼ hðxA ; QðxB ÞÞ
  (15)
@g y B ¼ xB ;
det ¼ detð11D þ uh0 ðwT x þ bÞwT Þ
@x (11)
where the parameters u are defined by any arbitrary function
¼ 1 þ h0 ðwT x þ bÞuT w;
QðxB Þ which only uses xB as input. This function is called a
where the last equality comes from the application of the conditioner. The bijection h is called a coupling function, and the
matrix determinant lemma. This can be computed in OðDÞ resulting function g is called a coupling flow. A coupling flow
time. The inversion of this flow isn’t possible in closed form is invertible if and only if h is invertible and has inverse
and may not exist for certain choices of hðÞ and certain
parameter settings [78]. xA ¼ h1 ðyA ; QðxB ÞÞ
(16)
The term uhðwT x þ bÞ can be interpreted as a multilayer xB ¼ yB :
perceptron with a bottleneck hidden layer with a single unit
[56]. This bottleneck means that one needs to stack many The Jacobian of g is a block triangular matrix where the
planar flows to get high expressivity. Hasenclever et al. [38] diagonal blocks are Dh and the identity matrix respectively.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3969

Hence the determinant of the Jacobian of the coupling flow


is simply the determinant of Dh.
Most coupling functions are applied to xA element-wise

hðxA ; uÞ ¼ ðh1 ðxA


1 ; u1 Þ; . . . ; hd ðxd ; ud ÞÞ;
A
(17)

where each hi ð; ui Þ : R ! R is a scalar bijection. In this case


a coupling flow is a triangular transformation (i.e., has a tri-
angular Jacobian matrix). See Section 3.4.4 for examples.
The power of a coupling flow resides in the ability of a
conditioner QðxB Þ to be arbitrarily complex. In practice it is Fig. 4. Autoregressive flows. On the left, is the direct autoregressive flow
usually modelled as a neural network. For example, Kingma given in Equation (18). Each output depends on the current and previous
and Dhariwal [57] used a shallow ResNet architecture. inputs and so this operation can be easily parallelized. On the right, is the
inverse autoregressive flow from Equation (20). Each output depends on
Sometimes, however, the conditioner can be constant the current input and the previous outputs and so computation is
(i.e., not depend on xB at all). This allows for the construc- inherently sequential and cannot be parallelized.
tion of a “multi-scale flow” [20] which gradually introduces
dimensions to the distribution in the generative direction by Papamakarios et al. [74] to create masked autoregressive
(Fig. 3b). In the normalizing direction, the dimension flows (MAF).
reduces by half after each iteration step, such that most of However, the computation of the inverse is more chal-
semantic information is retained. This reduces the computa- lenging. Given the inverse of h, the inverse of g can be found
tional costs of transforming high dimensional distributions with recursion: we have x1 ¼ h1 ðy1 ; Q1 Þ and for any
and can capture the multi-scale structure inherent in certain t ¼ 2; . . . ; D, xt ¼ h1 ðyt ; Qt ðx1:t1 ÞÞ. This computation is
kinds of data like natural images. inherently sequential which makes it difficult to implement
The question remains of how to partition x. This is often efficiently on modern hardware as it cannot be parallelized.
done by splitting the dimensions in half [19], potentially Note that the functional form for the autoregressive
after a random permutation. However, more structured par- model is very similar to that for the coupling flow. In both
titioning has also been explored and is common practice, cases a bijection h is used, which has as an input one part of
particularly when modelling images. For instance, Dinh the space and which is parameterized conditioned on the
et al. [20] used “masked” flows that take alternating pixels other part. We call this bijection a coupling function in both
or blocks of channels in the case of an image in non-volume cases. Note that Huang, Krueger, Lacoste, and Courville
preserving flows (RealNVP). In place of permutation [45] used the name “transformer” (which has nothing to do
Kingma and Dhariwal [57] used 1  1 convolution (Glow). with transformers in NLP).
For the partition for the multi-scale flow in the normalizing Alternatively, Kingma et al. [56] introduced the “inverse
direction, Das et al. [18] suggested selecting features at autoregressive flow” (IAF), which outputs each entry of y
which the Jacobian of the flow has higher values for the conditioned the previous entries of y (with respect to the
propagated part. fixed ordering). Formally,

3.4.2 Autoregressive Flows yt ¼ hðxt ; ut ðy1:t1 ÞÞ: (20)


Kingma et al. [56] used autoregressive models as a form of
normalizing flow. These are non-linear generalizations of One can see that the functional form of the inverse autore-
multiplication by a triangular matrix (Section 3.2.2). gressive flow is the same as the form of the inverse of the
Let hð ; uÞ : R ! R be a bijection parameterized by u. flow in Equation (18), hence the name. Computation of the
Then an autoregressive model is a function g : RD ! RD , IAF is sequential and expensive, but the inverse of IAF
which outputs each entry of y ¼ gðxÞ conditioned on the (which is a direct autoregressive flow) can be computed rel-
previous entries of the input atively efficiently (Fig. 4).
In Section 2.2.1 we noted that papers typically model
yt ¼ hðxt ; Qt ðx1:t1 ÞÞ; (18) flows in the “normalizing flow” direction (i.e., in terms of f
from data to the base density) to enable efficient evaluation
where x1:t ¼ ðx1 ; . . . ; xt Þ. For t ¼ 2; . . . ; D we choose arbi- of the log-likelihood during training. In this context one can
trary functions Qt ðÞ mapping Rt1 to the set of all parame- think of IAF as a flow in the generative direction: i.e.in terms
ters, and Q1 is a constant. The functions Qt ðÞ are called of g from base density to data. Hence Papamakarios et al.
conditioners. [74] noted that one should use IAFs if fast sampling is
The Jacobian matrix of the autoregressive transformation needed (e.g., for stochastic variational inference), and MAFs
g is triangular. Each output yt only depends on x1:t , and so if fast density estimation is desirable. The two methods are
the determinant is just a product of its diagonal entries closely related and, under certain circumstances, are theo-
Y
D
@yt retically equivalent [74].
detðDgÞ ¼ : (19)
t¼1
@xt
3.4.3 Universality
In practice, it’s possible to efficiently compute all the entries For several autoregressive flows the universality property
of the direct flow (Equation (18)) in one pass using a single has been proven [45], [49]. Informally, universality means
network with appropriate masks [31]. This idea was used that the flow can learn any target density to any required
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3970 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021

precision given sufficient capacity and data. We will expressiveness and many flows must be stacked to repre-
provide a formal proof of the universality theorem sent complicated distributions.
following [49]. This section requires some knowledge of 3.4.4.2 Nonlinear squared flow. Ziegler and Rush [108] pro-
measure theory and functional analysis and can be safely posed an invertible non-linear squared transformation
skipped. defined by
First, recall that a mapping T ¼ ðT1 ; . . . ; TD Þ : RD ! RD
c
is called triangular if Ti is a function of x1:i for each hðx ; uÞ ¼ ax þ b þ : (23)
i ¼ 1; . . . ; D. Such a triangular map T is called increasing if 1 þ ðdx þ hÞ2
Ti is an increasing function of xi for each i.
Under some constraints on parameters u ¼ ½a; b; c; d; h 2 R5 ,
Proposition 4 ([9], Lemma 2.1). If m and n are absolutely con- the coupling function is invertible and its inverse is analyti-
tinuous Borel probability measures on RD , then there exists an cally computable as a root of a cubic polynomial (with
increasing triangular transformation T : RD ! RD , such that only one real root). Experiments showed that these coupling
n ¼ T m. This transformation is unique up to null sets of m. A functions facilitate learning multimodal distributions.
similar result holds for measures on ½0; 1D . 3.4.4.3 Continuous mixture CDFs. Ho et al. [41] proposed
the Flow++ model, which contained several improve-
Proposition 5. If m is an absolutely continuous Borel probability ments, including a more expressive coupling function.
measures on RD and fTn g is a sequence of maps RD ! RD The layer is almost like a linear transformation, but one
which converges pointwise to a map T , then a sequence of meas- also applies a monotone function to x
ures ðTn Þ m weakly converges to T m.
hðx; uÞ ¼ u1 F ðx; u3 Þ þ u2 ; (24)
Proof. See [45], Lemma 4. The result follows from the domi-
nated convergence theorem. u
t where u1 6¼ 0, u2 2 R and u3 ¼ ½p p; m; s 2 RK  RK  RKþ.
As a corollary, to claim that a class of autoregressive The function F ðx; p ; m; sÞ is the CDF of a mixture of K logis-
flows gð; uÞ : RD ! RD is universal, it is enough to demon- tics, postcomposed with an inverse sigmoid
strate that a family of coupling functions h used in the
X  !
class is dense in the set of all monotone functions in the 1
K
x  mj
F ðx; p ; m ; sÞ ¼ s pj s : (25)
pointwise convergence topology. In particular, [45] used j¼1
sj
neural monotone networks for coupling functions, and
[49] used monotone polynomials. Using the theory out- Note, that the post-composition with s 1 : ½0; 1 ! R is used
lined in this section, universality could also be proved for to ensure the right range for h. Computation of the inverse
spline flows [24], [25] with splines for coupling functions is done numerically with the bisection algorithm. The deriv-
(see Section 3.4.4.4). ative of the transformation with respect to x is expressed in
terms of PDF of logistic mixture (i.e., a linear combination
3.4.4 Coupling Functions of hyperbolic secant functions), and its computation is not
expensive. An ablation study demonstrated that switching
As described in the previous sections, coupling flows and
from an affine coupling function to a logistic mixture
autoregressive flows have a similar functional form and
improved performance slightly.
both have coupling functions as building blocks. A cou-
3.4.4.4 Splines. A spline is a piecewise-polynomial or a
pling function is a bijective differentiable function hð; uÞ :
piecewise-rational function which is specified by K þ 1
Rd ! Rd , parameterized by u. In coupling flows, these
points ðxi ; yi ÞK
i¼0 , called knots, through which the spline
functions are typically constructed by applying a scalar
passes. To make a useful coupling function, the spline
coupling function hð; uÞ : R ! R elementwise. In autore-
should be monotone which will be the case if xi < xiþ1 and
gressive flows, d ¼ 1 and hence they are also scalar valued.
yi < yiþ1 . Usually splines are considered on a compact
Note that scalar coupling functions are necessarily (strictly)
interval.
monotone. In this section we describe the scalar coupling
Piecewise-linear and piecewise-quadratic. M€
uller et al. [69]
functions commonly used in the literature.
used linear splines for coupling functions h : ½0; 1 ! ½0; 1.
3.4.4.1 Affine coupling. Two simple forms of coupling
They divided the domain into K equal bins. Instead of defin-
functions h : R ! R were proposed by Dinh et al. [19] in
ing increasing values for yi , they modeled h as the integral of
NICE (nonlinear independent component estimation).
a positive piecewise-constant function
These were the additive coupling function
X
b1
hðx ; uÞ ¼ x þ u; u 2 R; (21) hðx; uÞ ¼ aub þ uk ; (26)
k¼1

and the affine coupling function where u 2 RK is a probability vector, b ¼ bKxc (the bin that
contains x), and a ¼ Kx  b (the position of x in bin b). This
hðx; uÞ ¼ u1 x þ u2 ; u1 6¼ 0; u2 2 R: (22) map is invertible, if all uk > 0, with derivative: @h
@x ¼ ub K:
M€ uller et al. [69] also used a monotone quadratic spline
Affine coupling functions are used for coupling flows in on the unit interval for a coupling function and modeled
NICE [19], RealNVP [20], Glow [57] and for autoregressive this as the integral of a positive piecewise-linear function. A
architectures in IAF [56] and MAF [74]. They are simple and monotone quadratic spline is invertible; finding its inverse
computation is efficient. However, they are limited in map requires solving a quadratic equation.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3971

Cubic Splines. Durkan et al. [24] proposed using monotone


cubic splines for a coupling function. They do not restrict
the domain to the unit interval, but instead use the form:
hð; uÞ ¼ s 1 ðhðsð
^ Þ; uÞÞ, where hð ^ ; uÞ : ½0; 1 ! ½0; 1 is a
monotone cubic spline and s is a sigmoid. Steffen’s method
is used to construct the spline. Here, one specifies K þ 1
knots of the spline and boundary derivatives h^0 ð0Þ and
h^0 ð1Þ. These quantities are modelled as the output of a neu-
ral network.
Computation of the derivative is easy as it is piecewise-
Fig. 5. Piecewise bijective coupling. The target domain (right) is
quadratic. A monotone cubic polynomial has only one real divided into disjoint sections (colors) and each mapped by a mono-
root and for inversion, one can find this either analytically tone function (center) to the base distribution (left). For inverting the
or numerically. However, the procedure is numerically function, one samples a component of the base distribution using a
unstable if not treated carefully. The flow can be trained by gating network.
gradient descent by differentiating through the numerical
root finding method. However Durkan et al. [25] noted 3.4.4.6 Sum-of-Squares polynomial flow. Jaini et al. [49] mod-
numerical difficulties when the sigmoid saturates for values eled hð ; uÞ as a strictly increasing polynomial. They proved
far from zero. such polynomials can approximate any strictly monotonic uni-
Rational Quadratic Splines. Durkan et al. [25] model a cou- variate continuous function. Hence, the resulting flow (SOS -
pling function hðx ; uÞ as a monotone rational-quadratic sum of squares polynomial flow) is a universal flow.
spline on an interval as the identity function otherwise. The authors observed that the derivative of an increasing
They define the spline using the method of Gregory and single-variable polynomial is a positive polynomial. Then
Delbourgo [35], by specifying K þ 1 knots fhðxi ÞgK i¼0 and they used a classical result from algebra: all positive
the derivatives at the inner points: fh0 ðxi Þgi¼1 K1
. These loca- single-variable polynomials are the sum of squares of poly-
tions of the knots and their derivatives are modelled as the nomials. To get the coupling function, one needs to inte-
output of a neural network. grate the sum of squares
The derivative with respect to x is a quotient derivative
Z !2
and the function can be inverted by solving a quadratic xX
K X
L
equation. Durkan, Bekasov, Murray, and Papamakarios [25] hðx ; uÞ ¼ c þ akl u l
du ; (27)
0 k¼1 l¼0
used this coupling function with both a coupling architec-
ture RQ-NSF(C) and an auto-regressive architecture RQ-
NSF(AR). where L and K are hyperparameters (and, as noted in the
3.4.4.5 Neural autoregressive flow. Huang et al. [45] intro- paper, can be chosen to be 2).
duced Neural Autoregressive Flows (NAF) where a cou- SOS is easier to train than NAF, because there are no
pling function hð ; uÞ is modelled with a deep neural restrictions on the parameters (like positivity of weights).
network. Typically such a network is not invertible, but For L=0, SOS reduces to the affine coupling function and so
they proved a sufficient condition for it to be bijective: it is a generalization of the basic affine flow.
3.4.4.7 Piecewise-bijective coupling. Dinh et al. [21] explore
Proposition 6. If NNðÞ : R ! R is a multilayer percepton, such
the idea that a coupling function does not need to be bijec-
that all weights are positive and all activation functions are
tive, but just piecewise-bijective (Fig. 5). Formally, they con-
strictly monotone, then NNðÞ is a strictly monotone function.
sider a function hð ; uÞ : R ! R and a covering of the
domain into K disjoint subsets: R ¼ ti¼1 Ai , such that the
K
They proposed two forms of neural networks: the deep
sigmoidal coupling function (NAF-DSF) and deep dense restriction of the function onto each subset hð ; uÞjAi is
sigmoidal coupling function (NAF-DDSF). Both are MLPs injective.
with layers of sigmoid and logit units and non-negative Dinh et al. [21] constructed a flow f : RD ! RD with a
weights; the former has a single hidden layer of sigmoid coupling architecture and piecewise-bijective coupling
units, whereas the latter is more general and does not have function in the normalizing direction - from data distribu-
this bottleneck. By Proposition 6, the resulting hð ; uÞ is a tion to (simpler) base distribution. There is a covering of the
strictly monotone function. They also proved that a DSF net- data domain, and each subset of this covering is separately
work can approximate any strictly monotone univariate mapped to the base distribution. Each part of the base distri-
function and so NAF-DSF is a universal flow. bution now receives contributions from each subset of the
Wehenkel and Louppe [102] noted that imposing positiv- data domain. For sampling, [21] proposed a probabilistic
ity of weights on a flow makes training harder and requires mapping from the base to data domain.
more complex conditioners. To mitigate this, they intro- More formally, denote the target y and base z, and con-
duced unconstrained monotonic neural networks (UMNN). sider a lookup function f : R ! ½K ¼ f1; . . . ; Kg, such that
The idea is in order to model a strictly monotone function, fðyÞ ¼ k, if y 2 Ak . One can define a new map R ! R  ½K,
one can describe a strictly positive (or negative) function given by the rule y 7! ðhðyÞ; fðyÞÞ, and a density on a target
with a neural network and then integrate it numerically. space pZ;½K ðz; kÞ ¼ p½KjZ ðkjzÞpZ ðzÞ. One can think of this as
They demonstrated that UMNN requires less parameters an unfolding of the non-injective map h. In particular, for
than NAF to reach similar performance, and so is more scal- each point z one can find its pre-image by sampling from
able for high-dimensional datasets. p½KjZ , which is called a gating network. Pushing forward
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3972 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021

along this unfolded map is now well-defined and one gets jdetðI þ DF Þj ¼ detðI þ DF Þ. Using the linear algebra iden-
the formula for the density pY tity, ln det A ¼ Tr ln A we have

pY ðyÞ ¼ pZ;½K ðhðyÞ; fðyÞÞjDhðyÞj: (28) ln jdet Dgj ¼ ln detðI þ DF Þ ¼ Trðln ðI þ DF ÞÞ; (31)

This real and discrete (RAD) flow efficiently learns distribu- Then one considers a power series for the trace of the matrix
tions with discrete structures (multimodal distributions, logarithm
distributions with holes, discrete symmetries etc).
X
1
TrðDF Þk
Trðln ðI þ DF ÞÞ ¼ ð1Þkþ1 : (32)
3.5 Residual Flows k¼1
k
Residual networks [40] are compositions of the function of
the form By truncating this series one can calculate an approxima-
tion to the log Jacobian determinant of g. To efficiently
gðxÞ ¼ x þ F ðxÞ: (29) compute each member of the truncated series, the Hutchin-
son trick was used. This trick provides a stochastic estima-
Such a function is called a residual connection, and here the tion of of a matrix trace A 2 RDD , using the relation:
residual block F ðÞ is a feed-forward neural network of any TrA ¼ EpðvÞ ½vT Av, where v 2 RD , E½v ¼ 0, and covðvÞ ¼ I.
kind (a CNN in the original paper). Truncating the power series gives a biased estimate of
The first attempts to build a reversible network architec- the log Jacobian determinant (the bias depends on the trun-
ture based on residual connections were made in RevNets cation error). An unbiased stochastic estimator was pro-
[32] and iRevNets [47]. Their main motivation was to save posed by Chen et al. [16] in a model they called a Residual
memory during training and to stabilize computation. The flow. The authors used a Russian roulette estimator instead of
central idea is a variation of additive coupling functions: truncation. Informally, every
consider a disjoint partition of RD ¼ Rd  RDd denoted by Pn time one adds the next term
a
P1nþ1 to the partial sum i¼1 ai while calculating the series
x ¼ ðxA ; xB Þ for the input and y ¼ ðyA ; yB Þ for the output,
i¼1 ai , one flips a coin to decide if the calculation should
and define a function be continued or stopped. During this process one needs to
re-weight terms for an unbiased estimate.
yA ¼ xA þ F ðxB Þ
(30)
yB ¼ xB þ GðyA Þ; 3.6 Infinitesimal (Continuous) Flows
The residual connections discussed in the previous section
where F : RDd ! Rd and G : Rd ! RDd are residual
can be viewed as discretizations of a first order ordinary dif-
blocks. This network is invertible (by re-arranging the equa-
ferential equation (ODE) [26], [37]
tions in terms of xA and xB and reversing their order) but
computation of the Jacobian is inefficient. d
A different point of view on reversible networks comes xðtÞ ¼ F ðxðtÞ; uðtÞÞ; (33)
dt
from a dynamical systems perspective via the observation
that a residual connection is a discretization of a first order where F : RD  Q ! RD is a function which determines the
ordinary differential equation (see Section 3.6 for more dynamic (the evolution function), Q is a set of parameters
details). [12], [13] proposed several architectures, some of and u : R ! Q is a parameterization. The discretization of
these networks were demonstrated to be invertible. How- this equation (Euler’s method) is
ever, the Jacobian determinants of these networks cannot be
xnþ1  xn ¼ "F ðxn ; un Þ; (34)
computed efficiently.
Other research has focused on making the residual connec- and this is equivalent to a residual connection with a resid-
tion gðÞ invertible. A sufficient condition for the invertibility ual block "F ð; un Þ.
was found in [7]. They proved the following statement: In this section we consider the case where we do not dis-
Proposition 7. A residual connection (29) is invertible, if the cretize but try to learn the continuous dynamical system
Lipschitz constant of the residual block is LipðF Þ < 1. instead. Such flows are called infinitesimal or continuous. We
consider two distinct types. The formulation of the first type
There is no analytically closed form for the inverse, but it comes from ordinary differential equations, and of the sec-
can be found numerically using fixed-point iterations ond type from stochastic differential equations.
(which, by the Banach theorem, converge if we assume
LipðF Þ < 1). 3.6.1 ODE-Based Methods
Controlling the Lipschitz constant of a neural network is
Consider an ODE as in Equation (33), where t 2 ½0; 1. Assum-
not simple. The specific architecture proposed by Behrmann
ing uniform Lipschitz continuity in x and continuity in t, the
et al. [7], called iResNet, uses a convolutional network for the
solution exists (at least, locally) and, given an initial condition
residual block. It constrains the spectral radius of each con-
xð0Þ ¼ z, is unique (Picard-Lindel€ of-Lipschitz-Cauchy theo-
volutional layer in this network to be less than one.
rem [5]). We denote the solution at each time t as Ft ðzÞ.
The Jacobian determinant of the iResNet cannot be com-
puted directly, so the authors propose to use a (biased) sto- Remark 8. At each time t, Ft ðÞ : RD ! RD is a diffeomor-
chastic estimate. The Jacobian of the residual connection g phism and satisfies the group law: Ft  Fs ¼ Ftþs . Mathe-
in Equation (29) is: Dg ¼ I þ DF . Because the function F is matically speaking, an ODE (33) defines a one-parameter
assumed to be Lipschitz with LipðF Þ < 1, one has: group of diffeomorphisms on RD . Such a group is called
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3973

a smooth flow in dynamical systems theory and differen- Jacobian determinant of a diffeomorphism is nonzero, its
tial geometry [52]. sign cannot change along the path. Hence, a time one
map must have a positive Jacobian determinant. For
When t ¼ 1, the diffeomorphism F1 ðÞ is called a time
example, consider a map f : R ! R, such that fðxÞ ¼ x.
one map. The idea to model a normalizing flow as a time
It is obviously a diffeomorphism, but it can not be pre-
one map y ¼ gðzÞ ¼ F1 ðzÞ was presented by Chen et al.
sented as a time one map of any ODE, because it is not
[15] under the name Neural ODE (NODE). From a deep
orientation preserving.
learning perspective this can be seen as an “infinitely deep”
Dupont et al. [23] suggested how one can improve
neural network with input z, output y and continuous
Neural ODE in order to be able to represent a broader class
weights uðtÞ. The invertibility of such networks naturally
of diffeomorphisms. Their model is called Augmented
comes from the theorem of the existence and uniqueness of
Neural ODE (ANODE). They add variables ^ xðtÞ 2 Rp and
the solution of the ODE.
consider a new ODE
Training these networks for a supervised downstream
task can be done by the adjoint sensitivity method which is     
the continuous analog of backpropagation. It computes the d xðtÞ ^ xðtÞ ; uðtÞ ;
¼ F (37)
gradients of the loss function by solving a second (aug- dt ^xðtÞ ^xðtÞ
mented) ODE backwards in time. For loss LðxðtÞÞ, where xðtÞ
is a solution of ODE (33), its sensitivity or adjoint is with initial conditions xð0Þ ¼ z and ^ xð0Þ ¼ 0. The addition
aðtÞ ¼ dxðtÞ
dL
. This is the analog of the derivative of the loss of ^
xðtÞ in particular gives freedom for the Jacobian determi-
with respect to the hidden layer. In a standard neural net- nant to remain positive. As was demonstrated in the experi-
work, the backpropagation formula computes this deriva- ments, ANODE is capable of learning distributions that the
dL
tive: dh
dh
¼ dhdL dhnþ1 . For “infinitely deep” neural network, Neural ODE cannot, and the training time is shorter. Zhang
n nþ1 n
this formula changes into an ODE et al. [106] proved that any diffeomorphism can be repre-
sented as a time one map of ANODE and so this is a univer-
daðtÞ dF ðxðtÞ; uðtÞÞ sal flow.
¼ aðtÞ : (35) A similar ODE-base approach was taken by Salman et al.
dt dxðtÞ
[83] in Deep Diffeomorphic Flows. In addition to modelling
a path Ft ðÞ in the space of all diffeomorphic transforma-
For density estimation learning, we do not have a loss, tions, for t 2 ½0; 1, they proposed geodesic regularisation in
but instead seek to maximize the log likelihood. For normal- which longer paths are punished.
izing flows, the change of variables formula is given by
another ODE
3.6.2 SDE-Based Methods (Langevin Flows)
  The idea of the Langevin flow is simple; we start with a
d dF ðxðtÞÞ
log ðpðxðtÞÞÞ ¼ Tr : (36) complicated and irregular data distribution pY ðyÞ on RD ,
dt dxðtÞ
and then mix it to produce the simple base distribution
Note that we no longer need to compute the determinant. pZ ðzÞ. If this mixing obeys certain rules, then this procedure
To train the model and sample from pY we solve these can be invertible. This idea was explored by Chen et al. [87],
ODEs, which can be done with any numerical ODE Jankowiak and Obermeyer [103], Rezende and Mohamed
solver. [78], Salimans et al. [84], Sohl-Dickstein et al. [81], Suykens
Grathwohl et al. [34] used the Hutchinson estimator to et al. [50], Welling and Teh [14]. We provide a high-level
calculate an unbiased stochastic estimate of the trace-term. overview of the method, including the necessary mathemat-
This approach which they termed FFJORD reduces the com- ical background.
plexity even further. Finlay, Jacobsen et al. [29] added two A stochastic differential equation (SDE) or It^ o process
regularization terms into the loss function of FFJORD: the describes a change of a random variable x 2 RD as a func-
first term forces solution trajectories to follow straight lines tion of time t 2 Rþ
with constant speed, and the second term is the Frobenius
norm of the Jacobian. This regularization decreased the dxðtÞ ¼ bðxðtÞ; tÞdt þ sðxðtÞ; tÞdBt ; (38)
training time significantly and reduced the need for multi-
ple GPUs. An interesting side-effect of using continuous where bðx; tÞ 2 RD is the drift coefficient, sðx; tÞ 2 RDD is the
ODE-type flows is that one needs fewer parameters to diffusion coefficient, and Bt is D-dimensional Brownian
achieve the similar performance. For example, Grathwohl motion. One can interpret the drift term as a deterministic
et al. [34] show that for the comparable performance on change and the diffusion term as providing the stochasticity
CIFAR10, FFJORD uses less than 2 percent as many parame- and mixing. Given some assumptions about these functions,
ters as Glow. the solution exists and is unique [72].
Not all diffeomorphisms can be presented as a time Given a time-dependent random variable xðtÞ we
one map of an ODE (see [3], [52]). For example, one nec- can consider its density function pðx; tÞ and this is also
essary condition is that the map is orientation preserving time dependent. If xðtÞ is a solution of Equation (38), its den-
which means that the Jacobian determinant must be posi- sity function satisfies two partial differential equations
tive. This can be seen because the solution Ft is a (contin- describing the forward and backward evolution [72]. The
uous) path in the space of diffeomorphisms from the forward evolution is given by Fokker-Plank equation or
identity map F0 ¼ Id to the time one map F1 . Since the Kolmogorov’s forward equation
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3974 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021

TABLE 1 TABLE 2
List of Normalizing Flows for Which We Show Tabular Datasets: Data Dimensionality and Number
Performance Results of Training Examples

POWER GAS HEPMASS MINIBOONE BSDS300


Dims 6 8 21 43 63
#Train 1:7M 800K 300K 30K 1M

interesting posterior for variational inference. They sample


a latent variable z0 conditioned on the input x, and then
evolve z0 with SDE. In practice this evolution is computed
by discretization. By analogy to Neural ODEs, Neural Sto-
chastic Differential Equations were proposed [76], [96]. In this
approach coefficients of the SDE are modelled as neural net-
works, and black box SDE solvers are used for inference. To
train Neural SDE one needs an analog of backpropagation,
Tzen and Raginsky [96] proposed the use of Kunita’s theory
of stochastic flows. Following this, Li et al. [62] derived the
adjoint SDE whose solution gives the gradient of the origi-
@ X @2 nal Neural SDE.
pðx; tÞ ¼ rx  ðbðx; tÞpðx; tÞÞ þ Dij ðx; tÞpðx; tÞ; Note, that even though Langevin flows manifest nice
@t i;j
@xi @xj
mathematical properties, they have not found practical
(39) applications. In particular, none of the methods has been
where D ¼ 12 ss T , with the initial condition pð; 0Þ ¼ pY ðÞ. tested on baseline datasets for flows.
The reverse is given by Kolmogorov’s backward equation
4 DATASETS AND PERFORMANCE
@ X @2
 pðx; tÞ ¼ bðx; tÞ  rx ðpðx; tÞÞ þ Dij ðx; tÞ pðx; tÞ; In this section we discuss datasets commonly used for train-
@t i;j
@xi @xj ing and testing normalizing flows. We provide comparison
(40) tables of the results as they were presented in the corre-
where 0 < t < T , and the initial condition is pð; T Þ ¼ pZ ðÞ. sponding papers. The list of the flows for which we post the
Asymptotically the Langevin flow can learn any distribu- performance results is given in Table 1.
tion if one picks the drift and diffusion coefficients appro-
priately [87]. However this result is not very practical, 4.1 Tabular Datasets
because one needs to know the (unnormalized) density We describe datasets as they were preprocessed in [74]
function of the data distribution. (Table 2).2 These datasets are relatively small and so are a
One can see that if the diffusion coefficient is zero, the It^ o reasonable first test of unconditional density estimation
process reduces to the ODE (33), and the Fokker-Plank models. All datasets were cleaned and de-quantized by
equation becomes a Liouville’s equation, which is con- adding uniform noise, so they can be considered samples
nected to Equation (36) (see [15]). It is also equivalent to the from an absolutely continuous distribution.
form of the transport equation considered in [50] for sto- We use a collection of datasets from the UC Irvine
chastic optimization. machine learning repository [22].
Sohl-Dickstein et al. [84] and Salimans et al. [81] sug-
gested using MCMC methods to model the diffusion. They 1) POWER: a collection of electric power consumption
considered discrete time t ¼ 0; . . . ; T . For each time t, xt is a measurements in one house over 47 months.
random variable where x0 ¼ y is the data point, and xT ¼ z 2) GAS: a collection of measurements from chemical
is the base point. The forward transition probability sensors in several gas mixtures.
qðxt jxt1 Þ is taken to be either normal or binomial distribu- 3) HEPMASS: measurements from high-energy physics
tion with trainable parameters. Kolmogorov’s backward experiments aiming to detect particles with
equation implies that the backward transition pðxt1 jxt Þ unknown mass.
must have the same functional form as the forward transi- 4) MINIBOONE: measurements from MiniBooNE
tion (i.e., be either normal or binomial). Denote: experiment for observing neutrino oscillations.
qðx0 Þ ¼ pY ðyÞ, the data distribution, and pðxT Þ ¼ pZ ðzÞ, the In addition we consider the Berkeley segmentation
base distribution. Applying the backward transition to the dataset [66] which contains segmentations of natural
base distribution, one obtains a new density pðx0 Þ, which images. [74] extracted 8  8 random monochrome patches
one wants to match with qðx0 Þ. RHence, the optimization from it.
objective is the log likelihood L ¼ dx0 qðx0 Þlog pðx0 Þ. This is In Table 3 we compare performance of flows for these
intractable, but one can find a lower bound as in variational tabular datasets. For experimental details, see the following
inference. papers: RealNVP [20] and MAF [74], Glow [57] and FFJORD
Several papers have worked explicitly with the SDE [14],
[62], [63], [76], [96]. Chen et al. [14] use SDEs to create an 2. See https://github.com/gpapamak/maf
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3975

TABLE 3
Average Test Log-Likelihood (in Nats) for Density Estimation on Tabular Datasets (Higher the Better)

POWER GAS HEPMASS MINIBOONE BSDS300


MAF(5) 0.14 0:01 9.07 0:02 -17.70 0:02 -11.75 0:44 155.69 0:28
MAF(10) 0.24 0:01 10.08 0:02 -17.73 0:02 -12.24 0:45 154.93 0:28
MAF MoG 0.30 0:01 9.59 0:02 -17.39 0:02 -11.68 0:44 156.36 0:28
RealNVP(5) -0.02 0:01 4.78 1:8 -19.62 0:02 -13.55 0:49 152.97 0:28
RealNVP(10) 0.17 0:01 8.33 0:14 -18.71 0:02 -13.84 0:52 153.28 1:78
Glow 0.17 8.15 -18.92 -11.35 155.07
FFJORD 0.46 8.59 -14.92 -10.43 157.40
NAF(5) 0.62 0:01 11.91 0:13 -15.09 0:40 -8.86 0:15 157.73 0:04
NAF(10) 0.60 0:02 11.96 0:33 -15.32 0:23 -9.01 0:01 157.43 0:30
UMNN 0.63 0:01 10.89 0:70 -13.99 0:21 -9.67 0:13 157.98 0:01
SOS(7) 0.60 0:01 11.99 0:41 -15.15 0:10 -8.90 0:11 157.48 0:41
Quadratic Spline (C) 0.64 0:01 12.80 0:02 -15.35 0:02 -9.35 0:44 157.65 0:28
Quadratic Spline (AR) 0.66 0:01 12.91 0:02 -14.67 0:03 -9.72 0:47 157.42 0:28
Cubic Spline 0.65 0:01 13.14 0:02 -14.59 0:02 -9.06 0:48 157.24 0:07
RQ-NSF(C) 0.64 0:01 13.09 0:02 -14.75 0:03 -9.67 0:47 157.54 0:28
RQ-NSF(AR) 0.66 0:01 13.09 0:02 -14.01 0:03 -9.22 0:48 157.31 0:28

A number in parenthesis next to a flow indicates number of layers. MAF MoG is MAF with mixture of Gaussians as a base density.

[34], NAF [45], UMNN [102], SOS [49], Quadratic Spline probability density function is possible and the parameters
flow and RQ-NSF [25], Cubic Spline Flow [24]. of this distribution can be learned during training.
Table 3 shows that universal flows (NAF, SOS, Splines) Theoretically the base measure shouldn’t matter: any dis-
demonstrate relatively better performance. tribution for which a CDF can be computed, can be simu-
lated by applying the inverse CDF to draw from the
uniform distribution. However in practice if structure is
4.2 Image Datasets
provided in the base measure, the resulting transformations
These datasets summarized in Table 4. They are of increas- may become easier to learn. In other words, the choice of
ing complexity and are preprocessed as in [20] by dequan- base measure can be viewed as a form of prior or inductive
tizing with uniform noise (except for Flow++). bias on the distribution and may be useful in its own right.
Table 5 compares performance on the image datasets for For example, a trade-off between the complexity of the gen-
unconditional density estimation. For experimental details, erative transformation and the form of base measure was
see: RealNVP for CIFAR-10 and ImageNet [20], Glow for explored in [48] in the context of modelling tail behaviour.
CIFAR-10 and ImageNet [57], RealNVP and Glow for
MNIST, MAF and FFJORD [34], SOS [49], RQ-NSF [25],
UMNN [102], iResNet [7], Residual Flow [16], Flow++ [41]. 5.1.2 Form of Diffeomorphisms
As of this writing Flow++ [41] is the best performing The majority of the flows explored are triangular flows (either
approach. Besides using more expressive coupling layers (see coupling or autoregressive architectures). Residual networks
Section 3.4.4.3) and a different architecture for the conditioner, and Neural ODEs are also being actively investigated and
variational dequantization was used instead of uniform. An applied. A natural question to ask is: are there other ways to
ablation study shows that the change in dequantization model diffeomorphisms which are efficient for computation?
approach gave the most significant improvement. What inductive bias does the architecture impose? For instance,
Spantini, Bigoni, and Marzouk [85] investigate the relation
5 DISCUSSION AND OPEN PROBLEMS between the sparsity of the triangular flow and Markov prop-
erty of the target distribution.
5.1 Inductive Biases
5.1.1 Role of the Base Measure TABLE 5
The base measure of a normalizing flow is generally Average Test Negative Log-Likelihood (in Bits per Dimension) for
Density Estimation on Image Datasets (Lower is Better)
assumed to be a simple distribution (e.g., uniform or Gauss-
ian). However this doesn’t need to be the case. Any distribu- MNIST CIFAR-10 ImNet32 ImNet64
tion where we can easily draw samples and compute the log
RealNVP 1.06 3.49 4.28 3.98
Glow 1.05 3.35 4.09 3.81
TABLE 4 MAF 1.89 4.31
Image Datasets: Data Dimensionality and FFJORD 0.99 3.40
Number of Training Examples for MNIST, CIFAR-10, SOS 1.81 4.18
ImageNet32 and ImageNet64 Datasets RQ-NSF(C) 3.38 3.82
UMNN 1.13
MNIST CIFAR-10 ImNet32 ImNet64 iResNet 1.06 3.45
Dims 784 3072 3072 12288 Residual Flow 0.97 3.28 4.01 3.76
#Train 50K 90K 1:3M 1:3M Flow++ 3.08 3.86 3.69
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3976 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021

A related question concerns the best way to model condi- on RD , one can push it forward on M via the exponential
tional normalizing flows when one needs to learn a condi- map. Additionally, applying a normalizing flow to a base
tional probability distribution. Trippe and Turner [95] measure before pushing it to M helps to construct multi-
suggested using different flows for each condition, but this modal distributions on M. If the manifold M is a hyberbolic
approach doesn’t leverage weight sharing, and so is ineffi- space, the exponential map is a global diffeomorphism and
cient in terms of memory and data usage. Atanov, Volo- all the formulas could be written explicitly. Using this
khova, Ashukha, Sosnovik, and Vetrov [6] proposed using method, Ovinnikov [73] introduced the Gaussian reparame-
affine coupling layers where the parameters u depend on terization trick in a hyperbolic space and Bose et al. [10]
the condition. Conditional distributions are useful in partic- constructed hyperbolic normalizing flows.
ular for time series modelling, where one needs to find Instead of a Riemannian structure, one can impose a
pðyt jy < t Þ [60]. Lie group structure on a manifold G. In this case there
also exists an exponential map exp : g ! G mapping a Lie
5.1.3 Loss Function algebra to the Lie group and one can use it to construct a
The majority of the existing flows are trained by minimi- normalizing flow on G. Falorsi et al. [28] introduced an
zation of KL-divergence between source and the target analog of the Gaussian reparameterization trick for a Lie
distributions (or, equivalently, with log-likelihood maxi- group.
mization). However, other losses could be used which
would put normalizing flows in a broader context of
optimal transport theory [99]. Interesting work has been 5.2.2 Discrete Distributions
done in this direction including Flow-GAN [36] and the Modelling distributions over discrete spaces is important in
minimization of the Wasserstein distance as suggested a range of problems, however the generalization of normal-
by [4], [90]. izing flows to discrete distributions remains an open prob-
lem in practice. Discrete latent variables were used by
5.2 Generalisation to Non-Euclidean Spaces Dinh et al. [21] as an auxiliary tool to pushforward continu-
5.2.1 Flows on Manifolds ous random variables along piecewise-bijective maps (see
Section 3.4.4.7). However, can we define normalizing flows
Modelling probability distributions on manifolds has
if one or both of our distributions are discrete? This could
applications in many fields including robotics, molecular
be useful for many applications including natural language
biology, optics, fluid mechanics, and plasma physics [30],
modelling, graph generation and others.
[79]. How best to construct a normalizing flow on a general
To this end Tran et al. [94] model bijective functions on a
differentiable manifold remains an open question. One
finite set and show that, in this case, the change of variables
approach to applying the normalizing flow framework on
is given by the formula: pY ðyÞ ¼ pZ ðg1 ðyÞÞ, i.e., with no
manifolds, is to find a base distribution on the euclidean
Jacobian term (compare with Definition 1). For backpropa-
space and transfer it to the manifold of interest. There are
gation of functions with discrete variables they use the
two main approaches: 1) embed the manifold in the euclid-
straight-through gradient estimator [8]. However this
ean space and “restrict” the measure, or 2) induce the mea-
method is not scalable to distributions with large numbers
sure from the tangent space to the manifold. We will
of elements.
briefly discuss each in turn.
Alternatively Hoogeboom et al. [43] models bijections
One can also use differential structure to define measures
on ZD directly with additive coupling layers. Other
on manifolds [86]. Every differentiable and orientable mani-
approaches transform a discrete variable into a continu-
fold M has a volume form v, then for a RBorel subset U M
ous latent variable with a variational autoencoder, and
one can define its measure as mv ðUÞ ¼ U v. A Riemannian
then apply normalizing flows in the continuous latent
manifold has a natural volume form given by its metric
pffiffiffiffiffi space [101], [108].
tensor: v ¼ jgjdx1 ^ . . . ^ dxD . Gemici et al. [30] explore
A different approach is dequantization, (i.e., adding noise
this approach considering an immersion of an D-dimen-
to discrete data to make it continuous) which can be used
sional manifold M into a euclidean space: f : M ! RN ,
with ordinal variables, e.g., discretized pixel intensities. The
where N D. In this case, one pulls-back a euclidean
noise can be uniform but other forms are possible and this
metric, and locally
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi a volume form on M is
dequantization can even be learned as a latent variable model
v ¼ detððDfÞT DfÞdx1 ^ . . . ^ dxD , where Df is the Jaco- [41], [44]. Hoogeboom et al. [44] analyzed how different
bian matrix of f. Rezende et al. [79] pointed out that the real- choices of dequantization objectives and dequantization dis-
ization of this method is computationally hard, and tributions affect the performance.
proposed an alternative construction of flows on tori and
spheres using diffeomorphisms of the one-dimensional cir- ACKNOWLEDGMENTS
cle as building blocks.
As another option, one can consider exponential maps The authors would like to thank Matt Taylor and Kry Yik-
expx : Tx M ! M, mapping a tangent space of a Riemannian Chau Lui for their insightful comments.
manifold (at some point x) to the manifold itself. If the man-
ifold is geodesic complete, this map is globally defined, and REFERENCES
locally is a diffeomorphism. A tangent space has a structure
[1] A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow:
of a vector space, so one can choose an isomorphism Noise modeling with conditional normalizing flows,” in Proc.
Tx M ffi RD . Then for a base distribution with the density pZ IEEE Int. Conf. Comput. Vis., 2019, pp. 3165–3173.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3977

[2] J. Agnelli, M. Cadeiras, E. Tabak, T. Cristina, and E. Vanden-Eijnden, [30] M. C. Gemici, D. Rezende, and S. Mohamed, “Normalizing flows
“Clustering and classification through normalizing flows in feature on riemannian manifolds,” 2016, arXiv:1611.02304.
space,” Multiscale Model. Simul., vol. 8, pp. 1784–1802, 2010. [31] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “MADE:
[3] J. Arango and A. G omez, “Diffeomorphisms as time one maps,” Masked autoencoder for distribution estimation,” in Proc. 32nd
Aequationes Math., vol. 64, pp. 304–314, 2002. Int. Conf. Mach. Learn., 2015, pp. 881–889.
[4] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein Generative [32] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The revers-
Adversarial Networks,” in Proc. 34th Int. Conf. Mach. Learn., 2017, ible residual network: Backpropagation without storing
pp. 214–223. activations,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017,
[5] V. Arnold, Ordinary Differential Equations. Cambridge, MA, USA: pp. 2211–2221.
The MIT Press, 1978. [33] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th
[6] A. Atanov, A. Volokhova, A. Ashukha, I. Sosnovik, and Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
D. Vetrov, “Semi-conditional normalizing flows for semi-super- [34] W. Grathwohl, R. T. Q Chen, J. Bettencourt, I. Sutskever, and
vised learning,” in Workshop Invertible Neural Nets Normalizing D. Duvenaud, “FFJORD: Free-form continuous dynamics for
Flows (ICML), 2019. scalable reversible generative models,” in Proc. Int. Conf. Learn.
[7] J. Behrmann, D. Duvenaud, and J.-H. Jacobsen, “Invertible resid- Representations, 2019.
ual networks,” in Proc. 36th Int. Conf. Mach. Learn., 2019. [35] J. Gregory and R. Delbourgo, “Piecewise rational quadratic inter-
[8] Y. Bengio, N. Leonard, and A. Courville, “Estimating or propa- polation to monotonic data,” IMA J. Numer. Anal., vol. 2, no. 2,
gating gradients through stochastic neurons for conditional pp. 123–130, 1982.
computation,” 2013, arXiv:1308.3432. [36] A. Grover, M. Dhar, and S. Ermon, “Flow-GAN: Combining
[9] V. Bogachev, A. Kolesnikov, and K. Medvedev, “Triangular maximum likelihood and adversarial learning in generative
transformations of measures,” Sbornik Math., vol. 196, no. 3/4, models,” in Proc. AAAI Conf. Artif. Intell., 2018.
pp. 309–335, 2005. [37] E. Haber, L. Ruthotto, and E. Holtham, “Learning across scales -
[10] A. J. Bose, A. Smofsky, R. Liao, P. Panangaden, and W. L. Hamilton, A multiscale method for convolution neural networks,” in Proc.
“Latent variable modelling with hyperbolic normalizing flows,” AAAI Conf. Artif. Intell., 2018.
2020, arXiv: 2002.06336. [38] L. Hasenclever, J. M. Tomczak, R. Van Den Berg, and M. Welling,
[11] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. J ozefowicz, and S. “Variational inference with orthogonal normalizing flows,”
Bengio, “Generating sentences from a continuous space,” in Proc. in Workshop Bayesian Deep Learn. (NeurIPS), 2017.
20th SIGNLL Conf. Comput. Natural Lang. Learn., 2015, pp. 10–21. [39] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
[12] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and Surpassing human-level performance on ImageNet classification,”
E. Holtham, “Reversible architectures for arbitrarily deep resid- in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026–1034.
ual neural networks,” in Proc. AAAI Conf. Artif. Intell., 2018, [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
pp. 2811–2818. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
[13] B. Chang, M. Chen, E. Haber, and E. H. Chi, Recognit., 2016, pp. 770–778.
“AntisymmetricRNN: A dynamical system view on recurrent [41] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel, “Flow++:
neural networks,” in Proc. Int. Conf. Learn. Representations, 2019. Improving flow-based generative models with variational
[14] C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. Carin, dequantization and architecture design,” in Proc. 36th Int. Conf.
“Continuous-time flows for efficient inference and density Mach. Learn., 2019, pp. 2722–2730.
estimation,” in Proc. Int. Conf. Mach. Learn., 2018. [42] E. Hoogeboom, R. V. D. Berg, and M. Welling, “Emerging convo-
[15] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, lutions for generative normalizing flows,” in Proc. 36th Int. Conf.
“Neural ordinary differential equations,” in Proc. 32nd Int. Conf. Mach. Learn., 2019, pp. 2771–2780.
Neural Inf. Process. Syst., 2018, pp. 6572–6583. [43] E. Hoogeboom, J. W. Peters, R. van den Berg, and M. Welling,
[16] R. T. Q. Chen, J. Behrmann, D. Duvenaud, and J.-H. Jacobsen, “Integer discrete flows and lossless compression,” in Proc. 33rd
“Residual flows for invertible generative modeling,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019.
33rd Int. Conf. Neural Inf. Process. Syst., 2019. [44] E. Hoogeboom, T. S. Cohen, and J. M. Tomczak, “Learning dis-
[17] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, crete distributions by dequantization,” 2020, arXiv: 2001.11235.
and A. A. Bharath, “Generative adversarial networks: An overview,” [45] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville, “Neural
IEEE Signal Process. Mag., vol. 35, no. 1, pp. 53–65, Jan. 2018. autoregressive flows,” in Proc. 35th Int. Conf. Mach. Learn., 2018,
[18] H. P. Das, P. Abbeel, and C. J. Spanos, “Dimensionality reduction pp. 2078–2087.
flows,” 2019, arXiv: 1908.01686. [46] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[19] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear indepen- network training by reducing internal covariate shift,” in Proc.
dent components estimation,” in Proc. Int. Conf. Learn. Representa- 32nd Int. Conf. Mach. Learn., 2015, pp. 448–456.
tions Workshop, 2015. [47] J.-H. Jacobsen, A. W. Smeulders, and E. Oyallon, “i-RevNet: Deep
[20] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation invertible networks,” in Proc. Int. Conf. Learn. Representations, 2018.
using real NVP,” in Proc. Int. Conf. Learn. Representations, 2017. [48] P. Jaini, I. Kobyzev, M. Brubaker, and Y. Yu, “Tails of triangular
[21] L. Dinh, J. Sohl-Dickstein, R. Pascanu, and H. Larochelle, “A flows,” 2019, arXiv: 1907.04481.
RAD approach to deep mixture models,” in Proc. Int. Conf. Learn. [49] P. Jaini, K. A. Selby, and Y. Yu, “Sum-of-squares polynomial
Representations Workshop, 2019. flow,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 3009–3018.
[22] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [50] M. Jankowiak and F. Obermeyer, “Pathwise derivatives beyond
[23] E. Dupont, A. Doucet, and Y. W. Teh, “Augmented neural ODEs,” in the reparameterization trick,” in Proc. 35th Int. Conf. Mach. Learn.,
Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 3140–3150. 2018, pp. 2235–2244.
[24] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios, “Cubic- [51] G. Kanwar et al., “Equivariant flow-based sampling for lattice
spline flows,” in Workshop Invertible Neural Networks Normalizing gauge theory,” 2020, arXiv: 2003.06413.
Flows (ICML), 2019. [52] A. Katok and B. Hasselblatt, Introduction to the Modern Theory of
[25] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios, “Neural Dynamical Systems. New York, NY, USA: Cambridge Univ. Press,
spline flows,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, 1995.
pp. 7511–7522. [53] S. Kim, S. Gil Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet: A
[26] Weinan E, “A proposal on machine learning via dynamical sys- generative flow for raw audio,” in Proc. 36th Int. Conf. Mach.
tems,” Commun. Math. Statist., vol. 5, pp. 1–11, 2017. Learn., 2018, pp. 3370–3378.
[27] P. Esling, N. Masuda, A. Bardet, R. Despres, and [54] D. P. Kingma and M. Welling, “Auto-encoding variational
A. Chemla-Romeu-Santos, “Universal audio synthesizer control bayes,” in Proc. 2nd Int. Conf. Learn. Representations, 2014.
with normalizing flows,” 2019, arXiv: 1907.00971. [55] D. P. Kingma and M. Welling, “An introduction to variational
[28] L. Falorsi, P. de Haan, T. R. Davidson, and P. Forre, autoencoders,” 2019, arXiv: 1906.02691.
“Reparameterizing distributions on lie groups,” 2019, arXiv: [56] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever,
1903.02958. and M. Welling, “Improved variational inference with inverse
[29] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. M. Oberman, autoregressive flow,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
“How to train your neural ODE,” 2020, arXiv: 2002.02798. 2016, pp. 4743–4751.

Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3978 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021

[57] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with [83] H. Salman, P. Yadollahpour, T. Fletcher, and N. Batmanghelich,
invertible 1x1 convolutions,” in Proc. Int. Conf. Neural Inf. Process. “Deep diffeomorphic normalizing flows,” 2018, arXiv: 1810.03256.
Syst., 2018, pp. 10 215–10 224. [84] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S.
[58] J. K€ohler, L. Klein, and F. Noe, “Equivariant flows: Sampling Ganguli, “Deep unsupervised learning using nonequilibrium
configurations for multi-body systems with symmetric ener- thermodynamics,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp.
gies,” in Workshop Mach. Learn. Physical Sciences (NeurIPS), 2256–2265.
2019. [85] A. Spantini, D. Bigoni, and Y. Marzouk, “Inference via low-dimen-
[59] D. Koller and N. Friedman, Probabilistic Graphical Models. sional couplings,” J. Mach. Learn. Res., vol. 19, pp. 2639–2709,
Cambridge, MA, USA: MIT Press, 2009. Mar. 2017.
[60] M. Kumar et al., “VideoFlow: A flow-based generative model for [86] M. Spivak, Calculus on Manifolds: A Modern Approach to Classical
video,” in Workshop Invertible Neural Nets Normalizing Flows (ICML), Theorems of Advanced Calculus. San Francisco, CA, USA: Print, 1965.
2019. [87] J. Suykens, H. Verrelst, and J. Vandewalle, “On-line learning
[61] P. M. Laurence, R. J. Pignol, and E. G. Tabak, “Constrained Fokker-Planck machine,” Neural Process. Lett., vol. 7, pp. 81–89,
density estimation,” in Proc. Wolfgang Pauli Inst. Conf. Energy 1998.
Commodity Trading, 2014, pp. 259–284. [88] E. G. Tabak and C. V. Turner, “A family of nonparametric den-
[62] X. Li, T.-K. L. Wong, R. T. Q. Chen, and D. Duvenaud, “Scalable sity estimation algorithms,” Commun. Pure Appl. Math., vol. 66,
gradients for stochastic differential equations,” 2020, arXiv: no. 2, pp. 145–164, 2013.
2001.01328. [89] E. G. Tabak and E. Vanden-Eijnden, “Density estimation by dual
[63] A. Liutkus, U. Simsekli, S. Majewski, A. Durmus, and ascent of the log-likelihood,” Commun. Math. Sci., vol. 8, no. 1,
F.-R. St€oter, “Sliced-wasserstein flows: Nonparametric genera- pp. 217–233, 2010.
tive modeling via optimal transport and diffusions,” in Proc. 36th [90] I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Sch€olkopf, “Wasserstein
Int. Conf. Mach. Learn., 2019, pp. 4104–4113. auto-encoders,” in Proc. Int. Conf. Learn. Representations, 2018.
[64] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities [91] J. Tomczak and M. Welling, “Improving variational auto-
improve neural network acoustic models,” in Proc. Int. Conf. encoders using convex combination linear inverse autoregressive
Mach. Learn., 2013. flow,” Benelearn, 2017.
[65] K. Madhawa, K. Ishiguro, K. Nakago, and M. Abe, “GraphNVP: [92] J. M. Tomczak and M. Welling, “Improving variational auto-
An invertible flow model for generating molecular graphs,” encoders using householder flow,” 2016, arXiv:1611.09630.
2019, arXiv: 1905.11600. [93] A. Touati, H. Satija, J. Romoff, J. Pineau, and P. Vincent,
[66] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of “Randomized value functions via multiplicative normalizing
human segmented natural images and its application to evaluat- flows,” in Proc. Conf. Uncertainty Artif. Intell., 2019.
ing segmentation algorithms and measuring ecological [94] D. Tran, K. Vafa, K. Agrawal, L. Dinh, and B. Poole, “Discrete
statistics,” in Proc. 8th Int. Conf. Comput. Vis., 2001, pp. 416–423. flows: Invertible generative models of discrete data,” in Proc. Int.
[67] B. Mazoure, T. Doan, A. Durand, J. Pineau, and R. D. Hjelm, Conf. Learn. Representations Workshop, 2019.
“Leveraging exploration in off-policy algorithms via normalizing [95] B. L. Trippe and R. E. Turner, “Conditional density estimation
flows,” in Proc. 3rd Conf. Robot Learn., 2019. with Bayesian normalising flows,” in Workshop Bayesian Deep
[68] K. V. Medvedev, “Certain properties of triangular transformations of Learn. (NeurIPS), 2017.
measures,” Theory Stochastic Processes, vol. 14, no. 30, pp. 95–99, 2008. [96] B. Tzen and M. Raginsky, “Neural stochastic differential equa-
[69] T. M€ uller, B. McWilliams, F. Rousselle, M. Gross, and J. Novak, tions: Deep latent gaussian models in the diffusion limit,” 2019,
“Neural importance sampling,” ACM Trans. Graph., vol. 38, 2018, arXiv: 1905.09883.
Art. no. 145. [97] R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling,
[70] P. Nadeem Ward, A. Smofsky, and A. Joey Bose, “Improving “Sylvester normalizing flows for variational inference,” in Proc.
exploration in soft-actor-critic with normalizing flows policies,” 34th Conf. Uncertainty Artif. Intell., 2018.
in Workshop Invertible Neural Nets Normalizing Flows (ICML), [98] A. van den Oord et al., “Parallel wavenet: Fast high-fidelity
2019. speech synthesis,” in Proc. 35th Int. Conf. Mach. Learn., 2017,
[71] F. Noe, S. Olsson, J. K€ ohler, and H. Wu, “Boltzmann generators: pp. 3918–3926.
Sampling equilibrium states of many-body systems with deep [99] C. Villani, Topics in Optimal Transportation (Graduate Studies in
learning,” Science, vol. 365, 2019, Art. no. eaaw1147. Mathematics 58). Providence, RI, USA: American Mathematical
[72] B. Oksendal, Stochastic Differential Equations (3rd Ed.): An Intro- Society, 2003.
duction With Applications. Berlin, Germany: Springer, 1992. [100] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F. Yue Wang,
[73] I. Ovinnikov, “Poincare wasserstein autoencoder,” in Workshop “Generative adversarial networks: Introduction and outlook,”
on Bayesian Deep Learning, NeurIPS, 2018. IEEE/CAA J. Automatica Sinica, vol. 4, no. 4, pp. 588–598, 2017.
[74] G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autore- [101] P. Z. Wang and W. Y. Wang, “Riemannian normalizing flow on
gressive flow for density estimation,” in Proc. 31st Int. Conf. variational wasserstein autoencoder for text modeling,” 2019,
Neural Inf. Process. Syst., 2017, pp. 2335–2344. arXiv: 1904.02399.
[75] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and [102] A. Wehenkel and G. Louppe, “Unconstrained monotonic neural
B. Lakshminarayanan, “Normalizing flows for probabilistic networks,” 2019, arXiv: 1908.05164.
modeling and inference,” 2019, arXiv: 1912.02762. [103] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gra-
[76] S. Peluchetti and S. Favaro, “Neural stochastic differential equa- dient langevin dynamics,” in Proc. 28th Int. Conf. Mach. Learn.,
tions,” 2019, arXiv: 1905.11065. 2011, pp. 681–688.
[77] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based [104] P. Wirnsberger et al., “Targeted free energy estimation via
generative network for speech synthesis,” in Proc. IEEE Int. Conf. learned mappings,” 2020, arXiv: 2002.04913.
Acoust. Speech Signal Process., 2019, pp. 3617–3621. [105] K. W. K. Wong, G. Contardo, and S. Ho, “Gravitational wave
[78] D. J. Rezende and S. Mohamed, “Variational inference with nor- population inference with deep flow-based generative network,”
malizing flows,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, 2020, arXiv: 2002.09491.
pp. 1530–1538. [106] H. Zhang, X. Gao, J. Unterman, and T. Arodz, “Approximation
[79] D. J. Rezende et al., “Normalizing flows on tori and spheres,” capabilities of neural ordinary differential equations,” 2019,
2020, arXiv: 2002.02428. arXiv: 1907.12998.
[80] O. Rippel and R. P. Adams, “High-dimensional probability esti- [107] G. Zheng, Y. Yang, and J. Carbonell, “Convolutional normalizing
mation with deep density models,” 2013, arXiv:1302.5125. flows,” in Workshop Theoretical Foundations Applications Deep Gen-
[81] T. Salimans, A. Diederik, D. P. Kingma, and M. Welling, erative Models (ICML), 2018.
“Markov chain Monte Carlo and variational inference: Bridging [108] Z. M. Ziegler and A. M. Rush, “Latent normalizing flows for discrete
the gap,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp. 1218– sequences,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 7673–7682.
1226.
[82] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
and X. Chen, “Improved techniques for training GANs,” in Proc.
30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 2234–2242.

Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3979

Ivan Kobyzev received the master’s degree in Marcus A. Brubaker (Member, IEEE) received
mathematical physics from St Petersburg State the PhD degree from the University of Toronto,
University, Russia, in 2011, and the PhD degree Canada, in 2011. He did postdocs at the Toyota
in mathematics from Western University, Canada, Technological Institute, Chicago, Toronto Rehabil-
in 2016. He did two postdocs in mathematics and itation Hospital and the University of Toronto,
in computer science at the University of Waterloo, Canada. His research interests include computer
Canada. Currently he is a researcher at Borealis vision, machine learning and statistics. He is cur-
AI, Canada. His research interests include alge- rently an assistant professor at York University,
bra, generative models, cognitive computing, and Toronto, Canada, an adjunct professor at the Uni-
natural language processing. versity of Toronto, Canada and a faculty affiliate
of the Vector Institute. He is also an academic
advisor to Borealis AI, Canada where he previously worked as the
Simon J.D. Prince received the master’s degree research director of the Toronto office. He is also an associate editor for
from University College London, United Kingdom the journal IET Computer Vision and has served as a reviewer and an
and the doctorate degree from the University of area chair for many computer vision and machine learning conferences.
Oxford, United Kingdom. He has a diverse
research background and has published in wide-
ranging areas including Computer Vision, Neuro- " For more information on this or any other computing topic,
science, HCI, Computer Graphics, Medical Imag- please visit our Digital Library at www.computer.org/csdl.
ing, and Augmented Reality. He is also the author
of a popular textbook on Computer Vision. From
2005–2012, he was a tenured faculty member
with the Department of Computer Science, Univer-
sity College London, where he taught courses in Computer Vision, Image
Processing and Advanced Statistical Methods. During this time, he was
director of the MSc in Computer Vision, Graphics and Imaging. He
worked in industry applying AI to computer graphics software. Currently
he is a research director of Borealis AI’s Montreal office.

Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.

You might also like