Professional Documents
Culture Documents
Normalizing Flows An Introduction and Review of Current Methods
Normalizing Flows An Introduction and Review of Current Methods
Normalizing Flows An Introduction and Review of Current Methods
Abstract—Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation
can be efficient and exact. The goal of this survey article is to give a coherent and comprehensive review of the literature around the
construction and use of Normalizing Flows for distribution learning. We aim to provide context and explanation of the models, review
current state-of-the-art literature, and identify open questions and promising future directions.
Index Terms—Generative models, normalizing flows, density estimation, variational inference, invertible neural networks
1 INTRODUCTION
major goal of statistics and machine learning has been [65], reinforcement learning [67], [70], [93], computer
A to model a probability distribution given samples
drawn from that distribution. This is an example of unsu-
graphics [69], and physics [51], [58], [71], [104], [105].
There are several survey papers for VAEs [55] and
pervised learning and is sometimes called generative GANs [17], [100]. This article aims to provide a compre-
modelling. Its importance derives from the relative abun- hensive review of the literature around Normalizing
dance of unlabelled data compared to labelled data. Appli- Flows for distribution learning. Our goals are to 1) pro-
cations include density estimation, outlier detection, prior vide context and explanation to enable a reader to become
construction, and dataset summarization. familiar with the basics, 2) review the current literature,
Many methods for generative modeling have been pro- and 3) identify open questions and promising future
posed. Direct analytic approaches approximate observed data directions. Since this article was first made public, an
with a fixed family of distributions. Variational approaches excellent complementary treatment has been provided by
and expectation maximization introduce latent variables to Papamakarios et al. [75]. Their article is more tutorial in
explain the observed data. They provide additional flexibility nature and provides many details concerning implemen-
but can increase the complexity of learning and inference. tation, whereas our treatment is more formal and focuses
Graphical models [59] explicitly model the conditional depen- mainly on the families of flow models.
dence between random variables. Recently, generative neural In Section 2, we introduce Normalizing Flows and
approaches have been proposed including generative adver- describe how they are trained. In Section 3 we review con-
sarial networks (GANs) [33] and variational auto-encoders structions for Normalizing Flows. In Section 4 we describe
(VAEs) [54]. datasets for testing Normalizing Flows and discuss the per-
GANs and VAEs have demonstrated impressive perfor- formance of different approaches. Finally, in Section 5 we
mance results on challenging tasks such as learning distri- discuss open problems and possible research directions.
butions of natural images. However, several issues limit
their application in practice. Neither allows for exact evalu-
ation of the probability density of new points. Furthermore,
2 BACKGROUND
training can be challenging due to a variety of phenomena Normalizing Flows were popularised by Rezende and
including mode collapse, posterior collapse, vanishing gra- Mohamed [78] in the context of variational inference and by
dients and training instability [11], [82]. Dinh et al. [19] for density estimation. However, the frame-
Normalizing Flows (NF) are a family of generative models work was previously defined in Tabak and Vanden- Eijnden
with tractable distributions where both sampling and density [89] and Tabak and Turner [88], and explored for clustering
evaluation can be efficient and exact. Applications include and classification [2], and density estimation [61], [80].
image generation [41], [57], noise modelling [1], video gener- A Normalizing Flow is a transformation of a simple
ation [60], audio generation [27], [53], [77], graph generation probability distribution (e.g., a standard normal) into a
more complex distribution by a sequence of invertible and
differentiable mappings. The density of a sample can be
The authors are with Borealis AI, Montreal H2S 3H1, Canada. evaluated by transforming it back to the original simple dis-
E-mail: {ivan.kobyzev, simon.prince}@borealisai.com, mab@eecs.yorku.ca. tribution and then computing the product of i) the density
Manuscript received 8 Dec. 2019; revised 21 Apr. 2020; accepted 1 May 2020. of the inverse-transformed sample under this distribution
Date of publication 7 May 2020; date of current version 1 Oct. 2021. and ii) the associated change in volume induced by the
(Corresponding author: Ivan Kobyzev.)
Recommended for acceptance by B. Kingsbury. sequence of inverse transformations. The change in volume
Digital Object Identifier no. 10.1109/TPAMI.2020.2992934 is the product of the absolute values of the determinants of
0162-8828 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3965
f ¼ f1 . . . fN1 fN ; (2)
Fig. 1. Change of variables (Equation (1)). Top-left: the density of the source
pZ . Top-right: the density function of the target distribution pY ðyÞ. There and the determinant of the Jacobian is
exists a bijective function g, such that pY ¼ g pZ , with inverse f. Bottom-left:
the inverse function f. Bottom-right: the absolute Jacobian (derivative) of f.
Y
N
det DfðyÞ ¼ det Dfi ðxi Þ; (3)
the Jacobians for each transformation, as required by the i¼1
change of variables formula.
The result of this approach is a mechanism to construct where Dfi ðyÞ ¼ @f@xi is the Jacobian of fi . We denote the value
new families of distributions by choosing an initial density of the ith intermediate flow as xi ¼ gi . . . g1 ðzÞ ¼ fiþ1
and then chaining together some number of parameterized, . . . fN ðyÞ and so xN ¼ y. Thus, a set of nonlinear bijective
invertible and differentiable transformations. The new den- functions can be composed to construct successively more
sity can be sampled from (by sampling from the initial den- complicated functions.
sity and applying the transformations) and the density at a
sample (i.e., the likelihood) can be computed as above. 2.1.1 More Formal Construction
In this section we explain normalizing flows from more for-
2.1 Basics mal perspective. Readers unfamiliar with measure theory
Let Z 2 RD be a random variable with a known and tracta- can safely skip to Section 2.2. First, let us recall the general
ble probability density function pZ : RD ! R. Let g be an definition of a pushforward.
invertible function and Y ¼ gðZÞ. Then using the change of
Definition 1. If ðZ; SZ Þ, ðY; SY Þ are measurable spaces, g is a
variables formula, one can compute the probability density
measurable mapping between them, and m is a measure on Z,
function of the random variable Y
then one can define a measure on Y (called the pushforward
measure and denoted by g m) by the formula
pY ðyÞ ¼ pZ ðfðyÞÞjdet DfðyÞj
(1) g mðUÞ ¼ mðg1 ðUÞÞ; for all U 2 SY :
¼ pZ ðfðyÞÞjdet DgðfðyÞÞj1 ; (4)
@f
where f is the inverse of g, DfðyÞ ¼ @y is the Jacobian of f and This notion gives a general formulation of a generative
DgðzÞ ¼ @g @z is the Jacobian of g. This new density function model. Data can be understood as a sample from a mea-
pY ðyÞ is called a pushforward of the density pZ by the function sured “data” space ðY; SY ; nÞ, which we want to learn. To
g and denoted by g pZ (Fig. 1). do that one can introduce a simpler measured space
In the context of generative models, the above function g ðZ; SZ ; mÞ and find a function g : Z ! Y, such that n ¼ g m.
(a generator) “pushes forward” the base density pZ (some- This function g can be interpreted as a “generator”, and Z
times referred to as the “noise”) to a more complex density. as a latent space. This view puts generative models in the
This movement from base density to final complicated den- context of transportation theory [99].
sity is the generative direction. Note that to generate a data In this survey we will assume that Z ¼ RD , all sigma-
point y, one can sample z from the base distribution, and algebras are Borel, and all measures are absolutely continu-
then apply the generator: y ¼ gðzÞ. ous with respect to Lebesgue measure (i.e., m ¼ pZ dz).
The inverse function f moves (or “flows”) in the opposite,
Definition 2. A function g : RD ! RD is called a diffeomor-
normalizing direction: from a complicated and irregular data
phism, if it is bijective, differentiable, and its inverse is differen-
distribution towards the simpler, more regular or “normal”
tiable as well.
form, of the base measure pZ . This view is what gives rise to
the name “normalizing flows” as f is “normalizing” the data The pushforward of an absolutely continuous measure
distribution. This term is doubly accurate if the base measure pZ dz by a diffeomorphism g is also absolutely continuous
pZ is chosen as a Normal distribution as it often is in practice. with a density function given by Equation (1). Note that this
Intuitively, if the transformation g can be arbitrarily com- more general approach is important for studying generative
plex, one can generate any distribution pY from any base models on non-euclidean spaces (see Section 5.2).
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3966 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021
Remark 3. It is common in the normalizing flows literature distribution pðyjxÞ is used when estimating the parameters of
to simply refer to diffeomorphisms as “bijections” even the model, but its computation is usually intractable in prac-
though this is formally incorrect. In general, it is not nec- tice. One approach is to use variational inference and intro-
essary that g is everywhere differentiable; rather it is suf- duce the approximate posterior qðyjx; uÞ where u are
ficient that it is differentiable only almost everywhere parameters of the variational distribution. Ideally this distribu-
with respect to the Lebesgue measure on RD . This allows, tion should be as close to the real posterior as possible. This is
for instance, piecewise differentiable functions to be used done by minimizing the KL divergence DKL ðqðyjx; uÞjjpðyjxÞÞ,
in the construction of g. which is equivalent to maximizing the evidence lower bound
LðuÞ ¼ Eqðyjx;uÞ ½log ðpðy; xÞÞ log ðqðyjx; uÞÞ. The latter optimi-
2.2 Applications zation can be done with gradient descent; however for that
2.2.1 Density Estimation and Sampling one needs to compute gradients of the form ru Eqðyjx;uÞ ½hðyÞ,
which is not straightforward.
The natural and most obvious use of normalizing flows is
As was observed by Rezende and Mohamed [78], one can
to perform density estimation. For simplicity assume that
reparametrize qðyjx; uÞ ¼ pY ðyjuÞ with normalizing flows.
only a single flow, g, is used and it is parameterized by
Assume for simplicity, that only a single flow g with param-
the vector u. Further, assume that the base measure, pZ is
eters u is used, y ¼ gðzjuÞ and the base distribution pZ ðzÞ
given and is parameterized by the vector f. Given a set of
does not depend on u. Then
data observed from some complicated distribution,
D ¼ fyðiÞ gM
i¼1 , we can then perform likelihood-based esti- EpY ðyjuÞ ½hðyÞ ¼ EpZ ðzÞ ½hðgðzjuÞÞ; (6)
mation of the parameters Q ¼ ðu; fÞ. The data likelihood
in this case simply becomes and the gradient of the right hand side with respect to u can be
X
M computed. This approach generally to computing gradients of
log pðDjQÞ ¼ log pY ðyðiÞ jQÞ an expectation is often called the “reparameterization trick”.
i¼1 In this scenario evaluating the likelihood is only required
(5)
X
M at points which have been sampled. Here the sampling per-
¼ log pZ ðfðyðiÞ juÞjfÞ þ log det Dfðyi juÞ; formance and evaluation of the log determinant are the only
i¼1
relevant metrics and computing the inverse of the mapping
where the first term is the log likelihood of the sample may not be necessary. Indeed, the planar and radial flows
under the base measure and the second term, sometimes introduced in Rezende and Mohamed [78] are not easily
called the log-determinant or volume correction, accounts invertible (see Section 3.3).
for the change of volume induced by the transformation of
the normalizing flows (see Equation (1)). During training, 3 METHODS
the parameters of the flow (u) and of the base distribution Normalizing Flows should satisfy several conditions in
(f) are adjusted to maximize the log-likelihood. order to be practical. They should:
Note that evaluating the likelihood of a distribution mod-
elled by a normalizing flow requires computing f (i.e., the be invertible; for sampling we need g while for com-
normalizing direction), as well as its log determinant. The puting likelihood we need f,
efficiency of these operations is particularly important dur- be sufficiently expressive to model the distribution of
ing training where the likelihood is repeatedly computed. interest,
However, sampling from the distribution defined by the be computationally efficient, both in terms of com-
normalizing flow requires evaluating the inverse g (i.e., the puting f and g (depending on the application) but
generative direction). Thus sampling performance is deter- also in terms of the calculation of the determinant of
mined by the cost of the generative direction. Even though a the Jacobian.
flow must be theoretically invertible, computation of the In the following section, we describe different types of
inverse may be difficult in practice; hence, for density esti- flows and comment on the above properties. An overview
mation it is common to model a flow in the normalizing of the methods discussed can be seen in Fig. 2.
direction (i.e., f).1
Finally, while maximum likelihood estimation is often 3.1 Elementwise Flows
effective (and statistically efficient under certain conditions) A basic form of bijective non-linearity can be constructed
other forms of estimation can and have been used with nor- given any bijective scalar function. That is, let h : R ! R be
malizing flows. In particular, adversarial losses can be used a scalar valued bijection. Then, if x ¼ ðx1 ; x2 ; . . .; xD ÞT ,
with normalizing flow models (e.g., in Flow-GAN [36]).
gðxÞ ¼ ðhðx1 Þ; hðx2 Þ; . . .; hðxD ÞÞT ; (7)
2.2.2 Variational Inference
R is also a bijection whose inverse simply requires computing
Consider a latent variable model pðxÞ ¼ pðx; yÞdy where x is
h1 and whose Jacobian is the product of the absolute val-
an observed variable and y the latent variable. The posterior
ues of the derivatives of h. This can be generalized by allow-
ing each element to have its own distinct bijective function
1. To ensure both efficient density estimation and sampling, van den
which might be useful if we wish to only modify portions of
Oord et al. [98] proposed an approach called Probability Density Distil-
lation which trains the flow f as normal and then uses this as a teacher our parameter vector. In deep learning terminology, h,
network to train a tractable student network g. could be viewed as an “activation function”. Note that the
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3967
3.2.2 Triangular
The triangular matrix is a more expressive form of linear
transformation whose determinant is the product of its
diagonal. It is non-singular so long as its diagonal entries
are non-zero. Inversion is relatively inexpensive requiring a
single pass of back-substitution costing OðD2 Þ operations.
Tomczak and Welling [91] combined K triangular matri-
ces Ti , each with ones on the diagonal, and a K-dimensional
probability
P vector v to define a more general linear flow
y¼ð K i¼1 vi Ti Þz. The determinant of this bijection is one.
However finding the inverse has OðD3 Þ complexity, if some
of the matrices are upper- and some are lower-triangular.
Fig. 2. Overview of flows discussed in this review. We start with element- 3.2.3 Permutation and Orthogonal
wise bijections, linear flows, and planar and radial flows. All of these
have drawbacks and are limited in utility. We then discuss two architec- The expressiveness of triangular transformations is sensi-
tures (coupling flows and autoregressive flows) which support invertible tive to the ordering of dimensions. Reordering the dimen-
non-linear transformations. These both use a coupling function, and we
sions can be done easily using a permutation matrix which
summarize the different coupling functions available. Finally, we discuss
residual flows and their continuous extension infinitesimal flows. has an absolute determinant of 1. Different strategies have
been tried, including reversing and a fixed random permu-
most commonly used activation function ReLU is not bijec- tation [20], [57]. However, the permutations cannot be
tive and can not be directly applicable, however, the directly optimized and so remain fixed after initialization
(Parametric) Leaky ReLU [39], [64] can be used instead which may not be optimal.
among others. Note that recently spline-based activation A more general alternative is the use of orthogonal trans-
functions have also been considered [24], [25] and will be formations. The inverse and absolute determinant of an
discussed in Section 3.4.4.4. orthogonal matrix are both trivial to compute which make
them efficient. Tomczak and Welling [92] used orthogonal
3.2 Linear Flows matrices parameterized by the Householder transform. The
idea is based on the fact from linear algebra that any orthog-
Elementwise operations alone are insufficient as they cannot
onal matrix can be written as a product of reflections. To
express any form of correlation between dimensions. Linear
parameterize a reflection matrix H in RD one fixes a non-
mappings can express correlation between dimensions
zero vector v 2 RD , and then defines H ¼ 11 2 2 vvT .
jjvjj
gðxÞ ¼ Ax þ b; (8)
3.2.4 Factorizations
where A 2 RDD and b 2 RD are parameters. If A is an Instead of limiting the form of A, Kingma and Dhariwal [57]
invertible matrix, the function is invertible. proposed using the LU factorization
Linear flows are limited in their expressiveness. Consider
a Gaussian base distribution: pZ ðzÞ ¼ N ðz; m; SÞ. After gðxÞ ¼ PLUx þ b; (9)
transformation by a linear flow, the distribution remains
Gaussian with distribution pY ¼ N ðy; Am þ b; AT SAÞ. More where L is lower triangular with ones on the diagonal, U is
generally, a linear flow of a distribution from the exponen- upper triangular with non-zero diagonal entries, and P is a
tial family remains in the exponential family. However, lin- permutation matrix. The determinant is the product of the
ear flows are an important building block as they form the diagonal entries of U which can be computed in OðDÞ. The
basis of affine coupling flows (Section 3.4.4.1). inverse of the function g can be computed using two passes
Note that the determinant of the Jacobian is simply of backward substitution in OðD2 Þ. However, the discrete
detðAÞ, which can be computed in OðD3 Þ, as can the inverse. permutation P cannot be easily optimized. To avoid this, P
Hence, using linear flows can become expensive for large D. is randomly generated initially and then fixed. Hoogeboom
By restricting the form of A we can avoid these practical et al. [42] noted that fixing the permutation matrix limits the
problems at the expense of expressive power. In the follow- flexibility of the transformation, and proposed using the QR
ing sections we discuss different ways of limiting the form decomposition instead where the orthogonal matrix Q is
of linear transforms to make them more practical. described with Householder transforms.
precision given sufficient capacity and data. We will expressiveness and many flows must be stacked to repre-
provide a formal proof of the universality theorem sent complicated distributions.
following [49]. This section requires some knowledge of 3.4.4.2 Nonlinear squared flow. Ziegler and Rush [108] pro-
measure theory and functional analysis and can be safely posed an invertible non-linear squared transformation
skipped. defined by
First, recall that a mapping T ¼ ðT1 ; . . . ; TD Þ : RD ! RD
c
is called triangular if Ti is a function of x1:i for each hðx ; uÞ ¼ ax þ b þ : (23)
i ¼ 1; . . . ; D. Such a triangular map T is called increasing if 1 þ ðdx þ hÞ2
Ti is an increasing function of xi for each i.
Under some constraints on parameters u ¼ ½a; b; c; d; h 2 R5 ,
Proposition 4 ([9], Lemma 2.1). If m and n are absolutely con- the coupling function is invertible and its inverse is analyti-
tinuous Borel probability measures on RD , then there exists an cally computable as a root of a cubic polynomial (with
increasing triangular transformation T : RD ! RD , such that only one real root). Experiments showed that these coupling
n ¼ T m. This transformation is unique up to null sets of m. A functions facilitate learning multimodal distributions.
similar result holds for measures on ½0; 1D . 3.4.4.3 Continuous mixture CDFs. Ho et al. [41] proposed
the Flow++ model, which contained several improve-
Proposition 5. If m is an absolutely continuous Borel probability ments, including a more expressive coupling function.
measures on RD and fTn g is a sequence of maps RD ! RD The layer is almost like a linear transformation, but one
which converges pointwise to a map T , then a sequence of meas- also applies a monotone function to x
ures ðTn Þ m weakly converges to T m.
hðx; uÞ ¼ u1 F ðx; u3 Þ þ u2 ; (24)
Proof. See [45], Lemma 4. The result follows from the domi-
nated convergence theorem. u
t where u1 6¼ 0, u2 2 R and u3 ¼ ½p p; m; s 2 RK RK RKþ.
As a corollary, to claim that a class of autoregressive The function F ðx; p ; m; sÞ is the CDF of a mixture of K logis-
flows gð; uÞ : RD ! RD is universal, it is enough to demon- tics, postcomposed with an inverse sigmoid
strate that a family of coupling functions h used in the
X !
class is dense in the set of all monotone functions in the 1
K
x mj
F ðx; p ; m ; sÞ ¼ s pj s : (25)
pointwise convergence topology. In particular, [45] used j¼1
sj
neural monotone networks for coupling functions, and
[49] used monotone polynomials. Using the theory out- Note, that the post-composition with s 1 : ½0; 1 ! R is used
lined in this section, universality could also be proved for to ensure the right range for h. Computation of the inverse
spline flows [24], [25] with splines for coupling functions is done numerically with the bisection algorithm. The deriv-
(see Section 3.4.4.4). ative of the transformation with respect to x is expressed in
terms of PDF of logistic mixture (i.e., a linear combination
3.4.4 Coupling Functions of hyperbolic secant functions), and its computation is not
expensive. An ablation study demonstrated that switching
As described in the previous sections, coupling flows and
from an affine coupling function to a logistic mixture
autoregressive flows have a similar functional form and
improved performance slightly.
both have coupling functions as building blocks. A cou-
3.4.4.4 Splines. A spline is a piecewise-polynomial or a
pling function is a bijective differentiable function hð; uÞ :
piecewise-rational function which is specified by K þ 1
Rd ! Rd , parameterized by u. In coupling flows, these
points ðxi ; yi ÞK
i¼0 , called knots, through which the spline
functions are typically constructed by applying a scalar
passes. To make a useful coupling function, the spline
coupling function hð; uÞ : R ! R elementwise. In autore-
should be monotone which will be the case if xi < xiþ1 and
gressive flows, d ¼ 1 and hence they are also scalar valued.
yi < yiþ1 . Usually splines are considered on a compact
Note that scalar coupling functions are necessarily (strictly)
interval.
monotone. In this section we describe the scalar coupling
Piecewise-linear and piecewise-quadratic. M€
uller et al. [69]
functions commonly used in the literature.
used linear splines for coupling functions h : ½0; 1 ! ½0; 1.
3.4.4.1 Affine coupling. Two simple forms of coupling
They divided the domain into K equal bins. Instead of defin-
functions h : R ! R were proposed by Dinh et al. [19] in
ing increasing values for yi , they modeled h as the integral of
NICE (nonlinear independent component estimation).
a positive piecewise-constant function
These were the additive coupling function
X
b1
hðx ; uÞ ¼ x þ u; u 2 R; (21) hðx; uÞ ¼ aub þ uk ; (26)
k¼1
and the affine coupling function where u 2 RK is a probability vector, b ¼ bKxc (the bin that
contains x), and a ¼ Kx b (the position of x in bin b). This
hðx; uÞ ¼ u1 x þ u2 ; u1 6¼ 0; u2 2 R: (22) map is invertible, if all uk > 0, with derivative: @h
@x ¼ ub K:
M€ uller et al. [69] also used a monotone quadratic spline
Affine coupling functions are used for coupling flows in on the unit interval for a coupling function and modeled
NICE [19], RealNVP [20], Glow [57] and for autoregressive this as the integral of a positive piecewise-linear function. A
architectures in IAF [56] and MAF [74]. They are simple and monotone quadratic spline is invertible; finding its inverse
computation is efficient. However, they are limited in map requires solving a quadratic equation.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3971
along this unfolded map is now well-defined and one gets jdetðI þ DF Þj ¼ detðI þ DF Þ. Using the linear algebra iden-
the formula for the density pY tity, ln det A ¼ Tr ln A we have
pY ðyÞ ¼ pZ;½K ðhðyÞ; fðyÞÞjDhðyÞj: (28) ln jdet Dgj ¼ ln detðI þ DF Þ ¼ Trðln ðI þ DF ÞÞ; (31)
This real and discrete (RAD) flow efficiently learns distribu- Then one considers a power series for the trace of the matrix
tions with discrete structures (multimodal distributions, logarithm
distributions with holes, discrete symmetries etc).
X
1
TrðDF Þk
Trðln ðI þ DF ÞÞ ¼ ð1Þkþ1 : (32)
3.5 Residual Flows k¼1
k
Residual networks [40] are compositions of the function of
the form By truncating this series one can calculate an approxima-
tion to the log Jacobian determinant of g. To efficiently
gðxÞ ¼ x þ F ðxÞ: (29) compute each member of the truncated series, the Hutchin-
son trick was used. This trick provides a stochastic estima-
Such a function is called a residual connection, and here the tion of of a matrix trace A 2 RDD , using the relation:
residual block F ðÞ is a feed-forward neural network of any TrA ¼ EpðvÞ ½vT Av, where v 2 RD , E½v ¼ 0, and covðvÞ ¼ I.
kind (a CNN in the original paper). Truncating the power series gives a biased estimate of
The first attempts to build a reversible network architec- the log Jacobian determinant (the bias depends on the trun-
ture based on residual connections were made in RevNets cation error). An unbiased stochastic estimator was pro-
[32] and iRevNets [47]. Their main motivation was to save posed by Chen et al. [16] in a model they called a Residual
memory during training and to stabilize computation. The flow. The authors used a Russian roulette estimator instead of
central idea is a variation of additive coupling functions: truncation. Informally, every
consider a disjoint partition of RD ¼ Rd RDd denoted by Pn time one adds the next term
a
P1nþ1 to the partial sum i¼1 ai while calculating the series
x ¼ ðxA ; xB Þ for the input and y ¼ ðyA ; yB Þ for the output,
i¼1 ai , one flips a coin to decide if the calculation should
and define a function be continued or stopped. During this process one needs to
re-weight terms for an unbiased estimate.
yA ¼ xA þ F ðxB Þ
(30)
yB ¼ xB þ GðyA Þ; 3.6 Infinitesimal (Continuous) Flows
The residual connections discussed in the previous section
where F : RDd ! Rd and G : Rd ! RDd are residual
can be viewed as discretizations of a first order ordinary dif-
blocks. This network is invertible (by re-arranging the equa-
ferential equation (ODE) [26], [37]
tions in terms of xA and xB and reversing their order) but
computation of the Jacobian is inefficient. d
A different point of view on reversible networks comes xðtÞ ¼ F ðxðtÞ; uðtÞÞ; (33)
dt
from a dynamical systems perspective via the observation
that a residual connection is a discretization of a first order where F : RD Q ! RD is a function which determines the
ordinary differential equation (see Section 3.6 for more dynamic (the evolution function), Q is a set of parameters
details). [12], [13] proposed several architectures, some of and u : R ! Q is a parameterization. The discretization of
these networks were demonstrated to be invertible. How- this equation (Euler’s method) is
ever, the Jacobian determinants of these networks cannot be
xnþ1 xn ¼ "F ðxn ; un Þ; (34)
computed efficiently.
Other research has focused on making the residual connec- and this is equivalent to a residual connection with a resid-
tion gðÞ invertible. A sufficient condition for the invertibility ual block "F ð; un Þ.
was found in [7]. They proved the following statement: In this section we consider the case where we do not dis-
Proposition 7. A residual connection (29) is invertible, if the cretize but try to learn the continuous dynamical system
Lipschitz constant of the residual block is LipðF Þ < 1. instead. Such flows are called infinitesimal or continuous. We
consider two distinct types. The formulation of the first type
There is no analytically closed form for the inverse, but it comes from ordinary differential equations, and of the sec-
can be found numerically using fixed-point iterations ond type from stochastic differential equations.
(which, by the Banach theorem, converge if we assume
LipðF Þ < 1). 3.6.1 ODE-Based Methods
Controlling the Lipschitz constant of a neural network is
Consider an ODE as in Equation (33), where t 2 ½0; 1. Assum-
not simple. The specific architecture proposed by Behrmann
ing uniform Lipschitz continuity in x and continuity in t, the
et al. [7], called iResNet, uses a convolutional network for the
solution exists (at least, locally) and, given an initial condition
residual block. It constrains the spectral radius of each con-
xð0Þ ¼ z, is unique (Picard-Lindel€ of-Lipschitz-Cauchy theo-
volutional layer in this network to be less than one.
rem [5]). We denote the solution at each time t as Ft ðzÞ.
The Jacobian determinant of the iResNet cannot be com-
puted directly, so the authors propose to use a (biased) sto- Remark 8. At each time t, Ft ðÞ : RD ! RD is a diffeomor-
chastic estimate. The Jacobian of the residual connection g phism and satisfies the group law: Ft Fs ¼ Ftþs . Mathe-
in Equation (29) is: Dg ¼ I þ DF . Because the function F is matically speaking, an ODE (33) defines a one-parameter
assumed to be Lipschitz with LipðF Þ < 1, one has: group of diffeomorphisms on RD . Such a group is called
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3973
a smooth flow in dynamical systems theory and differen- Jacobian determinant of a diffeomorphism is nonzero, its
tial geometry [52]. sign cannot change along the path. Hence, a time one
map must have a positive Jacobian determinant. For
When t ¼ 1, the diffeomorphism F1 ðÞ is called a time
example, consider a map f : R ! R, such that fðxÞ ¼ x.
one map. The idea to model a normalizing flow as a time
It is obviously a diffeomorphism, but it can not be pre-
one map y ¼ gðzÞ ¼ F1 ðzÞ was presented by Chen et al.
sented as a time one map of any ODE, because it is not
[15] under the name Neural ODE (NODE). From a deep
orientation preserving.
learning perspective this can be seen as an “infinitely deep”
Dupont et al. [23] suggested how one can improve
neural network with input z, output y and continuous
Neural ODE in order to be able to represent a broader class
weights uðtÞ. The invertibility of such networks naturally
of diffeomorphisms. Their model is called Augmented
comes from the theorem of the existence and uniqueness of
Neural ODE (ANODE). They add variables ^ xðtÞ 2 Rp and
the solution of the ODE.
consider a new ODE
Training these networks for a supervised downstream
task can be done by the adjoint sensitivity method which is
the continuous analog of backpropagation. It computes the d xðtÞ ^ xðtÞ ; uðtÞ ;
¼ F (37)
gradients of the loss function by solving a second (aug- dt ^xðtÞ ^xðtÞ
mented) ODE backwards in time. For loss LðxðtÞÞ, where xðtÞ
is a solution of ODE (33), its sensitivity or adjoint is with initial conditions xð0Þ ¼ z and ^ xð0Þ ¼ 0. The addition
aðtÞ ¼ dxðtÞ
dL
. This is the analog of the derivative of the loss of ^
xðtÞ in particular gives freedom for the Jacobian determi-
with respect to the hidden layer. In a standard neural net- nant to remain positive. As was demonstrated in the experi-
work, the backpropagation formula computes this deriva- ments, ANODE is capable of learning distributions that the
dL
tive: dh
dh
¼ dhdL dhnþ1 . For “infinitely deep” neural network, Neural ODE cannot, and the training time is shorter. Zhang
n nþ1 n
this formula changes into an ODE et al. [106] proved that any diffeomorphism can be repre-
sented as a time one map of ANODE and so this is a univer-
daðtÞ dF ðxðtÞ; uðtÞÞ sal flow.
¼ aðtÞ : (35) A similar ODE-base approach was taken by Salman et al.
dt dxðtÞ
[83] in Deep Diffeomorphic Flows. In addition to modelling
a path Ft ðÞ in the space of all diffeomorphic transforma-
For density estimation learning, we do not have a loss, tions, for t 2 ½0; 1, they proposed geodesic regularisation in
but instead seek to maximize the log likelihood. For normal- which longer paths are punished.
izing flows, the change of variables formula is given by
another ODE
3.6.2 SDE-Based Methods (Langevin Flows)
The idea of the Langevin flow is simple; we start with a
d dF ðxðtÞÞ
log ðpðxðtÞÞÞ ¼ Tr : (36) complicated and irregular data distribution pY ðyÞ on RD ,
dt dxðtÞ
and then mix it to produce the simple base distribution
Note that we no longer need to compute the determinant. pZ ðzÞ. If this mixing obeys certain rules, then this procedure
To train the model and sample from pY we solve these can be invertible. This idea was explored by Chen et al. [87],
ODEs, which can be done with any numerical ODE Jankowiak and Obermeyer [103], Rezende and Mohamed
solver. [78], Salimans et al. [84], Sohl-Dickstein et al. [81], Suykens
Grathwohl et al. [34] used the Hutchinson estimator to et al. [50], Welling and Teh [14]. We provide a high-level
calculate an unbiased stochastic estimate of the trace-term. overview of the method, including the necessary mathemat-
This approach which they termed FFJORD reduces the com- ical background.
plexity even further. Finlay, Jacobsen et al. [29] added two A stochastic differential equation (SDE) or It^ o process
regularization terms into the loss function of FFJORD: the describes a change of a random variable x 2 RD as a func-
first term forces solution trajectories to follow straight lines tion of time t 2 Rþ
with constant speed, and the second term is the Frobenius
norm of the Jacobian. This regularization decreased the dxðtÞ ¼ bðxðtÞ; tÞdt þ sðxðtÞ; tÞdBt ; (38)
training time significantly and reduced the need for multi-
ple GPUs. An interesting side-effect of using continuous where bðx; tÞ 2 RD is the drift coefficient, sðx; tÞ 2 RDD is the
ODE-type flows is that one needs fewer parameters to diffusion coefficient, and Bt is D-dimensional Brownian
achieve the similar performance. For example, Grathwohl motion. One can interpret the drift term as a deterministic
et al. [34] show that for the comparable performance on change and the diffusion term as providing the stochasticity
CIFAR10, FFJORD uses less than 2 percent as many parame- and mixing. Given some assumptions about these functions,
ters as Glow. the solution exists and is unique [72].
Not all diffeomorphisms can be presented as a time Given a time-dependent random variable xðtÞ we
one map of an ODE (see [3], [52]). For example, one nec- can consider its density function pðx; tÞ and this is also
essary condition is that the map is orientation preserving time dependent. If xðtÞ is a solution of Equation (38), its den-
which means that the Jacobian determinant must be posi- sity function satisfies two partial differential equations
tive. This can be seen because the solution Ft is a (contin- describing the forward and backward evolution [72]. The
uous) path in the space of diffeomorphisms from the forward evolution is given by Fokker-Plank equation or
identity map F0 ¼ Id to the time one map F1 . Since the Kolmogorov’s forward equation
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3974 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021
TABLE 1 TABLE 2
List of Normalizing Flows for Which We Show Tabular Datasets: Data Dimensionality and Number
Performance Results of Training Examples
TABLE 3
Average Test Log-Likelihood (in Nats) for Density Estimation on Tabular Datasets (Higher the Better)
A number in parenthesis next to a flow indicates number of layers. MAF MoG is MAF with mixture of Gaussians as a base density.
[34], NAF [45], UMNN [102], SOS [49], Quadratic Spline probability density function is possible and the parameters
flow and RQ-NSF [25], Cubic Spline Flow [24]. of this distribution can be learned during training.
Table 3 shows that universal flows (NAF, SOS, Splines) Theoretically the base measure shouldn’t matter: any dis-
demonstrate relatively better performance. tribution for which a CDF can be computed, can be simu-
lated by applying the inverse CDF to draw from the
uniform distribution. However in practice if structure is
4.2 Image Datasets
provided in the base measure, the resulting transformations
These datasets summarized in Table 4. They are of increas- may become easier to learn. In other words, the choice of
ing complexity and are preprocessed as in [20] by dequan- base measure can be viewed as a form of prior or inductive
tizing with uniform noise (except for Flow++). bias on the distribution and may be useful in its own right.
Table 5 compares performance on the image datasets for For example, a trade-off between the complexity of the gen-
unconditional density estimation. For experimental details, erative transformation and the form of base measure was
see: RealNVP for CIFAR-10 and ImageNet [20], Glow for explored in [48] in the context of modelling tail behaviour.
CIFAR-10 and ImageNet [57], RealNVP and Glow for
MNIST, MAF and FFJORD [34], SOS [49], RQ-NSF [25],
UMNN [102], iResNet [7], Residual Flow [16], Flow++ [41]. 5.1.2 Form of Diffeomorphisms
As of this writing Flow++ [41] is the best performing The majority of the flows explored are triangular flows (either
approach. Besides using more expressive coupling layers (see coupling or autoregressive architectures). Residual networks
Section 3.4.4.3) and a different architecture for the conditioner, and Neural ODEs are also being actively investigated and
variational dequantization was used instead of uniform. An applied. A natural question to ask is: are there other ways to
ablation study shows that the change in dequantization model diffeomorphisms which are efficient for computation?
approach gave the most significant improvement. What inductive bias does the architecture impose? For instance,
Spantini, Bigoni, and Marzouk [85] investigate the relation
5 DISCUSSION AND OPEN PROBLEMS between the sparsity of the triangular flow and Markov prop-
erty of the target distribution.
5.1 Inductive Biases
5.1.1 Role of the Base Measure TABLE 5
The base measure of a normalizing flow is generally Average Test Negative Log-Likelihood (in Bits per Dimension) for
Density Estimation on Image Datasets (Lower is Better)
assumed to be a simple distribution (e.g., uniform or Gauss-
ian). However this doesn’t need to be the case. Any distribu- MNIST CIFAR-10 ImNet32 ImNet64
tion where we can easily draw samples and compute the log
RealNVP 1.06 3.49 4.28 3.98
Glow 1.05 3.35 4.09 3.81
TABLE 4 MAF 1.89 4.31
Image Datasets: Data Dimensionality and FFJORD 0.99 3.40
Number of Training Examples for MNIST, CIFAR-10, SOS 1.81 4.18
ImageNet32 and ImageNet64 Datasets RQ-NSF(C) 3.38 3.82
UMNN 1.13
MNIST CIFAR-10 ImNet32 ImNet64 iResNet 1.06 3.45
Dims 784 3072 3072 12288 Residual Flow 0.97 3.28 4.01 3.76
#Train 50K 90K 1:3M 1:3M Flow++ 3.08 3.86 3.69
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3976 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021
A related question concerns the best way to model condi- on RD , one can push it forward on M via the exponential
tional normalizing flows when one needs to learn a condi- map. Additionally, applying a normalizing flow to a base
tional probability distribution. Trippe and Turner [95] measure before pushing it to M helps to construct multi-
suggested using different flows for each condition, but this modal distributions on M. If the manifold M is a hyberbolic
approach doesn’t leverage weight sharing, and so is ineffi- space, the exponential map is a global diffeomorphism and
cient in terms of memory and data usage. Atanov, Volo- all the formulas could be written explicitly. Using this
khova, Ashukha, Sosnovik, and Vetrov [6] proposed using method, Ovinnikov [73] introduced the Gaussian reparame-
affine coupling layers where the parameters u depend on terization trick in a hyperbolic space and Bose et al. [10]
the condition. Conditional distributions are useful in partic- constructed hyperbolic normalizing flows.
ular for time series modelling, where one needs to find Instead of a Riemannian structure, one can impose a
pðyt jy < t Þ [60]. Lie group structure on a manifold G. In this case there
also exists an exponential map exp : g ! G mapping a Lie
5.1.3 Loss Function algebra to the Lie group and one can use it to construct a
The majority of the existing flows are trained by minimi- normalizing flow on G. Falorsi et al. [28] introduced an
zation of KL-divergence between source and the target analog of the Gaussian reparameterization trick for a Lie
distributions (or, equivalently, with log-likelihood maxi- group.
mization). However, other losses could be used which
would put normalizing flows in a broader context of
optimal transport theory [99]. Interesting work has been 5.2.2 Discrete Distributions
done in this direction including Flow-GAN [36] and the Modelling distributions over discrete spaces is important in
minimization of the Wasserstein distance as suggested a range of problems, however the generalization of normal-
by [4], [90]. izing flows to discrete distributions remains an open prob-
lem in practice. Discrete latent variables were used by
5.2 Generalisation to Non-Euclidean Spaces Dinh et al. [21] as an auxiliary tool to pushforward continu-
5.2.1 Flows on Manifolds ous random variables along piecewise-bijective maps (see
Section 3.4.4.7). However, can we define normalizing flows
Modelling probability distributions on manifolds has
if one or both of our distributions are discrete? This could
applications in many fields including robotics, molecular
be useful for many applications including natural language
biology, optics, fluid mechanics, and plasma physics [30],
modelling, graph generation and others.
[79]. How best to construct a normalizing flow on a general
To this end Tran et al. [94] model bijective functions on a
differentiable manifold remains an open question. One
finite set and show that, in this case, the change of variables
approach to applying the normalizing flow framework on
is given by the formula: pY ðyÞ ¼ pZ ðg1 ðyÞÞ, i.e., with no
manifolds, is to find a base distribution on the euclidean
Jacobian term (compare with Definition 1). For backpropa-
space and transfer it to the manifold of interest. There are
gation of functions with discrete variables they use the
two main approaches: 1) embed the manifold in the euclid-
straight-through gradient estimator [8]. However this
ean space and “restrict” the measure, or 2) induce the mea-
method is not scalable to distributions with large numbers
sure from the tangent space to the manifold. We will
of elements.
briefly discuss each in turn.
Alternatively Hoogeboom et al. [43] models bijections
One can also use differential structure to define measures
on ZD directly with additive coupling layers. Other
on manifolds [86]. Every differentiable and orientable mani-
approaches transform a discrete variable into a continu-
fold M has a volume form v, then for a RBorel subset U M
ous latent variable with a variational autoencoder, and
one can define its measure as mv ðUÞ ¼ U v. A Riemannian
then apply normalizing flows in the continuous latent
manifold has a natural volume form given by its metric
pffiffiffiffiffi space [101], [108].
tensor: v ¼ jgjdx1 ^ . . . ^ dxD . Gemici et al. [30] explore
A different approach is dequantization, (i.e., adding noise
this approach considering an immersion of an D-dimen-
to discrete data to make it continuous) which can be used
sional manifold M into a euclidean space: f : M ! RN ,
with ordinal variables, e.g., discretized pixel intensities. The
where N D. In this case, one pulls-back a euclidean
noise can be uniform but other forms are possible and this
metric, and locally
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi a volume form on M is
dequantization can even be learned as a latent variable model
v ¼ detððDfÞT DfÞdx1 ^ . . . ^ dxD , where Df is the Jaco- [41], [44]. Hoogeboom et al. [44] analyzed how different
bian matrix of f. Rezende et al. [79] pointed out that the real- choices of dequantization objectives and dequantization dis-
ization of this method is computationally hard, and tributions affect the performance.
proposed an alternative construction of flows on tori and
spheres using diffeomorphisms of the one-dimensional cir- ACKNOWLEDGMENTS
cle as building blocks.
As another option, one can consider exponential maps The authors would like to thank Matt Taylor and Kry Yik-
expx : Tx M ! M, mapping a tangent space of a Riemannian Chau Lui for their insightful comments.
manifold (at some point x) to the manifold itself. If the man-
ifold is geodesic complete, this map is globally defined, and REFERENCES
locally is a diffeomorphism. A tangent space has a structure
[1] A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow:
of a vector space, so one can choose an isomorphism Noise modeling with conditional normalizing flows,” in Proc.
Tx M ffi RD . Then for a base distribution with the density pZ IEEE Int. Conf. Comput. Vis., 2019, pp. 3165–3173.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3977
[2] J. Agnelli, M. Cadeiras, E. Tabak, T. Cristina, and E. Vanden-Eijnden, [30] M. C. Gemici, D. Rezende, and S. Mohamed, “Normalizing flows
“Clustering and classification through normalizing flows in feature on riemannian manifolds,” 2016, arXiv:1611.02304.
space,” Multiscale Model. Simul., vol. 8, pp. 1784–1802, 2010. [31] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “MADE:
[3] J. Arango and A. G omez, “Diffeomorphisms as time one maps,” Masked autoencoder for distribution estimation,” in Proc. 32nd
Aequationes Math., vol. 64, pp. 304–314, 2002. Int. Conf. Mach. Learn., 2015, pp. 881–889.
[4] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein Generative [32] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The revers-
Adversarial Networks,” in Proc. 34th Int. Conf. Mach. Learn., 2017, ible residual network: Backpropagation without storing
pp. 214–223. activations,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017,
[5] V. Arnold, Ordinary Differential Equations. Cambridge, MA, USA: pp. 2211–2221.
The MIT Press, 1978. [33] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th
[6] A. Atanov, A. Volokhova, A. Ashukha, I. Sosnovik, and Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
D. Vetrov, “Semi-conditional normalizing flows for semi-super- [34] W. Grathwohl, R. T. Q Chen, J. Bettencourt, I. Sutskever, and
vised learning,” in Workshop Invertible Neural Nets Normalizing D. Duvenaud, “FFJORD: Free-form continuous dynamics for
Flows (ICML), 2019. scalable reversible generative models,” in Proc. Int. Conf. Learn.
[7] J. Behrmann, D. Duvenaud, and J.-H. Jacobsen, “Invertible resid- Representations, 2019.
ual networks,” in Proc. 36th Int. Conf. Mach. Learn., 2019. [35] J. Gregory and R. Delbourgo, “Piecewise rational quadratic inter-
[8] Y. Bengio, N. Leonard, and A. Courville, “Estimating or propa- polation to monotonic data,” IMA J. Numer. Anal., vol. 2, no. 2,
gating gradients through stochastic neurons for conditional pp. 123–130, 1982.
computation,” 2013, arXiv:1308.3432. [36] A. Grover, M. Dhar, and S. Ermon, “Flow-GAN: Combining
[9] V. Bogachev, A. Kolesnikov, and K. Medvedev, “Triangular maximum likelihood and adversarial learning in generative
transformations of measures,” Sbornik Math., vol. 196, no. 3/4, models,” in Proc. AAAI Conf. Artif. Intell., 2018.
pp. 309–335, 2005. [37] E. Haber, L. Ruthotto, and E. Holtham, “Learning across scales -
[10] A. J. Bose, A. Smofsky, R. Liao, P. Panangaden, and W. L. Hamilton, A multiscale method for convolution neural networks,” in Proc.
“Latent variable modelling with hyperbolic normalizing flows,” AAAI Conf. Artif. Intell., 2018.
2020, arXiv: 2002.06336. [38] L. Hasenclever, J. M. Tomczak, R. Van Den Berg, and M. Welling,
[11] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. J ozefowicz, and S. “Variational inference with orthogonal normalizing flows,”
Bengio, “Generating sentences from a continuous space,” in Proc. in Workshop Bayesian Deep Learn. (NeurIPS), 2017.
20th SIGNLL Conf. Comput. Natural Lang. Learn., 2015, pp. 10–21. [39] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
[12] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and Surpassing human-level performance on ImageNet classification,”
E. Holtham, “Reversible architectures for arbitrarily deep resid- in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026–1034.
ual neural networks,” in Proc. AAAI Conf. Artif. Intell., 2018, [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
pp. 2811–2818. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
[13] B. Chang, M. Chen, E. Haber, and E. H. Chi, Recognit., 2016, pp. 770–778.
“AntisymmetricRNN: A dynamical system view on recurrent [41] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel, “Flow++:
neural networks,” in Proc. Int. Conf. Learn. Representations, 2019. Improving flow-based generative models with variational
[14] C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. Carin, dequantization and architecture design,” in Proc. 36th Int. Conf.
“Continuous-time flows for efficient inference and density Mach. Learn., 2019, pp. 2722–2730.
estimation,” in Proc. Int. Conf. Mach. Learn., 2018. [42] E. Hoogeboom, R. V. D. Berg, and M. Welling, “Emerging convo-
[15] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, lutions for generative normalizing flows,” in Proc. 36th Int. Conf.
“Neural ordinary differential equations,” in Proc. 32nd Int. Conf. Mach. Learn., 2019, pp. 2771–2780.
Neural Inf. Process. Syst., 2018, pp. 6572–6583. [43] E. Hoogeboom, J. W. Peters, R. van den Berg, and M. Welling,
[16] R. T. Q. Chen, J. Behrmann, D. Duvenaud, and J.-H. Jacobsen, “Integer discrete flows and lossless compression,” in Proc. 33rd
“Residual flows for invertible generative modeling,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019.
33rd Int. Conf. Neural Inf. Process. Syst., 2019. [44] E. Hoogeboom, T. S. Cohen, and J. M. Tomczak, “Learning dis-
[17] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, crete distributions by dequantization,” 2020, arXiv: 2001.11235.
and A. A. Bharath, “Generative adversarial networks: An overview,” [45] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville, “Neural
IEEE Signal Process. Mag., vol. 35, no. 1, pp. 53–65, Jan. 2018. autoregressive flows,” in Proc. 35th Int. Conf. Mach. Learn., 2018,
[18] H. P. Das, P. Abbeel, and C. J. Spanos, “Dimensionality reduction pp. 2078–2087.
flows,” 2019, arXiv: 1908.01686. [46] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[19] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear indepen- network training by reducing internal covariate shift,” in Proc.
dent components estimation,” in Proc. Int. Conf. Learn. Representa- 32nd Int. Conf. Mach. Learn., 2015, pp. 448–456.
tions Workshop, 2015. [47] J.-H. Jacobsen, A. W. Smeulders, and E. Oyallon, “i-RevNet: Deep
[20] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation invertible networks,” in Proc. Int. Conf. Learn. Representations, 2018.
using real NVP,” in Proc. Int. Conf. Learn. Representations, 2017. [48] P. Jaini, I. Kobyzev, M. Brubaker, and Y. Yu, “Tails of triangular
[21] L. Dinh, J. Sohl-Dickstein, R. Pascanu, and H. Larochelle, “A flows,” 2019, arXiv: 1907.04481.
RAD approach to deep mixture models,” in Proc. Int. Conf. Learn. [49] P. Jaini, K. A. Selby, and Y. Yu, “Sum-of-squares polynomial
Representations Workshop, 2019. flow,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 3009–3018.
[22] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [50] M. Jankowiak and F. Obermeyer, “Pathwise derivatives beyond
[23] E. Dupont, A. Doucet, and Y. W. Teh, “Augmented neural ODEs,” in the reparameterization trick,” in Proc. 35th Int. Conf. Mach. Learn.,
Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 3140–3150. 2018, pp. 2235–2244.
[24] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios, “Cubic- [51] G. Kanwar et al., “Equivariant flow-based sampling for lattice
spline flows,” in Workshop Invertible Neural Networks Normalizing gauge theory,” 2020, arXiv: 2003.06413.
Flows (ICML), 2019. [52] A. Katok and B. Hasselblatt, Introduction to the Modern Theory of
[25] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios, “Neural Dynamical Systems. New York, NY, USA: Cambridge Univ. Press,
spline flows,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, 1995.
pp. 7511–7522. [53] S. Kim, S. Gil Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet: A
[26] Weinan E, “A proposal on machine learning via dynamical sys- generative flow for raw audio,” in Proc. 36th Int. Conf. Mach.
tems,” Commun. Math. Statist., vol. 5, pp. 1–11, 2017. Learn., 2018, pp. 3370–3378.
[27] P. Esling, N. Masuda, A. Bardet, R. Despres, and [54] D. P. Kingma and M. Welling, “Auto-encoding variational
A. Chemla-Romeu-Santos, “Universal audio synthesizer control bayes,” in Proc. 2nd Int. Conf. Learn. Representations, 2014.
with normalizing flows,” 2019, arXiv: 1907.00971. [55] D. P. Kingma and M. Welling, “An introduction to variational
[28] L. Falorsi, P. de Haan, T. R. Davidson, and P. Forre, autoencoders,” 2019, arXiv: 1906.02691.
“Reparameterizing distributions on lie groups,” 2019, arXiv: [56] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever,
1903.02958. and M. Welling, “Improved variational inference with inverse
[29] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. M. Oberman, autoregressive flow,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
“How to train your neural ODE,” 2020, arXiv: 2002.02798. 2016, pp. 4743–4751.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
3978 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 11, NOVEMBER 2021
[57] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with [83] H. Salman, P. Yadollahpour, T. Fletcher, and N. Batmanghelich,
invertible 1x1 convolutions,” in Proc. Int. Conf. Neural Inf. Process. “Deep diffeomorphic normalizing flows,” 2018, arXiv: 1810.03256.
Syst., 2018, pp. 10 215–10 224. [84] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S.
[58] J. K€ohler, L. Klein, and F. Noe, “Equivariant flows: Sampling Ganguli, “Deep unsupervised learning using nonequilibrium
configurations for multi-body systems with symmetric ener- thermodynamics,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp.
gies,” in Workshop Mach. Learn. Physical Sciences (NeurIPS), 2256–2265.
2019. [85] A. Spantini, D. Bigoni, and Y. Marzouk, “Inference via low-dimen-
[59] D. Koller and N. Friedman, Probabilistic Graphical Models. sional couplings,” J. Mach. Learn. Res., vol. 19, pp. 2639–2709,
Cambridge, MA, USA: MIT Press, 2009. Mar. 2017.
[60] M. Kumar et al., “VideoFlow: A flow-based generative model for [86] M. Spivak, Calculus on Manifolds: A Modern Approach to Classical
video,” in Workshop Invertible Neural Nets Normalizing Flows (ICML), Theorems of Advanced Calculus. San Francisco, CA, USA: Print, 1965.
2019. [87] J. Suykens, H. Verrelst, and J. Vandewalle, “On-line learning
[61] P. M. Laurence, R. J. Pignol, and E. G. Tabak, “Constrained Fokker-Planck machine,” Neural Process. Lett., vol. 7, pp. 81–89,
density estimation,” in Proc. Wolfgang Pauli Inst. Conf. Energy 1998.
Commodity Trading, 2014, pp. 259–284. [88] E. G. Tabak and C. V. Turner, “A family of nonparametric den-
[62] X. Li, T.-K. L. Wong, R. T. Q. Chen, and D. Duvenaud, “Scalable sity estimation algorithms,” Commun. Pure Appl. Math., vol. 66,
gradients for stochastic differential equations,” 2020, arXiv: no. 2, pp. 145–164, 2013.
2001.01328. [89] E. G. Tabak and E. Vanden-Eijnden, “Density estimation by dual
[63] A. Liutkus, U. Simsekli, S. Majewski, A. Durmus, and ascent of the log-likelihood,” Commun. Math. Sci., vol. 8, no. 1,
F.-R. St€oter, “Sliced-wasserstein flows: Nonparametric genera- pp. 217–233, 2010.
tive modeling via optimal transport and diffusions,” in Proc. 36th [90] I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Sch€olkopf, “Wasserstein
Int. Conf. Mach. Learn., 2019, pp. 4104–4113. auto-encoders,” in Proc. Int. Conf. Learn. Representations, 2018.
[64] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities [91] J. Tomczak and M. Welling, “Improving variational auto-
improve neural network acoustic models,” in Proc. Int. Conf. encoders using convex combination linear inverse autoregressive
Mach. Learn., 2013. flow,” Benelearn, 2017.
[65] K. Madhawa, K. Ishiguro, K. Nakago, and M. Abe, “GraphNVP: [92] J. M. Tomczak and M. Welling, “Improving variational auto-
An invertible flow model for generating molecular graphs,” encoders using householder flow,” 2016, arXiv:1611.09630.
2019, arXiv: 1905.11600. [93] A. Touati, H. Satija, J. Romoff, J. Pineau, and P. Vincent,
[66] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of “Randomized value functions via multiplicative normalizing
human segmented natural images and its application to evaluat- flows,” in Proc. Conf. Uncertainty Artif. Intell., 2019.
ing segmentation algorithms and measuring ecological [94] D. Tran, K. Vafa, K. Agrawal, L. Dinh, and B. Poole, “Discrete
statistics,” in Proc. 8th Int. Conf. Comput. Vis., 2001, pp. 416–423. flows: Invertible generative models of discrete data,” in Proc. Int.
[67] B. Mazoure, T. Doan, A. Durand, J. Pineau, and R. D. Hjelm, Conf. Learn. Representations Workshop, 2019.
“Leveraging exploration in off-policy algorithms via normalizing [95] B. L. Trippe and R. E. Turner, “Conditional density estimation
flows,” in Proc. 3rd Conf. Robot Learn., 2019. with Bayesian normalising flows,” in Workshop Bayesian Deep
[68] K. V. Medvedev, “Certain properties of triangular transformations of Learn. (NeurIPS), 2017.
measures,” Theory Stochastic Processes, vol. 14, no. 30, pp. 95–99, 2008. [96] B. Tzen and M. Raginsky, “Neural stochastic differential equa-
[69] T. M€ uller, B. McWilliams, F. Rousselle, M. Gross, and J. Novak, tions: Deep latent gaussian models in the diffusion limit,” 2019,
“Neural importance sampling,” ACM Trans. Graph., vol. 38, 2018, arXiv: 1905.09883.
Art. no. 145. [97] R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling,
[70] P. Nadeem Ward, A. Smofsky, and A. Joey Bose, “Improving “Sylvester normalizing flows for variational inference,” in Proc.
exploration in soft-actor-critic with normalizing flows policies,” 34th Conf. Uncertainty Artif. Intell., 2018.
in Workshop Invertible Neural Nets Normalizing Flows (ICML), [98] A. van den Oord et al., “Parallel wavenet: Fast high-fidelity
2019. speech synthesis,” in Proc. 35th Int. Conf. Mach. Learn., 2017,
[71] F. Noe, S. Olsson, J. K€ ohler, and H. Wu, “Boltzmann generators: pp. 3918–3926.
Sampling equilibrium states of many-body systems with deep [99] C. Villani, Topics in Optimal Transportation (Graduate Studies in
learning,” Science, vol. 365, 2019, Art. no. eaaw1147. Mathematics 58). Providence, RI, USA: American Mathematical
[72] B. Oksendal, Stochastic Differential Equations (3rd Ed.): An Intro- Society, 2003.
duction With Applications. Berlin, Germany: Springer, 1992. [100] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F. Yue Wang,
[73] I. Ovinnikov, “Poincare wasserstein autoencoder,” in Workshop “Generative adversarial networks: Introduction and outlook,”
on Bayesian Deep Learning, NeurIPS, 2018. IEEE/CAA J. Automatica Sinica, vol. 4, no. 4, pp. 588–598, 2017.
[74] G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autore- [101] P. Z. Wang and W. Y. Wang, “Riemannian normalizing flow on
gressive flow for density estimation,” in Proc. 31st Int. Conf. variational wasserstein autoencoder for text modeling,” 2019,
Neural Inf. Process. Syst., 2017, pp. 2335–2344. arXiv: 1904.02399.
[75] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and [102] A. Wehenkel and G. Louppe, “Unconstrained monotonic neural
B. Lakshminarayanan, “Normalizing flows for probabilistic networks,” 2019, arXiv: 1908.05164.
modeling and inference,” 2019, arXiv: 1912.02762. [103] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gra-
[76] S. Peluchetti and S. Favaro, “Neural stochastic differential equa- dient langevin dynamics,” in Proc. 28th Int. Conf. Mach. Learn.,
tions,” 2019, arXiv: 1905.11065. 2011, pp. 681–688.
[77] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based [104] P. Wirnsberger et al., “Targeted free energy estimation via
generative network for speech synthesis,” in Proc. IEEE Int. Conf. learned mappings,” 2020, arXiv: 2002.04913.
Acoust. Speech Signal Process., 2019, pp. 3617–3621. [105] K. W. K. Wong, G. Contardo, and S. Ho, “Gravitational wave
[78] D. J. Rezende and S. Mohamed, “Variational inference with nor- population inference with deep flow-based generative network,”
malizing flows,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, 2020, arXiv: 2002.09491.
pp. 1530–1538. [106] H. Zhang, X. Gao, J. Unterman, and T. Arodz, “Approximation
[79] D. J. Rezende et al., “Normalizing flows on tori and spheres,” capabilities of neural ordinary differential equations,” 2019,
2020, arXiv: 2002.02428. arXiv: 1907.12998.
[80] O. Rippel and R. P. Adams, “High-dimensional probability esti- [107] G. Zheng, Y. Yang, and J. Carbonell, “Convolutional normalizing
mation with deep density models,” 2013, arXiv:1302.5125. flows,” in Workshop Theoretical Foundations Applications Deep Gen-
[81] T. Salimans, A. Diederik, D. P. Kingma, and M. Welling, erative Models (ICML), 2018.
“Markov chain Monte Carlo and variational inference: Bridging [108] Z. M. Ziegler and A. M. Rush, “Latent normalizing flows for discrete
the gap,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp. 1218– sequences,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 7673–7682.
1226.
[82] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
and X. Chen, “Improved techniques for training GANs,” in Proc.
30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 2234–2242.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.
KOBYZEV ET AL.: NORMALIZING FLOWS: AN INTRODUCTION AND REVIEW OF CURRENT METHODS 3979
Ivan Kobyzev received the master’s degree in Marcus A. Brubaker (Member, IEEE) received
mathematical physics from St Petersburg State the PhD degree from the University of Toronto,
University, Russia, in 2011, and the PhD degree Canada, in 2011. He did postdocs at the Toyota
in mathematics from Western University, Canada, Technological Institute, Chicago, Toronto Rehabil-
in 2016. He did two postdocs in mathematics and itation Hospital and the University of Toronto,
in computer science at the University of Waterloo, Canada. His research interests include computer
Canada. Currently he is a researcher at Borealis vision, machine learning and statistics. He is cur-
AI, Canada. His research interests include alge- rently an assistant professor at York University,
bra, generative models, cognitive computing, and Toronto, Canada, an adjunct professor at the Uni-
natural language processing. versity of Toronto, Canada and a faculty affiliate
of the Vector Institute. He is also an academic
advisor to Borealis AI, Canada where he previously worked as the
Simon J.D. Prince received the master’s degree research director of the Toronto office. He is also an associate editor for
from University College London, United Kingdom the journal IET Computer Vision and has served as a reviewer and an
and the doctorate degree from the University of area chair for many computer vision and machine learning conferences.
Oxford, United Kingdom. He has a diverse
research background and has published in wide-
ranging areas including Computer Vision, Neuro- " For more information on this or any other computing topic,
science, HCI, Computer Graphics, Medical Imag- please visit our Digital Library at www.computer.org/csdl.
ing, and Augmented Reality. He is also the author
of a popular textbook on Computer Vision. From
2005–2012, he was a tenured faculty member
with the Department of Computer Science, Univer-
sity College London, where he taught courses in Computer Vision, Image
Processing and Advanced Statistical Methods. During this time, he was
director of the MSc in Computer Vision, Graphics and Imaging. He
worked in industry applying AI to computer graphics software. Currently
he is a research director of Borealis AI’s Montreal office.
Authorized licensed use limited to: CNR Biblioteca Centrale. Downloaded on October 18,2023 at 20:56:44 UTC from IEEE Xplore. Restrictions apply.