Montesuma Et Al. - 2023 - Recent Advances in Optimal Transport For Machine L

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 1

Recent Advances in Optimal Transport for


Machine Learning
Eduardo Fernandes Montesuma, Fred Ngolè Mboula, and Antoine Souloumiac,

Abstract—Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and
manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in
machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for
Machine Learning over the period 2012 – 2022, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer
and reinforcement learning. We further highlight the recent development in computational Optimal Transport, and its interplay with
Machine Learning practice.
arXiv:2306.16156v1 [cs.LG] 28 Jun 2023

Index Terms—Optimal Transport, Wasserstein Distance, Machine Learning

1 I NTRODUCTION

O PTIMAL transportation theory is a well-established


field of mathematics founded by the works of Gaspard
Monge [1] and economist Leonid Kantorovich [2]. Since
reinforcement learning (section 7). Section 8 concludes this
paper with general remarks and future research directions.

its genesis, this theory has made significant contributions 1.1 Related Work
to mathematics [3], physics [4], and computer science [5]. Before presenting the recent contributions of OT for ML,
Here, we study how Optimal Transport (OT) contributes to we briefly review previous surveys on related themes. Con-
different problems within Machine Learning (ML). Optimal cerning OT as a field of mathematics, a broad literature is
Transport for Machine Learning (OTML) is a growing re- available. For instance, [12] and [13], present OT theory from
search subject in the ML community. Indeed, OT is useful a continuous point of view. The most complete text on the
for ML through at least two viewpoints: (i) as a loss function, field remains [3], but it is far from an introductory text.
and (ii) for manipulating probability distributions. As we cover in this survey, the contributions of OT for
First, OT defines a metric between distributions, known ML are computational in nature. In this sense, practitioners
by different names, such as Wasserstein distance, Dudley will be focused on discrete formulations and computational
metric, Kantorovich metric, or Earth Mover Distance (EMD). aspects of how to implement OT. This is the case of surveys [9]
This metric belongs to the family of Integral Probability and [10]. While [9] serves as an introduction to OT and
Metrics (IPMs) (see section 2.1.1). In many problems (e.g., its discrete formulation, [10] gives a more broad view on
generative modeling), the Wasserstein distance is preferable the computational methods for OT and how it is applied
over other notions of dissimilarity between distributions, to data sciences in general. In terms of OT for ML, we
such as the Kullback-Leibler (KL) divergence, due its topo- compare our work to [7] and [11]. The first presents the
logical and statistical properties. theory of OT and briefly discusses applications in signal
Second, OT presents itself as a toolkit or framework processing and ML (e.g., image denoising). The second
for ML practitioners to manipulate probability distribu- provides a broad overview of areas OT has contributed to
tions. For instance, researchers can use OT for aggregating ML, classified by how data is structured (i.e., histograms,
or interpolating between probability distributions, through empirical distributions, graphs).
Wasserstein barycenters and geodesics. Hence, OT is a prin- Due to the nature of our work, there is a relevant level of
cipled way to study the space of probability distributions. intersection between the themes treated in previous OTML
This survey provides an updated view of how OTML in surveys and ours. Whenever that is the case, we provide an
the recent years. Even though previous surveys exist [6]– updated view of the related topic. We highlight a few topics
[11], the rapid growth of the field justifies a closer look that are new. In section 3, projection robust, structured,
at OTML. The rest of this paper is organized as follows. neural and mini-batch OT are novel ways of computing OT.
Section 2 discusses the fundamentals of different areas of We further analyze in section 8.1 the statistical challenges of
ML, as well as an overview of OT. Section 3 reviews recent estimating OT from finite samples. In section 4 we discuss
developments in computational optimal transport. The further recent advances in OT for structured data and fairness. In
sections explore OT for 4 ML problems: supervised (sec- section 5 we discuss Wasserstein autoencoders, and nor-
tion 4), unsupervised (section 5), transfer (section 6), and malizing flows. Also, the presentation of OT for dictionary
learning, and its applications to graph data and natural
language are new. In section 6 we present an introduction to
• The authors are with the Université Paris-Saclay, CEA, LIST, F-91120, domain adaptation, and discuss generalizations of the stan-
Palaiseau, France.E-mail: eduardo.fernandesmontesuma@cea.fr
dard domain adaptation setting (i.e., when multiple sources
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 2

are available) and transferability. Finally, in section 7 we and the effort of transportation can be translated as the
discuss distributional and Bayesian reinforcement learning, kinetic energy throughout the transportation,
as well as policy optimization. To the best of our knowledge, Z 1Z
Reinforcement Learning (RL) was not present in previous LB (ρ, v) = ∥v(t, x)∥2 ρ(t, x)dxdt. (4)
related reviews. 0 Rd

In addition, [16] established the connection between dy-


namic and static OT through the equations,
2 BACKGROUND
v(t, x) = ∇φ(t, x),
2.1 Optimal Transport ∂φ 1 (5)
+ ∥∇φ∥2 = 0.
Let P, Q ∈ P(Rd ). The Monge formulation [14] relies on a ∂t 2
mapping T : Rd → Rd such that T♯ P = Q, where T♯ is the This is known as Benamou-Brenier (BB) formulation.
push-forward operator associated with T , that is, (T♯ P )(A) = See [12, Section 5.3] for further theoretical details.
P (T −1 (A)) for A ⊂ Rd . An OT mapping T must minimize When c is a metric on Rd , OT defines a metric over dis-
the effort of transportation, tributions known as Wasserstein distance Wc . This distance
corresponds to the minimizer/maximizer of each effort
LM (T ) = E [c(x, T (x))], functional defined so far. For instance, in the case of the
x∼P
MK formulation,
where c : Rd × Rd → R+ is called ground-cost. The Monge
formulation is notoriously difficult to analyze, partly due Wc (P, Q) = arg inf E [c(x1 , x2 )].
γ∈Γ(P,Q) (x1 ,x2 )∼γ
to the constraint involving T♯ . A simpler formulation was
presented by [2] and relies on an OT plan γ : Rd × Rd → This has interesting consequences. For instance, the Wasser-
[0, 1]. Such a plan must preserve mass. For sets A, B ⊂ Rd , stein distance provides a principled way of interpolating
and averaging distributions, through Wasserstein geodesics
γ(Rd , B) = Q(B), and, γ(A, Rd ) = P (A). and barycenters. Concerning geodesics, as shown in [3,
Chapter 7], a geodesic between P and Q is a distribution
The set of all γ sufficing these requirements is denoted by Pt defined in terms of the OT plan γ ,
Γ(P, Q). In this case, the effort of transportation is,
Pt = πt,♯ γ, (6)
LK (γ) = E [c(x1 , x2 )].
(x1 ,x2 )∼γ where πt (x1 , x2 ) = (1 − t)x1 + tx2 . Likewise, let P =
{Pi }N
i=1 , the Wasserstein barycenter [17] of P , weighted by
This is known as Monge-Kantorovich (MK) formulation. PN
For a theoretical proof of equivalence between Monge and α ∈ ∆N = {a ∈ RN + : i=1 ai = 1} is,
Kantorovich approaches see [12, Section 1.5]. N
X
The MK formulation is easier to analyze because Γ(P, Q) B(α; P) = arg inf αi Wc (Pi , Q). (7)
and LK (γ) are linear w.r.t. γ . This remark characterizes Q i=1
this formulation as a linear program. As such, the MK Even though this constitutes a solid theory for OT, in
formulation admits a dual problem [12, Section 1.2] in terms ML one seldom knows P and Q a priori. In practice, one
of Kantorovich potentials φ : Rd → R and ψ : Rd → R. As (P ) (Q)
rather has access to samples xi ∼ P and xj ∼ Q. This
such, one maximizes, leads to two consequences: (i) one solves OT approximately
through the available samples, and (ii) one develops a
L∗K (φ, ψ) = E [φ(x)] + E [ψ(x)], (1)
x∼P x∼Q computational theory of OT. Indeed both consequences are
quite challenging, which kept OT from being used in ML
over the constraint Φc = {(φ, ψ) : φ(x1 ) + ψ(x2 ) ≤ for a long time. We discuss these in section 3.
c(x1 , x2 )}. When c(x1 , x2 ) = ∥x1 − x2 ∥22 , the celebrated
Brenier theorem [15] establishes a connection between the 2.1.1 Families of Probability Metrics
Monge and MK formulations: T = ∇φ. Alternatively, when
Probability metrics are functionals that compare probability
c(x1 , x2 ) = ∥x1 − x2 ∥1 , one has the so-called Kantorovich-
distributions. These can be either proper metrics (e.g., the
Rubinstein (KR) duality, which seeks to maximize
Wasserstein distance) or divergences (e.g., the KL diver-
L∗KR (φ) = E [φ(x)] − E [φ(x)], (2) gence). They quantify how different the two distributions
x∼P x∼Q are. In probability theory, there are two prominent fam-
over φ ∈ Lip1 , the set of functions with unitary Lipschitz ilies of distributions metrics, Integral Probability Metrics
norm. (IPMs) [18] and f −divergences [19]. For a family of func-
tions F , an IPM [20] dF between distributions P and Q is
In a parallel direction, one could see the process of
given by,
transportation dynamically. This line of work was proposed
in [16]. Let ρt (x) = ρ(t, x) be such that ρ(0, ·) = P and
ρ(1, ·) = Q and v(t, x) be a velocity field. Mass preservation dF (P, Q) = sup E [f (x)] − E [f (x)] .
f ∈F x∼P x∼Q
corresponds to the continuity equation,
As such, IPMs measure the distance between distributions
∂ρt based on the difference P − Q. Immediately, L∗KR is an IPM
+ ∇ · (ρt v) = 0, (3)
∂t for F = Lip1 . Another important metric is the Maximum
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 3

M = {θ = (m, σ 2) : m ∈ R, σ ∈ R+}
2-Wasserstein Distance KL-Divergence MMD (Linear Kernel) MMD (Gaussian Kernel)

0.300
0.400

0.100
2.50 2.50 2.50 2.50

0.700

0
0.80
00 1.0 00

1.2 050
2
50

1.
1.

00
2.25 2.25 5 2.25 2.25 0.05
0.12

0.500
0.600
0.700
2.00 2.00 2.00 2.00

0.200
0.04
0.9

1.75 00 1.75 1.75 1.75

0.900
0.600
0.3

µ(x)
θ0 θ0 θ0 θ0 0.03
1.50 1.50 1.50 1.50
σ

0.02
0.600
0.500
0.300
0.100

5
1.25 1.25 1.25 1.25 0.02

5
0.062

.07
50
0.450 0.2

0.1 .12 .100 0


0
0.5

0.900

05
1.00 1.00 0 1.00 1.00
0.

0.
0.01
75

0.
0
1.0 200

10

0.1.15

50 5
0.75 1. 050 0.75 2.0 1.000 0.75 0.75

0 0
0
0
1.
50

25 0
00
0
1.
20

0.200
0.800

0.400
0.00
0.50 0.50 0.50 0.50
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −5.0 −2.5 0.0 2.5 5.0
µ µ µ µ x

(a) (b)

Fig. 1: Comparison on how different metrics and divergences (a) calculate discrepancies on the manifold of Gaussian
distributions (b).

Mean Discrepancy (MMD) [21], defined for F = {f ∈ Hk : into a label space Y . Henceforth, we will assume X = Rd . Y
∥f ∥Hk ≤ 1}, where Hk is a Reproducing Kernel Hilbert is, for instance, {1, · · · , nc }, as in nc classes classification, or
Space (RKHS) with kernel k : Rd × Rd → R. These metrics R, in regression. The supervision in this setting comes from
play an important role in generative modeling and domain each x having an annotation y , coming from a ground-truth
adaptation. A third distance important in the latter field is labeling function h0 . For a distribution P , a loss function L,
the H−distance [22]. Conversely, f −divergences measure and hypothesis h, h′ ∈ H, the risk functional is,
the discrepancy between P and Q based on the ratio of their
probabilities. For a convex, lower semi-continuous function RP (h, h′ ) = E [L(h(x), h′ (x))].
x∼P
f : R+ → R with f (1) = 0,
   We further denote RP (h) = RP (h, h0 ), in short. As pro-
P (x) posed in [24], one chooses h ∈ H by risk minimization,
Df (P ||Q) = E f , (8)
x∼Q Q(x) i.e., the minimizer of RP . Nonetheless, solving a supervised
learning problem with this principle is difficult, since neither
where, with an abuse of notation, P (x) (resp. Q) denote
P nor h0 are known beforehand. In practice, one has a
the density of P . An example of f −divergence is the KL (P ) (P ) (P )
divergence, with f (u) = u log u. dataset {(xi , yi )}n i=1 , xi ∼ P , independently and
(P ) (P )
As discussed in [20], two properties favor IPMs over identically distributed (i.i.d.), and yi = h0 (xi ). This is
f-divergences. First, dF is defined even when P and Q known as Empirical Risk Minimization (ERM) principle,
have disjoint supports. For instance, at the beginning of n
1X (P ) (P )
training, Generative Adversarial Networks (GANs) gener- ĥ = arg min R̂P (h) = L(h(xi ), yi ).
ate poor samples, so Pmodel and Pdata have disjoint support. h∈H n i=1
In this sense, IPMs provide a meaningful metric, whereas In addition, in deep learning one implements classifiers
Df (Pmodel ||Pdata ) = +∞ irrespective of how bad Pmodel is. through Neural Networks (NNs). A deep NN, f : X → Y
Second, IPMs account for the geometry of the space where can be understood as the composition of a feature extractor
the samples live. As an example, consider the manifold of g : X → Z and a classifier h : Z → Y . In this sense, the
Gaussian distributions M = {N (µ, σ 2 ) : µ ∈ R, σ ∈ R+ }. gain with deep learning is that one automatizes the feature
As discussed in [10, Remark 8.2], When restricted to M, the extraction process.
Wasserstein distance with a Euclidean ground-cost is,
2.2.2 Unsupervised Learning
q
W2 (P, Q) = (µP − µQ )2 + (σP − σQ )2 ,
A key characteristic of unsupervised learning is the absence
whereas the KL divergence is associated with an hyperbolic (P )
of supervision. In this sense, samples xi ∼ P do not have
geometry. This is shown in Figure 1. Overall, the choice any annotation. In this survey we consider three problems
of discrepancy between distributions heavily influences the in this category, namely, generative modeling, dictionary
success of learning algorithms (e.g., GANs). Indeed, each learning, and clustering.
choice of metric/divergence induces a different geometry First, generative models seek to learn to represent prob-
in the space of probability distributions, thus changing the ability distributions over complex objects, such as images,
underlying optimization problems in ML. text and time series [25]. This goal is embodied, in many
(P )
cases, by learning how to reproduce samples xi from
2.2 Machine Learning an unknown distribution P . Examples of algorithms are
As defined by [23], ML can be broadly understood as data- GANs [25], Variational Autoencoders (VAEs) [26], and Nor-
driven algorithms for making predictions. In what follows malizing Flows (NFs) [27].
we define different problems within this field. Second, for a set of n observations {xi }n i=1 , dictionary
learning seeks to decompose each xi = DT ai , where D ∈
2.2.1 Supervised Learning Rk×d is a dictionary of atoms {dj }kj=1 , and ai ∈ Rk is a
Following [24], supervised learning seeks to estimate a representation. In this sense, one learns a representation for
hypothesis h : X → Y , h ∈ H, from a feature space X data points xi .
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 4

Finally, clustering tries to find groups of data-points in a 2.3 The Curse of Dimensionality
multi-dimensional space. In this case, one assigns each xi to The curse of dimensionality is a term coined by [35], and
one cluster ci ∈ {1, · · · , K}. Hence clustering is similar to loosely refers to phenomena occurring in high-dimensions
classification, except that in the former, one does not know that seem counter-intuitive in low dimensions (e.g., [36]).
beforehand the number of classes/clusters K . Due to the term curse, these phenomena often have a neg-
2.2.3 Transfer Learning ative connotation. For instance, consider binning Rd into n
distinct values [37, Section 5.11.1]. The number of distinct
Even though the ERM framework helps to formalize super-
bins grows exponentially with the dimension. As a conse-
vised learning, data seldom is i.i.d. As remarked by [28], the
quence, for assessing the probability P (X = xk ) accurately,
conditions on which data were acquired for training a model
one would need exponentially many samples (i.e., imagine
are likely to change when deploying it. As a consequence,
at least one sample per bin).
the statistical properties of the data change, a case known as
This kind of phenomenon also arises in OT. As studied
distributional shift. Examples of this happen, for instance, in
by [38], for n samples, the convergence rate of the empirical
vision [29] and signal processing [30].
estimator with for the Wasserstein distance decreases expo-
Transfer Learning (TL) was first formalized by [31], and
nentially with the dimension, i.e., W1 (P, P̂ ) = O(n− /d ),
1

rely on concepts of domains. A domain D = (X , P (X)) is a


where f (x) = O(g(x)) if ∃M ≥ 0 and x0 s.t. |f (x)| ≤
pair of a feature space X and a marginal feature distribution
M g(x) ∀x ≥ x0 . Henceforth we refer to the convergence
P (X). A task T = (Y, P (Y |X)) is a pair of a label space T
rate as sample complexity. This result hinders the applicability
and a predictive distribution P (Y |X). Within TL, a problem
of the Wasserstein distance for ML, but as we will discuss,
that received attention from the ML community is Unsuper-
other OT tools enjoy better sample complexity (see e.g., 8.1
vised Domain Adaptation (UDA), where one has the same
and table 1).
task, but different source and target domains DS ̸= DT .
Assuming the same feature space, this corresponds to a shift
in distribution, PS (X) ̸= PT (X).
3 C OMPUTATIONAL O PTIMAL T RANSPORT
2.2.4 Reinforcement Learning Computational OT is an active field of research that has seen
RL deals with dynamic learning scenarios and sequential significant development in the past ten years. This trend is
decision-making. Following [32], one assumes an environ- represented by publications [9], [39], and [10], and software
ment modeled through a Markov Decision Process (MDP), such as [40] and [41]. These resources set the foundations
which is a 5-tuple (S, A, P, R, ρ0 , λ) of a state space S , an for the recent spread of OT throughout ML and related
action space A, a transition distribution P : S × A × S → R, fields. One has mainly three strategies in computational
a reward function R : S × A → R, a distribution over the OT; (i) discretizing the ambient space; (ii) approximating
initial state ρ0 , and a discount factor λ ∈ (0, 1). the distributions P and Q from samples; (iii) assuming
RL revolves around an agent acting on the state space, a parametric model for P or Q. The first two strategies
changing the state of the environment. The actions are approximate distributions with a mixture of Dirac delta
chosen according to a policy π : S → R, which is a (P )
functions. Let xi ∼ P with probability pi > 0,
distribution π(·|s) over actions a ∈ A, for s ∈ S . Policies can
n
be evaluated with value functions Vπ (st ), and state-value X
P̂ = pi δx(P ) . (12)
functions Qπ (st , at ), which measure how good a state or i
i=1
state-action pair is under π . These are defined as, Pn
X∞  Naturally, i pi = 1 or p ∈ ∆n in short. In this case, we
η(π) = E λt R(st , at ) , say that P̂ is an empirical approximation of P . As follows, dis-
P,π
t=0 cretizing the ambient space is equivalent to binning it, thus

X  (P )
ℓ assuming a fixed grid of points xi . The sample weights pi
Vπ (s) = E λ R(st+ℓ , at+ℓ )|st = s , (9) correspond to the number of points that are assigned to the
P,π
ℓ=0
X∞  i-th bin. In this sense, the parameters of P̂ are the weights
Qπ (s, a) = E ℓ
λ R(st+ℓ , at+ℓ )|st = s, at = a , pi . In the literature, this strategy is known as Eulerian
P,π (P ) i.i.d.
ℓ=0 discretization. Conversely, one can sample xi ∼ P (resp.
where the expectation over P and π corresponds to s0 ∼ ρ0 , Q). In this case, pi = 1/n, and the parameters of P̂ are the
(P )
at ∼ π(·|st ), and st+1 ∼ P (·|st , at ). These quantities are locations xi . In the literature, this strategy is known as
related through the so-called Bellman’s equation [33], Lagrangian discretization. Finally, one may assume P follows
a known law for which the Wasserstein distance has a closed
Tπ Vπ (st ) = E [R(st , at )] + γ E [Vπ (s)],
P,π P,π formula (e.g., normal distributions).
(10) (P )
Tπ Qπ (st , at ) = E [R(st , at )] + γ E [Qπ (s, a)], We now discuss discrete OT. We assume {xi }n i=1 (resp.
P,π P,π (Q)
{xj }m j=1 ) sampled from P (resp. Q) with probability pi
where Tπ is called Bellman operator. Qπ can be learned from (resp. qj ). As in section 2.1, the Monge problem seeks a
experience [34] (i.e., tuples (s, a, r, s′ )). For a learning rate mapping T , that is the solution of,
αt > 0, the updates are as follows,
n
(P ) (P )
  X

Qπ (s, a) ← (1 − αt )Qπ (s, a) + αt r + λVπ (s ) . (11) T ⋆ = arg min LM (T ) = c(xi , T (xi )), (13)
T♯ P̂ =Q̂ i=1
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 5
P
where the constraint implies i∈I pi = qj , for I = {i : which implies that SW(P, Q) has O(Lnd + Ln log n) time
(Q) (P ) complexity. As shown in [48], SW (P, Q) is indeed a metric.
xj = T (xi )}. This formulation is non-linear w.r.t. T .
In addition, for m > n, it does not have a solution. The In addition, [49] and [50] proposed variants of SW , namely,
MK formulation, on the other hand, seeks an OT plan γ ∈ the max-SW distance and the generalized SW distance re-
Rn×m , where γij denotes the amount of mass transported spectively. Contrary to the averaging procedure in eq. 16,
from sample i to sample j . In this case γ must minimize, the max-SW of [49] is a mini-max procedure,
n X
m max-SW(P, Q) = max Wc (πu,♯ P, πu,♯ Q). (18)
(P ) (Q)
X
γ ⋆ = arg min LK (γ) = γij c(xi , xj ), (14) u∈Sd−1
γ∈Γ(p,q) i=1 j=1 This metric has the same advantage in sample complexity
n×m
P P as the SW distance, while being easier to compute.
where Γ(p, q) = {γ ∈ R : i γij = qj and j γij =
On another direction, [50] takes inspiration on the gener-
pi }. This is a linear program, which can be solved through
alized Radon transform [51] for defining a new metric. Let
the Simplex algorithm [42]. Nonetheless this implies on a
time complexity O(n3 log n). Alternatively, one can use the
g : Rd × Ωθ → R be a function sufficing, for ∀x ∈ Rd ,
approximation introduced by [39], by solving a regularized
∀θ ∈ Ωθ \{0}; (i) g ∈ C ∞ ; (ii) g(x, λθ) = λg(x, θ); (iii)
∂g/∂x(x, θ) ̸= 0; (iv) det(∂ 2 g/∂xi ∂θj )
problem, ij > 0. The Generalized
Sliced Wasserstein (GSW) distances are defined by,
n X
m Z
(P ) (Q)
X
γϵ⋆ = arg min γij c(xi , xj ) + ϵH(γ), GSW(P, Q) = Wc (gθ,♯ P, gθ,♯ Q)dθ.
γ∈Γ(p,q) i=1 j=1 Ωθ

which provides a faster way to estimate γ . Provided ei- One may define the max-GSW in analogy to eq. 18.
ther T ⋆ or γ ⋆ , OT defines a metric known as Wasserstein In addition, one can project samples on a sub-space
distance, e.g., Wc (P̂ , Q̂) = LK (γ ⋆ ). Likewise, for γϵ⋆ one 1 < k < d. For instance, [52] proposed the Subspace Robust
has Wc,ϵ (P̂ , Q̂) = LK (γϵ⋆ ). Associated with this approxima- Wasserstein (SRW) distances,
tion, [43] introduced the so-called Sinkhorn divergence,
SRWk (P, Q) = inf sup E [∥πE (x1 − x2 )∥22 ],
γ∈Γ(P,Q) E∈Gk (x1 ,x2 )∼γ
Wc,ϵ (P̂ , P̂ ) + Wc,ϵ (Q̂, Q̂)
Sc,ϵ (P̂ , Q̂) = Wc,ϵ (P̂ , Q̂) − , (15)
2 where Gk = {E ⊂ Rd : dim(E) = k} is the Grassmannian
which has interesting properties, such as interpolating be- manifold of k−dimensional subspaces of Rd , and πE denote
tween the MMD of [21] and the Wasserstein distance. Over- the orthogonal projector onto E . This can be equivalently
all, Sc,ϵ has 4 advantages: (i) its calculations can be done on formulated through a projection matrix U ∈ Rk×d , that is,
a GPU, (ii) For L > 0 iterations, it has complexity O(Ln2 ), SRWk (P, Q) = inf max E [Ux1 − Ux2 ]
(iii) Sc,ϵ is a smooth approximator of Wc (see e.g., [44]), (iv) Γ(P,Q) U∈Rk×d (x1 ,x2 )∼γ
UUT =Ik
it has better sample complexity (see e.g., [45]).
In the following, we discuss recent innovations on com- In practice, SRWk is based on a projected super gradient
putational OT. Section 3.1 present projection-based meth- method [52, Algorithm 1] that updates Ω = UUT , for a
ods. Section 3.2 discusses OT formulations with prescribed fixed γ , until convergence. These updates are computation-
structures. Section 3.3 presents OT through Input Convex ally complex, as they rely on eigendecomposition. Further
Neural Networks (ICNNs). Section 3.4 explores how to developments using Riemannian [53] and Block Coordinate
compute OT between mini-batches. Descent (BCD) [54] circumvent this issue by optimizing over
St(d, k) = {U ∈ Rk×d : UUT = Ik }.
3.1 Projection-based Optimal Transport
Projection-based OT relies on projecting data x ∈ Rd into 3.2 Structured Optimal Transport
sub-spaces Rk , k < d. A natural choice is k = 1, for which In some cases, it is desirable for the OT plan to have
computing OT can be done by sorting [12, chapter 2]. This is additional structure, e.g., in color transfer [55] and domain
called Sliced Wasserstein (SW) distance [46], [47]. Let Sd−1 = adaptation [56]. Based on this problem, [57] introduced a
{u ∈ Rd : ∥u∥22 = 1} denote the unit-sphere in Rd , and principled way to compute structured OT through sub-
πu : Rd → R denote πu (x) = ⟨u, x⟩. The Sliced-Wasserstein modular costs.
distance is, As defined in [57], a set function F : 2V → R is sub-
Z modular if ∀S ⊂ T ⊂ V and ∀v ∈ V \T ,
SW(P, Q) = Wc (πu,♯ P̂ , πu,♯ Q̂)du. (16)
Sd−1 F (S ∪ {v}) − F (S) ≥ F (T ∪ {v}) − F (T ).
This distance enjoys nice computational properties. First, These types of functions arise in combinatorial optimiza-
Wc (πu,♯ P, πu,♯ Q) can be computed in O(n log n) [10]. Sec- tion, and further OT since OT mappings and plans can be
ond, the integration in equation 16 can be computed using seen as a matching between 2 sets, namely, samples from P
Monte Carlo estimation. For samples {uℓ }L ℓ=1 , uℓ ∈ S
d−1
and Q. In addition, F defines a base polytope,
uniformly,
BF = {G ∈ R|V | : G(V ) = F (V ); G(S) ≤ F (S) ∀S ⊂ V }.
L
1X
SW(P̂ , Q̂) = Wc (πuℓ ,♯ P̂ , πuℓ ,♯ Q̂), (17) Based on these concepts, note that OT can be formulated
L ℓ=1
in terms of set functions. Indeed, suppose X(P ) ∈ Rn×d and
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 6

X(Q) ∈ Rn×d . In this case, γ ⋆ or T ⋆ can be interpreted as a A0 W1 W2 WL−1


x σ1 σ2 ··· σL−1 f (x; θ)
graph with edge set E = {(ui , vi )}ni=1 , where ui represents
a sample in P and vi , a sample in Q. In this P case, the cost
of transportation is represented by F (S) = (u,v)∈S cuv . A1 AL−1
Hence, adding a new (u, v) to S is the same regardless of
the elements of S .
The insight of [57] is using the sub-modular property Fig. 2: Network architecture of an ICNN, as proposed
for acquiring structured OT plans. Through Lovász exten- by [63].
sion [58], this leads to,

(γ ⋆ , κ⋆ ) = arg min arg max ⟨γ, κ⟩F . As follows, [61] uses [62, Theorem 2.9], for rewriting the
γ∈Γ(p,q) κ∈BF
Wasserstein distance as,
A similar direction was explored by [59], who proposed
W2 (P, Q)2 = EP,Q − inf J(f, f ∗ ),
to impose a low rank structure on OT plans. This was done f ∈CVX(P )
with the purpose of tackling the curse of dimensionality 2 2
in OT. They introduce the transport rank of γ ∈ Γ(P, Q) where EP,Q = Ex∼P [∥x∥2/2] + Ex∼Q [∥x∥2/2] does not
defined as the smallest integer K for which, depend on the f , and J(f, f ∗ ) = Ex∼P [f (x)] +
Ex∼Q [f ∗ (x)] is minimized over the set CVX(P ) = {f :
K
X f is convex and Ex∼P [f (x)] < ∞}. f ∗ denotes the convex
γ= λk (Pk ⊗ Qk ), conjugate of f , i.e., f ∗ (x2 ) = supx1 ⟨x1 , x2 ⟩ − f (x1 ). Opti-
k=1
mizing J(f, f ∗ ) is, nonetheless, complicated due to the con-
where Pk , Qk , k = 1, · · · , K are distributions over Rd , and vex conjugation. The insight of [61] is to introduce a second
Pk ⊗ Qk denotes the (independent) joint distribution with ICNN, g ∈ CVX, that maximizes J(f, g). As consequence,
marginals Pk and Qk , i.e. (Pk ⊗ Qk )(x, y) = Pk (x)Qk (y).
For empirical P̂ and Q̂, K coincides with the non-negative W2 (P, Q)2 = sup inf VP,Q (f, g) + EP,Q , (19)
f ∈CVX(P ) g∈CVX(Q)
rank [60] of γ ∈ Rn×m . As follows, [59] denotes the set of
γ with transport rank at most K as ΓK (P, Q) (ΓK (p, q) for VP,Q (f, g) = − E [f (x)] − E [⟨x, ∇g(x)⟩ − f (∇g(x))].
x∼P x∼Q
empirical P̂ and Q̂). The robust Wasserstein distance is thus,
Therefore, [61] proposes solving the sup-inf problem in
FWK,2 (P, Q)2 = inf E [∥x1 − x2 ∥2 ]. eq. 19 by parametrizing f ∈ CVX(P ) and g ∈ CVX(Q)
γ∈ΓK (P,Q) (x1 ,x)∼γ
through ICNNs. This becomes a mini-max problem, and the
In practice, this optimization problem is difficult due to the expectations can be approximated empirically using data
constraint γ ∈ ΓK . As follows, [59] propose estimating it from P and Q. An interesting feature from this work is that
using the Wasserstein barycenter B = B([1/2, 1/2]; {P, Q}) one solves OT for mappings f and g , rather than an OT plan.
(B) This has interesting applications for generative modeling
supported on K points, that is {xk }K k=1 , also called hubs.
As follows, the authors show how to construct a trans- (e.g. [64]) and domain adaptation.
port plan in Γk (P, Q), by exploiting the transport plans
γ1 ∈ Γ(P, B) and γ2 ∈ Γ(B, Q). This establishes a link
3.4 Mini-batch Optimal Transport
between the robustness of the Wasserstein distance, Wasser-
Pn (BP )
stein barycenters and clustering. Let λk = i=1 γki and A major challenge in OT is its time complexity. This mo-
(P ) (BP ) (P )
µk = 1/λk ni=1 γki xi , the authors in [59] propose the tivated different authors [43], [65], [66] to compute the
P
following proxy for the Wasserstein distance, Wasserstein distance between mini-batches rather than com-
plete datasets. For a dataset with n samples, this strategy
K
X (P ) (Q) leads to a dramatic speed-up, since for K mini-batches
FW2k,2 (P̂ , Q̂) = λk ∥µk − µk ∥22 . of size m ≪ n, one reduces the time complexity of OT
k=1
from O(n3 log n) to O(Km3 log m) [66]. This choice is key
when using OT as a loss in learning [67] and inference [68].
3.3 Neural Optimal Transport Henceforth we describe the mini-batch framework of [69],
In [61], the authors proposed to leverage the so-called con- for using OT as a loss.
vexification trick of [62] for solving OT through neural nets, Let LOT denote an OT loss (e.g. Wc or Sc,ϵ ). Assum-
using the ICNN architecture proposed in [63]. Formally, ing continuous distributions P and Q, the Mini-batch OT
an ICNN implements a function f : Rd → R such that (MBOT) loss is given by,
x 7→ f (x; θ) is convex. This is achieved through a special
structure, shown in figure 2. An L−layer ICNN is defined
LMBOT (P, Q) = E [LOT (X(P ) , X(Q) )],
(X(P ) ,X(Q) )∼P ⊗m ⊗Q⊗m
through the operations,
(P )
where X(P ) ∼ P ⊗m indicates xi ∼ P , i = 1, · · · , m. This
zℓ+1 = σℓ (Wℓ zℓ + Aℓ x + bℓ ); f (x; θ) = zL ,
loss inherits some properties from OT, i.e., it is positive and
(P ) P
where (i) all entries of Wℓ are non-negative; (ii) σ0 is convex; symmetric, but LMBOT (P, P ) > 0. In practice, let {xi }n i=1
(Q) nQ
(iii) σℓ , ℓ = 1, · · · , L − 1 is convex and non-decreasing. This and {xj }i=1 be iid samples from P and Q respectively. Let
is shown in figure 2. Im ⊂ {1, · · · , nP }m denote a set of m indices. We denote by
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 7

P̂Im to the empirical approximation of P with iid samples


(P )
X(P ) = {xi : i ∈ Im }. Therefore,
(k,m) 1 X
LMBOT (P̂ , Q̂) = LOT (P̂Ib , Q̂Ib′ ), (20)
k ′(Ib ,Ib )∈Ik

where Ik is a random set of k mini-batches of size m from


P and Q. This constitutes an estimator for LM BOT (P, Q),
which converges as n and k → ∞. We highlight 3 ad-
vantages that favor the MBOT for ML: (i) it is faster to
compute and computationally scalable; (ii) the deviation
(k,m)
bound between LMBOT (P, Q) and LMBOT does not depend
on the dimensionality of the space; (iii) it has unbiased
gradients, i.e., the expected gradient of the sample loss
equals the gradient of the true loss [70].
Nonetheless, as remarked in [71], using mini-batches
introduces artifacts in the OT plans, as they become less Fig. 3: Transportation plan between the image I1 of a 0, and
sparse. Indeed, at the level of a mini-batch, OT forces a the image I2 of a 5. An edge is drawn between each pair of
matching between samples that would not happen at a pixels (i, j) in I1 and (ℓ, k) in I2 for which γk1 k2 > 0.0.
global level. The solution proposed by [71] is to use unbal-
anced OT [72], as it allows for the creation and destruction of
mass during transportation. This effectively makes OT more where pi,ℓ is the probability of word wi occurring on doc-
robust to outliers in mini-batches. (P )
ument Pℓ (e.g., normalized frequency), and xi ℓ is the
embedding of word wi . This corresponds to the setting for
4 S UPERVISED L EARNING the Word Mover Distance (WMD) of [74], which relies on
In the following, we discuss OT for supervised learning the Wasserstein distance between P̂ℓ1 and P̂ℓ2 .
(see section 2.2.1 for the basic definitions). We divide our In addition, OT has been used for comparing data with
discussion into: (i) OT as a metric between samples; (ii) OT a graph structure. Formally, a graph G is a pair (V, E)
as a loss function; (iii) OT as a metric or loss for fairness. of a set of vertices and edges. Additional to the graph
In this section, we review the applications of OT for su- structure, one can have two mappings, ℓf : V → Rnf and
pervised learning (see section 2.2.1 for the basic definitions ℓx : V → Rd . For each vertex vi , fi = ℓf (vi ) is its associated
of this setting). We divide applications into two categories, feature vector in some metric space and xi = ℓx (vi ) its
(i) the use of OT as a metric and (ii) the use of OT as a loss, associated representation with respect to the graph structure.
which correspond to the two following sections. For a graph with N vertices, one can furthermore consider
a vector of vertex importance, p ∈ ∆N . This constitutes the
4.1 Optimal Transport as a metric definition of a structured object [75], S = (V, E, F, X, p),
In the following, we consider OT-based distances between F ∈ RN ×nf and X ∈ RN ×d . This defines multiple empirical
samples, rather than distributions of samples. This is the distributions,
case when data points are modeled using histograms over a
N N N
fixed grid. As such, OT can be used to define a matrix Mij X X X
of distances between samples i and j , which can later be P̂ = pi δ(xi ,fi ) , P̂ F = pi δfi , P̂ X = pi δxi .
i=1 i=1 i=1
integrated in distance-based classifiers such as k-neighbors
or Support Vector Machines (SVMs). For two structured objects Sℓ = (Vℓ , Eℓ , Fℓ , Xℓ , hℓ ), ℓ =
A natural example of histogram data are images. Indeed, 1, 2, a major challenge when comparing two graphs is that
a gray-scale image I ∈ Rh×w represents a histogram over there is no canonical way of corresponding their vertices. In
the pixel grid (i, j), i = 1, · · · , h and j = 1, · · · , w. In this what follows we discuss three approaches.
sense an OT plan between two images I1 and I2 is a matrix First, one may compare 2 structured objects by their
γ ∈ Rhw×hw , where γk1 ,k2 corresponds to how much mass feature distributions, P̂1F and P̂2F . Using this approach,
is sent from pixel k1 = (i, j) to k2 = (ℓ, k). An example of one loses the information encoded by representations xi ,
this is shown in figure 3. Such problem appears in several vertices vi and edges euv . Second, one may resort to the
foundational works of OT in signal processing, such as [5] Gromov-Wasserstein distance [76]. For a similarity function
and [39]. (ℓ) (P ) (P )
s, let Sij = s(xi ℓ , xj ℓ ). An example is the shortest path
A similar example occurs with text data. Let C =
distance matrix of Gℓ = (Vℓ , Eℓ ). The Gromov-Wasserstein
[D1 , · · · , DN ] represent a corpus of N documents. As-
(GW) distance is defined as,
sociated with C there is a vocabulary of words V =
[w1 , · · · , wn ].. In addition, each word wi can be embedded X (1) (2)
GW(P̂1X , P̂2X ) = min L(Sij , Skl )γik γjl , (22)
in a Euclidean space through word embeddings [73]. Hence γ∈Γ(p1 ,p2 )
ijkl
a document can be represented as,
n
X where L is a loss function (i.e. the squared error). In com-
P̂ℓ = pi,ℓ δx(Pℓ ) , (21) parison to the first case, one now disregards the features
i
i=1 rather than the graph structure. Third, [75] proposed the
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 8

Fused Gromov-Wasserstein (FGW) distance, which interpo- where Q = Q(X, S, Y ). In particular, the notion of statistical
lates between the previous choices, parity [84] is expressed in probabilistic terms as,

FGWα (P̂1 , P̂2 ) = min


X
Lijkl γik γjl , P (h(X) = 1|S = 0) = P (h(X) = 1|S = 1).
γ∈Γ(p1 ,p2 )
ijkl (23) As proposed by [85], this notion can be integrated into
(P1 ) (P2 ) (P1 ) (P2 ) ERM through OT. Indeed, for a transformation T sufficing
for Lijkl = (1 − α)c(fi , fj ) + αL(Sij , Skl ),
P (T (X)|S = 1) = P (T (X)|S = 0), a classifier trained on
T (X) naturally suffices the statistical parity criterion. [85]
4.2 Optimal Transport as a loss thus propose the strategy of random repair, which relies
on the Wasserstein barycenter for ensuring P (T (X)|S =
In section 2.2.1 we reviewed the ERM principle, which, for (P )
a suitable loss function L, seeks to minimize the empirical 1) = P (T (X)|S = 0). Let X(Ps ) = {xi s : si = s}
risk RP (h) for h ∈ H. In the context of neural nets, h = hθ where Ps = P (X|S = s). The OT plan between P0 and
P1 corresponds to γ ∈ Rn0 ×n1 , which is itself an empirical
is parametrized by the weightsPncθ of the network. An usual distribution,
choice in ML is L(ŷi , yi ) = k=1 yik log ŷik where yik = 1
if and only if the sample i belongs to class c, and ŷik ∈ [0, 1] X n1
n0 X

is the predicted probability of sample i belonging to c. This γ= γij δ(x(P0 ) ,x(P1 ) ) ,


i j
choice of loss is related to Maximum Likelihood Estimation i=1 j=1

(MLE), and hence to the KL divergence (see [77, Chapter 4]) hence the geodesic or barycenter P̂t between P̂0 and P̂1 is
between P (Y |X) and Pθ (Y |X). an empirical distribution with support,
As remarked by [78], this choice disregards similarities
(P0 ) (P1 )
in the label space. As such, a prediction of semantically x̃t,ij = (1 − t)xi + txj , ∀(ij) s.t. γij > 0.
close classes (e.g. a Siberian husky and a Eskimo dog)
The procedure of random repair corresponds to drawing
yields the same loss as semantically distant ones (e.g. a
b1 , · · · , bn0 +n1 ∼ B(λ), where B(λ) is a Bernoulli distribu-
dog and a cat). To remedy this issue, [78] proposes using
tion with parameter λ. For P0 (resp. P1 ),
the Wasserstein distance between the network’s predictions
n0
(
(P )
and the ground-truth vector, L(ŷ, y) = minγ∈U (ŷ,y) ⟨γ, C⟩F , (P0 )
[ {xi 0 } if bi = 0
where C ∈ Rnc ×nc is a metric between labels. X = ,
i=1
{xt,ij : γij > 0} if bi = 1
This idea has been employed in multiple applications.
For instance, [78] employs this cost between classes k1 where the amount of repair λ regulates the trade-off between
and k2 as Ck1 ,k2 = ∥xk1 − xk2 ∥22 , where xk ∈ Rd is the fairness (λ = 1) and classification performance (λ = 0). In
d−dimensional embedding corresponding to the class k . addition, t = 1/2, so as to avoid disparate impact.
In addition [79], [80] propose using this idea for seman- In another direction, one may use OT for devising a
tic segmentation, by employing a pre-defined ground-cost regularization term to enforce fairness in ERM. Two works
reflecting the importance of each class and the severity of follow this strategy, namely [86] and [87]. In the first case,
miss-prediction (e.g. confusing the road with a car and vice- the authors explore fairness in representation learning (see
versa in autonomous driving). for instance section 2.2.1). In this sense, let f = h ◦ g , be
the composition of a feature extractor g and a classifier h.
Therefore [86] proposes minimizing,
4.3 Fairness n
1X
The analysis of biases in ML algorithms has received in- (h⋆ , g ⋆ ) = arg min L(h(g(xi )), yi ) + βd(P0 , P1 ),
h,g n i=1
creasing attention, due to the now widespread use of these
systems for automatizing decision-making processes. In this for β > 0, P0 = P (X|S = 0) (resp S = 1) and d being either
context, fairness [81] is a sub-field of ML concerned with the MMD or the Sinkhorn divergence (see sections 2.1).
achieving fair treatment to individuals in a population. On another direction, [87] considers the output distribution
Throughout this section, we focus on fairness in binary Ps = P (h(g(X))|S = s). Their approach is more general,
classification. as they suppose that attributes can take more than 2 values,
The most basic notion of fairness is concerned with namely s ∈ {1, · · · , NS }. Their insight is that the output dis-
a sensitive or protected attribute s ∈ {0, 1}, which can tribution should be transported to the distribution closest to
be generalized to multiple categorical attributes. We focus each group’s conditional, namely Pk = P (h(g(X))|S = k).
on the basic case in the following discussion. The values This corresponds to Ps̄ = B(α; Pk n k=1 ), αs = /n. As in [86],
s ns

s = 0 and s = 1 correspond to the unfavored and favored the authors enforce this condition through regularization,
groups, respectively. This is complementary to the theory namely,
of classification, namely, ERM (see section 2.2.1), where one n NS
wants to find h ∈ H that predicts a label ŷ = h(x) to match
1X X nk
min L(h(g(xi )), y) + β W2 (Pk , P̄ ).
y ∈ {0, 1}. In this context different notions of fairness exist h,g n i=1 k=1
n
(e.g. [82, Table 1]). As shown in [83], these can be unified in Finally, we remark that OT can be used for devising a
the following equation, statistical test of fairness [83]. To that end, consider,
       
E 1h(x)=1 ϕ U, E [U ] = 0, F = Q: E 1h(x)=1 ϕ U, E [U ] = 0 ,
(x,s,y)∼Q (x,s,y)∼Q (x,s,y)∼Q (x,s,y)∼Q
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 9

the set of distributions Q for which a classifier h is fair. though (ii) and (iii) improve over (i), several works have
Now, let {(xi , si , yi )}N
i=1 be samples from a distribution P . confirmed that the WGAN and its variants do not estimate
Pn
Let P̂ = 1/n i=1 δ(xi ,si ,yi ) . The fairness test of [83] consists the Wasserstein distance [94]–[96].
on a hypothesis test with H0 : P ∈ F and H1 : P ∈ / F . The In another direction, one can calculate ∇θ Wp (Pθ , P0 )
test is then based on, using the primal problem (equation 14). To do so, [43] and
further [69] consider estimating Wp (Pθ , P0 ) through aver-
d(P, F) = inf W2 (P, Q), aging the loss over mini-batches (see section 3.4). For calcu-
Q∈F
lating derivatives ∇θ Wp (Pθ , P0 ), one has two approaches.
for which on rejects H0 if t = n × d(P, F) > η1−α , where
First, [43] advocate to back-propagate through Sinkhorn
η1−α is the (1 − α) × 100% quantile of the generalized chi- iterations. This is numerically unstable and computationally
squared distribution.
complex. Second, as discussed in [97], [98], one can use
the Envelope Theorem [99] to take derivatives of Wp . In
5 U NSUPERVISED L EARNING its primal form, it amounts to calculate γ ⋆ = OTϵ (p, q, C)
then differentiating w.r.t. C.
In this section, we discuss unsupervised learning techniques
that leverage the Wasserstein distance as a loss function. We In addition, as discussed in [43], an Euclidean ground-
consider three cases: generative modeling (section 5.1), rep- cost is likely not meaningful for complex data (e.g., images).
resentation learning (section 5.2) and clustering (section 5.3). The authors thus propose learning a parametric ground cost
(P ) (P )
(Cη )ij = ∥fη (xi θ ) − fη (xj 0 )∥, where fη : X → Z is
a NN that learns a representation for x ∈ X . Overall the
5.1 Generative Modeling optimization problem proposed by [43] is,
There are mainly three types of generative models that
benefit from OT, namely, GANs (section 5.1.1), VAEs (sec- min max Scη ,ϵ (Pθ , P0 ). (24)
θ η
tion 5.1.2) and normalizing flows (section 5.1.3). As dis-
cussed in [88], [89], OT provides an unifying framework We stress the fact that engineering/learning the ground-cost
for these principles through the Minimum Kantorovich Es- cη is important for having a meaningful metric between
timator (MKE) framework, introduced by [90], which, given distributions, since it serves to compute distances between
samples from P0 , tries to find θ that minimizes W2 (Pθ , P0 ). samples. For instance, the Euclidean distance is known to
not correlate well with perceptual or semantic similarity
5.1.1 Generative Adversarial Networks between images [100].
GAN is a NN architecture proposed by [25] for sampling Finally, as in [49], [50], [101], [102], one can employ
from a complex probability distribution P0 . The principle sliced Wasserstein metrics (see section 3.1). This has two
behind this architecture is building a function gθ , that gen- advantages: (i) computing the sliced Wasserstein distances is
erates samples in a input space X from latent vectors z ∈ Z , computationally less complex, (ii) these distances are more
sampled from a simple distribution Q (e.g. N (0, σ 2 )). A robust w.r.t. the curse of dimensionality [103], [104]. These
GAN is composed by two networks: a generator gθ : Z → properties favour their usage in generative modeling, as
X , and a discriminator hξ : X → [0, 1]. X is called the input data is commonly high dimensional.
space (e.g. image space) and Z is the latent space (e.g. an
Euclidean space Rp ). The architecture is shown in Figure 4 5.1.2 Autoencoders
(a). Following [25], the GAN objective function is,
In a parallel direction, one can leverage autoencoders for
LG (θ, ξ) = E [log(hξ (x))] + E [log(1 − hξ (gθ (z)))], generative modeling. This was introduced in [26], who
x∼P0 z∼Q
proposed using stochastic encoding and decoding functions.
which is a minimized w.r.t. θ and maximized w.r.t. ξ . In other Let fη : X → Z be an encoder network. Instead of mapping
words, hξ is trained to maximize the probability of assigning z = fη (x) deterministically, fη predicts a mean mη (x) and
the correct label to x (labeled 1) and gθ (z) (labeled 0). gθ , on a variance ση (x)2 from which the code z is sampled, that is,
the other hand, is trained to minimize log(1 − hξ (gθ (z))). As z ∼ Qη (z|x) = N (mη (x), ση (x)2 ). The stochastic decoding
consequence it minimizes the probability that hξ guesses its function gθ : Z → X works similarly for reconstructing the
samples correctly. input x̂. This is shown in figure 4 (b). In this framework, the
As shown in [25], for an optimal discriminator hξ⋆ , decoder plays the role of generator.
the generator cost is equivalent to the so-called Jensen- The VAE framework is built upon variational inference,
Shannon (JS) divergence. In this sense, [91] proposed the which is a method for approximating probability densi-
Wasserstein GAN (WGAN) algorithm, which substitutes the ties [105]. Indeed, for a parameter vector θ, generating new
JS divergence by the KR metric (equation 2), samples x is done in two steps: (i) sample z from the
    prior Q(z) (e.g. a Gaussian), then (ii) sample x from the
min max LW (θ, ξ) = E hξ (x) − E hξ (gθ (z)) . conditional Pθ (x|z). The issue comes from calculating the
θ hξ ∈Lip1 x∼P0 z∼Q
marginal,
Nonetheless, the WGAN involves maximizing LW over Z
hξ ∈ Lip1 . Nonetheless, this is not straightforward for NNs. Pθ (x) = Q(z)Pθ (x|z)dz,
Possible solutions are: (i) clipping the weights ξ [91]; (ii)
penalizing the gradients of hξ [92]; (iii) normalizing the which is intractable. This hinders the MLE, which relies on
spectrum of the weight matrices [93]. Surprisingly, even log Pθ (x). VAEs tackle this problem by first considering an
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 10

fη gθ

z∼Q
gθ x ∼ Pθ = gθ♯ Q σθ (z)
x x̂
ση (x)2

hξ Real or
Fake? fη ∼ gθ ∼ x̂
hξ Real or
mη (x) z Fake?
 x ∼ P0
x mθ (z)
Dataset

(a) (b) (c)

Fig. 4: Architecture of different generative models studiend in this section; (a) GANs; (b) VAEs; (c) AAEs.

approximation Qη (z|x) ≈ Pθ (z|x). Secondly, one uses the Flows (CNFs). We thus focus on this class of algorithms. For
Evidence Lower Bound (ELBO), a broader review, we refer the readers to [27]. Normalizing
flows rely on the chain rule for computing changes in
ELBO(Qη ) = E [log Pθ (x, z)] − E [log Qη (z)], probability distributions exactly. Let T : Rd → Rd be a
z∼Qη z∼Qη
diffeomorphism. If, for z ∼ Q and x ∼ P , z = T(x), one
instead of the log-likelihood. Indeed, as shown in [26], the has,
ELBO is a lower bound for the log-likelihood. As follows,
one turns to the maximization of the ELBO, which can be log P (x) = log Q(z) − log | det ∇T−1 |.
rewritten as,
  Following [27], a NF is characterized by a sequence of
L(θ, η) = E E [log Pθ (x|z)] − KL(Qη ∥Q) . (25) diffeomorphisms {Ti }N i=1 that transform a simple probabil-
x∼P0 z∼Qη ity distribution (e.g. a Gaussian distribution) into a complex
one. If one understands this sequence continuously, they can
A somewhat middle ground between the frameworks of [25]
be modeled through dynamic systems. In this context, CNFs
and [26] is the Adversarial Autoencoder (AAE) architec-
are a class of generative models based on dynamic systems,
ture [106], shown in figure 4 (c). AAEs are different from
GANs in two points, (i) an encoder fη is added, for mapping z(x, 0) = x, and, ż(x, t) = Fθ (z(x, t)). (26)
x ∈ X into z ∈ Z ; (ii) the adversarial component is
done in the latent space Z . While the first point puts the which is an Ordinary Differential Equation (ODE). Hence-
AAE framework conceptually near VAEs, the second shows forth we denote zt (x) = z(x, t). Using Euler’s method
similarities with GANs. and setting z0 = x, one can discretize equation 26 as
Based on both AAEs and VAEs, [107] proposed the zn+1 = zn + τ Fθ (zn ), for a step-size τ > 0. This equation
Wasserstein Autoencoder (WAE) framework, which is a is similar to the ResNet structure [109]. Indeed, this is the
generalization of AAEs. Their insight is that, when using insight behind the seminal work of [110], who proposed
a deterministic decoding function gθ , one may simplify the the Neural ODE (NODE) algorithm for parametrizing ODEs
MK formulation, through NNs. CNFs are thus the intersection of NFs with
  NODE.
inf E [c(x1 , x2 )] = inf E E [c(x, gθ (z))] . According to [110], if a Random Variable (RV) z(x, t)
γ∈Γ (x1 ,x2 )∼γ Qη =Q x∼P0 z∼Qη suffices equation 26, then its log-probability follows,
As follows, [107] suggests enforcing the constraint in the ∂
infimum through a penalty term Ω, leading to, log | det ∇z| = Tr(∇z Fθ ) = div(Fθ ). (27)
∂t
 
With this workaround, minimizing the negative log-
L(θ, η) = E E [c(x, gθ (z))] + λΩ(Qη , Q),
x∼P0 z∼Qη likelihood amounts to minimizing,
n Z T
for λ > 0. This expression is minimized jointly w.r.t. θ and X
η . As choices for the penalty Ω, [107] proposes using either − log Q(zT (xi )) − div(Fθ (zt (xi )))dt, (28)
i=1 0
the JS divergence, or the MMD distance. In the first case the
formulation falls back to a mini-max problem much alike which can be solved using the automatic differentia-
the AAE framework. A third choice was proposed by [108], tion [110]. One limitation of CNFs is that the generic flow
which relies on the Sinkhorn loss Sc,ϵ , thus leveraging the can be highly irregular, having unnecessarily fluctuating
work of [43]. dynamics. As remarked by [111], this can pose difficulties
to numerical integration. These issues motivated the propo-
5.1.3 Continuous Normalizing Flows sition of two regularization terms for simpler dynamics. The
OT contributes to the class of generative models known first term is based dynamic OT. Let,
as NFs through its dynamic formulation (e.g. section 2.1 1X
n
and equations 3 through 5). Especially, they contribute to ρt = zt,♯ P = δz (x ) ,
n i=1 t i
a specific class of NFs known as Continuous Normalizing
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 11

and recalling the concepts in section 2.1, this simplifies eq. 4, This can be understood as a barycenter under the Euclidean
n Z metric. This remark motivated [117] to use the Wasserstein
1X T barycenter B(ai , D) of histograms in D, with weights ai , for
min L(θ) = ∥Fθ (zt (xi ))∥2 dt,
θ n i=1 0 (29) approximating xi , that is,
subject to z0 = x, and zT,♯ P = Q, N
X
2
where ∥Fθ (zt (xi ))∥ is the kinetic energy of the particle xi arg min Wc,ϵ (B(ai , D), xi ) + λΩ(D, A)
D,A i=1
at time t, and L to the whole kinetic energy. In addition, ρt
has the properties of an OT map. Among these, OT enforces Graph Dictionary Learning. An interesting use case of
that: (i) the flow trajectories do not intersect, and (ii) the this framework is Graph Dictionary Learning (GDL) (see
particles follow geodesics w.r.t. the ground cost (e.g. straight section 4.1). Let {Gℓ : (hℓ , S(ℓ) )}N
ℓ=1 denote a dataset
lines for an Euclidean cost), thus enforcing the regularity of N graphs encoded by their node similarity matrices
of the flow. Nonetheless, as [111] remarks, these properties S(ℓ) ∈ Rnℓ ×nℓ , and histograms over nodes hℓ ∈ ∆nℓ . [118]
are enforced only on training data. To effectively generalize proposes,
them, the authors propose a second regularization term, that N  K 
consists on the Frobenius norm of the Jacobian,
X X
arg min GW22 S(ℓ) , aℓ,k S(k) − λ∥aℓ ∥22 ,
n {Sk }K N
k=1 ,{aℓ }ℓ=1 ℓ=1 k=1
1X
Ω(F) = ∥∇Fθ (zT (xi ))∥2F . (30)
n i=1 in analogy to eq. 31, using the GW distance (see eq. 22) as
loss function. Parallel to this line of work, [119] proposes
Thus, the Regularized Neural ODE (RNODE) algorithm Gromov-Wasserstein Factorization (GWF), which approxi-
combines equation 27 with the penalties in equations 29 mates S(ℓ) non-linearly through GW-barycenters [120].
and 30. Ultimately, this is equivalent to minimizing the KL Topic Modeling. Another application of dictionary learn-
divergence KL(zT,♯ P ∥Q) with the said constraints. ing consists on topic modeling [121], in which a set of
documents is expressed in terms of a dictionary of topics
5.2 Dictionary Learning weighted by mixture coefficients. As explored in section 4.1,
Dictionary learning [112] represents data X ∈ Rn×d through documents can be interpreted as probability distributions
a set of K atoms D ∈ Rk×d and n representations A ∈ over words (e.g., eq. 21). In this context [122] proposes an
Rn×k . For a loss L and a regularization term Ω, OT-inspired algorithm for learning a set of distributions
n
Q = {Q̂k }Kk=1 over words in the whole vocabulary. In this
1X sense, each Q̂k is a topic. Learning is based on a hierarchical
arg min L(xi , DT ai ) + λΩ(D, A). (31)
D,A n i=1 OT problem,
In this setting, the data points xi are approximated linearly K X
X N

through the matrix-vector product DT ai . In other words, min γik W2,ϵ (Q̂k , P̂i ) − H(γ),
{qk }K
k=1 ,b,γ
X ≈ AD. The practical interest is finding a faithful and k=1 i=1

sparse representation for X. In this sense, L is often the PV


where Q̂k = v=1 qk,v δx(Q k ) , and b = [b1 , · · · , bK ] is
Euclidean distance, and Ω(A) the ℓ1 or ℓ0 norm of vectors. k
v P
the coefficient of topic . In this sense i γik = bk , and
When the elements of X are non-negative, the dictionary P
γ = p , with p = ni/V is the proportion of words in
k ik i i
learning problem is known as Non-negative Matrix Fac-
document P̂i .
torization (NMF) [113]. This scenario is particularly useful
when data X are histograms, that is, xi ∈ ∆d . As such,
one needs to choose an appropriate loss function for his- 5.3 Clustering
tograms. OT provides such a loss, through the Wasserstein Clustering is a problem within unsupervised learning that
distance [114], in which case the problem can be solved deals with the aggregation of features into groups [123].
using BCD: for fixed D, solve for A, and vice-versa. From the perspective of OT, this is linked to the notion
Nonetheless, the BCD iterations scale poorly since NMF of quantization [124], [125]; Indeed, from a distributional
is equivalent to solving n linear programs with d2 variable. viewpoint, given X(P ) ∈ Rn×d , quantization corresponds
To bypass the complexity involved with the exact calcula- to finding the matrix X(Q) ∈ RK×d minimizing W2 (P, Q).
tion of the Wasserstein distance, [114] resort to the wavelet This has been explored in a number of works, such as [126],
approximation of [115], which has closed form. On another [127] and [128]. In this sense, OT serves as a framework
direction, [116] propose using the Sinkhorn algorithm [39] for quantization/clustering, thus allowing for theoretical
for NMF. Besides reducing complexity, using the Sinkhorn results such as convergence bounds for the K -means algo-
distance Wc,ϵ makes NMF smooth, thus gradient descent rithm [125].
methods can be applied successfully. The optimization prob- Conversely, the Wasserstein distance can be used for
lem of Wasserstein Dictionary Learning (WDL) reads as, performing K -means in the space of distributions. Given
N P1 , · · · , PN , this consists on finding R ∈ {0, 1}N ×K and
1 X
arg min Wc,ϵ (xi , DT ai ) + λΩ(D, A), (32) Q = {Qk }K k=1 such that,
D,A N i=1
N X
X K
N
subject to AD ∈ (∆d ) . Assuming ai ∈ ∆k implies that (R⋆ , Q⋆ ) = arg min rik W2 (Pi , Qk ).
DT ai represents a weighted average of dictionary elements. R,Q i=1 k=1
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 12

This problem can be optimized in two steps: (i) one sets where Ω is the class sparsity regularizer. Here we discuss 2
rik = 1 if Qk is the closest distribution to Pi (w.r.t. W2 ); (ii) cases proposed in [56]: (i) semi-supervised and (ii) group-
one computes the centroids using Pk = {Pi : rik = 1}. In lasso. The semi-supervised regularizer assumes that y (PT )
this case Qk = B(1/ i rik ; Pk ). This formulation is related are known at training time, so that,
P

to the robust clustering of [129], who proposed a version nT


nS X
of the Wasserstein barycenter-based clustering analogous to (PS ) (PT )
X
Ω(γ; y(PS ) , y(PT ) ) = γij δ(yi − yj ), (34)
the trimmed K−means of [130]. i=1 j=1

this corresponds to adding η to the ground-cost entries Cij


6 T RANSFER L EARNING (P ) (P )
which yi S ̸= yj T . In practice [56] proposes η → ∞,
As discussed in section 2.2.3, TL is a framework for ML, which implies that transporting samples from different
when data follows different probability distributions. The classes is infinitely costly. Nonetheless, in many cases y(PT )
most general case of distributional shift corresponds to is not known during training. This leads to proxy regular-
PS (X, Y ) ̸= PT (X, Y ). Since P (X, Y ) = P (X)P (Y |X) = izers such as the group-lasso. The idea is that samples xiS
P (Y )P (X|Y ), three types of shift may occur [131]: (i) covari- j
and xS from the same class should be transported nearby.
ate shift, that is, PS (X) ̸= PT (X), (ii) concept shift, namely, Hence, the group-lasso is defined as,
PS (Y |X) ̸= PT (Y |X) or PS (X|Y ) ̸= PT (X|Y ), and (iii)
c −1
X nX
target shift, for which PS (Y ) ̸= PT (Y ).
This section is structured as follows. In sections 6.1 Ω(γ; y(PS ) ) = ∥γIc j ∥2 ,
j c=0
and 6.2 we present how OT is used in both shallow and
deep Domain Adaptation (DA). For a broader view on where, for a given class c, Ic corresponds to the indices of
the subject, encompassing other methods than OT-based, the samples of that given class. In this notation, γIc j is a
one may consult surveys on deep domain adaptation such vector with as much components as samples in the class c.
as [132] and [29]. In section 6.3 we present how OT has In addition, note that the barycentric projection Tγ is
contributed to more general DA settings (e.g. multi-source defined only on the support of P̂S . When the number of
and heterogeneous DA). Finally, section 6.4 covers methods samples nS or nT is large, finding γ is not feasible (either
for predicting the success of TL methods. due to time or memory constraints). This is the case in many
applications in deep learning, as datasets are usually large.
6.1 Shallow Domain Adaptation A few workarounds exist, such as interpolating the values
Assuming PS (X) = PT (ϕ(X)) for an unknown, possi- of Tγ through nearest neighbors [55], or parametrizing Tγ
bly nonlinear mapping ϕ, OT can be used for matching (e.g. linear or based on a kernel) [133], [134].
PS and PT . Since the Monge problem may not have a So far, note that one assumes a weaker version of the
solution for discrete distributions, [56] proposes doing so covariate shift hypothesis. One may as well assume that the
through the Kantorovich formulation, i.e., by calculating shift occurs on the joint distributions PS (X, Y ), PT (X, Y ),
γ = OT(pS , pT , C). The OT plan γ has an associated map- then try to match those. In UDA this is not directly pos-
ping known as the barycentric mapping or projection [55], sible, since the labels from PT (X, Y ) are not available. A
[56], [133], workaround was introduced in [135], where the authors
nT
propose a proxy distribution PTh (X, h(X)):
(PS ) (PT )
X
Tγ (xi ) = arg min γij c(x, xj ) (33) nT
1 X
x∈Rd j=1 P̂Th = δ (P ) (P ) .
nT j=1 xj T ,h(xj T )
The case where c(x1 , x2 ) = ∥x1 − x2 ∥22 is particularly
interesting, as Tγ has closed-form in terms of the sup- where h ∈ H is a classifier. As follows, [135] propose min-
port X(PT ) ∈ RnT ×d of P̂T , X̂S = Tγ (X(PS ) ) = imizing the Wasserstein distance Wc (P̂S , P̂Th ) over possible
diag(pS )−1 γX(PT ) . As follows, each point xiS is mapped classifiers h ∈ H. This yields a joint minimization problem,
j
into x̂iS , which is a convex combination of the points xT nS X
nT
(PS ) (PS ) (PT )
X
i min c(xi , yi , xj )γij + λΩ(h).
that receives mass from xS , namely, γij > 0. This effectively
(P ) (P ) h∈H
generates a new training dataset {(x̂i S , yi S )}n S
i=1 , which
γ∈U (pS ,pT ) i=1 j=1
ˆ
hopefully leads to a classifier fS that works well on DT . An The cost c can be designed in terms of two factors, a
illustration is shown in Figure 5. (P ) (P )
feature loss cf (xi S , xj T ) (e.g. the Euclidean distance),
Nonetheless, the barycentric projection works by map- (P ) (P )
(P ) (P ) and a label loss cℓ (yi S , h(xj T )) (e.g. the Hinge loss).
ping xi S to the convex hull of {xj T : γij > 0}. As
(P ) (P ) As proposed in [135], the importance of cf and cℓ can be
consequence, if xi S sends mass to xj T from a different controlled by a parameter α,
class, it ends up on the wrong side of the classification
(PS ) (PT ) (PS ) (PT )
boundary. Based on this issue, [56] proposes using class- Cij = αcf (xi , xj ) + cℓ (yi , h(xj )).
based regularizers for inducing class sparsity on the trans-
(P ) (P )
portation plan, namely, γij ̸= 0 ⇐⇒ yi S = yj T . This is 6.2 Deep Domain Adaptation
done by introducing additional regularizers,
Recalling the concepts of section 2.2.1, a deep NN is a
γ ⋆ = arg min ⟨C, γ⟩F − ϵH(γ) + ηΩ(γ; y(PS ) , y(PT ) ). composition f = hξ ◦ gθ of a feature extractor g with
γ∈U (pS ,pT ) parameters θ, and a classifier hξ with parameters ξ . The
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 13

5 5 5

4 4 4

Feature 2

Feature 2
3 3 3
Feature 2

2 2 2

1
1 1
0
0 0
−1
0 2 4 0 2 4 0 2 4
Feature 1 Feature 1 Feature 1
(a) (b) (c)

Fig. 5: Optimal Transport for Domain Adaptation (OTDA) methodology: (a) Source (red) and target (blue) samples follow
different distributions; A classifier (dashed black line) fit on source is not adequate for target; (b) OT plan visualization. An
edge connects source and target points (ij) whenever γij > 0. (c) Tγ generates samples on target domain (red).

intuition behind deep DA is to force the feature extractor to maximize Ld over hη ∈ Lip1 . [138] proposes doing so
(P )
to learn domain invariant features. Given xi S ∼ PS and using the gradient penalty term of [92],
(PT )
xj ∼ PT , an invariant feature extractor gθ should suffice,
min min R̂PS (hξ ◦ gθ ) + λ2 max LW (θ, η) − λ1 Ω(hη ),
θ ξ η
nS nT
1 X Wc 1 X
(gθ )♯ P̂S = δz(PS ) ≈ δ (P ) = (gθ )♯ P̂T , where λ1 controls the gradient penalty term, and λ2 controls
nS i=1 i nT j=1 zj T
the importance of the domain loss term.
Finally, we highlight that the Joint Distribution Optimal
(P ) (P ) Wc
where zi S = gθ (xi S ) (resp. P̂T ), and P̂S ≈ P̂T means Transport (JDOT) framework can be extended to deep DA.
Wc (P̂S , P̂T ) ≈ 0. This implies that, after the application of First, one includes the feature extractor gθ in the ground-
gθ , distributions P̂S and P̂T are close. In this sense, this cost,
condition is enforced by adding an additional term to the (PS ) (PT ) 2 (PS ) (PT )
classification loss function (e.g. the MMD as in [136]). In Cij (θ, ξ) = α∥zi − zj ∥ + cℓ (yi , hξ (zj )),
this context, [137] proposes the Domain Adversarial Neural second, the objective function includes a classification loss
Network (DANN) algorithm, based on the loss function, RPS , and the OT loss, that is,
LDANN (θ, ξ, η) = R̂PS (hξ ◦ gθ ) − λLH (θ, η), nS X
X nT
LJDOT (θ, ξ) = R̂PS (hξ ◦ gθ ) + γij Cij (θ, ξ), (35)
where LH denotes, i=1 j=1

1 X
nS
(P )
nT
1 X (P ) which is jointly minimized w.r.t. γ ∈ RnS ×nT , θ and ξ . [67]
log hη (zi S )) + log(1 − hη (zj T )), proposes minimizing LJDOT jointly w.r.t. θ and ξ , using mini-
nS i=1 nT j=1
batches (see section 3.4). Nonetheless, the authors noted
and hη is a supplementary classifier, that discriminates be- that one needed large mini-batches for a stable training. To
tween source (labeled 0) and target (labeled 1). The DANN circumvent this issue [71] proposed using unbalanced OT,
algorithm is a mini-max optimization problem, allowing for smaller batch sizes and improved performance.

min max LDANN (θ, ξ, η),


θ,ξ η 6.3 Generalizations of Domain Adaptation

so as to minimize classification error and maximize do- In this section we highlight 2 generalizations of the DA
main confusion. This draws a parallel with the GAN al- problem. First, Multi-Source Domain Adaptation (MSDA)
gorithm of [25], presented in section 5.1.1. This remark (e.g. [139]), deals with the case where source data comes
motivated [138] for using the Wasserstein distance instead of from multiple, distributionally heterogeneous domains,
dH . Their algorithm is called Wasserstein Distance Guided namely PS1 , · · · , PSK . Second, Heterogeneous Domain
Representation Learning (WDGRL). The Wasserstein do- Adaptation (HDA), is concerned with the case where
main adversarial loss is then, source and target domain data live in incomparable spaces
(e.g. [140]).
nS nT
1 X (P ) 1 X (P ) For MSDA, [141] proposes using the Wasserstein
LW (θ, η) = hη (gθ (xi S )) − hη (gθ (xj T )).
nS i=1 nT j=1 barycenter for building an intermediate domain P̂B =
B(α; {P̂Sk }N k=1 ) (see equation 7), for αk = /NS . Since
S 1

Following our discussion on KR duality (see section 2.1) as P̂B ̸= P̂T , the authors propose using an additional adapta-
well as the Wassers tein GAN (see section 5.1.1), one needs tion step for transporting the barycenter towards the target.
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 14

In a different direction, [142] proposes the Weighted JDOT this sense, Zπ is a RV whose expectation is Qπ . [150] defines
algorithm, which generalizes the work of [135] for MSDA, DRL by analogy with equation 10,
K K D
1X h
X Zπ (st , at ) = R(st , at ) + λZπ (st+1 , at+1 ),
min R̂PSk (hξ ◦ gθ ) + Wcθ,ξ (P̂T ξ , αk P̂Sk ).
α,gθ ,hξ k k=1 k=1 D
where = means equality in distribution. Note that Zπ (st , at )
For HDA, Co-Optimal Transport (COOT) [140] tackles is a distribution over the value of the state (st , at ), hence a
DA when the distributions P̂S and P̂T are supported on distribution over the line. [150] then suggests parametrizing
incomparable spaces, that is, X(PS ) ∈ RnS ×dS and X(PT ) ∈ Zπ using Zθ with discrete support of size N , between values
RnT ×dT , nS ̸= nT and dS ̸= dT . The authors strategy relies Vmin and Vmax ∈ R, so that Zθ (s, a) = zi with probability
on the GW OT formalism, p(s, a) = softmax(θ(s, a)). As commented by the authors,
X (P ) (P ) (s) (f ) N = 51 yields the better empirical results, thus the authors
(γ (s) , γ (f ) ) = min L(Xi,kS , Xj,lT )γij γij ,
γ (s) ∈Γ(wS ,wT ) ijkl call their algorithm c51.
γ (f ) ∈Γ(vS ,vT ) For a pair (st , at ), learning Zθ corresponds to sampling
(st+1 , at+1 ) according to the dynamics P and the policy
where γ (s) ∈ RnS ×nT and γ (f ) ∈ RdS ×dT are OT plans π and minimizing a measure of divergence between the
between samples and features, respectively. Using γ (s) , one application of the Bellman operator, i.e., Tπ Zθ = R(st , at ) +
can estimate labels on the target domain using label propa- λZθ (st+1 , at+1 ), and Zθ . OT intersects distributional RL
gation [143]: Ŷ (PT ) = diag(pT )−1 (γ (s) )T Y (PS ) . precisely at this point. Indeed, [150] use the Wasserstein
distance as a theoretical framework for their method. This
6.4 Transferability suggests that one should employ Wc in learning as well,
however, as remarked by [70], this idea does not work in
Transferability concerns the task of predicting whether TL practice, since the stochastic gradients of ∇θ Wp are biased.
will be successful. From a theoretical perspective, differ- The solution [150] proposes is to use the KL divergence for
ent IPMs bound the target risk RPT by the source risk optimization. This yields practical complications as Zθ and
RPS , such as the H, Wasserstein and MMD distances (see Tπ Zθ may have disjoint supports. To tackle this issue, the
e.g., [22], [131], [144]). Furthermore there is a practical inter- authors suggest using an intermediate step, for projecting
est in accurately measuring transferability a priori, before Tπ Zθ into the support of Zθ .
training or fine-tuning a network. A candidate measure To avoid this projection step, [151] further proposes to
coming from OT is the Wasserstein distance, but in its orig- change the modeling of Zθ . Instead of using fixed atoms
inal formulation it only takes features into account. [145] with variable weights, the authors propose parametrizing
propose the Optimal Transport Dataset Distance (OTDD), PN
the support of Zθ , that is, Zθ (s, a) = 1/N i=1 δθi (s,a) .
which relies on a Wasserstein-distance based metric on the
This corresponds to estimating zi = θi (s, a). Since Zθ is a
label space,
distribution over the line, and due the relationship between
OTDD(PS , PT ) = arg min E [c(z1 , z2 )], the 1-D Wasserstein distance and the quantile functions
γ∈Γ(PS ,PT ) (z1 ,z2 )∼γ of distributions, the authors call this technique quantile
c(z1 , z2 ) = ∥x1 − x2 ∥22 + W2 (PS,y1 , PT,y2 ), regression. This choice presents 3 advantages; (i) Discretizing
Zθ ’s support is more flexible as the support is now free,
where z = (x, y), and PS,y is the conditional distribution potentially leading to more accurate predictions; (ii) This
PS (X|Y = y). As the authors show in [145], this distance is fact further implies that one does not need to perform the
highly correlated with transferability and can even be used projection step; (iii) It allows the use of the Wasserstein
to interpolate between datasets with different classes [146]. distance as loss without suffering from biased gradients.

7 R EINFORCEMENT L EARNING 7.2 Bayesian Reinforcement Learning


In this section, we review three contributions of OT for RL, Similarly to DRL, BRL [152] adopts a distributional view re-
namely: (i) Distributional Reinforcement Learning (DRL) flecting the uncertainty over a given variable. In the remain-
(section 7.1), (ii) Bayesian Reinforcement Learning (BRL) der of this section, we discuss the Wasserstein Q-Learning
(section 7.2), and (iii) Policy Optimization (PO) (section 7.3). (WQL) algorithm of [153], a modification of the Q-Learning
A brief review of the foundations of RL were presented in algorithm of [34] (see section 2.2.4). This strategy considers
section 2.2.4. a distribution Q(s, a) ∈ P , called Q−posterior, which repre-
sents the posterior distribution of the Q−function estimate.
P is a family of distributions (e.g. Gaussian family). For each
7.1 Distributional Reinforcement Learning
state, and associated with Q, there is a V −posterior V (s),
RL under a distributional perspective is covered in various defined in terms of the Wasserstein barycenter,
works, e.g. [147]–[149]. It differs from the traditional frame-
work by considering distributions over RVs instead of their V (s) ∈ arg inf E [W2 (V, Q(s, a))2 ], (36)
expected values. In this context, [150] proposed studying the V ∈P a∼π(·|s)

random return Zπ , such that Qπ (st , at ) = E [Zπ (st , at )], where the inclusion highlights that the minimizer may be
P,π
where the expectation is taken in the sense of equation 9. In non-unique. Upon a transition (s, a, s′ , r), [153] defines the
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 15

Wasserstein Temporal Difference, which is the distributional A third trend, useful in generative modeling, are sliced
analogous to equation 11, Wasserstein distances. This family of metrics provide an
easy-to-compute variant of the Wasserstein distance that
Qt+1 (s, a) ∈ B([1 − αt , αt ]; {Qt (s, a), Tπ Qt (s, a)}) (37)
benefit from the fact that in 1-D, OT has a closed form
where αt > 0 is the learning rate, whereas Tπ Qt = r + λVt . solution. In addition, a variety of works highlight sample
In certain cases when P is specified, equations 36 and 37 complexity and topological properties [103], [104] of these
have known solution. This is the case for Gaussian distribu- distances. We further highlight that ICNNs are gaining an
tions parametrized by a mean m(s, a) and variance σ(s, a)2 , increasing attention by the ML community, especially in
and empirical distributions with N points support given by generative modeling [161], [162]. Furthermore, ICNNs rep-
xi (s, a), and weights ai = 1/N . resent a good case of ML for OT. Even though unexplored,
this technique would be useful in DA and TL as well, for
7.3 Policy Optimization mapping samples between different domains.
There are two persistent challenges in OTML. First, is
PO focuses on manipulating the policy π for maximizing
the time complexity of OT. This is partially solved by
η(π). In this section we focus on gradient-based approaches
considering sliced methods, entropic regularization, or more
(e.g. [154]). The idea of these methods rely on parametrizing
recently, mini-batch OT. Second is the sample complexity
π = πθ so that training consists on maximizing η(θ) =
of OT. Especially, the Wasserstein distance is known to
η(πθ ). However, as [155] remarks, these algorithms suffer
suffer from the curse of dimensionality [38], [163]. This led
from high variance and slow convergence. [155] thus pro-
researchers to consider other estimators. Possible solutions
pose a distributional view. Let P be a distribution over θ,
include: (i) the SW [104]; (ii) the SRW [164], (iii) entropic
they propose the following optimization problem,
OT [45], and (iv) mini-batch OT [69], [71], [165]. Table 1
P ⋆ = arg max E [η(πθ )] − αKL(P ∥P0 ), (38) summarizes the time and sample complexity of different
P θ∼P
strategies in computational OT.
where P0 is a prior, and α > 0 controls regularization. The
minimum of equation 38 implies P (θ) ∝ exp(η(πθ )/α). TABLE 1: Time and sample complexity of empirical OT
From a Bayesian perspective, P (θ) is the posterior distri- estimators in terms of the number of samples n. † indicates
bution, whereas exp(η(πθ )/α) is the likelihood function. As complexity of a single iteration; ∗ indicates that the com-
discussed in [156], this formulation can be written in terms plexity is affected by samples dimension.
of gradient flows in the space of probability distributions
Loss Estimator Time Complexity Sample Complexity
(see [157] for further details). First, consider the functional,
Z Z Wc (P, Q) Wc (P̂ , Q̂) O(n3 log n) O(n−1/d )
F (P ) = − P (θ) log P0 (θ)dθ + P (θ) log P (θ)dθ, Wc,ϵ (P, Q) Wc,ϵ (P̂ , Q̂) O(n2 )† O(n−1/2 (1 + ϵ⌊d/2⌋ ))
W2 (P, Q) SRWk (P̂ , Q̂) O(n2 )†,∗ O(n−1/k )
SW(P, Q) SW(P̂ , Q̂) O(n log n)∗ O(n−1/2 )
For Itó-diffusion [158], optimizing F w.r.t. P becomes, W2 (P, Q) F WK,2 (P̂ , Q̂) O(n2 )† O(n−1/2 )
(k,m)
(h) LMBOT (P, Q) LMBOT (P̂ , Q̂) O(km3 log m) O(n−1/2 )
(h) W22 (P, Pk )
Pk+1 = arg min KL(P ∥P0 ) + , (39)
P 2h
which corresponds to the Jordan-Kinderlehrer-Otto (JKO) 8.2 Supervised Learning
scheme [158]. In this context, [156] introduces 2 formulations
OT contributes to supervised learning on three fronts. First,
for learning an optimal policy: indirect and direct learning.
it can serve as a metric for comparing objects. This can be
In the first case, one defines the gradient flow for equa-
integrated into nearest-neighbors classifiers, or kernel meth-
tion 39, in terms of θ. This setting is particularly suited when
ods, and it has the advantage to define a rich metric [74].
the parameter space does not has many dimensions. In the
However, due the time complexity of OT, this approach
second case, one uses policies directly into the described
does not scale with sample size. Second, the Wasserstein dis-
framework, which is related to [159].
tance can be used to define a semantic aware metric between
labels [78], [80] which is useful for classification. Third, OT
8 C ONCLUSIONS AND F UTURE D IRECTIONS contributes to fairness, which tries to find classifiers that are
8.1 Computational Optimal Transport insensitive w.r.t. protected variables. As such, OT presents
A few trends have emerged in recent computational OT three contributions: (i) defining a transformation of data
literature. First, there is an increasing use in OT as a loss so that classifiers trained on transformed data fulfill the
function (e.g., [43], [65], [67], [91]). Second, mini-max op- fairness requirement; (ii) defining a regularization term for
timization appears in seemingly unrelated strategies, such enforcing fairness; (iii) defining a test for fairness, given a
as the SRW [52], structured OT [57] and ICNN-based tech- classifier. The first point is conceptually close to the contri-
niques [61]. The first 2 cases consist of works that propose butions of OT for DA, as both depend on the barycentric
new problems by considering the worst-case scenario w.r.t. projection for generating new data.
the ground-cost. A somewhat common framework was
proposed by [160], through a convex set C of ground-cost 8.3 Unsupervised Learning
matrices, Generative modeling. This subject has been one of the most
RKP(Γ, C) = min max E [c(x1 , x2 )]. active OTML topics. This is evidenced by the impact papers
γ∈Γ c∈C (x1 ,x2 )∼γ such as [91] and [92] had in how GANs are understood
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 16

and trained. Nonetheless, the Wasserstein distance comes 8.5 Reinforcement Learning
at a cost: an intense computational burden in its primal The distributional RL framework of [150] re-introduced pre-
formulation, and a complicated constraint in its dual. This vious ideas in modern RL, such as formulating the goal of
computational burden has driven by early applications of an agent in terms of distributions over rewards. This turned
OT to consider the dual formulation, leaving the question of out to be successful, as the authors’ method surpassed the
how to enforce the Lipschitz constraint open. On this matter, state-of-the-art in a variety of game benchmarks. Further-
[92], [93] propose regularizers to enforce this constraint more, [169] further studied how the framework is related
indirectly. In addition, recent experiments and theoretical to dopamine-based RL, which highlights the importance of
discussion have shown that WGANs do not estimate the this formulation.
Wasserstein distance [94]–[96]. In a similar spirit, [153] proposes a distributional method
In contrast to GANs, OT contributions to autoencoders for propagating uncertainty of the value function. Here, it is
revolves around the primal formulation, which substitutes important to highlight that while [150] focuses on the ran-
the log-likelihood term in vanilla algorithms. In practice, domness of the reward function, [153] studies the stochas-
using the Wc partially solves the blurriness issue in VAEs. ticity of value functions. In this sense, the authors propose a
Moreover, the work of [107] serve as a bridge between variant of Q-Learning, called WQL, that leverages Wasser-
VAEs [26] and AAEs [106]. stein barycenters for updating the Q−function distribution.
Furthermore, dynamic OT contributes to CNFs, espe- From a practical perspective, the WQL algorithm is theoreti-
cially through the BB formulation. In this context a series cally grounded and has good performance. Further research
of papers [111], [166] have shown that OT enforces flow could focus on the common points between [150], [151]
regularity, thus avoiding stiffness in the related ODEs. As and [153], either by establishing a common framework or
consequence, these strategies improve training time. by comparing the 2 methodologies empirically. Finally, [156]
Dictionary Learning. OT contributed to this subject by explores policy optimization as gradient flows in the space
defining meaningful metrics between histograms [116], as of distributions. This formalism can be understood as the
well as how to calculate averages in probability spaces [117], continuous-time counterpart of gradient descent.
examples include [118] and [119]. Future contributions
could analyze the manifold of interpolations of dictionaries
8.6 Final Remarks
atoms in the Wasserstein space.
Clustering. OT provides a distributional view for clustering, TABLE 2: Principled division of works considered in this
by approximating an empirical distribution with n sam- survey, according to their OT formulation and the compo-
ples by another, with k samples representing clusters. In nent they represent in ML.
addition, the Wasserstein distance can be used to perform
K−means in the space of distributions. Method Section OT Formulation ML Component
EMD [5] 4.1 MK Metric
WMD [74] 4.1 MK Metric
GW [76] 4.1 MK Metric
8.4 Transfer Learning FGW [75] 4.1 MK Metric
Random Repair [85] 4.3 MK Loss
In UDA, OT has shown state-of-the-art performance in FRL [86], [87] 4.3 MK Transformation
Fairness Test [83] 4.3 MK Metric
various fields, such as image classification [56], sentiment
WGAN [91] 5.1.1 KR Loss
analysis [135], fault diagnosis [30], [167], and audio classi- SinkhornAutodiff [43] 5.1.1 MK Loss
fication [141], [142]. Furthermore, advances in the field in- WAE [107] 5.1.2 MK Loss/Regularization
RNODE [111] 5.1.3 BB Regularization
clude methods and architectures that can handle large scale Dictionary Learning [116] 5.2 MK Loss
WDL [117] 5.2 MK Loss/Aggregation
datasets, such as WDGRL [138] and DeepJDOT [67]. This GDL [118] 5.2 GW Loss
effectively consolidated OT as a valid and useful framework GWF [119] 5.2 GW Loss/Aggregation
OTLDA [122] 6.2 MK Loss/Metric
for transferring knowledge between different domains. Fur- Distr. K-Means [129] 5.3 MK Metric
thermore, heterogeneous DA [140] is a new, challenging OTDA [56] 6.1 MK Transformation
JDOT [135] 6.1 MK Loss
and relatively unexplored problem that OT is particularly WDGRL [167] 6.2 KR Loss/Regularization
suited to solve. In addition, OT contributes to the theoretical DeepJDOT [67] 6.2 MK Loss/Regularization
WBT [139], [141] 6.3 MK Transformation/Aggregation
developments in DA. For instance [144] provides a bound WJDOT [142] 6.3 MK Loss
COOT [140] 6.3 MK Transformation
for the different |R̂PT (h) − R̂PS (h)| in terms of the Wasser- OTCE [170] 6.4 MK Metric
OTDD [145] 6.4 MK Metric
stein distance, akin to [22]. The authors results highlight the
DRL [150] 7.1 MK Loss
fact that the class-sparsity structure in the OT plan plays a BRL [153] 7.2 MK Aggregation
prominent role in the success of DA. Finally, OT defines a PO [156] 7.3 BB Regularization

rich toolbox for analyzing heterogeneous datasets as prob-


ability distributions. For instance, OTDD [145] defines an OT for ML. As we covered throughout this paper, OT is
hierarchical metric between datasets, by exploiting features a component in different ML algorithms. Particularly, in
and labels in the ground-cost. Emerging topics in this field table 2 we summarize the methods described in section 4
include adaptation when one has many sources (e.g. [139], throughout section 7. We categorize methods in terms of
[142], [168]), and when distributions are supported in in- their purpose in ML: (i) as a metric between samples; (ii) as
comparable domains (e.g. [140]). Furthermore, DA is linked a loss function; (iii) as a regularizer; (iv) as a transformation
to other ML problems, such as generative modeling (i.e., for data; (v) as a way to aggregate heterogeneous probability
WGANs and WDGRL) and supervised learning (i.e., the distributions. Furthermore, most algorithms leverage the
works of [85] and [56]). MK formalism (equation 14), and a few consider the KR
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 17

formalism (equation 2). Whenever a notion of dynamics is [21] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J.
involved, the BB formulation is preferred (equation 4). Smola, “A kernel approach to comparing distributions,” in Pro-
ceedings of the National Conference on Artificial Intelligence, vol. 22,
ML for OT. In the inverse direction, a few works have no. 2. Menlo Park, CA; Cambridge, MA; London; AAAI Press;
considered ML, especially NNs, as a component within MIT Press; 1999, 2007, p. 1637.
OT. Arguably, one of the first works that considered so [22] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and
was [134], who proposed finding a parametric Monge map- J. W. Vaughan, “A theory of learning from different domains,”
Machine learning, vol. 79, no. 1, pp. 151–175, 2010.
ping. This was followed by [133], who proposed using a [23] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of
NN for parametrizing the OT map. These two works are machine learning. MIT press, 2018.
connected to the notion of barycentric mapping, which as [24] V. Vapnik, The nature of statistical learning theory. Springer science
discussed in [133] is linked to the Monge map, under mild & business media, 2013.
[25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
conditions. In a novel direction [61] exploited recent devel- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-
opments in NN architectures, notably through ICNNs, for versarial nets,” Advances in neural information processing systems,
finding a Monge map. Differently from previous work, [61] vol. 27, 2014.
provably find a Monge map, which can be useful in DA. [26] D. P. Kingma and M. Welling, “Autoencoding variational bayes,”
in International Conference on Learning Representations, 2020.
[27] I. Kobyzev, S. Prince, and M. Brubaker, “Normalizing flows: An
introduction and review of current methods,” IEEE Transactions
R EFERENCES on Pattern Analysis and Machine Intelligence, 2020.
[28] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D.
[1] G. Monge, “Mémoire sur la théorie des déblais et des remblais,” Lawrence, Dataset shift in machine learning. Mit Press, 2008.
Histoire de l’Académie Royale des Sciences de Paris, 1781. [29] M. Wang and W. Deng, “Deep visual domain adaptation: A
[2] L. Kantorovich, “On the transfer of masses (in russian),” in survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
Doklady Akademii Nauk, vol. 37, no. 2, 1942, pp. 227–229. [30] E. F. Montesuma, M. Mulas, F. Corona, and F. M. N. Mboula,
[3] C. Villani, Optimal transport: old and new. Springer, 2009, vol. 338. “Cross-domain fault diagnosis through optimal transport for a
[4] U. Frisch, S. Matarrese, R. Mohayaee, and A. Sobolevski, “A cstr process,” IFAC-PapersOnLine, 2022.
reconstruction of the initial conditions of the universe by optimal [31] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
mass transportation,” Nature, vol. 417, no. 6886, pp. 260–262, Transactions on knowledge and data engineering, vol. 22, no. 10, pp.
2002. 1345–1359, 2009.
[5] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s [32] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduc-
distance as a metric for image retrieval,” International journal of tion. MIT press, 2018.
computer vision, vol. 40, no. 2, pp. 99–121, 2000. [33] R. Bellman, “On the theory of dynamic programming,” Proceed-
[6] L. Ambrosio and N. Gigli, “A user’s guide to optimal transport,” ings of the National Academy of Sciences of the United States of
in Modelling and optimisation of flows on networks. Springer, 2013, America, vol. 38, no. 8, p. 716, 1952.
pp. 1–155.
[34] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
[7] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Ro- no. 3-4, pp. 279–292, 1992.
hde, “Optimal mass transport: Signal processing and machine-
[35] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731,
learning applications,” IEEE signal processing magazine, vol. 34,
pp. 34–37, 1966.
no. 4, pp. 43–59, 2017.
[36] M. Köppen, “The curse of dimensionality,” in 5th online world
[8] J. Solomon, “Optimal transport on discrete domains,” AMS Short
conference on soft computing in industrial applications (WSC5), vol. 1,
Course on Discrete Differential Geometry, 2018.
2000, pp. 4–8.
[9] B. Lévy and E. L. Schwindt, “Notions of optimal transport theory
and how to implement them on a computer,” Computers & [37] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT
Graphics, vol. 72, pp. 135–148, 2018. press, 2016.
[10] G. Peyré, M. Cuturi et al., “Computational optimal transport: [38] R. M. Dudley, “The speed of mean glivenko-cantelli conver-
With applications to data science,” Foundations and Trends® in gence,” The Annals of Mathematical Statistics, vol. 40, no. 1, pp.
Machine Learning, vol. 11, no. 5-6, pp. 355–607, 2019. 40–50, 1969.
[11] R. Flamary, “Transport optimal pour l’apprentissage statistique,” [39] M. Cuturi, “Sinkhorn distances: Lightspeed computation of opti-
Habilitation à diriger des recherches. Université Côte d’Azur, 2019. mal transport,” Advances in neural information processing systems,
[12] F. Santambrogio, “Optimal transport for applied mathemati- vol. 26, pp. 2292–2300, 2013.
cians,” Birkäuser, NY, vol. 55, no. 58-63, p. 94, 2015. [40] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon,
[13] A. Figalli and F. Glaudo, An Invitation to Optimal Transport, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier
Wasserstein Distances, and Gradient Flows. European Mathemati- et al., “Pot: Python optimal transport,” Journal of Machine Learning
cal Society/AMS, 2021. Research, vol. 22, no. 78, pp. 1–8, 2021.
[14] G. Monge, Mémoire sur le calcul intégral des équations aux différences [41] M. Cuturi, L. Meng-Papaxanthos, Y. Tian, C. Bunne, G. Davis,
partielles. Imprimerie royale, 1784. and O. Teboul, “Optimal transport tools (ott): A jax toolbox for
[15] Y. Brenier, “Polar factorization and monotone rearrangement all things wasserstein,” arXiv preprint arXiv:2201.12324, 2022.
of vector-valued functions,” Communications on pure and applied [42] G. B. Dantzig, “Reminiscences about the origins of linear pro-
mathematics, vol. 44, no. 4, pp. 375–417, 1991. gramming,” in Mathematical programming the state of the art.
[16] J.-D. Benamou and Y. Brenier, “A computational fluid mechan- Springer, 1983, pp. 78–86.
ics solution to the monge-kantorovich mass transfer problem,” [43] A. Genevay, G. Peyré, and M. Cuturi, “Learning generative
Numerische Mathematik, vol. 84, no. 3, pp. 375–393, 2000. models with sinkhorn divergences,” in International Conference on
[17] M. Agueh and G. Carlier, “Barycenters in the wasserstein space,” Artificial Intelligence and Statistics. PMLR, 2018, pp. 1608–1617.
SIAM Journal on Mathematical Analysis, vol. 43, no. 2, pp. 904–924, [44] G. Luise, A. Rudi, M. Pontil, and C. Ciliberto, “Differential
2011. properties of sinkhorn approximation for learning with wasser-
[18] A. Müller, “Integral probability metrics and their generating stein distance,” Advances in Neural Information Processing Systems,
classes of functions,” Advances in Applied Probability, vol. 29, no. 2, vol. 31, 2018.
pp. 429–443, 1997. [45] A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré, “Sample
[19] I. Csiszár, “Information-type measures of difference of proba- complexity of sinkhorn divergences,” in The 22nd international
bility distributions and indirect observation,” studia scientiarum conference on artificial intelligence and statistics. PMLR, 2019, pp.
Mathematicarum Hungarica, vol. 2, pp. 229–318, 1967. 1574–1583.
[20] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, [46] J. Rabin, G. Peyré, J. Delon, and M. Bernot, “Wasserstein barycen-
and G. R. Lanckriet, “On the empirical estimation of integral ter and its application to texture mixing,” in International Con-
probability metrics,” Electronic Journal of Statistics, vol. 6, pp. ference on Scale Space and Variational Methods in Computer Vision.
1550–1599, 2012. Springer, 2011, pp. 435–446.
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 18

[47] N. Bonneel, J. Rabin, G. Peyré, and H. Pfister, “Sliced and radon as a solution to biased wasserstein gradients,” arXiv preprint
wasserstein barycenters of measures,” Journal of Mathematical arXiv:1705.10743, 2017.
Imaging and Vision, vol. 51, no. 1, pp. 22–45, 2015. [71] K. Fatras, T. Sejourne, R. Flamary, and N. Courty, “Unbalanced
[48] S. Kolouri, S. R. Park, and G. K. Rohde, “The radon cumulative minibatch optimal transport; applications to domain adaptation,”
distribution transform and its application to image classifica- in International Conference on Machine Learning. PMLR, 2021, pp.
tion,” IEEE transactions on image processing, vol. 25, no. 2, pp. 3186–3197.
920–934, 2015. [72] L. Chizat, G. Peyré, B. Schmitzer, and F.-X. Vialard, “Scaling algo-
[49] I. Deshpande, Y.-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, rithms for unbalanced optimal transport problems,” Mathematics
Z. Zhao, D. Forsyth, and A. G. Schwing, “Max-sliced wasserstein of Computation, vol. 87, no. 314, pp. 2563–2609, 2018.
distance and its use for gans,” in Proceedings of the IEEE/CVF [73] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
Conference on Computer Vision and Pattern Recognition, 2019, pp. “Distributed representations of words and phrases and their
10 648–10 656. compositionality,” in Advances in neural information processing
[50] S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde, systems, 2013, pp. 3111–3119.
“Generalized sliced wasserstein distances,” Advances in neural [74] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word
information processing systems, vol. 32, 2019. embeddings to document distances,” in International conference
[51] G. Beylkin, “The inversion problem and applications of the on machine learning. PMLR, 2015, pp. 957–966.
generalized radon transform,” Communications on pure and applied [75] V. Titouan, N. Courty, R. Tavenard, and R. Flamary, “Optimal
mathematics, vol. 37, no. 5, pp. 579–599, 1984. transport for structured data with application on graphs,” in
[52] F.-P. Paty and M. Cuturi, “Subspace robust wasserstein dis- International Conference on Machine Learning. PMLR, 2019, pp.
tances,” in International conference on machine learning. PMLR, 6275–6284.
2019, pp. 5072–5081. [76] F. Mémoli, “Gromov–wasserstein distances and the metric ap-
[53] T. Lin, C. Fan, N. Ho, M. Cuturi, and M. Jordan, “Projection proach to object matching,” Foundations of computational mathe-
robust wasserstein distance and riemannian optimization,” Ad- matics, vol. 11, no. 4, pp. 417–487, 2011.
vances in neural information processing systems, vol. 33, pp. 9383– [77] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine
9397, 2020. learning. Springer, 2006, vol. 4, no. 4.
[54] M. Huang, S. Ma, and L. Lai, “A riemannian block coordinate [78] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio,
descent method for computing the projection robust wasserstein “Learning with a wasserstein loss,” Advances in Neural Information
distance,” in International Conference on Machine Learning. PMLR, Processing Systems, vol. 28, pp. 2053–2061, 2015.
2021, pp. 4446–4455. [79] X. Liu, Y. Han, S. Bai, Y. Ge, T. Wang, X. Han, S. Li, J. You, and
[55] S. Ferradans, N. Papadakis, G. Peyré, and J.-F. Aujol, “Reg- J. Lu, “Importance-aware semantic segmentation in self-driving
ularized discrete optimal transport,” SIAM Journal on Imaging with discrete wasserstein training,” in Proceedings of the AAAI
Sciences, vol. 7, no. 3, pp. 1853–1882, 2014. Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 629–
[56] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimal 11 636.
transport for domain adaptation,” IEEE Transactions on Pattern [80] X. Liu, W. Ji, J. You, G. E. Fakhri, and J. Woo, “Severity-aware
Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1853–1865, semantic segmentation with reinforced wasserstein training,” in
2017. Proceedings of the IEEE/CVF Conference on Computer Vision and
[57] D. Alvarez-Melis, T. Jaakkola, and S. Jegelka, “Structured optimal Pattern Recognition, 2020, pp. 12 566–12 575.
transport,” in International Conference on Artificial Intelligence and [81] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi,
Statistics. PMLR, 2018, pp. 1771–1780. “Fairness beyond disparate treatment & disparate impact: Learn-
[58] L. Lovász, “Mathematical programming–the state of the art,” ing classification without disparate mistreatment,” in Proceedings
1983. of the 26th international conference on world wide web, 2017, pp.
[59] A. Forrow, J.-C. Hütter, M. Nitzan, P. Rigollet, G. Schiebinger, and 1171–1180.
J. Weed, “Statistical optimal transport via factored couplings,” [82] K. Makhlouf, S. Zhioua, and C. Palamidessi, “On the applicability
in The 22nd International Conference on Artificial Intelligence and of machine learning fairness notions,” ACM SIGKDD Explorations
Statistics. PMLR, 2019, pp. 2454–2465. Newsletter, vol. 23, no. 1, pp. 14–23, 2021.
[60] J. E. Cohen and U. G. Rothblum, “Nonnegative ranks, decom- [83] N. Si, K. Murthy, J. Blanchet, and V. A. Nguyen, “Testing group
positions, and factorizations of nonnegative matrices,” Linear fairness via optimal transport projections,” in International Con-
Algebra and its Applications, vol. 190, pp. 149–168, 1993. ference on Machine Learning. PMLR, 2021, pp. 9649–9659.
[61] A. Makkuva, A. Taghvaei, S. Oh, and J. Lee, “Optimal transport [84] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel,
mapping via input convex neural networks,” in International “Fairness through awareness,” in Proceedings of the 3rd innovations
Conference on Machine Learning. PMLR, 2020, pp. 6672–6681. in theoretical computer science conference, 2012, pp. 214–226.
[62] C. Villani, Topics in optimal transportation. American Mathemati- [85] P. Gordaliza, E. Del Barrio, G. Fabrice, and J.-M. Loubes, “Ob-
cal Soc., 2021, vol. 58. taining fairness using optimal transport theory,” in International
[63] B. Amos, L. Xu, and J. Z. Kolter, “Input convex neural networks,” Conference on Machine Learning. PMLR, 2019, pp. 2357–2365.
in International Conference on Machine Learning. PMLR, 2017, pp. [86] L. Oneto, M. Donini, G. Luise, C. Ciliberto, A. Maurer, and
146–155. M. Pontil, “Exploiting mmd and sinkhorn divergences for fair
[64] J. Fan, A. Taghvaei, and Y. Chen, “Scalable computations of and transferable representation learning.” in NeurIPS, 2020.
wasserstein barycenter via input convex neural networks,” in [87] S. Chiappa, R. Jiang, T. Stepleton, A. Pacchiano, H. Jiang, and
International Conference on Machine Learning. PMLR, 2021, pp. J. Aslanides, “A general approach to fairness with optimal trans-
1571–1581. port.” in AAAI, 2020, pp. 3633–3640.
[65] G. Montavon, K.-R. Müller, and M. Cuturi, “Wasserstein training [88] A. Genevay, G. Peyré, and M. Cuturi, “Gan and vae from an
of restricted boltzmann machines.” in NIPS, 2016, pp. 3711–3719. optimal transport point of view,” arXiv preprint arXiv:1706.01807,
[66] M. Sommerfeld, J. Schrieber, Y. Zemel, and A. Munk, “Optimal 2017.
transport: Fast probabilistic approximation with exact solvers.” J. [89] N. Lei, K. Su, L. Cui, S.-T. Yau, and X. D. Gu, “A geometric view
Mach. Learn. Res., vol. 20, pp. 105–1, 2019. of optimal transportation and generative model,” Computer Aided
[67] B. B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and Geometric Design, vol. 68, pp. 1–21, 2019.
N. Courty, “Deepjdot: Deep joint distribution optimal transport [90] F. Bassetti, A. Bodini, and E. Regazzini, “On minimum kan-
for unsupervised domain adaptation,” in Proceedings of the Euro- torovich distance estimators,” Statistics & probability letters,
pean Conference on Computer Vision (ECCV), 2018, pp. 447–463. vol. 76, no. 12, pp. 1298–1302, 2006.
[68] E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert, “On param- [91] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative
eter estimation with the wasserstein distance,” Information and adversarial networks,” in International conference on machine learn-
Inference: A Journal of the IMA, vol. 8, no. 4, pp. 657–676, 2019. ing. PMLR, 2017, pp. 214–223.
[69] K. Fatras, Y. Zine, R. Flamary, R. Gribonval, and N. Courty, [92] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.
“Learning with minibatch wasserstein: asymptotic and gradient Courville, “Improved training of wasserstein gans,” in Advances
properties,” in AISTATS, 2020. in Neural Information Processing Systems, I. Guyon, U. V. Luxburg,
[70] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lak- S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
shminarayanan, S. Hoyer, and R. Munos, “The cramer distance nett, Eds., vol. 30. Curran Associates, Inc., 2017.
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 19

[93] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral nor- [116] A. Rolet, M. Cuturi, and G. Peyré, “Fast dictionary learning
malization for generative adversarial networks,” in International with a smoothed wasserstein loss,” in Artificial Intelligence and
Conference on Learning Representations, 2018. Statistics. PMLR, 2016, pp. 630–638.
[94] T. Pinetz, D. Soukup, and T. Pock, “On the estimation of the [117] M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngole, D. Coeurjolly,
wasserstein distance in generative models,” in German Conference M. Cuturi, G. Peyré, and J.-L. Starck, “Wasserstein dictionary
on Pattern Recognition. Springer, 2019, pp. 156–170. learning: Optimal transport-based unsupervised nonlinear dic-
[95] A. Mallasto, G. Montúfar, and A. Gerolin, “How well do wgans tionary learning,” SIAM Journal on Imaging Sciences, vol. 11, no. 1,
estimate the wasserstein metric?” arXiv preprint arXiv:1910.03875, pp. 643–678, 2018.
2019. [118] C. Vincent-Cuaz, T. Vayer, R. Flamary, M. Corneli, and N. Courty,
[96] J. Stanczuk, C. Etmann, L. M. Kreusser, and C.-B. Schönlieb, “Online graph dictionary learning,” in International Conference on
“Wasserstein gans work because they fail (to approximate the Machine Learning. PMLR, 2021, pp. 10 564–10 574.
wasserstein distance),” arXiv preprint arXiv:2103.01678, 2021. [119] H. Xu, “Gromov-wasserstein factorization models for graph clus-
[97] J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouvé, and tering,” in Proceedings of the AAAI conference on artificial intelli-
G. Peyré, “Interpolating between optimal transport and mmd gence, vol. 34, no. 04, 2020, pp. 6478–6485.
using sinkhorn divergences,” in The 22nd International Conference [120] G. Peyré, M. Cuturi, and J. Solomon, “Gromov-wasserstein aver-
on Artificial Intelligence and Statistics. PMLR, 2019, pp. 2681–2690. aging of kernel and distance matrices,” in International Conference
[98] Y. Xie, X. Wang, R. Wang, and H. Zha, “A fast proximal point on Machine Learning. PMLR, 2016, pp. 2664–2672.
method for computing exact wasserstein distance,” in Uncertainty [121] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allo-
in artificial intelligence. PMLR, 2020, pp. 433–453. cation,” Journal of machine Learning research, vol. 3, no. Jan, pp.
[99] D. Bertsekas, Network optimization: continuous and discrete models. 993–1022, 2003.
Athena Scientific, 1998, vol. 8. [122] V. Huynh, H. Zhao, and D. Phung, “Otlda: A geometry-aware
[100] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image optimal transport approach for topic modeling,” Advances in
quality assessment: from error visibility to structural similarity,” Neural Information Processing Systems, vol. 33, pp. 18 573–18 582,
IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2020.
2004. [123] C. M. Bishop, Pattern Recognition and Machine Learning (Informa-
[101] S. Kolouri, G. K. Rohde, and H. Hoffmann, “Sliced wasserstein tion Science and Statistics). Berlin, Heidelberg: Springer-Verlag,
distance for learning gaussian mixture models,” in Proceedings 2006.
of the IEEE Conference on Computer Vision and Pattern Recognition, [124] D. Pollard, “Quantization and the method of k-means,” IEEE
2018, pp. 3427–3436. Transactions on Information theory, vol. 28, no. 2, pp. 199–205, 1982.
[125] G. Canas and L. Rosasco, “Learning probability measures with
[102] J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and
respect to optimal transport metrics,” Advances in Neural Informa-
L. V. Gool, “Sliced wasserstein generative models,” in Proceedings
tion Processing Systems, vol. 25, 2012.
of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, 2019, pp. 3713–3722. [126] P. M. Gruber, “Optimum quantization and its applications,”
Advances in Mathematics, vol. 186, no. 2, pp. 456–497, 2004.
[103] K. Nadjahi, A. Durmus, U. Simsekli, and R. Badeau, “Asymp-
[127] S. Graf and H. Luschgy, Foundations of quantization for probability
totic guarantees for learning generative models with the sliced-
distributions. Springer, 2007.
wasserstein distance,” Advances in Neural Information Processing
Systems, vol. 32, 2019. [128] M. Cuturi and A. Doucet, “Fast computation of wasserstein
barycenters,” in International conference on machine learning.
[104] K. Nadjahi, A. Durmus, L. Chizat, S. Kolouri, S. Shahrampour,
PMLR, 2014, pp. 685–693.
and U. Simsekli, “Statistical and topological properties of sliced
[129] E. Del Barrio, J. A. Cuesta-Albertos, C. Matrán, and A. Mayo-
probability divergences,” Advances in Neural Information Process-
ing Systems, vol. 33, pp. 20 802–20 812, 2020. Íscar, “Robust clustering tools based on optimal transportation,”
Statistics and Computing, vol. 29, no. 1, pp. 139–160, 2019.
[105] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational infer-
[130] J. A. Cuesta-Albertos, A. Gordaliza, and C. Matrán, “Trimmed
ence: A review for statisticians,” Journal of the American statistical
k-means: an attempt to robustify quantizers,” The Annals of
Association, vol. 112, no. 518, pp. 859–877, 2017.
Statistics, vol. 25, no. 2, pp. 553–576, 1997.
[106] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Ad-
[131] I. Redko, E. Morvant, A. Habrard, M. Sebban, and Y. Bennani,
versarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
Advances in domain adaptation theory. Elsevier, 2019.
[107] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasser- [132] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain
stein auto-encoders,” in International Conference on Learning Rep- adaptation: A survey of recent advances,” IEEE signal processing
resentations, 2018. magazine, vol. 32, no. 3, pp. 53–69, 2015.
[108] G. Patrini, R. van den Berg, P. Forre, M. Carioni, S. Bhargav, [133] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet,
M. Welling, T. Genewein, and F. Nielsen, “Sinkhorn autoen- and M. Blondel, “Large scale optimal transport and mapping
coders,” in Uncertainty in Artificial Intelligence. PMLR, 2020, pp. estimation,” in International Conference on Learning Representations,
733–743. 2018.
[109] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning [134] M. Perrot, N. Courty, R. Flamary, and A. Habrard, “Mapping
for image recognition,” in Proceedings of the IEEE conference on estimation for discrete optimal transport,” Advances in Neural
computer vision and pattern recognition, 2016, pp. 770–778. Information Processing Systems, vol. 29, pp. 4197–4205, 2016.
[110] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, [135] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy,
“Neural ordinary differential equations,” in Advances in Neural “Joint distribution optimal transportation for domain adap-
Information Processing Systems, vol. 31, 2018. tation,” in Advances in Neural Information Processing Systems,
[111] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. Oberman, “How I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
to train your neural ode: the world of jacobian and kinetic wanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc.,
regularization,” in International Conference on Machine Learning. 2017.
PMLR, 2020, pp. 3154–3164. [136] M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain adaptive neu-
[112] B. A. Olshausen and D. J. Field, “Sparse coding with an over- ral networks for object recognition,” in Pacific Rim international
complete basis set: A strategy employed by v1?” Vision research, conference on artificial intelligence. Springer, 2014, pp. 898–904.
vol. 37, no. 23, pp. 3311–3325, 1997. [137] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
[113] P. Paatero and U. Tapper, “Positive matrix factorization: A non- F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-
negative factor model with optimal utilization of error estimates adversarial training of neural networks,” The journal of machine
of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994. learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
[114] R. Sandler and M. Lindenbaum, “Nonnegative matrix factoriza- [138] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided
tion with earth mover’s distance metric,” in 2009 IEEE Conference representation learning for domain adaptation,” in Thirty-Second
on Computer Vision and Pattern Recognition. IEEE, 2009, pp. 1873– AAAI Conference on Artificial Intelligence, 2018.
1880. [139] E. F. Montesuma and F. M. N. Mboula, “Wasserstein barycen-
[115] S. Shirdhonkar and D. W. Jacobs, “Approximate earth mover’s ter for multi-source domain adaptation,” in Proceedings of the
distance in linear time,” in 2008 IEEE Conference on Computer IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Vision and Pattern Recognition. IEEE, 2008, pp. 1–8. 2021, pp. 16 785–16 793.
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 20

[140] V. Titouan, I. Redko, R. Flamary, and N. Courty, “Co-optimal [163] J. Weed and F. Bach, “Sharp asymptotic and finite-sample rates
transport,” Advances in Neural Information Processing Systems, of convergence of empirical measures in wasserstein distance,”
vol. 33, pp. 17 559–17 570, 2020. Bernoulli, vol. 25, no. 4A, pp. 2620–2648, 2019.
[141] E. F. Montesuma and F. M. N. Mboula, “Wasserstein barycenter [164] J. Niles-Weed and P. Rigollet, “Estimation of wasserstein
transport for acoustic adaptation,” in ICASSP 2021-2021 IEEE distances in the spiked transport model,” arXiv preprint
International Conference on Acoustics, Speech and Signal Processing arXiv:1909.07513, 2019.
(ICASSP). IEEE, 2021, pp. 3405–3409. [165] K. Nguyen, D. Nguyen, T. Pham, N. Ho et al., “Improving mini-
[142] R. Turrisi, R. Flamary, A. Rakotomamonjy, and M. Pontil, “Multi- batch optimal transport via partial transportation,” in Interna-
source domain adaptation via weighted joint distributions opti- tional Conference on Machine Learning. PMLR, 2022, pp. 16 656–
mal transport,” in Uncertainty in Artificial Intelligence. PMLR, 16 690.
2022, pp. 1970–1980. [166] D. Onken, S. Wu Fung, X. Li, and L. Ruthotto, “Ot-flow: Fast and
[143] I. Redko, N. Courty, R. Flamary, and D. Tuia, “Optimal transport accurate continuous normalizing flows via optimal transport,” in
for multi-source domain adaptation under target shift,” in The Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35,
22nd International Conference on Artificial Intelligence and Statistics. no. 1-18, 2021.
PMLR, 2019, pp. 849–858. [167] C. Cheng, B. Zhou, G. Ma, D. Wu, and Y. Yuan, “Wasserstein
distance based deep adversarial transfer learning for intelligent
[144] I. Redko, A. Habrard, and M. Sebban, “Theoretical analysis
fault diagnosis,” arXiv preprint arXiv:1903.06753, 2019.
of domain adaptation with optimal transport,” in Joint Euro-
[168] T. Nguyen, T. Le, H. Zhao, Q. H. Tran, T. Nguyen, and D. Phung,
pean Conference on Machine Learning and Knowledge Discovery in
“Most: Multi-source domain adaptation via optimal transport for
Databases. Springer, 2017, pp. 737–753.
student-teacher learning,” in Uncertainty in Artificial Intelligence.
[145] D. Alvarez Melis and N. Fusi, “Geometric dataset distances PMLR, 2021, pp. 225–235.
via optimal transport,” Advances in Neural Information Processing [169] W. Dabney, Z. Kurth-Nelson, N. Uchida, C. K. Starkweather,
Systems, vol. 33, 2020. D. Hassabis, R. Munos, and M. Botvinick, “A distributional code
[146] D. Alvarez-Melis and N. Fusi, “Dataset dynamics via gradient for value in dopamine-based reinforcement learning,” Nature,
flows in probability space,” in International Conference on Machine vol. 577, no. 7792, pp. 671–675, 2020.
Learning. PMLR, 2021, pp. 219–230. [170] Y. Tan, Y. Li, and S.-L. Huang, “Otce: A transferability metric
[147] D. White, “Mean, variance, and probabilistic criteria in finite for cross-domain cross-task representations,” in Proceedings of the
markov decision processes: A review,” Journal of Optimization IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Theory and Applications, vol. 56, no. 1, pp. 1–29, 1988. 2021, pp. 15 779–15 788.
[148] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, and
T. Tanaka, “Parametric return density estimation for reinforce-
ment learning,” in Proceedings of the Twenty-Sixth Conference on
Uncertainty in Artificial Intelligence, 2010, pp. 368–375.
[149] M. G. Azar, R. Munos, and B. Kappen, “On the sample complex-
Eduardo Fernandes Montesuma was born in
ity of reinforcement learning with a generative model,” in ICML,
Fortaleza, Brazil, in 1996. He received a Bach-
2012.
elor degree in Computer Engineering from the
[150] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional Federal University of Ceará, and a M.S. degree
perspective on reinforcement learning,” in International Conference in Electronics and Industrial Informatics from the
on Machine Learning. PMLR, 2017, pp. 449–458. National Institute of Applied Sciences of Rennes.
[151] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Dis- He is currently working towards a PhD degree
tributional reinforcement learning with quantile regression,” in in Data Science at the University Paris-Saclay,
Thirty-Second AAAI Conference on Artificial Intelligence, 2018. and with the Department DM2I of the CEA-LIST
[152] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar, “Bayesian since 2021.
reinforcement learning: A survey,” Foundations and Trends® in
Machine Learning, vol. 8, no. 5-6, pp. 359–483, 2015.
[153] A. M. Metelli, A. Likmeta, and M. Restelli, “Propagating uncer-
tainty in reinforcement learning via wasserstein barycenters,” in
33rd Conference on Neural Information Processing Systems, NeurIPS
2019. Curran Associates, Inc., 2019, pp. 4335–4347. Fred Ngolè Mboula was born in Bandja city,
[154] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy Cameroon, in 1990. He received M.S. degree in
gradient methods for reinforcement learning with function ap- Signal Processing from IMT Atlantique in 2013,
proximation,” Advances in neural information processing systems, and a PhD degree in astronomical images pro-
vol. 12, 1999. cessing from Université Paris-Saclay in 2016. He
[155] Y. Liu, P. Ramachandran, Q. Liu, and J. Peng, “Stein variational joined the Department DM2I of the CEA-LIST
policy gradient,” UAI, 2017. in 2017 as a researcher in ML and Signal Pro-
[156] R. Zhang, C. Chen, C. Li, and L. Carin, “Policy optimization as cessing. His research interests include weakly
wasserstein gradient flows,” in International Conference on Machine supervised learning, transfer learning and infor-
Learning. PMLR, 2018, pp. 5737–5746. mation geometry. He is also teaching assistant
[157] L. Ambrosio, N. Gigli, and G. Savaré, Gradient flows: in metric at Université de Versailles Saint-Quentin in un-
spaces and in the space of probability measures. Springer Science & dergraduate mathematics.
Business Media, 2008.
[158] R. Jordan, D. Kinderlehrer, and F. Otto, “The variational formula-
tion of the fokker–planck equation,” SIAM journal on mathematical
analysis, vol. 29, no. 1, pp. 1–17, 1998.
[159] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement Antoine Souloumiac was born in Bordeaux,
learning with deep energy-based policies,” in International Con- France, in 1964. He received the M.S. and Ph.D.
ference on Machine Learning. PMLR, 2017, pp. 1352–1361. degrees in signal processing from Télécom
[160] S. Dhouib, I. Redko, T. Kerdoncuff, R. Emonet, and M. Sebban, “A Paris, France, in 1987 and 1993, respectively.
swiss army knife for minimax optimal transport,” in International From 1993 until 2001, he was Research Sci-
Conference on Machine Learning. PMLR, 2020, pp. 2504–2513. entist with Schlumberger. He is currently with
[161] A. Korotin, V. Egiazarian, A. Asadulaev, A. Safin, and E. Burnaev, the Laboratory LS2D, CEA-LIST. His current re-
“Wasserstein-2 generative networks,” in International Conference search interests include statistical signal pro-
on Learning Representations, 2020. cessing and ML, with emphasis on point pro-
[162] A. Korotin, L. Li, J. Solomon, and E. Burnaev, “Continuous cesses, biomedical signal processing, and in-
wasserstein-2 barycenter estimation without minimax optimiza- dependent component analysis or blind source
tion,” in International Conference on Learning Representations, 2020. separation.

You might also like