Professional Documents
Culture Documents
Montesuma Et Al. - 2023 - Recent Advances in Optimal Transport For Machine L
Montesuma Et Al. - 2023 - Recent Advances in Optimal Transport For Machine L
Montesuma Et Al. - 2023 - Recent Advances in Optimal Transport For Machine L
Abstract—Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and
manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in
machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for
Machine Learning over the period 2012 – 2022, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer
and reinforcement learning. We further highlight the recent development in computational Optimal Transport, and its interplay with
Machine Learning practice.
arXiv:2306.16156v1 [cs.LG] 28 Jun 2023
1 I NTRODUCTION
its genesis, this theory has made significant contributions 1.1 Related Work
to mathematics [3], physics [4], and computer science [5]. Before presenting the recent contributions of OT for ML,
Here, we study how Optimal Transport (OT) contributes to we briefly review previous surveys on related themes. Con-
different problems within Machine Learning (ML). Optimal cerning OT as a field of mathematics, a broad literature is
Transport for Machine Learning (OTML) is a growing re- available. For instance, [12] and [13], present OT theory from
search subject in the ML community. Indeed, OT is useful a continuous point of view. The most complete text on the
for ML through at least two viewpoints: (i) as a loss function, field remains [3], but it is far from an introductory text.
and (ii) for manipulating probability distributions. As we cover in this survey, the contributions of OT for
First, OT defines a metric between distributions, known ML are computational in nature. In this sense, practitioners
by different names, such as Wasserstein distance, Dudley will be focused on discrete formulations and computational
metric, Kantorovich metric, or Earth Mover Distance (EMD). aspects of how to implement OT. This is the case of surveys [9]
This metric belongs to the family of Integral Probability and [10]. While [9] serves as an introduction to OT and
Metrics (IPMs) (see section 2.1.1). In many problems (e.g., its discrete formulation, [10] gives a more broad view on
generative modeling), the Wasserstein distance is preferable the computational methods for OT and how it is applied
over other notions of dissimilarity between distributions, to data sciences in general. In terms of OT for ML, we
such as the Kullback-Leibler (KL) divergence, due its topo- compare our work to [7] and [11]. The first presents the
logical and statistical properties. theory of OT and briefly discusses applications in signal
Second, OT presents itself as a toolkit or framework processing and ML (e.g., image denoising). The second
for ML practitioners to manipulate probability distribu- provides a broad overview of areas OT has contributed to
tions. For instance, researchers can use OT for aggregating ML, classified by how data is structured (i.e., histograms,
or interpolating between probability distributions, through empirical distributions, graphs).
Wasserstein barycenters and geodesics. Hence, OT is a prin- Due to the nature of our work, there is a relevant level of
cipled way to study the space of probability distributions. intersection between the themes treated in previous OTML
This survey provides an updated view of how OTML in surveys and ours. Whenever that is the case, we provide an
the recent years. Even though previous surveys exist [6]– updated view of the related topic. We highlight a few topics
[11], the rapid growth of the field justifies a closer look that are new. In section 3, projection robust, structured,
at OTML. The rest of this paper is organized as follows. neural and mini-batch OT are novel ways of computing OT.
Section 2 discusses the fundamentals of different areas of We further analyze in section 8.1 the statistical challenges of
ML, as well as an overview of OT. Section 3 reviews recent estimating OT from finite samples. In section 4 we discuss
developments in computational optimal transport. The further recent advances in OT for structured data and fairness. In
sections explore OT for 4 ML problems: supervised (sec- section 5 we discuss Wasserstein autoencoders, and nor-
tion 4), unsupervised (section 5), transfer (section 6), and malizing flows. Also, the presentation of OT for dictionary
learning, and its applications to graph data and natural
language are new. In section 6 we present an introduction to
• The authors are with the Université Paris-Saclay, CEA, LIST, F-91120, domain adaptation, and discuss generalizations of the stan-
Palaiseau, France.E-mail: eduardo.fernandesmontesuma@cea.fr
dard domain adaptation setting (i.e., when multiple sources
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 2
are available) and transferability. Finally, in section 7 we and the effort of transportation can be translated as the
discuss distributional and Bayesian reinforcement learning, kinetic energy throughout the transportation,
as well as policy optimization. To the best of our knowledge, Z 1Z
Reinforcement Learning (RL) was not present in previous LB (ρ, v) = ∥v(t, x)∥2 ρ(t, x)dxdt. (4)
related reviews. 0 Rd
M = {θ = (m, σ 2) : m ∈ R, σ ∈ R+}
2-Wasserstein Distance KL-Divergence MMD (Linear Kernel) MMD (Gaussian Kernel)
0.300
0.400
0.100
2.50 2.50 2.50 2.50
0.700
0
0.80
00 1.0 00
1.2 050
2
50
1.
1.
00
2.25 2.25 5 2.25 2.25 0.05
0.12
0.500
0.600
0.700
2.00 2.00 2.00 2.00
0.200
0.04
0.9
0.900
0.600
0.3
µ(x)
θ0 θ0 θ0 θ0 0.03
1.50 1.50 1.50 1.50
σ
0.02
0.600
0.500
0.300
0.100
5
1.25 1.25 1.25 1.25 0.02
5
0.062
.07
50
0.450 0.2
0.900
05
1.00 1.00 0 1.00 1.00
0.
0.
0.01
75
0.
0
1.0 200
10
0.1.15
50 5
0.75 1. 050 0.75 2.0 1.000 0.75 0.75
0 0
0
0
1.
50
25 0
00
0
1.
20
0.200
0.800
0.400
0.00
0.50 0.50 0.50 0.50
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −5.0 −2.5 0.0 2.5 5.0
µ µ µ µ x
(a) (b)
Fig. 1: Comparison on how different metrics and divergences (a) calculate discrepancies on the manifold of Gaussian
distributions (b).
Mean Discrepancy (MMD) [21], defined for F = {f ∈ Hk : into a label space Y . Henceforth, we will assume X = Rd . Y
∥f ∥Hk ≤ 1}, where Hk is a Reproducing Kernel Hilbert is, for instance, {1, · · · , nc }, as in nc classes classification, or
Space (RKHS) with kernel k : Rd × Rd → R. These metrics R, in regression. The supervision in this setting comes from
play an important role in generative modeling and domain each x having an annotation y , coming from a ground-truth
adaptation. A third distance important in the latter field is labeling function h0 . For a distribution P , a loss function L,
the H−distance [22]. Conversely, f −divergences measure and hypothesis h, h′ ∈ H, the risk functional is,
the discrepancy between P and Q based on the ratio of their
probabilities. For a convex, lower semi-continuous function RP (h, h′ ) = E [L(h(x), h′ (x))].
x∼P
f : R+ → R with f (1) = 0,
We further denote RP (h) = RP (h, h0 ), in short. As pro-
P (x) posed in [24], one chooses h ∈ H by risk minimization,
Df (P ||Q) = E f , (8)
x∼Q Q(x) i.e., the minimizer of RP . Nonetheless, solving a supervised
learning problem with this principle is difficult, since neither
where, with an abuse of notation, P (x) (resp. Q) denote
P nor h0 are known beforehand. In practice, one has a
the density of P . An example of f −divergence is the KL (P ) (P ) (P )
divergence, with f (u) = u log u. dataset {(xi , yi )}n i=1 , xi ∼ P , independently and
(P ) (P )
As discussed in [20], two properties favor IPMs over identically distributed (i.i.d.), and yi = h0 (xi ). This is
f-divergences. First, dF is defined even when P and Q known as Empirical Risk Minimization (ERM) principle,
have disjoint supports. For instance, at the beginning of n
1X (P ) (P )
training, Generative Adversarial Networks (GANs) gener- ĥ = arg min R̂P (h) = L(h(xi ), yi ).
ate poor samples, so Pmodel and Pdata have disjoint support. h∈H n i=1
In this sense, IPMs provide a meaningful metric, whereas In addition, in deep learning one implements classifiers
Df (Pmodel ||Pdata ) = +∞ irrespective of how bad Pmodel is. through Neural Networks (NNs). A deep NN, f : X → Y
Second, IPMs account for the geometry of the space where can be understood as the composition of a feature extractor
the samples live. As an example, consider the manifold of g : X → Z and a classifier h : Z → Y . In this sense, the
Gaussian distributions M = {N (µ, σ 2 ) : µ ∈ R, σ ∈ R+ }. gain with deep learning is that one automatizes the feature
As discussed in [10, Remark 8.2], When restricted to M, the extraction process.
Wasserstein distance with a Euclidean ground-cost is,
2.2.2 Unsupervised Learning
q
W2 (P, Q) = (µP − µQ )2 + (σP − σQ )2 ,
A key characteristic of unsupervised learning is the absence
whereas the KL divergence is associated with an hyperbolic (P )
of supervision. In this sense, samples xi ∼ P do not have
geometry. This is shown in Figure 1. Overall, the choice any annotation. In this survey we consider three problems
of discrepancy between distributions heavily influences the in this category, namely, generative modeling, dictionary
success of learning algorithms (e.g., GANs). Indeed, each learning, and clustering.
choice of metric/divergence induces a different geometry First, generative models seek to learn to represent prob-
in the space of probability distributions, thus changing the ability distributions over complex objects, such as images,
underlying optimization problems in ML. text and time series [25]. This goal is embodied, in many
(P )
cases, by learning how to reproduce samples xi from
2.2 Machine Learning an unknown distribution P . Examples of algorithms are
As defined by [23], ML can be broadly understood as data- GANs [25], Variational Autoencoders (VAEs) [26], and Nor-
driven algorithms for making predictions. In what follows malizing Flows (NFs) [27].
we define different problems within this field. Second, for a set of n observations {xi }n i=1 , dictionary
learning seeks to decompose each xi = DT ai , where D ∈
2.2.1 Supervised Learning Rk×d is a dictionary of atoms {dj }kj=1 , and ai ∈ Rk is a
Following [24], supervised learning seeks to estimate a representation. In this sense, one learns a representation for
hypothesis h : X → Y , h ∈ H, from a feature space X data points xi .
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 4
Finally, clustering tries to find groups of data-points in a 2.3 The Curse of Dimensionality
multi-dimensional space. In this case, one assigns each xi to The curse of dimensionality is a term coined by [35], and
one cluster ci ∈ {1, · · · , K}. Hence clustering is similar to loosely refers to phenomena occurring in high-dimensions
classification, except that in the former, one does not know that seem counter-intuitive in low dimensions (e.g., [36]).
beforehand the number of classes/clusters K . Due to the term curse, these phenomena often have a neg-
2.2.3 Transfer Learning ative connotation. For instance, consider binning Rd into n
distinct values [37, Section 5.11.1]. The number of distinct
Even though the ERM framework helps to formalize super-
bins grows exponentially with the dimension. As a conse-
vised learning, data seldom is i.i.d. As remarked by [28], the
quence, for assessing the probability P (X = xk ) accurately,
conditions on which data were acquired for training a model
one would need exponentially many samples (i.e., imagine
are likely to change when deploying it. As a consequence,
at least one sample per bin).
the statistical properties of the data change, a case known as
This kind of phenomenon also arises in OT. As studied
distributional shift. Examples of this happen, for instance, in
by [38], for n samples, the convergence rate of the empirical
vision [29] and signal processing [30].
estimator with for the Wasserstein distance decreases expo-
Transfer Learning (TL) was first formalized by [31], and
nentially with the dimension, i.e., W1 (P, P̂ ) = O(n− /d ),
1
which provides a faster way to estimate γ . Provided ei- One may define the max-GSW in analogy to eq. 18.
ther T ⋆ or γ ⋆ , OT defines a metric known as Wasserstein In addition, one can project samples on a sub-space
distance, e.g., Wc (P̂ , Q̂) = LK (γ ⋆ ). Likewise, for γϵ⋆ one 1 < k < d. For instance, [52] proposed the Subspace Robust
has Wc,ϵ (P̂ , Q̂) = LK (γϵ⋆ ). Associated with this approxima- Wasserstein (SRW) distances,
tion, [43] introduced the so-called Sinkhorn divergence,
SRWk (P, Q) = inf sup E [∥πE (x1 − x2 )∥22 ],
γ∈Γ(P,Q) E∈Gk (x1 ,x2 )∼γ
Wc,ϵ (P̂ , P̂ ) + Wc,ϵ (Q̂, Q̂)
Sc,ϵ (P̂ , Q̂) = Wc,ϵ (P̂ , Q̂) − , (15)
2 where Gk = {E ⊂ Rd : dim(E) = k} is the Grassmannian
which has interesting properties, such as interpolating be- manifold of k−dimensional subspaces of Rd , and πE denote
tween the MMD of [21] and the Wasserstein distance. Over- the orthogonal projector onto E . This can be equivalently
all, Sc,ϵ has 4 advantages: (i) its calculations can be done on formulated through a projection matrix U ∈ Rk×d , that is,
a GPU, (ii) For L > 0 iterations, it has complexity O(Ln2 ), SRWk (P, Q) = inf max E [Ux1 − Ux2 ]
(iii) Sc,ϵ is a smooth approximator of Wc (see e.g., [44]), (iv) Γ(P,Q) U∈Rk×d (x1 ,x2 )∼γ
UUT =Ik
it has better sample complexity (see e.g., [45]).
In the following, we discuss recent innovations on com- In practice, SRWk is based on a projected super gradient
putational OT. Section 3.1 present projection-based meth- method [52, Algorithm 1] that updates Ω = UUT , for a
ods. Section 3.2 discusses OT formulations with prescribed fixed γ , until convergence. These updates are computation-
structures. Section 3.3 presents OT through Input Convex ally complex, as they rely on eigendecomposition. Further
Neural Networks (ICNNs). Section 3.4 explores how to developments using Riemannian [53] and Block Coordinate
compute OT between mini-batches. Descent (BCD) [54] circumvent this issue by optimizing over
St(d, k) = {U ∈ Rk×d : UUT = Ik }.
3.1 Projection-based Optimal Transport
Projection-based OT relies on projecting data x ∈ Rd into 3.2 Structured Optimal Transport
sub-spaces Rk , k < d. A natural choice is k = 1, for which In some cases, it is desirable for the OT plan to have
computing OT can be done by sorting [12, chapter 2]. This is additional structure, e.g., in color transfer [55] and domain
called Sliced Wasserstein (SW) distance [46], [47]. Let Sd−1 = adaptation [56]. Based on this problem, [57] introduced a
{u ∈ Rd : ∥u∥22 = 1} denote the unit-sphere in Rd , and principled way to compute structured OT through sub-
πu : Rd → R denote πu (x) = ⟨u, x⟩. The Sliced-Wasserstein modular costs.
distance is, As defined in [57], a set function F : 2V → R is sub-
Z modular if ∀S ⊂ T ⊂ V and ∀v ∈ V \T ,
SW(P, Q) = Wc (πu,♯ P̂ , πu,♯ Q̂)du. (16)
Sd−1 F (S ∪ {v}) − F (S) ≥ F (T ∪ {v}) − F (T ).
This distance enjoys nice computational properties. First, These types of functions arise in combinatorial optimiza-
Wc (πu,♯ P, πu,♯ Q) can be computed in O(n log n) [10]. Sec- tion, and further OT since OT mappings and plans can be
ond, the integration in equation 16 can be computed using seen as a matching between 2 sets, namely, samples from P
Monte Carlo estimation. For samples {uℓ }L ℓ=1 , uℓ ∈ S
d−1
and Q. In addition, F defines a base polytope,
uniformly,
BF = {G ∈ R|V | : G(V ) = F (V ); G(S) ≤ F (S) ∀S ⊂ V }.
L
1X
SW(P̂ , Q̂) = Wc (πuℓ ,♯ P̂ , πuℓ ,♯ Q̂), (17) Based on these concepts, note that OT can be formulated
L ℓ=1
in terms of set functions. Indeed, suppose X(P ) ∈ Rn×d and
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 6
(γ ⋆ , κ⋆ ) = arg min arg max ⟨γ, κ⟩F . As follows, [61] uses [62, Theorem 2.9], for rewriting the
γ∈Γ(p,q) κ∈BF
Wasserstein distance as,
A similar direction was explored by [59], who proposed
W2 (P, Q)2 = EP,Q − inf J(f, f ∗ ),
to impose a low rank structure on OT plans. This was done f ∈CVX(P )
with the purpose of tackling the curse of dimensionality 2 2
in OT. They introduce the transport rank of γ ∈ Γ(P, Q) where EP,Q = Ex∼P [∥x∥2/2] + Ex∼Q [∥x∥2/2] does not
defined as the smallest integer K for which, depend on the f , and J(f, f ∗ ) = Ex∼P [f (x)] +
Ex∼Q [f ∗ (x)] is minimized over the set CVX(P ) = {f :
K
X f is convex and Ex∼P [f (x)] < ∞}. f ∗ denotes the convex
γ= λk (Pk ⊗ Qk ), conjugate of f , i.e., f ∗ (x2 ) = supx1 ⟨x1 , x2 ⟩ − f (x1 ). Opti-
k=1
mizing J(f, f ∗ ) is, nonetheless, complicated due to the con-
where Pk , Qk , k = 1, · · · , K are distributions over Rd , and vex conjugation. The insight of [61] is to introduce a second
Pk ⊗ Qk denotes the (independent) joint distribution with ICNN, g ∈ CVX, that maximizes J(f, g). As consequence,
marginals Pk and Qk , i.e. (Pk ⊗ Qk )(x, y) = Pk (x)Qk (y).
For empirical P̂ and Q̂, K coincides with the non-negative W2 (P, Q)2 = sup inf VP,Q (f, g) + EP,Q , (19)
f ∈CVX(P ) g∈CVX(Q)
rank [60] of γ ∈ Rn×m . As follows, [59] denotes the set of
γ with transport rank at most K as ΓK (P, Q) (ΓK (p, q) for VP,Q (f, g) = − E [f (x)] − E [⟨x, ∇g(x)⟩ − f (∇g(x))].
x∼P x∼Q
empirical P̂ and Q̂). The robust Wasserstein distance is thus,
Therefore, [61] proposes solving the sup-inf problem in
FWK,2 (P, Q)2 = inf E [∥x1 − x2 ∥2 ]. eq. 19 by parametrizing f ∈ CVX(P ) and g ∈ CVX(Q)
γ∈ΓK (P,Q) (x1 ,x)∼γ
through ICNNs. This becomes a mini-max problem, and the
In practice, this optimization problem is difficult due to the expectations can be approximated empirically using data
constraint γ ∈ ΓK . As follows, [59] propose estimating it from P and Q. An interesting feature from this work is that
using the Wasserstein barycenter B = B([1/2, 1/2]; {P, Q}) one solves OT for mappings f and g , rather than an OT plan.
(B) This has interesting applications for generative modeling
supported on K points, that is {xk }K k=1 , also called hubs.
As follows, the authors show how to construct a trans- (e.g. [64]) and domain adaptation.
port plan in Γk (P, Q), by exploiting the transport plans
γ1 ∈ Γ(P, B) and γ2 ∈ Γ(B, Q). This establishes a link
3.4 Mini-batch Optimal Transport
between the robustness of the Wasserstein distance, Wasser-
Pn (BP )
stein barycenters and clustering. Let λk = i=1 γki and A major challenge in OT is its time complexity. This mo-
(P ) (BP ) (P )
µk = 1/λk ni=1 γki xi , the authors in [59] propose the tivated different authors [43], [65], [66] to compute the
P
following proxy for the Wasserstein distance, Wasserstein distance between mini-batches rather than com-
plete datasets. For a dataset with n samples, this strategy
K
X (P ) (Q) leads to a dramatic speed-up, since for K mini-batches
FW2k,2 (P̂ , Q̂) = λk ∥µk − µk ∥22 . of size m ≪ n, one reduces the time complexity of OT
k=1
from O(n3 log n) to O(Km3 log m) [66]. This choice is key
when using OT as a loss in learning [67] and inference [68].
3.3 Neural Optimal Transport Henceforth we describe the mini-batch framework of [69],
In [61], the authors proposed to leverage the so-called con- for using OT as a loss.
vexification trick of [62] for solving OT through neural nets, Let LOT denote an OT loss (e.g. Wc or Sc,ϵ ). Assum-
using the ICNN architecture proposed in [63]. Formally, ing continuous distributions P and Q, the Mini-batch OT
an ICNN implements a function f : Rd → R such that (MBOT) loss is given by,
x 7→ f (x; θ) is convex. This is achieved through a special
structure, shown in figure 2. An L−layer ICNN is defined
LMBOT (P, Q) = E [LOT (X(P ) , X(Q) )],
(X(P ) ,X(Q) )∼P ⊗m ⊗Q⊗m
through the operations,
(P )
where X(P ) ∼ P ⊗m indicates xi ∼ P , i = 1, · · · , m. This
zℓ+1 = σℓ (Wℓ zℓ + Aℓ x + bℓ ); f (x; θ) = zL ,
loss inherits some properties from OT, i.e., it is positive and
(P ) P
where (i) all entries of Wℓ are non-negative; (ii) σ0 is convex; symmetric, but LMBOT (P, P ) > 0. In practice, let {xi }n i=1
(Q) nQ
(iii) σℓ , ℓ = 1, · · · , L − 1 is convex and non-decreasing. This and {xj }i=1 be iid samples from P and Q respectively. Let
is shown in figure 2. Im ⊂ {1, · · · , nP }m denote a set of m indices. We denote by
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 7
Fused Gromov-Wasserstein (FGW) distance, which interpo- where Q = Q(X, S, Y ). In particular, the notion of statistical
lates between the previous choices, parity [84] is expressed in probabilistic terms as,
(MLE), and hence to the KL divergence (see [77, Chapter 4]) hence the geodesic or barycenter P̂t between P̂0 and P̂1 is
between P (Y |X) and Pθ (Y |X). an empirical distribution with support,
As remarked by [78], this choice disregards similarities
(P0 ) (P1 )
in the label space. As such, a prediction of semantically x̃t,ij = (1 − t)xi + txj , ∀(ij) s.t. γij > 0.
close classes (e.g. a Siberian husky and a Eskimo dog)
The procedure of random repair corresponds to drawing
yields the same loss as semantically distant ones (e.g. a
b1 , · · · , bn0 +n1 ∼ B(λ), where B(λ) is a Bernoulli distribu-
dog and a cat). To remedy this issue, [78] proposes using
tion with parameter λ. For P0 (resp. P1 ),
the Wasserstein distance between the network’s predictions
n0
(
(P )
and the ground-truth vector, L(ŷ, y) = minγ∈U (ŷ,y) ⟨γ, C⟩F , (P0 )
[ {xi 0 } if bi = 0
where C ∈ Rnc ×nc is a metric between labels. X = ,
i=1
{xt,ij : γij > 0} if bi = 1
This idea has been employed in multiple applications.
For instance, [78] employs this cost between classes k1 where the amount of repair λ regulates the trade-off between
and k2 as Ck1 ,k2 = ∥xk1 − xk2 ∥22 , where xk ∈ Rd is the fairness (λ = 1) and classification performance (λ = 0). In
d−dimensional embedding corresponding to the class k . addition, t = 1/2, so as to avoid disparate impact.
In addition [79], [80] propose using this idea for seman- In another direction, one may use OT for devising a
tic segmentation, by employing a pre-defined ground-cost regularization term to enforce fairness in ERM. Two works
reflecting the importance of each class and the severity of follow this strategy, namely [86] and [87]. In the first case,
miss-prediction (e.g. confusing the road with a car and vice- the authors explore fairness in representation learning (see
versa in autonomous driving). for instance section 2.2.1). In this sense, let f = h ◦ g , be
the composition of a feature extractor g and a classifier h.
Therefore [86] proposes minimizing,
4.3 Fairness n
1X
The analysis of biases in ML algorithms has received in- (h⋆ , g ⋆ ) = arg min L(h(g(xi )), yi ) + βd(P0 , P1 ),
h,g n i=1
creasing attention, due to the now widespread use of these
systems for automatizing decision-making processes. In this for β > 0, P0 = P (X|S = 0) (resp S = 1) and d being either
context, fairness [81] is a sub-field of ML concerned with the MMD or the Sinkhorn divergence (see sections 2.1).
achieving fair treatment to individuals in a population. On another direction, [87] considers the output distribution
Throughout this section, we focus on fairness in binary Ps = P (h(g(X))|S = s). Their approach is more general,
classification. as they suppose that attributes can take more than 2 values,
The most basic notion of fairness is concerned with namely s ∈ {1, · · · , NS }. Their insight is that the output dis-
a sensitive or protected attribute s ∈ {0, 1}, which can tribution should be transported to the distribution closest to
be generalized to multiple categorical attributes. We focus each group’s conditional, namely Pk = P (h(g(X))|S = k).
on the basic case in the following discussion. The values This corresponds to Ps̄ = B(α; Pk n k=1 ), αs = /n. As in [86],
s ns
s = 0 and s = 1 correspond to the unfavored and favored the authors enforce this condition through regularization,
groups, respectively. This is complementary to the theory namely,
of classification, namely, ERM (see section 2.2.1), where one n NS
wants to find h ∈ H that predicts a label ŷ = h(x) to match
1X X nk
min L(h(g(xi )), y) + β W2 (Pk , P̄ ).
y ∈ {0, 1}. In this context different notions of fairness exist h,g n i=1 k=1
n
(e.g. [82, Table 1]). As shown in [83], these can be unified in Finally, we remark that OT can be used for devising a
the following equation, statistical test of fairness [83]. To that end, consider,
E 1h(x)=1 ϕ U, E [U ] = 0, F = Q: E 1h(x)=1 ϕ U, E [U ] = 0 ,
(x,s,y)∼Q (x,s,y)∼Q (x,s,y)∼Q (x,s,y)∼Q
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 9
the set of distributions Q for which a classifier h is fair. though (ii) and (iii) improve over (i), several works have
Now, let {(xi , si , yi )}N
i=1 be samples from a distribution P . confirmed that the WGAN and its variants do not estimate
Pn
Let P̂ = 1/n i=1 δ(xi ,si ,yi ) . The fairness test of [83] consists the Wasserstein distance [94]–[96].
on a hypothesis test with H0 : P ∈ F and H1 : P ∈ / F . The In another direction, one can calculate ∇θ Wp (Pθ , P0 )
test is then based on, using the primal problem (equation 14). To do so, [43] and
further [69] consider estimating Wp (Pθ , P0 ) through aver-
d(P, F) = inf W2 (P, Q), aging the loss over mini-batches (see section 3.4). For calcu-
Q∈F
lating derivatives ∇θ Wp (Pθ , P0 ), one has two approaches.
for which on rejects H0 if t = n × d(P, F) > η1−α , where
First, [43] advocate to back-propagate through Sinkhorn
η1−α is the (1 − α) × 100% quantile of the generalized chi- iterations. This is numerically unstable and computationally
squared distribution.
complex. Second, as discussed in [97], [98], one can use
the Envelope Theorem [99] to take derivatives of Wp . In
5 U NSUPERVISED L EARNING its primal form, it amounts to calculate γ ⋆ = OTϵ (p, q, C)
then differentiating w.r.t. C.
In this section, we discuss unsupervised learning techniques
that leverage the Wasserstein distance as a loss function. We In addition, as discussed in [43], an Euclidean ground-
consider three cases: generative modeling (section 5.1), rep- cost is likely not meaningful for complex data (e.g., images).
resentation learning (section 5.2) and clustering (section 5.3). The authors thus propose learning a parametric ground cost
(P ) (P )
(Cη )ij = ∥fη (xi θ ) − fη (xj 0 )∥, where fη : X → Z is
a NN that learns a representation for x ∈ X . Overall the
5.1 Generative Modeling optimization problem proposed by [43] is,
There are mainly three types of generative models that
benefit from OT, namely, GANs (section 5.1.1), VAEs (sec- min max Scη ,ϵ (Pθ , P0 ). (24)
θ η
tion 5.1.2) and normalizing flows (section 5.1.3). As dis-
cussed in [88], [89], OT provides an unifying framework We stress the fact that engineering/learning the ground-cost
for these principles through the Minimum Kantorovich Es- cη is important for having a meaningful metric between
timator (MKE) framework, introduced by [90], which, given distributions, since it serves to compute distances between
samples from P0 , tries to find θ that minimizes W2 (Pθ , P0 ). samples. For instance, the Euclidean distance is known to
not correlate well with perceptual or semantic similarity
5.1.1 Generative Adversarial Networks between images [100].
GAN is a NN architecture proposed by [25] for sampling Finally, as in [49], [50], [101], [102], one can employ
from a complex probability distribution P0 . The principle sliced Wasserstein metrics (see section 3.1). This has two
behind this architecture is building a function gθ , that gen- advantages: (i) computing the sliced Wasserstein distances is
erates samples in a input space X from latent vectors z ∈ Z , computationally less complex, (ii) these distances are more
sampled from a simple distribution Q (e.g. N (0, σ 2 )). A robust w.r.t. the curse of dimensionality [103], [104]. These
GAN is composed by two networks: a generator gθ : Z → properties favour their usage in generative modeling, as
X , and a discriminator hξ : X → [0, 1]. X is called the input data is commonly high dimensional.
space (e.g. image space) and Z is the latent space (e.g. an
Euclidean space Rp ). The architecture is shown in Figure 4 5.1.2 Autoencoders
(a). Following [25], the GAN objective function is,
In a parallel direction, one can leverage autoencoders for
LG (θ, ξ) = E [log(hξ (x))] + E [log(1 − hξ (gθ (z)))], generative modeling. This was introduced in [26], who
x∼P0 z∼Q
proposed using stochastic encoding and decoding functions.
which is a minimized w.r.t. θ and maximized w.r.t. ξ . In other Let fη : X → Z be an encoder network. Instead of mapping
words, hξ is trained to maximize the probability of assigning z = fη (x) deterministically, fη predicts a mean mη (x) and
the correct label to x (labeled 1) and gθ (z) (labeled 0). gθ , on a variance ση (x)2 from which the code z is sampled, that is,
the other hand, is trained to minimize log(1 − hξ (gθ (z))). As z ∼ Qη (z|x) = N (mη (x), ση (x)2 ). The stochastic decoding
consequence it minimizes the probability that hξ guesses its function gθ : Z → X works similarly for reconstructing the
samples correctly. input x̂. This is shown in figure 4 (b). In this framework, the
As shown in [25], for an optimal discriminator hξ⋆ , decoder plays the role of generator.
the generator cost is equivalent to the so-called Jensen- The VAE framework is built upon variational inference,
Shannon (JS) divergence. In this sense, [91] proposed the which is a method for approximating probability densi-
Wasserstein GAN (WGAN) algorithm, which substitutes the ties [105]. Indeed, for a parameter vector θ, generating new
JS divergence by the KR metric (equation 2), samples x is done in two steps: (i) sample z from the
prior Q(z) (e.g. a Gaussian), then (ii) sample x from the
min max LW (θ, ξ) = E hξ (x) − E hξ (gθ (z)) . conditional Pθ (x|z). The issue comes from calculating the
θ hξ ∈Lip1 x∼P0 z∼Q
marginal,
Nonetheless, the WGAN involves maximizing LW over Z
hξ ∈ Lip1 . Nonetheless, this is not straightforward for NNs. Pθ (x) = Q(z)Pθ (x|z)dz,
Possible solutions are: (i) clipping the weights ξ [91]; (ii)
penalizing the gradients of hξ [92]; (iii) normalizing the which is intractable. This hinders the MLE, which relies on
spectrum of the weight matrices [93]. Surprisingly, even log Pθ (x). VAEs tackle this problem by first considering an
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 10
fη gθ
z∼Q
gθ x ∼ Pθ = gθ♯ Q σθ (z)
x x̂
ση (x)2
hξ Real or
Fake? fη ∼ gθ ∼ x̂
hξ Real or
mη (x) z Fake?
x ∼ P0
x mθ (z)
Dataset
Fig. 4: Architecture of different generative models studiend in this section; (a) GANs; (b) VAEs; (c) AAEs.
approximation Qη (z|x) ≈ Pθ (z|x). Secondly, one uses the Flows (CNFs). We thus focus on this class of algorithms. For
Evidence Lower Bound (ELBO), a broader review, we refer the readers to [27]. Normalizing
flows rely on the chain rule for computing changes in
ELBO(Qη ) = E [log Pθ (x, z)] − E [log Qη (z)], probability distributions exactly. Let T : Rd → Rd be a
z∼Qη z∼Qη
diffeomorphism. If, for z ∼ Q and x ∼ P , z = T(x), one
instead of the log-likelihood. Indeed, as shown in [26], the has,
ELBO is a lower bound for the log-likelihood. As follows,
one turns to the maximization of the ELBO, which can be log P (x) = log Q(z) − log | det ∇T−1 |.
rewritten as,
Following [27], a NF is characterized by a sequence of
L(θ, η) = E E [log Pθ (x|z)] − KL(Qη ∥Q) . (25) diffeomorphisms {Ti }N i=1 that transform a simple probabil-
x∼P0 z∼Qη ity distribution (e.g. a Gaussian distribution) into a complex
one. If one understands this sequence continuously, they can
A somewhat middle ground between the frameworks of [25]
be modeled through dynamic systems. In this context, CNFs
and [26] is the Adversarial Autoencoder (AAE) architec-
are a class of generative models based on dynamic systems,
ture [106], shown in figure 4 (c). AAEs are different from
GANs in two points, (i) an encoder fη is added, for mapping z(x, 0) = x, and, ż(x, t) = Fθ (z(x, t)). (26)
x ∈ X into z ∈ Z ; (ii) the adversarial component is
done in the latent space Z . While the first point puts the which is an Ordinary Differential Equation (ODE). Hence-
AAE framework conceptually near VAEs, the second shows forth we denote zt (x) = z(x, t). Using Euler’s method
similarities with GANs. and setting z0 = x, one can discretize equation 26 as
Based on both AAEs and VAEs, [107] proposed the zn+1 = zn + τ Fθ (zn ), for a step-size τ > 0. This equation
Wasserstein Autoencoder (WAE) framework, which is a is similar to the ResNet structure [109]. Indeed, this is the
generalization of AAEs. Their insight is that, when using insight behind the seminal work of [110], who proposed
a deterministic decoding function gθ , one may simplify the the Neural ODE (NODE) algorithm for parametrizing ODEs
MK formulation, through NNs. CNFs are thus the intersection of NFs with
NODE.
inf E [c(x1 , x2 )] = inf E E [c(x, gθ (z))] . According to [110], if a Random Variable (RV) z(x, t)
γ∈Γ (x1 ,x2 )∼γ Qη =Q x∼P0 z∼Qη suffices equation 26, then its log-probability follows,
As follows, [107] suggests enforcing the constraint in the ∂
infimum through a penalty term Ω, leading to, log | det ∇z| = Tr(∇z Fθ ) = div(Fθ ). (27)
∂t
With this workaround, minimizing the negative log-
L(θ, η) = E E [c(x, gθ (z))] + λΩ(Qη , Q),
x∼P0 z∼Qη likelihood amounts to minimizing,
n Z T
for λ > 0. This expression is minimized jointly w.r.t. θ and X
η . As choices for the penalty Ω, [107] proposes using either − log Q(zT (xi )) − div(Fθ (zt (xi )))dt, (28)
i=1 0
the JS divergence, or the MMD distance. In the first case the
formulation falls back to a mini-max problem much alike which can be solved using the automatic differentia-
the AAE framework. A third choice was proposed by [108], tion [110]. One limitation of CNFs is that the generic flow
which relies on the Sinkhorn loss Sc,ϵ , thus leveraging the can be highly irregular, having unnecessarily fluctuating
work of [43]. dynamics. As remarked by [111], this can pose difficulties
to numerical integration. These issues motivated the propo-
5.1.3 Continuous Normalizing Flows sition of two regularization terms for simpler dynamics. The
OT contributes to the class of generative models known first term is based dynamic OT. Let,
as NFs through its dynamic formulation (e.g. section 2.1 1X
n
and equations 3 through 5). Especially, they contribute to ρt = zt,♯ P = δz (x ) ,
n i=1 t i
a specific class of NFs known as Continuous Normalizing
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 11
and recalling the concepts in section 2.1, this simplifies eq. 4, This can be understood as a barycenter under the Euclidean
n Z metric. This remark motivated [117] to use the Wasserstein
1X T barycenter B(ai , D) of histograms in D, with weights ai , for
min L(θ) = ∥Fθ (zt (xi ))∥2 dt,
θ n i=1 0 (29) approximating xi , that is,
subject to z0 = x, and zT,♯ P = Q, N
X
2
where ∥Fθ (zt (xi ))∥ is the kinetic energy of the particle xi arg min Wc,ϵ (B(ai , D), xi ) + λΩ(D, A)
D,A i=1
at time t, and L to the whole kinetic energy. In addition, ρt
has the properties of an OT map. Among these, OT enforces Graph Dictionary Learning. An interesting use case of
that: (i) the flow trajectories do not intersect, and (ii) the this framework is Graph Dictionary Learning (GDL) (see
particles follow geodesics w.r.t. the ground cost (e.g. straight section 4.1). Let {Gℓ : (hℓ , S(ℓ) )}N
ℓ=1 denote a dataset
lines for an Euclidean cost), thus enforcing the regularity of N graphs encoded by their node similarity matrices
of the flow. Nonetheless, as [111] remarks, these properties S(ℓ) ∈ Rnℓ ×nℓ , and histograms over nodes hℓ ∈ ∆nℓ . [118]
are enforced only on training data. To effectively generalize proposes,
them, the authors propose a second regularization term, that N K
consists on the Frobenius norm of the Jacobian,
X X
arg min GW22 S(ℓ) , aℓ,k S(k) − λ∥aℓ ∥22 ,
n {Sk }K N
k=1 ,{aℓ }ℓ=1 ℓ=1 k=1
1X
Ω(F) = ∥∇Fθ (zT (xi ))∥2F . (30)
n i=1 in analogy to eq. 31, using the GW distance (see eq. 22) as
loss function. Parallel to this line of work, [119] proposes
Thus, the Regularized Neural ODE (RNODE) algorithm Gromov-Wasserstein Factorization (GWF), which approxi-
combines equation 27 with the penalties in equations 29 mates S(ℓ) non-linearly through GW-barycenters [120].
and 30. Ultimately, this is equivalent to minimizing the KL Topic Modeling. Another application of dictionary learn-
divergence KL(zT,♯ P ∥Q) with the said constraints. ing consists on topic modeling [121], in which a set of
documents is expressed in terms of a dictionary of topics
5.2 Dictionary Learning weighted by mixture coefficients. As explored in section 4.1,
Dictionary learning [112] represents data X ∈ Rn×d through documents can be interpreted as probability distributions
a set of K atoms D ∈ Rk×d and n representations A ∈ over words (e.g., eq. 21). In this context [122] proposes an
Rn×k . For a loss L and a regularization term Ω, OT-inspired algorithm for learning a set of distributions
n
Q = {Q̂k }Kk=1 over words in the whole vocabulary. In this
1X sense, each Q̂k is a topic. Learning is based on a hierarchical
arg min L(xi , DT ai ) + λΩ(D, A). (31)
D,A n i=1 OT problem,
In this setting, the data points xi are approximated linearly K X
X N
through the matrix-vector product DT ai . In other words, min γik W2,ϵ (Q̂k , P̂i ) − H(γ),
{qk }K
k=1 ,b,γ
X ≈ AD. The practical interest is finding a faithful and k=1 i=1
This problem can be optimized in two steps: (i) one sets where Ω is the class sparsity regularizer. Here we discuss 2
rik = 1 if Qk is the closest distribution to Pi (w.r.t. W2 ); (ii) cases proposed in [56]: (i) semi-supervised and (ii) group-
one computes the centroids using Pk = {Pi : rik = 1}. In lasso. The semi-supervised regularizer assumes that y (PT )
this case Qk = B(1/ i rik ; Pk ). This formulation is related are known at training time, so that,
P
5 5 5
4 4 4
Feature 2
Feature 2
3 3 3
Feature 2
2 2 2
1
1 1
0
0 0
−1
0 2 4 0 2 4 0 2 4
Feature 1 Feature 1 Feature 1
(a) (b) (c)
Fig. 5: Optimal Transport for Domain Adaptation (OTDA) methodology: (a) Source (red) and target (blue) samples follow
different distributions; A classifier (dashed black line) fit on source is not adequate for target; (b) OT plan visualization. An
edge connects source and target points (ij) whenever γij > 0. (c) Tγ generates samples on target domain (red).
intuition behind deep DA is to force the feature extractor to maximize Ld over hη ∈ Lip1 . [138] proposes doing so
(P )
to learn domain invariant features. Given xi S ∼ PS and using the gradient penalty term of [92],
(PT )
xj ∼ PT , an invariant feature extractor gθ should suffice,
min min R̂PS (hξ ◦ gθ ) + λ2 max LW (θ, η) − λ1 Ω(hη ),
θ ξ η
nS nT
1 X Wc 1 X
(gθ )♯ P̂S = δz(PS ) ≈ δ (P ) = (gθ )♯ P̂T , where λ1 controls the gradient penalty term, and λ2 controls
nS i=1 i nT j=1 zj T
the importance of the domain loss term.
Finally, we highlight that the Joint Distribution Optimal
(P ) (P ) Wc
where zi S = gθ (xi S ) (resp. P̂T ), and P̂S ≈ P̂T means Transport (JDOT) framework can be extended to deep DA.
Wc (P̂S , P̂T ) ≈ 0. This implies that, after the application of First, one includes the feature extractor gθ in the ground-
gθ , distributions P̂S and P̂T are close. In this sense, this cost,
condition is enforced by adding an additional term to the (PS ) (PT ) 2 (PS ) (PT )
classification loss function (e.g. the MMD as in [136]). In Cij (θ, ξ) = α∥zi − zj ∥ + cℓ (yi , hξ (zj )),
this context, [137] proposes the Domain Adversarial Neural second, the objective function includes a classification loss
Network (DANN) algorithm, based on the loss function, RPS , and the OT loss, that is,
LDANN (θ, ξ, η) = R̂PS (hξ ◦ gθ ) − λLH (θ, η), nS X
X nT
LJDOT (θ, ξ) = R̂PS (hξ ◦ gθ ) + γij Cij (θ, ξ), (35)
where LH denotes, i=1 j=1
1 X
nS
(P )
nT
1 X (P ) which is jointly minimized w.r.t. γ ∈ RnS ×nT , θ and ξ . [67]
log hη (zi S )) + log(1 − hη (zj T )), proposes minimizing LJDOT jointly w.r.t. θ and ξ , using mini-
nS i=1 nT j=1
batches (see section 3.4). Nonetheless, the authors noted
and hη is a supplementary classifier, that discriminates be- that one needed large mini-batches for a stable training. To
tween source (labeled 0) and target (labeled 1). The DANN circumvent this issue [71] proposed using unbalanced OT,
algorithm is a mini-max optimization problem, allowing for smaller batch sizes and improved performance.
so as to minimize classification error and maximize do- In this section we highlight 2 generalizations of the DA
main confusion. This draws a parallel with the GAN al- problem. First, Multi-Source Domain Adaptation (MSDA)
gorithm of [25], presented in section 5.1.1. This remark (e.g. [139]), deals with the case where source data comes
motivated [138] for using the Wasserstein distance instead of from multiple, distributionally heterogeneous domains,
dH . Their algorithm is called Wasserstein Distance Guided namely PS1 , · · · , PSK . Second, Heterogeneous Domain
Representation Learning (WDGRL). The Wasserstein do- Adaptation (HDA), is concerned with the case where
main adversarial loss is then, source and target domain data live in incomparable spaces
(e.g. [140]).
nS nT
1 X (P ) 1 X (P ) For MSDA, [141] proposes using the Wasserstein
LW (θ, η) = hη (gθ (xi S )) − hη (gθ (xj T )).
nS i=1 nT j=1 barycenter for building an intermediate domain P̂B =
B(α; {P̂Sk }N k=1 ) (see equation 7), for αk = /NS . Since
S 1
Following our discussion on KR duality (see section 2.1) as P̂B ̸= P̂T , the authors propose using an additional adapta-
well as the Wassers tein GAN (see section 5.1.1), one needs tion step for transporting the barycenter towards the target.
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 14
In a different direction, [142] proposes the Weighted JDOT this sense, Zπ is a RV whose expectation is Qπ . [150] defines
algorithm, which generalizes the work of [135] for MSDA, DRL by analogy with equation 10,
K K D
1X h
X Zπ (st , at ) = R(st , at ) + λZπ (st+1 , at+1 ),
min R̂PSk (hξ ◦ gθ ) + Wcθ,ξ (P̂T ξ , αk P̂Sk ).
α,gθ ,hξ k k=1 k=1 D
where = means equality in distribution. Note that Zπ (st , at )
For HDA, Co-Optimal Transport (COOT) [140] tackles is a distribution over the value of the state (st , at ), hence a
DA when the distributions P̂S and P̂T are supported on distribution over the line. [150] then suggests parametrizing
incomparable spaces, that is, X(PS ) ∈ RnS ×dS and X(PT ) ∈ Zπ using Zθ with discrete support of size N , between values
RnT ×dT , nS ̸= nT and dS ̸= dT . The authors strategy relies Vmin and Vmax ∈ R, so that Zθ (s, a) = zi with probability
on the GW OT formalism, p(s, a) = softmax(θ(s, a)). As commented by the authors,
X (P ) (P ) (s) (f ) N = 51 yields the better empirical results, thus the authors
(γ (s) , γ (f ) ) = min L(Xi,kS , Xj,lT )γij γij ,
γ (s) ∈Γ(wS ,wT ) ijkl call their algorithm c51.
γ (f ) ∈Γ(vS ,vT ) For a pair (st , at ), learning Zθ corresponds to sampling
(st+1 , at+1 ) according to the dynamics P and the policy
where γ (s) ∈ RnS ×nT and γ (f ) ∈ RdS ×dT are OT plans π and minimizing a measure of divergence between the
between samples and features, respectively. Using γ (s) , one application of the Bellman operator, i.e., Tπ Zθ = R(st , at ) +
can estimate labels on the target domain using label propa- λZθ (st+1 , at+1 ), and Zθ . OT intersects distributional RL
gation [143]: Ŷ (PT ) = diag(pT )−1 (γ (s) )T Y (PS ) . precisely at this point. Indeed, [150] use the Wasserstein
distance as a theoretical framework for their method. This
6.4 Transferability suggests that one should employ Wc in learning as well,
however, as remarked by [70], this idea does not work in
Transferability concerns the task of predicting whether TL practice, since the stochastic gradients of ∇θ Wp are biased.
will be successful. From a theoretical perspective, differ- The solution [150] proposes is to use the KL divergence for
ent IPMs bound the target risk RPT by the source risk optimization. This yields practical complications as Zθ and
RPS , such as the H, Wasserstein and MMD distances (see Tπ Zθ may have disjoint supports. To tackle this issue, the
e.g., [22], [131], [144]). Furthermore there is a practical inter- authors suggest using an intermediate step, for projecting
est in accurately measuring transferability a priori, before Tπ Zθ into the support of Zθ .
training or fine-tuning a network. A candidate measure To avoid this projection step, [151] further proposes to
coming from OT is the Wasserstein distance, but in its orig- change the modeling of Zθ . Instead of using fixed atoms
inal formulation it only takes features into account. [145] with variable weights, the authors propose parametrizing
propose the Optimal Transport Dataset Distance (OTDD), PN
the support of Zθ , that is, Zθ (s, a) = 1/N i=1 δθi (s,a) .
which relies on a Wasserstein-distance based metric on the
This corresponds to estimating zi = θi (s, a). Since Zθ is a
label space,
distribution over the line, and due the relationship between
OTDD(PS , PT ) = arg min E [c(z1 , z2 )], the 1-D Wasserstein distance and the quantile functions
γ∈Γ(PS ,PT ) (z1 ,z2 )∼γ of distributions, the authors call this technique quantile
c(z1 , z2 ) = ∥x1 − x2 ∥22 + W2 (PS,y1 , PT,y2 ), regression. This choice presents 3 advantages; (i) Discretizing
Zθ ’s support is more flexible as the support is now free,
where z = (x, y), and PS,y is the conditional distribution potentially leading to more accurate predictions; (ii) This
PS (X|Y = y). As the authors show in [145], this distance is fact further implies that one does not need to perform the
highly correlated with transferability and can even be used projection step; (iii) It allows the use of the Wasserstein
to interpolate between datasets with different classes [146]. distance as loss without suffering from biased gradients.
random return Zπ , such that Qπ (st , at ) = E [Zπ (st , at )], where the inclusion highlights that the minimizer may be
P,π
where the expectation is taken in the sense of equation 9. In non-unique. Upon a transition (s, a, s′ , r), [153] defines the
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 15
Wasserstein Temporal Difference, which is the distributional A third trend, useful in generative modeling, are sliced
analogous to equation 11, Wasserstein distances. This family of metrics provide an
easy-to-compute variant of the Wasserstein distance that
Qt+1 (s, a) ∈ B([1 − αt , αt ]; {Qt (s, a), Tπ Qt (s, a)}) (37)
benefit from the fact that in 1-D, OT has a closed form
where αt > 0 is the learning rate, whereas Tπ Qt = r + λVt . solution. In addition, a variety of works highlight sample
In certain cases when P is specified, equations 36 and 37 complexity and topological properties [103], [104] of these
have known solution. This is the case for Gaussian distribu- distances. We further highlight that ICNNs are gaining an
tions parametrized by a mean m(s, a) and variance σ(s, a)2 , increasing attention by the ML community, especially in
and empirical distributions with N points support given by generative modeling [161], [162]. Furthermore, ICNNs rep-
xi (s, a), and weights ai = 1/N . resent a good case of ML for OT. Even though unexplored,
this technique would be useful in DA and TL as well, for
7.3 Policy Optimization mapping samples between different domains.
There are two persistent challenges in OTML. First, is
PO focuses on manipulating the policy π for maximizing
the time complexity of OT. This is partially solved by
η(π). In this section we focus on gradient-based approaches
considering sliced methods, entropic regularization, or more
(e.g. [154]). The idea of these methods rely on parametrizing
recently, mini-batch OT. Second is the sample complexity
π = πθ so that training consists on maximizing η(θ) =
of OT. Especially, the Wasserstein distance is known to
η(πθ ). However, as [155] remarks, these algorithms suffer
suffer from the curse of dimensionality [38], [163]. This led
from high variance and slow convergence. [155] thus pro-
researchers to consider other estimators. Possible solutions
pose a distributional view. Let P be a distribution over θ,
include: (i) the SW [104]; (ii) the SRW [164], (iii) entropic
they propose the following optimization problem,
OT [45], and (iv) mini-batch OT [69], [71], [165]. Table 1
P ⋆ = arg max E [η(πθ )] − αKL(P ∥P0 ), (38) summarizes the time and sample complexity of different
P θ∼P
strategies in computational OT.
where P0 is a prior, and α > 0 controls regularization. The
minimum of equation 38 implies P (θ) ∝ exp(η(πθ )/α). TABLE 1: Time and sample complexity of empirical OT
From a Bayesian perspective, P (θ) is the posterior distri- estimators in terms of the number of samples n. † indicates
bution, whereas exp(η(πθ )/α) is the likelihood function. As complexity of a single iteration; ∗ indicates that the com-
discussed in [156], this formulation can be written in terms plexity is affected by samples dimension.
of gradient flows in the space of probability distributions
Loss Estimator Time Complexity Sample Complexity
(see [157] for further details). First, consider the functional,
Z Z Wc (P, Q) Wc (P̂ , Q̂) O(n3 log n) O(n−1/d )
F (P ) = − P (θ) log P0 (θ)dθ + P (θ) log P (θ)dθ, Wc,ϵ (P, Q) Wc,ϵ (P̂ , Q̂) O(n2 )† O(n−1/2 (1 + ϵ⌊d/2⌋ ))
W2 (P, Q) SRWk (P̂ , Q̂) O(n2 )†,∗ O(n−1/k )
SW(P, Q) SW(P̂ , Q̂) O(n log n)∗ O(n−1/2 )
For Itó-diffusion [158], optimizing F w.r.t. P becomes, W2 (P, Q) F WK,2 (P̂ , Q̂) O(n2 )† O(n−1/2 )
(k,m)
(h) LMBOT (P, Q) LMBOT (P̂ , Q̂) O(km3 log m) O(n−1/2 )
(h) W22 (P, Pk )
Pk+1 = arg min KL(P ∥P0 ) + , (39)
P 2h
which corresponds to the Jordan-Kinderlehrer-Otto (JKO) 8.2 Supervised Learning
scheme [158]. In this context, [156] introduces 2 formulations
OT contributes to supervised learning on three fronts. First,
for learning an optimal policy: indirect and direct learning.
it can serve as a metric for comparing objects. This can be
In the first case, one defines the gradient flow for equa-
integrated into nearest-neighbors classifiers, or kernel meth-
tion 39, in terms of θ. This setting is particularly suited when
ods, and it has the advantage to define a rich metric [74].
the parameter space does not has many dimensions. In the
However, due the time complexity of OT, this approach
second case, one uses policies directly into the described
does not scale with sample size. Second, the Wasserstein dis-
framework, which is related to [159].
tance can be used to define a semantic aware metric between
labels [78], [80] which is useful for classification. Third, OT
8 C ONCLUSIONS AND F UTURE D IRECTIONS contributes to fairness, which tries to find classifiers that are
8.1 Computational Optimal Transport insensitive w.r.t. protected variables. As such, OT presents
A few trends have emerged in recent computational OT three contributions: (i) defining a transformation of data
literature. First, there is an increasing use in OT as a loss so that classifiers trained on transformed data fulfill the
function (e.g., [43], [65], [67], [91]). Second, mini-max op- fairness requirement; (ii) defining a regularization term for
timization appears in seemingly unrelated strategies, such enforcing fairness; (iii) defining a test for fairness, given a
as the SRW [52], structured OT [57] and ICNN-based tech- classifier. The first point is conceptually close to the contri-
niques [61]. The first 2 cases consist of works that propose butions of OT for DA, as both depend on the barycentric
new problems by considering the worst-case scenario w.r.t. projection for generating new data.
the ground-cost. A somewhat common framework was
proposed by [160], through a convex set C of ground-cost 8.3 Unsupervised Learning
matrices, Generative modeling. This subject has been one of the most
RKP(Γ, C) = min max E [c(x1 , x2 )]. active OTML topics. This is evidenced by the impact papers
γ∈Γ c∈C (x1 ,x2 )∼γ such as [91] and [92] had in how GANs are understood
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 16
and trained. Nonetheless, the Wasserstein distance comes 8.5 Reinforcement Learning
at a cost: an intense computational burden in its primal The distributional RL framework of [150] re-introduced pre-
formulation, and a complicated constraint in its dual. This vious ideas in modern RL, such as formulating the goal of
computational burden has driven by early applications of an agent in terms of distributions over rewards. This turned
OT to consider the dual formulation, leaving the question of out to be successful, as the authors’ method surpassed the
how to enforce the Lipschitz constraint open. On this matter, state-of-the-art in a variety of game benchmarks. Further-
[92], [93] propose regularizers to enforce this constraint more, [169] further studied how the framework is related
indirectly. In addition, recent experiments and theoretical to dopamine-based RL, which highlights the importance of
discussion have shown that WGANs do not estimate the this formulation.
Wasserstein distance [94]–[96]. In a similar spirit, [153] proposes a distributional method
In contrast to GANs, OT contributions to autoencoders for propagating uncertainty of the value function. Here, it is
revolves around the primal formulation, which substitutes important to highlight that while [150] focuses on the ran-
the log-likelihood term in vanilla algorithms. In practice, domness of the reward function, [153] studies the stochas-
using the Wc partially solves the blurriness issue in VAEs. ticity of value functions. In this sense, the authors propose a
Moreover, the work of [107] serve as a bridge between variant of Q-Learning, called WQL, that leverages Wasser-
VAEs [26] and AAEs [106]. stein barycenters for updating the Q−function distribution.
Furthermore, dynamic OT contributes to CNFs, espe- From a practical perspective, the WQL algorithm is theoreti-
cially through the BB formulation. In this context a series cally grounded and has good performance. Further research
of papers [111], [166] have shown that OT enforces flow could focus on the common points between [150], [151]
regularity, thus avoiding stiffness in the related ODEs. As and [153], either by establishing a common framework or
consequence, these strategies improve training time. by comparing the 2 methodologies empirically. Finally, [156]
Dictionary Learning. OT contributed to this subject by explores policy optimization as gradient flows in the space
defining meaningful metrics between histograms [116], as of distributions. This formalism can be understood as the
well as how to calculate averages in probability spaces [117], continuous-time counterpart of gradient descent.
examples include [118] and [119]. Future contributions
could analyze the manifold of interpolations of dictionaries
8.6 Final Remarks
atoms in the Wasserstein space.
Clustering. OT provides a distributional view for clustering, TABLE 2: Principled division of works considered in this
by approximating an empirical distribution with n sam- survey, according to their OT formulation and the compo-
ples by another, with k samples representing clusters. In nent they represent in ML.
addition, the Wasserstein distance can be used to perform
K−means in the space of distributions. Method Section OT Formulation ML Component
EMD [5] 4.1 MK Metric
WMD [74] 4.1 MK Metric
GW [76] 4.1 MK Metric
8.4 Transfer Learning FGW [75] 4.1 MK Metric
Random Repair [85] 4.3 MK Loss
In UDA, OT has shown state-of-the-art performance in FRL [86], [87] 4.3 MK Transformation
Fairness Test [83] 4.3 MK Metric
various fields, such as image classification [56], sentiment
WGAN [91] 5.1.1 KR Loss
analysis [135], fault diagnosis [30], [167], and audio classi- SinkhornAutodiff [43] 5.1.1 MK Loss
fication [141], [142]. Furthermore, advances in the field in- WAE [107] 5.1.2 MK Loss/Regularization
RNODE [111] 5.1.3 BB Regularization
clude methods and architectures that can handle large scale Dictionary Learning [116] 5.2 MK Loss
WDL [117] 5.2 MK Loss/Aggregation
datasets, such as WDGRL [138] and DeepJDOT [67]. This GDL [118] 5.2 GW Loss
effectively consolidated OT as a valid and useful framework GWF [119] 5.2 GW Loss/Aggregation
OTLDA [122] 6.2 MK Loss/Metric
for transferring knowledge between different domains. Fur- Distr. K-Means [129] 5.3 MK Metric
thermore, heterogeneous DA [140] is a new, challenging OTDA [56] 6.1 MK Transformation
JDOT [135] 6.1 MK Loss
and relatively unexplored problem that OT is particularly WDGRL [167] 6.2 KR Loss/Regularization
suited to solve. In addition, OT contributes to the theoretical DeepJDOT [67] 6.2 MK Loss/Regularization
WBT [139], [141] 6.3 MK Transformation/Aggregation
developments in DA. For instance [144] provides a bound WJDOT [142] 6.3 MK Loss
COOT [140] 6.3 MK Transformation
for the different |R̂PT (h) − R̂PS (h)| in terms of the Wasser- OTCE [170] 6.4 MK Metric
OTDD [145] 6.4 MK Metric
stein distance, akin to [22]. The authors results highlight the
DRL [150] 7.1 MK Loss
fact that the class-sparsity structure in the OT plan plays a BRL [153] 7.2 MK Aggregation
prominent role in the success of DA. Finally, OT defines a PO [156] 7.3 BB Regularization
formalism (equation 2). Whenever a notion of dynamics is [21] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J.
involved, the BB formulation is preferred (equation 4). Smola, “A kernel approach to comparing distributions,” in Pro-
ceedings of the National Conference on Artificial Intelligence, vol. 22,
ML for OT. In the inverse direction, a few works have no. 2. Menlo Park, CA; Cambridge, MA; London; AAAI Press;
considered ML, especially NNs, as a component within MIT Press; 1999, 2007, p. 1637.
OT. Arguably, one of the first works that considered so [22] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and
was [134], who proposed finding a parametric Monge map- J. W. Vaughan, “A theory of learning from different domains,”
Machine learning, vol. 79, no. 1, pp. 151–175, 2010.
ping. This was followed by [133], who proposed using a [23] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of
NN for parametrizing the OT map. These two works are machine learning. MIT press, 2018.
connected to the notion of barycentric mapping, which as [24] V. Vapnik, The nature of statistical learning theory. Springer science
discussed in [133] is linked to the Monge map, under mild & business media, 2013.
[25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
conditions. In a novel direction [61] exploited recent devel- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-
opments in NN architectures, notably through ICNNs, for versarial nets,” Advances in neural information processing systems,
finding a Monge map. Differently from previous work, [61] vol. 27, 2014.
provably find a Monge map, which can be useful in DA. [26] D. P. Kingma and M. Welling, “Autoencoding variational bayes,”
in International Conference on Learning Representations, 2020.
[27] I. Kobyzev, S. Prince, and M. Brubaker, “Normalizing flows: An
introduction and review of current methods,” IEEE Transactions
R EFERENCES on Pattern Analysis and Machine Intelligence, 2020.
[28] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D.
[1] G. Monge, “Mémoire sur la théorie des déblais et des remblais,” Lawrence, Dataset shift in machine learning. Mit Press, 2008.
Histoire de l’Académie Royale des Sciences de Paris, 1781. [29] M. Wang and W. Deng, “Deep visual domain adaptation: A
[2] L. Kantorovich, “On the transfer of masses (in russian),” in survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
Doklady Akademii Nauk, vol. 37, no. 2, 1942, pp. 227–229. [30] E. F. Montesuma, M. Mulas, F. Corona, and F. M. N. Mboula,
[3] C. Villani, Optimal transport: old and new. Springer, 2009, vol. 338. “Cross-domain fault diagnosis through optimal transport for a
[4] U. Frisch, S. Matarrese, R. Mohayaee, and A. Sobolevski, “A cstr process,” IFAC-PapersOnLine, 2022.
reconstruction of the initial conditions of the universe by optimal [31] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
mass transportation,” Nature, vol. 417, no. 6886, pp. 260–262, Transactions on knowledge and data engineering, vol. 22, no. 10, pp.
2002. 1345–1359, 2009.
[5] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s [32] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduc-
distance as a metric for image retrieval,” International journal of tion. MIT press, 2018.
computer vision, vol. 40, no. 2, pp. 99–121, 2000. [33] R. Bellman, “On the theory of dynamic programming,” Proceed-
[6] L. Ambrosio and N. Gigli, “A user’s guide to optimal transport,” ings of the National Academy of Sciences of the United States of
in Modelling and optimisation of flows on networks. Springer, 2013, America, vol. 38, no. 8, p. 716, 1952.
pp. 1–155.
[34] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
[7] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Ro- no. 3-4, pp. 279–292, 1992.
hde, “Optimal mass transport: Signal processing and machine-
[35] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731,
learning applications,” IEEE signal processing magazine, vol. 34,
pp. 34–37, 1966.
no. 4, pp. 43–59, 2017.
[36] M. Köppen, “The curse of dimensionality,” in 5th online world
[8] J. Solomon, “Optimal transport on discrete domains,” AMS Short
conference on soft computing in industrial applications (WSC5), vol. 1,
Course on Discrete Differential Geometry, 2018.
2000, pp. 4–8.
[9] B. Lévy and E. L. Schwindt, “Notions of optimal transport theory
and how to implement them on a computer,” Computers & [37] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT
Graphics, vol. 72, pp. 135–148, 2018. press, 2016.
[10] G. Peyré, M. Cuturi et al., “Computational optimal transport: [38] R. M. Dudley, “The speed of mean glivenko-cantelli conver-
With applications to data science,” Foundations and Trends® in gence,” The Annals of Mathematical Statistics, vol. 40, no. 1, pp.
Machine Learning, vol. 11, no. 5-6, pp. 355–607, 2019. 40–50, 1969.
[11] R. Flamary, “Transport optimal pour l’apprentissage statistique,” [39] M. Cuturi, “Sinkhorn distances: Lightspeed computation of opti-
Habilitation à diriger des recherches. Université Côte d’Azur, 2019. mal transport,” Advances in neural information processing systems,
[12] F. Santambrogio, “Optimal transport for applied mathemati- vol. 26, pp. 2292–2300, 2013.
cians,” Birkäuser, NY, vol. 55, no. 58-63, p. 94, 2015. [40] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon,
[13] A. Figalli and F. Glaudo, An Invitation to Optimal Transport, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier
Wasserstein Distances, and Gradient Flows. European Mathemati- et al., “Pot: Python optimal transport,” Journal of Machine Learning
cal Society/AMS, 2021. Research, vol. 22, no. 78, pp. 1–8, 2021.
[14] G. Monge, Mémoire sur le calcul intégral des équations aux différences [41] M. Cuturi, L. Meng-Papaxanthos, Y. Tian, C. Bunne, G. Davis,
partielles. Imprimerie royale, 1784. and O. Teboul, “Optimal transport tools (ott): A jax toolbox for
[15] Y. Brenier, “Polar factorization and monotone rearrangement all things wasserstein,” arXiv preprint arXiv:2201.12324, 2022.
of vector-valued functions,” Communications on pure and applied [42] G. B. Dantzig, “Reminiscences about the origins of linear pro-
mathematics, vol. 44, no. 4, pp. 375–417, 1991. gramming,” in Mathematical programming the state of the art.
[16] J.-D. Benamou and Y. Brenier, “A computational fluid mechan- Springer, 1983, pp. 78–86.
ics solution to the monge-kantorovich mass transfer problem,” [43] A. Genevay, G. Peyré, and M. Cuturi, “Learning generative
Numerische Mathematik, vol. 84, no. 3, pp. 375–393, 2000. models with sinkhorn divergences,” in International Conference on
[17] M. Agueh and G. Carlier, “Barycenters in the wasserstein space,” Artificial Intelligence and Statistics. PMLR, 2018, pp. 1608–1617.
SIAM Journal on Mathematical Analysis, vol. 43, no. 2, pp. 904–924, [44] G. Luise, A. Rudi, M. Pontil, and C. Ciliberto, “Differential
2011. properties of sinkhorn approximation for learning with wasser-
[18] A. Müller, “Integral probability metrics and their generating stein distance,” Advances in Neural Information Processing Systems,
classes of functions,” Advances in Applied Probability, vol. 29, no. 2, vol. 31, 2018.
pp. 429–443, 1997. [45] A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré, “Sample
[19] I. Csiszár, “Information-type measures of difference of proba- complexity of sinkhorn divergences,” in The 22nd international
bility distributions and indirect observation,” studia scientiarum conference on artificial intelligence and statistics. PMLR, 2019, pp.
Mathematicarum Hungarica, vol. 2, pp. 229–318, 1967. 1574–1583.
[20] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, [46] J. Rabin, G. Peyré, J. Delon, and M. Bernot, “Wasserstein barycen-
and G. R. Lanckriet, “On the empirical estimation of integral ter and its application to texture mixing,” in International Con-
probability metrics,” Electronic Journal of Statistics, vol. 6, pp. ference on Scale Space and Variational Methods in Computer Vision.
1550–1599, 2012. Springer, 2011, pp. 435–446.
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 18
[47] N. Bonneel, J. Rabin, G. Peyré, and H. Pfister, “Sliced and radon as a solution to biased wasserstein gradients,” arXiv preprint
wasserstein barycenters of measures,” Journal of Mathematical arXiv:1705.10743, 2017.
Imaging and Vision, vol. 51, no. 1, pp. 22–45, 2015. [71] K. Fatras, T. Sejourne, R. Flamary, and N. Courty, “Unbalanced
[48] S. Kolouri, S. R. Park, and G. K. Rohde, “The radon cumulative minibatch optimal transport; applications to domain adaptation,”
distribution transform and its application to image classifica- in International Conference on Machine Learning. PMLR, 2021, pp.
tion,” IEEE transactions on image processing, vol. 25, no. 2, pp. 3186–3197.
920–934, 2015. [72] L. Chizat, G. Peyré, B. Schmitzer, and F.-X. Vialard, “Scaling algo-
[49] I. Deshpande, Y.-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, rithms for unbalanced optimal transport problems,” Mathematics
Z. Zhao, D. Forsyth, and A. G. Schwing, “Max-sliced wasserstein of Computation, vol. 87, no. 314, pp. 2563–2609, 2018.
distance and its use for gans,” in Proceedings of the IEEE/CVF [73] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
Conference on Computer Vision and Pattern Recognition, 2019, pp. “Distributed representations of words and phrases and their
10 648–10 656. compositionality,” in Advances in neural information processing
[50] S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde, systems, 2013, pp. 3111–3119.
“Generalized sliced wasserstein distances,” Advances in neural [74] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word
information processing systems, vol. 32, 2019. embeddings to document distances,” in International conference
[51] G. Beylkin, “The inversion problem and applications of the on machine learning. PMLR, 2015, pp. 957–966.
generalized radon transform,” Communications on pure and applied [75] V. Titouan, N. Courty, R. Tavenard, and R. Flamary, “Optimal
mathematics, vol. 37, no. 5, pp. 579–599, 1984. transport for structured data with application on graphs,” in
[52] F.-P. Paty and M. Cuturi, “Subspace robust wasserstein dis- International Conference on Machine Learning. PMLR, 2019, pp.
tances,” in International conference on machine learning. PMLR, 6275–6284.
2019, pp. 5072–5081. [76] F. Mémoli, “Gromov–wasserstein distances and the metric ap-
[53] T. Lin, C. Fan, N. Ho, M. Cuturi, and M. Jordan, “Projection proach to object matching,” Foundations of computational mathe-
robust wasserstein distance and riemannian optimization,” Ad- matics, vol. 11, no. 4, pp. 417–487, 2011.
vances in neural information processing systems, vol. 33, pp. 9383– [77] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine
9397, 2020. learning. Springer, 2006, vol. 4, no. 4.
[54] M. Huang, S. Ma, and L. Lai, “A riemannian block coordinate [78] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio,
descent method for computing the projection robust wasserstein “Learning with a wasserstein loss,” Advances in Neural Information
distance,” in International Conference on Machine Learning. PMLR, Processing Systems, vol. 28, pp. 2053–2061, 2015.
2021, pp. 4446–4455. [79] X. Liu, Y. Han, S. Bai, Y. Ge, T. Wang, X. Han, S. Li, J. You, and
[55] S. Ferradans, N. Papadakis, G. Peyré, and J.-F. Aujol, “Reg- J. Lu, “Importance-aware semantic segmentation in self-driving
ularized discrete optimal transport,” SIAM Journal on Imaging with discrete wasserstein training,” in Proceedings of the AAAI
Sciences, vol. 7, no. 3, pp. 1853–1882, 2014. Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 629–
[56] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimal 11 636.
transport for domain adaptation,” IEEE Transactions on Pattern [80] X. Liu, W. Ji, J. You, G. E. Fakhri, and J. Woo, “Severity-aware
Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1853–1865, semantic segmentation with reinforced wasserstein training,” in
2017. Proceedings of the IEEE/CVF Conference on Computer Vision and
[57] D. Alvarez-Melis, T. Jaakkola, and S. Jegelka, “Structured optimal Pattern Recognition, 2020, pp. 12 566–12 575.
transport,” in International Conference on Artificial Intelligence and [81] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi,
Statistics. PMLR, 2018, pp. 1771–1780. “Fairness beyond disparate treatment & disparate impact: Learn-
[58] L. Lovász, “Mathematical programming–the state of the art,” ing classification without disparate mistreatment,” in Proceedings
1983. of the 26th international conference on world wide web, 2017, pp.
[59] A. Forrow, J.-C. Hütter, M. Nitzan, P. Rigollet, G. Schiebinger, and 1171–1180.
J. Weed, “Statistical optimal transport via factored couplings,” [82] K. Makhlouf, S. Zhioua, and C. Palamidessi, “On the applicability
in The 22nd International Conference on Artificial Intelligence and of machine learning fairness notions,” ACM SIGKDD Explorations
Statistics. PMLR, 2019, pp. 2454–2465. Newsletter, vol. 23, no. 1, pp. 14–23, 2021.
[60] J. E. Cohen and U. G. Rothblum, “Nonnegative ranks, decom- [83] N. Si, K. Murthy, J. Blanchet, and V. A. Nguyen, “Testing group
positions, and factorizations of nonnegative matrices,” Linear fairness via optimal transport projections,” in International Con-
Algebra and its Applications, vol. 190, pp. 149–168, 1993. ference on Machine Learning. PMLR, 2021, pp. 9649–9659.
[61] A. Makkuva, A. Taghvaei, S. Oh, and J. Lee, “Optimal transport [84] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel,
mapping via input convex neural networks,” in International “Fairness through awareness,” in Proceedings of the 3rd innovations
Conference on Machine Learning. PMLR, 2020, pp. 6672–6681. in theoretical computer science conference, 2012, pp. 214–226.
[62] C. Villani, Topics in optimal transportation. American Mathemati- [85] P. Gordaliza, E. Del Barrio, G. Fabrice, and J.-M. Loubes, “Ob-
cal Soc., 2021, vol. 58. taining fairness using optimal transport theory,” in International
[63] B. Amos, L. Xu, and J. Z. Kolter, “Input convex neural networks,” Conference on Machine Learning. PMLR, 2019, pp. 2357–2365.
in International Conference on Machine Learning. PMLR, 2017, pp. [86] L. Oneto, M. Donini, G. Luise, C. Ciliberto, A. Maurer, and
146–155. M. Pontil, “Exploiting mmd and sinkhorn divergences for fair
[64] J. Fan, A. Taghvaei, and Y. Chen, “Scalable computations of and transferable representation learning.” in NeurIPS, 2020.
wasserstein barycenter via input convex neural networks,” in [87] S. Chiappa, R. Jiang, T. Stepleton, A. Pacchiano, H. Jiang, and
International Conference on Machine Learning. PMLR, 2021, pp. J. Aslanides, “A general approach to fairness with optimal trans-
1571–1581. port.” in AAAI, 2020, pp. 3633–3640.
[65] G. Montavon, K.-R. Müller, and M. Cuturi, “Wasserstein training [88] A. Genevay, G. Peyré, and M. Cuturi, “Gan and vae from an
of restricted boltzmann machines.” in NIPS, 2016, pp. 3711–3719. optimal transport point of view,” arXiv preprint arXiv:1706.01807,
[66] M. Sommerfeld, J. Schrieber, Y. Zemel, and A. Munk, “Optimal 2017.
transport: Fast probabilistic approximation with exact solvers.” J. [89] N. Lei, K. Su, L. Cui, S.-T. Yau, and X. D. Gu, “A geometric view
Mach. Learn. Res., vol. 20, pp. 105–1, 2019. of optimal transportation and generative model,” Computer Aided
[67] B. B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and Geometric Design, vol. 68, pp. 1–21, 2019.
N. Courty, “Deepjdot: Deep joint distribution optimal transport [90] F. Bassetti, A. Bodini, and E. Regazzini, “On minimum kan-
for unsupervised domain adaptation,” in Proceedings of the Euro- torovich distance estimators,” Statistics & probability letters,
pean Conference on Computer Vision (ECCV), 2018, pp. 447–463. vol. 76, no. 12, pp. 1298–1302, 2006.
[68] E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert, “On param- [91] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative
eter estimation with the wasserstein distance,” Information and adversarial networks,” in International conference on machine learn-
Inference: A Journal of the IMA, vol. 8, no. 4, pp. 657–676, 2019. ing. PMLR, 2017, pp. 214–223.
[69] K. Fatras, Y. Zine, R. Flamary, R. Gribonval, and N. Courty, [92] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.
“Learning with minibatch wasserstein: asymptotic and gradient Courville, “Improved training of wasserstein gans,” in Advances
properties,” in AISTATS, 2020. in Neural Information Processing Systems, I. Guyon, U. V. Luxburg,
[70] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lak- S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
shminarayanan, S. Hoyer, and R. Munos, “The cramer distance nett, Eds., vol. 30. Curran Associates, Inc., 2017.
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 19
[93] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral nor- [116] A. Rolet, M. Cuturi, and G. Peyré, “Fast dictionary learning
malization for generative adversarial networks,” in International with a smoothed wasserstein loss,” in Artificial Intelligence and
Conference on Learning Representations, 2018. Statistics. PMLR, 2016, pp. 630–638.
[94] T. Pinetz, D. Soukup, and T. Pock, “On the estimation of the [117] M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngole, D. Coeurjolly,
wasserstein distance in generative models,” in German Conference M. Cuturi, G. Peyré, and J.-L. Starck, “Wasserstein dictionary
on Pattern Recognition. Springer, 2019, pp. 156–170. learning: Optimal transport-based unsupervised nonlinear dic-
[95] A. Mallasto, G. Montúfar, and A. Gerolin, “How well do wgans tionary learning,” SIAM Journal on Imaging Sciences, vol. 11, no. 1,
estimate the wasserstein metric?” arXiv preprint arXiv:1910.03875, pp. 643–678, 2018.
2019. [118] C. Vincent-Cuaz, T. Vayer, R. Flamary, M. Corneli, and N. Courty,
[96] J. Stanczuk, C. Etmann, L. M. Kreusser, and C.-B. Schönlieb, “Online graph dictionary learning,” in International Conference on
“Wasserstein gans work because they fail (to approximate the Machine Learning. PMLR, 2021, pp. 10 564–10 574.
wasserstein distance),” arXiv preprint arXiv:2103.01678, 2021. [119] H. Xu, “Gromov-wasserstein factorization models for graph clus-
[97] J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouvé, and tering,” in Proceedings of the AAAI conference on artificial intelli-
G. Peyré, “Interpolating between optimal transport and mmd gence, vol. 34, no. 04, 2020, pp. 6478–6485.
using sinkhorn divergences,” in The 22nd International Conference [120] G. Peyré, M. Cuturi, and J. Solomon, “Gromov-wasserstein aver-
on Artificial Intelligence and Statistics. PMLR, 2019, pp. 2681–2690. aging of kernel and distance matrices,” in International Conference
[98] Y. Xie, X. Wang, R. Wang, and H. Zha, “A fast proximal point on Machine Learning. PMLR, 2016, pp. 2664–2672.
method for computing exact wasserstein distance,” in Uncertainty [121] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allo-
in artificial intelligence. PMLR, 2020, pp. 433–453. cation,” Journal of machine Learning research, vol. 3, no. Jan, pp.
[99] D. Bertsekas, Network optimization: continuous and discrete models. 993–1022, 2003.
Athena Scientific, 1998, vol. 8. [122] V. Huynh, H. Zhao, and D. Phung, “Otlda: A geometry-aware
[100] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image optimal transport approach for topic modeling,” Advances in
quality assessment: from error visibility to structural similarity,” Neural Information Processing Systems, vol. 33, pp. 18 573–18 582,
IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2020.
2004. [123] C. M. Bishop, Pattern Recognition and Machine Learning (Informa-
[101] S. Kolouri, G. K. Rohde, and H. Hoffmann, “Sliced wasserstein tion Science and Statistics). Berlin, Heidelberg: Springer-Verlag,
distance for learning gaussian mixture models,” in Proceedings 2006.
of the IEEE Conference on Computer Vision and Pattern Recognition, [124] D. Pollard, “Quantization and the method of k-means,” IEEE
2018, pp. 3427–3436. Transactions on Information theory, vol. 28, no. 2, pp. 199–205, 1982.
[125] G. Canas and L. Rosasco, “Learning probability measures with
[102] J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and
respect to optimal transport metrics,” Advances in Neural Informa-
L. V. Gool, “Sliced wasserstein generative models,” in Proceedings
tion Processing Systems, vol. 25, 2012.
of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, 2019, pp. 3713–3722. [126] P. M. Gruber, “Optimum quantization and its applications,”
Advances in Mathematics, vol. 186, no. 2, pp. 456–497, 2004.
[103] K. Nadjahi, A. Durmus, U. Simsekli, and R. Badeau, “Asymp-
[127] S. Graf and H. Luschgy, Foundations of quantization for probability
totic guarantees for learning generative models with the sliced-
distributions. Springer, 2007.
wasserstein distance,” Advances in Neural Information Processing
Systems, vol. 32, 2019. [128] M. Cuturi and A. Doucet, “Fast computation of wasserstein
barycenters,” in International conference on machine learning.
[104] K. Nadjahi, A. Durmus, L. Chizat, S. Kolouri, S. Shahrampour,
PMLR, 2014, pp. 685–693.
and U. Simsekli, “Statistical and topological properties of sliced
[129] E. Del Barrio, J. A. Cuesta-Albertos, C. Matrán, and A. Mayo-
probability divergences,” Advances in Neural Information Process-
ing Systems, vol. 33, pp. 20 802–20 812, 2020. Íscar, “Robust clustering tools based on optimal transportation,”
Statistics and Computing, vol. 29, no. 1, pp. 139–160, 2019.
[105] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational infer-
[130] J. A. Cuesta-Albertos, A. Gordaliza, and C. Matrán, “Trimmed
ence: A review for statisticians,” Journal of the American statistical
k-means: an attempt to robustify quantizers,” The Annals of
Association, vol. 112, no. 518, pp. 859–877, 2017.
Statistics, vol. 25, no. 2, pp. 553–576, 1997.
[106] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Ad-
[131] I. Redko, E. Morvant, A. Habrard, M. Sebban, and Y. Bennani,
versarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
Advances in domain adaptation theory. Elsevier, 2019.
[107] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasser- [132] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain
stein auto-encoders,” in International Conference on Learning Rep- adaptation: A survey of recent advances,” IEEE signal processing
resentations, 2018. magazine, vol. 32, no. 3, pp. 53–69, 2015.
[108] G. Patrini, R. van den Berg, P. Forre, M. Carioni, S. Bhargav, [133] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet,
M. Welling, T. Genewein, and F. Nielsen, “Sinkhorn autoen- and M. Blondel, “Large scale optimal transport and mapping
coders,” in Uncertainty in Artificial Intelligence. PMLR, 2020, pp. estimation,” in International Conference on Learning Representations,
733–743. 2018.
[109] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning [134] M. Perrot, N. Courty, R. Flamary, and A. Habrard, “Mapping
for image recognition,” in Proceedings of the IEEE conference on estimation for discrete optimal transport,” Advances in Neural
computer vision and pattern recognition, 2016, pp. 770–778. Information Processing Systems, vol. 29, pp. 4197–4205, 2016.
[110] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, [135] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy,
“Neural ordinary differential equations,” in Advances in Neural “Joint distribution optimal transportation for domain adap-
Information Processing Systems, vol. 31, 2018. tation,” in Advances in Neural Information Processing Systems,
[111] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. Oberman, “How I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
to train your neural ode: the world of jacobian and kinetic wanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc.,
regularization,” in International Conference on Machine Learning. 2017.
PMLR, 2020, pp. 3154–3164. [136] M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain adaptive neu-
[112] B. A. Olshausen and D. J. Field, “Sparse coding with an over- ral networks for object recognition,” in Pacific Rim international
complete basis set: A strategy employed by v1?” Vision research, conference on artificial intelligence. Springer, 2014, pp. 898–904.
vol. 37, no. 23, pp. 3311–3325, 1997. [137] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
[113] P. Paatero and U. Tapper, “Positive matrix factorization: A non- F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-
negative factor model with optimal utilization of error estimates adversarial training of neural networks,” The journal of machine
of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994. learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
[114] R. Sandler and M. Lindenbaum, “Nonnegative matrix factoriza- [138] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided
tion with earth mover’s distance metric,” in 2009 IEEE Conference representation learning for domain adaptation,” in Thirty-Second
on Computer Vision and Pattern Recognition. IEEE, 2009, pp. 1873– AAAI Conference on Artificial Intelligence, 2018.
1880. [139] E. F. Montesuma and F. M. N. Mboula, “Wasserstein barycen-
[115] S. Shirdhonkar and D. W. Jacobs, “Approximate earth mover’s ter for multi-source domain adaptation,” in Proceedings of the
distance in linear time,” in 2008 IEEE Conference on Computer IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Vision and Pattern Recognition. IEEE, 2008, pp. 1–8. 2021, pp. 16 785–16 793.
RECENT ADVANCES IN OPTIMAL TRANSPORT FOR MACHINE LEARNING 20
[140] V. Titouan, I. Redko, R. Flamary, and N. Courty, “Co-optimal [163] J. Weed and F. Bach, “Sharp asymptotic and finite-sample rates
transport,” Advances in Neural Information Processing Systems, of convergence of empirical measures in wasserstein distance,”
vol. 33, pp. 17 559–17 570, 2020. Bernoulli, vol. 25, no. 4A, pp. 2620–2648, 2019.
[141] E. F. Montesuma and F. M. N. Mboula, “Wasserstein barycenter [164] J. Niles-Weed and P. Rigollet, “Estimation of wasserstein
transport for acoustic adaptation,” in ICASSP 2021-2021 IEEE distances in the spiked transport model,” arXiv preprint
International Conference on Acoustics, Speech and Signal Processing arXiv:1909.07513, 2019.
(ICASSP). IEEE, 2021, pp. 3405–3409. [165] K. Nguyen, D. Nguyen, T. Pham, N. Ho et al., “Improving mini-
[142] R. Turrisi, R. Flamary, A. Rakotomamonjy, and M. Pontil, “Multi- batch optimal transport via partial transportation,” in Interna-
source domain adaptation via weighted joint distributions opti- tional Conference on Machine Learning. PMLR, 2022, pp. 16 656–
mal transport,” in Uncertainty in Artificial Intelligence. PMLR, 16 690.
2022, pp. 1970–1980. [166] D. Onken, S. Wu Fung, X. Li, and L. Ruthotto, “Ot-flow: Fast and
[143] I. Redko, N. Courty, R. Flamary, and D. Tuia, “Optimal transport accurate continuous normalizing flows via optimal transport,” in
for multi-source domain adaptation under target shift,” in The Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35,
22nd International Conference on Artificial Intelligence and Statistics. no. 1-18, 2021.
PMLR, 2019, pp. 849–858. [167] C. Cheng, B. Zhou, G. Ma, D. Wu, and Y. Yuan, “Wasserstein
distance based deep adversarial transfer learning for intelligent
[144] I. Redko, A. Habrard, and M. Sebban, “Theoretical analysis
fault diagnosis,” arXiv preprint arXiv:1903.06753, 2019.
of domain adaptation with optimal transport,” in Joint Euro-
[168] T. Nguyen, T. Le, H. Zhao, Q. H. Tran, T. Nguyen, and D. Phung,
pean Conference on Machine Learning and Knowledge Discovery in
“Most: Multi-source domain adaptation via optimal transport for
Databases. Springer, 2017, pp. 737–753.
student-teacher learning,” in Uncertainty in Artificial Intelligence.
[145] D. Alvarez Melis and N. Fusi, “Geometric dataset distances PMLR, 2021, pp. 225–235.
via optimal transport,” Advances in Neural Information Processing [169] W. Dabney, Z. Kurth-Nelson, N. Uchida, C. K. Starkweather,
Systems, vol. 33, 2020. D. Hassabis, R. Munos, and M. Botvinick, “A distributional code
[146] D. Alvarez-Melis and N. Fusi, “Dataset dynamics via gradient for value in dopamine-based reinforcement learning,” Nature,
flows in probability space,” in International Conference on Machine vol. 577, no. 7792, pp. 671–675, 2020.
Learning. PMLR, 2021, pp. 219–230. [170] Y. Tan, Y. Li, and S.-L. Huang, “Otce: A transferability metric
[147] D. White, “Mean, variance, and probabilistic criteria in finite for cross-domain cross-task representations,” in Proceedings of the
markov decision processes: A review,” Journal of Optimization IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Theory and Applications, vol. 56, no. 1, pp. 1–29, 1988. 2021, pp. 15 779–15 788.
[148] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, and
T. Tanaka, “Parametric return density estimation for reinforce-
ment learning,” in Proceedings of the Twenty-Sixth Conference on
Uncertainty in Artificial Intelligence, 2010, pp. 368–375.
[149] M. G. Azar, R. Munos, and B. Kappen, “On the sample complex-
Eduardo Fernandes Montesuma was born in
ity of reinforcement learning with a generative model,” in ICML,
Fortaleza, Brazil, in 1996. He received a Bach-
2012.
elor degree in Computer Engineering from the
[150] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional Federal University of Ceará, and a M.S. degree
perspective on reinforcement learning,” in International Conference in Electronics and Industrial Informatics from the
on Machine Learning. PMLR, 2017, pp. 449–458. National Institute of Applied Sciences of Rennes.
[151] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Dis- He is currently working towards a PhD degree
tributional reinforcement learning with quantile regression,” in in Data Science at the University Paris-Saclay,
Thirty-Second AAAI Conference on Artificial Intelligence, 2018. and with the Department DM2I of the CEA-LIST
[152] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar, “Bayesian since 2021.
reinforcement learning: A survey,” Foundations and Trends® in
Machine Learning, vol. 8, no. 5-6, pp. 359–483, 2015.
[153] A. M. Metelli, A. Likmeta, and M. Restelli, “Propagating uncer-
tainty in reinforcement learning via wasserstein barycenters,” in
33rd Conference on Neural Information Processing Systems, NeurIPS
2019. Curran Associates, Inc., 2019, pp. 4335–4347. Fred Ngolè Mboula was born in Bandja city,
[154] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy Cameroon, in 1990. He received M.S. degree in
gradient methods for reinforcement learning with function ap- Signal Processing from IMT Atlantique in 2013,
proximation,” Advances in neural information processing systems, and a PhD degree in astronomical images pro-
vol. 12, 1999. cessing from Université Paris-Saclay in 2016. He
[155] Y. Liu, P. Ramachandran, Q. Liu, and J. Peng, “Stein variational joined the Department DM2I of the CEA-LIST
policy gradient,” UAI, 2017. in 2017 as a researcher in ML and Signal Pro-
[156] R. Zhang, C. Chen, C. Li, and L. Carin, “Policy optimization as cessing. His research interests include weakly
wasserstein gradient flows,” in International Conference on Machine supervised learning, transfer learning and infor-
Learning. PMLR, 2018, pp. 5737–5746. mation geometry. He is also teaching assistant
[157] L. Ambrosio, N. Gigli, and G. Savaré, Gradient flows: in metric at Université de Versailles Saint-Quentin in un-
spaces and in the space of probability measures. Springer Science & dergraduate mathematics.
Business Media, 2008.
[158] R. Jordan, D. Kinderlehrer, and F. Otto, “The variational formula-
tion of the fokker–planck equation,” SIAM journal on mathematical
analysis, vol. 29, no. 1, pp. 1–17, 1998.
[159] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement Antoine Souloumiac was born in Bordeaux,
learning with deep energy-based policies,” in International Con- France, in 1964. He received the M.S. and Ph.D.
ference on Machine Learning. PMLR, 2017, pp. 1352–1361. degrees in signal processing from Télécom
[160] S. Dhouib, I. Redko, T. Kerdoncuff, R. Emonet, and M. Sebban, “A Paris, France, in 1987 and 1993, respectively.
swiss army knife for minimax optimal transport,” in International From 1993 until 2001, he was Research Sci-
Conference on Machine Learning. PMLR, 2020, pp. 2504–2513. entist with Schlumberger. He is currently with
[161] A. Korotin, V. Egiazarian, A. Asadulaev, A. Safin, and E. Burnaev, the Laboratory LS2D, CEA-LIST. His current re-
“Wasserstein-2 generative networks,” in International Conference search interests include statistical signal pro-
on Learning Representations, 2020. cessing and ML, with emphasis on point pro-
[162] A. Korotin, L. Li, J. Solomon, and E. Burnaev, “Continuous cesses, biomedical signal processing, and in-
wasserstein-2 barycenter estimation without minimax optimiza- dependent component analysis or blind source
tion,” in International Conference on Learning Representations, 2020. separation.