Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Smooth and Sparse Optimal Transport

Mathieu Blondel Vivien Seguy* Antoine Rolet*


NTT Communication Science Laboratories Kyoto University Kyoto University

Abstract 2017). While OT distances exhibit a unique ability


arXiv:1710.06276v2 [stat.ML] 20 Feb 2018

to capture the geometry of the data, their applica-


tion to large-scale problems has been largely ham-
Entropic regularization is quickly emerging
pered by their high computational cost. Indeed, com-
as a new standard in optimal transport (OT).
puting OT distances involves a linear program, which
It enables to cast the OT computation as a
takes super-cubic time in the data size to solve using
differentiable and unconstrained convex op-
state-of-the-art network-flow algorithms. Related to
timization problem, which can be efficiently
the Schrödinger problem (Schrödinger, 1931; Léonard,
solved using the Sinkhorn algorithm. How-
2012), entropy-regularized OT distances have recently
ever, entropy keeps the transportation plan
gained popularity due to their desirable properties
strictly positive and therefore completely
(Cuturi, 2013). Their computation involves a compar-
dense, unlike unregularized OT. This lack of
atively easier differentiable and unconstrained convex
sparsity can be problematic in applications
optimization problem, which can be solved using the
where the transportation plan itself is of in-
Sinkhorn algorithm (Sinkhorn and Knopp, 1967). Un-
terest. In this paper, we explore regularizing
like unregularized OT distances, entropy-regularized
the primal and dual OT formulations with
OT distances are also differentiable w.r.t. their inputs,
a strongly convex term, which corresponds to
enabling their use as a loss function in a machine learn-
relaxing the dual and primal constraints with
ing pipeline (Frogner et al., 2015; Rolet et al., 2016).
smooth approximations. We show how to
incorporate squared 2-norm and group lasso Despite its considerable merits, however, entropy-
regularizations within that framework, lead- regularized OT has some limitations, such as introduc-
ing to sparse and group-sparse transporta- ing blurring in the optimal transportation plan. While
tion plans. On the theoretical side, we bound this nuisance can be reduced by using small regulariza-
the approximation error introduced by reg- tion, this requires a carefully engineered implementa-
ularizing the primal and dual formulations. tion, since the naive Sinkhorn algorithm is numerically
Our results suggest that, for the regularized unstable in this regime (Schmitzer, 2016). More im-
primal, the approximation error can often be portantly, the entropy term keeps the transportation
smaller with squared 2-norm than with en- plan strictly positive and therefore completely dense,
tropic regularization. We showcase our pro- unlike unregularized OT. This lack of sparsity can be
posed framework on the task of color transfer. problematic when the optimal transportation plan it-
self is of interest, e.g., in color transfer (Pitié et al.,
2007), domain adaptation (Courty et al., 2016) and
ecological inference (Muzellec et al., 2017). Sparsity in
1 Introduction
these applications is motivated by the principle of par-
simony (simple solutions should be preferred) and by
Optimal transport (OT) distances (a.k.a. Wasser-
the enhanced interpretability of transportation plans.
stein or earth mover’s distances) are a powerful com-
putational tool to compare probability distributions Our contributions. This background motivates us
and have recently found widespread use in machine to study regularization schemes that lead to smooth
learning (Cuturi, 2013; Solomon et al., 2014; Kus- optimization problems (i.e., differentiable everywhere
ner et al., 2015; Courty et al., 2016; Arjovsky et al., and with Lipschitz continuous gradient) while retain-
ing the desirable property of sparse transportation
Proceedings of the 21st International Conference on Ar- plans. To do so, we make the following contributions.
tificial Intelligence and Statistics (AISTATS) 2018, Lan-
zarote, Spain. PMLR: Volume 84. Copyright 2018 by the We regularize the primal with an arbitrary strongly
author(s). * Work performed during an internship at NTT. convex term and derive the corresponding smoothed
Smooth and Sparse Optimal Transport

Unregularized Smoothed semi-dual (ent.) Smoothed semi-dual (sq. 2-norm) Semi-relaxed primal (Eucl.)
Sparsity: 94% Sparsity: 0% Sparsity: 90% Sparsity: 91%

Figure 1: Comparison of transportation plans obtained by different formulations on the application of color
transfer. The top and right histograms represent the color distributions a ∈ 4m and b ∈ 4n of two images.
For the sake of illustration, the number of colors is reduced to m = n = 32, using k-means clustering. Small
squares indicate non-zero elements in the obtained transportation plan, denoted by T throughout this paper.
The sparsity indicated below each graph is the percentage of zero elements in T . The weight of the elements of
T indicates the extent to which colors from one image must be transferred to colors from the other image. Like
unregularized OT (first from left), but unlike entropy-regularized OT (second from left), our squared 2-norm
regularized OT (third from left) is able to produce sparse transportation plans. This is also the case of our
relaxed primal (not shown) and semi-relaxed primal (fourth from left) formulations.

dual and semi-dual. Our derivations abstract away by 4m := {y ∈ Rm + : kyk1 = 1} and the Euclidean pro-
regularization-specific terms in an intuitive way (§3). jection onto it by P4m (x) := argminy∈4m ky − xk2 .
We show how incorporating squared 2-norm and We denote [x]+ := max(x, 0), performed element-wise.
group-lasso regularizations within that framework
leads to sparse solutions. This is illustrated in Fig- 2 Background
ure 1 for squared 2-norm regularization.
Next, we explore the opposite direction: replacing Convex analysis. The convex conjugate of a function
one or both of the primal marginal constraints with f : Rm → R ∪ {∞} is defined by
approximate smooth constraints. When using the
f ∗ (x) := sup y > x − f (y). (1)
squared Euclidean distance to approximate the con- y∈dom f
straints, we show that this can be interpreted as adding
squared 2-norm regularization to the dual (§4). As If f is strictly convex, then the supremum in (1) is
illustrated in Figure 1, that approach also produces uniquely achieved. Then, from Danskin’s theorem
sparse transportation plans. (1966), it is equal to the gradient of f ∗ :

For both directions, we bound the approximation er- ∇f ∗ (x) = argmax y > x − f (y).
y∈dom f
ror caused by regularizing the original OT problem.
For the regularized primal, we show that the approxi- The dual of a norm k · k is defined by kxk∗ :=
mation error of squared 2-norm regularization can be supkyk≤1 y > x. We say that a function is γ-smooth
smaller than that of entropic regularization (§5). Fi- w.r.t. a norm k · k if it is differentiable everywhere and
nally, we showcase the proposed approaches empiri- its gradient is γ-Lipschitz continuous w.r.t. that norm.
cally on the task of color transfer (§6). Strong convexity plays a crucial role in this paper due
to its well-known duality with smoothness: f is γ-
An open-source Python implementation is available at strongly convex w.r.t. a norm k · k if and only if f ∗ is
https://github.com/mblondel/smooth-ot. 1
γ -smooth w.r.t. k · k∗ (Kakade et al., 2012).
Notation. We denote scalars, vectors and matrices Optimal transport. We focus throughout this pa-
using lower-case, bold lower-case and upper-case let- per on OT between discrete probability distributions
ters, e.g., t, t and T , respectively. Given a matrix T , a ∈ 4m and b ∈ 4n . Rather than performing a point-
we denote its elements by ti,j and its columns by tj . wise comparison of the distributions, OT distances
We denote the set {1, . . . , m} by [m]. We use k · kp to compute the minimal effort, according to some ground
denote the p-norm. When p = 2, we simply write k · k. cost, for moving the probability mass of one distribu-
We denote the (m−1)-dimensional probability simplex tion to the other. The modern OT formulation, due to
Mathieu Blondel, Vivien Seguy, Antoine Rolet

Kantorovich [1942], is cast as a linear program (LP): To define a smoothed version of δ, we take the convex
conjugate of Ω, restricted to the non-negative orthant:
OT(a, b) := min hT, Ci, (2)
T ∈U (a,b)
δΩ (x) := sup y > x − Ω(y). (6)
where U(a, b) is the transportation polytope y≥0

U(a, b) := T ∈ Rm×n : T 1n = a, T > 1m = b If Ω is γ-strongly convex over Rm+ ∩ dom Ω, then δΩ is



+
1 ? ?
γ -smooth and its gradient is ∇δ Ω (x) = y , where y
and C ∈ Rm×n+ is a cost matrix. The former can be in- is the supremum of (6). We next show that δΩ plays
terpreted as the set of all joint probability distributions a crucial role in expressing the dual of (5), which is a
with marginals a and b. Without loss of generality, we smooth optimization problem in α and β.
will assume a > 0 and b > 0 throughout this paper (if
ai = 0 or bj = 0 then the ith row or the j th column Proposition 1 Smooth relaxed dual
of T ? is zero). When n = m and C is a distance ma-
1 n
trix raised to the power p, OT(·, ·) p is a distance on
X
OTΩ (a, b) = maxm α> a+β > b− δΩ (α+βj 1m −cj )
4n , called the Wasserstein distance of order p (Villani, α∈R
j=1
β∈Rn
2003, Theorem 7.3). The dual LP is (7)
OT(a, b) = max α> a + β > b, (3) The optimal solution T ? of (5) can be recovered from
α,β∈P(C) (α? , β ? ) by t?j = ∇δΩ (α? + βj? 1m − cj ) ∀j ∈ [n].
where P(C) := {α ∈ Rm , β ∈ Rn : αi + βj ≤ ci,j }.
For a proof, see Appendix A.1. Intuitively, the hard
Keeping α fixed, an optimal solution w.r.t. β is
Pβnj − ci,j ≤ 0 ∀i ∈ [m] ∀j ∈ [n],
dual constraints αi +
βj = min ci,j − αi , ∀j ∈ [n], which we can write j=1 δ(α + βj 1m − cj ), are now
i∈[m]
relaxed with soft ones by substituting δ with δΩ .
which is the so-called c-transform. Plugging it back
into the dual, we get the “semi-dual” 3.2 Smoothed semi-dual formulation
n
We now derive the semi-dual of (5), i.e., the dual (7)
X
OT(a, b) = maxm α> a − bj max (αi − ci,j ). (4)
α∈R
j=1
i∈[m] with one of the two variables eliminated. Without loss
of generality, we proceed to eliminate β. To do so, we
For a recent and comprehensive survey of computa- use the notion of smoothed max operator. Notice that
tional OT, see (Peyré and Cuturi, 2017).
max(x) := max xi = sup y > x ∀x ∈ Rm .
i∈[m] y∈4m
3 Strong primal ↔ Relaxed dual
This is indeed true, since the supremum is always
We study in this section adding strongly convex regu- achieved at one of the simplex vertices. To define a
larization to the primal problem (2). We define smoothed max operator (Nesterov, 2005), we take the
conjugate of Ω, this time restricted to the simplex:
Definition 1 Strongly convex primal
n maxΩ (x) := sup y > x − Ω(y). (8)
y∈4m
X
OTΩ (a, b) := min hT, Ci + Ω(tj ), (5)
T ∈U (a,b)
j=1 If Ω is γ-strongly convex over 4m ∩ dom Ω, then
where we assume that Ω is strongly convex over the maxΩ is γ1 -smooth and its gradient is defined by
intersection of dom Ω and either Rm m
+ or 4 .
∇maxΩ (x) = y ? , where y ? is the supremum of (8).
We next show that maxΩ plays a crucial role in ex-
These assumptions are sufficient for (5) to be strongly pressing the conjugate of OTΩ .
convex w.r.t. T ∈ U(a, b). On first sight, solving (5)
does not seem easier than (2). As we shall now see, Lemma 1 Conjugate of OTΩ w.r.t. its first argument
the main benefit occurs when switching to the dual. n
X
OT∗Ω (α, b) = sup α> a−OTΩ (a, b) = bj maxΩj (α−cj ),
a∈4m
3.1 Smooth relaxed dual formulation j=1
(9)
1
Let the (non-smooth) indicator function of the non- where Ωj (y) := bj Ω(bj y).
positive orthant be defined as
( A proof is given in Appendix A.2. With the conjugate,
0, if x ≤ 0 we can now easily express the semi-dual of (5), which
δ(x) := = sup y > x.
∞, o.w. y≥0 involves a smooth optimization problem in α.
Smooth and Sparse Optimal Transport

Table 1: Closed forms for δΩ (used in smoothed dual), maxΩj (used in smoothed semi-dual) and their gradients.

Ω(y) δΩ (x) ∇δΩ (x) maxΩj (x) ∇maxΩj (x)


m m m x
xi x −1 xi
−1
X X X
m eγ
Negative entropy γ yi log yi γ eγ eγ γ log e γ − γ log bj Pm
xi
γ
i=1 i=1 i=1 i=1 e
m  
γbj
X
Squared 2-norm γ
2
kyk2 1

[xi ]2+ 1
γ
[x]+ >
x y? − 2
ky ? k2 y ? = P4m x
γbj
i=1
X
γ
Group-lasso 2
kyk2 + γµ kyG k (11) (12) No closed form available
G∈G

Proposition 2 Smoothed semi-dual Squared 2-norm. We choose Ω(y) = γ2 kyk2 . We


again obtain closed-form expressions for δΩ , maxΩ and
OTΩ (a, b) = maxm α> a − OT∗Ω (α, b) (10) their gradients (cf. Table 1). Since Ω is γ-strongly
α∈R
convex w.r.t. the 2-norm over Rm , both δΩ and maxΩ
The optimal solution T ? of (5) can be recovered from are γ1 -strongly smooth w.r.t. the 2-norm. Projecting
α? by t?j = bj ∇maxΩj (α? − cj ) ∀j ∈ [n]. a vector onto the simplex, as required to compute
maxΩ and its gradient, can be done exactly in worst-
Proof. OTΩ (a, b) is a closed and convex function of a.
case O(m log m) time using the algorithm of (Michelot,
Therefore, OTΩ (a, b) = OT∗∗ Ω (a, b). 
1986) and in expected O(m) time using the random-
We can interpret this semi-dual as a variant of (4), ized pivot algorithm of (Duchi et al., 2008). Squared
where the max operator has been replaced with its 2-norm regularization can output exactly sparse trans-
smoothed counterpart, maxΩj . Note that α? , as ob- portation plans (the primal-dual relationship for (7) is
tained by solving the smoothed dual (7) or semi-dual t?i,j = γ1 [αi? + βj? − ci,j ]+ ) and is numerically stable
(10), is the gradient of OTΩ (a, b) w.r.t. a when α? without any particular implementation trick.
is unique or a sub-gradient otherwise. This is useful
Group lasso. Courty
P et al. (2016) P recently proposed
when learning with OTΩ as a loss, as done with en-
to use Ω(y) = γ( i yi log yi + µ G∈G kyG k), where
tropic regularization in (Frogner et al., 2015).
yG denotes the subvector of y restricted to the set
Solving the optimization problems. The dual and G, and showed that this regularization improves ac-
semi-dual we derived are unconstrained, differentiable curacy in the context of domain adaptation. Since Ω
and concave optimization problems. They can there- includes a negative entropy term, the same remarks as
fore be solved using gradient-based algorithms, as long for negative entropy apply regarding the differentiabil-
as we know how to compute ∇δΩ and ∇maxΩ . In our ity of δΩ and smoothness of maxΩ . Unfortunately, a
experiments, we use L-BFGS (Liu and Nocedal, 1989), closed-form solution is available for neither (6) nor (8).
for both the dual and semi-dual formulations. However, since the log keeps y in the strictly positive
orthant and kyG k is differentiable everywhere in that
3.3 Closed-form expressions orthant, we can use any proximal gradient algorithm
to solve these problems to arbitrary precision.
We derive in this section closed-form expressions for
δΩ , maxΩ and their gradients for specific choices of Ω. A drawback of this choice of Ω, however, is that group
sparsity is never truly achieved. To address
P this issue,
Negative entropy. P We choose Ω(y) = −γH(y), we propose to use Ω(y) = γ( 21 kyk2 + µ G∈G kyG k)
where H(y) := − i yi log yi is the entropy. For that instead. For that choice, δΩ is smooth and is equal to
choice, we get analytical expressions for δΩ , maxΩ and
their gradients (cf. Table 1). Since Ω is γ-strongly con- δΩ (x) = x> y ? − Ω(y ? ), (11)
vex w.r.t. the 1-norm over 4m (Shalev-Shwartz, 2007, where y ? decomposes over groups G ∈ G and equals
Lemma 16), maxΩ is γ1 -strongly smooth w.r.t. the ∞- 1
?
norm. However, since Ω is only strictly convex over yG = argmin kyG − xG /γk2 + µkyG k = ∇δΩ (x)G .
yG ≥0 2
Rm
>0 , δΩ is differentiable but not smooth. The dual
(12)
and semi-dual with this Ω were derived in (Cuturi and
As noted in the context of group-sparse NMF (Kim
Doucet, 2014) and (Genevay et al., 2016), respectively.
et al., 2012), (12) admits a closed-form solution
Next, we present two choices of Ω that induce sparsity  
1 + 2 µ
in transportation plans. The resulting dual and semi- ?
yG = argmin yG − xG +µkyG k = 1− + x+ ,
dual expressions are new, to our knowledge. yG 2 kxG k + G
Mathieu Blondel, Vivien Seguy, Antoine Rolet

where we defined x+ := γ1 [x]+ . We have thus ob- duals of (13) and (14) are strongly convex. The dual
tained an efficient way to compute exact gradients of formulations are crucial to derive our bounds in §5.
δΩ , making it possible to solve the dual using gradient-
Solving the optimization problems. While the re-
based algorithms. In contrast, Courty et al. (2016) use
laxed and semi-relaxed primals (13) and (14) are still
a generalized conditional gradient algorithm whose it-
constrained problems, it is much easier to project on
erations require expensive calls to Sinkhorn. Finally,
their constraint domain than on U(a, b). For the re-
because t?j = ∇δΩ (α? + βj? 1m − cj ) ∀j ∈ [n], the ob-
laxed primal, in our experiments we use L-BFGS-B, a
tained transportation plan will be truly group-sparse.
variant of L-BFGS suitable for box-constrained prob-
lems (Byrd et al., 1995). For the semi-relaxed primal,
4 Relaxed primal ↔ Strong dual we use FISTA (Beck and Teboulle, 2009). Since the
constraint domain of (14) has the structure of a Carte-
We now explore the opposite way to define smooth OT sian product b1 4m × · · · × bn 4m , we can easily project
problems while retaining sparse transportation plans: any T on it by column-wise projection on the (scaled)
replace marginal constraints in the primal with ap- simplex. Although not exlored in this paper, the block
proximate constraints. When relaxing both marginal Frank-Wolfe algorithm (Lacoste-Julien et al., 2012) is
constraints, we define the next formulation: also a good fit for the semi-relaxed primal.

Definition 2 Relaxed smooth primal


5 Theoretical bounds
1 1
ROTΦ (a, b) := min hT, Ci+ Φ(T 1n , a)+ Φ(T > 1m , b),
T ≥0 2 2 Convergence rates. The dual (7) is not smooth in
(13) α and β when using entropic regularization but it is
where Φ(x, y) is a smooth divergence measure. when using the squared 2-norm, with constant upper-
bounded by n/γ w.r.t. α and m/γ w.r.t. β. The semi-
We may also relax only one of the marginal constraints: dual (10) is smooth for both regularizations, with the
same constant of 1/γ , albeit not in the same norm. The
Definition 3 Semi-relaxed smooth primal
relaxed and semi-relaxed primals (13) and (14) are
1
ROTΦ (a, b) := min hT, Ci + Φ(T 1n , a) (14)
] both 1/γ -smooth when using Φ(x, y) = 2γ kx − yk2 .
T ≥0 However, none of these problems are strongly con-
T > 1m =b
vex. From standard convergence analysis of (pro-
where Φ(x, y) is defined as in Definition 2. jected) gradient descent for smooth but non-strongly
convex problems, the number of iterations to reach
For both (13) and (14), the transportation plans will an -accurate solution w.r.t. the smoothed problems is
be typically sparse. As discussed in more details in §7, O(1/γ) or O(1/√γ) with Nesterov acceleration.
these formulations are similar to (Frogner et al., 2015; Approximation error. Because the smoothed prob-
Chizat et al., 2016), with the key difference that we lems approach unregularized OT as γ → 0, there is a
do not regularize T with an entropic term. In addi- trade-off between convergence rate w.r.t. the smoothed
1
tion, for Φ, we propose to use Φ(x, y) = 2γ kx − yk2 , problem and approximation error w.r.t. unregularized
which is γ1 -smooth, while these works use a general- OT. A question is then which smoothed formulations
ized Kullback-Leibler (KL) divergence, which is not and which regularizations have better approximation
smooth. Relaxing the marginal constraints is useful error. Our first theorem bounds OTΩ −OT in the case
when normalizing input measures to unit mass is not of entropic and squared 2-norm regularization.
suitable (Gramfort et al., 2015) or to allow for only
partial displacement of mass. Relaxing only one of Theorem 1 Approximation error of OTΩ
the two constraints is useful in color transfer (Rabin
Let a ∈ 4m and b ∈ 4n . Then,
et al., 2014), where we would like all the probability
mass of the source image to be accounted for but not γL ≤ OTΩ (a, b) − OT(a, b) ≤ γU,
necessarily for the reference image.
Dual interpretation. As we show in Appendix A.3, where we defined L and U as follows.
1
in the case Φ(x, y) = 2γ kx − yk2 , the dual of (13) Ω Neg. entropy Squared 2-norm
can be interpreted as the original dual with additional
m,n  2
squared 2-norm regularization on the dual variables α 1
X ai bj 1
L −H(a) − H(b) 2 + −
and β. For the dual of (14), the additional regulariza- n m mn
i,j=1
tion is on α only (on the original dual or equivalently − max{H(a), H(b)} 1
 2 2
U 2 min kak , kbk
on the original semi-dual). For that choice of Φ, the
Smooth and Sparse Optimal Transport

Source Unregularized Smoothed semi-dual (l22) Semi-relaxed primal (Eucl.)

Sparsity: 99% Sparsity: 98% Sparsity: 99%


Reference Smoothed semi-dual (ent.) Relaxed primal (Eucl.) Semi-relaxed primal (KL)

Sparsity: 0% Sparsity: 99% Sparsity: 96%


Figure 2: Result comparison for different formulations on the task of color transfer. For regularized formulations,
we solve the optimization problem with γ ∈ {10−4 , 10−2 , . . . , 104 } and choose the most visually pleasing result.
The sparsity indicated below each image is the percentage of zero elements in the optimal transportation plan.

Proof is given in Appendix A.4. Our result suggests 6.1 Application to color transfer
that, for the same γ, the approximation error can of-
ten be smaller with squared 2-norm than with entropic Experimental setup. Given an image of size u × v,
regularization. In particular, we represent its pixels in RGB color space. We apply
 this is true whenever
min{H(a), H(b)} > 12 min kak2 , kbk2 , which is of- k-means clustering to quantize the image down to m
ten the case in practice since 0 ≤ min{H(a), H(b)} ≤ colors. This produces m color centroids x1 , . . . , xm ∈
min{log m, log n} while 0 ≤ 12 min kak2 , kbk2 ≤ R3 . We can count how many pixels were assigned to

1
Our second theorem bounds OT − ROTΦ and each centroid and normalizing by uv gives us a color
2.
histogram a ∈ 4m . We repeat the same process with
OT− ROT
] Φ when Φ is the squared Euclidean distance.
a second image to obtain y1 , . . . , yn ∈ R3 and b ∈ 4n .
Theorem 2 Approximation error of ROTΦ , ROT
]Φ Next, we apply any of the proposed methods with cost
1
matrix ci,j = d(xi , yj ), where d is some discrepancy
Let a ∈ 4m , b ∈ 4n , Φ(x, y) = 2γ kx − yk2 . Then, measure, to obtain a (possibly relaxed) transportation
m×n
0 ≤ OT(a, b) − ROTΦ (a, b) ≤ γL plan T ∈ R+ . For each color centroid xi , we apply
a barycentric projection to obtain a new color centroid
0 ≤ OT(a, b) − ROT
] Φ (a, b) ≤ γ L
e
n
X
where we defined x̂i := argmin ti,j d(x, yj ).
x∈R3 j=1
L := kCk2∞ min{ν1 + n, ν2 + m} 2

ν1 := max (2 + n/m) a−1 ∞ , b−1


 When d(x, y) = kx − yk2 , as used in our experi-

ments, the above admits a closed-form solution: x̂i =
ν2 := max a−1 ∞ , (2 + m/n) b−1
 Pn
∞ j=1 ti,j yj
P n . Finally, we use the new color x̂i for all
2 j=1 ti,j
e := 2 ||C||2 a−1
L . pixels assigned to xi . The same process can be per-
∞ ∞

Proof is given in Appendix A.5. While the bound for formed with respect to the yj , in order to transfer
the colors in the other direction. We use two public
ROT
] Φ is better than that of ROTΦ , both are worse
domain images “fall foliage” by Bernard Spragg and
than that of OTΩ , suggesting that the smoothed dual
“comunion” by Abel Maestro Garcia, and reduce the
formulations are the way to go when low approxima-
number of colors to m = n = 4096. We compare
tion error w.r.t. unregularized OT is important.
smoothed dual approaches and (semi-)relaxed primal
approaches. For the semi-relaxed primal, we also com-
6 Experimental results pared with Φ(x, y) = γ1 KL(x||y), where KL(x||y) is
 
We showcase our formulations on color transfer, which the generalized KL divergence, x> log x >
y − x 1 +
is a classical OT application (Pitié et al., 2007). More y > 1. This choice is differentiable but not smooth. We
experimental results are presented in Appendix C. ran the aforementioned solvers for up to 1000 epochs.
Mathieu Blondel, Vivien Seguy, Antoine Rolet

0.16
Dual convergence, = 10 3
0.16 Dual convergence, = 10 1
0.16
Dual convergence, = 10
0.14 0.14 0.14

Dual objective value


0.12 0.12 0.12
0.10 0.10 0.10
0.08 0.08 0.08
0.06 0.06 0.06 Alt. min. (dual)
0.04 0.04 0.04 L-BFGS (dual)
0.02 0.02 0.02 L-BFGS (semi-dual)
0.00 0.00 0.00
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
CPU time (seconds) CPU time (seconds) CPU time (seconds)

Figure 3: Solver comparison for the smoothed dual and semi-dual, with squared 2-norm regularization. With
γ = 10, which was also the best value selected in Figure 2, the maximum is reached in less than 4 minutes.

Results. Our results are presented in Figure 2. All and squared 2-norm regularizations, implying similar
formulations clearly produced better results than un- convergence rates in theory. In addition, in the case of
regularized OT. With the exception of the entropy- entropic regularization, the expressions of maxΩ and
smoothed semi-dual formulation, all formulations pro- ∇maxΩ are trivial to stabilize numerically using stan-
duced extremely sparse transportation plans. The dard log-sum-exp implementation tricks.
semi-relaxed primal formulation with Φ set to the
Results. We ran the comparison using the same data
squared Euclidean distance was the only one to pro-
as in §6.1. Results are indicated in Figure 4. For
duce colors with a darker tone.
the transportation plan error and the (regularized)
value error, entropic regularization required 100 times
6.2 Solver and objective comparison smaller γ to achieve the same error. This confirms, as
We compared the smoothed dual and semi-dual when suggested by Theorem 1, that squared 2-norm regular-
using squared 2-norm regularization. In addition to ization is typically tighter. Unsurprisingly, the semi-
L-BFGS on both objectives, we also compared with relaxed primal was tighter than the relaxed primal in
alternating minimization in the dual. As we show in all four criteria. A runtime comparison of smoothed
Appendix B, exact block minimization w.r.t. α and β formulations is also important. However, a rigorous
can be carried out by projection onto the simplex. comparison would require carefully engineered imple-
mentations and is therefore left for future work.
Results. We ran the comparison using the same data
as in §6.1. Results are indicated in Figure 3. When
the problem is loosely regularized, we made two key 7 Related work
findings: i) L-BFGS converges much faster in the semi-
dual than in the dual, ii) alternating minimization con- Regularized OT. Problems similar to (5) for gen-
verges extremely slowly. The reason for i) could be eral Ω were considered in (Dessein et al., 2016). Their
the better smoothness constant of the semi-dual (cf. work focuses on strictly convex and differentiable Ω
§5). Since alternating minimization and the semi-dual for which there exists an associated Bregman diver-
have roughly the same cost per iteration (cf. Appendix gence. Following (Benamou et al., 2015), they show
B), the reason for ii) is not iteration cost but a con- that (5) can then be reformulated as a Bregman pro-
vergence issue of alternating minimization. When us- jection onto the transportation polytope and solved
ing larger regularization, L-BFGS appears to converge using Dykstra’s algorithm [1985]. While Dykstra’s al-
slighly faster on the dual than on the semi-dual, which gorithm can be interpreted implicitly as a two-block
is likely thanks to its cheap-to-compute gradients. alternating minimization scheme on the dual problem,
neither the dual nor the semi-dual expressions were
6.3 Approximation error comparison derived. These expressions allow us to make use of
arbitrary solvers, including quasi-Newton ones like L-
We compared empirically the approximation error of BFGS, which as we showed empirically, converge much
smoothed formulations w.r.t. unregularized OT ac- faster on loosely regularized problems. Our framework
cording to four criteria: transportation plan error, can also accomodate non-differentiable regularizations
marginal constraint error, value error and regularized for which there does not exist an associated Bregman
value error (cf. Figure 4 for a precise definition). For divergence, such as those that include a group lasso
the dual approaches, we solved the smoothed semi- term. Squared 2-norm regularization was recently con-
dual objective (10), since, as we discussed in §5, it has sidered in (Li et al., 2016) as well as in (Essid and
the same smoothness constant of 1/γ for both entropic Solomon, 2017) but for a reformulation of the Wasser-
Smooth and Sparse Optimal Transport

Transportation plan error Marginal constraint error


1.50
Value error Regularized value error
0.4 1.5 104
1.25
0.3 102
1.00 1.0 100
0.75 0.2
10 2
0.50 0.5
0.1 10 4
0.25
10 6
0.00 0.0 0.0
10 3 10 1 101 103 10 3 10 1 101 103 10 3 10 1 101 103 10 3 10 1 101 103
Reg. parameter Reg. parameter Reg. parameter Reg. parameter
Smoothed semi-dual (ent.) Smoothed semi-dual (l22) Relaxed primal (Eucl.) Semi-relaxed primal (Eucl.)

Figure 4: Approximation error w.r.t. unregularized OT empirically achieved by different smoothed formulations
on the task of color transfer. Let T ? be en optimal solution of the unregularized LP (2) and Tγ? be an optimal
solution of one of the smoothed formulations with regularization parameter γ. The transportation plan error is
kTγ? − T ? k/kT ? k. The marginal constraint error is kTγ? 1n − ak + k(Tγ? )> 1m − bk. The value error is |hTγ? , Ci −
hT ? , Ci|/hT ? , Ci. The regularized value error is |v − hT ? , Ci|, where v is one of OTΩ (a, b), ROTΦ (a, b) and
ROT
] Φ (a, b). For the regularized value error, our empirical findings confirm what Theorem 1 suggested, namely
that, for the same value of γ, squared 2-norm regularization is quite tighter than entropic regularization.

stein distance of order 1 as a min cost flow problem Smoothed LPs. Smoothed linear programs have
along the edges of a graph. been investigated in other contexts. The two closest
works to ours are (Meshi et al., 2015b) and (Meshi
Relaxed OT. There has been a large number of pro-
et al., 2015a), in which smoothed LP relaxations based
posals to extend OT to unbalanced positive measures.
on the squared 2-norm are proposed for maximum a-
Static formulations with approximate marginal con-
posteriori inference. One innovation we make com-
straints based on the KL divergence have been pro-
pared to these works is to abstract away the regular-
posed in (Frogner et al., 2015; Chizat et al., 2016). The
ization by introducing the δΩ and maxΩ functions.
main difference with our work is that these formula-
tions include an additional entropic regularization on
T . While this entropic term enables a Sinkhorn-like al- 8 Conclusion
gorithm, it also prevents from obtaining sparse T and
requires the tuning of an additional hyper-parameter. We proposed in this paper to regularize both the pri-
Relaxing only one of the two marginal constraints with mal and dual OT formulations with a strongly convex
an inequality was investigated for color transfer in (Ra- term, and showed that this corresponds to relaxing the
bin et al., 2014). Benamou (2003) considered an inter- dual and primal constraints with smooth approxima-
polation between OT and squared Euclidean distances: tions. There are several important avenues for future
work. The conjugate expression (9) should be useful
1 for barycenter computation (Cuturi and Peyré, 2016)
minm OT(x, b) + kx − ak2 . (15)
x∈4 2γ or dictionary learning (Rolet et al., 2016) with squared
2-norm instead of entropic regularization. On the the-
While on first sight this looks quite different, this is in oretical side, while we provided convergence guaran-
fact equivalent to our semi-relaxed primal formulation tees w.r.t. the OT distance value as the regularization
1
when Φ(x, y) = 2γ kx − yk2 since (15) is equal to vanishes, which suggested the advantage of squared
2-norm regularization, it would also be important to
1 study the convergence w.r.t. the transportation plan,
min min hT, Ci + kx − ak2
x∈4m T ≥0 2γ as was done for entropic regularization by Cominetti
T 1n =x
>
T 1m =b and San Martı́n (1994). Finally, studying optimization
1 algorithms that can cope with large-scale data is im-
= min hT, Ci + kT 1n − ak2 = ROT
] Φ (a, b).
portant. We believe SAGA (Defazio et al., 2014) is a
T ≥0 2γ
T > 1m =b good candidate since it is stochastic, supports proxim-
ity operators, is adaptive to non-strongly convex prob-
However, the bounds in §5 are to our knowledge new. lems and can be parallelized (Leblond et al., 2017).
A similar formulation but with a group-lasso penalty
1
on T instead of 2γ kT 1n − ak2 was considered in the
context of convex clustering (Carli et al., 2013).
Mathieu Blondel, Vivien Seguy, Antoine Rolet

Acknowledgements adaptation. IEEE transactions on pattern analysis


and machine intelligence, 2016.
We thank Arthur Mensch and the anonymous review- Thomas M. Cover and Joy A. Thomas. Elements of
ers for constructive comments. Information Theory. Wiley, 2006.
Marco Cuturi. Sinkhorn distances: Lightspeed com-
References putation of optimal transport. In Proc. of NIPS,
On the transfer of masses (in russian). Doklady pages 2292–2300. 2013.
Akademii Nauk, 37(2):227–229, 1942. Marco Cuturi and Arnaud Doucet. Fast computation
of wasserstein barycenters. In International Confer-
Jason Altschuler, Jonathan Wee, and Philippe Rigol-
ence on Machine Learning, pages 685–693, 2014.
let. Near-linear time approximation algorithms for
optimal transport via sinkhorn iteration. In Proc. Marco Cuturi and Gabriel Peyré. A smoothed dual ap-
of NIPS, 2017. proach for variational wasserstein problems. SIAM
Journal on Imaging Sciences, 9(1):320–343, 2016.
Martin Arjovsky, Soumith Chintala, and Léon Bot-
tou. Wasserstein generative adversarial networks. John M Danskin. The theory of max-min, with appli-
In Proc. of ICML, volume 70, pages 214–223, 2017. cations. SIAM Journal on Applied Mathematics, 14
(4):641–664, 1966.
Amir Beck and Marc Teboulle. A fast iterative
shrinkage-thresholding algorithm for linear inverse Aaron Defazio, Francis Bach, and Simon Lacoste-
problems. SIAM Journal on Imaging Sciences, 2 Julien. Saga: A fast incremental gradient method
(1):183–202, 2009. with support for non-strongly convex composite ob-
jectives. In Proc. of NIPS, pages 1646–1654, 2014.
Jean-David Benamou. Numerical resolution of an un-
balanced mass transport problem. ESAIM: Math- Arnaud Dessein, Nicolas Papadakis, and Jean-Luc
ematical Modelling and Numerical Analysis, 37(5): Rouas. Regularized optimal transport and the rot
851–868, 2003. mover’s distance. arXiv preprint arXiv:1610.06447,
2016.
Jean-David Benamou, Guillaume Carlier, Marco Cu-
turi, Luca Nenna, and Gabriel Peyré. Iterative breg- John Duchi, Shai Shalev-Shwartz, Yoram Singer, and
man projections for regularized transportation prob- Tushar Chandra. Efficient projections onto the `1 -
lems. SIAM Journal on Scientific Computing, 37(2): ball for learning in high dimensions. In Proc. of
A1111–A1138, 2015. ICML, 2008.

Stephen Boyd and Lieven Vandenberghe. Convex Op- Richard L Dykstra. An iterative procedure for obtain-
timization. Cambridge University Press, 2004. ing i-projections onto the intersection of convex sets.
The annals of Probability, pages 975–984, 1985.
Richard H Byrd, Peihuang Lu, Jorge Nocedal, and
Ciyou Zhu. A limited memory algorithm for bound Montacer Essid and Justin Solomon. Quadratically-
constrained optimization. SIAM Journal on Scien- regularized optimal transport on graphs. arXiv
tific Computing, 16(5):1190–1208, 1995. preprint arXiv:1704.08200, 2017.
Charlie Frogner, Chiyuan Zhang, Hossein Mobahi,
Gilberto Calvillo and David Romero. On the closest
Mauricio Araya, and Tomaso A Poggio. Learning
point to the origin in transportation polytopes. Dis-
with a wasserstein loss. In Proc. of NIPS, pages
crete Appl. Math., 210:88–102, 2016.
2053–2061, 2015.
Francesca P Carli, Lipeng Ning, and Tryphon T Geor-
Aude Genevay, Marco Cuturi, Gabriel Peyré, and
giou. Convex clustering via optimal mass transport.
Francis Bach. Stochastic optimization for large-scale
arXiv preprint arXiv:1307.5459, 2013.
optimal transport. In Proc. of NIPS, pages 3440–
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, 3448. 2016.
and François-Xavier Vialard. Scaling algorithms
Alexandre Gramfort, Gabriel Peyré, and Marco Cu-
for unbalanced transport problems. arXiv preprint
turi. Fast optimal transport averaging of neuroimag-
arXiv:1607.05816, 2016.
ing data. In Proc. of International Conference on
Roberto Cominetti and Jaime San Martı́n. Asymp- Information Processing in Medical Imaging, pages
totic analysis of the exponential penalty trajectory 261–272, 2015.
in linear programming. Mathematical Programming, Sham M Kakade, Shai Shalev-Shwartz, and Ambuj
67(1-3):169–187, 1994. Tewari. Regularization techniques for learning with
Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain matrices. Journal of Machine Learning Research,
Rakotomamonjy. Optimal transport for domain 13:1865–1890, 2012.
Smooth and Sparse Optimal Transport

Jingu Kim, Renato DC Monteiro, and Haesun Park. François Pitié, Anil C Kokaram, and Rozenn Dahyot.
Group sparsity in nonnegative matrix factorization. Automated colour grading using colour distribution
In Proc. of SIAM International Conference on Data transfer. Computer Vision and Image Understand-
Mining, pages 851–862, 2012. ing, 107(1):123–137, 2007.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Julien Rabin, Sira Ferradans, and Nicolas Papadakis.
Weinberger. From word embeddings to document Adaptive color transfer with relaxed optimal trans-
distances. In Proc. of ICML, pages 957–966, 2015. port. In Proc. of International Conference on Image
Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, Processing, pages 4852–4856, 2014.
and Patrick Pletscher. Block-coordinate Frank- Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast
Wolfe optimization for structural SVMs. In Proc. dictionary learning with a smoothed wasserstein
of ICML, 2012. loss. In Proc. of AISTATS, pages 630–638, 2016.
Rmi Leblond, Fabian Pedregosa, and Simon Lacoste- David Romero. Easy transportation-like problems on
Julien. ASAGA: Asynchronous Parallel SAGA. In k-dimensional arrays. Journal of Optimization The-
Proceedings of the 20th International Conference ory and Applications, 66:137–147, 1990.
on Artificial Intelligence and Statistics, volume 54, Bernhard Schmitzer. Stabilized sparse scaling algo-
pages 46–54, 2017. rithms for entropy regularized transport problems.
Christian Léonard. A survey of the Schrödinger prob- arXiv preprint arXiv:1610.06519, 2016.
lem and some of its connections with optimal trans- Erwin Schrödinger. Über die umkehrung der naturge-
port. Discrete and Continuous Dynamical Systems, setze. Phys. Math., 144:144–153, 1931.
34:1879–1920, 2012.
Shai Shalev-Shwartz. Online Learning: Theory, Algo-
Wuchen Li, Stanley Osher, and Wilfrid Gangbo. A rithms, and Applications. PhD thesis, The Hebrew
fast algorithm for earth mover’s distance based on University of Jerusalem, 2007.
optimal transport and l1 type regularization. arXiv
Richard Sinkhorn and Paul Knopp. Concerning
preprint arXiv:1609.07092, 2016.
nonnegative matrices and doubly stochastic matri-
Dong C Liu and Jorge Nocedal. On the limited mem- ces. Pacific Journal of Mathematics, 21(2):343–348,
ory bfgs method for large scale optimization. Math- 1967.
ematical programming, 45(1):503–528, 1989.
Justin Solomon, Raif Rustamov, Guibas Leonidas, and
Ofer Meshi, Amir Globerson, and Tommi S. Jaakkola. Adrian Butscher. Wasserstein propagation for semi-
Convergence rate analysis of map coordinate mini- supervised learning. In Proc. of ICML, pages 306–
mization algorithms. In Proc. of NIPS, pages 3014– 314, 2014.
3022. 2012. Cédric Villani. Topics in optimal transportation. Num-
Ofer Meshi, Mehrdad Mahdavi, and Alexander G. ber 58. American Mathematical Soc., 2003.
Schwing. Smooth and strong: MAP inference with
linear convergence. In Proc. of NIPS, 2015a.
Ofer Meshi, Nathan Srebro, and Tamir Hazan. Ef-
ficient training of structured svms via soft con-
straints. In Proc. of AISTATS, pages 699–707,
2015b.
Christian Michelot. A finite algorithm for finding the
projection of a point onto the canonical simplex of
Rn . Journal of Optimization Theory and Applica-
tions, 50(1):195–200, 1986.
Boris Muzellec, Richard Nock, Giorgio Patrini, and
Frank Nielsen. Tsallis regularized optimal transport
and ecological inference. In AAAI, pages 2387–2393,
2017.
Yu Nesterov. Smooth minimization of non-smooth
functions. Mathematical programming, 103(1):127–
152, 2005.
Gabriel Peyré and Marco Cuturi. Computational Op-
timal Transport. 2017.
Mathieu Blondel, Vivien Seguy, Antoine Rolet

Appendix
A Proofs
A.1 Derivation of the smooth relaxed dual

Recall that
n
X
OTΩ (a, b) = min t>
j cj + Ω(tj ). (16)
T ∈U (a,b)
j=1

We now add Lagrange multipliers for the two equality constraints but keep the constraint T ≥ 0 explicitly:
n
X
OTΩ (a, b) = min max t> > > >
j cj + Ω(tj ) + α (T 1n − a) + β (T 1m − b).
T ≥0 α∈R ,β∈Rn
m
j=1

Since (16) is a convex optimization problem with only linear equality and inequality constraints, Slater’s condi-
tions reduce to feasibility (Boyd and Vandenberghe, 2004, §5.2.3) and hence strong duality holds:
n
X
OTΩ (a, b) = max
m n
min t> > > >
j cj + Ω(tj ) + α (T 1n − a) + β (T 1m − b)
α∈R ,β∈R T ≥0
j=1
n
X
= max min t> > >
j (cj + α + βj 1m ) + Ω(tj ) − α a − β b
α∈Rm ,β∈Rn tj ≥0
j=1
n
X
= max
m n
− max t> > >
j (−cj − α − βj 1m ) − Ω(tj ) − α a − β b
α∈R ,β∈R tj ≥0
j=1
n
X
= max
m n
α> a + β > b − max t>
j (α + βj 1m − cj ) − Ω(tj ).
α∈R ,β∈R tj ≥0
j=1

Finally, plugging the expression of (6) gives the claimed result.

A.2 Derivation of the convex conjugate

The convex conjugate of OTΩ (a, b) w.r.t. the first argument is


OT∗Ω (g, b) = sup g > a − OTΩ (a, b).
a∈4m

Following a similar argument as (Cuturi and Peyré, 2016, Theorem 2.4), we have
n
X
OT∗Ω (g, b) = max hT, g1>
n − Ci − Ω(tj ).
T ≥0
T > 1m =b j=1

Notice that this is an easier optimization problem than (5), since there are equality constraints only in one
direction. Cuturi and Peyré (2016) showed that this optimization problem admits a closed form in the case of
entropic regularization. Here, we show how to compute OT∗Ω for any strongly-convex regularization.
The problem clearly decomposes over columns and we can rewrite it as
n
X
OT∗Ω (g, b) = max t>
j (g − cj ) − Ω(tj )
tj ≥0
j=1
t>
j 1m =bj
n
X 1
= bj maxm τ >
j (g − cj ) − Ω(bj τ j )
j=1
τ j ∈4 bj
n
X
= bj maxΩj (g − cj ),
j=1
Smooth and Sparse Optimal Transport

1
where we defined Ωj (y) := bj Ω(bj y) and where maxΩ is defined in (8).

A.3 Expression of the strongly-convex duals

Using a similar derivation as before, we obtain the duals of (13) and (14).

Proposition 3 Duals of (13) and (14)

1 1
ROTΦ (a, b) = max − Φ∗ (−2α, a) − Φ∗ (−2β, b)
α,β∈P(C) 2 2
ROT
] Φ (a, b) = max −Φ∗ (−α, a) + β > b
α,β∈P(C)
n
X
= maxm −Φ∗ (−α, a) − bj max (αi − ci,j ),
α∈R i∈[m]
j=1

where Φ∗ is the conjugate of Φ in the first argument.

The duals are strongly convex if Φ is smooth.


When Φ(x, y) = 2γ 1
kx − yk2 , Φ∗ (−α, a) = γ2 kαk2 − α> a. Plugging that expression in the above, we get

ROTΦ (a, b) = max α> a + β > b − γ kαk2 + kβk2



(17)
α,β∈P(C)

and
γ
ROT
] Φ (a, b) = max α> a + β > b − kαk2
α,β∈P(C) 2
n
X γ
= maxm α> a − bj max (αi − ci,j ) − kαk2 .
α∈R
j=1
i∈[m] 2

This corresponds to the original dual and semi-dual with squared 2-norm regularization on the variables.

A.4 Proof of Theorem 1

Before proving the theorem, we introduce the next two lemmas, which bound the regularization value achieved
by any transportation plan.

Lemma 2 Bounding the entropy of a transportation plan


P P
Let H(a) := − i ai log ai and H(T ) := − i,j ti,j log ti,j be the joint entropy.
Let a ∈ 4m , b ∈ 4n and T ∈ U(a, b). Then,
max{H(a), H(b)} ≤ H(T ) ≤ H(a) + H(b).

Proof. See, for instance, (Cover and Thomas, 2006).


Together with 0 ≤ H(a) ≤ log m and 0 ≤ H(b) ≤ log n, this provides lower and upper bounds for the entropy of
a transportation plan. As noted in (Cuturi, 2013), the upper bound is tight since
max H(T ) = H(ab> ) = H(a) + H(b).
T ∈U (a,b)

Lemma 3 Bounding the squared 2-norm of a transportation plan


Let a ∈ 4m , b ∈ 4n and T ∈ U(a, b). Then,
m X n  2
X ai bj 1
≤ kT k2 ≤ min kak2 , kbk2 .

+ −
i=1 j=1
n m mn
Mathieu Blondel, Vivien Seguy, Antoine Rolet

Proof. The tightest lower bound is given by min kT k2 . An exact iterative algorithm was proposed in (Calvillo
T ∈U (a,b)
and Romero, 2016) to solve this problem. However, since we are interested in an explicit formula, we consider
instead the lower bound min kT k2 (i.e., we ignore the non-negativity constraint). It is known (Romero, 1990)
T 1n =a
T > 1m =b
ai bj 1
that the minimum is achieved at ti,j = n + m − mn , hence our lower bound. For the upper bound, we have
m X
X n
kT k2 = t2i,j
i=1 j=1
m X n  2
X ti,j
= ai
i=1 j=1
ai
m n
X X ti,j 2 
= a2i
i=1 j=1
ai
m n  
X X ti,j
≤ a2i
i=1 j=1
ai
= kak2 .

We can do the same with b ∈ 4n to obtain kT k2 ≤ kbk2 , yielding the claimed result. 
Together with 0 ≤ kak2 ≤ 1 and 0 ≤ kbk2 ≤ 1, this provides lower and upper bounds for the squared 2-norm of
a transportation plan.
Proof of the theorem. Let T ? and TΩ? be optimal solutions of (2) and (5), respectively. Then,

OT(a, b) + Ω(TΩ? ) = hT ? , Ci + Ω(TΩ? ) ≤ hTΩ? , Ci + Ω(TΩ? ) = OTΩ (a, b).

Likewise,
OTΩ (a, b) = hTΩ? , Ci + Ω(TΩ? ) ≤ hT ? , Ci + Ω(T ? ) = OT(a, b) + Ω(T ? ).
Combining the two, we obtain

OT(a, b) + Ω(TΩ? ) ≤ OTΩ (a, b) ≤ OT(a, b) + Ω(T ? ).

Using T ? , TΩ? ∈ U(a, b) together with Lemma 2 and Lemma 3 gives the claimed results.

A.5 Proof of Theorem 2

To prove the theorem, we first need the following two lemmas.

Lemma 4 Bounding the 1-norm of α and β for (α, β) ∈ P(C)


Let α, β ∈ P(C) with extra constraints α> 1m = 0 and α> a + β > b ≥ 0, where a ∈ 4m and b ∈ 4n . Then,

0 ≤ kαk1 + kβk1 ≤ kCk∞ (ν + n)


where
ν = max (2 + n/m) a−1 b−1


, ∞
.

Proof. The proof technique is inspired by (Meshi et al., 2012, Supplementary material Lemma 1.2).
The 1-norm can be rewritten as

||α||1 + ||β||1 = max r > α + s> β.


r∈{−1,1}m
s∈{−1,1}n
Smooth and Sparse Optimal Transport

Our goal is to upper bound the following objective

max r > α + s> β s.t. 0 ≤ α> a + β > b,


α∈Rm ,β∈Rn

αi + βj ≤ ci,j ,
α> 1m = 0,

with a constant that does not depend on r and s. We call the above the dual problem. Its Lagrangian is
m,n
X
> > > > >
L(α, β, µ, ν, T ) = r α + s β + µα 1m + ν(α a + β b) + ti,j (ci,j − αi − βj )
i,j=1

= (r + µ1m + νa − T 1n )> α + (s + νb − T > 1m )> β + hT, Ci

with µ ∈ R, ν ≥ 0, T ≥ 0. Maximizing the Lagrangian w.r.t. α and β gives the corresponding primal problem
min hT, Ci s.t. T 1n = νa + r + µ1m ,
T ≥0, µ∈R, ν≥0

T > 1m = νb + s.

By weakPduality,Pany feasible primal point provides an upper bound of the dual problem. We start by choosing
1
P
µ= m ( j sj − i ri ) so that i,j ti,j provides the same values w.r.t. the last two constraints. Next, we choose
 
2 + n/m 1
ν = max max , max
i ai j bj

which ensures the non-negativity of νa + r + µ1m and νb + s regardless of r and s. It follows that the
transportation plan T defined by
1
T = (νa + r + µ1m )(νb + s)>
(νb + s)T 1n
P
is feasible. We finally bound the objective, hT, Ci ≤ ||C||∞ i,j ti,j ≤ ||C||∞ (ν + n). 

Lemma 5 Bounding the 1-norm of α for (α, ·) ∈ P(C)


Pm
Let α, β ∈ P(C) with extra constraints i=1 αi = 0 and α> a + β > b ≥ 0, where a ∈ 4m and b ∈ 4n . Then,

0 ≤ kαk1 ≤ 2kCk∞ a−1 ∞


.

Proof. Similarly as before, our goal is to upper bound

max r> α s.t. 0 ≤ α> a + β > b,


α∈Rm ,β∈Rn

αi + βj ≤ ci,j ,
α> 1m = 0,

with a constant which does not depend on r. The corresponding primal is


min hT, Ci s.t. T 1n = νa + r + µ1m ,
T ≥0, µ∈R, ν≥0

T > 1m = νb.
1
P
By weak duality, any feasible primal point gives us an upper bound. We start by choosing µ = m i ri so that
2
P
ij t i,j provides the same values w.r.t. the last two constraints. Next, we choose, ν = max ai , which ensures the
i
non-negativity of νa + r + µ1m (νb ≥ 0 is also satisfied since ν ≥ 0) which appears in the r.h.s. of the second
constraint, independently of r. It follows that the transportation plan T defined by
1
T = (νa + r + µ1m )(νb)> = (νa + r + µ1m )b>
νb> 1 n
Mathieu Blondel, Vivien Seguy, Antoine Rolet

is feasible. We finally bound the objective


X
hT, Ci ≤ ||C||∞ ti,j ≤ ν ||C||∞ = 2 ||C||∞ a−1 ∞
,
i,j

which concludes the proof. 


Proof of the theorem. We begin by deriving the bound for the relaxed primal. Let (α? , β ? ) and (α?Φ , βΦ
?
) be
? > ? > ? > ? >
optimal solutions of (3) and (17), respectively. Since (αΦ ) a + (βΦ ) b ≤ (α ) a + (β ) b, we have
γ
ROTΦ (a, b) ≤ OT(a, b) − (kαΦ k2 + kβΦ k2 ).
2
Likewise,
γ
OT(a, b) − (kα? k2 + kβ ? k2 ) ≤ ROTΦ (a, b).
2
Combining the two, we get
γ γ
OT(a, b) − (kα? k2 + kβ ? k2 ) ≤ ROTΦ (a, b) ≤ OT(a, b) − (kαΦ k2 + kβΦ k2 ). (18)
2 2
Hence we need to bound variables α, β ∈ P(C). Since || · ||2 ≤ || · ||1 , we can upper bound ||α? ||1 + ||β ? ||1 . In
addition, we can always add the additional constraint that α> a + β > b ≥ 0> a + 0> b = 0 since (0, 0) is dual
feasible for (3). Since for any optimal pair α? , β ? , the pair α? − σ1, β ? + σ1 is also feasible and optimal for
any σ ∈ R, we can also add the constraint α> 1m = 0. The obtained bound will obviously hold for any optimal
pair α? , β ? . Hence, we can apply Lemma 4. By the same reasoning but using the constraint β > 1n = 0 in place
of α> 1m = 0, we can obtain a similar bound. By combining these two bounds, we obtain our final bound:

kαk1 + kβk1 ≤ kCk∞ min{ν1 + n, ν2 + m}

where
ν1 = max (2 + n/m) a−1 ∞ , b−1


ν2 = max a−1 ∞ , (2 + m/n) b−1


.

Taking the square of this bound and plugging the result in (18) gives the claimed result. Applying the same
reasoning with Lemma 5 gives the claimed result for the semi-relaxed primal.

B Alternating minimization with exact block updates

General case. Let β(α) be an optimal solution of (7) given α fixed, and similarly for α(β). From the first-order
optimality conditions,
>
∇δΩ (α + βj (α)1m − cj ) 1m = bj ∀j ∈ [n] (19)

and similarly for α given β fixed. Solving these equations is non-trivial in general. However, because

∇δΩ (α + βj (α)1m − cj ) = bj ∇maxΩj (α − cj )

holds ∀α ∈ Rm , j ∈ [n], we can retrieve βj (α) if we know how to compute ∇maxΩ (x) and the inverse map
(∇δΩ )−1 (y) exists. That map exists and equals ∇Ω(y) provided that Ω is differentiable and y > 0.
Entropic regularization. It is easy to verify that (19) is satisfied with
 
b −C
β(α) = γ log α
−1m
where K := e γ
K e> γ

and similarly for α(β). These updates recover the iterates of the Sinkhorn algorithm (Cuturi, 2013).
Squared 2-norm regularization. Plugging the expression of ∇δΩ in (19), we get that β(α) must satisfy

[α + βj (α)1m − cj ]>
+ 1m = γbj ∀j ∈ [n].
Smooth and Sparse Optimal Transport

Close inspection shows that it is exactly the same optimality condition as the Euclidean projection onto the
α−c
simplex argminky − xk2 must satisfy, with x = γbj j . Let x[1] ≥ · · · ≥ x[m] be the values of x in sorted order.
y∈4m
Following (Michelot, 1986; Duchi et al., 2008), if we let
( i
! )
1 X
ρ := max i ∈ [m] : x[i] − x[r] − 1 >0
i r=1

βj (α)
then y ? is exactly achieved at [x + γbj 1m ]+ , where

ρ
!
γbj X
βj (α) = − x[r] − 1 .
ρ r=1

The expression for α(β) is completely symmetrical. While a projection onto the simplex is required for each
coordinate, as discussed in §3.3, this can be done in expected linear time. In addition, each coordinate-wise
solution can be computed in parallel.
Alternating minimization. Once we know how to compute β(α) and α(β), there are a number of ways
we can build a proper algorithm to solve the smoothed dual. Perhaps the simplest is to alternate between
β ← β(α) and α ← α(β). For entropic regularization, this two-block coordinate descent (CD) scheme is known
as the Sinkhorn algorithm and was recently popularized in the context of optimal transport by Cuturi (2013).
A disadvantage of this approach, however, is that computational effort is spent updating coordinates that may
already be near-optimal. To address this issue, we can instead adopt a greedy CD scheme as recently proposed
for entropic regularization by Altschuler et al. (2017).

C Additional experiments

We ran the same experiments as Figure 2 and Figure 3 on one more image pair: “Grafiti” by Jon Ander and
“Rainbow Bridge National Monument Utah”, by Bernard Spragg. Both images are in the public domain. The
results, presented in Figure 5 and Figure 6 below, confirm the empirical findings described in §6.1 and §6.2. The
images are available at https://github.com/mblondel/smooth-ot/tree/master/data.
Mathieu Blondel, Vivien Seguy, Antoine Rolet

Dual convergence, = 10 3 Dual convergence, = 10 1 Dual convergence, = 10


0.14 0.14 0.14
0.12 0.12 0.12
Dual objective value

0.10 0.10 0.10


0.08 0.08 0.08
0.06 0.06 0.06
Alt. min. (dual)
0.04 0.04 0.04 L-BFGS (dual)
0.02 0.02 0.02 L-BFGS (semi-dual)
0.00 0.00 0.00
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
CPU time (seconds) CPU time (seconds) CPU time (seconds)

Figure 5: Same experiment as Figure 3 on one more image pair.

Source Unregularized Smoothed semi-dual (l22) Semi-relaxed primal (Eucl.)

Sparsity: 99% Sparsity: 98% Sparsity: 99%


Reference Smoothed semi-dual (ent.) Relaxed primal (Eucl.) Semi-relaxed primal (KL)

Sparsity: 0% Sparsity: 99% Sparsity: 92%

Figure 6: Same experiment as Figure 2 on one more image pair.

You might also like