Professional Documents
Culture Documents
Smooth and Sparse Optimal Transport
Smooth and Sparse Optimal Transport
Unregularized Smoothed semi-dual (ent.) Smoothed semi-dual (sq. 2-norm) Semi-relaxed primal (Eucl.)
Sparsity: 94% Sparsity: 0% Sparsity: 90% Sparsity: 91%
Figure 1: Comparison of transportation plans obtained by different formulations on the application of color
transfer. The top and right histograms represent the color distributions a ∈ 4m and b ∈ 4n of two images.
For the sake of illustration, the number of colors is reduced to m = n = 32, using k-means clustering. Small
squares indicate non-zero elements in the obtained transportation plan, denoted by T throughout this paper.
The sparsity indicated below each graph is the percentage of zero elements in T . The weight of the elements of
T indicates the extent to which colors from one image must be transferred to colors from the other image. Like
unregularized OT (first from left), but unlike entropy-regularized OT (second from left), our squared 2-norm
regularized OT (third from left) is able to produce sparse transportation plans. This is also the case of our
relaxed primal (not shown) and semi-relaxed primal (fourth from left) formulations.
dual and semi-dual. Our derivations abstract away by 4m := {y ∈ Rm + : kyk1 = 1} and the Euclidean pro-
regularization-specific terms in an intuitive way (§3). jection onto it by P4m (x) := argminy∈4m ky − xk2 .
We show how incorporating squared 2-norm and We denote [x]+ := max(x, 0), performed element-wise.
group-lasso regularizations within that framework
leads to sparse solutions. This is illustrated in Fig- 2 Background
ure 1 for squared 2-norm regularization.
Next, we explore the opposite direction: replacing Convex analysis. The convex conjugate of a function
one or both of the primal marginal constraints with f : Rm → R ∪ {∞} is defined by
approximate smooth constraints. When using the
f ∗ (x) := sup y > x − f (y). (1)
squared Euclidean distance to approximate the con- y∈dom f
straints, we show that this can be interpreted as adding
squared 2-norm regularization to the dual (§4). As If f is strictly convex, then the supremum in (1) is
illustrated in Figure 1, that approach also produces uniquely achieved. Then, from Danskin’s theorem
sparse transportation plans. (1966), it is equal to the gradient of f ∗ :
For both directions, we bound the approximation er- ∇f ∗ (x) = argmax y > x − f (y).
y∈dom f
ror caused by regularizing the original OT problem.
For the regularized primal, we show that the approxi- The dual of a norm k · k is defined by kxk∗ :=
mation error of squared 2-norm regularization can be supkyk≤1 y > x. We say that a function is γ-smooth
smaller than that of entropic regularization (§5). Fi- w.r.t. a norm k · k if it is differentiable everywhere and
nally, we showcase the proposed approaches empiri- its gradient is γ-Lipschitz continuous w.r.t. that norm.
cally on the task of color transfer (§6). Strong convexity plays a crucial role in this paper due
to its well-known duality with smoothness: f is γ-
An open-source Python implementation is available at strongly convex w.r.t. a norm k · k if and only if f ∗ is
https://github.com/mblondel/smooth-ot. 1
γ -smooth w.r.t. k · k∗ (Kakade et al., 2012).
Notation. We denote scalars, vectors and matrices Optimal transport. We focus throughout this pa-
using lower-case, bold lower-case and upper-case let- per on OT between discrete probability distributions
ters, e.g., t, t and T , respectively. Given a matrix T , a ∈ 4m and b ∈ 4n . Rather than performing a point-
we denote its elements by ti,j and its columns by tj . wise comparison of the distributions, OT distances
We denote the set {1, . . . , m} by [m]. We use k · kp to compute the minimal effort, according to some ground
denote the p-norm. When p = 2, we simply write k · k. cost, for moving the probability mass of one distribu-
We denote the (m−1)-dimensional probability simplex tion to the other. The modern OT formulation, due to
Mathieu Blondel, Vivien Seguy, Antoine Rolet
Kantorovich [1942], is cast as a linear program (LP): To define a smoothed version of δ, we take the convex
conjugate of Ω, restricted to the non-negative orthant:
OT(a, b) := min hT, Ci, (2)
T ∈U (a,b)
δΩ (x) := sup y > x − Ω(y). (6)
where U(a, b) is the transportation polytope y≥0
Table 1: Closed forms for δΩ (used in smoothed dual), maxΩj (used in smoothed semi-dual) and their gradients.
where we defined x+ := γ1 [x]+ . We have thus ob- duals of (13) and (14) are strongly convex. The dual
tained an efficient way to compute exact gradients of formulations are crucial to derive our bounds in §5.
δΩ , making it possible to solve the dual using gradient-
Solving the optimization problems. While the re-
based algorithms. In contrast, Courty et al. (2016) use
laxed and semi-relaxed primals (13) and (14) are still
a generalized conditional gradient algorithm whose it-
constrained problems, it is much easier to project on
erations require expensive calls to Sinkhorn. Finally,
their constraint domain than on U(a, b). For the re-
because t?j = ∇δΩ (α? + βj? 1m − cj ) ∀j ∈ [n], the ob-
laxed primal, in our experiments we use L-BFGS-B, a
tained transportation plan will be truly group-sparse.
variant of L-BFGS suitable for box-constrained prob-
lems (Byrd et al., 1995). For the semi-relaxed primal,
4 Relaxed primal ↔ Strong dual we use FISTA (Beck and Teboulle, 2009). Since the
constraint domain of (14) has the structure of a Carte-
We now explore the opposite way to define smooth OT sian product b1 4m × · · · × bn 4m , we can easily project
problems while retaining sparse transportation plans: any T on it by column-wise projection on the (scaled)
replace marginal constraints in the primal with ap- simplex. Although not exlored in this paper, the block
proximate constraints. When relaxing both marginal Frank-Wolfe algorithm (Lacoste-Julien et al., 2012) is
constraints, we define the next formulation: also a good fit for the semi-relaxed primal.
Proof is given in Appendix A.4. Our result suggests 6.1 Application to color transfer
that, for the same γ, the approximation error can of-
ten be smaller with squared 2-norm than with entropic Experimental setup. Given an image of size u × v,
regularization. In particular, we represent its pixels in RGB color space. We apply
this is true whenever
min{H(a), H(b)} > 12 min kak2 , kbk2 , which is of- k-means clustering to quantize the image down to m
ten the case in practice since 0 ≤ min{H(a), H(b)} ≤ colors. This produces m color centroids x1 , . . . , xm ∈
min{log m, log n} while 0 ≤ 12 min kak2 , kbk2 ≤ R3 . We can count how many pixels were assigned to
1
Our second theorem bounds OT − ROTΦ and each centroid and normalizing by uv gives us a color
2.
histogram a ∈ 4m . We repeat the same process with
OT− ROT
] Φ when Φ is the squared Euclidean distance.
a second image to obtain y1 , . . . , yn ∈ R3 and b ∈ 4n .
Theorem 2 Approximation error of ROTΦ , ROT
]Φ Next, we apply any of the proposed methods with cost
1
matrix ci,j = d(xi , yj ), where d is some discrepancy
Let a ∈ 4m , b ∈ 4n , Φ(x, y) = 2γ kx − yk2 . Then, measure, to obtain a (possibly relaxed) transportation
m×n
0 ≤ OT(a, b) − ROTΦ (a, b) ≤ γL plan T ∈ R+ . For each color centroid xi , we apply
a barycentric projection to obtain a new color centroid
0 ≤ OT(a, b) − ROT
] Φ (a, b) ≤ γ L
e
n
X
where we defined x̂i := argmin ti,j d(x, yj ).
x∈R3 j=1
L := kCk2∞ min{ν1 + n, ν2 + m} 2
Proof is given in Appendix A.5. While the bound for formed with respect to the yj , in order to transfer
the colors in the other direction. We use two public
ROT
] Φ is better than that of ROTΦ , both are worse
domain images “fall foliage” by Bernard Spragg and
than that of OTΩ , suggesting that the smoothed dual
“comunion” by Abel Maestro Garcia, and reduce the
formulations are the way to go when low approxima-
number of colors to m = n = 4096. We compare
tion error w.r.t. unregularized OT is important.
smoothed dual approaches and (semi-)relaxed primal
approaches. For the semi-relaxed primal, we also com-
6 Experimental results pared with Φ(x, y) = γ1 KL(x||y), where KL(x||y) is
We showcase our formulations on color transfer, which the generalized KL divergence, x> log x >
y − x 1 +
is a classical OT application (Pitié et al., 2007). More y > 1. This choice is differentiable but not smooth. We
experimental results are presented in Appendix C. ran the aforementioned solvers for up to 1000 epochs.
Mathieu Blondel, Vivien Seguy, Antoine Rolet
0.16
Dual convergence, = 10 3
0.16 Dual convergence, = 10 1
0.16
Dual convergence, = 10
0.14 0.14 0.14
Figure 3: Solver comparison for the smoothed dual and semi-dual, with squared 2-norm regularization. With
γ = 10, which was also the best value selected in Figure 2, the maximum is reached in less than 4 minutes.
Results. Our results are presented in Figure 2. All and squared 2-norm regularizations, implying similar
formulations clearly produced better results than un- convergence rates in theory. In addition, in the case of
regularized OT. With the exception of the entropy- entropic regularization, the expressions of maxΩ and
smoothed semi-dual formulation, all formulations pro- ∇maxΩ are trivial to stabilize numerically using stan-
duced extremely sparse transportation plans. The dard log-sum-exp implementation tricks.
semi-relaxed primal formulation with Φ set to the
Results. We ran the comparison using the same data
squared Euclidean distance was the only one to pro-
as in §6.1. Results are indicated in Figure 4. For
duce colors with a darker tone.
the transportation plan error and the (regularized)
value error, entropic regularization required 100 times
6.2 Solver and objective comparison smaller γ to achieve the same error. This confirms, as
We compared the smoothed dual and semi-dual when suggested by Theorem 1, that squared 2-norm regular-
using squared 2-norm regularization. In addition to ization is typically tighter. Unsurprisingly, the semi-
L-BFGS on both objectives, we also compared with relaxed primal was tighter than the relaxed primal in
alternating minimization in the dual. As we show in all four criteria. A runtime comparison of smoothed
Appendix B, exact block minimization w.r.t. α and β formulations is also important. However, a rigorous
can be carried out by projection onto the simplex. comparison would require carefully engineered imple-
mentations and is therefore left for future work.
Results. We ran the comparison using the same data
as in §6.1. Results are indicated in Figure 3. When
the problem is loosely regularized, we made two key 7 Related work
findings: i) L-BFGS converges much faster in the semi-
dual than in the dual, ii) alternating minimization con- Regularized OT. Problems similar to (5) for gen-
verges extremely slowly. The reason for i) could be eral Ω were considered in (Dessein et al., 2016). Their
the better smoothness constant of the semi-dual (cf. work focuses on strictly convex and differentiable Ω
§5). Since alternating minimization and the semi-dual for which there exists an associated Bregman diver-
have roughly the same cost per iteration (cf. Appendix gence. Following (Benamou et al., 2015), they show
B), the reason for ii) is not iteration cost but a con- that (5) can then be reformulated as a Bregman pro-
vergence issue of alternating minimization. When us- jection onto the transportation polytope and solved
ing larger regularization, L-BFGS appears to converge using Dykstra’s algorithm [1985]. While Dykstra’s al-
slighly faster on the dual than on the semi-dual, which gorithm can be interpreted implicitly as a two-block
is likely thanks to its cheap-to-compute gradients. alternating minimization scheme on the dual problem,
neither the dual nor the semi-dual expressions were
6.3 Approximation error comparison derived. These expressions allow us to make use of
arbitrary solvers, including quasi-Newton ones like L-
We compared empirically the approximation error of BFGS, which as we showed empirically, converge much
smoothed formulations w.r.t. unregularized OT ac- faster on loosely regularized problems. Our framework
cording to four criteria: transportation plan error, can also accomodate non-differentiable regularizations
marginal constraint error, value error and regularized for which there does not exist an associated Bregman
value error (cf. Figure 4 for a precise definition). For divergence, such as those that include a group lasso
the dual approaches, we solved the smoothed semi- term. Squared 2-norm regularization was recently con-
dual objective (10), since, as we discussed in §5, it has sidered in (Li et al., 2016) as well as in (Essid and
the same smoothness constant of 1/γ for both entropic Solomon, 2017) but for a reformulation of the Wasser-
Smooth and Sparse Optimal Transport
Figure 4: Approximation error w.r.t. unregularized OT empirically achieved by different smoothed formulations
on the task of color transfer. Let T ? be en optimal solution of the unregularized LP (2) and Tγ? be an optimal
solution of one of the smoothed formulations with regularization parameter γ. The transportation plan error is
kTγ? − T ? k/kT ? k. The marginal constraint error is kTγ? 1n − ak + k(Tγ? )> 1m − bk. The value error is |hTγ? , Ci −
hT ? , Ci|/hT ? , Ci. The regularized value error is |v − hT ? , Ci|, where v is one of OTΩ (a, b), ROTΦ (a, b) and
ROT
] Φ (a, b). For the regularized value error, our empirical findings confirm what Theorem 1 suggested, namely
that, for the same value of γ, squared 2-norm regularization is quite tighter than entropic regularization.
stein distance of order 1 as a min cost flow problem Smoothed LPs. Smoothed linear programs have
along the edges of a graph. been investigated in other contexts. The two closest
works to ours are (Meshi et al., 2015b) and (Meshi
Relaxed OT. There has been a large number of pro-
et al., 2015a), in which smoothed LP relaxations based
posals to extend OT to unbalanced positive measures.
on the squared 2-norm are proposed for maximum a-
Static formulations with approximate marginal con-
posteriori inference. One innovation we make com-
straints based on the KL divergence have been pro-
pared to these works is to abstract away the regular-
posed in (Frogner et al., 2015; Chizat et al., 2016). The
ization by introducing the δΩ and maxΩ functions.
main difference with our work is that these formula-
tions include an additional entropic regularization on
T . While this entropic term enables a Sinkhorn-like al- 8 Conclusion
gorithm, it also prevents from obtaining sparse T and
requires the tuning of an additional hyper-parameter. We proposed in this paper to regularize both the pri-
Relaxing only one of the two marginal constraints with mal and dual OT formulations with a strongly convex
an inequality was investigated for color transfer in (Ra- term, and showed that this corresponds to relaxing the
bin et al., 2014). Benamou (2003) considered an inter- dual and primal constraints with smooth approxima-
polation between OT and squared Euclidean distances: tions. There are several important avenues for future
work. The conjugate expression (9) should be useful
1 for barycenter computation (Cuturi and Peyré, 2016)
minm OT(x, b) + kx − ak2 . (15)
x∈4 2γ or dictionary learning (Rolet et al., 2016) with squared
2-norm instead of entropic regularization. On the the-
While on first sight this looks quite different, this is in oretical side, while we provided convergence guaran-
fact equivalent to our semi-relaxed primal formulation tees w.r.t. the OT distance value as the regularization
1
when Φ(x, y) = 2γ kx − yk2 since (15) is equal to vanishes, which suggested the advantage of squared
2-norm regularization, it would also be important to
1 study the convergence w.r.t. the transportation plan,
min min hT, Ci + kx − ak2
x∈4m T ≥0 2γ as was done for entropic regularization by Cominetti
T 1n =x
>
T 1m =b and San Martı́n (1994). Finally, studying optimization
1 algorithms that can cope with large-scale data is im-
= min hT, Ci + kT 1n − ak2 = ROT
] Φ (a, b).
portant. We believe SAGA (Defazio et al., 2014) is a
T ≥0 2γ
T > 1m =b good candidate since it is stochastic, supports proxim-
ity operators, is adaptive to non-strongly convex prob-
However, the bounds in §5 are to our knowledge new. lems and can be parallelized (Leblond et al., 2017).
A similar formulation but with a group-lasso penalty
1
on T instead of 2γ kT 1n − ak2 was considered in the
context of convex clustering (Carli et al., 2013).
Mathieu Blondel, Vivien Seguy, Antoine Rolet
Stephen Boyd and Lieven Vandenberghe. Convex Op- Richard L Dykstra. An iterative procedure for obtain-
timization. Cambridge University Press, 2004. ing i-projections onto the intersection of convex sets.
The annals of Probability, pages 975–984, 1985.
Richard H Byrd, Peihuang Lu, Jorge Nocedal, and
Ciyou Zhu. A limited memory algorithm for bound Montacer Essid and Justin Solomon. Quadratically-
constrained optimization. SIAM Journal on Scien- regularized optimal transport on graphs. arXiv
tific Computing, 16(5):1190–1208, 1995. preprint arXiv:1704.08200, 2017.
Charlie Frogner, Chiyuan Zhang, Hossein Mobahi,
Gilberto Calvillo and David Romero. On the closest
Mauricio Araya, and Tomaso A Poggio. Learning
point to the origin in transportation polytopes. Dis-
with a wasserstein loss. In Proc. of NIPS, pages
crete Appl. Math., 210:88–102, 2016.
2053–2061, 2015.
Francesca P Carli, Lipeng Ning, and Tryphon T Geor-
Aude Genevay, Marco Cuturi, Gabriel Peyré, and
giou. Convex clustering via optimal mass transport.
Francis Bach. Stochastic optimization for large-scale
arXiv preprint arXiv:1307.5459, 2013.
optimal transport. In Proc. of NIPS, pages 3440–
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, 3448. 2016.
and François-Xavier Vialard. Scaling algorithms
Alexandre Gramfort, Gabriel Peyré, and Marco Cu-
for unbalanced transport problems. arXiv preprint
turi. Fast optimal transport averaging of neuroimag-
arXiv:1607.05816, 2016.
ing data. In Proc. of International Conference on
Roberto Cominetti and Jaime San Martı́n. Asymp- Information Processing in Medical Imaging, pages
totic analysis of the exponential penalty trajectory 261–272, 2015.
in linear programming. Mathematical Programming, Sham M Kakade, Shai Shalev-Shwartz, and Ambuj
67(1-3):169–187, 1994. Tewari. Regularization techniques for learning with
Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain matrices. Journal of Machine Learning Research,
Rakotomamonjy. Optimal transport for domain 13:1865–1890, 2012.
Smooth and Sparse Optimal Transport
Jingu Kim, Renato DC Monteiro, and Haesun Park. François Pitié, Anil C Kokaram, and Rozenn Dahyot.
Group sparsity in nonnegative matrix factorization. Automated colour grading using colour distribution
In Proc. of SIAM International Conference on Data transfer. Computer Vision and Image Understand-
Mining, pages 851–862, 2012. ing, 107(1):123–137, 2007.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Julien Rabin, Sira Ferradans, and Nicolas Papadakis.
Weinberger. From word embeddings to document Adaptive color transfer with relaxed optimal trans-
distances. In Proc. of ICML, pages 957–966, 2015. port. In Proc. of International Conference on Image
Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, Processing, pages 4852–4856, 2014.
and Patrick Pletscher. Block-coordinate Frank- Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast
Wolfe optimization for structural SVMs. In Proc. dictionary learning with a smoothed wasserstein
of ICML, 2012. loss. In Proc. of AISTATS, pages 630–638, 2016.
Rmi Leblond, Fabian Pedregosa, and Simon Lacoste- David Romero. Easy transportation-like problems on
Julien. ASAGA: Asynchronous Parallel SAGA. In k-dimensional arrays. Journal of Optimization The-
Proceedings of the 20th International Conference ory and Applications, 66:137–147, 1990.
on Artificial Intelligence and Statistics, volume 54, Bernhard Schmitzer. Stabilized sparse scaling algo-
pages 46–54, 2017. rithms for entropy regularized transport problems.
Christian Léonard. A survey of the Schrödinger prob- arXiv preprint arXiv:1610.06519, 2016.
lem and some of its connections with optimal trans- Erwin Schrödinger. Über die umkehrung der naturge-
port. Discrete and Continuous Dynamical Systems, setze. Phys. Math., 144:144–153, 1931.
34:1879–1920, 2012.
Shai Shalev-Shwartz. Online Learning: Theory, Algo-
Wuchen Li, Stanley Osher, and Wilfrid Gangbo. A rithms, and Applications. PhD thesis, The Hebrew
fast algorithm for earth mover’s distance based on University of Jerusalem, 2007.
optimal transport and l1 type regularization. arXiv
Richard Sinkhorn and Paul Knopp. Concerning
preprint arXiv:1609.07092, 2016.
nonnegative matrices and doubly stochastic matri-
Dong C Liu and Jorge Nocedal. On the limited mem- ces. Pacific Journal of Mathematics, 21(2):343–348,
ory bfgs method for large scale optimization. Math- 1967.
ematical programming, 45(1):503–528, 1989.
Justin Solomon, Raif Rustamov, Guibas Leonidas, and
Ofer Meshi, Amir Globerson, and Tommi S. Jaakkola. Adrian Butscher. Wasserstein propagation for semi-
Convergence rate analysis of map coordinate mini- supervised learning. In Proc. of ICML, pages 306–
mization algorithms. In Proc. of NIPS, pages 3014– 314, 2014.
3022. 2012. Cédric Villani. Topics in optimal transportation. Num-
Ofer Meshi, Mehrdad Mahdavi, and Alexander G. ber 58. American Mathematical Soc., 2003.
Schwing. Smooth and strong: MAP inference with
linear convergence. In Proc. of NIPS, 2015a.
Ofer Meshi, Nathan Srebro, and Tamir Hazan. Ef-
ficient training of structured svms via soft con-
straints. In Proc. of AISTATS, pages 699–707,
2015b.
Christian Michelot. A finite algorithm for finding the
projection of a point onto the canonical simplex of
Rn . Journal of Optimization Theory and Applica-
tions, 50(1):195–200, 1986.
Boris Muzellec, Richard Nock, Giorgio Patrini, and
Frank Nielsen. Tsallis regularized optimal transport
and ecological inference. In AAAI, pages 2387–2393,
2017.
Yu Nesterov. Smooth minimization of non-smooth
functions. Mathematical programming, 103(1):127–
152, 2005.
Gabriel Peyré and Marco Cuturi. Computational Op-
timal Transport. 2017.
Mathieu Blondel, Vivien Seguy, Antoine Rolet
Appendix
A Proofs
A.1 Derivation of the smooth relaxed dual
Recall that
n
X
OTΩ (a, b) = min t>
j cj + Ω(tj ). (16)
T ∈U (a,b)
j=1
We now add Lagrange multipliers for the two equality constraints but keep the constraint T ≥ 0 explicitly:
n
X
OTΩ (a, b) = min max t> > > >
j cj + Ω(tj ) + α (T 1n − a) + β (T 1m − b).
T ≥0 α∈R ,β∈Rn
m
j=1
Since (16) is a convex optimization problem with only linear equality and inequality constraints, Slater’s condi-
tions reduce to feasibility (Boyd and Vandenberghe, 2004, §5.2.3) and hence strong duality holds:
n
X
OTΩ (a, b) = max
m n
min t> > > >
j cj + Ω(tj ) + α (T 1n − a) + β (T 1m − b)
α∈R ,β∈R T ≥0
j=1
n
X
= max min t> > >
j (cj + α + βj 1m ) + Ω(tj ) − α a − β b
α∈Rm ,β∈Rn tj ≥0
j=1
n
X
= max
m n
− max t> > >
j (−cj − α − βj 1m ) − Ω(tj ) − α a − β b
α∈R ,β∈R tj ≥0
j=1
n
X
= max
m n
α> a + β > b − max t>
j (α + βj 1m − cj ) − Ω(tj ).
α∈R ,β∈R tj ≥0
j=1
Following a similar argument as (Cuturi and Peyré, 2016, Theorem 2.4), we have
n
X
OT∗Ω (g, b) = max hT, g1>
n − Ci − Ω(tj ).
T ≥0
T > 1m =b j=1
Notice that this is an easier optimization problem than (5), since there are equality constraints only in one
direction. Cuturi and Peyré (2016) showed that this optimization problem admits a closed form in the case of
entropic regularization. Here, we show how to compute OT∗Ω for any strongly-convex regularization.
The problem clearly decomposes over columns and we can rewrite it as
n
X
OT∗Ω (g, b) = max t>
j (g − cj ) − Ω(tj )
tj ≥0
j=1
t>
j 1m =bj
n
X 1
= bj maxm τ >
j (g − cj ) − Ω(bj τ j )
j=1
τ j ∈4 bj
n
X
= bj maxΩj (g − cj ),
j=1
Smooth and Sparse Optimal Transport
1
where we defined Ωj (y) := bj Ω(bj y) and where maxΩ is defined in (8).
Using a similar derivation as before, we obtain the duals of (13) and (14).
1 1
ROTΦ (a, b) = max − Φ∗ (−2α, a) − Φ∗ (−2β, b)
α,β∈P(C) 2 2
ROT
] Φ (a, b) = max −Φ∗ (−α, a) + β > b
α,β∈P(C)
n
X
= maxm −Φ∗ (−α, a) − bj max (αi − ci,j ),
α∈R i∈[m]
j=1
and
γ
ROT
] Φ (a, b) = max α> a + β > b − kαk2
α,β∈P(C) 2
n
X γ
= maxm α> a − bj max (αi − ci,j ) − kαk2 .
α∈R
j=1
i∈[m] 2
This corresponds to the original dual and semi-dual with squared 2-norm regularization on the variables.
Before proving the theorem, we introduce the next two lemmas, which bound the regularization value achieved
by any transportation plan.
Proof. The tightest lower bound is given by min kT k2 . An exact iterative algorithm was proposed in (Calvillo
T ∈U (a,b)
and Romero, 2016) to solve this problem. However, since we are interested in an explicit formula, we consider
instead the lower bound min kT k2 (i.e., we ignore the non-negativity constraint). It is known (Romero, 1990)
T 1n =a
T > 1m =b
ai bj 1
that the minimum is achieved at ti,j = n + m − mn , hence our lower bound. For the upper bound, we have
m X
X n
kT k2 = t2i,j
i=1 j=1
m X n 2
X ti,j
= ai
i=1 j=1
ai
m n
X X ti,j 2
= a2i
i=1 j=1
ai
m n
X X ti,j
≤ a2i
i=1 j=1
ai
= kak2 .
We can do the same with b ∈ 4n to obtain kT k2 ≤ kbk2 , yielding the claimed result.
Together with 0 ≤ kak2 ≤ 1 and 0 ≤ kbk2 ≤ 1, this provides lower and upper bounds for the squared 2-norm of
a transportation plan.
Proof of the theorem. Let T ? and TΩ? be optimal solutions of (2) and (5), respectively. Then,
Likewise,
OTΩ (a, b) = hTΩ? , Ci + Ω(TΩ? ) ≤ hT ? , Ci + Ω(T ? ) = OT(a, b) + Ω(T ? ).
Combining the two, we obtain
Using T ? , TΩ? ∈ U(a, b) together with Lemma 2 and Lemma 3 gives the claimed results.
Proof. The proof technique is inspired by (Meshi et al., 2012, Supplementary material Lemma 1.2).
The 1-norm can be rewritten as
αi + βj ≤ ci,j ,
α> 1m = 0,
with a constant that does not depend on r and s. We call the above the dual problem. Its Lagrangian is
m,n
X
> > > > >
L(α, β, µ, ν, T ) = r α + s β + µα 1m + ν(α a + β b) + ti,j (ci,j − αi − βj )
i,j=1
with µ ∈ R, ν ≥ 0, T ≥ 0. Maximizing the Lagrangian w.r.t. α and β gives the corresponding primal problem
min hT, Ci s.t. T 1n = νa + r + µ1m ,
T ≥0, µ∈R, ν≥0
T > 1m = νb + s.
By weakPduality,Pany feasible primal point provides an upper bound of the dual problem. We start by choosing
1
P
µ= m ( j sj − i ri ) so that i,j ti,j provides the same values w.r.t. the last two constraints. Next, we choose
2 + n/m 1
ν = max max , max
i ai j bj
which ensures the non-negativity of νa + r + µ1m and νb + s regardless of r and s. It follows that the
transportation plan T defined by
1
T = (νa + r + µ1m )(νb + s)>
(νb + s)T 1n
P
is feasible. We finally bound the objective, hT, Ci ≤ ||C||∞ i,j ti,j ≤ ||C||∞ (ν + n).
αi + βj ≤ ci,j ,
α> 1m = 0,
T > 1m = νb.
1
P
By weak duality, any feasible primal point gives us an upper bound. We start by choosing µ = m i ri so that
2
P
ij t i,j provides the same values w.r.t. the last two constraints. Next, we choose, ν = max ai , which ensures the
i
non-negativity of νa + r + µ1m (νb ≥ 0 is also satisfied since ν ≥ 0) which appears in the r.h.s. of the second
constraint, independently of r. It follows that the transportation plan T defined by
1
T = (νa + r + µ1m )(νb)> = (νa + r + µ1m )b>
νb> 1 n
Mathieu Blondel, Vivien Seguy, Antoine Rolet
where
ν1 = max (2 + n/m) a−1 ∞ , b−1
∞
ν2 = max a−1 ∞ , (2 + m/n) b−1
∞
.
Taking the square of this bound and plugging the result in (18) gives the claimed result. Applying the same
reasoning with Lemma 5 gives the claimed result for the semi-relaxed primal.
General case. Let β(α) be an optimal solution of (7) given α fixed, and similarly for α(β). From the first-order
optimality conditions,
>
∇δΩ (α + βj (α)1m − cj ) 1m = bj ∀j ∈ [n] (19)
and similarly for α given β fixed. Solving these equations is non-trivial in general. However, because
holds ∀α ∈ Rm , j ∈ [n], we can retrieve βj (α) if we know how to compute ∇maxΩ (x) and the inverse map
(∇δΩ )−1 (y) exists. That map exists and equals ∇Ω(y) provided that Ω is differentiable and y > 0.
Entropic regularization. It is easy to verify that (19) is satisfied with
b −C
β(α) = γ log α
−1m
where K := e γ
K e> γ
and similarly for α(β). These updates recover the iterates of the Sinkhorn algorithm (Cuturi, 2013).
Squared 2-norm regularization. Plugging the expression of ∇δΩ in (19), we get that β(α) must satisfy
[α + βj (α)1m − cj ]>
+ 1m = γbj ∀j ∈ [n].
Smooth and Sparse Optimal Transport
Close inspection shows that it is exactly the same optimality condition as the Euclidean projection onto the
α−c
simplex argminky − xk2 must satisfy, with x = γbj j . Let x[1] ≥ · · · ≥ x[m] be the values of x in sorted order.
y∈4m
Following (Michelot, 1986; Duchi et al., 2008), if we let
( i
! )
1 X
ρ := max i ∈ [m] : x[i] − x[r] − 1 >0
i r=1
βj (α)
then y ? is exactly achieved at [x + γbj 1m ]+ , where
ρ
!
γbj X
βj (α) = − x[r] − 1 .
ρ r=1
The expression for α(β) is completely symmetrical. While a projection onto the simplex is required for each
coordinate, as discussed in §3.3, this can be done in expected linear time. In addition, each coordinate-wise
solution can be computed in parallel.
Alternating minimization. Once we know how to compute β(α) and α(β), there are a number of ways
we can build a proper algorithm to solve the smoothed dual. Perhaps the simplest is to alternate between
β ← β(α) and α ← α(β). For entropic regularization, this two-block coordinate descent (CD) scheme is known
as the Sinkhorn algorithm and was recently popularized in the context of optimal transport by Cuturi (2013).
A disadvantage of this approach, however, is that computational effort is spent updating coordinates that may
already be near-optimal. To address this issue, we can instead adopt a greedy CD scheme as recently proposed
for entropic regularization by Altschuler et al. (2017).
C Additional experiments
We ran the same experiments as Figure 2 and Figure 3 on one more image pair: “Grafiti” by Jon Ander and
“Rainbow Bridge National Monument Utah”, by Bernard Spragg. Both images are in the public domain. The
results, presented in Figure 5 and Figure 6 below, confirm the empirical findings described in §6.1 and §6.2. The
images are available at https://github.com/mblondel/smooth-ot/tree/master/data.
Mathieu Blondel, Vivien Seguy, Antoine Rolet