Notes of Optimal Transport Problem and Metrics: Yang YANG, EE 68 April 27, 2019

Notes of Optimal Transport Problem and Metrics
Yang YANG, EE 681
April 27, 2019
1
email:yang-yan16@mails.tsinghua.edu.cn
Chapter 1
Background
Optimal Transport problem is an important problem considering distance of

two distributions and allocation of resources in mathematics and economics.
The problem was formalized by the French mathematician Gaspard Monge in
1781. Initial formulation is so-called Monge formulation, which will be dis-
cussed below(2.1). The relaxed and more generalized form, called Kantorovich
formulation expanded the category of transport plan from maps to joint distri-
butions. Kantorovich formulation has a dual which was used to solve discrete
OT problems as a linear programming(??). And Monge formulation has an
explicit solution when two distributions are supported on one-dimensional sets.
And optimal transport problem induces Wasserstein metrics as the minimum.
However, high complexity of the linear programming limited its application un-
til entropy regularization has been added, which enable Sinkhorn algorithm to
approximately solve optimal transport in near-linear time. On the other side,
balanced optimal transport requests normalization.
Unlike Monge formulation in geometry opinions, Benamou-Brenier formulations

is started form a fluid mechanics problem. Similarly it has a potential form
called Hamilton-Jacobi formulations. Considering the mess transport within a
period, Benamou-Brenier formulation is optimization about dynamics, under
the contraint of Euler Equation:
∂t ρt + ∇ · (ρt vt ) = 0 (1.1)
which describes the mass conservation in fluid, and optimize the Lagrangian
term. By adding source term to the right side, we can generalize the balanced
optimal transport to the unbalanced one, which induced the Wasserstein-Fisher-
Rao metric(Chizat et al.,2018). This metric is an interpolation between the
Wasserstein metric and the Fisher-Rao metric. Besides, another equivalent form
relaxed the constraints of joint distribution, called Hellinger-Kantorovich(Liero
et al.,2018) will be discussed later.
1
Chapter 2
Formulations
2.1 Monge Problem

Given two distributions µ ∈ P(X) and ν ∈ P(Y ) and a cost function on the
product space c : X × Y → [0, +∞], solve the optimization:
Z
(M P ) inf c(x, T (x))dµ(x) : T] (µ) = ν (2.1)
X
where T : X → Y , T] is a image measure which maps measures in X to measures

in Y . T] µ is given by (T] µ)(B) = µ(T −1 (B)) for any measurable set B ∈ Y .
For one-dimensional case. It is possible to solve the optimal transportation
problem explicitly in terms of the cumulative distribution functions. Let f and
g be density functions of two space. The distribution functions will be:
Z x Z x
F (x) = f (t)dt, G(x) = g(t)dt
−∞ −∞
Then the optimal transportation cost is

Z 1
2
W2 (f, g) = |F −1 (t) − G−1 (t)|2 dt. (2.2)
0
If additionally the target density g is positive, then the optimal map from f to
g is given by
T (x) = G−1 (F (x)) (2.3)
Remark:We can obtain Equation 2 from the optimal map given above by change
of variables.
2.2 Kantorovich-Dual formulations

Monge Problem is difficult because of its constraint about the existence of trans-
port map. Thus the starting point of all the modern approaches is the natural
2
relaxed formulation by Kantorovich.
Given two probability measures µ ∈ P(X) and ν ∈ P(Y ) and a cost function
c : X × Y → [0, +∞], solve the optimization:
Z
(KP ) inf c(x, y)dγ(x, y) : γ ∈ Π(µ, ν) (2.4)
X×Y
where Π(µ, ν) is the so-called transport plans with the constraint of marginal
distribution:

Π(µ, ν) = γ ∈ P(X × Y ) : (πx )] γ = µ, (πy )] γ = ν
where (πx ) and(πy ) are the two projections of X ×Y onto X and Y respectively.
Kantorovich Problem has linear constraint. After discretization, it becomes a
standard linear problem.
Given marginal constraints {ai } ∈ P(X) and {bi } ∈ P(Y ) for discrete proba-
bility space P(X × Y ) and a family of costs (cij )i∈X,j∈Y
 
X X X 
min cij γij : γij ≥ 0, γij = ai , γij = bj (2.5)
 
i,j j i
The discretized Kantorovich Problem can be solved with any algorithm for the
standard linear problems.
By now we have made no assumptions for the Kantorovich Problem.
The assumptions below are widely used in applications.
• X = Y = Ω ⊂ Rd
• c(x, y) = kx − ykp (1 ≤ p ≤ ∞)
• We restrict our analysis to the family of probability distributions
Z
Pp (Ω) = µ ∈ P(Ω) : kxkp dµ < +∞
Ω×Ω
Use the same notation with the discussion above,given marginal constraints
(ai )i ∈ P(X) and (bi )i ∈ P(Y ) for discrete probability space P(X × Y ) and a
family of costs (cij )i∈X,j∈Y . We deduce the duality formulation in the discrete
case,the key observation is the fact that
 
 P P
0 ifai = j rij , bj = i rij
X X X X
sup ϕi (ai − rij ) + ψj (bj − rij ) =
ϕi ,ψj   +∞ otherwise
i j j i
3
With the observation above, we can reformulated the Kantorovich problem as
  
X X X X X 
inf cij γij + sup ϕi (ai − rij ) + ψj (bj − rij )
γij >=0  ϕi ,ψj  
i,j i j j i
Formally interchanging infimum and supremum leads us to

  
X X X 
sup ϕi (ai ) + ψj (bj ) + inf cij − (ϕi + ψj )γij
ϕi ,ψj  γij >=0  
i j i,j
And we notice the fact that

 

0 ifϕi + ψj <= cij ∀i, j
X 
inf cij − (ϕi + ψj )γij =
γij >=0   −∞ otherwise
i,j
As a consequence we have the dual optimization problem(DP).

 
X X 
sup ϕi (ai ) + ψj (bj ) : ϕi + ψj <= cij (2.6)
ϕi ,ψj  i j

Similarly, in the continuous case,we have:

Z Z
sup ϕdµ + ψdν : ϕ(x) + ψ(y) ≤ c(x, y) (2.7)
ϕ∈Cb (X),ψ∈Cb (Y ) X Y
In the dual problem, ϕ and ψ are known as Kantorovich potentials.

• Under our assumptions, duality sup(DP) = min(KP) holds, and for (x, y) ∈
sup(γ ∗ ), ϕ(x) + ψ(y) = c(x, y).
• Even under our assumptions, the restriction ϕ(x)+ψ(y) ≤ c(x, y) (∀x, y ∈
Ω) is very complicated.
• The Kantorovich potentials ϕ and ψ are not determined where there is no
mass, and they could be very singular near the support of the mass.
2.3 Benamou–Brenier formulations

∂t ρt + ∇ · (vt ρt ) = 0
where vt denotes the speed of fluid, ρ denotes the density of the fluid. And the
boundary condition is given by:
ρ0 (x) = µ, ρ1 (x) = ν
We try to minimize the aggregate energy involved in this process, thereby giving
the Benamou-Brenier formulation(BP):
Z 1Z
inf 1/2|v(t, x)|2 ρ(t, x)dxdt (2.8)
0 Ω
4
2.4 Hamilton-Jacobi formulations
5
Chapter 3
Convexity of the
Wasserstein metric and
Insensitivity with respect to
noise
We note that any shifts, dilations, and amplitude changes in a signal f will
correspond to a similar transformation in the positive and negative parts of f.
Thus in the results below, it is sufficient to assume that the original profile f is
non-negative.
3.1 Convexity with respect to shift

Definition 2.1 (Cyclical monotonicity). We say that a map T : X ⊂ Rn → Rn
is cyclically monotone if for any m ∈ N+ , xi ∈ X, 1 ≤ i ≤ m, x0 = xm ,
m
X m
X
xi · T (xi ) ≥ xi · T (xi−1 )
i=1 i=1
Theorem 2.1 (Optimality criterion for quadratic cost). Let f and g be prob-
ability density functions supported on sets X and Y , respectively. A masspre-
serving map T : X → Y minimizes the quadratic optimal transport cost if and
only if it is cyclically monotone.
We shall not prove this theorem.
6
Theorem 2.2 (Convexity of shift) Suppose f and g are probability den-
sity functions of bounded second moment. Let T be the optimal map that
rearranges f into g. If fs (x) = f (x − sη) for η ∈ Rn , then the optimal map from
fs (x) to g(y) is Ts = T (x − sη). Moreover, W22 (fs , g) is convex with respect to
the shift size s.
Proof:First, we shall prove Ts is the optimal map rearranging fs to g by

checking its cyclical monotonicity:
n
X n
X n
X
xi · Ts (xi ) = (xi − sη) · T (xi − sη) + sη · T (xi − sη)
i=1 i=1 i=1
n
X n
X
≥ (xi − sη) · T (xi−1 − sη) + sη · T (xi−1 − sη)
i=1 i=1
n
X
= xi · Ts (xi−1 )
i=1
Then the squared Wasserstein metric can be expressed as
Z Z
W22 (fs , g) = 2
|x − Ts (x)| fs (x)dx = |x − T (x − sη)|2 f (x − sη)dx
Z
= |x + sη − T (x)|2 f (x)dx
Z
2 2 2
= W2 (f, g) + s |η| + 2s η · (x − T (x))f (x)dx.
The convexity with respect to s is evident from the last equation.
3.2 Convexity with respect to dilation

Lemma 2.1 (Optimal map for dilation). Assume f and g are probability density
functions of bounded second moment satisfying f (x) = det(A)g(Ax), where A
is a symmetric positive definite matrix. Then the optimal transport map rear-
ranging f (x) into g(y) is T (x) = Ax.
Proof:Since A is symmetric positive definite, it has a unique Cholesky decom-

position A = LT L for some upper triangular matrix L. Then for any xi ∈ X,
7
n n
X 1X T T
xi · (T (xi ) − T (xi−1 )) = (x L Lxi + xTi−1 LT Lxi−1 − 2xTi−1 LT Lxi )
i=1
2 i=1 i
n
1X
= |Lxi − Lxi−1 |2
2 i=1
≥0
which verifies the optimality condition.
Theorem 2.3 (Convexity with respect to dilation)

Assume f (x) is a probability density function and g(y) = f (A−1 y) where A
is a symmetric positive definite matrix. Then the squared Wasserstein metric
W22 (f, g/hgi) is convex with respect to the eigenvalues λ1 , ..., λn of A.
Proof. In order to define the Wasserstein metric, it is necessary to work with

the normalized density g/hgi = det(A)−1 f (A−1 y). By Lemma 2.1, the optimal
mapping is
T (x) = Ax = OΛOT x
where O is an orthogonal matrix and Λ is a diagonal matrix whose entries are the
eigenvalues λ1 , ..., λn . Then the squared Wasserstein metric can be expressed as
Z
g
W22 (f, )= f (x)|x − Ax|2 dx
hgi
Z
= f (x)xT O(I − Λ)2 OT xdx
Z
= f (Oz)z T (I − Λ)2 zdz
which is convex in λ1 , ..., λn .
3.3 Convexity with respect to partial amplitude

change
Finally, we consider the problem where a profile f is derived from g, but with
a decreased amplitude in part of the domain. That is, we suppose that the
domain is decomposed into Ω = Ω1 ∪ Ω2 with Ω1 ∩ Ω2 = ∅. For an amplitude
8
loss parameter 0 ≤ β ≤ 1 we suppose that f depends on the probability density
function g via
βg(x) , x ∈ Ω1
fβ (x) =
g(x) , x ∈ Ω2
Theorem 2.4 (Convexity with respect to partial amplitude loss) The
squared Wasserstein metric W22 (fβ /hfβ i, g) is a convex function of the parame-
ter β.
In order to prove this result, we introduce an alternative form of the rescaled

density in terms of a parameter −1 ≤ α ≤ 0.

(1 + α)g(x) , x ∈ Ω1
hα (x) =
(1 − γα )g(x) , x ∈ Ω2
where R
g
γα = α RΩ1
Ω2
g
. Note that hα is non-negative and has unit mass by construction, with h0 = g.
Lemma 2.2 (Parameterization of amplitude loss). Define the parameteriza-

tion function
β
α(β) = R R − 1.
β Ω1 g + Ω2 g
Then α : [0, 1] → [−1, 0] is concave and the associated density functions are
related through
fβ
fˆβ ≡ = hα(β) .
hfβ i
The proof of convexity with respect to β will come via convexity with respect
to α.
Lemma 2.3 (Convexity with respect to partial amplitude change). The

squared Wasserstein metric W22 (hα , g) is a convex function of the parameter
α.
Proof: Choose any α1 , α2 ∈ [−1, 0] and s ∈ [0, 1]. From convexity of the
Monge-Kantorovich minimization problem , we have
W22 (shα1 + (1 − s)hα2 , g) ≤ sW22 (hα1 , g) + (1 − s)W22 (hα2 , g) (3.1)
We can calculate
9

s(1 + α1 )g + (1 − s)(1 + α2 )g , x ∈ Ω1
shα1 + (1 − s)hα2 =
s(1 − γα1 )g + (1 − s)(1 − γα2 )g , x ∈ Ω2

(1 + sα1 + α2 − sα2 )g , x ∈ Ω1
=
(1 − sγα1 − γα2 + sγα2 )g , x ∈ Ω2
= hsα1 +(1−s)α2 .
Thus we can rewrite Equation (1) as
W22 (hsα1 +(1−s)α2 , g) ≤ sW22 (hα1 , g) + (1 − s)W22 (hα2 , g)
and the Wasserstein metric W22 (hα , g) is convex with respect to α.
Lemma 2.4 (Misfit is non-increasing). Let −1 ≤ α1 < α2 ≤ 0. Then

W22 (hα2 , g) ≤ W22 (hα2 , g).
Proof. Define the parameters ∈ [0, 1] by s = α2 /α1 . Then we can use the
convexity result of Lemma 2.3 to compute
W22 (hα2 , g) = W22 (hsα1 +(1−s)·0 , g)

≤ sW22 (hα1 , g) + (1 − s)W22 (h0 , g)
≤ W22 (hα1 , g)
where we have used the fact that h0 = g.
Proof:(Proof of Theorem 2.4.) Choose any β1 , β2 , s ∈ [0, 1]. From the

concavity of α(β) we have
α(sβ1 + (1 − s)β2 ) ≥ sα(β1 ) + (1 − s)α(β2 ).
Applying Lemmas 2.3−2.4 we can compute
W22 (fˆsβ1 +(1−s)β2 , g) = W22 (hα(sβ1 +(1−s)β2 ) , g)

≤ W22 (hsα(β1 )+(1−s)α(β2 ) , g)
≤ sW22 (hα(β1 ) , g) + (1 − s)W22 (hα(β2 ) , g)
= sW 2 (fˆβ , g) + (1 − s)W 2 (fˆβ , g),
2 1 2 2
which establishes the convexity.
10
3.4 Insensitivity with respect to noise
We shall only consider one dimensional case. In one dimension, it is possible
to exactly solve the optimal transportation problem in terms of the cumulative
distribution functions
Z x Z x
F (x) = f (t)dt, G(x) = g(t)dt.
−∞ −∞
Then it is well known that the optimal transportation cost is

Z 1
W22 (f, g) = |F −1 (t) − G−1 (t)|2 dt. (3.2)
0
If additionally the target density g is positive, then the optimal map from f to
g is given by
T (x) = G−1 (F (x))
Remark:We can obtain Equation 2 from the optimal map given above by change
of variables.
Theorem 2.5 (Insensitivity to noise in 1D) Let g be a positive probability

density function on [0,1] and choose 0 < c < min g. Let fN (x) = g(x) + rN (x),
which contains piecewise constant additive noise rN drawn from the uniform
distribution U [−c, c]. Then EW22 (fN /hfN i, g) = O( N1 ).
Without loss of generality, we take g = 1 on [0, 1]. For x ∈ i−1 i

N
N , N , r (x) ≡ ri,
with each ri drawn from U [−c, c]. As N → ∞, rN (x) approximates the noise
function r(x) on [0,1]. For any x0 ∈ [0, 1], r(x0 ) is a random variable with uni-
form distribution U [−c, c].
Proof. (Proof of Theorem 2.5) For each i, ri is a random variable of uniform

distribution U [−c, c], 0 < c < min g. Thus, we have Eri = 0 and Er̄ = 0, where
PN
ri
r̄ = i=1
N
Let h = 1/N and xi = ih for i = 0, ..., N . Then the noisy density function is
given by
fN (x) = 1 + ri , x ∈ (xi−1 , xi ]
We begin by calculating the Wasserstein metric between fN and the constant
gN = 1 + r̄, which share the same mass and now derive the cumulative distri-
bution function and its inverse for both fN and gN :
Pi−1
FN (x) = j=1 (1 + rj )h + (1 + ri )(x − xi−1 ) , x ∈ (xi−1 , xi ]
GN (x) = (1 + r̄)x , x∈ [0, 1]
Pi−1
−1 x+ (i−1)ri − j=1 rj h Pi−1 Pi
FN (x) = 1+ri , x ∈ j=1 (1 + rj )h, j=1 (1 + rj )h
G−1
N (x) =
x
1+r̄ , x ∈ [0, 1 + r̄]
11
where i ≤ i ≤ N .
Then we can bound the squared Wasserstein metric by
1+r̄ N i N 2
2h3 X X
Z X
W22 (fN , gN ) = |FN−1 (t) − G−1 2
N (t)| dt ≤ rl − ih rk .
0 (1 − c)2 i=1
l=1 k=1
Since the noise {ri }N

i=1 is i.i.d., we obtain the following upper bound for the
expectation of the Wasserstein metric:
N
X C2
EW22 (fN , gN ) ≤ Ch3 iEr12 ≤ .
i=1
N
We can similarly establish a lower bound so that

C1 C2
≤ EW22 (fN , gN ) ≤ .
N N
where C1 and C2 only depend on c.
Recalling that g = g/(1 + r̄N ), we can rescale the squared Wasserstein metric
to obtain 2
2 1
W2 (fN /hfN i, g) = W22 (fN , gN ),
1 + r̄
where 2 2 2
1 1 1
≤ ≤
1+c 1 + r̄ 1+c
Thus we conclude thatEW22 (fN /hfN i, g) = O( N1 ).
In this chapter, we will introduce a practical numerical scheme to handle bal-

anced and unbalanced OT problems emerging from applications. Chapter 1
mainly focuses on balanced OT problems and gives a brief introduction to en-
tropic regularization and Sinkhorn’s algorithm. Next chapter will deal with
unbalanced OT problems and reveals that unbalanced OT shares a similar nu-
merical scheme with balanced OT.
12
Chapter 4
Algorithms
4.1 Entropic Regularization

First, we define the discrete entropy of a transport plan X
X
H(X) = − Xi,j (log(Xi,j ) − 1) (4.1)
i,j
Note that if X contains a non-positive entry, H(X) will be negative infinite.

Besides, since Xi,j ≤ 1 and H’s Hessian ∂ 2 H(X) = −diag(1/Xi,j ), it is guar-
anteed that H is concave. So, we could utilize -H as a regularizing function to
approximately solve the original OT problem. Suppose µ and ν are two given
probability distribution, α and β are dual variables, η is the regularizing factor,
and C is the cost matrix. We obtain the regularized form of OT:

min max L(X, α, β) = hC, Xi − ηH(X) − α, X1 − µ − β, X1 − ν (4.2)
X α,β
where 1 denotes an all-ones vector. To approach the minimum, ∂Xi,j L(X, α, β)

has to be zero and the following relationships can be derived
 −Ci,j +αi +βj
Xi,j = e
 η
(4.3)
P
Xi,j = µi
Pj

i Xi,j = νj
In the end of this section, We have two comments on the regularized loss
function. First, the regularization is very benefitial. Similar with Largrange
multiplier approach, the formulation benefits us with less unknown variables
of the optimization problem n2 ⇒ 2n. Besiedes, it will be shown in the next
section that the numerical scheme to handle it invloves only simple matrix-
vector products, which is memory efficient and suited to the execution of GPU.
Second, the dual problem is still a convex problem. Thus, it can be dealt with
tools from convex optimization.
13
Chapter 5
Unbalanced
14

Notes of Optimal Transport Problem and Metrics: Yang YANG, EE 68 April 27, 2019

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes of Optimal Transport Problem and Metrics: Yang YANG, EE 68 April 27, 2019

Uploaded by

Copyright:

Available Formats

Notes of Optimal Transport Problem and Metrics

Yang YANG, EE 681

April 27, 2019

Optimal Transport problem is an important problem considering distance of

Unlike Monge formulation in geometry opinions, Benamou-Brenier formulations

2.1 Monge Problem

where T : X → Y , T] is a image measure which maps measures in X to measures

Then the optimal transportation cost is

2.2 Kantorovich-Dual formulations

Formally interchanging infimum and supremum leads us to

And we notice the fact that

As a consequence we have the dual optimization problem(DP).

Similarly, in the continuous case,we have:

In the dual problem, ϕ and ψ are known as Kantorovich potentials.

2.3 Benamou–Brenier formulations

3.1 Convexity with respect to shift

Proof:First, we shall prove Ts is the optimal map rearranging fs to g by

Then the squared Wasserstein metric can be expressed as

The convexity with respect to s is evident from the last equation.

3.2 Convexity with respect to dilation

Proof:Since A is symmetric positive definite, it has a unique Cholesky decom-

which verifies the optimality condition.

Theorem 2.3 (Convexity with respect to dilation)

Proof. In order to define the Wasserstein metric, it is necessary to work with

which is convex in λ1 , ..., λn .

3.3 Convexity with respect to partial amplitude

In order to prove this result, we introduce an alternative form of the rescaled

Lemma 2.2 (Parameterization of amplitude loss). Define the parameteriza-

Lemma 2.3 (Convexity with respect to partial amplitude change). The

W22 (shα1 + (1 − s)hα2 , g) ≤ sW22 (hα1 , g) + (1 − s)W22 (hα2 , g) (3.1)

Thus we can rewrite Equation (1) as

W22 (hsα1 +(1−s)α2 , g) ≤ sW22 (hα1 , g) + (1 − s)W22 (hα2 , g)

and the Wasserstein metric W22 (hα , g) is convex with respect to α.

Lemma 2.4 (Misfit is non-increasing). Let −1 ≤ α1 < α2 ≤ 0. Then

W22 (hα2 , g) = W22 (hsα1 +(1−s)·0 , g)

where we have used the fact that h0 = g.

Proof:(Proof of Theorem 2.4.) Choose any β1 , β2 , s ∈ [0, 1]. From the

α(sβ1 + (1 − s)β2 ) ≥ sα(β1 ) + (1 − s)α(β2 ).

Applying Lemmas 2.3−2.4 we can compute

W22 (fˆsβ1 +(1−s)β2 , g) = W22 (hα(sβ1 +(1−s)β2 ) , g)

which establishes the convexity.

Then it is well known that the optimal transportation cost is

Theorem 2.5 (Insensitivity to noise in 1D) Let g be a positive probability

Without loss of generality, we take g = 1 on [0, 1]. For x ∈ i−1 i

Proof. (Proof of Theorem 2.5) For each i, ri is a random variable of uniform

Since the noise {ri }N

We can similarly establish a lower bound so that

In this chapter, we will introduce a practical numerical scheme to handle bal-

4.1 Entropic Regularization

Note that if X contains a non-positive entry, H(X) will be negative infinite.

where 1 denotes an all-ones vector. To approach the minimum, ∂Xi,j L(X, α, β)

You might also like