Professional Documents
Culture Documents
Notes of Optimal Transport Problem and Metrics: Yang YANG, EE 68 April 27, 2019
Notes of Optimal Transport Problem and Metrics: Yang YANG, EE 68 April 27, 2019
1
email:yang-yan16@mails.tsinghua.edu.cn
Chapter 1
Background
However, high complexity of the linear programming limited its application un-
til entropy regularization has been added, which enable Sinkhorn algorithm to
approximately solve optimal transport in near-linear time. On the other side,
balanced optimal transport requests normalization.
∂t ρt + ∇ · (ρt vt ) = 0 (1.1)
which describes the mass conservation in fluid, and optimize the Lagrangian
term. By adding source term to the right side, we can generalize the balanced
optimal transport to the unbalanced one, which induced the Wasserstein-Fisher-
Rao metric(Chizat et al.,2018). This metric is an interpolation between the
Wasserstein metric and the Fisher-Rao metric. Besides, another equivalent form
relaxed the constraints of joint distribution, called Hellinger-Kantorovich(Liero
et al.,2018) will be discussed later.
1
Chapter 2
Formulations
If additionally the target density g is positive, then the optimal map from f to
g is given by
T (x) = G−1 (F (x)) (2.3)
Remark:We can obtain Equation 2 from the optimal map given above by change
of variables.
2
relaxed formulation by Kantorovich.
Given two probability measures µ ∈ P(X) and ν ∈ P(Y ) and a cost function
c : X × Y → [0, +∞], solve the optimization:
Z
(KP ) inf c(x, y)dγ(x, y) : γ ∈ Π(µ, ν) (2.4)
X×Y
where Π(µ, ν) is the so-called transport plans with the constraint of marginal
distribution:
Π(µ, ν) = γ ∈ P(X × Y ) : (πx )] γ = µ, (πy )] γ = ν
where (πx ) and(πy ) are the two projections of X ×Y onto X and Y respectively.
Kantorovich Problem has linear constraint. After discretization, it becomes a
standard linear problem.
Given marginal constraints {ai } ∈ P(X) and {bi } ∈ P(Y ) for discrete proba-
bility space P(X × Y ) and a family of costs (cij )i∈X,j∈Y
X X X
min cij γij : γij ≥ 0, γij = ai , γij = bj (2.5)
i,j j i
The discretized Kantorovich Problem can be solved with any algorithm for the
standard linear problems.
By now we have made no assumptions for the Kantorovich Problem.
The assumptions below are widely used in applications.
• X = Y = Ω ⊂ Rd
• c(x, y) = kx − ykp (1 ≤ p ≤ ∞)
• We restrict our analysis to the family of probability distributions
Z
Pp (Ω) = µ ∈ P(Ω) : kxkp dµ < +∞
Ω×Ω
Use the same notation with the discussion above,given marginal constraints
(ai )i ∈ P(X) and (bi )i ∈ P(Y ) for discrete probability space P(X × Y ) and a
family of costs (cij )i∈X,j∈Y . We deduce the duality formulation in the discrete
case,the key observation is the fact that
P P
0 ifai = j rij , bj = i rij
X X X X
sup ϕi (ai − rij ) + ψj (bj − rij ) =
ϕi ,ψj +∞ otherwise
i j j i
3
With the observation above, we can reformulated the Kantorovich problem as
X X X X X
inf cij γij + sup ϕi (ai − rij ) + ψj (bj − rij )
γij >=0 ϕi ,ψj
i,j i j j i
4
2.4 Hamilton-Jacobi formulations
5
Chapter 3
Convexity of the
Wasserstein metric and
Insensitivity with respect to
noise
We note that any shifts, dilations, and amplitude changes in a signal f will
correspond to a similar transformation in the positive and negative parts of f.
Thus in the results below, it is sufficient to assume that the original profile f is
non-negative.
Theorem 2.1 (Optimality criterion for quadratic cost). Let f and g be prob-
ability density functions supported on sets X and Y , respectively. A masspre-
serving map T : X → Y minimizes the quadratic optimal transport cost if and
only if it is cyclically monotone.
We shall not prove this theorem.
6
Theorem 2.2 (Convexity of shift) Suppose f and g are probability den-
sity functions of bounded second moment. Let T be the optimal map that
rearranges f into g. If fs (x) = f (x − sη) for η ∈ Rn , then the optimal map from
fs (x) to g(y) is Ts = T (x − sη). Moreover, W22 (fs , g) is convex with respect to
the shift size s.
n
X n
X n
X
xi · Ts (xi ) = (xi − sη) · T (xi − sη) + sη · T (xi − sη)
i=1 i=1 i=1
n
X n
X
≥ (xi − sη) · T (xi−1 − sη) + sη · T (xi−1 − sη)
i=1 i=1
n
X
= xi · Ts (xi−1 )
i=1
Z Z
W22 (fs , g) = 2
|x − Ts (x)| fs (x)dx = |x − T (x − sη)|2 f (x − sη)dx
Z
= |x + sη − T (x)|2 f (x)dx
Z
2 2 2
= W2 (f, g) + s |η| + 2s η · (x − T (x))f (x)dx.
7
n n
X 1X T T
xi · (T (xi ) − T (xi−1 )) = (x L Lxi + xTi−1 LT Lxi−1 − 2xTi−1 LT Lxi )
i=1
2 i=1 i
n
1X
= |Lxi − Lxi−1 |2
2 i=1
≥0
Z
g
W22 (f, )= f (x)|x − Ax|2 dx
hgi
Z
= f (x)xT O(I − Λ)2 OT xdx
Z
= f (Oz)z T (I − Λ)2 zdz
8
loss parameter 0 ≤ β ≤ 1 we suppose that f depends on the probability density
function g via
βg(x) , x ∈ Ω1
fβ (x) =
g(x) , x ∈ Ω2
Theorem 2.4 (Convexity with respect to partial amplitude loss) The
squared Wasserstein metric W22 (fβ /hfβ i, g) is a convex function of the parame-
ter β.
where R
g
γα = α RΩ1
Ω2
g
. Note that hα is non-negative and has unit mass by construction, with h0 = g.
Proof: Choose any α1 , α2 ∈ [−1, 0] and s ∈ [0, 1]. From convexity of the
Monge-Kantorovich minimization problem , we have
We can calculate
9
s(1 + α1 )g + (1 − s)(1 + α2 )g , x ∈ Ω1
shα1 + (1 − s)hα2 =
s(1 − γα1 )g + (1 − s)(1 − γα2 )g , x ∈ Ω2
(1 + sα1 + α2 − sα2 )g , x ∈ Ω1
=
(1 − sγα1 − γα2 + sγα2 )g , x ∈ Ω2
= hsα1 +(1−s)α2 .
Proof. Define the parameters ∈ [0, 1] by s = α2 /α1 . Then we can use the
convexity result of Lemma 2.3 to compute
10
3.4 Insensitivity with respect to noise
We shall only consider one dimensional case. In one dimension, it is possible
to exactly solve the optimal transportation problem in terms of the cumulative
distribution functions
Z x Z x
F (x) = f (t)dt, G(x) = g(t)dt.
−∞ −∞
If additionally the target density g is positive, then the optimal map from f to
g is given by
T (x) = G−1 (F (x))
Remark:We can obtain Equation 2 from the optimal map given above by change
of variables.
11
where i ≤ i ≤ N .
Then we can bound the squared Wasserstein metric by
1+r̄ N i N 2
2h3 X X
Z X
W22 (fN , gN ) = |FN−1 (t) − G−1 2
N (t)| dt ≤ rl − ih rk .
0 (1 − c)2 i=1
l=1 k=1
12
Chapter 4
Algorithms
Xi,j = e
η
(4.3)
P
Xi,j = µi
Pj
i Xi,j = νj
In the end of this section, We have two comments on the regularized loss
function. First, the regularization is very benefitial. Similar with Largrange
multiplier approach, the formulation benefits us with less unknown variables
of the optimization problem n2 ⇒ 2n. Besiedes, it will be shown in the next
section that the numerical scheme to handle it invloves only simple matrix-
vector products, which is memory efficient and suited to the execution of GPU.
Second, the dual problem is still a convex problem. Thus, it can be dealt with
tools from convex optimization.
13
Chapter 5
Unbalanced
14