Professional Documents
Culture Documents
Optimization-Aplications To Image Processing
Optimization-Aplications To Image Processing
Optimization-Aplications To Image Processing
Contents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Generalities
1.1 Optimization problems . . . . . . . . . . . .
1.2 Analysis of the optimization problem . . . .
1.2.1 Remainders . . . . . . . . . . . . . .
1.2.2 Existence, uniqueness of the solution
1.2.3 Equivalent optimization problems . .
1.3 Optimization algorithms . . . . . . . . . . .
1.3.1 Iterative minimization methods . . .
1.3.2 Local convergence rate . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
6
10
10
11
12
17
17
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
22
22
23
26
27
27
28
29
32
32
34
36
37
40
42
42
43
46
3 Constrained optimization
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Projection theorem . . . . . . . . . . . . . . . . . . . .
3.3 General methods . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Gauss-Seidel under Hyper-cube constraint . . . . . . .
3.3.2 Gradient descent with projection and varying step-size
3.3.3 Penalty . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Equality constraints . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Lagrange multipliers . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
49
50
51
51
51
54
56
56
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
60
60
61
63
64
65
65
67
72
74
74
74
75
75
76
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
78
78
79
80
80
81
83
87
87
88
89
90
91
93
96
98
99
102
103
3.5
3.6
3.7
3.8
5 Appendix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
106
Objectives: obtain a good knowledge of the most important optimization methods, provide tools to
help reading the literature, conceive and analyze new methods. At the implementation stage, numerical
methods are always on Rn .
Main References: [7, 14, 26, 44, 45, 46, 53, 69, 70, 72]
For remainders check Connexions: http://cnx.org/
CONTENTS
Chapter 1
Generalities
Restored image
x
Original image u Data d = un (Gamma noise) Restoration u = W
Image originale
Jitter
Bebe volant
Restoration
CHAPTER 1. GENERALITIES
1.1
Optimization problems
(P )
uU
U = {u V : gi (u) = 0, i = 1, . . . , p},
p<n
(1.1)
inequality constraints
U = {u V : hi (u) 0, i = 1, . . . , q},
qN
(1.2)
1
Bu, u c, u
2
(1.4)
CHAPTER 1. GENERALITIES
(1.5)
(1.6)
Di udiscrete approximation of the gradient or the Laplacian of the image or signal at i, or nite
dierences, . is usually the 2 norm1 , > 0 parameter and : R+ R+ increasing function, e.g.
(t) = t , 1 2,
(t) =
(1.7)
u[1] u[p + 1]
u[(q 1)p + 1]
u[2] u[p + 2]
u[(q 1)p + 2]
.
..
.
.
u[i 1]
.
.
.
..
..
u[i p]
u[i]
u[p] u[2p]
u[n]
Often
or
1/2
Di u = (u[i] u[i m])2 + (u[i] u[i 1])2
(1.8)
Dk u = u[i] u[i m] and Dk+1u = u[i] u[i 1].
(1.9)
( |u[i]| )
i
p1
v[i]
and for p = +,
CHAPTER 1. GENERALITIES
8
Illustration on the role of
F (u) = a u v22 +
(Di u)
Original image
Data v
v =au+n
nNormal noise
ablur
SNR=20 dB
(t) = t
(t) = t
Row 54
Row 54
Row 90
Row 90
(t) = min{t2 , 1}
(t) = 1 1l(t=0)
No Regularization
= 0 in (1.5)
Row 54
Row 54
Row 54
Row 54
Row 90
Row 90
Row 90
Row 90
D is a rst-order dierence operator, i.e. Di u = u[i] u[i + 1], 1 i p 1. Data (- - -), Restored
signal (). Constant pieces in (a) are emphasized using while data samples that are equal to
the relevant samples of the minimizer in (b) are emphasized using .
Table of Contents
CHAPTER 1. GENERALITIES
1. Generalities
2. Unconstrained differentiable problems
3. Constrained optimization
4. Non differentiable problems
CHAPTER 1. GENERALITIES
1.2
10
1.2.1
Remainders
u
U nonconvex
U strictly convex
v O .
- If F is l.s.c.2 and uk u as k
lim inf F (uk ) F (u).
k
CHAPTER 1. GENERALITIES
11
- If Fi iI , where I is an index set, is a family of l.s.c. functions then the superior envelop of
Fi iI is l.s.c. In words, the function F dened by
F (u) = sup Fi (u)
F3
iI
is l.s.c.
- If Fi iI is a family of l.s.c. convex functions then
the superior envelop F of Fi iI is l.s.c. and convex.
1.2.2
F1
F2
def
Then u U
= B(u0 , r) U which is compact. The conclusion is obtained as in (a).
Let us underline that the conditions in the theorem are only strong sucient conditions. Much weaker
existence conditions can be found e.g. in [10, p. 96].
Alternative proof using minimizing sequences.
The theorem extends to separable Hilbert spaces under additional conditions, see e.g. [26, 5].
Optimization problems are often called feasible when U and F are convex and the conditions
of Theorem 1 are met. Remind that F can be convex but not coercive (Fig. 1.1(b)).
Theorem 2 For U V convex, let F : U R be proper, convex and l.s.c.
1. If F has a relative (local) minimum at u U, it is a (global) minimum w.r.t. U. [26]
2. The set of minimizers U = u U : F (
u) = inf F (u) is convex and closed. [38, p. 35]
uU
3. If F is strictly convex, then F admits at most one minimum and the latter is strict.
=
4. In addition, suppose that either F is coercive or U is bounded. Then U
. [38, p. 35]
Weierstrass theorem: Let V be a n.v.s. and K V a compact. If F : V R is l.s.c. on K, then F achieves a
minimum on K. If F : V R is upper semi continuous on K, then F achieves a maximum on K. If F is continuous
on K, then F achieves a minimum and a maximum on K. See e.g. [52, p. 40]
3
CHAPTER 1. GENERALITIES
12
F (u)
F (u)
No minimizer
U = [1, 8]
2.94
two local
minimizers
u
0.21
1.72
6.86
(a)F nonconvex
F (u)
F (u)
F (u)
U = [1, 8]
u
1
minimizers
1.89
0.38
1.89
3.51
Figure 1.1: Illustration of Denition 6 and Theorem 2. Minimizers are depicted with a thick point.
1.2.3
nd u such that F (
u) = inf F (u)
uU
there are many dierent optimization problems that are in some way equivalent to (P ). For instance
(
u, b) = arg min max F (u, b) where F : V W R, U V , X W and u solves (P );
uU bX
Some of these equivalent optimization problems are easier to solve than the original (P ). We shall
see such reformulations in what follows. The way enabling to recover them is in general an open
question. In many cases, such a reformulation (found by intuition and proven mathematically) is
valid only for a particular problem (P ).
There are several duality principles in optimization theory relating a problem expressed in terms
of vectors in an n.v.s. V to a problem expressed in terms of hyperplanes in the n.v.s.
Denition 7 A hyperplane (ane) is a set of the form
def
[h = ] = {w V | h, w = }
where h : V R is a linear nonzero functional and R is a constant.
A milestone is the Hahn-Banach theorem, stated below in its geometric form [15, 52].
Denition 8 Let U V and K V . The hyperplane [h = ] separates K and U
- nonstrictly if h, w , w U and h, w , w K;
CHAPTER 1. GENERALITIES
13
nonstrict
u
U
v
Figure 1.2: Green line: [h = ] = {w | h, w = } separates strictly {u} and U: for v U we
have h, v < whereas h, u > . The orthogonal projection of u onto U is u (red dots). Lines
in magenta provide nonstrict separation between u and U.
One systematic approach to derive equivalent optimization problems is centered about the interrelation between an n.v.s. V and its dual.
Denition 9 The dual V of a n.v.s. V is composed of all continuous linear functionals on V , that is
def
sup
uV, u1
|f (u)|
The hyperplane [h = ] is closed i h is continuous [15, p. 4], [52, p. 130]. If V = Rn any hyperplane is closed.
Let us remind that if K is a subset of a nite dimensional space, it is compact i K is closed and bounded.
CHAPTER 1. GENERALITIES
14
If V = Rn then V = Rn (see e.g. [52, p. 107]). The dual of the Hilbert space6 L2 (resp. 2 ) is
yet again L2 (resp. 2 )see e.g. [52, pp. 107-109]. Clearly, all these spaces are reexive as well.
Denition 10 Let F dened on a real n.v.s. V be proper. The function F : V R given by
def
F (v) = sup u, v F (u)
(1.10)
uV
(1.11)
F
epiF
F (u )
u
F (v)
0
Figure 1.3: The convex conjugate at v determines an hyperplane that separates nonstrictly epiF
and its complement in Rn .
The application v u, v F (u) is convex and continuous, hence l.s.c., for any xed u V .
According to Property 1, F is convex and l.s.c. (as being a superior convex envelop).
1
|u| where u R and ]1, +[.
1
f (v) = sup uv |u|
uR
The term between the parentheses obviously has a unique global maximizer u. The latter meets
v = |
u|1 sign(
u) hence sign(
u) = sign(v). Then
1
u = |v| 1 sign(v)
Set =
1
1
1
sign(v)v |v| 1 = |v| 1
(1.12)
1
1
=
1
|v|
(1.13)
CHAPTER 1. GENERALITIES
15
1
u
Then v Rn and
n
1
1
u[i]v[i] |u[i]|
F (v) = sup u, v u = sup
uRn
uRn
i=1
n
n
n
1
1
=
sup u[i]v[i] |u[i]| =
f (v[i]) =
|v[i]|
n
i=1 uR
i=1
i=1
1
v
F (u) = sup
vV
u, v F (v)
uV
vV
The proof can be found in [15, p. 11] or in [52, p. 201] (where the assumptions on and are
slightly dierent).
Example 2 V is a real reexive n.v.s. Let U V and K V be convex compact nonempty subsets.
def
vK
vK
(1.14)
where Schwarz inequality was used. Note that we can replace sup with max in (1.14) because K is
compact and v v, u is continuous (Weierstrass theorem). Clearly max v is nite and is
vK
continuous and convex.
CHAPTER 1. GENERALITIES
16
Let w V \ K. The set {w} being compact, Hahn-Banach theorem 3 tells us that there exists
h V = V enabling us to separate strictly {w} and K V , that is
v, h < w, h ,
def
vK
v K
w, h v, h meets c > 0. Then
v K
>0
vK
(1.16)
vK
uV
(u) =
+ if u U
0
if u U
By the denition of in (1.14) and the one of , the Fenchel-Rockafellar Theorem 5 yields
min max u, v = min (u) = min (u) + (u) = max (v) (v)
uU
vK
uU
uV
vV
(1.17)
The maximum in the right side cannot be reached if v K because in this case (v) = .
Hence max (v) (v) = max (v) . Furthermore
vV
vK
(v) = sup v, u (u) = inf v, u + (u) = inf v, u = min v, u
uU
uU
It follows that
uU
uU
max (v) (v) = max min v, u
vV
vK
uU
(1.18)
Combining (1.17) and (1.18) shows that minuU maxvK u, v = maxvK minuU v, u .
In Example 2 we have proven the fundamental Min-Max theorem used often in classical game
theory. The precise statement is given below.
Theorem 6 (Min-Max) Let V be a reexive real n.v.s. Let U V and K V be convex, compact
nonempty subsets. Then
min max u, v = max min v, u
uU vK
vK uU
CHAPTER 1. GENERALITIES
1.3
17
Optimization algorithms
Usually the solution u is dened implicitly and cannot be computed in an explicit way, in one step.
1.3.1
(1.19)
u1 = G(u0);
def
def
uk = G
G (u0 ) = Gk (u0 ) = G Gk1 (u0 )
k times
Key questions:
Given an iterative method G, determine if there is convergence;
If convergence, to what kind of accumulation point?
Given two iterative methods G1 and G2 choose the faster one.
Denition 11 u is a xed point for G if G(
u) = u.
Let X be a metric space equipped with distance d.
Denition 12 G : X X is a contraction if there exists (0, 1) such that
d(G(u1), G(u2 )) d(u1 , u2 ), u1 , u2 X
G is Lipschitzian7 uniformly continuous.
Theorem 7 (Fixed point theorem) Let X be complete and G : X X a contraction.
Then G admits a unique xed point, u = G(
u). (see [77, p.141]).
Theorem 8 (Fixed point theorem-bis) Let X be complete and G : X X be such that k0 for which
Gk0 is a contraction. Then G admits a unique xed point. (see [77, p.142]).
Note that in the latter case G is not necessarily a contraction.
7
CHAPTER 1. GENERALITIES
1.3.2
18
uk+1 u
, k N.
uk u
Q = lim sup Qk
k
0 < R < 1: R-linear convergence; this means geometric or exponential convergence since
(0, 1 R) k0 such that Rk (R + ), k k0 uk u (R + )k
R = 0: R-superlinear convergence.
G1 is faster than G2 in the sense of R if R(G1 , u) < R(G2 , u)
Remark 1 Sublinear convergence if Qk 1 or Rk 1. Convergence is too slow, choose another G.
Theorem 9 Let uk u vk where (vk )kN converges to 0 Q-superlinearly. Then (uk )kN converges
to u R-superlinearly.
Theorem 10 Let uk u R-linearly. Then (vk )kN that tends to 0 Q-linearly such that uk
u vk .
Proofs of the last two theorems - see [14, p.15].
CHAPTER 1. GENERALITIES
19
u B(
u, )
u) we have uk O(
u) and uk u.
then O(
u) such that u0 O(
Proof. G(
u) is not necessarily symmetric and positive denite. Condition (1.20) says that
> 0 there is an induced matrix norm . on Rnn such that8
G(
u) + .
(1.21)
1
< (1 ).
2
The dierentiability of G at u implies that > 0 so that B(
u, ) U and
G(u) G(
u) G(
u)(u u) u u, u B(
u, )
(1.22)
Using that G(
u) = u, and combining (1.21) and (1.22), we get
G(u) u = G(u) G(
u) G(
u)(u u) + G(
u)(u u)
G(u) G(
u) G(
u)(u u) + G(
u) u u
< (2 + )u u < u u, u B(
u, )
The conclusion follows from the observation that 2 + < 1 and Lemma 1.
8
Let B be an n n real matrix. Its spectral radius is dened as its largest in absolute value eigenvalue,
def
(B) = max i (B).
1in
Given a vector norm on Rn , say ., the corresponding induced matrix norm on the space of all n n matrices reads
B = sup{Bu : u Rn , u 1}.
def
CHAPTER 1. GENERALITIES
20
Theorem 13 (linear convergence) Under the conditions of Theorem 11, max |i (G(
u))| = R, where
1in
(a) Original
(b) Noisy
(d) No convergence
Chapter 2
Unconstrained differentiable problems
2.1
Preliminaries (regularity of F )
2. If Y = R (i.e. f : V R) and the norm on V is derived from an inner product ., ., then
L(V ; R) is identied to V (via a canonical isomorphism). Then f (u) V the gradient of
f at u is dened by
f (u), v = Df (u)v, v V.
Note that if V = Rn we have Df (u) = (f (u))T .
3. Under the conditions given above in 2., if f is twice dierentiable, we can identify the second
dierential D 2 f and the Hessian 2 f via D 2 f (u)(v, w) = 2 f (u)v, w.
Property 2 Let U V be convex. Suppose F : V R is dierentiable in O(U).
1. F convex on U F (v) F (u) + F (u), v u , u, v U.
2. F strictly convex on U F (v) > F (u) + F (u), v u , u, v U, v = u.
A proof of statement 1 can be found in Appendix, p. 106.
Property 3 Let U V be convex. Suppose F twice dierentiable in O(U).
1. F convex on U 2 F (u)(v u) , (v u) 0, u, v U.
2. If 2 F (u)(v u), (v u) > 0, u, v U, v = u then F is strictly convex on U.
Denition 13 F is called strongly convex or equivalently elliptic if F C 1 and if > 0 such that
F (v) F (u), v u v u2, v, u V.
22
2.2
uk [i + 1], . . . , uk [n])
(see [44])
2.3
They use F (u) and F (u) and rely on F (uk k dk ) F (uk ) k F (uk ), dk . Dierent tasks:
Choose a descent direction (dk ) - such that F (uk dk ) < F (uk ) for > 0 small enough;
Line search : nd k such that F (uk k dk ) F (uk ) < 0 is suciently negative.
Stopping rule: introduce an error function which measures the quality of an approximate
solution u, e.g. choose a norm ., > 0 (small enough) and test:
Iterates: if uk+1 uk , then stop ;
Gradient: if F (uk ) , then stop ;
Objective: if F (uk ) F (uk+1) , then stop ;
23
or a combination of these.
Descent direction dk
F dierentiable:
(2.1)
(2.2)
F : R2 R smooth nonconvex
two minimizers
(2.3)
(dk ) 0 as 0
(2.4)
where
Then (2.3) can be rewritten as
F (uk ) F (uk dk ) = F (uk ), dk dk (dk ) .
Consider that F (uk ), dk > 0. By (2.4), there is > 0 such that F (uk ), dk > dk (dk ),
0 < < . Then dk is a descent direction since F (uk ) F (uk dk ) > 0, 0 < < .
When F is convex, we know by Property 2-1 that F (uk dk ) F (uk ) + F (uk ), dk , i.e.
F (uk ) F (uk dk ) F (uk ), dk .
If F (uk ), dk 0 then F (uk ) F (uk dk ) 0, hence dk is not a descent direction. It follows
that for dk to be a descent direction, it is necessary that F (uk ), dk > 0.
Note that some methods construct k dk in an automatic way.
2.3.1
Motivation : make F (uk )F (uk+1) as large as possible. In (2.3) we have F (uk ), dk F (uk ) dk
(Schwarz inequality) where the equality is reached if dk F (uk ).
The steepest descent method = Gradient with optimal stepsize is dened by
F (uk k F (uk )) = inf F (uk F (uk ))
R
uk+1 = uk k F (uk ), k N.
(2.5)
(2.6)
24
Theorem 15 If F : Rn R is strongly convex, (uk )kN given by (2.5)-(2.6) converges to the unique
minimizer u of F .
Proof. Suppose that F (uk ) = 0 (otherwise u = uk ). Proof in 5 steps.
(a) Denote
fk () = F uk F (uk ) , > 0.
def
fk is coercive, strictly convex, hence it admits a unique minimizer = (uk ) and the latter
solves the equation fk () = 0. Thus
Since
fk () = F uk F (uk ) , F (uk ) = 0
(2.7)
1
uk+1 = uk F (uk ) F (uk ) = (uk uk+1 )
(2.8)
(2.9)
i.e. two consecutive directions are orthogonal, while the right hand side equation of (2.8) inserted
into (2.7) shows that
F (uk+1), uk uk+1 = 0.
Using the last equation and the assumption that F is strongly convex,
F (uk ) F (uk+1) F (uk+1), uk uk+1 +
(2.10)
u) we deduce that F (uk ) kN
(b) Since F (uk ) kN is decreasing and bounded from below by F (
converges, hence
lim F (uk ) F (uk+1) = 0
k
(2.11)
(2.12)
Remind that (2.11) does not mean that the sequence (uk ) converges!
k
1
Consider uk = i=0 k+1
. It is well known that (uk )kN diverges. Nevertheless, uk+1 uk =
1
k+2
0 as k .
25
(d) The facts that F (uk ) kN is decreasing and that F is coercive implies that r > 0 such that
uk B(0, r), k N. Since F C 1 , F is uniformly continuous on the compact B(0, r). Using
(2.11), > 0 there are > 0 and k0 N such that
uk uk+1 < F (uk ) F (uk+1) < , k k0 .
Consequently,
lim F (uk ) F (uk+1) = 0.
(2.13)
Remark 5 Note the role of the assumption that V = Rn is of nite dimension in this proof.
(2.14)
F (u) = Bu c (remind B = B T )
dk = F (uk ), k = 0, 1,
f () = F uk F (uk ) = F uk (Buk c) .
The steepest descent requires k such that f (k ) = 0.
0 = f (k ) = F uk k F (uk ) , F (uk )
= B uk k F (uk ) c , F (uk )
= (Buk c) k BF (uk ) , F (uk )
= F (uk ) k BF (uk ) , F (uk )
= F (uk )2 + k BF (uk ) , F (uk )
= dk 2 + k Bdk , dk
k =
dk 2
Bdk , dk
Using that B 0 and that B = B T , the lest eigenvalue of B, denoted by min , obeys min > 0. Using
Denition 13,
F (v) F (u), v u = (v u)T B(v u) min v u22 .
2
26
F being continuous and using (2.13), limj F (ukj ) = F (u) = 0. Remind that F admits a
unique minimizer u and the latter satises F (
u) = 0. It follows that u = u.
Note that we do not have any bound on the error uk u as in the proof of Theorem 16.
u) kN towards zero in a neighborhood of u, under addi Q-linear convergence of F (uk ) F (
tional conditionssee [14, p. 33].
Steepest descent method can be very-very bad: the sequence of iterates is subject to zigzags, the
step-size can decrease consderably.
However, the steepest descent method serves as a basis for all the methods actually used.
Remark 6 Consider
F (u[1], u[2]) =
1
1 u[1]2 + 2 u[2]2 ,
2
Clearly, u = 0 and
F (u) =
1 u[1]
2 u[2]
0 < 1 < 2 .
uk [1] 1 uk [1]
uk [2] 2 uk [2]
2.3.2
"
u, u, u V .
uk+1 = uk k F (uk ), k > 0, k N
(2.15)
Theorem 17 Let F : V R be dierentiable in V . Suppose and M such that 0 < < M, and
27
+ , k N
k
2
M2 #
M
$
and
0, 2
M
in (2.15) converges to the unique minimizer u of F and
%
2
k
uk+1 u u0 u, where = 2 M 2 2 + 1 < 1.
M
where
&
&
If F is twice dierentiable, condition (ii) becomes sup &2 F (u)& M.
uV
1
2
G(u) = I F (u). By Theorem 11, p. 19, convergence is ensured if max i F (u) < .
i
2.4
Line search
2.4.1
Introduction
Merit function f : R R
f () = F (uk dk ), > 0,
where dk is a descent direction, e.g. dk = F (uk ) and in any case F (uk ), dk > 0.
The goal is to choose > 0 such that f is decreased enough.
Line search is of crucial importance since it is done at each iteration.
We x at iteration k and drop indexes. Thus we write
f () = F (u d), then f () = F (u d), d .
In particular,
f (0) = F (u), d < 0
since d is a descent direction, see (2.1), p. 23.
Usually, line search is a subprogram where f can be evaluated only point-wise. In practice, is
found by trials and errors.
General scheme with 3 possible exists:
28
2.4.2
2.4.3
29
Arguments:
Devise tolerant stopping ruleswe minimize F and not f !
Striving to minimize f along the current direction is useless.
Intuitions:
If f is quadratic with minimum at , then3
1
f (
) = f (0) + f (0) ;
2
(extrapolation step)
(interpolation step)
Theorem 18 Suppose that f is C 1 and bounded from below. Then Wolfes line-search terminates (i.e.
the number of the line-search iterations is nite).
Proof. Remind that f (0) < 0 because we are on a descent direction. The proof is in two steps.
(a) If test 2 (extrapolation) is done innitely many times, then . Using 1(a), we deduce
that
f () f (0) c0 |f (0)|
This contradicts the assumption that f is bounded from below.
3
1 2
c + f (0) + f (0).
2
2
f (0)
f (
) =
+ f (0)
2c
f (
) f (0)
1
= f (0) .
30
(2.16)
By test 2, no R can be equal to : indeed 2 means that R > . Using that f (0) < 0 and (2.16),
test 2 can be put in the form
) c0 |f (0)| (R )
f (R ) > f (0) + c0 f (0) + (R ) = f (
Hence
) > c0 |f (0)| (R )
f (R ) f (
Divide both sides by R > 0
)
f (R ) f (
> c0 |f (0)|
R
, the above inequality yields f (
) c0 |f (0)| > c1 |f (0)|. On the other hand, test
For R
3(b) for L ! entails f (
) c1 |f (0)|. The last two inequalities are contradictory.
It is impossible to repeat steps 1 and 3 alternatively neither. Hence the number of line-search
iterations is nite.
Wolfes line search can be combined with any kind of descent direction d. However, line search
is helpless if dk is too orthogonal to F (uk ). The angle k between the direction and the gradient
is crucial. Put
F (uk ), dk
cos k =
(2.17)
F (uk ) dk
We can say that dk is a denite descent direction if cos k > 0 is large enough.
Theorem 19 If F (u) is Lipschitz-continuous with constant on {u : F (u) F (u0 )} and the minimization algorithm uses Wolfes rule, then
r(cos k )2 F (uk )2 F (uk ) F (uk+1),
k N,
(2.18)
(2.19)
Note that F (uk ) F (uk+1) is bounded from below by a positive number (descent is guaranteed by
test 1). Reminding that f (k ) = F (uk k dk ) = F (uk+1), condition 1(b) reads
f (k ) = F (uk+1), dk c1 f (0) = c1 F (uk ), dk
31
c0 (1 c1 )
>0
2
r cos(k ) F (uk )2 F (uk ) F (uk+1).
Theorem 20 Let F C 1 be bounded from below. Assume that the iterative scheme is such that (2.18),
namely
r(cos k )2 F (uk )2 F (uk ) F (uk+1), k N,
holds true for a constant r > 0 independent of k. If the series
(cos k )2
diverges
(2.20)
k=0
Proof. Since F (uk ) kN is decreasing and bounded from below, there is a nite number F R
such that F (uk ) F, k N. Then (2.18) yields
1
1
F (uk ) F (uk+1) = F (u0 ) F (u1 ) + F (u1 ) F (u2 ) +
(cos k ) F (uk )
r k=0
r
k=0
1
F (u0 ) F < +.
(2.21)
r
If there was > 0 such that F (uk ) , k 0, then (2.20) would yield
2
k=0
(cos k )2
k=0
32
Remark 8 Wolfes rule needs to compute the value of f () = F (uk dk ), dk at each cycle of
the line-search. Computing F (uk dk ) is costly when compared to the computation of F (uk dk ).
Other line-search methods avoid the computation of F (uk dk ).
Other Line-Searches
We can evoke e.g. the Goldstein and Price method and the Armijos method. More details: e.g.
[69, 14].
Armijos methods
Choose c (0, 1). Tests for :
1. f () f (0) + cf (0)
terminate;
set R = .
It is easy to see that this line-search terminates as well. The interest of the method is that it does
not need to compute f (). It ensures that is not too large. However, it can be dangerous since it
never increases ; thus it relies on the initial step-size 0 .
In practice it often had appeared that taking a good constant step-size yields a faster convergence.
2.5
Second-order methods
2.5.1
Newtons method
Newtons method:
Theorem 21 Let V be
say O " u, and that4
1
uk+1 = uk 2 F (uk ) F (uk )
(2.22)
BuV
uV
, where V is the
33
L(V )
Then
uk+1 u M
u uk (
u uk )
u
u
and Qk = uk+1
uk
convergence.
M(
u uk ) 0 as
u uk 0. Hence the Q-superlinearity of the
Comments
Under the conditions of the theorem, if in addition F is C 3 then convergence is Q-quadratic.
1
Direction and step-size are automatically chosen ( 2 F (uk ) F (uk ));
1
If F is nonconvex, 2 F (uk ) F (uk ) may not be a descent direction, hence the u found
in this way is not necessarily a minimizer if F ;
Convergence is very fast;
1
can be dicult, i.e. solving the equation 2 F (uk )z = F (uk ) with
Computing 2 F (uk )
respect to z can require a considerable computational eort;
Numerically it can be very unstable thus entailing a violent divergence;
Preconditioning can help if 2 F (uk ) has a favorable structure, see [19];
In practiceecient to get a high-precision u
if we are close enough to the sought-after u.
2.5.2
34
1
1
2
Approximation : Hk (uk )
F (uk )
1
uk+1 = uk Hk (uk ) F (uk )
(2.23)
3. F (u0 )V
r
(1 )
M
k
u1 u0 V .
(1 )
(2.24)
(2.25)
uk+1 B, k N ;
F (uk+1)
uk+1 uk .
M
(2.26)
(2.27)
r
(1 ) < r u1 B
M
35
Consider the application u F (u) H0 (u0 )u. Its gradient is 2 F (u) H0 (u0 ). By the
generalized mean-value theorem 5 and assumption 2 we obtain (2.27) for k = 1:
F (u1) H0 (u0)u1 F (u0 ) + H0 (u0 )u0
= F (u1) sup 2 F (u) H0 (u0 ) u1 u0
uB
u1 u0 .
M
uk uk1
F (uk )
M
(2.28)
We check if these hold for k as well. Using (2.28) and assumption 1, iteration (2.23) yields
(2.25):
uk+1 uk Hk1 (uk ) F (uk ) M F (uk ) M
(2.29)
Using the triangular inequality, (2.25) for k = 0 and assumption 3, we get (2.26) for the
actual k:
uk+1 u0 uk+1 uk + uk uk1 + + u1 u0 =
k
i u1 u0
i=0
u1 u0 =
i=0
MF (u0 )
r
ui+1 ui
i=0
i
k
1
u1 u0
1
uk+1 B.
uk+1 uk .
M
Generalized mean-value theorem. Let f : U V W and a U , b U such that the segment [a, b] U .
Assume f is continuous on [a, b] and dierentiable on ]a, b[. Then
5
36
j1
k
u1 u0 ,
1
i=0
i=0
i=0
(2.30)
hence (uk )kN is a Cauchy sequence. The latter, combined with the fact that B V is complete
shows that u B such that lim uk = u.
uk+j uk
uk+i+1 uk+i
k+i
u1 u0 u1 u0
k
i =
(c) Since F is continuous on O(U) B and using (2.27), we obtain the existence result:
F (
u) = lim F (uk )
lim uk+1 uk = 0
k
M k
(d) Uniqueness of u in B.
Suppose u B, u = u such that F (u) = 0 = F (
u). We can hence write
u u = H01 (u0 ) F (u) F (
u) H0 (u0 ) (u u)
u u H01 (u0 ) sup 2 F (u) H0 (u0 ) u u M
uB
u u< u u
M
k
u1 u0 .
1
Iteration (2.23) is quite general. In particular, Newtons method (2.22) corresponds to Hk (uk ) =
F (uk ) while (variable) stepsize Gradient descent to Hk (uk ) = 1k I.
2
2.5.3
It is a Quasi-Newton method (with linearization of the gradient); see [89, 86, 23]
Assumptions:
F : Rn R is twice continuously dierentiable, strictly convex, bounded from
below and coercive. (F can be constrained to Ua nonempty convex polyhedral set.)
F is approximated from above by a quadratic function F of the form
1
F (u, v) = F (v) + u v, F (v) + u v, H(v)(u v)
2
under the assumptions that u Rn
F (u, v) F (u), for any xed v ;
H : Rn Rnn is continuous and symmetric ;
0 < 0 i H(u) 1 < , 1 i n.
Note that F (u, u) = F (u). The iteration for this method reads
uk+1 = arg min F (u, uk ), k N.
u
For uk xed, F is C , coercive, bounded below and strictly convex. The minimizer uk+1 exists and
satises the quasi-Newton linear equation
2
2.5.4
37
Half-quadratic regularization
r
(Di u2),
(2.31)
i=1
where Di Rsn for s {1, 2}: s = 2e.g. discrete gradients, see (1.8) and s = 1e.g. nite
dierences, see (1.9), p. 7. Let us denote
D1
def D2
D = .. Rrsn . Assumption: ker(AT A) ker(D T D) = {0}.
.
Dr
Furthermore, : R+ R is a convex edge-preserving potential function (see [7]), e.g.
t
+ t2
t
tanh
t
+ |t|
t
if |t| ,
sign(t) if |t| > ,
|t|1 sign(t)
t2 +
t
log cosh
|t|
|t| log 1 +
2
if |t| ,
t /2
|t| 2 /2 if |t| > ,
|t| , 1 < 2
2 3
( +t )
1
t
1 tanh2
2
( + |t|)
1 if |t| ,
0 if |t| > ,
( 1)|t|2
r
(|Di u|)DTi Di .
i=1
1
For the class of F considered here, 2 F (u) has typically a bad condition number so (2 F (u) ) is
dicult (unstable) or impossible to obtain numerically. Newtons method is practically unfeasible.
Half-quadratic regularization, two forms multiplicative () and additive (+) introduced in [42]
and [43], respectively. Main idea: using an auxiliary variable b, construct an augmented criterion
F (u, b) which is quadratic in u and separable in b, such that
(
u, b) = arg min F (u, b)
u,b
Alternate minimization: k N,
uk+1
1ir.
38
Multiplicative form.
Usual assumptions on in order to guarantee global convergence:
(a)
(b)
(c)
(d)
(e)
t (t)
is convex and increasing on R+ , 0 and (0) = 0
t ( t) is concave on R+ ,
is C 1 on R+ and (0) = 0,
(0+ ) > 0,
lim (t)/t2 = 0.
t
Proposition 1 We have
Moreover,
1 2
1 2
t b + (b)
(t) = min
(b) = sup bt + (t)
b0
2
2
tR
+
(t)
if t > 0
b def
=
is the unique point yielding (t) = 12 t2b + (b).
t
+
(0 ) if t = 0
(2.32)
H(b) = 2AT A +
r
b[i]DTi Di
i=1
(Di uk+12 )
bk+1 [i] =
,
Di uk+1 2
1 i r.
Comments
References [24, 34, 47, 7, 68, 1].
Interpretation : bk edge variable ( 0 if discontinuity)
uk can be computed eciently using CG (see 2.6).
F is nonconvex w.r.t. (u, b). However, under (a)-(e) the minimum is unique and convergence
is ensured, uk u (see [68, 1]).
r
(Di u2 )
1
0
Quasi-Newton : uk+1 = uk (H(uk )) F (uk ), H(u) = H
Di u2
i=1
Compare with H in Newton.
Iterates amount to the relaxed xed point iteration described next. In order to solve the equation
F (
u) = 0, we write
F (u) = L(u)u z
where z is independent of u. At each iteration k, one nds uk+1 by solving the linear problem
L(uk )uk+1 = z
In words, F is linearized at each iteration (see [67] for the proof).
39
Additive form.
Usual assumptions on to ensure convergence:
(a)
(b)
(c)
(d)
Notice that is dierentiable on R+ . Indeed, (a) implies that (t ) (t+ ), for all t. If for some t
the latter inequality is strict, (b) cannot be satised. If follows that (t ) = (t+ ) = (t).
In order to avoid additional notations, we write (.) and (.) next.
Proposition 2 For . = .2 and b Rs , t Rs , s {1, 2}, we have the equivalence:
1
1
2
2
t b + (b)
(t) = mins
(b) = maxs b t + (t)
tR
bR
2
2
(2.33)
t
b def
is the unique point yielding (t) = 12 t b2 + (b).
= t (t)
t
Proof. By (b) and (d), 12 t2 (t) is convex and coercive, so the maximum of is reached.
The formula for in (2.33) is equivalent to
1 2
1
2
(b) + b = maxs b, t
t (t)
(2.34)
tR
2
2
Moreover,
The term on the left is the convex conjugate of 12 t2 (t), see (1.10), p. 14. The latter term being
convex continuous and , using convex bi-conjugacy (Theorem 4, p. 15), (2.34) is equivalent to
1 2
1
2
(2.35)
t (t) = sup b, t (b) + b
2
2
bRs
Noticing that the supremum in (2.35) is reached, (2.35) is equivalent to
1
1
1 2
2
2
(t) = mins b, t + (b) + b + t = mins
t b + (b)
bR
bR
2
2
2
Hence (2.33) is proven.
Using (2.33), the maximum of and the minimum of are reached by a pair (t, b) such that
b t + (t)
t
= 0.
t
For t xed, the solution for b, denoted b is unique and is as stated in the proposition.
Based on Proposition 2, we consider
F (u, b) = Au
v22
r
1
i=1
2A v +
r
Di u
bi 2s
+ (bi ) , bi Rs
DTi bik
H = 2A A +
T
i=1
bik+1
Di uk
= Di uk (Di uk 2 )
, 1 i r.
Di uk 2
r
i=1
DTi Di
(2.36)
40
Comments
References: [8, 47, 68, 1].
Quasi-Newton :
Convergence: uk u.
Comparison between the two forms. See [68].
-form: less iterations but each iteration is expensive;
+ -form: more iterations but each iteration is cheaper and it is possible to precondition H.
For both forms, convergence is faster than Gradient, DFP and BFGC ( 2.5.5), and non-linear
CG ( 2.6.2) for functions of the form (2.31).
2.5.5
"
References: [14, 54, 69]. Here F : Rn R with u = u, u, u Rn .
1
0, k N.
Hk 2 F (uk ) 0 and Mk 2 F (uk )
k
def
= uk+1 uk
def
gk = F (uk+1) F (uk )
Denition 14 Quasi-Newton (or secant) equation:
Mk+1 gk = k
(2.37)
Interpretation: the mean value H of 2 F between uk and uk+1 satises gk = Hk . So (2.37) forces
Mk+1 to have the same action as H 1 on gk (subspace of dimension one).
There are innitely many quasi-Newton matrices satisfying Denition 14.
General quasi-Newton scheme
0. Fix u0 , tolerance 0, M0 0 such that M0 = M0T . If F (u0) stop, else go to 1.
For k = 1, 2, . . ., do:
1. dk = Mk F (uk )
2. Line search along dk to nd k (e.g. Wolfe, 0 = 1)
3. uk+1 = uk k dk
4. if F (uk+1) stop else go to step 5
T
and (2.37) holds; then loop to step 1
5. Mk+1 = Mk + Ck 0 so that Mk+1 = Mk+1
Ck 0 is a correction matrix. For stability and simplicity, it should be minimal in some sense.
Various choices for Ck can be done.
41
DFP (Davidon-Fletcher-Powel)
Historically the rst (1959). Apply the General quasi-Newton scheme along with
Mk+1 = Mk + Ck
k Tk
Mk gk gkT Mk
Ck =
gkT k
gkT Mk gk
Each of the matrices composing Ck is of rank 1, hence rank Ck 2, k N. Note that Ck = CkT .
Mk+1 is symmetric and satises the quasi-Newton equation (2.37):
Mk+1 gk = Mk gk + k
Tk gk
gkT Mk gk
M
g
= Mk gk + k Mk gk = k
k
k
gkT k
gkT Mk gk
BFGC (Broyden-Fletcher-Goldfarb-Shanno)
Proposed in 1970. Apply the General quasi-Newton scheme along with
Mk+1 = Mk + Ck
k gkT Mk + Mk gk Tk
gkT Mk gk k Tk
Ck =
+ 1+ T
gkT k
gk k
gkT k
Obviously Mk+1 satises (2.37) as well. BFGC is often preferred to DFP.
k where Hk must satisfy the so called
Comment: One can approximate 2 F using Hk+1 = Hk + C
dual quasi-Newton equation, Hk+1 k = gk , k (see [14, p. 55]).
Theorem 23 Let M0 0 (resp. H0 0). Then gkT k > 0 is a necessary and sucient condition for
DFP and BFGS formulae to give Mk 0 (resp. Hk 0), k N.
Proofsee [14, p. 56.]
Remark 9 It can be shown that DFP and BFGC formulae are mutually dual. See [14, p. 56].
Theorem 24 Let F be convex, bounded from below and F Lipschitzian on {u : F (u) F (u0)}.
Then the BFGS algorithm with Wolfes line-search and M0 0, M0 = M0T yields
lim inf |F (uk )| = 0.
k
http://www.cs.nyu.edu/overton/software/hanso/
2.6
42
Due to Hestenes and Stiefel, 1952, still contains a lot of material for research.
Preconditioned CG is an important technique to solve large systems of equations.
2.6.1
For B = B T and B 0, powerful method to minimize F given below for n very large
F (u) =
1
Bu, u c, u ,
2
u Rn
or equivalently to solve Bu = c. (Remind that B 0 implies min (B) > 0.) Then F is strongly
convex. By the denition of F , we have a very useful relation:
F (u v) = B(u v) c = F (u) Bv, u, v Rn .
(2.38)
inf
uuk +Hk
F (u)
Hk = Span{F (ui ), 0 i k}
uk+1 minimizes F over an ane subspace (and not only along one direction, as in Gradient methods.)
Theorem 25 The CG method converges after n iterations at most and it provides the unique exact
minimiser u obeying F (
u) = minn F (u).
uR
g() =
k
,
[i]F (ui ) : [i] R, 0 i k
(2.39)
i=0
def
Hk is a closed convex set. Set f () = F uk g() . Note that f is convex and coercive.
F (uk+1) = inf F uk g() = inf f () = f (k ).
Rk+1
Rk+1
(2.40)
F (uk+1) , F (ui ) = 0, 0 i k
(2.41)
F (uk+1) , u = 0, u Hk
(2.42)
It follows that F (ui ) are linearly independent. Conclusions about convergence:
If F (uk ) = 0 then u = uk (terminate).
43
F (un ) = 0
u = un .
2.
k =
3.
4.
k =
5.
F (uk ), dk
(where B = 2 F (uk ), k)
Bdk , dk
Main Properties :
k, F (uk+1), F (ui ) = 0 if 0 i k
k, Bdk+1 , di = 0 if 0 i k directions are conjugated w.r.t. 2 F = B
since B 0
F (un ) = 0.
are linearly independent where n
n is such that
it follows that (dk )nk=1
2.6.2
44
F (uk )2
F (uk1)2
3. dk = F (uk ) + k dk1
4. Test: if F (uk ), dk < 0 (dk is not a descent direction) set dk = F (uk )
5. Line-search along dk to obtain k > 0
6.
(2.43)
(2.44)
(the PR formula)
45
1
(F (uk )
k
F (uk ) F (uk1) , F (uk )
F (uk ) F (uk1) , F (uk )
Bk dk1, F (uk )
=
=
Bk dk1, dk1
F (uk ), dk1 + F (uk1), dk1
F (uk ) + F (uk1) , dk1
If k1 is optimal, i.e. if F (uk1 k dk1) = inf F (uk1 dk1 ), then
F (uk1 k1 dk1), dk1 = 0 = F (uk ), dk1
Then
F (uk ) F (uk1) , F (uk )
F (uk1), dk1
If k2 is optimal as well, in the same way we have F (uk1), dk2 = 0. Using this result and the
formula for dk1 according to step 3 in PR, the denominator on the right hand side is
F (uk1), dk1 = F (uk1), F (uk1) + k1 dk2 = F (uk1)2 .
It follows that
for k as in the Lemma. Using the expression for k established above, we nd:
Bk dk1 , dk = Bk dk1 , F (uk ) + k dk1 = Bk dk1 , F (uk ) + k Bk dk1 , dk1
Bk dk1, F (uk )
Bk dk1 , dk1 = 0.
= Bk dk1 , F (uk )
Bk dk1, dk1
PR method
Apply the FR method, p. 44, where k in step 2 is replaced by the PR formula
k =
2.6.3
46
where
10
7
A=
8
7
7 8 7
5 6 5
6 10 9
5 9 10
u = A1 v = [1, 1, 1, 1 ]T
cond(A) =
1/2
max1in i (A)
.
min1in i (A)
c0 cn1 cn2 . . .
c1
c0 cn1 . . .
.
..
.
.
C= .
c1
c0
..
..
cn2
.
.
cn1 cn2 cn1 . . .
u
v
cond(A)
.
u
v
c1
c1
c2
..
.
cn1
c0
n1
cj exp
j=0
2i jp
, p = 0, . . . , n 1
n
where FH is the conjugate transpose of F. (C) can be obtained in O(n log n) operations using the
fast Fourier transform (FFT) of the rst column of C.
Note that F1 = FH . Solving
Cu = v
amounts to solve
u=
v where
v = Fv; then u = FH u
.
47
Here Fv and FH u
are computed using the FFT and the inverse FFT, respectively.
For preconditioning of block-circulant matrices with applications to image restoration, see [43],
focus on p. 938.
Preconditioning
The speed of algorithms can be substantially improved using preconditioning.
References: [81, 19, 69, 20].
Instead of solving Bu = c, we solve the preconditioned system
P 1 Bu = P 1 c,
where the n n matrix P is called the preconditioner. The latter is chosen according to the following
criteria:
P should be constructed within O(n log n) operations;
P v = w should be solved in O(n log n) operations;
The eigenvalues of P 1B should be clustered (i.e. well concentrated near 1);
If B 0 then P 1 B 0.
Toeplitz Systems
An n n Toeplitz matrix A is dened using 2n 1 scalars, say ap , 1 n p n 1. It is
constant along its diagonals:
a1 . . . an+2 an+1
a0
a1
a0 a1
. . . an+2
.
..
..
.
.
.
a1
a0
A= .
(2.45)
.
.
..
..
an2
a1
an1 an2 . . .
a1
a0
Toeplitz matrices are important since they arise in deblurring, recursive ltering, auto-regressive
(AR) modeling and many others.
Circulant preconditioners
We emphasize that the use of circulant matrices as preconditioners for Toeplitz systems allows the
use of FFT throughout the computations, and FFT is highly parallelizable and has been eciently
implemented on multiprocessors.
Given an integer n, we denote by Cn the set of all circulant n n matrices.
Below we sketch several classical circulant preconditioners.
Strangs preconditioner [80]
It is the rst circulant preconditioner that was proposed in the literature. P is dened by to
be the matrix that copies the central diagonals of A and reects them around to complete the
circulant requirement. For A given by (2.45), the diagonals pj of the Strang preconditioner
P = [pm ]0m,<n are given by
0 < j n/2
aj
ajn n/2 < j < n
pj =
48
CCn
where .F denotes the Frobenius norm8 . The jth diagonals of P are shown to be equal to
(nj)aj +jajn
0j<n
n
pj =
pn+j
0 < j < n
Preconditioners by embedding [17]
For A as in (2.45), let B be such that the matrix below is 2n 2n circulant:
!
A BH
B A
R. Chans circulant preconditioner P is dened as P = A + B.
Preconditioners by minimization of norms
Tyrtyshnikovs circulant preconditioner is dened by
Id P 1AF = min Id C 1 AF .
CCn
It has some nice properties and is called the superoptimal circulant preconditioner. See [84].
Other circulant preconditioners are derived from kernel functions.
Various types of preconditioners are known for dierent classes of matrices.
Remind that for an n n matrix B with elements B[i, j] (i for row, j for column)
B1
B
=
=
max
1jn
n
B[i, j]
i=1
n
B[i, j]
max
1in
j=1
i=1 j=1
Chapter 3
Constrained optimization
3.1
Preliminaries
3.2
Optimality conditions
Theorem 27 (Euler (in)equalities) Let V be a real n.v.s., U V convex, F : O(U) R proper (see
the denition on p. 10) and dierentiable at u U.
1. If F admits at u a minimum w.r.t. U, then
F (
u), u u 0, u U
(3.1)
2. Assume in addition that F is convex. Necessary and sucient condition (NSC) for a minimum:
F (
u) = min F (u)
uU
(3.1) holds.
(3.2)
Suppose that F (
u) , v < 0 and note that v is xed. Since (v) 0 as 0, we can nd
u) > F (
u + v) by (3.2), i.e. there
small enough such that F (
u) , v + v (v) < 0, hence F (
is no minimum at u. It follows that necessarily
F (
u) , u u = F (
u) , v 0.
50
Statement 2. The necessity of (3.1) follows from statement 1. To show its suciency, combine
(3.1) with the convexity of F , Property 2, statement 1 (p. 21): F (u) F (
u) F (
u) , u u,
u U. If F is strictly convex, the uniqueness of u follows from (3.1) and Property 2-2, namely
F (u) F (
u) > F (
u) , u u, u U, u = u.
Statement 3 follows directly from statement 1.
3.2.1
Projection theorem
Theorem 28 (Projection) Let V be real Hilbert space, U V nonempty, convex and closed, and
"
u = u, u, u V . Given v V , the following statements are equivalent:
1.
u = U v U unique such that v u = inf v u, where U is the projection onto U;
uU
2. u U and v u, u u 0, u U.
Classical proof [77, p.391] or [52]. Shorter proof using Theorem 27see below.
Proof. If v U, then u = v and U = Id. Consider next v V \ U and
1
F (u) = v u2 .
2
F is clearly C , strictly convex and coercive. The projection u of v onto U solves the problem:
F (
u) = inf F (u) = min F (u).
uU
uU
v u, u u 0, u U.
3.3
3.3.1
51
General methods
Gauss-Seidel under Hyper-cube constraint
(3.3)
Iterations: k N, i = 1, . . . , n
F (uk+1[1], . . . , uk+1 [i 1], uk+1[i], uk [i + 1], . . . , uk [n])
=
inf
ai bi
uk [i + 1], . . . , uk [n])
Theorem 29 If F is strongly convex and U is of the form (3.3), then (uk ) converges to the unique u
such that F (
u) = minuU F (u).
Remark 12 The method cannot be extended to a general U. E.g., consider F (u) = (u[1]2 + u[2]2 )
and U = {u R2 : u[1] + u[2] 2}, and initialize with u0 [1] = 1 or u0 [2] = 1.
The algorithm is blocked at the boundary of U
at a point dierent from the solution.
3.3.2
V Hilbert space, U V convex, closed, non empty, F : V R is convex and dierentiable in V . Here
"
again, u = u, u, u V .
Motivation:
u U and F (
u) = inf F (u)
uU
u U and F (
u), u u 0, u U, > 0 (the NSC for a minimum)
u U and
u F (
u) u, u u 0, u U, > 0
u) , > 0.
u = U u F (
In words, u is the xed point of the application
G(u) = U u F (u) , where U the projection operator onto U
Theorem 30 Let F : V R be dierentiable in V and U V a nonempty convex and closed subset.
Suppose that and M such that 0 < < M, and
(i) F (u) F (v), u v u v2 , (u, v) V 2 (i.e. F is strongly convex);
(ii) F (u) F (v) Mu v, (u, v) V 2
52
k 2 + , k N
where
2
M #
M
$
and
0, 2
(xed)
M
uk+1 = Gk (uk ), k N.
(3.4)
(3.5)
(3.6)
where
2M 2
2
+ 1 < 1.
M2
(3.7)
Remark 13 If we x 0, is nearly optimal but the range for k is decreased; and vice-versa, for
1 2 2
but 1.
M2 , the range for k is nearly maximal 0, M
2
Ideally, one would wish the largest range for k and the least value for ...
Proof. By its strongly convexity, F admits a unique minimizer. Let u V and v V .
Gk (u) Gk (v)2 = U u k F (u) U v k F (v) 2
u v k F (u) F (v) 2
= u v2 2k F (u) F (v), u v + 2k F (u) F (v)2
u v2 2k u v2 + 2k M 2 u v2
= (2k M 2 2k + 1) u v2
= f (k ) u v2 ,
where f is convex and quadratic and read
def
f () = 2 M 2 2 + 1.
Since 0 < < M, the discriminant of f is negative, 42 4M 2 < 0, hence
f () > 0, 0.
"
"
Then f () is real and positive. It is easy to check that f () is strictly convex on R+ when
0 < < M and that it admits a unique minimizer on R+ .
More precisely:
1
"
3
f ()
f ()
"
2
=1
f (0) = f
M2
arg min
"
f () =
M2
53
%
0< f
<1
M2
For any as given in (3.5), we check that
%
%
%
2
f ( 2 ) = f ( 2 + ) = 2M 2 2 + 1 = ,
M
M
M
where the last equality comes from (3.7). By (3.5) yet again,
2M 2
2
2
2
+
1
<
+ 1 1 < 1.
M2
M2 M2
f (k ) < 1.
It follows that Gk is a contraction, k N since Gk (u) Gk (v) u v, < 1, for any
u) = u for all k N. We can write down
(u, v) V 2 , hence Gk (
u) uk u k u0 u
uk+1 u = Gk (uk ) Gk (
where < 1 is given in (3.7).
(3.8)
2
Drawback: min (B)/ max (B) can be very-very small.
For any u, v Rn ,
&
&
&U u k (Bu c) U v k (Bv c) & (u v) k B(u v)2 Id k B2 u v2 .
2
Remind that for any square matrix B, if Bv = i (B)v then (Id B)v = v i (B)v = (1 i (B))v.
Since Id B, > 0, is symmetric, its spectral radius is
6
7
f () = max i (Id B) = max 1 min (B) , 1 max (B)
1in
54
2
. This bound for k is much better than (3.8) that was
max (B)
established in Theorem 30 in a more general case.
Remark 16 Diculty : in general U has no an explicit form. In such a case we can use a penalization method, or another iterative method (see later sections).
3.3.3
Penalty
(3.9)
1
F (u) = F (u) + G(u)
uU
Proof. F (u) F (u) F is strictly convex and coercive, > 0, hence statement 1.
Noticing that u is the minimizer of F and not of F , as well as that G(
u) = 0,
1
1
def
u) = F (
u)
F (u) F (u) + G(u ) = F (u ) F (
u) = F (
u) + G(
(3.10)
55
Since F is coercive, (u )>0 is bounded. Hence there is a convergent subsequence (uk )kN , say
lim uk = u
(3.11)
u)
F (uk ) Fk (uk ) F (
(3.12)
F (
u) = lim F (uk ) F (
u)
(3.13)
Using (3.10)
Since F is continuous,
k
lim u = u.
q
gi (u),
i=1
3.4
3.4.1
56
Equality constraints
Lagrange multipliers
Here V1 , V2 and Y are n.v.s. Optimality conditions are based on the Implicit functions theorem.
Theorem 32 (Implicit functions theorem) Let G : O V1 V2 Y be C 1 in O, where V2 and Y are
2) O and v Y are such that G(
u1 , u2 ) = v
complete (i.e. Banach spaces). Suppose that u = (
u1 , u
u1 , u2 ) Isom(V2 ; Y ).
and D2 G(
u1 , u2 ) O1 O2 O and there is a unique continuous
Then O1 V1 , O2 V2 such that (
function g : O1 V1 V2 , called an implicit function, such that
{(u1 , u2) O1 O2 : G(u1 , u2) = v} = {( u1 , u2) O1 V2 : u2 = g(u1 )}.
Moreover g is dierentiable at u1 and
u1 , u2 ))1 D1 G(
u1 , u2 ).
Dg(
u1) = (D2 G(
(3.14)
(3.15)
57
Proof. By the Implicit Functions Theorem 32, there are two open sets O1 V1 and O2 V2 , with
u O1 O2 O, and a continuous application g : O1 O2 such that
(O1 O2 ) U = {(u1 , u2) (O1 V2 ) | u2 = g(u1)}
and
u))1 D1 G(
u).
Dg(
u1) = (D2 G(
(3.16)
Dene
def
F (u1) = F u1 , g(u1) , u1 O1 .
Then F (
u1 ) = inf F (u1) entails F (
u) = inf F (u) for u = u1 , g(
u1) . Since O1 is open, u1 satises
u1 O1
uU
u1) + D2 F u1 , g(
u1) Dg(
u1 )
0 = DF (
u1) = D1 F u1 , g(
= D1 F u1 , g(
u1) D2 F u1 , g(
u1) (D2 G(
u))1 D1 G(
u),
where the last equality comes from (3.16). Using that u = u1 , g(
u1) , we can hence write down
u) = D2 F (
u) (D2 G(
u))1 D1 G(
u)
D1 F (
D2 F (
u) = D2 F (
u) (D2 G(
u))1 D2 G(
u)
where the second equality is just an identity. We obtain (3.15) by setting
u) (D2 G(
u))1 .
(
u) = D2 F (
Since D2 F (
u) L(V2 ; R) and (D2 G(
u))1 L(V2 ; V2 ), we have (
u) L(V2 ; R).
The most usual case arising in practice is considered in the next theorem.
Theorem 34 Let O Rn be open and gi : O R, 1 i p be C 1 -functions in O.
U = {u O : gi (u) = 0, 1 i p} O
u) L(Rn , R), 1 i p be linearly independent and F : O R dierentiable at u. If
Let Dgi (
u) R, 1 i p, uniquely
u U solves the problem inf F (u) then there exist p real numbers i (
uU
dened, such that
p
i (
u) Dgi (
u) = 0.
(3.17)
DF (
u) +
i=1
DG(
u) =
g1
u
1
Dg1 (
u)
=
gp
Dgp (
u)
u
1
g1
u
p
g1
u
n
gp
u
p
gp
u
n
58
Since Dgi (
u) are linearly independent, rank(DG(
u)) = p n, so we can assume that the rst p p
submatrix of DG(
u) is invertible. If p = n, u is uniquely determined by G. Consider next that p < n.
Let {e1 , en } be the canonical basis of Rn . Dene
def
V1 = Span{ep+1 , , en }
def
V2 = Span{e1 , , ep }
Redene G in the following way:
G : V1 V2 V2
(u1 , u2) G(u) =
p
gi (u)ei .
i=1
The uniqueness of (
u) is due to the fact that rank(DG(
u)) = p.
The numbers i (
u), i {1, 2, , p} are called Lagrange multipliers relevant to a constraint
extremum (minimum or maximum) of F in U.
To resume: u Rn , Rp hence n + p unknowns
u) = 0
F (
u) + pi=1 i gi (
u) = 0, 1 i p
gi (
n equations
p equations
3.4.2
(3.19)
(3.18) yields
Bu + AT = c
Au
= v
(3.20)
59
u = U (w) = I AT (AAT )1 A w + AT (AAT )1 v
(3.21)
(3.22)
60
1
if i = j and 1 i r
The pseudo-inverse A =
where [i, j] =
[i, i]
0
else
QT2 QT1
wR
u :
u = inf u
uU
u = A v
3.5
Inequality constraints
3.5.1
C(u) is a cone with vertex 0, not necessarily convex. C(u) is the set of all directions such that
starting from u, we can reach another point v U.
1
1
k+2
u+
1
w U, k N.
k+2
1
(w
k+2
61
u). Then
wu
v
uk u
=
=
, k N.
uk u
w u
v
Hence v C(u).
3.5.2
i ai
iI
(3.23)
iI
The proof of 1 2 is aimed at showing that b K. The proof is conducted ad absurdum and is
based of 3 substatements.
(a) K in (3.23) have the properties stated below.
K is a cone with vertex at 0.
> 0 i 0, i I
iI
i ai =
iI
(i )ai K.
62
K is convex.
Let 0 < < 1, i 0, i I and i 0, i I.
i + (1 )i ai K
i ai + (1 )
i ai =
iI
iI
iI
because i + (1 )i 0.
K is closed in V .
Suppose that {ai , i I} are linearly independent. Set A = Span{ai : i I}a nite
dimensional vector subspace (hence closed). Consider an arbitrary (vj )jN K. For any
j N, there is a unique set {j [i] 0, i I} so that
j [i]ai K A.
vj =
iI
lim vj =
iI
lim j [i] ai K A.
Hence K is closed in V .
Let {ai , i I} be linearly dependent. Then K is a nite union of cones corresponding
to linearly independent subsets of {ai }. These cones are closed, hence K is closed.
(b) Let K V be convex, closed and K = . Let b V \ K. Using the Hahn-Banach theorem 3,
h V and R such that
u, h > , u K,
b, h < .
By the Projection Theorem 28 (p. 50), there is a unique c K such that
b c = inf b u > 0
uK
(> 0 because b K)
c V satises
u c, b c 0, u K.
(3.24)
Then
c b2 > 0
63
Since K is a cone
u, h > , > 0, u K
u, h >
, > 0, u K.
3.5.3
Constraint qualication
(convex cone)
(3.25)
This cone is much more practical that C, even though it still depends on u. The same gures
show that in some cases Q(u) = C(u). Denition 18 below is constructed in such a way so that
these cones are equal in the most important cases.
64
Denition 18 The constraints are qualied at u U if one of the following conditions hold:
w V \ {0} such that i I(u),
not ane;
hi is ane, i I(u).
Naturally, Q(u) = V if I(u) = .
Lemma 5 ([26]) Let hi for i I(u) be dierentiable at u
C(u) Q(u)
hi dierentiable at u, i I(u)
hi continuous at u, i I \ I(u)
Constraints are qualied at u
C(u) = Q(u)
3.5.4
Putting together all previous results leads to one of the most important statements in Optimization
theoreythe Kuhn-Tucker (KT) relations, stated below.
Theorem 38 (Necessary conditions for a minimum) V Hilbert space, O V open, hi : O R, i
U = {u O : hi (u) 0, 1 i q}
(i)
(ii)
(iii)
(iv)
(v)
(3.26)
hi dierentiable at u, i I(
u)
hi continuous at u, i I \ I(
u)
Constraints are qualied at u U
F : O R dierentiable at u
F has a relative minimum at u w.r.t. U
Then
i (
u) 0, 1 i q such that F (
u) +
q
i (
u)hi (
u) = 0,
i=1
q
i (
u)hi (
u) = 0 .
(3.27)
i=1
i (
u) are called Generalized Lagrange Multipliers
Proof. Assumptions (i), (ii) and (iii) are as in Theorem 37, hence
C(
u) = Q(
u).
By Theorem 35 (p. 61), or equivalently by Theorem 27 (p. 49)since Q(
u) is convex,
def
u) .
F (
u), v 0, v = u u Q(
(3.28)
65
i (
u)hi (
u).
(3.29)
iI(
u)
i (
u)hi (
u) = 0. Fixing i (
u) = 0 whenever i I \ I(
u)
iI(
u)
Remarks
i (
u) 0, i I(
u) are dened in a unique way if and only if the set {hi (
u), i I(
u)} is
linearly independent (recall the Farkas-Minkowski lemma).
The KT relation (3.27) depend on udicult to exploit directly.
(3.27) yields a system of (often nonlinear) equations and inequationsnot easy to solve.
3.6
3.6.1
66
Theorem 39 (Necessary and sucient conditions for a minimum on U) Let V be a Hilbert space, O
V (open) and U O convex. Suppose u U and that
(i) F : O V R and hi : O V R, 1 i q are dierentiable at u;
(ii) hi : V R, 1 i q, are convex;
(iii) Convex constraints qualication holds (Denition 19).
Then we have the following statements:
1. If F admits at u U a relative minimum w.r.t. U, then
u) R+ : 1 i q}
{i (
such that
F (
u) +
q
i (
u)hi (
u) = 0 and
(3.30)
i=1
q
i (
u)hi (
u) = 0 .
(3.31)
i=1
Proof. To prove 1, we have to show that if the constraints are qualied in the convex sense (Denition
19) then they are qualied in the general sense (Denition 18, p. 64), for any u U, which will
allows us to apply the KT Theorem 38.
u) = 0, i I(
u), and
Let w = u, w U, satisfy Denition 19. Set v = w u. Using that hi (
that hi are convex (see Property 2-1, p. 21),
i I(
u)
hi (
u), v = hi (
u) + hi (
u), w u hi (w).
By Denition 19, we know that hi (w) 0 and that hi (w) < 0 is hi is not ane. Then
u), w u = hi (
u), v 0 ,
hi (
where the inequality is strict if hi is not ane. Hence Denition 18 is satised for v.
u) < 0 is impossible for any i I(
u), Denition 19 means that all hi
Let u = w U. Since hi (
for i I(
u) are ane. Then Denition 18 is trivially satised.
Statement 1 is a straightforward consequence of the KT relations (Theorem 38, p. 64).
To prove statement 2, we have to check that F (
u) F (u), u U. Let u U (arbitrary). The
convexity of hi (Property 2-1) and the denition of U yield the two inequalities below:
i I(
u)
hi (
u), u u hi (u) hi (
u) 0, u U.
67
i (
u) hi (u) hi (
u)
iI(
u)
F (
u)
i (
u) hi (
u), u u
iI(
u)
8
F (
u)
q
9
i (
u)hi (
u) , u u
by (3.31), i (
u) = 0 if i I(
u)
i=1
= F (
u) + F (
u) , u u
F (u), u U
(by (3.30))
(F is convex).
u) Rq+ were known, then we would have and unconstrained minimization problem.
If i (
Example 8 U = {u Rn : Au b Rq }: the constraints are qualied if and only if U non-empty.
F convex. The necessary and sucient conditions for F to have a constrained minimum at
u U: Rq+ such that F (
u) + AT = 0 and i 0, 1 i q, with i = 0 if ai , u bi < 0.
3.6.2
Duality
V, W any subsets
= inf L(u, )
sup L(
u, ) = L(
u, )
:
Denition 20 Saddle point of L : V W R at (
u, )
W uV
uV W
uV
68
uV
W
def
uV
uV W
inf L(u, ) K, W
uV
W uV
uV W
W uV
uV W
Proof.
inf
uV
Def.Saddle Pt.
sup inf L(u, ).
u, )
=
inf L(u, )
sup L(u, ) sup L(
W
uV
W uV
L(u, ) = F (u) +
q
i hi (u)
i=1
u U solves (P)
u, ) is a saddle point of L
R+ such that (
Rq+ ,
Proof. By L(
u, ) L(
u, ),
0, Rq+ .
L(
u, ) L(
u, )
By the denition of L the latter reads
F (
u) +
q
q
i=1
i hi (
u) F (
u)
q
i hi (
u) 0, Rq+
i=1
i hi (
u) 0, Rq+ .
i
i=1
(3.32)
(3.32), uU
Let i = 0, 1 i q
=
69
i hi (
u) 0
i=1
q
#
$
i hi (
u) 0 and i 0 , 1 i q
u) 0
hi (
i=1
q
i hi (
u) = 0.
i=1
L(u, ),
u U
Using that L(
u, )
F (
u) = F (
u) +
q
L(u, )
i hi (
u) = L(
u, )
i=1
q
= F (u) +
i hi (u) F (u), u U
i hi (u) 0, u U)
(remind
i=1
Hence statement 1.
Rq+ such that
Let u solve (P ). By KT theorem (see (3.30)-(3.31), p. 66),
F (
u) +
q
i hi (
u) = 0 and
q
i hi (
u) = 0 .
i=1
i=1
(3.33)
is a saddle point of L.
We have to check that (
u, )
Rq+ ,
L(
u, ) = F (
u) +
q
i hi (
u)F (
u) = F (
u) +
u F (u) +
i hi (
u, ).
u) = L(
i=1
i=1
q
q
i hi (u) = L(u, )
is convex and reaches its minimum at u. Hence
i=1
= F (
u) +
L(
u, )
q
= F (u) +
i hi (
u) L(u, )
q
i=1
i hi (u), u U.
i=1
is a saddle point of L.
Hence (
u, )
For Rq+ (called the dual variable) we dene u V (Hilbert space) and K : Rq+ R by:
(P )
u V | L(u , ) =
inf L(u, )
uV
def
K() = L(u , ).
Dual Problem:
(P )
Rq+ | K()
= sup K()
Rq+
(3.34)
70
q
i hi (u ), Rq .
i=1
Proof. Restate (P ):
K() = inf L(u, ) = L(u , ).
uV
Let , +
Rq+ .
We have:
(a)
K() F (u+ ) +
q
and
(i + i )hi (u+ )
i=1
(b)
K( + ) F (u ) +
i hi (u+ ) = K( + )
i=1
q
i hi (u ) +
i=1
q
q
q
q
i hi (u+ ).
i=1
i hi (u ) = K() +
q
i=1
i hi (u ).
i=1
i hi (u+ ) K( + ) K()
q
i=1
i hi (u).
i=1
q
i hi (u ) +
i=1
q
i=1
q
i hi (u+ )
i=1
q
i hi (u ) +
i hi (u+ ) hi (u)
i=1
Since u is continuous and hi , 1 i q are continuous, for any Rq+ we can write down
K( + ) K() =
q
i=1
q
lim () = 0
i hi (u ).
i=1
(i)
(ii)
(iii)
(iv)
that
hi : V R, i = 1, . . . , q, are continuous;
Rq+ (P ) admits a unique solution u ;
u is continuous on Rq+ ;
Rq+ solves (P )
2. Assume
(i)
(ii)
(iii)
(iv)
that
(P) admits a solution u;
F : V R and hi : V R, 1 i q are convex;
F and hi , 1 i q, are dierentiable at u;
(convex) constraints are qualied.
u solves (P)
(P ) admits a solution
71
solve (P ). Then
Proof. Statement 1. Let
= L(u , )
= inf L(u, )
K()
uV
q
q
;
:
;
K(), =
i hi (u )
i hi (u ) = K(), , Rq+ .
i=1
L(u , ) = F (u ) +
q
i=1
i=1
i hi (u ) F (u ) +
q
i=1
, Rq+ .
i hi (u ) = L(u , )
Rq+
u, ) is a saddle point of L.
Statement 2. Using Theorem 41-2 (p. 68), R+ such that (
= inf L(u, )
= sup L(
= sup L(
L(
u, )
u, ) K()
u, ).
Rq+
uV
Rq+
rankA = q < n
(3.35)
0 = 1 L(u , ) = Bu c + AT u = B 1 (c AT )
K() = L(u , ) =
1 1
1
AB 1 AT , + AB 1 c b,
B c, c
2
2
R+
.
u = B 1 (c AT )
3.6.3
72
Uzawas Method
= a solution of (P )
Compute
Maximization of K using gradient with projection:
k+1 = + (k + K(k ))
1 i q, Rq
uk = uk = inf L(u, k )
uV
i=1
2
A22
Then
lim uk = u
If moreover rankA = q
lim k =
h(u) = Au b Rq .
(3.36)
(3.37)
(3.38)
73
u) ,
is the projection of
+ h(
By the Projection theorem,
u) on Rq+ , i.e.
= +
+ h(
u) .
Iterates solve the system for k N:
F (uk ) + AT k = 0
k+1 = + k + h(uk ) .
= 0 AT (k )
= F (uk ) F (
u) + AT (k )
u)
F (uk ) F (
(3.39)
Convergence of uk :
2 = + k + h(uk ) +
+ h(
u) 22
k+1
2
+ A(uk u)2
k
by (3.36): h(uk ) h(
u) = A(uk u)
2
:
;
2 + 2 AT (k )
, uk u + 2 A(uk u)2
= k
2
2
2 2 F (uk ) F (
= k
u) , uk u + 2 A(uk u)22 (using (3.39))
2
2 2 uk u2 + 2 A2 uk u2 (F is strongly convex, see (i))
k
2
2
2
2
2
2
2
2 A uk u
= k
2
2
2
2 k
2 , hence
Note that by (iii), we have 2 A22 > 0. It follows that
k+1
2
2
k+1
is decreasing. Being bounded from below, k+1
converges (not
kN
kN
necessarily to 0), that is
2
2
lim k+1 2 k 2 = 0
k
Then
2 k+1
2 0 as k .
0 2 A22 uk u2 k
2
2
If
$ rankA = q, then:
Range(A) = Rq
ker AT = {0}
Remark 21 The method of Uzawa amounts to projected gradient maximization with respect to
(solving the dual problem).
3.7
74
U=
g (u) = 0, 1 i p
uR : i
hi (u) 0, 1 i q
=
Associated Lagrangian:
L(u, , ) = F (u) +
p
i gi (u) +
i=1
3.7.1
q
i hi (u), i 0, 1 q.
i=1
u)
1. F : Rn R, {gi } and {hi } be C 1 on O(
u), 1 i p & hi (
u), i I(
u)} linearly independent
2. {gi (
Rp and
Rq+
u L(
u, ,
) = 0
i gi (
u) = 0 and
u) = 0
1 i p : gi (
u) 0,
i 0 and
i hi (
u) = 0
1 i q : hi (
i hi (
u) = 0, 1 i q and
i > 0, i I(
u) - complementarity conditions (facilities for numerical
methods)
,
uniqueness by assumption 2.
3.7.2
(Verify if u is a minimum
indeed)
(
u
),
v
=
0,
1
p
g
i
Critical cone C = v C(
u) :
hi (
u), v = 0, if i I(
u) and
i > 0
u) and that all conditions
Theorem 45 (CN) Suppose that F : Rn R, {gi } and {hi } are C 2 on O(
of Theorem 44 hold.
u, ,
)(v, v) 0, v C.
(CN) Then 2uu L(
u, ,
)(v, v) 0, v C \ {0} then u = strict (local) minimum of
(CS) If in addition 2uu L(
inf uU F (u).
3.7.3
75
Au b = 0 Rp
Cu f 0 Rq
Au b = 0 Rp
Cu f 0 Rq
3.7.4
Constraints yield numerical diculties. (When a constraint is satised, it can be dicult to realize
a good decrease at the next iteration.)
Interior point methodsmain idea: satisfy constraints non-strictly. Constraints are satised
asymptotically. At each iteration one realizes an important step along the direction given by 2 F
even though calculations are more tricky.
Actually interior point methods are considered among the most powerful methods in presence of
constraints (both linear or non-linear). See [69] (chapters 14 and 19) or [90] (chapter 1).
Sketch of example: minimize F : Rn R subject to Au b, b Rq .
Set y = b Au then y 0.
Notations: 1l = [1, . . . , 1]T , Y = diag(y1 , . . . , yq ), = diag(1 , . . . , q )
We need to solve:
F (u) + AT
def
S(u, y, ) = Au b + y = 0 subject to y 0, 0.
Y 1l
Note that Y 1l = 0 is the complementarity condition.
76
1
Duality measure: = y T .
q
Central path dened using = for [0, 1] xed.
At each iteration, one derives a central path (u , y , ), > 0 by solving
0
S(u , y , ) = 0 , y > 0, > 0
1l
The step to a new point is done so that we remain in the interior of U. The complementary condition
is relaxed using .
The new direction in the variables u, , y is found using Newtons method.
Often some variables can be solved explicitly and eliminated from S.
Interior point methods are polynomial time methods, and were denitely one of the main events
in optimization during the last decade, in particular, in linear programming.
3.8
Nesterovs approach
For V a Hilbert space, consider F : V R convex and Lipschitz dierentiable and U V a closed
and convex domain. The problem is to solve
inf F (u)
(3.40)
uU
u0 u2 ,
2
k+1
k
2
u U.
5. uk+1 =
77
d(w)
+ xk , w
2
k+1
wk +
zk
k+3
k+3
4d(
u)
(k + 1)(k + 2)
The rationale: similarly to CG, one computes the direction at iteration k by using the information
in F (uk1), , F (u0 ) .
For details and applications in image processing, see [9] and [88].
Y. Nesterov proposes an accelerated algorithm in [59] where the estimation of the Lipschitz
constant is adaptive and improved.
Other solverssee next chapter.
Chapter 4
Non dierentiable problems
Textbooks: [38, 73, 45, 46, 93, 14].
4.1
Specicities
4.1.1
Examples
i
79
Property 7 Typically Di u = 0, for many indexes i, see [63, 66]. This eect is known as staircasing if = TV. It arises with probability zero if is dierentiable.
Non-dierentiable data-tting
Starts to be quite popular, [3, 64, 65, 18, 22].
F (u) =
m
(4.1)
i=1
4.1.2
Kinks
Denition 21 A kink is a point u where F (u) is not dened (in the usual sense).
Theorem 46 (Rademacher) Let F : Rn ] , +] be convex and F +. Then the subsetset
u int dom F : F (u) is of Lebesgue measure zero in Rn .
Proof. : see [45, p. 189]. The statement extends to nonconvex functions [75, p.403] and to mappings
F : Rn Rm [39, p.81].
Hence F is dierentiable at almost every u. However minimizers are frequently located at kinks.
1
Example 9 Consider F (u) = u w2 + |u| for > 0 and u, w R. The minimizer u of F reads
2
w + if w <
0
if |w|
(
u is shrunk w.r.t. w.)
u =
w if w >
= 1, w = 0.9
= 1, w = 0.2
= 1, w = 0.5
= 1, w = 0.95
4.2
80
Basic notions
Denition 22 0 (V ) is the class of all l.s.c. convex proper functions on V (a real Hilbert space) .
(Remind Denitions 2, 4 and 6 on p. 10.)
4.2.1
Preliminaries
Proposition 4 f is sublinear
= > 0,
u+ v
= f
u+ v
f (u) + f (v) = f (u) + f (v)
f (u + v) = f
(4.2)
() Taking + = 1 in (4.2) shows that f is convex. Furthermore
u + u 2 f (u) = f (u) .
f (u) = f
2
2 2
Proof. () Set = + . Using the convexity of f for 1
Sublinearity is stronger than convexityit does not require + = 1and weaker than
linearityit requires that > 0, > 0.
Denition 25 Suppose that U V is nonempty.
Support function U (u) = sup s, u R {}
sU
Indicator function U (u) =
0
+
if u U
if u
U
81
(4.3)
vU
U V is bounded.
Proposition 6 (p. 12, [38]) Let f be a proper convex function on a real n.v.s. V . The statements
below are equivalent:
1. O V over which F is bounded above;
2. int domF = and F is locally Lipschitz there.
Note that this is the kind of functions we deal with in practical optimization problems.
4.2.2
Directional derivatives
F (u + tv) F (u)
, t R+
t
is increasing.
Proof. For any t > 0, choose an arbitrary 0. Set
def
,
t+
then 1 =
t
t+
We check that
u + (1 )(u + (t + )v) =
Hence
t
t
u+
u+
(t + )v = u + tv
t+
t+
t+
F (u + (1 ) u + (t + )v = F (u + tv).
F (u + (1 ) u + (t + )v
t
t
F (u) +
F u + (t + )v = F (u) +
F u + (t + )v F (u)
t +
t+
t+
It follows that
F u + (t + )v F (u)
F (u + tv) F (u)
, t > 0, R+ .
t
t+
82
F (u)(v) = lim
(4.4)
(4.5)
Whenever F is convex and proper, the limit in (4.4) always exists (see [38, p. 23]). The denition
(u)
in (4.4) is equivalent to (4.5) since by Lemma 10 (p. 81), the function t F (u+tv)F
, t R+ goes
t
to its unique inmum when t
0.
Remark 22 For nonconvex functions, dened on more general spaces, directional derivatives can be
dened only using (4.4); they do exist whenever the limit in (4.4) do exist.
F (u)(v) is the right-hand side derivative. The left-hand side derivative is F (u)(v).
F convex
F (u)(v) F (u)(v).
+
, t > 0.
t
t
t
For t
t
t
Taking the limit when t
83
Proposition 9 (1st -order approximation) Let F : Rn R be convex and proper, with Lipschitz constant
> 0 and u Rn . Then > 0, > 0 such that
v <
tk = vk
such that
1
k+1
F (u + vk ) F (u) F (u)(vk ) > tk = vk , k N.
vk
= 1, k N
tk
vk
{w Rn : w = 1} (compact set), k N vkj such that
tk
vkj
= v, v = 1
j tkj
lim
(4.6)
&
&
& vkj
F (u + tkj v) F (u)
&
&
&
2 lim &
v & + lim
F (u)(v) = 0.
j tkj
j
tkj
This contradicts the assumption that > 0.
4.2.3
Subdierentials
(4.7)
(4.8)
84
F (u+tv)F (u)
,
t
0, using
F (u + tv) F (u)
F (u) = {g V : g, v F (u)(v), v V } =
g V : g, v
, v V, t > 0
t
1
F (z) F (u)
(use z = u + tv v = (z u)/t)
=
g V : g, z u
, z V, t > 0
t
t
= {g V : g, z u F (z) F (u), z V }
The conclusion follows from the observation that when (u, v) describe V 2 and t describes R+ , then
z = u + tv describes V .
Observe that by the denition of F in (4.7) (p. 83),
F (u)(v) = sup g, v : g F (u) = F (u) (v)
where is the support function (see Denition 25, p. 80). Moreover,
F (u)(v) g, v F (u)(v), (g, v) (F (u) V )
Property 9 If F 0 (V ), then F is a maximal monotone mapping, i.e.
it is monotone:
u1 , u2 V
if Gu = 0
Gu2
F (u) =
T
G h | h2 1, h Rm if Gu = 0
Then
(4.9)
def
Let f = u2 =
i
u[i]2 . Then
u = 0
f (u) =
u
u2
We have F = f G. For Gu = 0, noticing that Gu = GT , chain rule entails the rst equation in (4.9).
(4.10)
85
Consider that Gu = 0. Clearly, Gz2 = G(z u)2 . After setting w = z u, (4.10) equivalently
reads
(4.11)
F (u) = g Rn | Gw2 g, w , w Rn
Dene the set
def
S = GT h | h2 1, h Rm
(4.12)
Then using the H-B separation Theorem 3 (p. 13), there exist w Rn and R so that the
hyperplane {x Rn | w, x = } separates g and S so that
g, w > > z, w ,
Consequently,
g, w > sup
h
z S
T
G h, w | h2 1, h Rm = Gw2
A comparison with (4.11) shows that g F (u). Hence the assumption in (4.12) is false. Consequently, F (u) = S.
In particular, Proposition 10 gives rise to the following well known result:
Corollary 1 Let u Rn . Then
u
if u = 0
u
2
(u2 ) =
h Rn | h2 1 if u = 0
(4.13)
In words, the subdierential of u u2 at the origin is the closed unit ball with respect to the
2 norm centered at zero.
Example 10 Let F : R R read F (u) = |u|. By (4.13)
u
= sign(u)
u = 0 F (u) = |u|
u = 0 F (0) = h R | |h| 1 = [1, +1]
86
0 (V ) and 0 (V )
(u) + (v) = (u) (v), u V, v V.
A : V W bounded linear operator, A its adjoint, b W and F 0 (W )
F (Au + b) (u) = A F (Au + b)
0 (W ), 0 (V ), A : V W bounded linear operator, 0 int dom( A(dom)
( A + ) = A () A +
Let U V be nonempty, convex and closed. For any u V \U we have dU (u) =
u U u
.
dU (u)
Duality relations
Theorem 47 Let F 0 (V ) and F : V R be its convex conjugate (see (1.10), p. 14). Then
v F (u) F (u) + F (v) = u, v
u F (v)
(4.14)
zV
u F (v)
87
4.3
Optimality conditions
4.3.1
0 F (
u)
F (
u)(v) 0, v V.
F (v) F (
u) + 0, v u , v V
F (v) F (
u), v V
0 F (
u)(v), v V
(4.15)
88
(4.16)
where (Id + F )1 is the resolvent operator associated to F , > 0 is a stepsize. The schematic
algorithm resulting from (4.16), namely
u(k+1) = (Id + F )1 (u(k) ),
is a fundamental tool for nding the root of any maximal monotone operator [74, 36], such as e.g.
the subdierential of a convex function. Often, (Id + F )1 cannot be calculated in closed-form.
4.3.2
(4.17)
uU
uU
if u U
if u
U
def
(4.18)
89
Proof. By Corollary 2 (p. 86) and using that U (v) = supzU z, v = U (v). For U (v) remind
Denition 25 on p. 80 and (4.3) on p. 81.
U (u) = {v V : U (u) + U (v) = u, v} = {v V : U (u) + sup z, v = u, v}
zU
= {v V : z, v u, v , z U}
0 F (
u) + NU (
u)
t
F u + t(v u) F (
u)
inf
0, t ]0, 1], v U
t]0,1]
t
F (
u)(v u) 0, v U.
(4.19)
4.3.3
ai , u = bi , 1 i p (Au = b Rp )
n
=
U = uR :
hi (u) 0, 1 i q
(4.20)
def
90
2. Rp , Rq such that
p
q
i hi (
u), i 0 and i hi (
u) = 0, 1 i q
(4.21)
i zi : Rp , i 0, zi hi (u), i I(u)
NU (u) = AT +
(4.22)
0 F (
u) +
i ai +
i=1
i=1
u) where
3. 0 F (
u) + NU (
iI(u)
Remark 26 Note that the condition in 2, namely (4.21) ensures that the normal cone at u w.r.t. U
reads as in (4.22) which entails that the constraints are qualied.
The existence of coecients satisfying (4.21) is called Karush-Kuhn-Tucker (KKT) conditions. The
requirement i hi (
u) = 0, 1 i q is called transversality or complementarity condition (or equivalently slackness).
(, ) Rp Rq+ are called Lagrange multipliers. The Lagrangian function reads
L(u, , ) = F (u) +
p
i (ai , u bi ) +
q
i=1
i hi (u)
(4.23)
i=1
Theorem 53 ([45], Theorem 4.4.3 (p. 337).) We posit the assumption of Theorem 52. Then the following statements are equivalent:
1. u solves the problem F (
u) = min F (u);
uU
2. u, (,
) is a saddle point of L in (4.23) over Rn Rp Rq+ .
4.4
Minimization methods
Descent direction d : > 0 such that F (u d) < F (u) gu , d > 0, gu F (u)
Note that d = gu for some gu F (u) is not necessarily a descent direction.
Example
F (u) = |u[1]| + 2|u[2]|
|1 + tv[1]| 1 + 2|tv[2]|
= v[1] + 2|v[2]|
F (1, 0)(v) = lim
t
0
t
F (1, 0) = {g R2 | g, v F (1, 0)(v), v R2 }
= {g R2 | g[1]v[1] + g[2]v[2] v[1] + 2|v[2]|, v R2 }
= {1} [2, 2]
The steepest descent method:
dk arg min
max g, d
d=1 gF (uk )
91
4.4.1
Subgradient methods
Subgradient projections are signicantly easier to implement than exact projections and have been
used for solving a wide range of problems.
Subgradient algorithm with projection
Minimize F : Rn R subject to the constraint set U (closed, convex, nonempty, possibly = Rn )
For k N
1. if 0 F (uk ), stop (dicult to verify); else
2. obtain gk F (uk ) (easy in general);
3. (possible) line-search to nd k > 0;
gk
4. uk+1 = U uk k
gk
Remark 27 dk =
gk
is not necessarily a descent direction. This entails oscillations of the value
gk
(F (uk ))kN .
Theorem 54 Suppose that F : Rn R is convex and reaches its minimum at u Rn . If (k )k0
satises
k = + and
2k < +
lim uk = u.
(4.24)
k
92
(P )
uU
u
if F (u) or 0 = g F (u)
gu (u) =
F (u)
gu otherwise
u+
gu 2
Indeed, gu , u g (u) = F (u) .
If
is knownPolyaks method (cf. [79]):
def
(4.25)
when
93
if gk = 0, stop.
3. k = k k
gk
v = uk + (k F (uk )) g 2
k
4.
uk+1 = U (v)
k+1 = k , uk+1 = uk 1
4.4.2
Given the form of F one can often design specic (well adapted) optimization algorithms. There
are numerous examples in the literature. We present some outstanding samples.
TV minimization (Chambolle 2004)
Reference:[16]
Minimize F (u) = 12 u v22 + TV(u) where
6<
7
TV(u) = sup
u(x)div(x)dx : Cc1 (, R2 ), |(x)| 1, x
(4.26)
discrete
equivalent for
: (Rnn )2 Rnn ,
is the adjoint of
94
(u)[i,
j] =
u[i + 1, j] u[i, j]
u[i, j + 1] u[i, j]
!
, 1 i, j n
(Div())[i, j] = 1 [i, j] 1 [i 1, j] + 2 [i, j] 2 [i, j 1] , 1 i, j n,
boundary conditions:
Dene
2
K = Div | Rnn , ([i, j])2 1, 1 i, j n Rnn
(4.27)
where [i, j] = ( 1 [i, j], 2 [i, j]). Let us emphasize that K is convex and closed2 .
The discrete equivalent of the TV regularization in (4.26) reads
TV(u) = sup u, w : w K = sup u, w = K (u)
wK
v u
u 1
def v u
where w
=
(Use Theorem 47 p. 86) u TV
0 + TV (w)
v
u 1
1
v
v
+ TV (w)
w + TV (w)
1
v
0 w + TV (w)
2
&2
v
w = K
v
u = v K
(4.28)
95
where K is the orthogonal projection onto K. To calculate u, we need to nd Div that solves the
nonlinear projection onto K :
+&
,
&2
&
&
v
nn 2
2
&
min &
&Div & : (R ) , [i, j]2 1 0, 1 i, j n
KT optimality conditions (Theorem 39, p. 66) see [16] for details.
Algorithm:
k [i, j] + Div k v/ [i, j]
k+1 [i, j] =
1 + Div k v/ [i, j]
n
i i (|u[i]|),
i 0, i (0+ ) > 0,
1 i n
(4.29)
i=1
n
i=1
in the form of (4.29), provided that {ai , 1 i n} is linearly independent. Let A Rnn be the
matrix whose columns are ai , 1 i n. Using the change of variables z = Au v, we consider
n
1
(|zi |) + A1 (z v)
F (z) =
i=1
(u)
+ t2 , 1 i n.
ui
Then the one coordinate at a time method given in 2.2 (p. 22) converges to a minimizer u of F .
Proof. : under condition 1, see [44]; under condition 2, see [65].
Comments
Condition 2 is quite loose since it does not require that is globally coercive.
The method cannot be extended to non separable.
96
Algorithm.
k N, each iteration k has n steps:
1 i n, compute
=
if || i i (0+ )
u(k) [i] = 0;
()
i i (u(k) [i]) +
u(k) [1], . . . , u(k) [i 1], u(k) [i], u(k1) [i + 1], . . . , u(k1) [n] = 0,
u[i]
(k)
where sign u [i] = sign()
The components located at kinks are found exactly in (). Note that for t = 0, we have dtd i (|t|) =
i (|t|)sign(t). In particular, for i (t) = |t|, we have good simplications: i (0+ ) = 1 and i (t) =
sign(t) if t = 0.
Useful reformulation
For any w Rn we consider the decomposition
w = w+ w
Then
Minimize w1 minimize
w + [i] + w [i] subject to w + [i] 0, w [i] 0, 1 i n.
4.4.3
We will describe the approach in the context of an objective function F : Rn R of the form
F (u) = Au v22 +
(4.30)
where for any i, Di u R2 is the discrete approximation of the gradient of u at i. The second
summand is hence the nonsmooth TV regularization [76].
The idea of variable-splitting and penalty techniques in optimization is as follows: with each
nonsmooth term Di u2 an auxiliary variable bi R2 is associated in order to transfer Di u out of
the nondierentiable term 2 , and the dierence between bi and Di u is penalized. The resultant
approximation to (4.30) is
F (u, b) = Au v22 +
i
bi Di u22 +
bi 2
(4.31)
with a suciently large penalty parameter . This quadratic penalty approach can be traced back
as early as Courant [32] in 1943. It is well known that the solution of (4.31) converges to that of
97
(4.30) as . The motivation for this formulation is that, while either one of the two variables
u and b is xed, minimizing the function with respect to the other has a closed-form formula with
low computational complexity. Indeed, F is quadratic in u and separable in b. A connection with
the additive form of half-quadratic regularization, see (2.36) on p. 39, was investigated in [87]. For
xed, an alternating minimization scheme is applied to F :
+
,
uk+1 = arg minn Auk v22 +
(4.32)
bik Di uk 22
uR
+ bi 2 ,
i
Di uk+1 22
(4.33)
Finding uk+1 in (4.32) amounts to solve a linear problem. According to the content of A, fast solvers
have been exhibited (see e.g. [62]).
The computation of bik+1 in (4.33) can be done in an exact and fast way using multidimensional
shrinkage suggested in [87] and proven in [91], presented next.
Lemma 13 ([91], p. 577) For > 0, > 0 and v Rn , where n is any positive integer, set
f (u) =
u v22 + u2
2
u = max v2 , 0
v2
(4.34)
if u = 0
u
2
f (u) = (u v) +
h Rn | h2 1 if u = 0
According to the optimality conditions, 0 f (
u), hence
u v) +
=0
if u = 0
(
u 2
v h Rn | h2 if u = 0
From the second condition, it is obvious that
u = 0
Consider next that u = 0. Then
u
u2
v2
u2 +
=v
It follows that there exists a positive constant c such that u = cv. The latter obeys
v c+
1 =0 c = 1
v2
v2
(4.35)
u = 1
v = v2
v2
v2
98
v2
>0
(4.36)
A convergence analysis of the relevant numerical scheme is provided in [87]. In the same reference,
it is exhibited how to apply the continuation on (i.e. the increasing of ) in order to improve the
convergence.
Remark 29 If n = 1 in Lemma 13, that is
f (u) =
(u v)2 + |u|,
2
uR
0
if |v|
u = max |v| , 0 sign(v) =
4.5
i
Splitting methods
Hence V = V .
4.5.1
99
(4.37)
(4.38)
uV
1
v
1 + 2
v
(see Lemma 13, p. 97)
v2
def
> 0, T (t) =
0
if |t| <
t sign(t) otherwise
100
Theorem 59 ([56], p. 280.) Let f 0 (V ) and f 0 (V ) be its convex conjugate. For any u, v, w
V we have the equivalence
w = u + v and f (u) + f (v) = u, v
u = proxf w, v = proxf w.
f (z) f (u) z u, v , z V
(4.40)
1
1
z w2 + f (z) u w2 f (u), z V
2
2
1
1
=
z u v2 v2 + f (z) f (u), z V
set v = w u
2
2
1
1
z u v2 v2 + z u, v , z V
(by (4.40))
2
2
1
1
1
=
z u2 z u, v + v2 v2 + z u, v , z V
2
2
2
1
=
z u2 0, z V.
2
=
It follows that u is the unique point such that Pf (u) < Pf (z), z V, z = u. Hence u = proxf w.
In a similar way one establishes that v = proxf w.
() Since u = proxf w for w V , we know that u is the unique minimizer of Pf on V for this w
(see Proposition 11 on p. 99). Hence
1
t(z u) + u w2 + f t(z u) + u
Pf t(z u) + u =
2
1
1
u w2 + f (u) = Pf (u) =
v 2 + f (u), z V, t R
2
2
where we set
v =wu
Rearranging the obtained inequality yields
1
1
v 2 t(z u) v2 + f (u) f t(z u) + u , z V, t R.
2
2
v2 12 t(z u)
v 2 = 12 t2 z u2 + t z u,
v, the last inequality entails that
Noticing that 12
1
v + f (u) f t(z u) + u , z V, t R.
t2 z u2 + t z u,
2
Remind that t(z u) + u = tz + (1 t)u. Since f is convex, for t ]0, 1[ we have
f t(z u) + u tf (z) + (1 t)f (u) = f (u) + t f (z) f (u) , t ]0, 1[, z V.
4
(4.41)
101
v ) = u,
v f (u)
sup z, v f (z) = f (
zV
Taking v =
v = w u achieves the proof.
Denition 31 An operator T : V V is
1. nonexpansive if T u T v u v, u V, v V
2. rmly nonexpansive if T u T v2 T u T v, u v , u V, v V
Hence a rmly nonexpansive operator is always nonexpansive.
Lemma 14 If T : V V is rmly nonexpansive, then the operator (Id T ) : V V is rmly
nonexpansive.
Proof. Using that T is rmly nonexpansive, we obtain that u V, v V
(Id T )u (Id T )v2 (Id T )u (Id T )v, u v
= u v (T u T v)2 u v (T u T v), u v
= T u T v2 + u v2 2 T u T v, u v + T u T v, u v u v2
= T u T v2 T u T v, u v 0,
where the last inequality comes from Denition 31-2. Hence the result.
Proposition 12 ([31], Lemma 2.4.) For f 0 (V ), proxf and (Id proxf ) are rmly nonexapnsive
Proposition 13 Let f 0 (V ). The operator proxf is monotonous in the sense that
proxf w proxf z , proxf w proxf z 0, w V, z V.
102
u = proxf z
v = proxf w
v = proxf z
(4.42)
(4.43)
we obtain
proxf w proxf z , proxf w proxf z = u u, v v
=
use (4.42)
use (4.43)
4.5.2
Forward-backward can be seen as a generalization of the classical gradient projection method for
constrained convex optimization. One must assume that either or is dierentiable.
Proposition 14 ([31], Proposition 3.1.) Assume that 0 (V ), 0 (V ), where V is a Hilbert
space, that is dierentiable and that F = + is coercive. Let ]0, +[. Then
F (
u) = min F (u) u = prox u (
u) .
uV
0 ( + )(
u) = (
u) + (
u) = (
u) + {(
u)}
(
u) (
u) (
u) (
u)
u (
u) u (
u) (use (4.39), p. 99 identify u (
u) with v )
u)
u = prox u (
103
(forward, explicit)
(backward, implicit)
Formally, the second step amounts to solving an inclusion, hence its implicit nature.
A practical algorithm proposed by Combettes and Wajs in [31] is a more general iteration where
is iteration-dependent, errors are allowed in the evaluation of the operators prox and and a
relaxation sequence (k )kN is introduced. Admission of errors allows some tolerance in the numerical
implementation of the algorithm, while the iteration-dependent parameters (k )kN and (k )kN can
improve its convergence speed.
Theorem 60 ([31], Theorem 3.4.) Assume that 0 (V ), 0 (V ), where V is a Hilbert space,
that is dierentiable with a 1/-Lipschitz gradient, ]0, [, and that F = + is coercive.
Let the sequences (k )kN R+ , (k )kN R+ , (ak )kN V and (bk )kN V satisfy
0 < inf k sup k < 2;
kN
kN
kN
kN
bk < +.
kN
4.5.3
Douglas-Rachford splitting
Douglas-Rachford splitting, generalized and formalized in [51], is a much more general class of
monotone operator splitting methods. A crucial property of the Douglas-Rachford splitting scheme
is its high robustness to numerical errors that may occur when computing the proximity operators
prox and prox , see [29].
104
z V :
z V :
z V :
z V :
z V :
z V :
u) 2
u z (Id + )(
u)
u z (
z u (
u) z (Id + )(
u) u = (Id + )1 (z)
u)
2(Id + )1 (z) z (Id + )(
(a)
u = (Id + )1 (z)
u)
2(Id + )1 Id z (Id + )(
u = (Id + )1 (z)
1
1
u =(Id + ) 2(Id + ) Id (z) = prox 2prox Id (z)
(b)
u = prox z
(c)
u (2
u z) = 2prox 2prox Id (z) (2prox Id)(z)
z = 2
u = prox z
z = (2prox Id) (2prox Id z
u = prox z
105
Given a xed scalar > 0 and a sequence k (0, 2), this class of methods can be expressed via
the following recursion written in a compact form
k
k
(k+1)
Id + (2prox Id) (2prox Id) z (k)
z
=
1
2
2
A robust DR algorithm is proposed by Combettes and Pesquet in [30].
Theorem 61 ([30], Theorem 20.) Assume that 0 (V ), 0 (V ) and that F = + is coercive
and ]0, [. Let the sequences (k )kN R+ , (ak )kN V and (bk )kN V satisfy
(i) 0 < k < 2, k N and
k (2 k ) = +;
kN
(ii)
k ak + bk < +.
kN
When V is of nite dimension, e.g. V = Rn , the sequence (zk )kN converges strongly.
The sequences ak and bk model the numerical errors when computing the proximity operators.
They are under control using (ii). Conversely, if the convergence rate can be established, one can
easily derive a rule on the number of inner iterations at each outer iteration k such that (ii) is
veried.
This algorithm naturally inherits the splitting property of the Douglas-Rachford iteration, in the
sense that the operators prox and prox are used in separate steps. Note that the algorithm
allows for the inexact implementation of these two proximal steps via the incorporation of the error
terms ak and bk . Moreover, a variable relaxation parameter k enables an additional exibility.
Chapter 5
Appendix
Proof of Property 2-1
Proof of Propositions 1
Put
(t) = ( t),
(5.1)
then is convex by (b) and continuous on R+ by (c). Its convex conjugatesee (1.10), p. 14is
(b) = supt0 {bt (t)} where b R. Dene (b) = ( 12 b) which means that
1
1 2
(5.2)
(b) = sup bt (t) = sup bt + (t) .
2
2
t0
t0
CHAPTER 5. APPENDIX
107
j
j [i]F (ui ), 0 j k, dj Rn , j R
(5.4)
i=0
j
j [i]F (ui ), 0 j k
(5.5)
i=0
(5.6)
(5.7)
CHAPTER 5. APPENDIX
108
We will say that the directions dj and di are conjugated with respect to B. Since B
shows as well that dj and di , 0 i < j k are linearly independent. Till iteration
be put into a matrix form:
0 [0] 1 [0]
$
#
$
#
1 [1]
.
.
.
.
.
.
0, this result
k + 1, (5.4) can
..
.
k [0]
k [1]
..
.
k [k]
j
j
The linear independence of i di i=0 and F (ui ) i=0 for any 0 j k implies
j [j] = 0, 0 j k.
(5.8)
(5.9)
i=0
Using that j [j] = 0see (5.8)and that j = 0 in (5.4), for all 0 j k, the constants below are
well dened:
j [i]
, 1 i j k.
(5.10)
j [i] =
j j [j]
Using (5.5)
dk =
k
i=0
k1
k [i]
F (ui ) + F (uk )
k [i]F (ui ) = k [k]
k [k]
i=0
k [k 1] F (uk1)2 + F (uk 2 = 0
k [j] F (uj )2 + k [j + 1] F (uj+1)2 = 0
F (uk 2
, 0 j k 1.
F (uj )2
(5.11)
Note that with the help of j [i] in (5.10), dj in (5.4) can equivalently written down as
dj =
j
i=0
j [i]F (ui ), 0 j k,
(5.12)
CHAPTER 5. APPENDIX
109
provided that j solves the problem min F (uj dj ), 0 j k. From (5.11) and (5.12),
dk
k1
k1
F (uk 2
=
k [i]F (ui ) + F (uk ) =
F (ui ) + F (uk )
2
F
(u
)
i
i=0
i=0
k2
2
F (uk
F (uk12
= F (uk ) +
F (ui ) + F (uk1)
F (uk1)2 i=0 F (ui )2
= F (uk ) +
F (uk 2
dk1 .
F (uk1)2
The latter result provides a simple way to compute the new dk using the previous dk1 :
d0 = F (u0 ) and di = F (ui ) +
F (ui 2
di1 , 1 i k.
F (ui1 )2
(c) The optimal k for dk . The optimal step-size k is the unique solution of the problem
F (uk k dk ) = inf F (uk dk ).
R
1
B(uk dk ), (uk dk ) c, (uk dk ) .
2
F (uk ), dk
.
Bdk , dk
Bibliography
[1] M. Allain, J. Idier, and Y. Goussard, On global and local convergence of half-quadratic algorithms, IEEE Transactions on Image Processing, 15 (2006), pp. 11301142.
[2] E. Allen, R. Helgason, J. Kennington, and S. B., A generalization of polyaks convergence
result for subgradient optimization, Math. Progr., 37 (1987).
[3] S. Alliney, Digital lters as absolute norm regularizers, IEEE Transactions on Signal Processing,
SP-40 (1992), pp. 15481562.
[4] S. Alliney and S. A. Ruzinsky, An algorithm for the minimization of mixed l1 and l2 norms with
application to Bayesian estimation, IEEE Transactions on Signal Processing, SP-42 (1994), pp. 618
627.
[5] L. Ambrosio, N. Fusco, and D. Pallara, Functions of Bounded Variation and Free Discontinuity
Problems, Oxford Mathematical Monographs, Oxford University Press, 2000.
[6] A. Antoniadis and J. Fan, Regularization of wavelet approximations, Journal of Acoustical Society
America, 96 (2001), pp. 939967.
[7] G. Aubert and P. Kornprobst, Mathematical problems in image processing, Springer-Verlag,
Berlin, 2 ed., 2006.
[8] G. Aubert and L. Vese, A variational method in image recovery, SIAM Journal on Numerical
Analysis, 34 (1997), pp. 19481979.
[9] J.-F. Aujol, Some rst-order algorithms for total variation based image restoration, Journal of Mathematical Imaging and Vision, 34 (2009), pp. 307327.
[10] A. Auslender and M. Teboulle, Asymptotic Cones and Functions in Optimization and Variational
Inequalities, Springer, New York, 2003.
[11] A. Avez, Calcul dierentiel, Masson, 1991.
[12] M. Belge, M. Kilmer, and E. Miller, Wavelet domain image restoration with adaptive edgepreserving regularization, IEEE Transactions on Image Processing, 9 (2000), pp. 597608.
[13] J. E. Besag, Digital image processing : Towards Bayesian image analysis, Journal of Applied Statistics, 16 (1989), pp. 395407.
bal, Numerical optimiza[14] J. F. Bonnans, J.-C. Gilbert, C. Lemar
echal, and C. A. Sagastiza
tion (theoretical and practical aspects), Springer, Berlin ; New York NY ; Hong Kong, 2003.
[15] H. Br
ezis, Analyse fonctionnelle, Collection mathematiques appliquees pour la matrise, Masson,
Paris, 1992.
[16] A. Chambolle, An algorithm for total variation minimization and application, Journal of Mathematical Imaging and Vision, 20 (2004).
[17] R. Chan, Circulant preconditioners for hermitian toeplitz systems, SIAM J. Matrix Anal. Appl.,
(1989), pp. 542550.
BIBLIOGRAPHY
111
[18] R. Chan, C. Hu, and M. Nikolova, An iterative procedure for removing random-valued impulse
noise, IEEE Signal Processing Letters, 11 (2004), pp. 921924.
[19] R. Chan and M. Ng, Conjugate gradient methods for toeplitz systems, SIAM Review, 38 (1996),
pp. 427482.
[20] R. H.-F. Chan and X.-Q. Jin, An Introduction to Iterative Toeplitz Solvers, Cambridge University
Press, 2011.
[21] T. Chan, An optimal circulant preconditioner for toeplitz systems, SIAM J. Sci. Stat. Comput.,
(1988), pp. 766771.
[22] T. Chan and S. Esedoglu, Aspects of total variation regularized l1 function approximation, SIAM
Journal on Applied Mathematics, 65 (2005), pp. 18171837.
[23] T. Chan and P. Mulet, On the convergence of the lagged diusivity xed point method in total
variation image restoration, SIAM Journal on Numerical Analysis, 36 (1999), pp. 354367.
[24] P. Charbonnier, L. Blanc-F
eraud, G. Aubert, and M. Barlaud, Deterministic edgepreserving regularization in computed imaging, IEEE Transactions on Image Processing, 6 (1997),
pp. 298311.
[25] P. G. Ciarlet, Introduction to Numerical Linear Algebra and Optimization, Cambridge University
Press, 1989.
[26] P. G. Ciarlet, Introduction a` lanalyse numerique matricielle et `a loptimisation, Collection
mathematiques appliquees pour la matrise, Dunod, Paris, 5e ed., 2000.
[27] P. Combettes, Convex set theoretic image recovery by extrapolated iterations of parallel subgradient
projections, IEEE Transactions on Image Processing, 6 (1997), pp. 493506.
[28] P. Combettes and J. Luo, An adaptive level set method for nondierentiable constrained image
recovery, IEEE Transactions on Image Processing, 11 (2002), pp. 12951304.
[29] P. L. Combettes, Solving monotone inclusions via compositions of nonexpansive averaged operators,
Optimization, 53 (2004).
[30] P. L. Combettes and J.-C. Pesquet, A Douglas-Rachford splittting approach to nonsmooth convex
variational signal recovery, IEEE Journal of Selected Topics in Signal Processing, 1 (2007), pp. 564574.
[31] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, SIAM
Multiscale Model. Simul., 4 (2005), pp. 11681200.
[32] R. Courant, Variational methods for the solution of problems with equilibrium and vibration, Bull.
Amer. Math. Soc., (1943), pp. 123.
[33] P. Davis, Circulant matrices, John Wiley, New York, 3 ed., 1979.
[34] A. H. Delaney and Y. Bresler, Globally convergent edge-preserving regularized reconstruction: an
application to limited-angle tomography, IEEE Transactions on Image Processing, 7 (1998), pp. 204
221.
[35] D. Donoho, I. Johnstone, J. Hoch, and A. Stern, Maximum entropy and the nearly black
object, Journal of the Royal Statistical Society B, 54 (1992), pp. 4181.
[36] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal
point algorithm for maximal monotone operators, Math. Programming: Series A and B, 55 (1992),
pp. 293318.
[37] J. Eckstein and B. F. Svaiter, A family of projective splitting methods for the sum of two maximal
monotone operators, Math. Program., Ser. B, 111 (2008).
BIBLIOGRAPHY
112
[38] I. Ekeland and R. Temam, Convex Analysis and Variational Problems, SIAM, Amsterdam: North
Holland, 1976.
[39] L. C. Evans and R. F. Gariepy, Measure theory and ne properties of functions, Studies in Advanced Mathematics, CRC Press, Roca Baton, FL, 1992.
[40] H. Fu, M. Ng, Michael K.and Nikolova, and J. L. Barlow, Ecient minimization methods
of mixed 1 l and 2 1 norms for image restoration, SIAM Journal on Scientic Computing, 27
(2005), pp. 18811902.
[41] D. Gabay, Applications of the method of multipliers to variational inequalities, M. Fortin and
R. Glowinski, editors, North-Holland, Amsterdam, 1983.
[42] D. Geman and G. Reynolds, Constrained restoration and recovery of discontinuities, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-14 (1992), pp. 367383.
[43] D. Geman and C. Yang, Nonlinear image recovery with half-quadratic regularization, IEEE Transactions on Image Processing, IP-4 (1995), pp. 932946.
[44] R. Glowinski, J. Lions, and R. Tr
emoli`
eres, Analyse numerique des inequations variationnelles,
vol. 1, Dunod, Paris, 1 ed., 1976.
[45] J.-B. Hiriart-Urruty and C. Lemar
echal, Convex analysis and Minimization Algorithms, vol.
I, Springer-Verlag, Berlin, 1996.
[46]
, Convex analysis and Minimization Algorithms, vol. II, Springer-Verlag, Berlin, 1996.
[47] J. Idier, Convex half-quadratic criteria and auxiliary interacting variables for image restoration, IEEE
Transactions on Image Processing, 10 (2001), pp. 10011009.
[48] S. Kim, H. Ahn, and S. C. Cho, Variable target value subgradient method, Math. Progr., 49 (1991).
[49] A. Lewis and M. Overton, Nonsmooth optimization via quasi-newton methods, Math. Programming, online 2012.
[50] P.-L. Lions, Une methode iterative de resolution dune inequation variationnelle, Israel Journal of
Mathematics, 31 (1978), pp. 204208.
[51] P.-L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear operators, SIAM
Journal on Numerical Analysis, 16 (1979), pp. 964979.
[52] D. Luenberger, Optimization by Vector Space Methods, Wiley, J., New York, 1 ed., 1969.
[53]
, Introduction to Linear and Nonlinear Programming, Addison-Wesley, New York, 1 ed., 1973.
[54] D. Luenberger and Y. Ye, Linear and Nonlinear Programming, Springer, New York, 3 ed., 2008.
[55] M. Minoux, Programmation mathematique. Theorie et algorithmes, T.1, Dunod, Paris, 1983.
[56] J.-J. Moreau, Proximite et dualite dans un espace hilbertien, Bulletin de la S. M. F.
[57]
, Fonctions convexes duales et points proximaux dans un espace hilbertien, CRAS Ser. A Math.,
255 (1962), pp. 28972899.
[58] P. Moulin and J. Liu, Analysis of multiresolution image denoising schemes using generalized Gaussian and complexity priors, IEEE Transactions on Image Processing, 45 (1999), pp. 909919.
[59] Y. Nesterov, Gradient methods for minimizing composite objective function, tech. report, Universite
catholique de Louvain, Center for Operations Research and Econometrics (CORE), CORE Discussion
Papers 2007076, Sep. 2007.
[60]
BIBLIOGRAPHY
[61]
113
, Smooth minimization of non-smooth functions, Math. Program. (A), 1 (2005), pp. 127152.
[62] M. Ng, R. Chan, and W. Tang, A fast algorithm for deblurring models with neumann boundary
conditions, SIAM Journal on Scientic Computing, (1999).
[63] M. Nikolova, Local strong homogeneity of a regularized estimator, SIAM Journal on Applied Mathematics, 61 (2000), pp. 633658.
[64]
, Minimizers of cost-functions involving nonsmooth data-delity terms. Application to the processing of outliers, SIAM Journal on Numerical Analysis, 40 (2002), pp. 965994.
[65]
, A variational approach to remove outliers and impulse noise, Journal of Mathematical Imaging
and Vision, 20 (2004).
[66]
, Weakly constrained minimization. Application to the estimation of images and signals involving
constant regions, Journal of Mathematical Imaging and Vision, 21 (2004), pp. 155175.
[67] M. Nikolova and R. Chan, The equivalence of half-quadratic minimization and the gradient linearization iteration, IEEE Transactions on Image Processing, 16 (2007), pp. 16231627.
[68] M. Nikolova and M. Ng, Analysis of half-quadratic minimization methods for signal and image
recovery, SIAM Journal on Scientic Computing, 27 (2005), pp. 937966.
[69] J. Nocedal and S. Wright, Numerical Optimization, Springer, New York, 1999.
[70] J. Ortega and W. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables,
Academic Press, New York, 1970.
[71] G. B. Passty, Ergodic convergence to a zero of the sum of monotone operators in hilbert space,
Journal of Mathematical Analysis and Applications, 72 (1979).
[72] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical recipes, the art of
scientic computing, Cambridge Univ. Press, New York, 1992.
[73] R. T. Rockafellar, Convex Analysis, Princeton University Press, 1970.
[74] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control
and Optimization, 14 (1976), pp. 877898.
[75] R. T. Rockafellar and J. B. Wets, Variational analysis, Springer-Verlag, New York, 1997.
[76] L. Rudin, S. Osher, and C. Fatemi, Nonlinear total variation based noise removal algorithm,
Physica, 60 D (1992), pp. 259268.
[77] L. Schwartz, Analyse: Topologoe generale et analyse fonctionnelle, Hermann, Paris, 1993.
[78]
, Analyse II. Calcul dierentiel et equations dierentielles, vol. 2 of Enseignement des sciences,
Hermann, Paris, 1997.
[79] N. Z. Shor, Minimization Methods for Non-Dierentiable Functions, vol. 3, Springer-Verlag, 1985.
[80] G. Strang, A proposal for toeplitz matrix calculations, Stud. Appl. Math., 74 (1986).
[81] G. Strang, Linear Algebra and its Applications, Brooks / Cole, 3 ed., 1988.
[82] P. Tseng, Applications of a splitting algorithm to decomposition in convex programming and variational inequalities, SIAM Journal on Control and Optimization, 29 (1991), pp. 119138.
[83]
, A modied forward-backward splitting method for maximal monotone mappings, SIAM Journal
on Control and Optimization, 38 (2000), pp. 431446.
[84] E. Tyrtyshnikov, Optimal and superoptimal circulant preconditioners, SIAM J. Matrix Anal. Appl,
(1992), pp. 459473.
BIBLIOGRAPHY
114
[85] C. Vogel, Computational Methods for Inverse Problems, Frontiers in Applied Mathematics Series,
Number 23, SIAM, 2002.
[86] H. E. Voss and U. Eckhardt, Linear convergence of generalized Weiszfelds method, Computing,
25 (1980).
[87] Y. Wang, J. Yang, W. Yin, and Y. Zhang, A new alternating minimization algorithm for total
variation image reconstruction, SIAM Journal on Imaging Sciences, 1 (2008), pp. 248272.
[88] P. Weiss, L. Blanc-F
eraud, and G. Aubert, Ecient schemes for total variation minimization
under constraints in image processing, SIAM Journal on Scientigc Computing, (2009).
[89] E. Weiszfeld, Sur le point pour lequel la somme des distances de n points donnes est minimum,
Tohoku Math. J, (1937), pp. 355386.
[90] S. Wright, Primal-Dual Interior-Point Methods, SIAM Publications, Philadelphia, 1997.
[91] J. Yang, W. Yin, Y. Zhang, and Y. Wang, A fast algorithm for edge-preserving variational
multichannel image restoration, SIAM Journal on Imaging Sciences, 2 (2009), pp. 569592.
[92] J. Yang, Y. Zhang, and W. Yin, An ecient TVL1 algorithm for deblurring multichannel images
corrupted by impulsive noise, SIAM Journal on Scientic Computing, 31 (2009), pp. 28422865.
[93] C. Zalinescu, Convex analyzis in general vector spaces, World Scientic, River Edge, N.J., 2002.