Optimization-Aplications To Image Processing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 115

Optimization.

Applications to image processing


Mila NIKOLOVA
CMLA (CNRS-UMR 8536)ENS de Cachan, 61 av. du President Wilson, 94235 Cachan Cedex, France
nikolova@cmla.ens-cachan.fr
http://www.cmla.ens-cachan.fr/nikolova/
Textbook : http://www.cmla.ens-cachan.fr/nikolova/Optim-MVD12.pdf

MONTEVIDEO, April 2012

Contents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Generalities
1.1 Optimization problems . . . . . . . . . . . .
1.2 Analysis of the optimization problem . . . .
1.2.1 Remainders . . . . . . . . . . . . . .
1.2.2 Existence, uniqueness of the solution
1.2.3 Equivalent optimization problems . .
1.3 Optimization algorithms . . . . . . . . . . .
1.3.1 Iterative minimization methods . . .
1.3.2 Local convergence rate . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

2 Unconstrained differentiable problems


2.1 Preliminaries (regularity of F ) . . . . . . . . . . . . . . . .
2.2 One coordinate at a time (Gauss-Seidel) . . . . . . . . . .
2.3 First-order (Gradient) methods . . . . . . . . . . . . . . .
2.3.1 The steepest descent method . . . . . . . . . . . .
2.3.2 Gradient with variable step-size . . . . . . . . . . .
2.4 Line search . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Schematic algorithm for line-search . . . . . . . . .
2.4.3 Modern line-search methods . . . . . . . . . . . . .
2.5 Second-order methods . . . . . . . . . . . . . . . . . . . .
2.5.1 Newtons method . . . . . . . . . . . . . . . . . . .
2.5.2 General quasi-Newton Methods . . . . . . . . . . .
2.5.3 Generalized Weiszfelds method (1937) . . . . . . .
2.5.4 Half-quadratic regularization . . . . . . . . . . . . .
2.5.5 Standard quasi-Newton methods . . . . . . . . . .
2.6 Conjugate gradient method (CG) . . . . . . . . . . . . . .
2.6.1 Quadratic strongly convex functionals (Linear CG)
2.6.2 Non-quadratic Functionals (non-linear CG) . . . . .
2.6.3 Condition number and preconditioning . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

5
6
10
10
11
12
17
17
18

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

21
21
22
22
23
26
27
27
28
29
32
32
34
36
37
40
42
42
43
46

3 Constrained optimization
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Projection theorem . . . . . . . . . . . . . . . . . . . .
3.3 General methods . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Gauss-Seidel under Hyper-cube constraint . . . . . . .
3.3.2 Gradient descent with projection and varying step-size
3.3.3 Penalty . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Equality constraints . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Lagrange multipliers . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

49
49
49
50
51
51
51
54
56
56

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

58
60
60
61
63
64
65
65
67
72
74
74
74
75
75
76

4 Non dierentiable problems


4.1 Specicities . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Kinks . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Directional derivatives . . . . . . . . . . . . . . . . .
4.2.3 Subdierentials . . . . . . . . . . . . . . . . . . . . .
4.3 Optimality conditions . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Unconstrained minimization problems . . . . . . . . .
4.3.2 General constrained minimization problems . . . . .
4.3.3 Minimality conditions under explicit constraints . . .
4.4 Minimization methods . . . . . . . . . . . . . . . . . . . . .
4.4.1 Subgradient methods . . . . . . . . . . . . . . . . . .
4.4.2 Some algorithms specially adapted to the shape of F
4.4.3 Variable-splitting combined with penalty techniques .
4.5 Splitting methods . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Moreaus decomposition and proximal calculus . . . .
4.5.2 Forward-Backward (FB) splitting . . . . . . . . . . .
4.5.3 Douglas-Rachford splitting . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

78
78
78
79
80
80
81
83
87
87
88
89
90
91
93
96
98
99
102
103

3.5

3.6

3.7

3.8

3.4.2 Application to linear systems . . . . . . . .


Inequality constraints . . . . . . . . . . . . . . . . .
3.5.1 Abstract optimality conditions . . . . . . . .
3.5.2 Farkas-Minkowski (F-M) theorem . . . . . .
3.5.3 Constraint qualication . . . . . . . . . . . .
3.5.4 Kuhn & Tucker Relations . . . . . . . . . .
Convex inequality constraint problems . . . . . . .
3.6.1 Adaptation of previous results . . . . . . . .
3.6.2 Duality . . . . . . . . . . . . . . . . . . . .
3.6.3 Uzawas Method . . . . . . . . . . . . . . .
Unifying framework and second-order conditions . .
3.7.1 Karush-Kuhn-Tucker Conditions (1st order)
3.7.2 Second order conditions . . . . . . . . . . .
3.7.3 Standard forms (QP, LP) . . . . . . . . . .
3.7.4 Interior point methods . . . . . . . . . . . .
Nesterovs approach . . . . . . . . . . . . . . . . . .

5 Appendix

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

106

Objectives: obtain a good knowledge of the most important optimization methods, provide tools to
help reading the literature, conceive and analyze new methods. At the implementation stage, numerical
methods are always on Rn .
Main References: [7, 14, 26, 44, 45, 46, 53, 69, 70, 72]
For remainders check Connexions: http://cnx.org/

CONTENTS

Notations and abbreviations


Abbreviations:
n.v.s.normed vector space (V ).
l.s.c.lower semi-continuous (for a function).
w.r.t.with respect to
resp.respectively
dim(.)dimension.
. , . inner product ( scalar product) on a n.v.s. V .
a denotes the integer part of a R.
O is an open subset (arbitrary).
O(U) denotes an open subset of V containing U V .
int(U) stands for the interior of U.
B(u, r) is the open ball centered at u with radius r. The relevant closed ball is B(u, r).
N is the set of non negative integers.
Rq+ = {v Rq : v[i] 0, 1 i q} for any positive integer q.
R = R {+}.
(ei )ni=1 is the canonical basis of Rn .
For an n n matrix B:
B T is its transpose and B H its conjugate transpose.
i (B)an eigenvalue of B.
min (B) (resp. max (B))the minimal (resp. the maximal) eigenvalue of B.


def
Remind: the spectral radius of B is = max i (B).
1in

B  0B is positive denite, B  0B is positive semi-denite.


Id is the identity operator.
diag(b[1], , b[n]) is an n n diagonal matrix whose diagonal entries are b[i], 1 i n.
For f : V1 V2 Vm Y where Vj , j {1, 2, , m} and Y are n.v.s., Dj f (u1 , , um )
is the dierential of f at u = (u1 , , um) with respect to uj .
L(V ; Y ), for V and Y n.v.s., is the set of all linear continuous applications from V to Y .
For A L(V ; Y ), its adjoint is A .
Isom(V ; Y ) L(V ; Y ) are all bijective applications that have a continuous inverse application.
o-functionsatises limt o(t)/t = 0
O-functionsatises limt O(t)/t = K where K is a constant.

Chapter 1
Generalities

Sampling errors + blur + noise

Restored image

 x
Original image u Data d = un (Gamma noise) Restoration u = W

Image originale

Jitter

Bebe volant

Restoration

Formulate your problem as the minimization / maximization of an objective function


(energy, criterion) whose solution is the sought after object (a signal, an image).

CHAPTER 1. GENERALITIES

1.1

Optimization problems

General Form : nd a solution u V such that


u U and F (
u) = inf F (u)
uU


= sup F (u)

(P )

uU

F : V R functional (objective function, criterion, energy) to minimize.


V real Hilbert space, if not explicitly specied.
U V constraint set, supposed nonempty and closed often convex.
Important cases:
equality constraints

U = {u V : gi (u) = 0, i = 1, . . . , p},

p<n

(1.1)

inequality constraints

U = {u V : hi (u) 0, i = 1, . . . , q},

qN

(1.2)

Minimizer u and minimum F (


u).
Relative (local) minimum (resp. minimizer) and minimum global minimum (resp. minimizer).
Standard Form for quadratic functional:
1
(1.3)
F (u) = B(u, u) c(u), for B L(V V ; R) and c L(V ; R),
2
where B is bilinear and symmetric (i.e. B(u, v) = B(v, u), u, v V ).
If V is equipped with an inner product ., ., we can write down (Rieszs representation theorem)
F (u) =

1
Bu, u c, u
2

If V = Rn in (1.4), B Rnn is a symmetric matrix (i.e. B = B T ) and c Rn .

(1.4)

CHAPTER 1. GENERALITIES

Regularized objective functional on Rn


F (u) = (u) + (u)

(Di u)
(u) =

(1.5)
(1.6)

Di udiscrete approximation of the gradient or the Laplacian of the image or signal at i, or nite
dierences, . is usually the 2 norm1 , > 0 parameter and : R+ R+ increasing function, e.g.
(t) = t , 1 2,

(t) =

t2 + , (t) = min{t2 , }, (t) = t/(1 + t), > 0.

data-tting term for data v Rm , usually A Rmn and


(u) = Au v22 ,

(1.7)

in some cases (u) = Au v1 or another function.


E.g. an image u of size p q rearranged into a n-length vector, n = pq:

u[1] u[p + 1]
u[(q 1)p + 1]
u[2] u[p + 2]
u[(q 1)p + 2]

.
..
.
.

u[i 1]
.
.
.
..
..
u[i p]
u[i]

u[p] u[2p]

u[n]
Often

or


1/2
Di u = (u[i] u[i m])2 + (u[i] u[i 1])2

(1.8)









Dk u = u[i] u[i m] and Dk+1u = u[i] u[i 1].

(1.9)

If u is an one-dimensional signal, (1.8) and (1.9) amount to






Di u = u[i] u[i 1], i = 2, , n.
Di can also be any other linear mapping. If
Di = eTi , i {1, , n},
then
(u) =

( |u[i]| )

For any p [1, +[, the p norm of a vector v is dened by vp =


 
v = max v[i].
1



i

 p1
v[i]

and for p = +,

CHAPTER 1. GENERALITIES

8
Illustration on the role of

F (u) = a u v22 +

(Di u)

Original image

Data v

v =au+n
nNormal noise

ablur
SNR=20 dB

(t) = t

(t) = t

Row 54

Row 54

Row 90

Row 90

(t) = min{t2 , 1}

(t) = 1 1l(t=0)

No Regularization

= 0 in (1.5)

(t) = t2 /(1 + t2 ) (t) = t/(1 + t)

Row 54

Row 54

Row 54

Row 54

Row 90

Row 90

Row 90

Row 90

Illustration on the role of

F (u) = u v22 + Du1


(a) Stair-casing

F (u) = u v1 + Du22


(b) Exact data-t

F (u) = u v22 + Du22


(c)

D is a rst-order dierence operator, i.e. Di u = u[i] u[i + 1], 1 i p 1. Data (- - -), Restored
signal (). Constant pieces in (a) are emphasized using while data samples that are equal to
the relevant samples of the minimizer in (b) are emphasized using .

Table of Contents

CHAPTER 1. GENERALITIES
1. Generalities
2. Unconstrained differentiable problems
3. Constrained optimization
4. Non differentiable problems

CHAPTER 1. GENERALITIES

1.2

10

Analysis of the optimization problem

1.2.1

Remainders

Denition 1 Let F : V ] , +] where V is a real topological space.


- The domain of F is the set domF = {u V | F (u) < +}.
- The epigraph of F is the set epiF = {(u, ) V R | F (u) }.
Denition 2 A function F on a real n.v.s. V is proper if F : V ] , +] and if it is not
identically equal to +.
Denition 3 F : V R is coercive if lim F (u) = +.
u

Denition 4 F : V ] , +] for V a real topological space


is lower semi-continuous (l.s.c.) if


R the set u V | F (u) is is closed in V .
F is l.s.c.
If F is l.s.c., then F is upper semi-continuous.
If F is continuous, then it is l.s.c. and upper semi-continuous.
Denition 5 (Convex subset) Let V be any real vector space. A subset U V is convex if u, v U
and ]0, 1[, we have u + (1 )v U.
v
v
u

u
U nonconvex

U non strictly convex

U strictly convex

Remind that F can be convex but not coercive.


Denition 6 (Convex function) Let V be any real vector space. A proper function F : U V R is
convex if u, v U and ]0, 1[
F (u + (1 )v) F (u) + (1 )F (v)
F is strictly convex when the inequality is strict whenever u = v.
Property 1 Important properties [15, p. 8]:
- F is l.s.c. if and only if u V and > 0 there is a neighborhood O of u such that
F (v) F (u) ,

v O .

- If F is l.s.c.2 and uk u as k
lim inf F (uk ) F (u).
k

If F is upper semi-continuous and uk u as k then lim supk F (uk ) F (u).

CHAPTER 1. GENERALITIES

11

 
- If Fi iI , where I is an index set, is a family of l.s.c. functions then the superior envelop of
 
Fi iI is l.s.c. In words, the function F dened by
F (u) = sup Fi (u)

F3

iI

is l.s.c.
 
- If Fi iI is a family of l.s.c. convex functions then
 
the superior envelop F of Fi iI is l.s.c. and convex.

1.2.2

F1
F2

Existence, uniqueness of the solution

Theorem 1 (Existence) Let U Rn be non-empty and closed, F : Rn R l.s.c. and proper. If U is


not bounded, we suppose that F is coercive. Then
u U such that F (
u) = inf F (u).
uU

Note that F can be non-convex; moreover u is not necessarily unique.


Proof. Two parts:
(a) U bounded U is compact, since F is l.s.c., Weierstrass theorem3 yields the result.
(b) U not necessarily bounded. Choose u0 U.
F coercive r > 0 such that
u > r

F (u0) < F (u)

 def
Then u U
= B(u0 , r) U which is compact. The conclusion is obtained as in (a).

Let us underline that the conditions in the theorem are only strong sucient conditions. Much weaker
existence conditions can be found e.g. in [10, p. 96].
Alternative proof using minimizing sequences.
The theorem extends to separable Hilbert spaces under additional conditions, see e.g. [26, 5].
Optimization problems are often called feasible when U and F are convex and the conditions
of Theorem 1 are met. Remind that F can be convex but not coercive (Fig. 1.1(b)).
Theorem 2 For U V convex, let F : U R be proper, convex and l.s.c.
1. If F has a relative (local) minimum at u U, it is a (global) minimum w.r.t. U. [26]



2. The set of minimizers U = u U : F (
u) = inf F (u) is convex and closed. [38, p. 35]
uU

3. If F is strictly convex, then F admits at most one minimum and the latter is strict.
=
4. In addition, suppose that either F is coercive or U is bounded. Then U
 . [38, p. 35]
Weierstrass theorem: Let V be a n.v.s. and K V a compact. If F : V R is l.s.c. on K, then F achieves a
minimum on K. If F : V R is upper semi continuous on K, then F achieves a maximum on K. If F is continuous
on K, then F achieves a minimum and a maximum on K. See e.g. [52, p. 40]
3

CHAPTER 1. GENERALITIES

12
F (u)

F (u)
No minimizer

U = [1, 8]

2.94

two local
minimizers
u

0.21
1.72

6.86

(a)F nonconvex

(b)F convex non coercive U = R (c)F convex non coercive U compact

F (u)

F (u)

F (u)

U = [1, 8]

u
1

(d)F strongly convex, U compact

minimizers
1.89

0.38

1.89

3.51

(e)F non strictly convex

(f)F strongly convex on R

Figure 1.1: Illustration of Denition 6 and Theorem 2. Minimizers are depicted with a thick point.

1.2.3

Equivalent optimization problems

Given an optimization problem dened by F : V R and U V ,


(P )

nd u such that F (
u) = inf F (u)
uU

nd u = arg inf F (u)


uU

there are many dierent optimization problems that are in some way equivalent to (P ). For instance

w = arg min F (w) where F : W R, X W and there is f : W V such that u = f (w);


X

(
u, b) = arg min max F (u, b) where F : V W R, U V , X W and u solves (P );
uU bX

Some of these equivalent optimization problems are easier to solve than the original (P ). We shall
see such reformulations in what follows. The way enabling to recover them is in general an open
question. In many cases, such a reformulation (found by intuition and proven mathematically) is
valid only for a particular problem (P ).
There are several duality principles in optimization theory relating a problem expressed in terms
of vectors in an n.v.s. V to a problem expressed in terms of hyperplanes in the n.v.s.
Denition 7 A hyperplane (ane) is a set of the form
def

[h = ] = {w V | h, w = }
where h : V R is a linear nonzero functional and R is a constant.
A milestone is the Hahn-Banach theorem, stated below in its geometric form [15, 52].
Denition 8 Let U V and K V . The hyperplane [h = ] separates K and U
- nonstrictly if h, w  , w U and h, w  , w K;

CHAPTER 1. GENERALITIES

13

- strictly if there exists > 0 such that h, w  , w U and h, w +  , w K.


Theorem 3 (Hahn-Banach theorem, geometric form) Let U V and K V be convex, nonempty and
disjoint sets.
(i) If U is open, there is a closed hyperplane4 [h = ] that separates K and U nonstrictly;
(ii) If U is closed and K is compact5 , then there exists a closed hyperplane [h = ] that separates
K and U in a strict way.
Denition 8 and Theorem 3 are illustrated in Fig. 1.2 where K = {u}. Note that U and K in
Denition 8 are not needed to be convex.
An example of equivalent optimization problems is seen in Fig. 1.2: the shortest distance from
a point u to a convex closed set U is equal to the maximum of the distances from the point u to a
hyperplane separating the point u from the set U.
u
strict separation

nonstrict

u
U
v

Figure 1.2: Green line: [h = ] = {w | h, w = } separates strictly {u} and U: for v U we
have h, v < whereas h, u > . The orthogonal projection of u onto U is u (red dots). Lines
in magenta provide nonstrict separation between u and U.
One systematic approach to derive equivalent optimization problems is centered about the interrelation between an n.v.s. V and its dual.
Denition 9 The dual V  of a n.v.s. V is composed of all continuous linear functionals on V , that is
def

V  = {f : V R | f linear and continuous}


V  is endowed with the norm
f V  =

sup
uV, u1

|f (u)|

The n.v.s. V is reexive if V  = V where V  is the dual of V  .


4
5

The hyperplane [h = ] is closed i h is continuous [15, p. 4], [52, p. 130]. If V = Rn any hyperplane is closed.
Let us remind that if K is a subset of a nite dimensional space, it is compact i K is closed and bounded.

CHAPTER 1. GENERALITIES

14

If V = Rn then V  = Rn (see e.g. [52, p. 107]). The dual of the Hilbert space6 L2 (resp. 2 ) is
yet again L2 (resp. 2 )see e.g. [52, pp. 107-109]. Clearly, all these spaces are reexive as well.
Denition 10 Let F dened on a real n.v.s. V be proper. The function F  : V  R given by


def
F  (v) = sup u, v F (u)

(1.10)

uV

is called the convex conjugate or the polar function of F .


The Fenchel-Young inequality:
u V, v V 

u, v F (u) + F  (v)

(1.11)

F
epiF
F (u )
u
F  (v)
0

Figure 1.3: The convex conjugate at v determines an hyperplane that separates nonstrictly epiF
and its complement in Rn .
The application v  u, v F (u) is convex and continuous, hence l.s.c., for any xed u V .
According to Property 1, F  is convex and l.s.c. (as being a superior convex envelop).

Example 1 Let f (u) =

1
|u| where u R and ]1, +[.



1

f (v) = sup uv |u|

uR

The term between the parentheses obviously has a unique global maximizer u. The latter meets
v = |
u|1 sign(
u) hence sign(
u) = sign(v). Then
1

u = |v| 1 sign(v)
Set  =

which is the same as


1
1
+  =1

We have  ]1, +[ and


f (v) = |v|


1
1

1
sign(v)v |v| 1 = |v| 1

Remind that Rn is a Hilbert space as well.

(1.12)


1
1


=

1 
|v|


(1.13)

CHAPTER 1. GENERALITIES

15

Extension: For u Rn and ]1, +[, let


F (u) =

1
u

Then v Rn and



n 

1
1

u[i]v[i] |u[i]|
F (v) = sup u, v u = sup

uRn
uRn
i=1


n
n
n



1
1
=
sup u[i]v[i] |u[i]| =
f  (v[i]) =
|v[i]|

n

i=1 uR
i=1
i=1


1

v


where f  is given by (1.13) and  meets (1.13)


We can repeat the process in (1.10), thereby leading to the bi-conjugate F  : V R:
def

F  (u) = sup

vV 

u, v F  (v)

One can note that F  (u) F (u), u V .


Theorem 4 (Fenchel-Moreau) Let F , dened on a real n.v.s. V , be proper and convex. Then
F  = F

For the proof, see [15, p. 10] or [73, 38].


The theorem below is an important tool to equivalently reformulate optimization problems.
Theorem 5 (Fenchel-Rockafellar) Let and be convex on V . Assume that there exists u0 such that
(u0 ) < +, (u0 ) < + and is continuous at u0 . Then



inf (u) + (u) = max ( (v)  (v)

uV

vV

The proof can be found in [15, p. 11] or in [52, p. 201] (where the assumptions on and are
slightly dierent).

Example 2 V is a real reexive n.v.s. Let U V and K V  be convex compact nonempty subsets.
def

(u) = sup v, u = max v, u max v u


vK

vK

vK

(1.14)

where Schwarz inequality was used. Note that we can replace sup with max in (1.14) because K is
compact and v  v, u is continuous (Weierstrass theorem). Clearly max v is nite and is
vK
continuous and convex.

CHAPTER 1. GENERALITIES

16

Let w V  \ K. The set {w} being compact, Hahn-Banach theorem 3 tells us that there exists
h V  = V enabling us to separate strictly {w} and K V  , that is
v, h < w, h ,


def

and the constant c = inf

vK

v K


w, h v, h meets c > 0. Then

w, h v, h c > 0,

v K

w, h sup v, h c > 0


vK

Using that h V for any > 0, we have








 (w) = sup w, u (u) sup w, h (h) = sup w, h sup v, h
>0
>0
uV
vK


(1.15)
= sup w, h sup v, h c sup = +.
>0

>0

vK

Let now w K V  . Then w, u maxvK v, u 0, u V , hence






 (w) = sup w, u (u) = sup w, u max v, u = 0
uV

(1.16)

vK

uV

where the upper bound 0 is always reached for u = 0.


Combining (1.15) and (1.16) shows that the convex conjugate of reads

+ if v  K

(v) =
0
if v K
Set


(u) =

+ if u  U
0
if u U

By the denition of in (1.14) and the one of , the Fenchel-Rockafellar Theorem 5 yields






min max u, v = min (u) = min (u) + (u) = max  (v)  (v)
uU

vK

uU

uV

vV

(1.17)

The maximum in the right side cannot be reached if v  K because in this case  (v) = .




Hence max  (v)  (v) = max  (v) . Furthermore
vV

vK





 (v) = sup v, u (u) = inf v, u + (u) = inf v, u = min v, u
uU

uU

It follows that

uU

uU





max  (v)  (v) = max min v, u
vV

vK

uU

(1.18)





Combining (1.17) and (1.18) shows that minuU maxvK u, v = maxvK minuU v, u .
In Example 2 we have proven the fundamental Min-Max theorem used often in classical game
theory. The precise statement is given below.
Theorem 6 (Min-Max) Let V be a reexive real n.v.s. Let U V and K V  be convex, compact
nonempty subsets. Then
min max u, v = max min v, u
uU vK

vK uU

CHAPTER 1. GENERALITIES

1.3

17

Optimization algorithms

Usually the solution u is dened implicitly and cannot be computed in an explicit way, in one step.

1.3.1

Iterative minimization methods

Construct a sequence (uk )kN initialized by u0 converging to ua solution of (P ):


uk+1 = G(uk ), k = 0, 1, . . . ,
where G : V U is called iterative scheme (often
local (relative) if (P ) is nonconvex. The choice of u0
nonconvex.
G is constructed using information on (P ), e.g.
hi (uk ) (or subgradients instead of for nonsmooth
By (1.19), the iterates of G read

(1.19)

dened implicitly). The solution u can be


(= the initialization) can be crucial if (P ) is
F (uk ), F (uk ), gi (uk ), hi (uk ), gi (uk ) and
functions).

u1 = G(u0);
def

u2 = G(u1) = G G(u0) = G2 (u0 );

def

uk = G
G (u0 ) = Gk (u0 ) = G Gk1 (u0 )
 
k times
Key questions:
Given an iterative method G, determine if there is convergence;
If convergence, to what kind of accumulation point?
Given two iterative methods G1 and G2 choose the faster one.
Denition 11 u is a xed point for G if G(
u) = u.
Let X be a metric space equipped with distance d.
Denition 12 G : X X is a contraction if there exists (0, 1) such that
d(G(u1), G(u2 )) d(u1 , u2 ), u1 , u2 X
G is Lipschitzian7 uniformly continuous.
Theorem 7 (Fixed point theorem) Let X be complete and G : X X a contraction.
Then G admits a unique xed point, u = G(
u). (see [77, p.141]).
Theorem 8 (Fixed point theorem-bis) Let X be complete and G : X X be such that k0 for which
Gk0 is a contraction. Then G admits a unique xed point. (see [77, p.142]).
Note that in the latter case G is not necessarily a contraction.
7

A function f : V X is -Lipschitz continuous if (u, v) V V , we have f (u) f (v)  f (u) f (v).

CHAPTER 1. GENERALITIES

1.3.2

18

Local convergence rate

In general, a nonlinear problem (P ) cannot be solved exactly in a nite number of iterations.


Goal: attach to G precise indicators of the asymptotic rate of convergence of uk towards u. Refs.
[70, 69, 26, 14]
Here V = Rn ,   =  2 and we simply assume that uk converges to u.
def

Q-convergence studies the quotient Qk =

uk+1 u
, k N.
uk u

Q = lim sup Qk
k

If Q < then > 0 k0 such that


uk+1 u (Q + )uk u, k k0 .
Bound on the error at iteration k + 1 in terms of the error at iteration k. (Crucial if Q < 1.)
0 < Q < 1: Q-linear convergence;
Q = 0: Q-superlinear convergence (called also superlinear convergence);
In particular, Qk = O(uk u2 ): Q-quadratic convergence.
Compare how G1 and G2 converge towards the same u:
If Q(G1 , u) < Q(G2 , u) then G1 is faster than G2 in the sense of Q.
def

R-convergence studies the rate of the root Rk = uk u1/k , k N.


R = lim sup Rk
k

0 < R < 1: R-linear convergence; this means geometric or exponential convergence since
(0, 1 R) k0 such that Rk (R + ), k k0 uk u (R + )k
R = 0: R-superlinear convergence.
G1 is faster than G2 in the sense of R if R(G1 , u) < R(G2 , u)
Remark 1 Sublinear convergence if Qk 1 or Rk 1. Convergence is too slow, choose another G.
Theorem 9 Let uk u vk where (vk )kN converges to 0 Q-superlinearly. Then (uk )kN converges
to u R-superlinearly.
Theorem 10 Let uk u R-linearly. Then (vk )kN that tends to 0 Q-linearly such that uk
u vk .
Proofs of the last two theorems - see [14, p.15].

CHAPTER 1. GENERALITIES

19

Lemma 1 Let G : U Rn Rn , . be any norm on Rn , B(


u, ) U and [0, 1) such that
G(u) u u u,

u B(
u, )

u, ) the iterates given by (1.19) remain in B(


u, ) and converge to u.
Then u0 B(
Proof. Let u0 B(
u, ), then u1 u = G(u0) u u0 u < u0 u, hence u1 B(
u, ).
k
Using induction, uk B(
u, ) and uk u u0 u. Thus limk uk = u.

Theorem 11 (Ostrowski) [70, p.300] Let G : U Rn Rn be dierentiable at u int(U) and has
a xed point G(
u) = u U. If
 

def
u)  < 1,
(1.20)
= max i G(
1in

u) we have uk O(
u) and uk u.
then O(
u) such that u0 O(
Proof. G(
u) is not necessarily symmetric and positive denite. Condition (1.20) says that
> 0 there is an induced matrix norm . on Rnn such that8
G(
u) + .

(1.21)

Let us choose such that

1
< (1 ).
2
The dierentiability of G at u implies that > 0 so that B(
u, ) U and
G(u) G(
u) G(
u)(u u) u u, u B(
u, )

(1.22)

Using that G(
u) = u, and combining (1.21) and (1.22), we get
G(u) u = G(u) G(
u) G(
u)(u u) + G(
u)(u u)
G(u) G(
u) G(
u)(u u) + G(
u) u u
< (2 + )u u < u u, u B(
u, )
The conclusion follows from the observation that 2 + < 1 and Lemma 1.
8

Let B be an n n real matrix. Its spectral radius is dened as its largest in absolute value eigenvalue,


def
(B) = max i (B).
1in

Given a vector norm on Rn , say ., the corresponding induced matrix norm on the space of all n n matrices reads
B = sup{Bu : u Rn , u 1}.
def

Theorem 12 [26, p. 18]


(1) Let B be an n n matrix and  .  any matrix norm. Then
(B)  B  .
(2) Given a matrix B and a number > 0, there exists at least one induced matrix norm  .  such that
 B  (B) + .

CHAPTER 1. GENERALITIES

20

Theorem 13 (linear convergence) Under the conditions of Theorem 11, max |i (G(
u))| = R, where
1in

R is the root convergence factor.


Proofsee [70, p.301].
Illustration of the role of convergence conditions

(a) Original

(b) Noisy

(c) Correctly restored

(d) No convergence

Figure 1.4: Convergence conditions satised in (c), not satised in (d).

Chapter 2
Unconstrained differentiable problems
2.1

Preliminaries (regularity of F )

Remainders (see e.g. [78, 11, 26]):


1. For V and Y real n.v.s., f : V Y if dierentiable at u V if Df (u) L(V, Y ) (linear
continuous application from V to Y ) such that
f (u + v) = f (u) + Df (u)v + v(v) where lim (v) = 0.
v0

2. If Y = R (i.e. f : V R) and the norm on V is derived from an inner product ., ., then
L(V ; R) is identied to V (via a canonical isomorphism). Then f (u) V the gradient of
f at u is dened by
f (u), v = Df (u)v, v V.
Note that if V = Rn we have Df (u) = (f (u))T .
3. Under the conditions given above in 2., if f is twice dierentiable, we can identify the second
dierential D 2 f and the Hessian 2 f via D 2 f (u)(v, w) = 2 f (u)v, w.
Property 2 Let U V be convex. Suppose F : V R is dierentiable in O(U).
1. F convex on U F (v) F (u) + F (u), v u , u, v U.
2. F strictly convex on U F (v) > F (u) + F (u), v u , u, v U, v = u.
A proof of statement 1 can be found in Appendix, p. 106.
Property 3 Let U V be convex. Suppose F twice dierentiable in O(U).
1. F convex on U 2 F (u)(v u) , (v u) 0, u, v U.
2. If 2 F (u)(v u), (v u) > 0, u, v U, v = u then F is strictly convex on U.
Denition 13 F is called strongly convex or equivalently elliptic if F C 1 and if > 0 such that
F (v) F (u), v u v u2, v, u V.

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

22

F in (1.4) is strongly convex if and only if B  0.


Property 4 Suppose F : V R is dierentiable in V .
1. F : V R strongly convex F is strictly convex, coercive and

F (v) F (u) F (u), v u + v u2 , v, u V .


2
2. Suppose F twice dierentiable in V :
F strongly convex 2 F (u)v, v v2, v V .
Remark 2 If F : Rn R is twice dierentiable, then 2 F (u) (known as the Hessian of F at u) is




2 F (u)
= 2 F (u) [j, i].
an n n symmetric matrix, u Rn . Indeed, 2 F (u) [i, j] = u[i]u[j]

2.2

One coordinate at a time (Gauss-Seidel)

We consider the minimization of a coercive proper convex function F : Rn R.


For each k N, the algorithm involves n steps: for i = 1, . . . , n, uk+1[i] is updated according to
F (uk+1[1], . . . , uk+1[i 1], uk+1[i], uk [i + 1], . . . , uk [n])
= inf F (uk+1[1], . . . , uk+1[i 1],
R

Theorem 14 F is C 1 , strictly convex and coercive

uk [i + 1], . . . , uk [n])

(uk )kN converges to u.

(see [44])

Remark 3 Dierentiability is essential. E.g. apply the method to






F (u[1], u[2]) = u[1]2 + u[2]2 2 u[1] + u[2] + 2u[1] u[2] with u0 = (0, 0).
Note that inf R2 F (u) = F (1, 1) = 2.
Remark 4 Extension for F (u) = Au v2 +

2.3

i |u[i]|, see [44].

First-order (Gradient) methods

They use F (u) and F (u) and rely on F (uk k dk ) F (uk ) k F (uk ), dk . Dierent tasks:
Choose a descent direction (dk ) - such that F (uk dk ) < F (uk ) for > 0 small enough;
Line search : nd k such that F (uk k dk ) F (uk ) < 0 is suciently negative.
Stopping rule: introduce an error function which measures the quality of an approximate
solution u, e.g. choose a norm ., > 0 (small enough) and test:
Iterates: if uk+1 uk  , then stop ;
Gradient: if F (uk ) , then stop ;
Objective: if F (uk ) F (uk+1) , then stop ;

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

23

or a combination of these.
Descent direction dk
F dierentiable:

for F at uk F (uk dk ) < F (uk ), > 0 (small enough)

F (uk ), dk  > 0 dk is a descent direction for F at uk

(2.1)

F convex and dierentiable:


F (uk ), dk  > 0 dk is a descent direction for F at uk

F : R2 R smooth strictly convex


one minimizer

(2.2)

F : R2 R smooth nonconvex
two minimizers

Proof. Using a rst-order expansion,


F (uk dk ) = F (uk ) F (uk ), dk  + dk  (dk )

(2.3)

(dk ) 0 as 0

(2.4)

where
Then (2.3) can be rewritten as



F (uk ) F (uk dk ) = F (uk ), dk  dk  (dk ) .
Consider that F (uk ), dk  > 0. By (2.4), there is > 0 such that F (uk ), dk  > dk  (dk ),
0 < < . Then dk is a descent direction since F (uk ) F (uk dk ) > 0, 0 < < .
When F is convex, we know by Property 2-1 that F (uk dk ) F (uk ) + F (uk ), dk , i.e.
F (uk ) F (uk dk ) F (uk ), dk  .
If F (uk ), dk  0 then F (uk ) F (uk dk ) 0, hence dk is not a descent direction. It follows

that for dk to be a descent direction, it is necessary that F (uk ), dk  > 0.
Note that some methods construct k dk in an automatic way.

2.3.1

The steepest descent method

Motivation : make F (uk )F (uk+1) as large as possible. In (2.3) we have F (uk ), dk  F (uk ) dk 
(Schwarz inequality) where the equality is reached if dk F (uk ).
The steepest descent method = Gradient with optimal stepsize is dened by
F (uk k F (uk )) = inf F (uk F (uk ))
R

uk+1 = uk k F (uk ), k N.

(2.5)
(2.6)

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

24

Theorem 15 If F : Rn R is strongly convex, (uk )kN given by (2.5)-(2.6) converges to the unique
minimizer u of F .
Proof. Suppose that F (uk ) = 0 (otherwise u = uk ). Proof in 5 steps.
(a) Denote



fk () = F uk F (uk ) , > 0.
def

fk is coercive, strictly convex, hence it admits a unique minimizer = (uk ) and the latter
solves the equation fk () = 0. Thus

Since





fk () = F uk F (uk ) , F (uk ) = 0

(2.7)

1
uk+1 = uk F (uk ) F (uk ) = (uk uk+1 )

(2.8)

the left hand side of (2.8) inserted into (2.7) yields


F (uk+1), F (uk ) = 0,

(2.9)

i.e. two consecutive directions are orthogonal, while the right hand side equation of (2.8) inserted
into (2.7) shows that
F (uk+1), uk uk+1 = 0.
Using the last equation and the assumption that F is strongly convex,
F (uk ) F (uk+1) F (uk+1), uk uk+1 +

uk uk+12 = uk uk+12


2
2

(2.10)





u) we deduce that F (uk ) kN
(b) Since F (uk ) kN is decreasing and bounded from below by F (
converges, hence


lim F (uk ) F (uk+1) = 0
k

Inserting this result into (2.10) shows that

lim uk uk+1 = 0

(2.11)

(c) Using (2.9) allows us to write down


F (uk )2 = F (uk ), F (uk ) F (uk ), F (uk+1) = F (uk ), F (uk ) F (uk+1)
By Schwarzs inequality,
F (uk )2 F (uk ) F (uk ) F (uk+1)
hence
F (uk ) F (uk ) F (uk+1)

(2.12)

Remind that (2.11) does not mean that the sequence (uk ) converges!
k
1
Consider uk = i=0 k+1
. It is well known that (uk )kN diverges. Nevertheless, uk+1 uk =

1
k+2

0 as k .

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

25



(d) The facts that F (uk ) kN is decreasing and that F is coercive implies that r > 0 such that
uk B(0, r), k N. Since F C 1 , F is uniformly continuous on the compact B(0, r). Using
(2.11), > 0 there are > 0 and k0 N such that
uk uk+1  < F (uk ) F (uk+1) < , k k0 .
Consequently,
lim F (uk ) F (uk+1) = 0.

Combining this result with (2.12) shows that


lim F (uk ) = 0.

(2.13)

(e) Using that F is strongly convex, that F (


u) = 0 and Schwarzs inequality,
u) , uk u = F (uk ) , uk u F (uk ) uk u.
uk u2 F (uk ) F (
Thus we have a bound on the error at iteration k
1
uk u F (uk ) 0 as k

where the convergence result is due to (2.13).

Remark 5 Note the role of the assumption that V = Rn is of nite dimension in this proof.

Quadratic strictly convex problem. Consider F : Rn R of the form:


1
F (u) = Bu, u c, u , B  0, B = B T .
2
Since B is symmetric and B  0, hence F is strongly convex2 .

(2.14)

F (u) = Bu c (remind B = B T )
dk = F (uk ), k = 0, 1,




f () = F uk F (uk ) = F uk (Buk c) .
The steepest descent requires k such that f  (k ) = 0.




0 = f  (k ) = F uk k F (uk ) , F (uk )


 
= B uk k F (uk ) c , F (uk )
= (Buk c) k BF (uk ) , F (uk )
= F (uk ) k BF (uk ) , F (uk )
= F (uk )2 + k BF (uk ) , F (uk )
= dk 2 + k Bdk , dk 

k =

dk 2
Bdk , dk 

Using that B  0 and that B = B T , the lest eigenvalue of B, denoted by min , obeys min > 0. Using
Denition 13,
F (v) F (u), v u = (v u)T B(v u) min v u22 .
2

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

26

The full algorithm: for any k N, do


dk = Buk c
dk 2
k =
Bdk , dk 
uk+1 = uk k dk
a method to solve a linear system Bu = c when B  0 and B T = B.
Theorem 16 The statement of Theorem 15 holds true if F is C 1 , strictly convex and coercive.
F (uk ) is decreasing, bounded from below. Then > 0 such that uk
 
B(0, ), k N, i.e. (uk )kN is bounded. Hence there exists a convergent subsequence ukj jN ; let
us denote
u = lim ukj .
Proof. (Sketch.)

F being continuous and using (2.13), limj F (ukj ) = F (u) = 0. Remind that F admits a
unique minimizer u and the latter satises F (
u) = 0. It follows that u = u.

Note that we do not have any bound on the error uk u as in the proof of Theorem 16.


u) kN towards zero in a neighborhood of u, under addi Q-linear convergence of F (uk ) F (
tional conditionssee [14, p. 33].
Steepest descent method can be very-very bad: the sequence of iterates is subject to zigzags, the
step-size can decrease consderably.
However, the steepest descent method serves as a basis for all the methods actually used.
Remark 6 Consider
F (u[1], u[2]) =


1
1 u[1]2 + 2 u[2]2 ,
2

Clearly, u = 0 and
F (u) =

1 u[1]
2 u[2]

0 < 1 < 2 .

Initialize the steepest descent method with u0 = 0. Iterations read


uk+1 = uk F (uk ) =

uk [1] 1 uk [1]
uk [2] 2 uk [2]

In order to get uk+1 = 0 we need 1 = 1 and 2 = 1 which is impossible (1 = 2 ). Finding the


solution u = 0 needs an innite number of iterations.

2.3.2

Gradient with variable step-size

V Hilbert space, u =

"
u, u, u V .
uk+1 = uk k F (uk ), k > 0, k N

(2.15)

Theorem 17 Let F : V R be dierentiable in V . Suppose and M such that 0 < < M, and

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

27

(i) F (u) F (v), u v u v2 , (u, v) V 2 (i.e. F is strongly convex);


(ii) F (u) F (v) Mu v, (u, v) V 2
Consider the iteration (2.15) where

+ , k N
k
2
M2 #
M
$
and
0, 2
M
in (2.15) converges to the unique minimizer u of F and
%
2
k
uk+1 u u0 u, where = 2 M 2 2 + 1 < 1.
M
where

Then (uk )kN

Proof. See the proof of Theorem 30, p. 51 for U = Id.

&
&
If F is twice dierentiable, condition (ii) becomes sup &2 F (u)& M.
uV

Remark 7 Denote by u the xed point of


G(u) = u F (u).

 
 1

2
G(u) = I F (u). By Theorem 11, p. 19, convergence is ensured if max i F (u)  < .
i

Quadratic strictly convex problem (2.14), p. 25. Convergence is improvedit is ensured if


!
1
1
1
k
+ with 0,
.
max (B)
max (B)
max (B)
Proof. See the proof of Proposition 16, p. 54.

2.4

Line search

2.4.1

Introduction

Merit function f : R R
f () = F (uk dk ), > 0,
where dk is a descent direction, e.g. dk = F (uk ) and in any case F (uk ), dk  > 0.
The goal is to choose > 0 such that f is decreased enough.
Line search is of crucial importance since it is done at each iteration.
We x at iteration k and drop indexes. Thus we write
f () = F (u d), then f  () = F (u d), d .
In particular,
f  (0) = F (u), d < 0
since d is a descent direction, see (2.1), p. 23.
Usually, line search is a subprogram where f can be evaluated only point-wise. In practice, is
found by trials and errors.
General scheme with 3 possible exists:

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

28

(a) f  () = 0 seems to minimize f (true is f convex)


(b) f  () > 0f seems to have a minimum for a smaller
(c) f  () < 0f seems to have a minimum for a larger

2.4.2

Schematic algorithm for line-search

L a too small ; R a too large ;


step (0). Initialize L = 0, R such that f  (R ) > 0 and > 0.
An initial R can be found using extrapolation.
step (1). Test > 0:
if f  () = 0 then stop.
if f  () < 0, set L = and go to Step 2;
if f  () > 0, set R = and go to Step 2.
step (2). Compute a new ]L , R [. Loop to step (1).
For some functions F , can be calculated explicitlyalgorithms are faster.
(uk1 )
Fletchers initializationassume that f is locally quadratic, then take = 2 F (uk )F
> 0.
f  (0)
Once a R is found, a line-search algorithm is a sequence of interpolations that reduce the bracket
[L , R ], i.e. L increases R decreases.
Property 5 Each [L , R ] contains a such that f  (
) = 0. Innite interpolations entail R L = 0.
) = 0. Such a is called optimal stepsize.
Historically: one has tried to nd such that f  (
Looking for this is not a good strategy in practice.
Interpolation methods. There are many choices. For instance:
L + R
Bissection method: =
2
Polynomial tting; t a polynomial that coincides with the points i already tested and compute a that minimizes this new function explicitly. E.g. use a 2nd or 3rd order polynomial.
Precautions:
Attention to roundo errors.
Avoid innite loopsimpose emergency exits.
Attention when programming the computation of f  .
Mathematical proofs need assumptions on f . It may happen that they are not satised or that
they exclude the roundo errors.
Line-search is time consuming!

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

2.4.3

29

Modern line-search methods

Arguments:
Devise tolerant stopping ruleswe minimize F and not f !
Striving to minimize f along the current direction is useless.
Intuitions:
If f is quadratic with minimum at , then3

1
f (
) = f (0) + f  (0) ;
2

If f is ane, then f () = f (0) + f  (0) .


Goal: predict the decrease of f with respect to f (0).
Wolfes conditions
It seems the best actually [69, 14]. Here f is the usual merit function, f () = F (uk d), > 0.
(To simplify, indexes are dropped.) Goals at each iteration:
decrease f enough (hence uk+1 will not be too far from uk );
increase f  enough (hence uk+1 will not be too close to uk ).
Choose two coecients 0 < c0 < c1 < 1, e.g. c0 < 1/2 and c1 > 1/2. Do the following 3 tests:

(a) f () f (0) + c0 f  (0)
1.
terminate
(b) f  () c1 f  (0)
2. f () > f (0) + c0 f  (0) set R =

(a) f () f (0) + c0 f  (0)
set L =
3.
(b) f  () < c1 f  (0)

(extrapolation step)
(interpolation step)

Theorem 18 Suppose that f is C 1 and bounded from below. Then Wolfes line-search terminates (i.e.
the number of the line-search iterations is nite).
Proof. Remind that f  (0) < 0 because we are on a descent direction. The proof is in two steps.
(a) If test 2 (extrapolation) is done innitely many times, then . Using 1(a), we deduce
that
f () f (0) c0 |f (0)|
This contradicts the assumption that f is bounded from below.
3

For some c > 0, the quadratic f can be written


f (t) =

1 2
c + f  (0) + f (0).
2

The minimizer point satises f (


) = c
+ f  (0) = 0. Hence
f  (0)
,
=
c

  2
f (0)
f (
) =
+ f (0)
2c

f (
) f (0)
1
= f  (0) .

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

30

Hence the number of extrapolations is nite.


(b) Suppose now that test 3 (interpolation) is done innitely many times. Then the two sequences
{L } and {R } are adjacent and have a common limit . Passing to the limit in 2 and 3(a) yields
f (
) = f (0) + c0 f  (0).

(2.16)

By test 2, no R can be equal to : indeed 2 means that R > . Using that f  (0) < 0 and (2.16),
test 2 can be put in the form


) c0 |f  (0)| (R )
f (R ) > f (0) + c0 f  (0) + (R ) = f (
Hence
) > c0 |f  (0)| (R )
f (R ) f (
Divide both sides by R > 0
)
f (R ) f (
> c0 |f  (0)|
R
, the above inequality yields f  (
) c0 |f  (0)| > c1 |f  (0)|. On the other hand, test
For R
3(b) for L ! entails f  (
) c1 |f (0)|. The last two inequalities are contradictory.
It is impossible to repeat steps 1 and 3 alternatively neither. Hence the number of line-search
iterations is nite.

Wolfes line search can be combined with any kind of descent direction d. However, line search
is helpless if dk is too orthogonal to F (uk ). The angle k between the direction and the gradient
is crucial. Put
F (uk ), dk 
cos k =
(2.17)
F (uk ) dk 
We can say that dk is a denite descent direction if cos k > 0 is large enough.
Theorem 19 If F (u) is Lipschitz-continuous with constant  on {u : F (u) F (u0 )} and the minimization algorithm uses Wolfes rule, then
r(cos k )2 F (uk )2 F (uk ) F (uk+1),

k N,

(2.18)

where the constant r > 0 is independent of k.


Proof. Note that condition 1 in Wolfes rule corresponds to a descent scenario. By the expression
for cos(k ) in (2.17) and using that uk+1 = uk k dk we get
k f  (0) = k F (uk ), dk  = cos(k )F (uk ) k dk  = cos(k )F (uk ) uk uk+1 < 0.
Combining this result with condition 1(a) yields
0 < c0 k |f  (0)| = c0 cos(k )F (uk ) uk uk+1  f (0) f (k ) = F (uk ) F (uk+1)

(2.19)

Note that F (uk ) F (uk+1) is bounded from below by a positive number (descent is guaranteed by
test 1). Reminding that f (k ) = F (uk k dk ) = F (uk+1), condition 1(b) reads
f  (k ) = F (uk+1), dk  c1 f  (0) = c1 F (uk ), dk 

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

31

Add F (uk ), dk  to both sides of the above inequality:


F (uk+1) F (uk ), dk  (1 c1 ) F (uk ), dk  .
Using Schwarzs inequality and the denition of cos(k ),
F (uk+1) F (uk ) dk  (1 c1 ) cos(k )F (uk ) dk .
Division of both sides by dk  = 0 leads to
F (uk+1) F (uk ) (1 c1 ) cos(k )F (uk )
Combining this result with the Lipschitz property, i.e. F (uk+1) F (uk ) uk+1 uk , yields
uk+1 uk  F (uk+1) F (uk ) (1 c1 ) cos(k )F (uk )
Multiplying both sides by cos(k )F (uk ) and using (2.19) for the underlined term below leads to


2



2
(1 c1 ) cos(k ) F (uk )
c0 cos(k )F (uk ) uk+1 uk 
F (uk ) F (uk+1) .
c0
c0
For
r=
we obtain

c0 (1 c1 )
>0


2

r cos(k ) F (uk )2 F (uk ) F (uk+1).


The proof is complete.


Why in the proof we use only test 1 ?

Theorem 20 Let F C 1 be bounded from below. Assume that the iterative scheme is such that (2.18),
namely
r(cos k )2 F (uk )2 F (uk ) F (uk+1), k N,
holds true for a constant r > 0 independent of k. If the series

(cos k )2

diverges

(2.20)

k=0

then lim F (uk ) = 0.


k



Proof. Since F (uk ) kN is decreasing and bounded from below, there is a nite number F R
such that F (uk ) F, k N. Then (2.18) yields

 1

1 
F (uk ) F (uk+1) = F (u0 ) F (u1 ) + F (u1 ) F (u2 ) +
(cos k ) F (uk )
r k=0
r
k=0

1
F (u0 ) F < +.
(2.21)

r
If there was > 0 such that F (uk ) , k 0, then (2.20) would yield
2

(cos k )2 F (uk )2 2

k=0

(cos k )2

k=0

which contradicts (2.21). Hence the conclusion.


Property (2.20) depends on the way the direction is computed.

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

32

Remark 8 Wolfes rule needs to compute the value of f  () = F (uk dk ), dk  at each cycle of
the line-search. Computing F (uk dk ) is costly when compared to the computation of F (uk dk ).
Other line-search methods avoid the computation of F (uk dk ).
Other Line-Searches
We can evoke e.g. the Goldstein and Price method and the Armijos method. More details: e.g.
[69, 14].
Armijos methods
Choose c (0, 1). Tests for :
1. f () f (0) + cf  (0)

terminate;

2. f () > f (0) + cf  (0)

set R = .

It is easy to see that this line-search terminates as well. The interest of the method is that it does
not need to compute f  (). It ensures that is not too large. However, it can be dangerous since it
never increases ; thus it relies on the initial step-size 0 .
In practice it often had appeared that taking a good constant step-size yields a faster convergence.

2.5

Second-order methods

Based on F (uk+1) F (uk ) F (uk ), uk+1 uk  + 12 2 F (uk )(uk+1 uk ), uk+1 uk 


Crucial approach to derive a descent direction at each iteration of a scheme minimizing a smooth
objective F .

2.5.1

Newtons method

Example 3 Let F : R R, nd u such that F  (


u) = F (
u) = 0.
We choose uk+1 such that:
F (uk )

= 2 F (uk ) = F (uk )
uk uk+1

Newtons method:
Theorem 21 Let V be
say O " u, and that4


1
uk+1 = uk 2 F (uk ) F (uk )

(2.22)

a&Hilbert space.& Assume that 2 F (uk ) is continuous and invertible near u,


1 &
& 2
is bounded above on O. The Newtons method converges
& F (u) &
L(V )

towards u and the convergence is Q-superlinear.


Remind that L(V ) L(V, V ) and that for B L(V ), we have BL(V ) = supuV \{0}
norm on V . If V = Rn , BL(V ) is just the induced matrix norm.
4

BuV
uV

, where  V is the

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

33

Proof. Using Taylor expansion for F about uk :


u uk ) + 
u uk (
u uk ) = 0,
0 = F (
u) = F (uk ) + 2 F (uk ) (
u uk  0. By Newtons method (2.22),
where (
u uk ) 0 as 
F (uk ) + 2 F (uk ) (uk+1 uk ) = 0
Subtracting this equation from the previous one yields
u uk+1 ) + 
u uk (
u uk ) = 0.
2 F (uk ) (
Since 2 F (uk ) is invertible, the latter equation is equivalent to

1
u uk+1 = 2 F (uk ) 
u uk (
u uk ).
From the assumptions, the constant M below is nite:
&
1 &
def
&
&
M = sup & 2 F (u) &
uO(
u)

L(V )

Then
uk+1 u M
u uk (
u uk )
u

u

and Qk = uk+1
uk 
convergence.

M(
u uk ) 0 as 
u uk  0. Hence the Q-superlinearity of the


Comments
Under the conditions of the theorem, if in addition F is C 3 then convergence is Q-quadratic.
1

Direction and step-size are automatically chosen ( 2 F (uk ) F (uk ));
1

If F is nonconvex, 2 F (uk ) F (uk ) may not be a descent direction, hence the u found
in this way is not necessarily a minimizer if F ;
Convergence is very fast;
1

can be dicult, i.e. solving the equation 2 F (uk )z = F (uk ) with
Computing 2 F (uk )
respect to z can require a considerable computational eort;
Numerically it can be very unstable thus entailing a violent divergence;
Preconditioning can help if 2 F (uk ) has a favorable structure, see [19];
In practiceecient to get a high-precision u
if we are close enough to the sought-after u.

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

2.5.2

34

General quasi-Newton Methods

1

1 2
Approximation : Hk (uk )
F (uk )


1
uk+1 = uk Hk (uk ) F (uk )

(2.23)

Possible to keep Hk1 (uk ) constant for several consecutive iterations.


Sucient conditions for convergence:
Theorem 22 F : U V R twice dierentiable in O(U), V complete n.v.s. Suppose there exist
def
constants r > 0, M > 0 and ]0, 1[ such that B = B(u0 , r) U and
1. sup sup Hk1 (u)L(V ) M
kN uB

2. sup sup 2 F (u) Hk (v)L(V )


kN u,vB

3. F (u0 )V

r
(1 )
M

Then the sequence generated by (2.23) satises:


(i) uk B, k N;
(ii) lim uk u; moreover, u is the unique zero of F (u) on B;
k

(iii) The convergence is geometric with


uk uV

k
u1 u0 V .
(1 )

Proof. We will use that iteration (2.23) is equivalent to




Hk (uk ) uk+1 uk + F (uk ) = 0, k N.

(2.24)

There are several steps.


(a) We will show 3 preliminary results:
uk+1 uk  MF (uk ), k N;

(2.25)

uk+1 B, k N ;

F (uk+1)
uk+1 uk .
M

(2.26)
(2.27)

We start with k = 0. Iteration (2.23) yields (2.25) and (2.26) for k = 1:


u1 u0 H01 (u0 ) F (u0) M F (u0 ) M

r
(1 ) < r u1 B
M

Using (2.24) for k = 0,




F (u1 ) = F (u1 ) F (u0 ) H0 (u0 ) u1 u0

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

35

Consider the application u  F (u) H0 (u0 )u. Its gradient is 2 F (u) H0 (u0 ). By the
generalized mean-value theorem 5 and assumption 2 we obtain (2.27) for k = 1:
F (u1) H0 (u0)u1 F (u0 ) + H0 (u0 )u0 
= F (u1) sup 2 F (u) H0 (u0 ) u1 u0
uB

u1 u0 .
M

Suppose that (2.25)-(2.27) hold till (k 1) inclusive, which means we have


uk uk1 MF (uk1)
uk B;

uk uk1
F (uk )
M

(2.28)

We check if these hold for k as well. Using (2.28) and assumption 1, iteration (2.23) yields
(2.25):
uk+1 uk  Hk1 (uk ) F (uk ) M F (uk ) M

uk uk1 = uk uk1


M

Thus we have established that


uk+1 uk  uk uk1 k u1 u0 

(2.29)

Using the triangular inequality, (2.25) for k = 0 and assumption 3, we get (2.26) for the
actual k:
uk+1 u0  uk+1 uk  + uk uk1  + + u1 u0 =


k



i u1 u0 

i=0




u1 u0  =

i=0

MF (u0 )
r

ui+1 ui 

i=0


i

k


1
u1 u0 
1

uk+1 B.

Using (2.24) yet again




F (uk+1) = F (uk+1) F (uk ) Hk (uk ) uk+1 uk
Applying in a similar way the generalized mean-value theorem to the application u  F (u)
Hk (uk )u and using assumption 2 entails (2.27):
F (uk+1) sup 2 F (u) Hk (uk ) uk+1 uk 
uB

uk+1 uk .
M

Generalized mean-value theorem. Let f : U V W and a U , b U such that the segment [a, b] U .
Assume f is continuous on [a, b] and dierentiable on ]a, b[. Then
5

f (b) f (a)W sup f (u)L(V,W ) b aV


u]a,b[

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

36

(b) Let us prove the existence of a zero of F in B. Using (2.29), k N, j N we have


j1


j1


k
u1 u0 ,
1
i=0
i=0
i=0
(2.30)
hence (uk )kN is a Cauchy sequence. The latter, combined with the fact that B V is complete
shows that u B such that lim uk = u.
uk+j uk 

uk+i+1 uk+i 

k+i

u1 u0  u1 u0 
k

i =

(c) Since F is continuous on O(U) B and using (2.27), we obtain the existence result:

F (
u) = lim F (uk )
lim uk+1 uk  = 0
k
M k
(d) Uniqueness of u in B.
Suppose u B, u = u such that F (u) = 0 = F (
u). We can hence write


u u = H01 (u0 ) F (u) F (
u) H0 (u0 ) (u u)
u u H01 (u0 ) sup 2 F (u) H0 (u0 ) u u M
uB

u u< u u
M

This is impossible, hence u is unique in B.


(e) Geometric convergence: using (2.30),

u uk  = lim uk+j uk 
j

k
u1 u0 .
1

Iteration (2.23) is quite general. In particular, Newtons method (2.22) corresponds to Hk (uk ) =
F (uk ) while (variable) stepsize Gradient descent to Hk (uk ) = 1k I.
2

2.5.3

Generalized Weiszfelds method (1937)

It is a Quasi-Newton method (with linearization of the gradient); see [89, 86, 23]
Assumptions:
F : Rn R is twice continuously dierentiable, strictly convex, bounded from
below and coercive. (F can be constrained to Ua nonempty convex polyhedral set.)
F is approximated from above by a quadratic function F of the form
1
F (u, v) = F (v) + u v, F (v) + u v, H(v)(u v)
2
under the assumptions that u Rn
F (u, v) F (u), for any xed v ;
H : Rn Rnn is continuous and symmetric ;


0 < 0 i H(u) 1 < , 1 i n.
Note that F (u, u) = F (u). The iteration for this method reads
uk+1 = arg min F (u, uk ), k N.
u

For uk xed, F is C , coercive, bounded below and strictly convex. The minimizer uk+1 exists and
satises the quasi-Newton linear equation
2

0 = 1 F (uk+1, uk ) = F (uk ) + H(uk )(uk+1 uk ).


Here H(.) can be seen as an approximation of 2 F (.).

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

2.5.4

37

Half-quadratic regularization

Minimize F : Rn R of the form


F (u) = Au v22 +

r


(Di u2),

(2.31)

i=1

where Di Rsn for s {1, 2}: s = 2e.g. discrete gradients, see (1.8) and s = 1e.g. nite
dierences, see (1.9), p. 7. Let us denote

D1

def D2
D = .. Rrsn . Assumption: ker(AT A) ker(D T D) = {0}.
.
Dr
Furthermore, : R+ R is a convex edge-preserving potential function (see [7]), e.g.


t


+ t2
t
tanh

t
+ |t|

t
if |t| ,
sign(t) if |t| > ,
|t|1 sign(t)

t2 +

 
t
log cosh
 
|t|
|t| log 1 +

 2
if |t| ,
t /2
|t| 2 /2 if |t| > ,
|t| , 1 < 2



2 3
 ( +t ) 
1
t
1 tanh2

2
 ( + |t|)
1 if |t| ,
0 if |t| > ,
( 1)|t|2

where > 0 is a parameter. Note that F is dierentiable.


However  is (nearly) bounded, so  is close to zero on large regions; if A is not well conditioned
(a common case in practice), 2 F has nearly null regions and convergence of usual methods can be
extremely slow. Newtons method reads uk+1 = uk (2 F (uk )))1 F (uk ). E.g., for s = 1 we have
Di u R, so Di u2 = |Di u| and
F (u) = 2A A +
2

r


 (|Di u|)DTi Di .

i=1
1

For the class of F considered here, 2 F (u) has typically a bad condition number so (2 F (u) ) is
dicult (unstable) or impossible to obtain numerically. Newtons method is practically unfeasible.
Half-quadratic regularization, two forms multiplicative () and additive (+) introduced in [42]
and [43], respectively. Main idea: using an auxiliary variable b, construct an augmented criterion
F (u, b) which is quadratic in u and separable in b, such that
(
u, b) = arg min F (u, b)
u,b

Alternate minimization: k N,

uk+1

u = arg min F (u)


u

= arg min F (u, bk )


u

bk+1 [i] = arg min F (uk+1, b),


b

1ir.

The construction of these augmented criteria rely on Theorem 4 (Fenchel-Moreau, p. 15).

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

38

Multiplicative form.
Usual assumptions on in order to guarantee global convergence:
(a)
(b)
(c)
(d)
(e)

t (t)
is convex and increasing on R+ ,  0 and (0) = 0
t ( t) is concave on R+ ,
is C 1 on R+ and  (0) = 0,
 (0+ ) > 0,
lim (t)/t2 = 0.
t

Proposition 1 We have

Moreover,





1 2
1 2
t b + (b)

(t) = min
(b) = sup bt + (t)
b0
2
2
tR
+ 
(t)
if t > 0
b def
=
is the unique point yielding (t) = 12 t2b + (b).
t
 +
(0 ) if t = 0

(2.32)

The proof of the proposition is outlined in Appendix, p. 106.


Based on Proposition 1 we minimize w.r.t. (u, b) an F of the form given below

r 

b[i]
2
2
Di u2 + (b[i])
F (u, b) = Au v2 +
2
i=1
and iterates read, for k N:
uk+1 = (H(bk ))1 2AT v,

H(b) = 2AT A +

r


b[i]DTi Di

i=1

 (Di uk+12 )
bk+1 [i] =
,
Di uk+1 2

1 i r.

Comments
References [24, 34, 47, 7, 68, 1].
Interpretation : bk edge variable ( 0 if discontinuity)
uk can be computed eciently using CG (see 2.6).
F is nonconvex w.r.t. (u, b). However, under (a)-(e) the minimum is unique and convergence
is ensured, uk u (see [68, 1]).
 
r 
(Di u2 )
1
0
Quasi-Newton : uk+1 = uk (H(uk )) F (uk ), H(u) = H
Di u2
i=1
Compare with H in Newton.
Iterates amount to the relaxed xed point iteration described next. In order to solve the equation
F (
u) = 0, we write
F (u) = L(u)u z
where z is independent of u. At each iteration k, one nds uk+1 by solving the linear problem
L(uk )uk+1 = z
In words, F is linearized at each iteration (see [67] for the proof).

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

39

Additive form.
Usual assumptions on to ensure convergence:
(a)
(b)
(c)
(d)

t (t) is convex and increasing on R+ ,  0 and (0) = 0


t t2 /2 (t) is convex,
 (0) = 0,
lim (t)/t2 < 1/2.
t

Notice that is dierentiable on R+ . Indeed, (a) implies that  (t )  (t+ ), for all t. If for some t
the latter inequality is strict, (b) cannot be satised. If follows that  (t ) =  (t+ ) =  (t).
In order to avoid additional notations, we write (.) and (.) next.
Proposition 2 For . = .2 and b Rs , t Rs , s {1, 2}, we have the equivalence:




1
1
2
2
t b + (b)
(t) = mins
(b) = maxs b t + (t)
tR
bR
2
2

(2.33)

t
b def
is the unique point yielding (t) = 12 t b2 + (b).
= t  (t)
t


Proof. By (b) and (d), 12 t2 (t) is convex and coercive, so the maximum of is reached.
The formula for in (2.33) is equivalent to



1 2
1
2
(b) + b = maxs b, t
t (t)
(2.34)
tR
2
2
Moreover,

The term on the left is the convex conjugate of 12 t2 (t), see (1.10), p. 14. The latter term being
convex continuous and  , using convex bi-conjugacy (Theorem 4, p. 15), (2.34) is equivalent to



1 2
1
2
(2.35)
t (t) = sup b, t (b) + b
2
2
bRs
Noticing that the supremum in (2.35) is reached, (2.35) is equivalent to






1
1
1 2
2
2
(t) = mins b, t + (b) + b + t = mins
t b + (b)
bR
bR
2
2
2
Hence (2.33) is proven.
Using (2.33), the maximum of and the minimum of are reached by a pair (t, b) such that
b t +  (t)

t
= 0.
t

For t xed, the solution for b, denoted b is unique and is as stated in the proposition.
Based on Proposition 2, we consider
F (u, b) = Au

v22

r 

1
i=1

and the iterations read, k N,



uk+1 = H

2A v +

r



Di u

bi 2s

+ (bi ) , bi Rs


DTi bik

H = 2A A +
T

i=1

bik+1

Di uk
= Di uk  (Di uk 2 )
, 1 i r.
Di uk 2

r

i=1

DTi Di

(2.36)

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

40

Comments
References: [8, 47, 68, 1].
Quasi-Newton :

uk+1 = uk H1 F (u) (compare with 2 F )

Convergence: uk u.
Comparison between the two forms. See [68].
 -form: less iterations but each iteration is expensive;
+  -form: more iterations but each iteration is cheaper and it is possible to precondition H.
For both forms, convergence is faster than Gradient, DFP and BFGC ( 2.5.5), and non-linear
CG ( 2.6.2) for functions of the form (2.31).

2.5.5

Standard quasi-Newton methods

"
References: [14, 54, 69]. Here F : Rn R with u = u, u, u Rn .

1
 0, k N.
Hk 2 F (uk )  0 and Mk 2 F (uk )
k

def

= uk+1 uk

def

gk = F (uk+1) F (uk )
Denition 14 Quasi-Newton (or secant) equation:
Mk+1 gk = k

(2.37)

Interpretation: the mean value H of 2 F between uk and uk+1 satises gk = Hk . So (2.37) forces
Mk+1 to have the same action as H 1 on gk (subspace of dimension one).
There are innitely many quasi-Newton matrices satisfying Denition 14.
General quasi-Newton scheme
0. Fix u0 , tolerance  0, M0  0 such that M0 = M0T . If F (u0) stop, else go to 1.
For k = 1, 2, . . ., do:
1. dk = Mk F (uk )
2. Line search along dk to nd k (e.g. Wolfe, 0 = 1)
3. uk+1 = uk k dk
4. if F (uk+1) stop else go to step 5
T
and (2.37) holds; then loop to step 1
5. Mk+1 = Mk + Ck  0 so that Mk+1 = Mk+1

Ck  0 is a correction matrix. For stability and simplicity, it should be minimal in some sense.
Various choices for Ck can be done.

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

41

DFP (Davidon-Fletcher-Powel)
Historically the rst (1959). Apply the General quasi-Newton scheme along with
Mk+1 = Mk + Ck
k Tk
Mk gk gkT Mk
Ck =

gkT k
gkT Mk gk
Each of the matrices composing Ck is of rank 1, hence rank Ck 2, k N. Note that Ck = CkT .
Mk+1 is symmetric and satises the quasi-Newton equation (2.37):
Mk+1 gk = Mk gk + k

Tk gk
gkT Mk gk

M
g
= Mk gk + k Mk gk = k
k
k
gkT k
gkT Mk gk

BFGC (Broyden-Fletcher-Goldfarb-Shanno)
Proposed in 1970. Apply the General quasi-Newton scheme along with
Mk+1 = Mk + Ck


k gkT Mk + Mk gk Tk
gkT Mk gk k Tk
Ck =
+ 1+ T
gkT k
gk k
gkT k
Obviously Mk+1 satises (2.37) as well. BFGC is often preferred to DFP.
k where Hk must satisfy the so called
Comment: One can approximate 2 F using Hk+1 = Hk + C
dual quasi-Newton equation, Hk+1 k = gk , k (see [14, p. 55]).
Theorem 23 Let M0  0 (resp. H0  0). Then gkT k > 0 is a necessary and sucient condition for
DFP and BFGS formulae to give Mk  0 (resp. Hk  0), k N.
Proofsee [14, p. 56.]
Remark 9 It can be shown that DFP and BFGC formulae are mutually dual. See [14, p. 56].
Theorem 24 Let F be convex, bounded from below and F Lipschitzian on {u : F (u) F (u0)}.
Then the BFGS algorithm with Wolfes line-search and M0  0, M0 = M0T yields
lim inf |F (uk )| = 0.
k

Proofsee [14, p. 58.]


Locally - Q superlinear convergence;
Drawback: we have to store n n matrices;
DFP, BFGC available in Matlab Optimization toolbox.
Recently the BFGS minimization method was extended in [49] to handle nonsmooth, not necessarily
convex problems. The Matlab package HANSO developed by the authors is freely available6.
6

http://www.cs.nyu.edu/overton/software/hanso/

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

2.6

42

Conjugate gradient method (CG)

Due to Hestenes and Stiefel, 1952, still contains a lot of material for research.
Preconditioned CG is an important technique to solve large systems of equations.

2.6.1

Quadratic strongly convex functionals (Linear CG)

For B = B T and B  0, powerful method to minimize F given below for n very large
F (u) =

1
Bu, u c, u ,
2

u Rn

or equivalently to solve Bu = c. (Remind that B  0 implies min (B) > 0.) Then F is strongly
convex. By the denition of F , we have a very useful relation:
F (u v) = B(u v) c = F (u) Bv, u, v Rn .

(2.38)

Main idea : at each iteration, compute uk+1 such that


F (uk+1) =

inf

uuk +Hk

F (u)

Hk = Span{F (ui ), 0 i k}
uk+1 minimizes F over an ane subspace (and not only along one direction, as in Gradient methods.)
Theorem 25 The CG method converges after n iterations at most and it provides the unique exact
minimiser u obeying F (
u) = minn F (u).
uR

Proof. Dene the subspace


+
Hk =

g() =

k


,
[i]F (ui ) : [i] R, 0 i k

(2.39)

i=0



def
Hk is a closed convex set. Set f () = F uk g() . Note that f is convex and coercive.


F (uk+1) = inf F uk g() = inf f () = f (k ).
Rk+1

Rk+1

k is the unique solution of f () = 0. By (2.39), g()/[i] = F (ui ). Then


f ()
= 0 = F (uk g()) , F (ui) = F (uk+1) , F (ui ) , 0 i k.
[i]

(2.40)

Hence for 0 k n 1 we have

F (uk+1) , F (ui ) = 0, 0 i k

(2.41)

F (uk+1) , u = 0, u Hk

(2.42)



It follows that F (ui ) are linearly independent. Conclusions about convergence:
If F (uk ) = 0 then u = uk (terminate).

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

43

If F (uk ) = 0 then dimHk = k + 1.


Suppose that F (un1 ) = 0, then Hn1 = Rn . By (2.42),
F (un ), u = 0, u Rn

F (un ) = 0

u = un .

The proof is complete.

Remark 10 In practice, numerical errors can require a higher number of iterations.


The derivation of the CG algorithm is outlined in Appendix on p. 107
CG Algorithm:
0. Initialization: d1 = 0, F (u1) = 1 and u0 Rn . Then for k = 0, 1, 2, do:
1. if F (uk ) = 0 then terminate; else go to step 2
F (uk )2
(where F (uk ) = Buk c, k)
F (uk1)2

2.

k =

3.

dk = F (uk ) + k dk1 (by step 0, we have d0 = F (u0))

4.

k =

5.

uk+1 = uk k dk ; then loop to step 1.

F (uk ), dk 
(where B = 2 F (uk ), k)
Bdk , dk 

Main Properties :
k, F (uk+1), F (ui ) = 0 if 0 i k
k, Bdk+1 , di  = 0 if 0 i k directions are conjugated w.r.t. 2 F = B
since B  0
F (un ) = 0.


are linearly independent where n
 n is such that
it follows that (dk )nk=1

CG does orthogonalization: D T BD is diagonal where D = [d1 , . . . , dn ] (the directions).


Theorem 26 If B has only r dierent eigenvalues, then the CG method converges after r iterations
at most.
Proofsee [69]. Hence faster convergence if the eigenvalues of B are concentrated.

2.6.2

Non-quadratic Functionals (non-linear CG)

References: [69, 14]


Nonlinear variants of the CG are well studied and have proved to be quite successful in practice.
However, in general there is no guarantee to converge in a nite number of iterations.
Main idea: cumulate past information when choosing the new descent direction.

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

44

Fletcher-Reeves method (FR)


step 0. Initialization: d1 = 0, F (u1 ) = 1 and u0 Rn .
k N:
1. Compute F (uk ); if F (uk ) then stop, otherwise go to step 2
2. k =

F (uk )2
F (uk1)2

3. dk = F (uk ) + k dk1
4. Test: if F (uk ), dk  < 0 (dk is not a descent direction) set dk = F (uk )
5. Line-search along dk to obtain k > 0
6.

uk+1 = uk k dk , then loop to step 1.

C. Comments (see [14, p. 72])


The new direction dk still involves memory from previous directions (Markovian property).
Conjugacy has little meaning in the non-quadratic case.
FR is heavily based on the locally quadratic shape of F (only step 4 is new w.r.t. CG).
If the line-search is exact, the new direction dk is a descent direction.
Polak-Ribi`
ere (PR) method
Remark 11 There are many variants of the FR method that dier from each other mainly in the
choice of the parameter k . An important variant is the PR method.
By the mean-value formula there is  ]0, k1 ] such that along the line Span(dk1) we have
F (uk ) = F (uk1) k1 Bk dk1
for Bk = 2 F (uk1 dk1 )

(2.43)
(2.44)

Note that Bk = BkT .


Main idea: choose a k in the FR method that conjugates dk and dk1 w.r.t. Bk , i.e. that yields
Bk dk1 , dk  = 0.
However, Bk is unknown. By way of compromise, the result given below will be used.
Lemma 2 Let Bk be given by (2.43)-(2.44) and consider the FR methods where
(i) k =

F (uk ) F (uk1), F (uk )


F (uk1)2

(ii) k1 and k2 are optimal.

(the PR formula)

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

45

Then Bk dk1 , dk  = 0.


Proof. By (2.43), we have Bk dk1 =

1
(F (uk )
k

F (uk1)). Then we can write







F (uk ) F (uk1) , F (uk )
F (uk ) F (uk1) , F (uk )
Bk dk1, F (uk )

 =
= 
Bk dk1, dk1 
F (uk ), dk1 + F (uk1), dk1
F (uk ) + F (uk1) , dk1
If k1 is optimal, i.e. if F (uk1 k dk1) = inf F (uk1 dk1 ), then
F (uk1 k1 dk1), dk1  = 0 = F (uk ), dk1
Then

Bk dk1, F (uk )


=
Bk dk1, dk1 





F (uk ) F (uk1) , F (uk )
F (uk1), dk1

If k2 is optimal as well, in the same way we have F (uk1), dk2 = 0. Using this result and the
formula for dk1 according to step 3 in PR, the denominator on the right hand side is
F (uk1), dk1 = F (uk1), F (uk1) + k1 dk2 = F (uk1)2 .
It follows that

Bk dk1 , F (uk )


= k
Bk dk1 , dk1

for k as in the Lemma. Using the expression for k established above, we nd:
Bk dk1 , dk  = Bk dk1 , F (uk ) + k dk1  = Bk dk1 , F (uk ) + k Bk dk1 , dk1
Bk dk1, F (uk )
Bk dk1 , dk1 = 0.
= Bk dk1 , F (uk )
Bk dk1, dk1 


Hence the result.

PR method
Apply the FR method, p. 44, where k in step 2 is replaced by the PR formula
k =

F (uk ) F (uk1), F (uk )


.
F (uk1)2

Comparison between FR and PR (see [14, p. 75])


FR converges globally (F convex, coercive).
There is a counter-example where PR does not converge
PR converges if F is locally strongly convex
PR converges much faster than FR (so the latter is rarely used in practice)
Relation with quasi-Newton

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

2.6.3

46

Condition number and preconditioning

Example 4 Consider the linear system (example due to R.S.Wilson):


Au = v

where

10
7
A=
8
7

7 8 7
5 6 5

6 10 9
5 9 10

u = A1 v = [1, 1, 1, 1 ]T

v = [32, 23, 33, 31 ]T

v = [32.1, 22.9, 33.1, 30.9]T u = A1 v = [9.2, 12.6, 4.5, 1.1]T


Denition 15 The condition number of an n n matrix A, denoted by cond(A), is dened by

def

cond(A) =

 1/2

max1in i (A)


.
min1in i (A)

Solving Au = v under perturbations corresponds to A(u+u) = v+v, then


Preconditioning of circulant matrices
An n n matrix C is said to be circulant if its (j + 1)th
0 j n 1:

c0 cn1 cn2 . . .
c1
c0 cn1 . . .
.
..
.
.
C= .
c1
c0

..
..
cn2
.
.
cn1 cn2 cn1 . . .

u
v
cond(A)
.
u
v

row is a cyclic right shift of the jth row,


c2
..

c1

c1
c2
..
.
cn1
c0

Note that C is determined by n components only, ci for 0 i n 1.


Circulant matrices are diagonalized by the discrete Fourier matrix F (see e.g. [33, p. 73]):


1
2i pq
def
F[p, q] = exp
, 0 p, q n 1, i = 1
n
n
We have
FCF = = diag(1 (C), , n (C)), where p (C) =
H

n1



cj exp

j=0


2i jp
, p = 0, . . . , n 1
n

where FH is the conjugate transpose of F. (C) can be obtained in O(n log n) operations using the
fast Fourier transform (FFT) of the rst column of C.
Note that F1 = FH . Solving
Cu = v
amounts to solve

u=
v where 
v = Fv; then u = FH u
.

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS

47

Here Fv and FH u
 are computed using the FFT and the inverse FFT, respectively.
For preconditioning of block-circulant matrices with applications to image restoration, see [43],
focus on p. 938.
Preconditioning
The speed of algorithms can be substantially improved using preconditioning.
References: [81, 19, 69, 20].
Instead of solving Bu = c, we solve the preconditioned system
P 1 Bu = P 1 c,
where the n n matrix P is called the preconditioner. The latter is chosen according to the following
criteria:
P should be constructed within O(n log n) operations;
P v = w should be solved in O(n log n) operations;
The eigenvalues of P 1B should be clustered (i.e. well concentrated near 1);
If B  0 then P 1 B  0.
Toeplitz Systems
An n n Toeplitz matrix A is dened using 2n 1 scalars, say ap , 1 n p n 1. It is
constant along its diagonals:

a1 . . . an+2 an+1
a0
a1
a0 a1
. . . an+2
.

..
..
.

.
.
a1
a0
A= .
(2.45)

.
.
..
..
an2
a1
an1 an2 . . .
a1
a0
Toeplitz matrices are important since they arise in deblurring, recursive ltering, auto-regressive
(AR) modeling and many others.
Circulant preconditioners
We emphasize that the use of circulant matrices as preconditioners for Toeplitz systems allows the
use of FFT throughout the computations, and FFT is highly parallelizable and has been eciently
implemented on multiprocessors.
Given an integer n, we denote by Cn the set of all circulant n n matrices.
Below we sketch several classical circulant preconditioners.
Strangs preconditioner [80]
It is the rst circulant preconditioner that was proposed in the literature. P is dened by to
be the matrix that copies the central diagonals of A and reects them around to complete the
circulant requirement. For A given by (2.45), the diagonals pj of the Strang preconditioner
P = [pm ]0m,<n are given by

0 < j n/2
aj
ajn n/2 < j < n
pj =

pn+j 0 < j < n

CHAPTER 2. UNCONSTRAINED DIFFERENTIABLE PROBLEMS


P (of size n n) satises

48

P A1 = min C A1 and P A = min C A .


CCn

CCn

T. Chans preconditioner for A as in (2.45) [21]


For A as in (2.45), P is dened by
P AF = min C AF
CCn

where .F denotes the Frobenius norm8 . The jth diagonals of P are shown to be equal to
 (nj)aj +jajn
0j<n
n
pj =
pn+j
0 < j < n
Preconditioners by embedding [17]
For A as in (2.45), let B be such that the matrix below is 2n 2n circulant:
!
A BH
B A
R. Chans circulant preconditioner P is dened as P = A + B.
Preconditioners by minimization of norms
Tyrtyshnikovs circulant preconditioner is dened by
Id P 1AF = min Id C 1 AF .
CCn

It has some nice properties and is called the superoptimal circulant preconditioner. See [84].
Other circulant preconditioners are derived from kernel functions.
Various types of preconditioners are known for dierent classes of matrices.

Remind that for an n n matrix B with elements B[i, j] (i for row, j for column)
B1
B

=
=

max

1jn

n



B[i, j]
i=1

n



B[i, j]
max

1in

j=1

The Frobenius norm of an n n matrix B is dened by


1/2
n 
n


2
B[i, j] .
BF =

i=1 j=1

Chapter 3
Constrained optimization
3.1

Preliminaries

References : [25, 26, 14], also [44, 55, 72].

3.2

Optimality conditions

Theorem 27 (Euler (in)equalities) Let V be a real n.v.s., U V convex, F : O(U) R proper (see
the denition on p. 10) and dierentiable at u U.
1. If F admits at u a minimum w.r.t. U, then
F (
u), u u 0, u U

(3.1)

2. Assume in addition that F is convex. Necessary and sucient condition (NSC) for a minimum:
F (
u) = min F (u)
uU

(3.1) holds.

Moreover, if F is strictly convex, the minimizer u is unique.


3. If U is open, then : (3.1) F (
u) = 0.
Proof. We have F (
u) F (
u + v) for any v satisfying u + v U. Consider an arbitrary v such that
u = u + v U.
Since U is convex,
[0, 1] u + (1 )
u = (
u + v) + (1 )
u = u + v U.
The rst-order expansion of F about u + v reads
u) = F (
u) , v + v(v).
F (
u + v) F (

(3.2)

Suppose that F (
u) , v < 0 and note that v is xed. Since (v) 0 as 0, we can nd


u) > F (
u + v) by (3.2), i.e. there
small enough such that F (
u) , v + v (v) < 0, hence F (
is no minimum at u. It follows that necessarily
F (
u) , u u = F (
u) , v  0.

CHAPTER 3. CONSTRAINED OPTIMIZATION

50

Statement 2. The necessity of (3.1) follows from statement 1. To show its suciency, combine
(3.1) with the convexity of F , Property 2, statement 1 (p. 21): F (u) F (
u) F (
u) , u u,
u U. If F is strictly convex, the uniqueness of u follows from (3.1) and Property 2-2, namely
F (u) F (
u) > F (
u) , u u, u U, u = u.
Statement 3 follows directly from statement 1.


3.2.1

Projection theorem

Theorem 28 (Projection) Let V be real Hilbert space, U V nonempty, convex and closed, and
"
u = u, u, u V . Given v V , the following statements are equivalent:
1.
u = U v U unique such that v u = inf v u, where U is the projection onto U;
uU

2. u U and v u, u u 0, u U.
Classical proof [77, p.391] or [52]. Shorter proof using Theorem 27see below.
Proof. If v U, then u = v and U = Id. Consider next v V \ U and
1
F (u) = v u2 .
2
F is clearly C , strictly convex and coercive. The projection u of v onto U solves the problem:
F (
u) = inf F (u) = min F (u).
uU

uU

By Theorem 27-2, such an u exists and it is unique; it satises F (


u) , u u 0, u U.
Noticing that
F (u) = u v,
we get

u v, u u 0, u U

v u, u u 0, u U.

Property 6 The application U : V U satises


1. v U v = 0 v U
2. U v1 U v2  v1 v2 , v1 , v2 V .
(U is uniformly continuous, Lipschitz)
3. U linear U is a sub-vector space. Then statement 2 reads (v U v)u, u U.
A typical constraint is U = {u Rn ; u[i] 0, i}
Then U is not a vector subspace, U is nonlinear

CHAPTER 3. CONSTRAINED OPTIMIZATION

3.3
3.3.1

51

General methods
Gauss-Seidel under Hyper-cube constraint

Consider the minimization of a coercive proper convex function F : Rn R under a hyper-cube


constraint:
U = {u Rn : ai u[i] bi , 1 i n} with ai [, [, bi ] , ], 1 i n.

(3.3)

Iterations: k N, i = 1, . . . , n
F (uk+1[1], . . . , uk+1 [i 1], uk+1[i], uk [i + 1], . . . , uk [n])
=

inf

ai bi

F (uk+1[1], . . . , uk+1 [i 1],

uk [i + 1], . . . , uk [n])

Theorem 29 If F is strongly convex and U is of the form (3.3), then (uk ) converges to the unique u
such that F (
u) = minuU F (u).
Remark 12 The method cannot be extended to a general U. E.g., consider F (u) = (u[1]2 + u[2]2 )
and U = {u R2 : u[1] + u[2] 2}, and initialize with u0 [1] = 1 or u0 [2] = 1.
The algorithm is blocked at the boundary of U
at a point dierent from the solution.

3.3.2

Gradient descent with projection and varying step-size

V Hilbert space, U V convex, closed, non empty, F : V R is convex and dierentiable in V . Here
"
again, u = u, u, u V .
Motivation:

u U and F (
u) = inf F (u)
uU

u U and F (
u), u u 0, u U, > 0 (the NSC for a minimum)
u U and 
u F (
u) u, u u 0, u U, > 0


u) , > 0.
u = U u F (
In words, u is the xed point of the application


G(u) = U u F (u) , where U the projection operator onto U
Theorem 30 Let F : V R be dierentiable in V and U V a nonempty convex and closed subset.
Suppose that and M such that 0 < < M, and
(i) F (u) F (v), u v u v2 , (u, v) V 2 (i.e. F is strongly convex);
(ii) F (u) F (v) Mu v, (u, v) V 2

CHAPTER 3. CONSTRAINED OPTIMIZATION

52

For U the projection operator onto U, consider




def
Gk (u) = U u k F (u)

k 2 + , k N
where
2
M #
M
$
and
0, 2
(xed)
M
uk+1 = Gk (uk ), k N.

(3.4)
(3.5)
(3.6)

Then (uk )kN in (3.6) converges to the unique minimizer u of F and


uk+1 u k u0 u,
%

where

2M 2

2
+ 1 < 1.
M2

(3.7)

Remark 13 If we x  0, is nearly optimal but the range for k is decreased; and vice-versa, for
1 2 2
but  1.
 M2 , the range for k is nearly maximal 0, M
2
Ideally, one would wish the largest range for k and the least value for ...
Proof. By its strongly convexity, F admits a unique minimizer. Let u V and v V .




Gk (u) Gk (v)2 = U u k F (u) U v k F (v) 2


u v k F (u) F (v) 2
= u v2 2k F (u) F (v), u v + 2k F (u) F (v)2
u v2 2k u v2 + 2k M 2 u v2
= (2k M 2 2k + 1) u v2
= f (k ) u v2 ,
where f is convex and quadratic and read
def

f () = 2 M 2 2 + 1.
Since 0 < < M, the discriminant of f is negative, 42 4M 2 < 0, hence
f () > 0, 0.
"
"
Then f () is real and positive. It is easy to check that f () is strictly convex on R+ when
0 < < M and that it admits a unique minimizer on R+ .
More precisely:
1
"
3 
f ()
f ()

"
2
=1
f (0) = f
M2
arg min

"

f () =

M2

CHAPTER 3. CONSTRAINED OPTIMIZATION

53

%

0< f
<1
M2
For any as given in (3.5), we check that
%
%
%

2
f ( 2 ) = f ( 2 + ) = 2M 2 2 + 1 = ,
M
M
M
where the last equality comes from (3.7). By (3.5) yet again,
2M 2

2
2
2
+
1
<

+ 1 1 < 1.
M2
M2 M2

Hence for any k as specied by (3.4)-(3.5)


"

f (k ) < 1.

It follows that Gk is a contraction, k N since Gk (u) Gk (v) u v, < 1, for any
u) = u for all k N. We can write down
(u, v) V 2 , hence Gk (
u) uk u k u0 u
uk+1 u = Gk (uk ) Gk (
where < 1 is given in (3.7).

Remark 14 If U = V then U = Id.


Remark 15 (About the Assumptions) If 2 F then 2 F  0 because F is strongly convex. Then
we have M = max (2 F ) = min (2 F ) > 0. Note that most usually, M > .

Constrained quadratic strictly convex problem: F (u) = Bu, u c, u, u Rn , B = B T ,


B  0 (hence min(B) > 0), U Rn nonempty, closed and convex. Using that F (u) = Bu c,
iterations are


uk+1 = U uk k (Buk c) , k N.
Since 2 F (u) = B, u Rn , Theorem 30 ensures convergence if
4
5
min (B)
min (B)
min(B)

2 k 
2 + , 0, 
2
max (B)
max (B)
max (B)

(3.8)


2
Drawback: min (B)/ max (B) can be very-very small.
For any u, v Rn ,
& 

&

&U u k (Bu c) U v k (Bv c) & (u v) k B(u v)2 Id k B2 u v2 .
2
Remind that for any square matrix B, if Bv = i (B)v then (Id B)v = v i (B)v = (1 i (B))v.
Since Id B, > 0, is symmetric, its spectral radius is
6

 
7

f () = max i (Id B) = max 1 min (B) , 1 max (B)
1in

CHAPTER 3. CONSTRAINED OPTIMIZATION

54

We have f convex and


f () > 0 if min (B) < max (B);
2
<1
min (B) + max (B)


2
f (0) = f
=1
max (B)
arg min f () =

2
. This bound for k is much better than (3.8) that was
max (B)
established in Theorem 30 in a more general case.

Convergence is ensured if 0 < k <

Remark 16 Diculty : in general U has no an explicit form. In such a case we can use a penalization method, or another iterative method (see later sections).

3.3.3

Penalty

Main idea : Replace the constrained minimization of F : Rn R by an unconstrained problem.


Construct G : Rn R continuous, convex, G(u) 0, u Rn and such that
G(u) = 0 u U
> 0 dene
(P )

(3.9)

1
F (u) = F (u) + G(u)

We will consider that 0.


Theorem 31 Let F be continuous, coercive and strictly convex, and U convex, dened by (3.9). Then
(1) > 0, u unique such that F (u ) = infn F (u) ;
uR

(2) lim u = u where u is the unique solution of F (


u) = inf F (u).
0

uU

Proof. F (u) F (u) F is strictly convex and coercive, > 0, hence statement 1.
Noticing that u is the minimizer of F and not of F , as well as that G(
u) = 0,
1
1
def
u) = F (
u)
F (u) F (u) + G(u ) = F (u ) F (
u) = F (
u) + G(

(3.10)

CHAPTER 3. CONSTRAINED OPTIMIZATION

55

Since F is coercive, (u )>0 is bounded. Hence there is a convergent subsequence (uk )kN , say
lim uk = u

(3.11)

u)
F (uk ) Fk (uk ) F (

(3.12)

F (
u) = lim F (uk ) F (
u)

(3.13)

Using (3.10)

Since F is continuous,
k

By the expression for (P ) and (3.12),






u) F (uk ) k K,
0 G(uk ) = k Fk (uk ) F (uk ) k F (
where K is un upper bound for F (
u) F (uk ), independent of k. By the continuity of G, (3.11)
yields
lim G(uk ) = G(
u),
k

and reminding that k 0 as k ,


lim G(uk ) lim k K = 0

The last two results show that G(


u) = 0, hence u U. Using (3.13) and the fact that F has a
unique minimizer in U show that
u = u

lim u = u.

The proof is complete.

Convex programming Problem : U given by (1.2), i.e.


U = {u Rn | hi (u) 0, 1 i q}
with hi : Rn R, i = 1, . . . , q convex functions. Consider
G(u) =

q


gi (u),

i=1

gi (u) = max{hi (u), 0}


G clearly satises the requirements for continuity, convexity and (3.9). Let us check the last point
(3.9). If for all 1 i q we have hi (u) 0, i.e. u U, then gi (u) = 0 for all 1 i q and thus
G(u) = 0. Note that G(u)  0 for all u Rn . If u  U, there is at least one index i such that
hi (u) > 0 which leads to gi (u) > 0 and G(u) > 0. Thus G(u) = 0 shows that u U.
Comments Note that G is not necessarily dierentiable.
Diculty : construct good functions G (dierentiable, convex). This constitutes the main
limitation of penalization methods.

CHAPTER 3. CONSTRAINED OPTIMIZATION

3.4
3.4.1

56

Equality constraints
Lagrange multipliers

Here V1 , V2 and Y are n.v.s. Optimality conditions are based on the Implicit functions theorem.
Theorem 32 (Implicit functions theorem) Let G : O V1 V2 Y be C 1 in O, where V2 and Y are
2) O and v Y are such that G(
u1 , u2 ) = v
complete (i.e. Banach spaces). Suppose that u = (
u1 , u
u1 , u2 ) Isom(V2 ; Y ).
and D2 G(
u1 , u2 ) O1 O2 O and there is a unique continuous
Then O1 V1 , O2 V2 such that (
function g : O1 V1 V2 , called an implicit function, such that
{(u1 , u2) O1 O2 : G(u1 , u2) = v} = {( u1 , u2) O1 V2 : u2 = g(u1 )}.
Moreover g is dierentiable at u1 and
u1 , u2 ))1 D1 G(
u1 , u2 ).
Dg(
u1) = (D2 G(

(3.14)

Proof. See [11, p. 30], [78, p. 176]


As to dierentiability (note that g is dierentiable at u1 ):
G(u1 , g(u1)) v = 0 Y, u1 O1 .
Dierentiation of both sides of this identity w.r.t. u1 = u1 yields
u1 , g(
u1)) + D2 G(
u1 , g(
u1))Dg(
u1) = 0.
D1 G(
Hence (3.14).
Example 5 V1 = V2 = R, G(u1 , u2 ) = u21 u2 = 0 = v Y = R. Then D2 G(u) = 1 Isom(R, R)
and u2 = g(u1) = u21 . Thus G(u1, u2 ) = 0 = G(u1 , g(u1)), u1 R.
Theorem 33 (Necessary condition for a constrained minimum) Let O V = V1 V2 be open where V1
and V2 are real n.v.s., and V2 be complete. Consider that G : O V2 is C 1 and set
U = {u O : G(u1 , u2) = 0}.
Suppose that F : V R is dierentiable at u U and that D2 G(
u) Isom(V2 , V2 ). If F has a
relative minimum w.r.t. U at u, then there is an application (
u) L(V2 , R) such that
DF (
u) + (
u)DG(
u) = 0.

(3.15)

CHAPTER 3. CONSTRAINED OPTIMIZATION

57

Proof. By the Implicit Functions Theorem 32, there are two open sets O1 V1 and O2 V2 , with
u O1 O2 O, and a continuous application g : O1 O2 such that
(O1 O2 ) U = {(u1 , u2) (O1 V2 ) | u2 = g(u1)}
and
u))1 D1 G(
u).
Dg(
u1) = (D2 G(

(3.16)

Dene



def
F (u1) = F u1 , g(u1) , u1 O1 .


Then F (
u1 ) = inf F (u1) entails F (
u) = inf F (u) for u = u1 , g(
u1) . Since O1 is open, u1 satises
u1 O1

uU





u1) + D2 F u1 , g(
u1) Dg(
u1 )
0 = DF (
u1) = D1 F u1 , g(




= D1 F u1 , g(
u1) D2 F u1 , g(
u1) (D2 G(
u))1 D1 G(
u),


where the last equality comes from (3.16). Using that u = u1 , g(
u1) , we can hence write down
u) = D2 F (
u) (D2 G(
u))1 D1 G(
u)
D1 F (
D2 F (
u) = D2 F (
u) (D2 G(
u))1 D2 G(
u)
where the second equality is just an identity. We obtain (3.15) by setting
u) (D2 G(
u))1 .
(
u) = D2 F (
Since D2 F (
u) L(V2 ; R) and (D2 G(
u))1 L(V2 ; V2 ), we have (
u) L(V2 ; R).

The most usual case arising in practice is considered in the next theorem.
Theorem 34 Let O Rn be open and gi : O R, 1 i p be C 1 -functions in O.
U = {u O : gi (u) = 0, 1 i p} O
u) L(Rn , R), 1 i p be linearly independent and F : O R dierentiable at u. If
Let Dgi (
u) R, 1 i p, uniquely
u U solves the problem inf F (u) then there exist p real numbers i (
uU
dened, such that
p

i (
u) Dgi (
u) = 0.
(3.17)
DF (
u) +
i=1

u), 1 i p are called the Lagrange multipliers.


i (
The problem considered in the last theorem is a particular case of Theorem 33.
def
Proof. Put G = (g1 , , gp ). Then

DG(
u) =

g1
u
1

Dg1 (
u)

=
gp
Dgp (
u)

u
1

g1
u
p

g1
u
n

gp
u
p

gp
u
n

CHAPTER 3. CONSTRAINED OPTIMIZATION

58

Since Dgi (
u) are linearly independent, rank(DG(
u)) = p n, so we can assume that the rst p p
submatrix of DG(
u) is invertible. If p = n, u is uniquely determined by G. Consider next that p < n.
Let {e1 , en } be the canonical basis of Rn . Dene
def

V1 = Span{ep+1 , , en }
def

V2 = Span{e1 , , ep }
Redene G in the following way:
G : V1 V2 V2
(u1 , u2) G(u) =

p


gi (u)ei .

i=1

Since the rst p p submatrix of DG(


u) is invertible, D2 G(
u) Isom(V2 , V2 ). Noticing that the
u) Rp (i.e. real numbers i (
u),
elements of V2 belong to Rp , Theorem 33 shows that there exists (
1 i p) such that
p

i (
u)Dgi (
u) = 0
DF (
u) + (
u)DG(u) = 0 DF (
u) +
i=1

The uniqueness of (
u) is due to the fact that rank(DG(
u)) = p.

Remark 17 Since F : O Rn R and gi : O Rn R, 1 i p, we can identify dierentials


with gradients using the scalar product on R ,  {p, n}. Then (3.17) can be rewritten as
p

F (
u) +
i (
u) gi (
u) = 0.
(3.18)
i=1

The numbers i (
u), i {1, 2, , p} are called Lagrange multipliers relevant to a constraint
extremum (minimum or maximum) of F in U.
To resume: u Rn , Rp hence n + p unknowns


u) = 0
F (
u) + pi=1 i gi (
u) = 0, 1 i p
gi (

n equations
p equations

System of n + p (nonlinear) equations.


These are necessary conditions. Complete the resolution of the problem by analyzing if u indeed
is a minimizer. In particular, if F is convex and continuous, and if U is convex and {gi } linearly
independent on V , then (3.18) yields the minimum of F .

3.4.2

Application to linear systems

Quadratic function under ane constraints


For B Rnn , B = B T , B  0, and c Rn , consider
1
minimize F (u) =
Bu, u c, u subject to u U
2
where U = {u Rn : Au = v} with A Rpn , rankA = p < n.

(3.19)

(3.18) yields
Bu + AT = c
Au
= v

(3.20)

CHAPTER 3. CONSTRAINED OPTIMIZATION

59

Projection onto U in (3.19) The projection u = U (w) of w (Rn \ U) onto U satises




1
1
2
u = arg min u w = arg min
u, u w, u .
uU 2
uU
2
(3.20) yields
u + AT = w
Au
= v
Hence
Au + AAT = Aw = (AAT )1 A(w u) = (AAT )1 (Aw v)
Since AAT is invertible (see (3.19)), we get
u + AT = u + AT (AAT )1 (Aw v) = w
We deduce



u = U (w) = I AT (AAT )1 A w + AT (AAT )1 v

(3.21)

If p = 1, then A R1n is a row-vector, v R and AAT = A22 .


If v = 0, then U = ker A is a vector sub-space and U is linear
U = I AT (AAT )1 A

(3.22)

Solving a linear system: Au = v Rm where u Rn


rankA = m n : F (u) = u2, U = {u Rn : Au = v}
u = AT (AAT )1 v (the minimum norm solution). Compare with (3.22).
rankA = n m : F (u) = Au v2 , U = Rn
u = (AT A)1 AT v (the least-squares (LS) solution)

Generalized inverse, pseudoinverse if A has linearly dependent rows or columns


SVD (singular value decomposition) of A Rmn :
A = Q1 Q2
Q1 QT1 = QT1 Q1 = Im ,

Q2 QT2 = QT2 Q2 = In (orthonormal matrices)

the columns of Q1 = eigenvectors of AAT


the columns of Q2 = eigenvectors of de AT A
, m n, diagonal, [i, i], 1 i r singular values =
r = rankA

eigenvalues = 0 of AAT and AT A,

CHAPTER 3. CONSTRAINED OPTIMIZATION

60

1
if i = j and 1 i r
The pseudo-inverse A =
where [i, j] =
[i, i]
0
else

It can be shown that this A corresponds to




n
U = u R : Au v = infn Aw v

QT2 QT1

wR

u : 
u = inf u
uU

u = A v

See [81, 85].


Remark 18 Generalized inversein a similar way when A : V1 V2 is compact and V1 , V2 are
Hilbert spaces. (see e.g. [85].)

3.5

Inequality constraints

Set of constraints: U V , U = , where V is a real n.v.s.

3.5.1

Abstract optimality conditions

Denition 16 Let V be a n.v.s., U V , = . The cone of all feasible directions at u U reads




v
uk u
=
C(u) = {0} v V \ {0} : (uk )k0, u = uk U, k 0, lim uk = u, lim
k
k uk u
v

C(u) is a cone with vertex 0, not necessarily convex. C(u) is the set of all directions such that
starting from u, we can reach another point v U.

Lemma 3 ([26]) C(u) is closed, u U.


Lemma 4 If U is convex then U u + C(u), for all u U.
def

Proof. We have to show that w U, w = u, v = w u C(u).



uk =

1
1
k+2


u+

1
w U, k N.
k+2

because U is convex. Moreover, uk = u, k N and


lim uk = u

CHAPTER 3. CONSTRAINED OPTIMIZATION


Then uk u =

1
(w
k+2

61

u). Then
wu
v
uk u
=
=
, k N.
uk u
w u
v

Hence v C(u).

Theorem 35 (Necessary condition for constrained minimum) V real n.v.s., U V , U = (arbitrary)


and F : O(U) R dierentiable at u U. If F admits at u U a constrained minimum, then
F (
u), u u 0, u {
u + C(
u)}.
Proof. see e.g. [26].
Remind that by Theorem 27, p. 49, if U is convex, then F (
u), u u 0, u U.

3.5.2

Farkas-Minkowski (F-M) theorem

Theorem 36 Let V be a Hilbert space, Ia nite set of indexes, ai V, i I and b V . Then


1. {u V : ai , u 0, i I} {u V : b, u 0}
if and only if
2. i 0, i I | b =

i ai

iI

If {ai , i I} are linearly independent,


then {i , i I} are determined in a unique way.
Proof. Let 2 hold. Using that ai , u 0, i I, we obtain

b, u =
i ai , u 0.
iI

Thus 2 implies the inclusion stated in 1.


Dene the subset
+
,

def
K =
i ai V | i 0, i I

(3.23)

iI

The proof of 1 2 is aimed at showing that b K. The proof is conducted ad absurdum and is
based of 3 substatements.
(a) K in (3.23) have the properties stated below.
K is a cone with vertex at 0.
> 0 i 0, i I


iI

i ai =


iI

(i )ai K.

CHAPTER 3. CONSTRAINED OPTIMIZATION

62

K is convex.
Let 0 < < 1, i 0, i I and i 0, i I.




i + (1 )i ai K
i ai + (1 )
i ai =

iI

iI

iI

because i + (1 )i 0.
K is closed in V .
Suppose that {ai , i I} are linearly independent. Set A = Span{ai : i I}a nite
dimensional vector subspace (hence closed). Consider an arbitrary (vj )jN K. For any
j N, there is a unique set {j [i] 0, i I} so that

j [i]ai K A.
vj =
iI

Let lim vj = v V then v V A which is equivalent to


j

lim vj =


iI


lim j [i] ai K A.

Hence K is closed in V .
Let {ai , i I} be linearly dependent. Then K is a nite union of cones corresponding
to linearly independent subsets of {ai }. These cones are closed, hence K is closed.
(b) Let K V be convex, closed and K = . Let b V \ K. Using the Hahn-Banach theorem 3,
h V and R such that
u, h > , u K,
b, h < .
By the Projection Theorem 28 (p. 50), there is a unique c K such that
b c = inf b u > 0
uK

(> 0 because b  K)

c V satises
u c, b c 0, u K.

(3.24)

Then
c b2 > 0

c2 c, b c, b + b2 > 0

c2 c, b > c, b b2

c, c b > b, c b

Set h = c b and such that


c, c b = c, h > > b, h = b, c b
From (3.24)
u c, h 0 u, h c, h > , u K.

CHAPTER 3. CONSTRAINED OPTIMIZATION

63

Remark 19 The hyperplane {u V : h, u = } separates strictly {b} and K.


(c) Let K read as in (3.23). Then
b  K h V such that ai , h 0, i I and b, h < 0.
Since b  K, then (b) shows that h V and R such that (Theorem 3)
u, h > , u K,
b, h < .
In particular
u=0K

0 = u, h > < 0.

Since K is a cone
u, h > , > 0, u K

u, h >

, > 0, u K.

Since / ! 0 as ! it follows that


u, h 0, u K.
Choose arbitrarily ai K for i I. Then ai , h 0, i I.
(d) Conclusion. Statement 2 means that b K. By (c), we have
statement 2 fails statement 1 fails.


Consequently, statement 1 implies statement 2.

Remark 20 Link with : g1 , . . . , gp linearly independent and [gi , u = 0, i f, u = 0] then 1 , . . . , p



such that f = pi=1 i gi .

3.5.3

Constraint qualication

U = {u O : hi (u) 0, 1 i q} where O V (open), V is a n.v.s., hi : V R, 1 i


q, q N (nite). We always assume that U = .
How to describe C(u) in an easier way?
Denition 17 The set of active constraints at u U is dened by:
I(u) = {1 i q : hi (u) = 0}
Since hi (u) 0, u U, the gures on p. 60 (below Denition 16) suggest that in some cases,
if i I(u), then hi reaches at u its maximum w.r.t. U, hence hi (u), v u 0 for some v U.
(We say some because U can be nonconvex.) This observation underlies the denition of Q:
Q(u) = {v V : hi (u), v 0, i I(u)}

(convex cone)

(3.25)

This cone is much more practical that C, even though it still depends on u. The same gures
show that in some cases Q(u) = C(u). Denition 18 below is constructed in such a way so that
these cones are equal in the most important cases.

CHAPTER 3. CONSTRAINED OPTIMIZATION

64

Denition 18 The constraints are qualied at u U if one of the following conditions hold:
w V \ {0} such that i I(u),
not ane;

hi (u), w 0, where the inequality is strict if hi is

hi is ane, i I(u).
Naturally, Q(u) = V if I(u) = .
Lemma 5 ([26]) Let hi for i I(u) be dierentiable at u

C(u) Q(u)

Theorem 37 ([26]) Assume that


(i)
(ii)
(iii)

hi dierentiable at u, i I(u)
hi continuous at u, i I \ I(u)
Constraints are qualied at u

C(u) = Q(u)

Example 6 U = {v Rn : ai , v bi , 1 i q} = (then the constraints are qualied v U)


and we have C(u) = Q(u) = {v Rn : ai , v 0, i I(u)}

3.5.4

Kuhn & Tucker Relations

Putting together all previous results leads to one of the most important statements in Optimization
theoreythe Kuhn-Tucker (KT) relations, stated below.
Theorem 38 (Necessary conditions for a minimum) V Hilbert space, O V open, hi : O R, i
U = {u O : hi (u) 0, 1 i q}
(i)
(ii)
(iii)
(iv)
(v)

(3.26)

hi dierentiable at u, i I(
u)
hi continuous at u, i I \ I(
u)
Constraints are qualied at u U
F : O R dierentiable at u
F has a relative minimum at u w.r.t. U

Then
i (
u) 0, 1 i q such that F (
u) +

q


i (
u)hi (
u) = 0,

i=1

q


i (
u)hi (
u) = 0 .

(3.27)

i=1

i (
u) are called Generalized Lagrange Multipliers
Proof. Assumptions (i), (ii) and (iii) are as in Theorem 37, hence
C(
u) = Q(
u).
By Theorem 35 (p. 61), or equivalently by Theorem 27 (p. 49)since Q(
u) is convex,
def

u) .
F (
u), v 0, v = u u Q(

(3.28)

CHAPTER 3. CONSTRAINED OPTIMIZATION

65

The denition of Q at usee (3.25) (p. 63)can be rewritten as


Q(
u) = {v V : hi (
u), v 0, i I(
u)}.
Combining the latter with (3.28), we can write down:

 

Q(
u) = v V : hi (
u), v 0, i I(
u) v V : F (
u), v 0 .
By the F-M Theorem 36 (p. 61),
u) such that F (
u) =
i 0, i I(

i (
u)hi (
u).

(3.29)

iI(
u)

From the denition of I(


u), we see that

i (
u)hi (
u) = 0. Fixing i (
u) = 0 whenever i I \ I(
u)

iI(
u)

leads to the last equation in (3.27).

Remarks
i (
u) 0, i I(
u) are dened in a unique way if and only if the set {hi (
u), i I(
u)} is
linearly independent (recall the Farkas-Minkowski lemma).
The KT relation (3.27) depend on udicult to exploit directly.
(3.27) yields a system of (often nonlinear) equations and inequationsnot easy to solve.

3.6
3.6.1

Convex inequality constraint problems


Adaptation of previous results

Lemma 6 If O is convex and hi : O V R, 1 i q are convex


U in (3.26) (namely U = {u O : hi (u) 0, 1 i q}) is convex.
Proof. Straightforward (apply the basic denition for convexity).
Denition 19 Convex constraints hi : O V R, 1 i q are qualied if U = and if
Either w O such that hi (w) 0, i = 1, . . . , q and hi (w) < 0 if hi is not ane.
Or hi , 1 i q, are ane.
Emphasize that these conditions are independent of u, hence they are much easier to use.
Example 7 U = {v Rn | ai , v bi , 1 i q} = .
For u U the set of active constraints ai , u = bi
has a geometric representation as seen next.

CHAPTER 3. CONSTRAINED OPTIMIZATION

66

Theorem 39 (Necessary and sucient conditions for a minimum on U) Let V be a Hilbert space, O
V (open) and U O convex. Suppose u U and that
(i) F : O V R and hi : O V R, 1 i q are dierentiable at u;
(ii) hi : V R, 1 i q, are convex;
(iii) Convex constraints qualication holds (Denition 19).
Then we have the following statements:
1. If F admits at u U a relative minimum w.r.t. U, then
u) R+ : 1 i q}
{i (

such that
F (
u) +

q


i (
u)hi (
u) = 0 and

(3.30)

i=1
q


i (
u)hi (
u) = 0 .

(3.31)

i=1

2. If F is convex on U and (3.30)-(3.31) holds then F has at u a constrained minimum w.r.t. U.

Proof. To prove 1, we have to show that if the constraints are qualied in the convex sense (Denition
19) then they are qualied in the general sense (Denition 18, p. 64), for any u U, which will
allows us to apply the KT Theorem 38.
u) = 0, i I(
u), and
Let w = u, w U, satisfy Denition 19. Set v = w u. Using that hi (
that hi are convex (see Property 2-1, p. 21),
i I(
u)

hi (
u), v = hi (
u) + hi (
u), w u hi (w).

By Denition 19, we know that hi (w) 0 and that hi (w) < 0 is hi is not ane. Then
u), w u = hi (
u), v 0 ,
hi (
where the inequality is strict if hi is not ane. Hence Denition 18 is satised for v.
u) < 0 is impossible for any i I(
u), Denition 19 means that all hi
Let u = w U. Since hi (
for i I(
u) are ane. Then Denition 18 is trivially satised.
Statement 1 is a straightforward consequence of the KT relations (Theorem 38, p. 64).
To prove statement 2, we have to check that F (
u) F (u), u U. Let u U (arbitrary). The
convexity of hi (Property 2-1) and the denition of U yield the two inequalities below:
i I(
u)

hi (
u), u u hi (u) hi (
u) 0, u U.

CHAPTER 3. CONSTRAINED OPTIMIZATION

67

Using these inequalities and noticing that i (


u) 0, we can write down
F (
u) F (
u)



i (
u) hi (u) hi (
u)

iI(
u)

F (
u)

i (
u) hi (
u), u u

iI(
u)

8
F (
u)

q


9
i (
u)hi (
u) , u u

by (3.31), i (
u) = 0 if i  I(
u)

i=1

= F (
u) + F (
u) , u u
F (u), u U

(by (3.30))


(F is convex).

u) Rq+ were known, then we would have and unconstrained minimization problem.
If i (
Example 8 U = {u Rn : Au b Rq }: the constraints are qualied if and only if U non-empty.
F convex. The necessary and sucient conditions for F to have a constrained minimum at
u U: Rq+ such that F (
u) + AT = 0 and i 0, 1 i q, with i = 0 if ai , u bi < 0.

3.6.2

Duality

V, W any subsets
= inf L(u, )

sup L(
u, ) = L(
u, )

:
Denition 20 Saddle point of L : V W R at (
u, )

In our context: W the space of Generalized Lagrange Multipliers

L has a saddle point at (0, 0).


Lemma 7 For any L : V W R, u V and W we have:
sup inf L(u, ) inf sup L(u, )

W uV

uV W

uV

CHAPTER 3. CONSTRAINED OPTIMIZATION

68

Proof. Take u V , W arbitrary.


inf L(u, ) L(u, ) sup L(u, ), u V

uV

W
def

inf L(u, ) inf sup L(u, ) = K

uV

uV W

inf L(u, ) K, W

uV

sup inf L(u, ) K = inf sup L(u, )

W uV

uV W

The proof is complete.


be a saddle-point for L : V W R. Then
Theorem 40 Let (
u, )
= inf sup L(u, ).
u, )
sup inf L(u, ) = L(

W uV

uV W

Proof.

inf

uV


Def.Saddle Pt.
sup inf L(u, ).
u, )
=
inf L(u, )
sup L(u, ) sup L(
W

uV

W uV

Compare with Lemma 7 (p. 67) to conclude.


Remind:
(P ) nd u such that F (
u) = min F (u) where U = {u : hi (u) 0, 1 i q}
uU

The Lagrangian associated to (P ):

L(u, ) = F (u) +

q


i hi (u)

i=1

Theorem 41 Let F : V R and hi : V R, 1 i q, where V is a Hilbert space.


V Rq+ is a saddle-point of L
1. (
u, )

u U solves (P)

F and hi , 1 i q are dierentiable at u


F and hi , 1 i q are convex
2. Let u U solve (P). Suppose also that

constraints are qualied (convex sense)


q

u, ) is a saddle point of L
R+ such that (
Rq+ ,
Proof. By L(
u, ) L(
u, ),
0, Rq+ .
L(
u, ) L(
u, )
By the denition of L the latter reads
F (
u) +

q

q

i=1

i hi (
u) F (
u)

q


i hi (
u) 0, Rq+

i=1

i hi (
u) 0, Rq+ .
i

i=1

Rq+ , for any i {1, , q} apply (3.32) with


Since
j , j = i hi (
i , j =
u) 0.
Then hi (
u) 0, 1 i q, hence u U .

(3.32)

CHAPTER 3. CONSTRAINED OPTIMIZATION


q


(3.32), uU
Let i = 0, 1 i q
=

69

i hi (
u) 0

i=1
q
#
$


i hi (
u) 0 and i 0 , 1 i q
u) 0

hi (
i=1

q


i hi (

u) = 0.

i=1

L(u, ),
u U
Using that L(
u, )
F (
u) = F (
u) +

q


L(u, )

i hi (
u) = L(
u, )

i=1
q

= F (u) +

i hi (u) F (u), u U

i hi (u) 0, u U)
(remind

i=1

Hence statement 1.
Rq+ such that
Let u solve (P ). By KT theorem (see (3.30)-(3.31), p. 66),
F (
u) +

q


i hi (
u) = 0 and

q


i hi (
u) = 0 .



i=1

i=1

(3.33)

is a saddle point of L.
We have to check that (
u, )

Rq+ ,

L(
u, ) = F (
u) +

q


i hi (
u)F (
u) = F (
u) +

u F (u) +

i hi (
u, ).

u) = L(

i=1 

i=1
q


q


i hi (u) = L(u, )
is convex and reaches its minimum at u. Hence

i=1

= F (
u) +
L(
u, )

q


= F (u) +
i hi (
u) L(u, )

q


i=1

i hi (u), u U.

i=1

is a saddle point of L.
Hence (
u, )

For Rq+ (called the dual variable) we dene u V (Hilbert space) and K : Rq+ R by:
(P )

u V | L(u , ) =

inf L(u, )

uV

def

K() = L(u , ).
Dual Problem:
(P )

Rq+ | K()
= sup K()

Lemma 8 Assume that


(i) hi : V R, i = 1, . . . , q, are continuous;

Rq+

(3.34)

CHAPTER 3. CONSTRAINED OPTIMIZATION

70

(ii) Rq+ (P ) admits a unique solution u ;


(iii) u is continuous on Rq+
Then K in (3.34) is dierentiable and K(),  =

q


i hi (u ), Rq .

i=1

Proof. Restate (P ):
K() = inf L(u, ) = L(u , ).
uV

Let , +

Rq+ .

We have:

(a) L(u , ) L(u+ , )

(a)

K() F (u+ ) +

q


(b) L(u+ , + ) L(u , + )

and

(i + i )hi (u+ )

i=1

(b)

K( + ) F (u ) +

i hi (u+ ) = K( + )

i=1

q


i hi (u ) +

i=1
q


q


q


q


i hi (u+ ).

i=1

i hi (u ) = K() +

q


i=1

i hi (u ).

i=1

i hi (u+ ) K( + ) K()

q


i=1

i hi (u).

i=1

Then [0, 1] such that


K( + ) K() = (1 )

q


i hi (u ) +

i=1

q

i=1

q


i hi (u+ )

i=1

q



i hi (u ) +
i hi (u+ ) hi (u)
i=1

Since u is continuous and hi , 1 i q are continuous, for any Rq+ we can write down
K( + ) K() =

q


i hi (u) + () where

i=1

It follows that K is dierentiable and that K(),  =

q


lim () = 0

i hi (u ).

i=1

Theorem 42 Two reciprocal statements.


1. Assume

(i)

(ii)
(iii)

(iv)

that
hi : V R, i = 1, . . . , q, are continuous;
Rq+ (P ) admits a unique solution u ;
u is continuous on Rq+ ;
Rq+ solves (P )

2. Assume

(i)

(ii)
(iii)

(iv)

that
(P) admits a solution u;
F : V R and hi : V R, 1 i q are convex;
F and hi , 1 i q, are dierentiable at u;
(convex) constraints are qualied.

u solves (P)

(P ) admits a solution

CHAPTER 3. CONSTRAINED OPTIMIZATION

71

solve (P ). Then
Proof. Statement 1. Let
= L(u , )
= inf L(u, )

K()

uV

a maximum w.r.t. Rq+ (convex set), hence


K is dierentiable (Lemma 8 (p. 69)) and has at
:
;

K(), 0, Rq+ .


This, combined with Lemma 8, yields
:

q
q
; 
:
;


K(), =
i hi (u )
i hi (u ) = K(), , Rq+ .
i=1

L(u , ) = F (u ) +
  

q

i=1

i=1

i hi (u ) F (u ) +


q

i=1

, Rq+ .
i hi (u ) = L(u , )

 
   

sup L(u , ) = L(u , ).

Rq+

is a saddle point of L. By Theorem 41-1 (p. 68), u solves (P ).


Consequently (u , )

u, ) is a saddle point of L.
Statement 2. Using Theorem 41-2 (p. 68), R+ such that (
= inf L(u, )
= sup L(
= sup L(
L(
u, )
u, ) K()
u, ).
Rq+

uV

Rq+

The proof is complete.

Quadratic function under ane constraints


F (u) = 12 Bu, u c, u, B Rnn B  0, B T = B, c Rn and
U = {u Rn : Au b Rq },

rankA = q < n

(3.35)

hence U is non-empty convex constraints qualied.


1
Bu, u c, u + Au b, 
2
u L(u, ) is dierentiable, strictly convex and coercive, hence (P ) (p. 69) has a unique solution
u , Rq+ . It solves 1 L(u, ) = 0:
L(u, ) =

0 = 1 L(u , ) = Bu c + AT u = B 1 (c AT )
K() = L(u , ) =

 
 1  1 
1
AB 1 AT , + AB 1 c b,
B c, c
2
2

= max K() admits at least one solution. If rankA = q, the solution of


(P ), p. 69, namely K()
q

R+

(P ) is unique and reads


= (AB 1 AT )1 (AB 1 c b)

.
u = B 1 (c AT )

CHAPTER 3. CONSTRAINED OPTIMIZATION

3.6.3

72

Uzawas Method

= a solution of (P )
Compute
Maximization of K using gradient with projection:
k+1 = + (k + K(k ))

+ > 0 because we maximize

1 i q, Rq

(+ )[i] = max{[i], 0},

K(k )[i] = hi (uk ), 1 i q


def

uk = uk = inf L(u, k )
uV

Alternate Optimization (uk , k ) V Rq+ . For k N


,
+
q

k [i] hi (u)
uk = arg inf F (u) +
uV

i=1

k+1 [i] = max {0, k [i] + hi (uk )} , 1 i q


Theorem 43 Assume that
(i) F : Rn R is strongly convex with constant > 0
(ii) U = {u Rn : Au b} = , A Rqn , b Rq
(iii) 0 < <

2
A22

Then
lim uk = u

If moreover rankA = q

(the unique solution of (P ))

lim k =

(the unique solution of (P ))

Emphasize that (uk ) converges even if (k ) diverges.


"
2
, . Euclidean norm u2 = u, u.
Remind: A2 = supu Au
u2
Proof. Denote by h : Rn Rq the function
def

h(u) = Au b Rq .

(3.36)

Then the constraint set U reads






U = u Rn | h(u) [i] 0, 1 i q .
For Rq+ ,
L(u, ) = F (u) + , h(u) = F (u) + , Au b
is a saddle point of L. The latter is dened by the system
Rq+ such that (
u, )
Then
=0
F (
u) + AT
:
;
0, Rq+ .
h(
u),

(3.37)
(3.38)

CHAPTER 3. CONSTRAINED OPTIMIZATION

73

For any > 0, (3.38) is equivalent to


:
;



+ h(
0, Rq+ .

u) ,
is the projection of
+ h(
By the Projection theorem,
u) on Rq+ , i.e.


= +
+ h(

u) .
Iterates solve the system for k N:
F (uk ) + AT k = 0


k+1 = + k + h(uk ) .


= 0 AT (k )
= F (uk ) F (
u) + AT (k )
u)
F (uk ) F (

(3.39)

Convergence of uk :




2 = + k + h(uk ) +
+ h(
u) 22
k+1 
2
+ A(uk u)2
k
by (3.36): h(uk ) h(
u) = A(uk u)
2
:
;
2 + 2 AT (k )
, uk u + 2 A(uk u)2
= k 
2
2
2 2 F (uk ) F (
= k 
u) , uk u + 2 A(uk u)22 (using (3.39))
2
2 2 uk u2 + 2 A2 uk u2 (F is strongly convex, see (i))
k 
2
2
2
2


2
2
2
2 A uk u
= k 
2
2
2
2 k 
2 , hence
Note that by (iii), we have 2 A22 > 0. It follows that k+1 


2
2
k+1 
is decreasing. Being bounded from below, k+1 
converges (not
kN
kN
necessarily to 0), that is


2
2

lim k+1 2 k 2 = 0
k

Then



2 k+1 
2 0 as k .
0 2 A22 uk u2 k 
2
2

Consequently, uk u2 0 as k .


Possible convergence for k
decreasing) subsequence k
 such that
(k )k0 bounded (because k 


 = lim F (uk ) + AT k
F (
u) + AT

k

If
$ rankA = q, then:
Range(A) = Rq

ker AT = {0}

= 0 has a unique solution.


hence F (
u) + AT

Remark 21 The method of Uzawa amounts to projected gradient maximization with respect to
(solving the dual problem).

CHAPTER 3. CONSTRAINED OPTIMIZATION

3.7

74

Unifying framework and second-order conditions

Find u such that F (


u) = inf F (u) where
uU


U=

g (u) = 0, 1 i p
uR : i
hi (u) 0, 1 i q

=

Associated Lagrangian:
L(u, , ) = F (u) +

p


i gi (u) +

i=1

3.7.1

q


i hi (u), i 0, 1 q.

i=1

Karush-Kuhn-Tucker Conditions (1st order)

Theorem 44 (KKT) Let u U be a solution (local) of inf F (u) and


uU

u)
1. F : Rn R, {gi } and {hi } be C 1 on O(
u), 1 i p & hi (
u), i I(
u)} linearly independent
2. {gi (

Rp and

Rq+

such that for


u L(
u, ,
) = 0
i gi (
u) = 0 and
u) = 0
1 i p : gi (
u) 0,
i 0 and
i hi (
u) = 0
1 i q : hi (

i hi (
u) = 0, 1 i q and
i > 0, i I(
u) - complementarity conditions (facilities for numerical
methods)

,
uniqueness by assumption 2.

3.7.2

Second order conditions

(Verify if u is a minimum
 indeed)

(
u
),
v
=
0,
1

p
g
i
Critical cone C = v C(
u) :
hi (
u), v = 0, if i I(
u) and
i > 0
u) and that all conditions
Theorem 45 (CN) Suppose that F : Rn R, {gi } and {hi } are C 2 on O(
of Theorem 44 hold.

u, ,
)(v, v)  0, v C.
(CN) Then 2uu L(

u, ,
)(v, v)  0, v C \ {0} then u = strict (local) minimum of
(CS) If in addition 2uu L(
inf uU F (u).

Lagrangian convex (non-negative curvature) for all directions in C.

CHAPTER 3. CONSTRAINED OPTIMIZATION

3.7.3

75

Standard forms (QP, LP)

A. Linear programming (LP) We are given: c Rn , A Rpn , b Rp , C Rqn and f Rq .


Assumption: the constraint set is nonempty and it is not reduced to one point.

min c, u subject to

Au b = 0 Rp
Cu f 0 Rq

L(u, , ) = c, u + , Au b + , Cu f . Dual problem (KKT):


c + AT + C T = 0
Au b = 0
, Cu f  = 0
0
B. Quadratic programming (QP) We are given: B Rnn with B  0 and B T = B, A Rpn ,
b Rp , C Rqn and f Rq . The same assumption as in LP.
1
F (u) = Bu, u c, u subject to
2

Au b = 0 Rp
Cu f 0 Rq

There is exactly one solution.


Associated Lagrangian
L(u, , ) = 12 Bu, u c, u + , Au b + , Cu f . Dual problem (KKT), similarly.
N.B. Various algorithms to solve LP or QP can be found on the web. Otherwise, see next subsection.

3.7.4

Interior point methods

Constraints yield numerical diculties. (When a constraint is satised, it can be dicult to realize
a good decrease at the next iteration.)
Interior point methodsmain idea: satisfy constraints non-strictly. Constraints are satised
asymptotically. At each iteration one realizes an important step along the direction given by 2 F
even though calculations are more tricky.
Actually interior point methods are considered among the most powerful methods in presence of
constraints (both linear or non-linear). See [69] (chapters 14 and 19) or [90] (chapter 1).
Sketch of example: minimize F : Rn R subject to Au b, b Rq .
Set y = b Au then y 0.
Notations: 1l = [1, . . . , 1]T , Y = diag(y1 , . . . , yq ), = diag(1 , . . . , q )
We need to solve:

F (u) + AT
def
S(u, y, ) = Au b + y = 0 subject to y 0, 0.
Y 1l
Note that Y 1l = 0 is the complementarity condition.

CHAPTER 3. CONSTRAINED OPTIMIZATION

76

1
Duality measure: = y T .
q
Central path dened using = for [0, 1] xed.
At each iteration, one derives a central path (u , y , ), > 0 by solving

0
S(u , y , ) = 0 , y > 0, > 0
1l
The step to a new point is done so that we remain in the interior of U. The complementary condition
is relaxed using .
The new direction in the variables u, , y is found using Newtons method.
Often some variables can be solved explicitly and eliminated from S.
Interior point methods are polynomial time methods, and were denitely one of the main events
in optimization during the last decade, in particular, in linear programming.

3.8

Nesterovs approach

For V a Hilbert space, consider F : V R convex and Lipschitz dierentiable and U V a closed
and convex domain. The problem is to solve
inf F (u)

(3.40)

uU

The Lipschitz constant of F is denoted by .


It is shown in 2004 by Nesterov, [60, Theorem 2.1.7], that no algorithm that uses only values
F (uk ) and gradients F (uk ) has a better rate of convergence than O(1/k 2) uniformly on all problems
of the form (3.40) where k is the iterations number. The convergence rate is in term of objective
function, that is |F (uk ) F (
u)| C/k 2 where C is a constant proportional to u0 u2 . This
result alone gives no information on the convergence rate of uk to u.
Nestorovs algorithm [61].
 is the Lipschitz constant of F ;
  is a norm and d is a convex function such that there exists > 0 satisfying
d(u)

u0 u2 ,
2

For k = 0 set u0 U and x0 = 0.


For k 1:
1. k = F (uk )



2. zk = arg min k , z uk  + z uk 2
zU
2
3. xk = xk1 +

k+1
k
2

u U.

CHAPTER 3. CONSTRAINED OPTIMIZATION



4. wk = arg min
wU

5. uk+1 =

77


d(w)
+ xk , w

2
k+1
wk +
zk
k+3
k+3

Proposition 3 ([61]) This algorithm satises


0 F (uk ) F (
u)

4d(
u)
(k + 1)(k + 2)

The rationale: similarly to CG, one computes the direction at iteration k by using the information


in F (uk1), , F (u0 ) .
For details and applications in image processing, see [9] and [88].
Y. Nesterov proposes an accelerated algorithm in [59] where the estimation of the Lipschitz
constant is adaptive and improved.
Other solverssee next chapter.

Chapter 4
Non dierentiable problems
Textbooks: [38, 73, 45, 46, 93, 14].

4.1

If not specied, V is a real Hilbert space again.

Specicities

4.1.1

Examples

Non-dierentiable functions frequently arise in practice.


Minimax problems
Minimize F (u) = maxiI i (u) where I is a nite set of indexes and i I, i is convex and smooth.
Regularized objective functionals on Rn
The minimizers of functionals of the form F (u) = (u) + (u) (remind (1.5)-(1.6), p. 7, and the
explanations that follow) are very popular to dene a restored image or signal. They are often
used in learning theory, in approximation and in many other elds. Nonsmooth F have attracted a
particular attention because of the specic properties of their solutions.
Non-dierentiable regularization term
(u) =

(Di u), with  (0+ ) > 0

i


Note that (0 ) > 0 entails that is nonsmooth.


For (t) = t and Di the discrete approximation of the gradient of u at i, along with
. = .2 = Total Variation (TV) regularization [76];
For (t) = t and Di R1n = median pixel regularization (approximate TV) [13];
Other non-dierentiable (nonconvex) potential functions :
t
, see [42, 43] :
(t) =
+t
(t) = t , ]0, 1[
If u = the coecients of the decomposition of the image in a wavelet basis or a frame
and Di = ei , i, then u = shrinkage estimation of the coecients [35, 58, 12, 6]). See
Examples 9 (p. 79) and 8 (p. 99).

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

79

Property 7 Typically Di u = 0, for many indexes i, see [63, 66]. This eect is known as staircasing if = TV. It arises with probability zero if is dierentiable.
Non-dierentiable data-tting
Starts to be quite popular, [3, 64, 65, 18, 22].
F (u) =

m


(|aTi u vi |) + (u),  (0+ ) > 0,

(4.1)

i=1

A = an m n matrix with rows = aTi , 1 i m.


Property 8 Typically, u is such that aTi u = vi for a certain number of indexes i, see [64]. This
eect arises with probability zero if  (0+ ) = 0 (i.e. t (|t|) is smooth at zero).
Constrained nonsmooth problems
Typical forms are
minimize u1 subject to Au v
or
minimize u1 subject to Au = v
These are frequently used in Compression, in Coding and in Compressive sensing. The reason is
that they lead to sparse solution, i.e. u[i] = 0 for many indexes i.

4.1.2

Kinks

Denition 21 A kink is a point u where F (u) is not dened (in the usual sense).
Theorem 46 (Rademacher) Let F : Rn ] , +] be convex and F  +. Then the subsetset


u int dom F :  F (u) is of Lebesgue measure zero in Rn .
Proof. : see [45, p. 189]. The statement extends to nonconvex functions [75, p.403] and to mappings
F : Rn Rm [39, p.81].
Hence F is dierentiable at almost every u. However minimizers are frequently located at kinks.
1
Example 9 Consider F (u) = u w2 + |u| for > 0 and u, w R. The minimizer u of F reads
2

w + if w <
0
if |w|
(
u is shrunk w.r.t. w.)
u =

w if w >

= 1, w = 0.9

= 1, w = 0.2

= 1, w = 0.5

= 1, w = 0.95

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

4.2

80

Basic notions

Denition 22 0 (V ) is the class of all l.s.c. convex proper functions on V (a real Hilbert space) .
(Remind Denitions 2, 4 and 6 on p. 10.)

4.2.1

Preliminaries

Denition 23 f : V R {+}, is said to be positively homogeneous if f (u) = f (u), >


0, u V .
Denition 24 f : V R {+}, is said to be sublinear if it is convex and positively homogeneous.
Lemma 9 f is positively homogeneous
f (u) f (u), > 0, u V
Proof. () is obvious. Next(). Let f (u) f (u), > 0, u V . Since u V and 1 > 0
f (u) = f ( 1 u) 1 f (u) f (u) f (u), > 0, u V
hence f (u) = f (u), > 0, u V .

Proposition 4 f is sublinear

f (u + v) f (u) + f (v), (u, v) V 2 , > 0, > 0

= > 0,






 

u+ v
= f
u+ v
f (u) + f (v) = f (u) + f (v)
f (u + v) = f




  

(4.2)
() Taking + = 1 in (4.2) shows that f is convex. Furthermore



u + u 2 f (u) = f (u) .
f (u) = f
  
  
2
2  2
Proof. () Set = + . Using the convexity of f for 1

f is positively homogeneous by Lemma 9.

Sublinearity is stronger than convexityit does not require + = 1and weaker than
linearityit requires that > 0, > 0.
Denition 25 Suppose that U V is nonempty.
Support function U (u) = sup s, u R {}
sU


Indicator function U (u) =

0
+

if u U
if u
 U

Distance from u V to U: dU (u) = inf vU u v.

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

81

We denote by f  : V  R the convex conjugate of the function f : V R (see (1.10), p. 14).


When U = is convex and closed, we have


U (u) = sup u, v U (v) = sup u, v = U (u) and U (v) = 


U (v) = U (v)
vV

(4.3)

vU

where the second part results from Theorem 4 (p. 15).


Proposition 5 (p. 19, [38]) U is l.s.c. and sub-linear. Moreover
U (u) < u V

U V is bounded.

Proposition 6 (p. 12, [38]) Let f be a proper convex function on a real n.v.s. V . The statements
below are equivalent:
1. O V over which F is bounded above;
2. int domF = and F is locally Lipschitz there.
Note that this is the kind of functions we deal with in practical optimization problems.

4.2.2

Directional derivatives

Lemma 10 Let F : V R be convex, where V is a n.v.s. The function


t

F (u + tv) F (u)
, t R+
t

is increasing.
Proof. For any t > 0, choose an arbitrary 0. Set
def

,
t+

then 1 =

t
t+

We check that
u + (1 )(u + (t + )v) =
Hence

t
t

u+
u+
(t + )v = u + tv
t+
t+
t+



F (u + (1 ) u + (t + )v = F (u + tv).

Using the last result and the convexity of F yield


F (u + tv)
  



F (u + (1 ) u + (t + )v





t
t 

F (u) +
F u + (t + )v = F (u) +
F u + (t + )v F (u)

 t +
t+
t+




It follows that



F u + (t + )v F (u)
F (u + tv) F (u)

, t > 0, R+ .
t
t+

Hence the result.

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

82

Denition 26 Let F : V R be convex and proper. The (one-sided) directional derivative of F at


u V along the direction of v V reads
F (u + tv) F (u)
, u V
t
0
t
F (u + tv) F (u)
, u V
= inf
t>0
t

F (u)(v) = lim

(4.4)
(4.5)

Whenever F is convex and proper, the limit in (4.4) always exists (see [38, p. 23]). The denition
(u)
in (4.4) is equivalent to (4.5) since by Lemma 10 (p. 81), the function t F (u+tv)F
, t R+ goes
t
to its unique inmum when t
0.
Remark 22 For nonconvex functions, dened on more general spaces, directional derivatives can be
dened only using (4.4); they do exist whenever the limit in (4.4) do exist.
F (u)(v) is the right-hand side derivative. The left-hand side derivative is F (u)(v).
F convex

F (u)(v) F (u)(v).

When F is dierentiable at u in the usual sense, F (u)(v) = F (u), v, v V .


Proposition 7 Let F : V R be convex and proper. For any u V xed, v F (u)(v) is sublinear.
Proof. By Denition 26(4.4), it is obvious that v F (u)(v) is positively homogeneous, i.e.
F (u)(v) = F (u)(v), > 0. Let us check that it is convex. Choose > 0 and > 0 such that
+ = 1, and v V and w V .



 

F u + t(v + w) F (u) = F (u + tv) + (u + tw) F (u) + F (u) , t > 0




F (u + tv) F (u) + F (u + tw) F (u) , t > 0.
Divide by t > 0


F u + t(v + w) F (u)
F (u + tv) F (u)
F (u + tw) F (u)

+
, t > 0.
t
t
t
For t

0 we get F (u)(v + w) F (u)(v) + F (u)(w).

Proposition 8 If F 0 (V ) is Lipschitz with constant  > 0 on B(u, ), > 0 then


z u |F (z)(v) F (z)(w)| v w, v, w V.
Proof. Let > 0 be such that if 0 < t < then z + tv B(u, ) and z + tw B(u, ). Then


F (z + tv) F (z + tw) tv w, t ]0, [



F (z + tv) F (z) F (z + tw) F (z)  tv w, t ]0, [


 F (z + tv) F (z) F (z + tw) F (z) 
 v w, t ]0, [.




t
t
Taking the limit when t

0 yields the result.

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

83

Proposition 9 (1st -order approximation) Let F : Rn R be convex and proper, with Lipschitz constant
 > 0 and u Rn . Then > 0, > 0 such that
v <

|F (u + v) F (u) F (u)(v)| v.

Proof. By contradiction: let > 0 and (vk )kN with


def

tk = vk 
such that

1
k+1



F (u + vk ) F (u) F (u)(vk ) > tk = vk , k N.

vk 
= 1, k N
tk

vk
{w Rn : w = 1} (compact set), k N vkj such that
tk
vkj
= v, v = 1
j tkj
lim

(4.6)

For  > 0 a local Lipschitz constant of F we have




tkj < F (u + vkj ) F (u) F (u)(vkj )
 
 


F (u + vkj ) F (u + tkj v) + F (u + tkj v) F (u) F (u)(tkj v) + F (u)(tkj v) F (u)(vkj )


2vkj tkj v + F (u + tkj v) F (u) tkj F (u)(v)


Note that by Lemma 8 (p. 82) we have F (u)(tkj v) F (u)(vkj ) tkj v vkj , which is used to
get the last inequality.
Using that lim tkj = 0 along with the denition of F in (4.4) (p. 82) and (4.6)
j

&

&

& vkj
 F (u + tkj v) F (u)
&

&

&
2 lim &
v & + lim 
F (u)(v) = 0.
j tkj
j
tkj
This contradicts the assumption that > 0.

4.2.3

Subdierentials

For a given space V , the class of all subsets of V is denoted by 2V .


Denition 27 Let F : V R be convex and proper. The subdierential of F is the set-valued operator
F : V 2V whose value at u V is
F (u) = {g V : g, v F (u)(v), v V }
= {g V : F (z) F (u) + g, z u , z V }
A subgradient of F at u is a selection g V such that g F (u).
If F is dierentiable at u then F (u) = {F (u)}.
Lemma 11 The two formulations for F in (4.7) and in (4.8), Denition 27, are equivalent.

(4.7)
(4.8)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS


Proof. Since by Lemma 10, the function t
the denition for F in (4.5), we can write

84

F (u+tv)F (u)
,
t

t R+ , has its inmum for t

0, using


F (u + tv) F (u)
F (u) = {g V : g, v F (u)(v), v V } =
g V : g, v
, v V, t > 0
t


1
F (z) F (u)
(use z = u + tv v = (z u)/t)
=
g V : g, z u
, z V, t > 0
t
t
= {g V : g, z u F (z) F (u), z V }
The conclusion follows from the observation that when (u, v) describe V 2 and t describes R+ , then
z = u + tv describes V .

Observe that by the denition of F in (4.7) (p. 83),


F (u)(v) = sup g, v : g F (u) = F (u) (v)
where is the support function (see Denition 25, p. 80). Moreover,
F (u)(v) g, v F (u)(v), (g, v) (F (u) V )
Property 9 If F 0 (V ), then F is a maximal monotone mapping, i.e.
it is monotone:
u1 , u2 V

g2 g1 , u2 u1  0, g1 F (u1 ), g2 F (u2 )

and maximal: g1 F (u1 ) g2 F (u2 ) whenever the above condition holds.


Property 10 (p. 277, [56]) Let F 0 (V ). Then the set {F (u)} is closed and convex.
Proposition 10 (p. 263, [87]) Let G Rmn and F : Rn R read
F (u) = Gu2
T
G Gu

if Gu = 0

Gu2
F (u) =


 T
G h | h2 1, h Rm if Gu = 0

Then

(4.9)

Proof. Form the denition of the subdierential (see Denition 27)




F (u) = g Rn | Gz2 Gu2 g, z u , z Rn
If Gu = 0 then F is dierentiable at u and F (u) = {F (u)}, hence the result 1 .
1

def

Let f = u2 =


i


u[i]2 . Then
u = 0

f (u) =

u
u2

We have F = f G. For Gu = 0, noticing that Gu = GT , chain rule entails the rst equation in (4.9).

(4.10)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

85

Consider that Gu = 0. Clearly, Gz2 = G(z u)2 . After setting w = z u, (4.10) equivalently
reads


(4.11)
F (u) = g Rn | Gw2 g, w , w Rn
Dene the set


def 
S = GT h | h2 1, h Rm

If GT h S, then Schwarz inequality yields


 T

G h , w = h , Gw Gw2 , w Rn
By (4.11), GT h F (u) and hence S F (u).
Suppose that
g F (u) obeying g  S

(4.12)

Then using the H-B separation Theorem 3 (p. 13), there exist w Rn and R so that the
hyperplane {x Rn | w, x = } separates g and S so that
g, w > > z, w ,
Consequently,
g, w > sup
h

z S

 T


G h, w | h2 1, h Rm = Gw2

A comparison with (4.11) shows that g  F (u). Hence the assumption in (4.12) is false. Consequently, F (u) = S.

In particular, Proposition 10 gives rise to the following well known result:
Corollary 1 Let u Rn . Then
u

if u = 0
u
2
(u2 ) =



h Rn | h2 1 if u = 0

(4.13)

In words, the subdierential of u  u2 at the origin is the closed unit ball with respect to the
2 norm centered at zero.
Example 10 Let F : R R read F (u) = |u|. By (4.13)
u
= sign(u)
u = 0 F (u) = |u|


u = 0 F (0) = h R | |h| 1 = [1, +1]

Some calculus rules


(see e.g. [93, 45])
0 (V ) and 0 (V )
( + )(u) = (u) + (u), , 0, u V.

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

86

0 (V ) and 0 (V )


(u) + (v) = (u) (v), u V, v V.
A : V W bounded linear operator, A its adjoint, b W and F 0 (W )


F (Au + b) (u) = A F (Au + b)


0 (W ), 0 (V ), A : V W bounded linear operator, 0 int dom( A(dom)
( A + ) = A () A +

Let U V be nonempty, convex and closed. For any u V \U we have dU (u) =


u U u
.
dU (u)

Duality relations
Theorem 47 Let F 0 (V ) and F  : V  R be its convex conjugate (see (1.10), p. 14). Then
v F (u) F (u) + F  (v) = u, v

u F  (v)

(4.14)

Proof By (4.8) and using that F = F  (see Theorem 4, p. 15):


v F (u)
  

F (z) F (u) + v, z u , z V

v, z F (z) v, u F (u), z V


6
7

F (v) = sup v, z F (z) = v, u F (u),

F (v) = v, u F (u)

zV

F (u) + F  (v) = u, v








F (v) + F (u) = u, v

u F  (v)
  

Corollary 2 We posit the conditions of Theorem 47. Then


F (u) = {v V : F (u) + F (v) = u, v} .
Continuity properties of the subdierential of F 0 (Rn )
Theorem 48 (Mean-value theorem.) Let F : Rn R be convex and proper. If u = v then (0, 1)
and g F (v + (1 )u) such that
F (v) F (u) = g, v u
Proofsee [45, p.257].
Property 11 ([45]) F (u) is compact for any u Rn .
Property 12 ( p.282, [45]) u F (u) is locally bounded: B Rn bounded F (B) Rn bounded.

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

87

Moreover [p. 282, [45]]:


B Rn compact F (B) compact.
B Rn compact and connected F (B) compact and connected.
B Rn convex in general F (B) is not convex.
Theorem 49 (Continuity) F is outer semi-continuous at any u Rn :
> 0 > 0 : u v F (v) F (u) + B(0, ).
Proofsee [45, p.283]
Corollary 3 (p. 283 [45]) For F : Rn R convex, the function u F (u)(v) is upper semicontinuous:
u Rn F (u)(v) = lim sup F (z)(v), v Rn .
zu

4.3

Optimality conditions

4.3.1

Unconstrained minimization problems

Theorem 50 (Necessary and Sucient condition for a minimum) Let F 0 (V ), then


F (v) F (
u), v V

0 F (
u)

F (
u)(v) 0, v V.

Proof. Using Denition 27, (4.8) (p. 83)


g = 0 F (
u)

F (v) F (
u) + 0, v u , v V

F (v) F (
u), v V

Using Denition 27, (4.7), namely F (u) = {g V : g, v F (u)(v), v V },


g = 0 F (
u)

0 F (
u)(v), v V


The proof is complete.


Denote the set of minimizers of F by

 = {uu (F )1 (0)}
U
 is closed and convex.
The set U
 = {
Remark 23 If F is strictly convex and coercive, then U
u}, i.e. the minimizer is unique.

(4.15)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

88

Minimizing F amounts to solving the inclusion


0 F (u) ,
or equivalently, to nding a solution to the xed point equation
u = (Id + F )1 (u) ,

(4.16)

where (Id + F )1 is the resolvent operator associated to F , > 0 is a stepsize. The schematic
algorithm resulting from (4.16), namely
u(k+1) = (Id + F )1 (u(k) ),
is a fundamental tool for nding the root of any maximal monotone operator [74, 36], such as e.g.
the subdierential of a convex function. Often, (Id + F )1 cannot be calculated in closed-form.

4.3.2

General constrained minimization problems

Consider the problem


F (
u) = min F (u)

(4.17)

uU

where U V is closed, convex and nonempty.


Remark 24 For any F 0 (V ) and U V convex and closed and U = , the problem in (4.17) can
be redened as an unconstrained nondierentiable minimization problem via
F (u) = F (u) + U (u)

arg min F (u) = arg min F (u).


uV

uU

where stands for indicator function (see Denition 25, p. 80).


Denition 28 A direction v is normal to U V at u if
w u, v 0, w U

Denition 29 The normal cone operator for U, for any u U reads



{v V : w u, v 0, w U}
NU (u) =

if u U
if u
 U

def

NU (u) convex and closed. Note that if u = U (v) then v u NU (u), v V.


Lemma 12 For U indicator function of U and NU as given in (4.18), we have
U = NU .

(4.18)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

89

Proof. By Corollary 2 (p. 86) and using that U (v) = supzU z, v = U (v). For U (v) remind
Denition 25 on p. 80 and (4.3) on p. 81.
U (u) = {v V : U (u) + U (v) = u, v} = {v V : U (u) + sup z, v = u, v}
zU

If u  U, obviously U (u) = . Consider next that u U in which case U (u) = 0. We obtain:


U (u) = {v V : sup z, v = u, v}
zU

= {v V : z, v u, v , z U}

Theorem 51 Let F 0 (V ) and U V be closed, convex and nonempty. Then


u)(v u) 0, v U
F (
u) = min F (u) F (
uU

0 F (
u) + NU (
u)

Proof. v U and t ]0, 1] we have u + t(v u) U.




u), t ]0, 1], v U
F (
u) = min F (u) F u + t(v u) F (
uU


F u + t(v u) F (
u)
0, t ]0, 1], v U

 t

F u + t(v u) F (
u)
inf
0, t ]0, 1], v U
t]0,1]
t
F (
u)(v u) 0, v U.

(4.19)

By Lemma 12 (p. 88) we get




F (u) + NU (u) = F (u) + U (u) = F (u) + U (u) .
Using Theorem 50 (p. 87) and Remark 24 (p. 88),


F (
u) = min F (u) 0 F (
u) + U (
u) = F (
u) + NU (
u)
uU

which establishes the last equivalence in the theorem.

Remark 25 If U = V , the condition of Theorem 51 reads: F (


u)(v) 0, v V .

4.3.3

Minimality conditions under explicit constraints

Consider problem (4.17), namely F (


u) = min F (u), for hi 0 (Rn ), 1 i q
uU



ai , u = bi , 1 i p (Au = b Rp )
n
=
U = uR :
hi (u) 0, 1 i q

(4.20)

def

The set of active constraints at u reads I(u) = {i : hi (u) = 0}


Theorem 52 ([45], Theorem 2.1.4 (p. 305) and Proposition 2.2.1 (p. 308).) Consider U as in (4.20)
and F 0 (Rn ) Then the following statements are equivalent:

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

90

1. u solves the problem F (


u) = min F (u);
uU

2. Rp , Rq such that
p


q


i hi (
u), i 0 and i hi (
u) = 0, 1 i q

(4.21)




i zi : Rp , i 0, zi hi (u), i I(u)
NU (u) = AT +

(4.22)

0 F (
u) +

i ai +

i=1

i=1

u) where
3. 0 F (
u) + NU (

iI(u)

Remark 26 Note that the condition in 2, namely (4.21) ensures that the normal cone at u w.r.t. U
reads as in (4.22) which entails that the constraints are qualied.
The existence of coecients satisfying (4.21) is called Karush-Kuhn-Tucker (KKT) conditions. The
requirement i hi (
u) = 0, 1 i q is called transversality or complementarity condition (or equivalently slackness).
(, ) Rp Rq+ are called Lagrange multipliers. The Lagrangian function reads
L(u, , ) = F (u) +

p


i (ai , u bi ) +

q


i=1

i hi (u)

(4.23)

i=1

Theorem 53 ([45], Theorem 4.4.3 (p. 337).) We posit the assumption of Theorem 52. Then the following statements are equivalent:
1. u solves the problem F (
u) = min F (u);
uU






2. u, (,
) is a saddle point of L in (4.23) over Rn Rp Rq+ .

4.4

Minimization methods

Descent direction d : > 0 such that F (u d) < F (u) gu , d > 0, gu F (u)
Note that d = gu for some gu F (u) is not necessarily a descent direction.
Example
F (u) = |u[1]| + 2|u[2]|
|1 + tv[1]| 1 + 2|tv[2]|
= v[1] + 2|v[2]|
F (1, 0)(v) = lim
t
0
t
F (1, 0) = {g R2 | g, v F (1, 0)(v), v R2 }
= {g R2 | g[1]v[1] + g[2]v[2] v[1] + 2|v[2]|, v R2 }
= {1} [2, 2]
The steepest descent method:
dk arg min

max g, d

d=1 gF (uk )

is unstable and may converge to non-optimal points [45, 14].

not a descent direction

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

91

Some diculties inherent to the minimization of non-smooth functions


Usually one cannot compute F (u), but only some elements g F (u).
Stopping rule: 0 F (uk ) is dicult to implement.
An approximation like gk  can never be satised.
u)  {gk F (uk )} gu
uk u, gu F (
Minimizing a smooth approximation of F may generate large numerical errors while the properties of the minimizer of F and the one of its smoothed version are usually very dierent.
Specialized minimization algorithms are needed.

4.4.1

Subgradient methods

Subgradient projections are signicantly easier to implement than exact projections and have been
used for solving a wide range of problems.
Subgradient algorithm with projection
Minimize F : Rn R subject to the constraint set U (closed, convex, nonempty, possibly = Rn )
For k N
1. if 0 F (uk ), stop (dicult to verify); else
2. obtain gk F (uk ) (easy in general);
3. (possible) line-search to nd k > 0;


gk
4. uk+1 = U uk k
gk 
Remark 27 dk =

gk
is not necessarily a descent direction. This entails oscillations of the value
gk 

(F (uk ))kN .
Theorem 54 Suppose that F : Rn R is convex and reaches its minimum at u Rn . If (k )k0
satises


k = + and
2k < +

lim uk = u.
(4.24)
k

Proof: see [79, 14].


By (4.24), k converges fast to 0 which means that the convergence of uk is slow (sub-linear).
(E.g., k = 1/(k + 1) for k N.)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

92

An adaptive level-set method for constrained nonsmooth minimization


Following Combettes in [28]
def

nd u such that F ( u) = min F (u) =

(P )

uU

where F : Rn R is convex and proper and U Rn is closed, convex and = .


The lower level set of F at height R:
def

L() = {u Rn : F (u) } is convex and closed.


Note that L() = if <
.
n
For R, u R and gu F (u) dene the half-space
H(u, gu , ) = {v Rn : gu , u v F (u) } .
Let z L(), then F (z) . From the denition of F in (4.8), gu , u z F (u) F (z)
F (u) . Then z H(u, gu, ), hence
L() H(u, gu, )
The subgradient projection of u onto H(u, gu , ) reads (see [27])

u
if F (u) or 0 = g F (u)

gu (u) =
F (u)

gu otherwise
u+
gu 2


Indeed, gu , u g (u) = F (u) .
If
is knownPolyaks method (cf. [79]):
def

u0 U, k N uk = gk F (uk ), uk+1 = U gk (uk ) u.


In general
is unknown. Idea: estimate
adaptively. Adaptive level-set approach:
u0 U, k N gk F (uk ), uk+1 = U gkk (uk )

(4.25)

where (k )kN is a sequence of level estimates.


.
Let is un upper bound of
and u0 U. Let uk , k N be generated by (4.25) with k
Theorem 55 ([48]) Let
k . Then lim uk L() U .

when

Theorem 56 ([2]) Suppose


and u0 U. Let uk , k N be generated by (4.25) with k = ,
+ (
) + .
k N. Then ]0, [, n N such that F (un )
These theorems show that if a reasonably tight estimate of
is available then (P ) can be solved
approximately by (4.25).
Comments on the algorithm proposed in [28]:

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

93

Adaptive renement of the estimated levels k ;


)
Explicit criterion to detect infeasibility (k <
Implementable Algorithm
 =U
 U such that u U
 . Fix > 0, 0 < < 1
Determine U
0. u0 U, 0 > and 0 = F (u0 ).
 ; if k , stop
1. Set k = 0, y = uk and d(y, U)
2. Obtain gk F (uk ) ;

if gk = 0, stop.

3. k = k k

gk

v = uk + (k F (uk )) g 2
k
4.
uk+1 = U (v)

k+1 = k + v uk 2 + uk+1 v2



y uk+1
k+1 >  6
,
5. =
k+1  7
2 y uk+1
6. k+1 = k ,

k+1 = k , uk+1 = uk  1

7. k+1 = k , k+1 = min{F (uk+1), k }  2


 as tight as possible in order to get a better convergence rate. Note
We need to determine an U
. Step 5 = test to check if k <
(in which case L(k ) = ). If so, one decreases
that k k+1
in 6. Otherwise, one updates in 7.
Remark 27 still holds true, i.e. F (uk )kN oscillates. The choice of step-size k is improved with
respect to (4.24).

4.4.2

Some algorithms specially adapted to the shape of F

Given the form of F one can often design specic (well adapted) optimization algorithms. There
are numerous examples in the literature. We present some outstanding samples.
TV minimization (Chambolle 2004)
Reference:[16]
Minimize F (u) = 12 u v22 + TV(u) where
6<
7
TV(u) = sup
u(x)div(x)dx : Cc1 (, R2 ), |(x)| 1, x

(4.26)

Let u Rnn and v Rnn . Notations:

discrete
equivalent for
: (Rnn )2 Rnn ,
is the adjoint of

Divdiscrete equivalent for div, dened by Div =

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

94

and Div are matrix operators as described below.

(u)[i,
j] =

u[i + 1, j] u[i, j]
u[i, j + 1] u[i, j]

!
, 1 i, j n

with boundary conditions u[n + 1, i] = u[n, i],: u[i, n;+ 1] = u[i, n]

. One checks easily that


Div is hence dened by Div, uRnn = , u
Rnn Rnn


 

(Div())[i, j] = 1 [i, j] 1 [i 1, j] + 2 [i, j] 2 [i, j 1] , 1 i, j n,
boundary conditions:
Dene

1 [0, i] = 1 [n, i] = 2 [i, 0] = 2 [i, n] = 0.

2



K = Div | Rnn , ([i, j])2 1, 1 i, j n Rnn

(4.27)

where [i, j] = ( 1 [i, j], 2 [i, j]). Let us emphasize that K is convex and closed2 .
The discrete equivalent of the TV regularization in (4.26) reads


TV(u) = sup u, w : w K = sup u, w = K (u)
wK

Its convex conjugate (see (1.10), p. 14)


TV (w) = sup (u, w TV(u)) = K (w)
u

= K , see (4.3) (p. 81). Since TV 0 (Rnn ), using Fenchel-Moreau Theorem 4


where we use K
(p. 15), TV(u) = TV (u).
Using that
F (u) = u v + TV(u),

Theorem 50 (p. 87) implies that F has a minimum at u if and only if


v u
TV(
u)
0 F (
u) 0 u v + TV(
u)




v u
u 1
def v u

where w
=
(Use Theorem 47 p. 86) u TV
0 + TV (w)

v
u 1
1
v
v
+ TV (w)
w + TV (w)

1
v

0 w + TV (w)

w minimizes the function F below


,
+ &
+ &
&2
&2 ,
&
&
&
v
v
1
1&
1
&w & + TV (w) = min
&w &
F (w)
= min
&
&
&
wK
wRnn
2

2
&2

 
v
w = K

 
v
u = v K

The set K is composed out of oscillatory components.

(4.28)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

95

where K is the orthogonal projection onto K. To calculate u, we need to nd Div that solves the
nonlinear projection onto K :
+&
,
&2
&
&
v
nn 2
2
&
min &
&Div & : (R ) , [i, j]2 1 0, 1 i, j n
KT optimality conditions (Theorem 39, p. 66) see [16] for details.
Algorithm:



k [i, j] + Div k v/ [i, j]
 

k+1 [i, j] =



1 +  Div k v/ [i, j] 

Theorem 57 (p. 91, [16]) For 1/8, convergence of Div k to w = K (v/) as k .


This algorithm amounts to a special instance of Berm`
udez and Morenos Algorithm (1979), see
Aujol 2009 [9]. Hence convergence for 1/4.
One coordinate at a time when the non-dierentiable term is separable
Consider
F (u) = (u) +

n


i i (|u[i]|),

i 0, i  (0+ ) > 0,

1 i n

(4.29)

i=1

Remark 28 Note that (4.1), namely F (u) =

n


(| ai , u vi |) + (u),  (0) > 0 can be rewritten

i=1

in the form of (4.29), provided that {ai , 1 i n} is linearly independent. Let A Rnn be the
matrix whose columns are ai , 1 i n. Using the change of variables z = Au v, we consider
n


1
(|zi |) + A1 (z v)
F (z) =
i=1

Theorem 58 Let F : Rn R in (4.29) be convex, proper and coercive, C 1 be convex, i 0


and i : R+ R be convex, C 1 and i (0) > 0, 1 i n. Suppose that satises only one of the
conditions below:
1. is coercive;
2. > 0, > 0 such that u , |t| (u + tei ) (u) t

(u)
+ t2 , 1 i n.
ui

Then the one coordinate at a time method given in 2.2 (p. 22) converges to a minimizer u of F .
Proof. : under condition 1, see [44]; under condition 2, see [65].
Comments
Condition 2 is quite loose since it does not require that is globally coercive.
The method cannot be extended to non separable.

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

96

Algorithm.
k N, each iteration k has n steps:
1 i n, compute
=




u(k) [1], . . . , u(k) [i 1], 0, u(k1) [i + 1], . . . , u(k1)[n] ;


u[i]

if || i i (0+ )

u(k) [i] = 0;

()

else u[i](k) solves the equation on R \ {0}





i i (u(k) [i]) +
u(k) [1], . . . , u(k) [i 1], u(k) [i], u(k1) [i + 1], . . . , u(k1) [n] = 0,
u[i]
 (k) 
where sign u [i] = sign()
The components located at kinks are found exactly in (). Note that for t = 0, we have dtd i (|t|) =
i (|t|)sign(t). In particular, for i (t) = |t|, we have good simplications: i (0+ ) = 1 and i (t) =
sign(t) if t = 0.
Useful reformulation
For any w Rn we consider the decomposition
w = w+ w

where w + [i] 0, w [i] 0, 1 i n

Then
Minimize w1 minimize




w + [i] + w [i] subject to w + [i] 0, w [i] 0, 1 i n.

Can be used to minimize TV(u) Du1, see [4, 40].

4.4.3

Variable-splitting combined with penalty techniques

We will describe the approach in the context of an objective function F : Rn R of the form
F (u) = Au v22 +

Di u2 = Au v22 + TV(u)

(4.30)

where for any i, Di u R2 is the discrete approximation of the gradient of u at i. The second
summand is hence the nonsmooth TV regularization [76].
The idea of variable-splitting and penalty techniques in optimization is as follows: with each
nonsmooth term Di u2 an auxiliary variable bi R2 is associated in order to transfer Di u out of
the nondierentiable term  2 , and the dierence between bi and Di u is penalized. The resultant
approximation to (4.30) is
F (u, b) = Au v22 +


i

bi Di u22 +

bi 2

(4.31)

with a suciently large penalty parameter . This quadratic penalty approach can be traced back
as early as Courant [32] in 1943. It is well known that the solution of (4.31) converges to that of

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

97

(4.30) as . The motivation for this formulation is that, while either one of the two variables
u and b is xed, minimizing the function with respect to the other has a closed-form formula with
low computational complexity. Indeed, F is quadratic in u and separable in b. A connection with
the additive form of half-quadratic regularization, see (2.36) on p. 39, was investigated in [87]. For
xed, an alternating minimization scheme is applied to F :
+
,

uk+1 = arg minn Auk v22 +
(4.32)
bik Di uk 22
uR


+ bi 2 ,
i

bik+1 = arg min bi

Di uk+1 22

(4.33)

Finding uk+1 in (4.32) amounts to solve a linear problem. According to the content of A, fast solvers
have been exhibited (see e.g. [62]).
The computation of bik+1 in (4.33) can be done in an exact and fast way using multidimensional
shrinkage suggested in [87] and proven in [91], presented next.
Lemma 13 ([91], p. 577) For > 0, > 0 and v Rn , where n is any positive integer, set
f (u) =

u v22 + u2
2

The unique minimizer of f is given by




v

u = max v2 , 0

v2

(4.34)

where the convention 0 (0/0) = 0 is followed.


Proof. Since f is strictly convex, bounded below, and coercive, it has unique minimizer u. Using
Corollary 1 (p. 85), the subdierential of f reads
u

if u = 0
u
2
f (u) = (u v) +



h Rn | h2 1 if u = 0
According to the optimality conditions, 0 f (
u), hence

u v) +
=0
if u = 0
(

u 2




v h Rn | h2 if u = 0
From the second condition, it is obvious that

u = 0
Consider next that u = 0. Then
u

u2

v2


u2 +


=v

It follows that there exists a positive constant c such that u = cv. The latter obeys



v c+
1 =0 c = 1
v2
v2

(4.35)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS


Note that c > 0 if and only if v2 > . Hence




v

u = 1
v = v2
v2
v2

98

v2

>0

(4.36)


Combining (4.35) and (4.36) leads to (4.34).


Using Lemma 13, each bik+1 in (4.33) read


Di uk+1

bik+1 = max Di uk+12 , 0


2
Di uk+1 2

A convergence analysis of the relevant numerical scheme is provided in [87]. In the same reference,
it is exhibited how to apply the continuation on (i.e. the increasing of ) in order to improve the
convergence.
Remark 29 If n = 1 in Lemma 13, that is
f (u) =

(u v)2 + |u|,
2

uR

then (4.34) amounts to the soft-shrinkage operator






0
if |v|
u = max |v| , 0 sign(v) =

v sign(v) if |v| >

Remind also Example 9 on p. 79.


Similar ideas have recently been used to minimize F (u) = Au v1 +
are quite general matrices in [92] for multichannel image restoration.

4.5


i

Di u2 where Di

Splitting methods

Let V be a Hilbert space 3 and 0 (V ) and 0 (V ) Consider the problem


nd u = arg min F (u), F (u) = (u) + (u).
u

The astuteness of splitting methods for F = + is to use separately


prox = (Id + )1 and prox = (Id + )1
Usually these are much easier to obtain than the resolvent for F , namely proxF = (Id + F )1 .
Even though the literature is abundant, these can basically be systematized into three main
classes:
forward-backward, see [41, 82, 83];
Douglas/Peaceman-Rachford, see [51];
double-backward (little-used), see [50, 71].
A recent theoretical overview of all these methods can be found in [29, 37]. They are essentially
based on proximal calculus.
3

Hence V = V  .

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

4.5.1

99

Moreaus decomposition and proximal calculus

Proximity operators were inaugurated in [57] as a generalization of convex projection operators.


Proposition 11 (p. 278, [56].) Let f 0 (V ). For any v V , the function Pf : V R below
1
Pf (u) = v u2 + f (u)
2

(4.37)

admits a unique minimizer.


Denition 30 The unique minimizer in Proposition 11 is denoted by
proxf v = arg min Pf (u).

(4.38)

uV

proxf is the proximity operator for f .


Remark 30 From (4.37) and (4.38), the optimality conditions (Theorem 50, p. 87) yield
u = proxf v 0 Pf (
u) 0 u v+f (
u) v u f (
u) u = (Id+f )1 (v) (4.39)
(Id + f )1 = proxf
Example 11 Several examples of proxf for f 0 (V ):
1. f (u) = a, u , a V , R. Then proxf v = v a (a translation)
2. U V is closed, convex and = . f (u) = U (u) then proxU v = U (v)
3. f (u) = u22, 0. Then proxf v =

1
v
1 + 2

4. f (u) = u2, 0. Then proxf v = max{v2 , 0}

v
(see Lemma 13, p. 97)
v2

5. f (u) = (u) then proxf v = prox (v)


6. f (u) = (u z), z V . Then proxf v = z + prox (v z)
7. f (u) = (u/), R \ {0}. Then proxf v = prox/ 2 (v/).
n







i u[i] z[i], i 0, i. Then proxf v [i] = z[i] + T i v[i] z[i] , i, where
8. f (u) =
i=1


def

> 0, T (t) =

0
if |t| <
t sign(t) otherwise

T is a soft shrinkage operator


See Remark 29 on p. 98 and Example 9 on p. 79.

9. proxTV v = v K ( v ) where K is given in (4.27), p. 94; see (4.28).

Some important properties


Relationship with the classical Projection Theorem:

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

100

Theorem 59 ([56], p. 280.) Let f 0 (V ) and f  0 (V ) be its convex conjugate. For any u, v, w
V we have the equivalence
w = u + v and f (u) + f  (v) = u, v

u = proxf w, v = proxf  w.

Meaning: Link with the Projection theorem.


Proof. Two parts.


() By the denition of the convex conjugate4 f  (v) = supzV z, v f (z) , we have
u, v f (u) = f  (v) z, v f (z), z V
Pf (z) Pf (u)




f (z) f (u) z u, v , z V

(4.40)

1
1
z w2 + f (z) u w2 f (u), z V
2
2
1
1
=
z u v2 v2 + f (z) f (u), z V
set v = w u
2
2
1
1
z u v2 v2 + z u, v , z V
(by (4.40))

 2
2
1
1
1
=
z u2 z u, v + v2 v2 + z u, v , z V
2
2
2
1
=
z u2 0, z V.
2



=

It follows that u is the unique point such that Pf (u) < Pf (z), z V, z = u. Hence u = proxf w.
In a similar way one establishes that v = proxf  w.
() Since u = proxf w for w V , we know that u is the unique minimizer of Pf on V for this w
(see Proposition 11 on p. 99). Hence




1
t(z u) + u w2 + f t(z u) + u
Pf t(z u) + u =
2
1
1
u w2 + f (u) = Pf (u) = 
v 2 + f (u), z V, t R

2
2
where we set
v =wu

Rearranging the obtained inequality yields


1
1

v 2 t(z u) v2 + f (u) f t(z u) + u , z V, t R.
2
2
v2 12 t(z u) 
v 2 = 12 t2 z u2 + t z u, 
v, the last inequality entails that
Noticing that 12 


1
v + f (u) f t(z u) + u , z V, t R.
t2 z u2 + t z u, 
2
Remind that t(z u) + u = tz + (1 t)u. Since f is convex, for t ]0, 1[ we have




f t(z u) + u tf (z) + (1 t)f (u) = f (u) + t f (z) f (u) , t ]0, 1[, z V.
4

This denition was already given in (1.10) on p. 14.

(4.41)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

101

Inserting the last inequality into (4.41) leads to




1
t2 z u2 + t z u, 
v + f (u) t f (z) f (u) + f (u), t ]0, 1[, z V,
2


1 2
t z u2 + t z u, 
v t f (z) f (u) , t ]0, 1[, z V,
2
1
tz u2 + z u, 
v f (z) f (u), t ]0, 1[, z V,
2
1
z, v f (z) u, 
v f (u) + tz u2 , t ]0, 1[, z V.
2
Letting t 0 yields
z, v f (z) u, 
v f (u), z V.
Then



v ) = u, 
v f (u)
sup z, v f (z) = f  (
zV

Taking v = 
v = w u achieves the proof.

In words, it was proven that w V , for a given f 0 (V ), we have a unique decomposition


w = proxf w + proxf  w

Denition 31 An operator T : V V is
1. nonexpansive if T u T v u v, u V, v V
2. rmly nonexpansive if T u T v2 T u T v, u v , u V, v V
Hence a rmly nonexpansive operator is always nonexpansive.
Lemma 14 If T : V V is rmly nonexpansive, then the operator (Id T ) : V V is rmly
nonexpansive.
Proof. Using that T is rmly nonexpansive, we obtain that u V, v V
(Id T )u (Id T )v2 (Id T )u (Id T )v, u v
= u v (T u T v)2 u v (T u T v), u v
= T u T v2 + u v2 2 T u T v, u v + T u T v, u v u v2
= T u T v2 T u T v, u v 0,
where the last inequality comes from Denition 31-2. Hence the result.
Proposition 12 ([31], Lemma 2.4.) For f 0 (V ), proxf and (Id proxf ) are rmly nonexapnsive
Proposition 13 Let f 0 (V ). The operator proxf is monotonous in the sense that


proxf w proxf z , proxf  w proxf  z 0, w V, z V.

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

102

Proof. For w V and z V set


u = proxf w

u = proxf z

v = proxf  w

v  = proxf  z

Noticing that by Theorem 59 we have


f (u) + f  (v) = u, v and f (u) + f  (v  ) = u, v  

(4.42)

and that by Fenchel-Young inequality (1.11) (p. 14)


f  (v  ) u, v  f (u) and f (u ) v, u f  (v)

(4.43)

we obtain


proxf w proxf z , proxf  w proxf  z = u u, v v  
=

u, v + u , v  u, v  u , v

use (4.42)

use (4.43)



f (u) + f  (v) + f (u) + f (v  ) u, v  u , v


     

f (u) + f (v) + u, v   f (u) + v, u f  (v) u, v  u, v = 0.


The proof is complete.

4.5.2

Forward-Backward (FB) splitting

Forward-backward can be seen as a generalization of the classical gradient projection method for
constrained convex optimization. One must assume that either or is dierentiable.
Proposition 14 ([31], Proposition 3.1.) Assume that 0 (V ), 0 (V ), where V is a Hilbert
space, that is dierentiable and that F = + is coercive. Let ]0, +[. Then


F (
u) = min F (u) u = prox u (
u) .
uV

Moreover, u is unique if either of is strictly convex.


Proof. A sequence of equivalences:
F (
u) = min F (u)
uV

0 ( + )(
u) = (
u) + (
u) = (
u) + {(
u)}

(
u) (
u) (
u) (
u)




u (
u) u (
u) (use (4.39), p. 99 identify u (
u) with v )


u)
u = prox u (

F is strictly convex if either of is so, in which case u


is unique.
FB splitting relies on the xed-point equation


u = prox u (u)

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

103

The basic iteration consists in two steps:


uk+ 1 = uk (uk )
2


uk+1 = prox uk+ 1
2

(forward, explicit)
(backward, implicit)

Formally, the second step amounts to solving an inclusion, hence its implicit nature.
A practical algorithm proposed by Combettes and Wajs in [31] is a more general iteration where
is iteration-dependent, errors are allowed in the evaluation of the operators prox and and a
relaxation sequence (k )kN is introduced. Admission of errors allows some tolerance in the numerical
implementation of the algorithm, while the iteration-dependent parameters (k )kN and (k )kN can
improve its convergence speed.
Theorem 60 ([31], Theorem 3.4.) Assume that 0 (V ), 0 (V ), where V is a Hilbert space,
that is dierentiable with a 1/-Lipschitz gradient, ]0, [, and that F = + is coercive.
Let the sequences (k )kN R+ , (k )kN R+ , (ak )kN V and (bk )kN V satisfy
0 < inf k sup k < 2;
kN

kN

0 < inf k sup k 1;


kN

kN

ak  < + and

kN

bk  < +.

kN

Consider the iteration:






uk+1 = uk + k proxk uk k ((uk ) + bk + ak uk , k N.
Then
.
1. (uk )kN converges weakly to u U
 if one of the following conditions hold:
2. (uk )kN converges strongly to u U
 = ;
intU

either or satisfy the condition below on U:
f 0 (V ); (vk )kN , (wk )kN belonging to V and v V and v f (w) V we have
2
1
wk  w, vk  v, vk f (wk ), k N w is a strong cluster point of (wk )kN .
Let us emphasize that if V = Rn , convergence is always strong.

4.5.3

Douglas-Rachford splitting

Douglas-Rachford splitting, generalized and formalized in [51], is a much more general class of
monotone operator splitting methods. A crucial property of the Douglas-Rachford splitting scheme
is its high robustness to numerical errors that may occur when computing the proximity operators
prox and prox , see [29].

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

104

Proposition 15 ([30], Proposition 18.) Assume that 0 (V ), 0 (V ) and that F = + is


coercive. Let ]0, +[. Then


F (
u) = min F (u) u = prox z where z = (2prox Id) (2prox Id) (z).
uV

Moreover, u is unique if either or is strictly convex.


Given a function f , the expression (2proxf Id) is also called reection operator.
 = . The following chain of equivalences yields the result.
Proof. Since F is coercive, U
u) 0 ( + )(
u) 0 (
u) + (
u)
F (
u) = inf F (u) 0 F (
uV

z V :

z V :

z V :

z V :

z V :

z V :

u) 2
u z (Id + )(
u)
u z (

z u (
u) z (Id + )(
u) u = (Id + )1 (z)

u)
2(Id + )1 (z) z (Id + )(

(a)

(use (a) for u)

u = (Id + )1 (z)



u)
2(Id + )1 Id z (Id + )(

u = (Id + )1 (z)






1
1
u =(Id + ) 2(Id + ) Id (z) = prox 2prox Id (z)

(b)

u = prox z

(c)



u (2
u z) = 2prox 2prox Id (z) (2prox Id)(z)
z = 2

u = prox z

(above: insert (b) for the rst 2


u and (c) for the second term)


z = (2prox Id) (2prox Id z

u = prox z

F is coercive by assumption. When either of is strictly convex, F is strictly convex as well.


Hence F admits a unique minimizer u.

Thus DR splitting is based on the xed point equation

z = (2prox Id) (2prox Id (z)
and F reaches its minimum at u = prox z where z is the xed point of the equation above.
Remark 31 Note that the roles of and can be interchanged since they share the same assumptions.

CHAPTER 4. NON DIFFERENTIABLE PROBLEMS

105

Given a xed scalar > 0 and a sequence k (0, 2), this class of methods can be expressed via
the following recursion written in a compact form



k
k
(k+1)
Id + (2prox Id) (2prox Id) z (k)
z
=
1
2
2
A robust DR algorithm is proposed by Combettes and Pesquet in [30].
Theorem 61 ([30], Theorem 20.) Assume that 0 (V ), 0 (V ) and that F = + is coercive
and ]0, [. Let the sequences (k )kN R+ , (ak )kN V and (bk )kN V satisfy

(i) 0 < k < 2, k N and
k (2 k ) = +;
kN

(ii)



k ak  + bk  < +.

kN

Consider the iteration for k N:



zk+ 1 = (2prox Id zk + bk
2


zk+1 = zk + k prox (2zk+ 1 zk ) + ak zk+ 1
2

Then (zk )kN converges weakly to a point z V and


u = prox z = arg min F (u).
uV

When V is of nite dimension, e.g. V = Rn , the sequence (zk )kN converges strongly.
The sequences ak and bk model the numerical errors when computing the proximity operators.
They are under control using (ii). Conversely, if the convergence rate can be established, one can
easily derive a rule on the number of inner iterations at each outer iteration k such that (ii) is
veried.
This algorithm naturally inherits the splitting property of the Douglas-Rachford iteration, in the
sense that the operators prox and prox are used in separate steps. Note that the algorithm
allows for the inexact implementation of these two proximal steps via the incorporation of the error
terms ak and bk . Moreover, a variable relaxation parameter k enables an additional exibility.

Chapter 5
Appendix
Proof of Property 2-1

Proof of Propositions 1
Put

(t) = ( t),

(5.1)

then is convex by (b) and continuous on R+ by (c). Its convex conjugatesee (1.10), p. 14is
 (b) = supt0 {bt (t)} where b R. Dene (b) =  ( 12 b) which means that




1
1 2
(5.2)
(b) = sup bt (t) = sup bt + (t) .
2
2
t0
t0

CHAPTER 5. APPENDIX

107

By Theorem 4, p. 15, we have ( ) = . Calculating ( ) at t2 and using (5.1) yields




 2

1 2
2

(t) = (t ) = sup bt (b) = sup bt (b) .
2
b0
b0
Since (t) 0 on R+ , we have  (b) = + if b > 0 and then the supremum of b bt2  (b) in
the middle expression above necessarily corresponds to b 0. Finally,


1 2
bt + (b) .
(5.3)
(t) = inf
b0
2
The statement comes from the fact that ( ) = .
Next we focus on the possibility to achieve the supremum in jointly with the inmum in .
For any b > 0, dene fb : R+ R by fb (t) = 12 bt + (t), then (5.2) yields (b) = inf t0 fb (t).
Observe that fb is convex by (b) with fb (0) = 0 by (a) and limt fb (t) = + by (e), hence fb has
a unique minimum reached at a t 0. According to (5.2), (b) = 12 bt2 + (t), then equivalently

the inmum in (5.3) is reached for b since (t) = 12 bt2 + (b). Notice that  (t) = 2(tt) and that
fb (t) = 12 b +  (t) is increasing on R+ by (b). If fb (0+ ) 0, i.e. if b  (0+ ), fb reaches its minimum
at t = 0. Otherwise, its minimum is reached for a t > 0 such that fb (t) = 0, i.e. b = 2 (t). In
this case, t 12 bt2 + (t) in the last expression of (5.2) reaches its supremum for a t that satises
b = 2 (t2 ). Hence b is as given in the proposition.
Derivation of the CG algorithm
Assume that F (uk ) = 0, then the optimal g(k ) = 0 for k Rk+1 is the unique point dened by
(2.40). Then for all iterates 0 j k, we have g(j ) = 0. We will express g(j ) in the form
g(j ) = j dj =

j


j [i]F (ui ), 0 j k, dj Rn , j R

(5.4)

i=0

Since g(j ) = 0, we have dj =


 0 and j = 0, 0 j k. We will choose dj in such a way that
successive directions are easy to compute. Iterates read
uj+1 = uj j dj = uj

j


j [i]F (ui ), 0 j k

(5.5)

i=0

(a) A major result. Combining (2.38) and (5.5),


F (uj+1) = F (uj j dj ) = F (uj ) j Bdj , 0 j k.

(5.6)

For k 1, combining (2.41) and (5.6) yields


0 = F (uj+1), F (ui ) = F (uj ), F (ui )j Bdj , F (ui ) = j Bdj , F (ui ), 0 i < j k.
By (5.12), dj is a linear combination of F (ui ), 0 i j; this, joined to (2.41) yields
Bdj , di  = 0, 0 i < j k.

(5.7)

CHAPTER 5. APPENDIX

108

We will say that the directions dj and di are conjugated with respect to B. Since B
shows as well that dj and di , 0 i < j k are linearly independent. Till iteration
be put into a matrix form:

0 [0] 1 [0]

$
#
$
#
1 [1]
.
.
.
.
.
.

0 d0 .. 1 d1 .. .. k dk = F (u0 ) .. F (u1) .. .. F (uk )

 0, this result
k + 1, (5.4) can

..
.

k [0]
k [1]
..
.

k [k]
j

j

The linear independence of i di i=0 and F (ui ) i=0 for any 0 j k implies
j [j] = 0, 0 j k.

(5.8)

(b) Calculus of successive directions. By (5.6) we see that


j Bdj = F (uj+1) F (uj ), 0 j k.
Using this results, the fact that B = B T , along with (5.7) yields


0 = Bdk , dj  = dk , Bdj  = k dk , j Bdj  = k dk , F (uj+1) F (uj ) , 0 j k 1.
Introducing k dk according to (5.4) in the last term above entails
8 k
9

0=
k [i]F (ui ), F (uj+1) F (uj ) , 0 j k 1.

(5.9)

i=0

Using that j [j] = 0see (5.8)and that j = 0 in (5.4), for all 0 j k, the constants below are
well dened:
j [i]
, 1 i j k.
(5.10)
j [i] =
j j [j]
Using (5.5)
dk =

k

i=0

 k1

 k [i]
F (ui ) + F (uk )
k [i]F (ui ) = k [k]
k [k]
i=0

Then (5.9) is equivalent to


9
8 k1

k [i]F (ui ) + F (uk ) , F (uj+1) F (uj ) , 0 j k 1.
0=
i=0

Using (2.41), this equation yields:


j =k1:
0j k2:

k [k 1] F (uk1)2 + F (uk 2 = 0
k [j] F (uj )2 + k [j + 1] F (uj+1)2 = 0

It is easy to check that the solution is


k [j] =

F (uk 2
, 0 j k 1.
F (uj )2

(5.11)

Note that with the help of j [i] in (5.10), dj in (5.4) can equivalently written down as
dj =

j

i=0

j [i]F (ui ), 0 j k,

(5.12)

CHAPTER 5. APPENDIX

109

provided that j solves the problem min F (uj dj ), 0 j k. From (5.11) and (5.12),

dk

k1


k1

F (uk 2
=
k [i]F (ui ) + F (uk ) =
F (ui ) + F (uk )
2
F
(u
)
i
i=0
i=0

 k2
2

F (uk 
F (uk12
= F (uk ) +
F (ui ) + F (uk1)
F (uk1)2 i=0 F (ui )2

= F (uk ) +

F (uk 2
dk1 .
F (uk1)2

The latter result provides a simple way to compute the new dk using the previous dk1 :
d0 = F (u0 ) and di = F (ui ) +

F (ui 2
di1 , 1 i k.
F (ui1 )2

(c) The optimal k for dk . The optimal step-size k is the unique solution of the problem
F (uk k dk ) = inf F (uk dk ).
R

k is hence the unique minimizer of f as given below,


f () =

1
B(uk dk ), (uk dk ) c, (uk dk ) .
2

hence the unique solution of


0 = f  () = Bdk , dk  Buk , dk  + c, dk  = Bdk , dk  F (uk ), dk 
reads
k =

F (uk ), dk 
.
Bdk , dk 

All ingredients of the algorithm are established.

Bibliography
[1] M. Allain, J. Idier, and Y. Goussard, On global and local convergence of half-quadratic algorithms, IEEE Transactions on Image Processing, 15 (2006), pp. 11301142.
[2] E. Allen, R. Helgason, J. Kennington, and S. B., A generalization of polyaks convergence
result for subgradient optimization, Math. Progr., 37 (1987).
[3] S. Alliney, Digital lters as absolute norm regularizers, IEEE Transactions on Signal Processing,
SP-40 (1992), pp. 15481562.
[4] S. Alliney and S. A. Ruzinsky, An algorithm for the minimization of mixed l1 and l2 norms with
application to Bayesian estimation, IEEE Transactions on Signal Processing, SP-42 (1994), pp. 618
627.
[5] L. Ambrosio, N. Fusco, and D. Pallara, Functions of Bounded Variation and Free Discontinuity
Problems, Oxford Mathematical Monographs, Oxford University Press, 2000.
[6] A. Antoniadis and J. Fan, Regularization of wavelet approximations, Journal of Acoustical Society
America, 96 (2001), pp. 939967.
[7] G. Aubert and P. Kornprobst, Mathematical problems in image processing, Springer-Verlag,
Berlin, 2 ed., 2006.
[8] G. Aubert and L. Vese, A variational method in image recovery, SIAM Journal on Numerical
Analysis, 34 (1997), pp. 19481979.
[9] J.-F. Aujol, Some rst-order algorithms for total variation based image restoration, Journal of Mathematical Imaging and Vision, 34 (2009), pp. 307327.
[10] A. Auslender and M. Teboulle, Asymptotic Cones and Functions in Optimization and Variational
Inequalities, Springer, New York, 2003.
[11] A. Avez, Calcul dierentiel, Masson, 1991.
[12] M. Belge, M. Kilmer, and E. Miller, Wavelet domain image restoration with adaptive edgepreserving regularization, IEEE Transactions on Image Processing, 9 (2000), pp. 597608.
[13] J. E. Besag, Digital image processing : Towards Bayesian image analysis, Journal of Applied Statistics, 16 (1989), pp. 395407.
bal, Numerical optimiza[14] J. F. Bonnans, J.-C. Gilbert, C. Lemar
echal, and C. A. Sagastiza
tion (theoretical and practical aspects), Springer, Berlin ; New York NY ; Hong Kong, 2003.
[15] H. Br
ezis, Analyse fonctionnelle, Collection mathematiques appliquees pour la matrise, Masson,
Paris, 1992.
[16] A. Chambolle, An algorithm for total variation minimization and application, Journal of Mathematical Imaging and Vision, 20 (2004).
[17] R. Chan, Circulant preconditioners for hermitian toeplitz systems, SIAM J. Matrix Anal. Appl.,
(1989), pp. 542550.

BIBLIOGRAPHY

111

[18] R. Chan, C. Hu, and M. Nikolova, An iterative procedure for removing random-valued impulse
noise, IEEE Signal Processing Letters, 11 (2004), pp. 921924.
[19] R. Chan and M. Ng, Conjugate gradient methods for toeplitz systems, SIAM Review, 38 (1996),
pp. 427482.
[20] R. H.-F. Chan and X.-Q. Jin, An Introduction to Iterative Toeplitz Solvers, Cambridge University
Press, 2011.
[21] T. Chan, An optimal circulant preconditioner for toeplitz systems, SIAM J. Sci. Stat. Comput.,
(1988), pp. 766771.
[22] T. Chan and S. Esedoglu, Aspects of total variation regularized l1 function approximation, SIAM
Journal on Applied Mathematics, 65 (2005), pp. 18171837.
[23] T. Chan and P. Mulet, On the convergence of the lagged diusivity xed point method in total
variation image restoration, SIAM Journal on Numerical Analysis, 36 (1999), pp. 354367.
[24] P. Charbonnier, L. Blanc-F
eraud, G. Aubert, and M. Barlaud, Deterministic edgepreserving regularization in computed imaging, IEEE Transactions on Image Processing, 6 (1997),
pp. 298311.
[25] P. G. Ciarlet, Introduction to Numerical Linear Algebra and Optimization, Cambridge University
Press, 1989.
[26] P. G. Ciarlet, Introduction a` lanalyse numerique matricielle et `a loptimisation, Collection
mathematiques appliquees pour la matrise, Dunod, Paris, 5e ed., 2000.
[27] P. Combettes, Convex set theoretic image recovery by extrapolated iterations of parallel subgradient
projections, IEEE Transactions on Image Processing, 6 (1997), pp. 493506.
[28] P. Combettes and J. Luo, An adaptive level set method for nondierentiable constrained image
recovery, IEEE Transactions on Image Processing, 11 (2002), pp. 12951304.
[29] P. L. Combettes, Solving monotone inclusions via compositions of nonexpansive averaged operators,
Optimization, 53 (2004).
[30] P. L. Combettes and J.-C. Pesquet, A Douglas-Rachford splittting approach to nonsmooth convex
variational signal recovery, IEEE Journal of Selected Topics in Signal Processing, 1 (2007), pp. 564574.
[31] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, SIAM
Multiscale Model. Simul., 4 (2005), pp. 11681200.
[32] R. Courant, Variational methods for the solution of problems with equilibrium and vibration, Bull.
Amer. Math. Soc., (1943), pp. 123.
[33] P. Davis, Circulant matrices, John Wiley, New York, 3 ed., 1979.
[34] A. H. Delaney and Y. Bresler, Globally convergent edge-preserving regularized reconstruction: an
application to limited-angle tomography, IEEE Transactions on Image Processing, 7 (1998), pp. 204
221.
[35] D. Donoho, I. Johnstone, J. Hoch, and A. Stern, Maximum entropy and the nearly black
object, Journal of the Royal Statistical Society B, 54 (1992), pp. 4181.
[36] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal
point algorithm for maximal monotone operators, Math. Programming: Series A and B, 55 (1992),
pp. 293318.
[37] J. Eckstein and B. F. Svaiter, A family of projective splitting methods for the sum of two maximal
monotone operators, Math. Program., Ser. B, 111 (2008).

BIBLIOGRAPHY

112

[38] I. Ekeland and R. Temam, Convex Analysis and Variational Problems, SIAM, Amsterdam: North
Holland, 1976.
[39] L. C. Evans and R. F. Gariepy, Measure theory and ne properties of functions, Studies in Advanced Mathematics, CRC Press, Roca Baton, FL, 1992.
[40] H. Fu, M. Ng, Michael K.and Nikolova, and J. L. Barlow, Ecient minimization methods
of mixed 1 l and 2 1 norms for image restoration, SIAM Journal on Scientic Computing, 27
(2005), pp. 18811902.
[41] D. Gabay, Applications of the method of multipliers to variational inequalities, M. Fortin and
R. Glowinski, editors, North-Holland, Amsterdam, 1983.
[42] D. Geman and G. Reynolds, Constrained restoration and recovery of discontinuities, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-14 (1992), pp. 367383.
[43] D. Geman and C. Yang, Nonlinear image recovery with half-quadratic regularization, IEEE Transactions on Image Processing, IP-4 (1995), pp. 932946.
[44] R. Glowinski, J. Lions, and R. Tr
emoli`
eres, Analyse numerique des inequations variationnelles,
vol. 1, Dunod, Paris, 1 ed., 1976.
[45] J.-B. Hiriart-Urruty and C. Lemar
echal, Convex analysis and Minimization Algorithms, vol.
I, Springer-Verlag, Berlin, 1996.
[46]

, Convex analysis and Minimization Algorithms, vol. II, Springer-Verlag, Berlin, 1996.

[47] J. Idier, Convex half-quadratic criteria and auxiliary interacting variables for image restoration, IEEE
Transactions on Image Processing, 10 (2001), pp. 10011009.
[48] S. Kim, H. Ahn, and S. C. Cho, Variable target value subgradient method, Math. Progr., 49 (1991).
[49] A. Lewis and M. Overton, Nonsmooth optimization via quasi-newton methods, Math. Programming, online 2012.
[50] P.-L. Lions, Une methode iterative de resolution dune inequation variationnelle, Israel Journal of
Mathematics, 31 (1978), pp. 204208.
[51] P.-L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear operators, SIAM
Journal on Numerical Analysis, 16 (1979), pp. 964979.
[52] D. Luenberger, Optimization by Vector Space Methods, Wiley, J., New York, 1 ed., 1969.
[53]

, Introduction to Linear and Nonlinear Programming, Addison-Wesley, New York, 1 ed., 1973.

[54] D. Luenberger and Y. Ye, Linear and Nonlinear Programming, Springer, New York, 3 ed., 2008.
[55] M. Minoux, Programmation mathematique. Theorie et algorithmes, T.1, Dunod, Paris, 1983.
[56] J.-J. Moreau, Proximite et dualite dans un espace hilbertien, Bulletin de la S. M. F.
[57]

, Fonctions convexes duales et points proximaux dans un espace hilbertien, CRAS Ser. A Math.,
255 (1962), pp. 28972899.

[58] P. Moulin and J. Liu, Analysis of multiresolution image denoising schemes using generalized Gaussian and complexity priors, IEEE Transactions on Image Processing, 45 (1999), pp. 909919.
[59] Y. Nesterov, Gradient methods for minimizing composite objective function, tech. report, Universite
catholique de Louvain, Center for Operations Research and Econometrics (CORE), CORE Discussion
Papers 2007076, Sep. 2007.
[60]

, Introductory Lectures on Convex Optimization: a Basic Course, Kluwer Academic, Dordrecht,


2004.

BIBLIOGRAPHY
[61]

113

, Smooth minimization of non-smooth functions, Math. Program. (A), 1 (2005), pp. 127152.

[62] M. Ng, R. Chan, and W. Tang, A fast algorithm for deblurring models with neumann boundary
conditions, SIAM Journal on Scientic Computing, (1999).
[63] M. Nikolova, Local strong homogeneity of a regularized estimator, SIAM Journal on Applied Mathematics, 61 (2000), pp. 633658.
[64]

, Minimizers of cost-functions involving nonsmooth data-delity terms. Application to the processing of outliers, SIAM Journal on Numerical Analysis, 40 (2002), pp. 965994.

[65]

, A variational approach to remove outliers and impulse noise, Journal of Mathematical Imaging
and Vision, 20 (2004).

[66]

, Weakly constrained minimization. Application to the estimation of images and signals involving
constant regions, Journal of Mathematical Imaging and Vision, 21 (2004), pp. 155175.

[67] M. Nikolova and R. Chan, The equivalence of half-quadratic minimization and the gradient linearization iteration, IEEE Transactions on Image Processing, 16 (2007), pp. 16231627.
[68] M. Nikolova and M. Ng, Analysis of half-quadratic minimization methods for signal and image
recovery, SIAM Journal on Scientic Computing, 27 (2005), pp. 937966.
[69] J. Nocedal and S. Wright, Numerical Optimization, Springer, New York, 1999.
[70] J. Ortega and W. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables,
Academic Press, New York, 1970.
[71] G. B. Passty, Ergodic convergence to a zero of the sum of monotone operators in hilbert space,
Journal of Mathematical Analysis and Applications, 72 (1979).
[72] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical recipes, the art of
scientic computing, Cambridge Univ. Press, New York, 1992.
[73] R. T. Rockafellar, Convex Analysis, Princeton University Press, 1970.
[74] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control
and Optimization, 14 (1976), pp. 877898.
[75] R. T. Rockafellar and J. B. Wets, Variational analysis, Springer-Verlag, New York, 1997.
[76] L. Rudin, S. Osher, and C. Fatemi, Nonlinear total variation based noise removal algorithm,
Physica, 60 D (1992), pp. 259268.
[77] L. Schwartz, Analyse: Topologoe generale et analyse fonctionnelle, Hermann, Paris, 1993.
[78]

, Analyse II. Calcul dierentiel et equations dierentielles, vol. 2 of Enseignement des sciences,
Hermann, Paris, 1997.

[79] N. Z. Shor, Minimization Methods for Non-Dierentiable Functions, vol. 3, Springer-Verlag, 1985.
[80] G. Strang, A proposal for toeplitz matrix calculations, Stud. Appl. Math., 74 (1986).
[81] G. Strang, Linear Algebra and its Applications, Brooks / Cole, 3 ed., 1988.
[82] P. Tseng, Applications of a splitting algorithm to decomposition in convex programming and variational inequalities, SIAM Journal on Control and Optimization, 29 (1991), pp. 119138.
[83]

, A modied forward-backward splitting method for maximal monotone mappings, SIAM Journal
on Control and Optimization, 38 (2000), pp. 431446.

[84] E. Tyrtyshnikov, Optimal and superoptimal circulant preconditioners, SIAM J. Matrix Anal. Appl,
(1992), pp. 459473.

BIBLIOGRAPHY

114

[85] C. Vogel, Computational Methods for Inverse Problems, Frontiers in Applied Mathematics Series,
Number 23, SIAM, 2002.
[86] H. E. Voss and U. Eckhardt, Linear convergence of generalized Weiszfelds method, Computing,
25 (1980).
[87] Y. Wang, J. Yang, W. Yin, and Y. Zhang, A new alternating minimization algorithm for total
variation image reconstruction, SIAM Journal on Imaging Sciences, 1 (2008), pp. 248272.
[88] P. Weiss, L. Blanc-F
eraud, and G. Aubert, Ecient schemes for total variation minimization
under constraints in image processing, SIAM Journal on Scientigc Computing, (2009).
[89] E. Weiszfeld, Sur le point pour lequel la somme des distances de n points donnes est minimum,
Tohoku Math. J, (1937), pp. 355386.
[90] S. Wright, Primal-Dual Interior-Point Methods, SIAM Publications, Philadelphia, 1997.
[91] J. Yang, W. Yin, Y. Zhang, and Y. Wang, A fast algorithm for edge-preserving variational
multichannel image restoration, SIAM Journal on Imaging Sciences, 2 (2009), pp. 569592.
[92] J. Yang, Y. Zhang, and W. Yin, An ecient TVL1 algorithm for deblurring multichannel images
corrupted by impulsive noise, SIAM Journal on Scientic Computing, 31 (2009), pp. 28422865.
[93] C. Zalinescu, Convex analyzis in general vector spaces, World Scientic, River Edge, N.J., 2002.

You might also like