Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

APG

DSA5103 Lecture 4

Yangjing Zhang
05-Sep-2023
NUS
Today’s content

1. Basic convex analysis


2. Proximal operator
3. (Accelerated) proximal gradient method

lecture4 1/54
Basic convex analysis
Norms

A vector norm on Rn is a function k · k : Rn → R satisfying the following


properties:
(1) kxk ≥ 0 ∀ x ∈ Rn , and kxk = 0 ⇐⇒ x = 0
(2) kαxk = |α|kxk ∀ α ∈ R, x ∈ Rn
(3) kx + yk ≤ kxk + kyk ∀ x, y ∈ Rn
Example.
Pn
1. kxk1 = i=1 |xi | (`1 norm)
Pn 2 1/2
2. kxk2 = ( i=1 |xi | ) (`2 norm)
Pn p 1/p
3. kxkp = ( i=1 |xi | ) , where 1 ≤ p < ∞ (`p norm)
4. kxk∞ = max1≤i≤n |xi | (`∞ norm)
5. kxkW,p = kW xkp , where W is a fixed nonsingular matrix,
1≤p≤∞

lecture4 2/54
Inner product

For the space of m × n matrices, Rm×n , we define the standard inner


product, for any A, B ∈ Rm×n ,
m X
X n
hA, Bi = Tr(AT B) = Aij Bij
i=1 j=1
Pn
Recap: the trace of a square matrix C ∈ Rn×n is Tr(C) = i=1 Cii .
n
For the space of n-vectors, R , we define the standard inner product, for
any x, y ∈ Rn ,
n
X
hx, yi = xT y = xi yi
i=1

lecture4 3/54
Inner product

From the definition of inner products, the following properties hold for
any matrices A, B, C ∈ Rm×n and any scalars a, b ∈ R

• kAk2F := hA, Ai ≥ 0
• hA, Ai = 0 if and only if A = 0
• hC, aA + bBi = ahC, Ai + bhC, Bi
• hA + B, A + Bi = hA, Ai + 2hA, Bi + hB, Bi
(i.e., kA + Bk2F = kAk2F + 2hA, Bi + kBk2F )

lecture4 4/54
Projection onto a closed convex set

Theorem (Projection theorem). Let C be a closed convex set.


(1) For every z, there exists a unique minimizer of
1
min kx − zk2 ,
x∈C 2
denoted as ΠC (z) and called as the projection of z onto C.
(2) x∗ := ΠC (z) is the projection of z onto C if and only if
hz − x∗ , x − x∗ i ≤ 0 ∀ x ∈ C.

lecture4 5/54
Projection onto a closed convex set

Theorem (Projection theorem). Let C be a closed convex set.


(1) For every z, there exists a unique minimizer of
1
min kx − zk2 ,
x∈C 2
denoted as ΠC (z) and called as the projection of z onto C.
(2) x∗ := ΠC (z) is the projection of z onto C if and only if
hz − x∗ , x − x∗ i ≤ 0 ∀ x ∈ C.

lecture4 5/54
Projection onto a closed convex set

Example.
1. C = Rn+ = {x ∈ Rn | xi ≥ 0, ∀ i = 1, 2, . . . , n} positive orthant
ΠC (z) = ΠRn+ (z) = max{z, 0}
2. C = {x ∈ Rn | kxk2 ≤ 1} `2 -norm ball

 z, if kzk2 ≤ 1 z
ΠC (z) = =
 z
, if kzk2 > 1 max{kzk2 , 1}
kzk2

Figure 1: C = {x ∈ R2 | kxk2 ≤ 1} `2 -norm ball


lecture4 6/54
Projection onto a closed convex set

Example.

3. C = Sn+ the space of n × n symmetric and positive semidefinite


matrices
 
max{λ1 , 0}
ΠC (A) = ΠSn+ (A) = Q  ..  T
Q

.
max{λn , 0}

for a given A ∈ Sn with eigenvalue decomposition


 
λ1
A = Q ..  T
Q

.
λn

lecture4 7/54
arg min

arg min f (x) denotes the solution set of x for which f (x) attains its
x
minimum (argument of the minimum)

4 1

0.9
3.5
0.8
3
0.7
2.5
0.6

2 0.5

0.4
1.5
0.3
1
0.2
0.5
0.1

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Left: min f (x) = 0, arg min f (x) = {2}


x x

Right: min f (x) = 0, arg min f (x) = [−1, 1]


x x
1
ΠC (z) = arg minx∈C 2 kx − zk2
lecture4 8/54
Extended real-valued function

Definition.
Let X be a Euclidean space (e.g., X = Rn or Rm×n ). Let
f : X → (−∞, +∞] be an extended real-valued function1 .
(1) The (effective) domain of f is defined to be the set

dom(f ) := {x ∈ X | f (x) < +∞}.

(2) f is said to be proper if dom(f ) 6= ∅.


(3) f is said to be closed if its epi-graph

epi(f ) := {(x, α) ∈ X × R | f (x) ≤ α}

is closed.
(4) f is said to be convex if its epi-graph is convex.
1 Here f is allowed to take the value of +∞, but not allowed to take the value of −∞
lecture4 9/54
Extended real-valued function

• For a real-valued function f : X → R, this definition of convexity

epi(f ) is convex (1)

coincides with the one we have used

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y), ∀ x, y ∈ dom(f ), λ ∈ [0, 1]


(2)
[Exercise] (1) ⇐⇒ (2)
• A convex function f : D ⊆ Rn → R can be extended to a convex
function on all of Rn by setting f (x) = +∞ for x ∈
/ D.

lecture4 10/54
Example: extended real-valued function


 0, if x = 0


f (x) = 1, if x > 0


 +∞, if x < 0

• dom(f ) = [0, +∞), f is proper


• epi(f ) = {0} × [0, +∞) ∪ (0, +∞) × [1, +∞) is closed, i.e., f is
closed
• epi(f ) is not convex, i.e., f is not convex

lecture4 11/54
Indicator function

Let C be a nonempty set in X . f : R2 → (−∞, +∞]


The indicator function of C is
(
0, if x ∈ C
δC (x) =
+∞, if x ∈ /C

• dom(f ) = C, f is proper
• epi(f ) = C × [0, +∞) is closed if C is closed, i.e., δC (·) is closed if
C is closed
• epi(f ) is convex if C is convex, i.e., δC (·) is convex if C is convex

lecture4 12/54
Dual/polar cone

Definition (Cone)
A set C ⊆ X is called a cone if λx ∈ C when x ∈ C and λ ≥ 0.

Definition (Dual and polar cone)


The dual cone of a set C ⊆ X (not necessarily convex) is defined by

C ∗ = {y ∈ X | hx, yi ≥ 0 ∀x ∈ C}

The polar cone of C is C o = −C ∗ .


If C ∗ = C, then C is said to be self-dual.

• C ∗ is always a convex cone, even if C is neither convex nor a cone.

lecture4 13/54
Dual/polar cone

Figure 2: Left: A set C and its dual cone C ∗ . Right: A set C and its polar
cone C o . The dual cone and the polar cone are symmetric to each other with
respect to the origin. Image from internet

lecture4 14/54
Example: self-dual cones

1. X = Rn . C = Rn+ is a self-dual closed convex cone


. Proof: C ∗ = {y ∈ Rn | hx, yi ≥ 0 ∀x ∈ Rn n
+ } = R+

2. X = Sn . C = Sn+ is a self-dual closed convex cone (psd cone)


. Proof: want to show that {B ∈ Sn | hA, Bi ≥ 0 ∀A ∈ Sn n
+ } = S+
∗ n T n
[LHS ⊆ RHS] Take B ∈ C . For any x ∈ R , we have xx ∈ S+
and thus hxxT , Bi = xT Bx ≥ 0, which implies B ∈ Sn +.
[RHS ⊆ LHS] Take B ∈ Sn + . Compute its eigenvalue decomposition
B= n T n
P
i=1 λi vi vi , λi ≥ 0. For any A ∈ S+ , we have
Pn T Pn T
hA, Bi = hA, i=1 λi vi vi i = i=1 λi vi Avi ≥ 0.

lecture4 15/54
Example: self-dual cones
p
3. X = Rn . C := {x ∈ Rn | x22 + · · · + x2n ≤ x1 , x1 ≥ 0} is a
self-dual closed convex cone (second-order cone)
. Proof: want to show that {y ∈ Rn | hx, yi ≥ 0 ∀x ∈ C} = C
[RHS ⊆ LHS] Take y ∈ C. For any x ∈ C
v v
u n u n
uX uX
2
hx, yi = x1 y1 + x2 y2 + · · · + xn yn ≥ x1 y1 − t xi t yi2 ≥ 0
i=2 i=2

The first inequality follows from the Cauchy-Schwartz inequality, and


the last inequality follows from the fact that x, y ∈ C.
[LHS ⊆ RHS] Take y ∈ C ∗ . If [y2 ; . . . ; yn ] = 0, we take
x = [1; 0; . . . ; 0] ∈ C then hx, yi = y1 ≥ 0. Obviously, y ∈ C. Else,
we take v 
u n
uX
2
x= t yi ; −y2 ; . . . ; −yn  ∈ C
i=2

then v v
u n u n
uX 2 2
uX
2
hx, yi = y1 t yi − y2 − · · · − yn ≥ 0 ⇒ y1 ≥ t yi2 ⇒ y ∈ C
i=2 i=2 lecture4 16/54
.
Example: self-dual cones

Figure 3: Left: second-order cone {x ∈ R2 | x1 ≥ |x2 |}. Right: second-order


p
cone {x ∈ R3 | x1 ≥ x22 + x23 }. Image from internet

lecture4 17/54
Normal cone

Definition (Normal cone)


Let C be a convex set in X and x̄ ∈ C. The normal cone of C at x̄ ∈ C
is defined by
NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}
By convention, we let NC (x̄) = ∅ if x̄ ∈
/ C.

lecture4 18/54
Example: normal cones

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}


Example 1. C = [0, 1] ⊆ R



 (−∞, 0], if x̄ = 0


 [0, +∞), if x̄ = 1
NC (x̄) =


 {0}, if x̄ ∈ (0, 1)


 ∅, if x̄ ∈
/C

lecture4 19/54
Example: normal cones

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}


Example 2.
C = {x ∈ R2 | kxk ≤ 1} ⊆ R2

 {λx̄ | λ ≥ 0}, if kx̄k = 1


NC (x̄) = {0}, if kx̄k < 1


 ∅, if x̄ ∈
/C

. In page 22, we show that u ∈ NC (x̄) ⇐⇒ x̄ = ΠC (x̄ + u). If


kx̄k = 1 and u 6= 0, then x̄ = ΠC (x̄ + u) implies that kx̄ + uk > 1
x̄+u
and x̄ = ΠC (x̄ + u) = kx̄+uk . It follows that u = (kx̄ + uk − 1)x̄
(kx̄ + uk − 1 > 0). That is, NC (x̄) = {λx̄ | λ ≥ 0} as it is a cone.

lecture4 20/54
Example: normal cones

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

Example 3. C = a triangle ⊆ R2

lecture4 21/54
Example: normal cones

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

Example 3. C = a triangle ⊆ R2

[Exercise] (1) C = {x ∈ R2 | x1 + x2 ≤ 1, x1 ≥ 0, x2 ≥ 0}. Find NC (x̄)


for x̄ = [0; 1], x̄ = [0.5; 0.5], x̄ = [0.1; 0.2].
(2) C = Rn+ . Find NC (x̄), x̄ = [1; 1; 0; 0; . . . ; 0]. lecture4 21/54
Normal cone

Proposition Let C ⊆ X be a nonempty convex set and x̄ ∈ C. Then


(1) NC (x̄) is a closed convex cone.
(2) If x̄ ∈ int(C) (x̄ is an interior point of C), then NC (x̄) = {0}.
(3) If C is a cone, then NC (x̄) ⊆ C o .

lecture4 22/54
Normal cone

Proof*: (1) Closeness is easy to prove [Exercise]. First we prove that it


is a cone. Consider z ∈ NC (x̄) and λ ≥ 0. By definition, we have that
hz, x − x̄i ≤ 0 ∀ x ∈ C ⇒ hλz, x − x̄i ≤ 0 ∀ x ∈ C. Thus
λz ∈ NC (x̄). Next we show that if z1 , z2 ∈ NC (x̄), then
z1 + z2 ∈ NC (x̄). By definition, we have that hz1 , x − x̄i ≤ 0 and
hz2 , x − x̄i ≤ 0 for all x ∈ C. Thus hz1 + z2 , x − x̄i ≤ 0 for all x ∈ C,
which implies that z1 + z2 ∈ NC (x̄) and NC (x̄) is convex.
(2) Let z ∈ NC (x̄). Since x̄ ∈ int(C), there exists  > 0 such that
x̄ + tz ∈ C for |t| < . By definition of normal cone, we have that
0 ≥ hz, (x̄ + tz) − x̄i = tkzk2 . By considering both positive and negative
t, we get kzk2 = 0. Hence z = 0.
(3) Suppose C is a cone. Then x̄ + x ∈ C for any x ∈ C. Thus for any
z ∈ NC (x̄), we have hz, xi = hz, (x + x̄) − x̄i ≤ 0 ∀ x ∈ C. Hence
z ∈ C o . That is, NC (x̄) ⊆ C o .

lecture4 23/54
Normal cone

Proposition Let C ⊆ X be a nonempty closed convex set. Then for any


y ∈ C,
u ∈ NC (y) ⇐⇒ y = ΠC (y + u)

Proof. “⇒” Suppose u ∈ NC (y). Then hu, x − yi ≤ 0 for all x ∈ C.


Thus
h(y + u) − y, x − yi ≤ 0 for all x ∈ C
which implies that y = ΠC (y + u).
“⇐” Suppose y = ΠC (y + u). Then we know that

hu, x − yi = h(y + u) − y, x − yi ≤ 0 for all x ∈ C

which implies that u ∈ NC (y).

lecture4 24/54
Subdifferential

Definition Let f : X → (−∞, +∞] be a convex function.


We call v a subgradient of f at x ∈ dom(f ) if
f (z) ≥ f (x) + hv, z − xi ∀z ∈ X.
The set of all subgradients at x is called the subdifferential of f at x,
denoted as
∂f (x) = {v | f (z) ≥ f (x) + hv, z − xi ∀ z ∈ X }.
By convention, ∂f (x) = ∅ for any x ∈
/ dom(f ).

lecture4 25/54
Subdifferential and optimization

Subgradient is an extension of gradient

• If f is differentiable at x, then ∂f (x) = {∇f (x)}.


. Proof: If v ∈ ∂f (x), then f (x + h) ≥ f (x) + hv, hi ∀ h. By taking
h = t(v − ∇f (x)), t > 0, and use first-order Taylor series expansion
to get tkv − ∇f (x)k ≤ o(t) ∀ t > 0, which implies v = ∇f (x).

Theorem Let f : X → (−∞, +∞] be a proper convex function. Then


x̄ ∈ X is a global minimizer of min f (x) if and only if 0 ∈ ∂f (x̄).
x∈X

lecture4 26/54
Subdifferential and optimization

Subgradient is an extension of gradient

• If f is differentiable at x, then ∂f (x) = {∇f (x)}.


. Proof: If v ∈ ∂f (x), then f (x + h) ≥ f (x) + hv, hi ∀ h. By taking
h = t(v − ∇f (x)), t > 0, and use first-order Taylor series expansion
to get tkv − ∇f (x)k ≤ o(t) ∀ t > 0, which implies v = ∇f (x).

Theorem Let f : X → (−∞, +∞] be a proper convex function. Then


x̄ ∈ X is a global minimizer of min f (x) if and only if 0 ∈ ∂f (x̄).
x∈X

Proof. By the subgradient inequality f (z) ≥ f (x̄) + hv, z − x̄i ∀z ∈ X.


Take 0 = v ∈ ∂f (x̄)

lecture4 26/54
Example: subdifferential

Example 1.
f (x) = |x|, x ∈ R.
1

0.9

0.8
 0.7

 {−1}, if x < 0

 0.6

0.5

∂f (x) = [−1, 1], if x = 0 0.4



 0.3
 {1}, if x > 0 0.2

0.1

0
-1 -0.5 0 0.5 1

lecture4 27/54
Example: subdifferential

Example 1.
f (x) = |x|, x ∈ R.
1

0.9

0.8
 0.7

 {−1}, if x < 0

 0.6

0.5

∂f (x) = [−1, 1], if x = 0 0.4



 0.3
 {1}, if x > 0 0.2

0.1

0
-1 -0.5 0 0.5 1

The `1 norm promotes sparsity: for example, in lasso


1
minimize kXβ − Y k2 + λkβk1
β∈Rp 2
Optimality condition: 0 ∈ X T (Xβ − Y ) + λ∂(k · k1 )(β)

T
 βi < 0, 0 = [X (Xβ − Y )]i − λ

βi = 0, 0 ∈ [X T (Xβ − Y )]i + λ[−1, 1] (∗)
 β > 0, 0 = [X T (Xβ − Y )] + λ

i i

(∗) ⇐⇒ βi = 0, |[X T (Xβ − Y )]i | ≤ λ Sparsity!


lecture4 27/54
Example: subdifferential

Example 2.
f (x) = max{x2 − 1, 0}, x ∈ R.
1.5

1



 {2x}, if x < −1, x > 1 0.5

 {0},

if − 1 < x < 1 0

∂f (x) =


 [−2, 0], if x = −1 -0.5



 [0, 2], if x = 1 -1
-1.5 -1 -0.5 0 0.5 1 1.5

lecture4 28/54
Example: subdifferential

Example 3. Let C be a convex set.


(
NC (x), if x ∈ C
∂δC (x) =
∅, if x ∈
/ C.

Proof: Take x ∈ C

v ∈ ∂δC (x)
⇐⇒ δC (z) ≥ δC (x) + hv, z − xi ∀z
⇐⇒ 0 ≥ hv, z − xi ∀z ∈ C
⇐⇒ v ∈ NC (x)

lecture4 29/54
Lipschitz continuous

Definition (Lipschitz continuous)


A function F : Rn → Rm is said to be locally Lipschitz continuous if for
any open set O ⊆ Rn , there exists a constant L (depending on O) such
that
kF (x) − F (y)k ≤ Lkx − yk ∀ x, y ∈ O.
If O = Rn , then F is said to be globally Lipschitz continuous.
Example
(1) f (x) = |x|, x ∈ R is globally Lipschitz continuous with Lipschitz
constant L = 1
(2) f (x) = x2 , x ∈ R is locally Lipschitz continuous but not globally
Lipschitz continuous

lecture4 30/54
Fenchel conjugate

Definition Let f : X → [−∞, +∞]. The (Fenchel) conjugate of f is


defined by

f ∗ (y) = sup{hy, xi − f (x) | x ∈ X }, y ∈ X

Remark

• f ∗ is always closed and convex, even if f is neither convex nor


closed.
• If f : X → (−∞, +∞] a closed proper convex function, then
(f ∗ )∗ = f .

lecture4 31/54
Fenchel conjugate

Proposition Let f : X → (−∞, +∞] be a closed proper convex


function. The following is equivalent:
(1) f (x) + f ∗ (y) = hx, yi
(2) y ∈ ∂f (x)
(3) x ∈ ∂f ∗ (y)

• “y ∈ ∂f (x) ⇐⇒ x ∈ ∂f ∗ (y)” means that ∂f ∗ is the inverse of ∂f


in the sense of multi-valued mappings.

lecture4 32/54
Example: conjugate function

Example 1. Let C ⊆ X be a nonempty convex set. Compute the


conjugate of the indicator function of C.
Solution. Recall that the indicator function of C is
(
0, if x ∈ C
δC (x) =
+∞, if x ∈ /C

Its conjugate

δC (y) = sup{hy, xi − δC (x) | x ∈ X } = sup{hy, xi | x ∈ C}

is called as the support function.

lecture4 33/54
Example: conjugate function

Example 2. Let f (x) = kxk1 , x ∈ Rn . Compute f ∗ .


Solution. f ∗ (y) = supx {hy, xi − kxk1 }
Case 1: if kyk∞ ≤ 1, we have hy, xi − kxk1 ≤ kxk1 kyk∞ − kxk1 ≤ 0.
Therefore, f ∗ (y) = 0.
Case 2: if kyk∞ > 1, there exists |yk | > 1. For a positive integer m, we
construct
x̄ = [0; . . . ; 0; m sign(yk ); 0; . . . ; 0]
and therefore f ∗ (y) ≥ hy, x̄i − kx̄k1 = m(|yk | − 1) → +∞, as
m → +∞. Therefore, f ∗ (y) = +∞.
In conclusion, f ∗ (y) = δC (y), C = {y ∈ Rn | kyk∞ ≤ 1}.

lecture4 34/54
Example: conjugate function

Example 2. Let f (x) = kxk1 , x ∈ Rn . Compute f ∗ .


Solution. f ∗ (y) = supx {hy, xi − kxk1 }
Case 1: if kyk∞ ≤ 1, we have hy, xi − kxk1 ≤ kxk1 kyk∞ − kxk1 ≤ 0.
Therefore, f ∗ (y) = 0.
Case 2: if kyk∞ > 1, there exists |yk | > 1. For a positive integer m, we
construct
x̄ = [0; . . . ; 0; m sign(yk ); 0; . . . ; 0]
and therefore f ∗ (y) ≥ hy, x̄i − kx̄k1 = m(|yk | − 1) → +∞, as
m → +∞. Therefore, f ∗ (y) = +∞.
In conclusion, f ∗ (y) = δC (y), C = {y ∈ Rn | kyk∞ ≤ 1}.
[Exercise] Let f (x) = λkxkp , x ∈ Rn , 1 < p < ∞, λ > 0. Show that
f ∗ (y) = δC (y), C = {y ∈ Rn | kykq ≤ λ}, p1 + 1q = 1.

lecture4 34/54
Proximal operator
Moreau envelope and proximal mapping

Let f : X → (−∞, +∞] be a closed proper convex function. We define

• Moreau envelope (Moreau-Yosida regularization) of f at x


 
1
Mf (x) = min f (y) + ky − xk2
y 2
• Proximal mapping of f at x
 
1
Pf (x) = arg min f (y) + ky − xk2
y 2

• Mf (x) is differentiable, its gradient is ∇Mf (x) = x − Pf (x)


(Moreau envelope is a way to smooth a possibly non-differentiable
convex function)
• Pf (x) exists and is unique
• Mf (x) ≤ f (x)
• arg min f (x) = arg min Mf (x)

lecture4 35/54
Example

Let C ⊆ X be a nonempty closed convex set and f (x) = δC (x) be the


indicator function of C. Its proximal mapping is
 
1 1
Pf (x) = arg min δC (y) + ky − xk2 = arg min ky − xk2 = ΠC (x)
y∈X 2 y∈C 2

Its Moreau envelope is


1
Mf (x) = kx − ΠC (x)k2
2

lecture4 36/54
Example

Let f (x) = λ|x|, x ∈ R. Its Moreau envelope (known as Huber function)


(
1 2
x , |x| ≤ λ
Mf (x) = 2
λ2
λ|x| − 2 , |x| > λ
Its proximal mapping is (known as soft thresholding)
Pf (x) = sign(x) max{|x| − λ, 0}

Soft thresholding
1.5 2 2

1.8 1.8
1
1.6 1.6

1.4 1.4
0.5
1.2 1.2

0 1 1

0.8 0.8
-0.5
0.6 0.6

0.4 0.4
-1
0.2 0.2

-1.5 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

We can see that Mf is a smoothing of f , Mf ≤ f ,


arg min f (x) = arg min Mf (x).
lecture4 37/54
Soft thresholding

The soft thresholding operator Sλ : Rn → Rn is defined as


 
sign(x1 ) max{|x1 | − λ, 0}
 sign(x2 ) max{|x2 | − λ, 0} 
 
Sλ (x) = 
 .. 
.

 
sign(xn ) max{|xn | − λ, 0}

for any x = [x1 ; . . . ; xn ] ∈ Rn and λ > 0.


Example. Given
     
1.5 1 0
−0.4  0  0
     
x =  3 , S0.5 (x) =  2.5  , S2 (x) = 1
     
     
 −2  −1.5 0
0.8 0.3 0

lecture4 38/54
Moreau envelope and proximal mapping

Theorem (Moreau decomposition)


Let f : X → (−∞, +∞] be a closed proper convex function and f ∗ be
its conjugate. For any x ∈ X , it holds that
x =Pf (x) + Pf ∗ (x)
1
kxk2 =Mf (x) + Mf ∗ (x)
2
Example. Let C ⊆ X be a nonempty closed convex cone. f (x) = δC (x)
and f ∗ (x) = δC

(x) = δC o (x). Therefore
x = ΠC (x) + ΠC o (x).

lecture4 39/54
Moreau envelope and proximal mapping

Theorem (Moreau decomposition)


Let f : X → (−∞, +∞] be a closed proper convex function and f ∗ be
its conjugate. For any x ∈ X , it holds that
x =Pf (x) + Pf ∗ (x)
1
kxk2 =Mf (x) + Mf ∗ (x)
2
Example. Let C ⊆ X be a nonempty closed convex cone. f (x) = δC (x)
and f ∗ (x) = δC

(x) = δC o (x). Therefore
x = ΠC (x) + ΠC o (x).

• Mf (·) is always differentiable even though f is non-differentiable


• Pf (·) is important in many optimization algorithms (e.g.,
accelerated proximal gradient methods introduced later)
• For many widely used regularizaters, Pf (·) and Mf (·) have explicit
expression
lecture4 39/54
lecture4 40/54
(Accelerated) proximal gradient
method
A proximal point view of gradient methods

To minimize a differentiable function min f (β)


β

β (k+1) = β (k) − αk ∇f (β (k) )

The gradient step can be written equivalently as


1
β (k+1) = arg min{f (β (k) ) + h∇f (β (k) ), β − β (k) i + kβ − β (k) k2 }
β | {z } 2αk
linear approximation
| {z }
proximal term

lecture4 41/54
Optimizing composite functions

minimize
p
f (β) + g(β)
β∈R

• f : Rp → R is convex and differentiable, ∇f is L-Lipschitz


continuous
• g : Rp → (−∞, +∞] is closed proper convex, non-differentiable
• For example, in lasso,
1
minimize kXβ − Y k2 + λkβk1
β∈Rp 2
| {z } | {z }
=g(β)
=f (β)

• Since g is non-differentiable, we cannot apply gradient methods

lecture4 42/54
Proximal gradient step

minimize
p
f (β) + g(β)
β∈R

Gradient step (suppose g(β) disappears):

β (k+1) = β (k) − αk ∇f (β (k) )

which can be written equivalently as


 
1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i + kβ − β (k) k2
β 2αk

Proximal gradient step:


 
(k+1) (k) (k) (k) 1 (k) 2
β = arg min f (β ) + h∇f (β ), β − β i+g(β) + kβ − β k
β 2αk

lecture4 43/54
Proximal gradient step

Proximal gradient step:


 
1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i+g(β) + kβ − β (k) k2
β 2αk
After ignoring constant terms and completing the square, the above step
can be written equivalently as
  2 
(k+1) 1 
(k) (k)
β = arg min β − β − αk ∇f (β ) + g(β)
β 2αk
 
= Pαk g β (k) − αk ∇f (β (k) )

lecture4 44/54
Proximal gradient step

Proximal gradient step:


 
1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i+g(β) + kβ − β (k) k2
β 2αk
After ignoring constant terms and completing the square, the above step
can be written equivalently as
  2 
(k+1) 1 
(k) (k)
β = arg min β − β − αk ∇f (β ) + g(β)
β 2αk
 
= Pαk g β (k) − αk ∇f (β (k) )

Derivation* completing the square


1
h∇f (β (k) ), βi + kβ − β (k) k2
2αk
1 (k) 1
=h∇f (β (k) ) − β , βi + kβk2 + constant
αk 2αk
1 1
= hαk ∇f (β (k) ) − β (k) , βi + kβk2 + constant
αk 2αk
lecture4 44/54
Proximal gradient methods

Algorithm (Proximal gradient (PG) method)


Choose β (0) , constant step length α > 0. Set k ← 0
repeat until convergence
 
β (k+1) = Pαg β (k) − α∇f (β (k) )
k ←k+1
end(repeat)
return β (k)

lecture4 45/54
Proximal gradient methods

(Informal) in convex problems (f and g are convex), iteration complexity


of PG method is O( k1 ):
 
1
f (β (k) ) + g(β (k) ) − minp f (β) + g(β) ≤ O
β∈R k
| {z }
optimal value

If adopting stopping condition

f (β (k) ) + g(β (k) ) − optimal value ≤ 10−4

we need O(104 ) PG iterations

lecture4 46/54
Nesterov’s accelerated method

Nesterov’s idea: include a momentum term for acceleration


tk − 1  (k) 
β̄ (k) = β (k) + β − β (k−1)
tk+1
| {z }
momentum term
 
β (k+1) = Pαg β̄ (k) − α∇f (β̄ (k) )

{tk } is a positive sequence such that t0 = t1 = 1, t2k+1 − tk+1 ≤ t2k

• e.g., tk = 1, ∀ k. In this case, momentum term = 0, it is reduced to


PG without acceleration
√ √
1+ 1+4t2k
• e.g., tk+1 = 2 ⇒ t0 = 1, t1 = 1, t2 = 1+ 5
2 ,...
• there are many other sequences satisfying the condition

1+ 1+4t2k
• Next we simply take t0 = t1 = 1, tk+1 = 2

lecture4 47/54
Sequence tk


1+ 1+4t2k
Take t0 = t1 = 1, tk+1 = 2

45 1

40 0.9

0.8
35
0.7
30
0.6
25
0.5
20
0.4
15
0.3
10
0.2

5 0.1

0 0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80

lecture4 48/54
Accelerated proximal gradient methods

Algorithm (Accelerated proximal gradient (APG) method)


Choose β (0) , constant step length α > 0. Set t0 = t1 = 1, k ← 0
repeat until convergence
tk − 1 (k)
β̄ (k) = β (k) + (β − β (k−1) )
tk+1
 
β (k+1) = Pαg β̄ (k) − α∇f (β̄ (k) )
k ←k+1
p
1+ 1 + 4t2k
tk+1 =
2
end(repeat)
return β (k)

The algorithmic framework follows from [1]. It is built based on


Nesterov’s accelerated method in 1983 [2]
lecture4 49/54
Accelerated proximal gradient methods

(Informal) in convex problems (f and g are convex), iteration complexity


of APG method is O( k12 ):
 
1
f (β (k) ) + g(β (k) ) − minp f (β) + g(β) ≤ O
β∈R k2
| {z }
optimal value

If adopting stopping condition

f (β (k) ) + g(β (k) ) − optimal value ≤ 10−4

we need O(102 ) PG iterations

lecture4 50/54
Accelerated proximal gradient methods

• Backtracking line search is also applicable for finding step length αk .


• For simplicity, we take a constant step length. f It should satisfy
α ∈ (0, L1 ), where L is Lipschitz constant of ∇f (·) (typically
unknown)
• APG methods enjoy the same computational cost per iteration as
PG methods.
• Iteration complexity: APG O( k12 ); PG O( k1 )

lecture4 51/54
APG for lasso

1
minimize kXβ − Y k2 + λkβk1 lasso problem
p
β∈R 2
| {z } | {z }
=g(β)
=f (β)

Then ∇f (β) = X T (Xβ − Y ) with Lipschitz constant L = λmax (X T X).


Choose step length α = 1/L. APG iterations:
tk − 1  (k) 
β̄ (k) = β (k) + β − β (k−1)
tk+1
 
1 T
β (k+1) (k) (k)
= Sλ/L β̄ − X (X β̄ − Y )
L

APG is also applicable for “logistic regression + lasso regularization”

lecture4 52/54
APG for lasso

Design a stopping criteria for lasso problem


1
minimize kXβ − Y k2 + λkβk1
p
β∈R 2
| {z } | {z }
=g(β)
=f (β)

We know that β is a global minimizer to lasso problem if and only if

0 ∈ ∇f (β) + ∂g(β),

which is equivalent to [check]

β = Pg (β − ∇f (β)).

Therefore, we can choose a tolerance ε > 0 and stop the method at β (k)
once the stopping criteria is satisfied
 
β (k) − Sλ β (k) − X T (Xβ (k) − Y ) < ε.

lecture4 53/54
Restart

• A strategy to speed up APG is to restart the algorithm after a fixed


number of iterations
• using the latest iterate as the starting point of the new round of
APG iteration
• a reasonable choice is to perform restart every 100 or 200 iterations

lecture4 54/54
References i

A. Beck and M. Teboulle.


A fast iterative shrinkage-thresholding algorithm for linear
inverse problems.
SIAM journal on imaging sciences, 2(1):183–202, 2009.
Y. E. Nesterov.
A method for solving the convex programming problem with
convergence rate O(1/k 2 ).
In Dokl. Akad. Nauk SSSR,, volume 269, pages 543–547, 1983.

You might also like