Lecture4 APG

APG
DSA5103 Lecture 4
Yangjing Zhang
05-Sep-2023
NUS
Today’s content
1. Basic convex analysis

2. Proximal operator
3. (Accelerated) proximal gradient method
lecture4 1/54
Basic convex analysis
Norms
A vector norm on Rn is a function k · k : Rn → R satisfying the following

properties:
(1) kxk ≥ 0 ∀ x ∈ Rn , and kxk = 0 ⇐⇒ x = 0
(2) kαxk = |α|kxk ∀ α ∈ R, x ∈ Rn
(3) kx + yk ≤ kxk + kyk ∀ x, y ∈ Rn
Example.
Pn
1. kxk1 = i=1 |xi | (`1 norm)
Pn 2 1/2
2. kxk2 = ( i=1 |xi | ) (`2 norm)
Pn p 1/p
3. kxkp = ( i=1 |xi | ) , where 1 ≤ p < ∞ (`p norm)
4. kxk∞ = max1≤i≤n |xi | (`∞ norm)
5. kxkW,p = kW xkp , where W is a fixed nonsingular matrix,
1≤p≤∞
lecture4 2/54
Inner product
For the space of m × n matrices, Rm×n , we define the standard inner

product, for any A, B ∈ Rm×n ,
m X
X n
hA, Bi = Tr(AT B) = Aij Bij
i=1 j=1
Pn
Recap: the trace of a square matrix C ∈ Rn×n is Tr(C) = i=1 Cii .
n
For the space of n-vectors, R , we define the standard inner product, for
any x, y ∈ Rn ,
n
X
hx, yi = xT y = xi yi
i=1
lecture4 3/54
Inner product
From the definition of inner products, the following properties hold for
any matrices A, B, C ∈ Rm×n and any scalars a, b ∈ R
• kAk2F := hA, Ai ≥ 0
• hA, Ai = 0 if and only if A = 0
• hC, aA + bBi = ahC, Ai + bhC, Bi
• hA + B, A + Bi = hA, Ai + 2hA, Bi + hB, Bi
(i.e., kA + Bk2F = kAk2F + 2hA, Bi + kBk2F )
lecture4 4/54
Projection onto a closed convex set
Theorem (Projection theorem). Let C be a closed convex set.

(1) For every z, there exists a unique minimizer of
1
min kx − zk2 ,
x∈C 2
denoted as ΠC (z) and called as the projection of z onto C.
(2) x∗ := ΠC (z) is the projection of z onto C if and only if
hz − x∗ , x − x∗ i ≤ 0 ∀ x ∈ C.
lecture4 5/54
Theorem (Projection theorem). Let C be a closed convex set.

(1) For every z, there exists a unique minimizer of
1
min kx − zk2 ,
x∈C 2
denoted as ΠC (z) and called as the projection of z onto C.
(2) x∗ := ΠC (z) is the projection of z onto C if and only if
hz − x∗ , x − x∗ i ≤ 0 ∀ x ∈ C.
lecture4 5/54
Example.
1. C = Rn+ = {x ∈ Rn | xi ≥ 0, ∀ i = 1, 2, . . . , n} positive orthant
ΠC (z) = ΠRn+ (z) = max{z, 0}
2. C = {x ∈ Rn | kxk2 ≤ 1} `2 -norm ball

 z, if kzk2 ≤ 1 z
ΠC (z) = =
 z
, if kzk2 > 1 max{kzk2 , 1}
kzk2
Figure 1: C = {x ∈ R2 | kxk2 ≤ 1} `2 -norm ball

lecture4 6/54
Example.
3. C = Sn+ the space of n × n symmetric and positive semidefinite

matrices
 
max{λ1 , 0}
ΠC (A) = ΠSn+ (A) = Q  ..  T
Q

.
max{λn , 0}
for a given A ∈ Sn with eigenvalue decomposition

 
λ1
A = Q ..  T
Q

.
λn
lecture4 7/54
arg min
arg min f (x) denotes the solution set of x for which f (x) attains its
x
minimum (argument of the minimum)
4 1
0.9
3.5
0.8
3
0.7
2.5
0.6
2 0.5
0.4
1.5
0.3
1
0.2
0.5
0.1
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Left: min f (x) = 0, arg min f (x) = {2}

x x
Right: min f (x) = 0, arg min f (x) = [−1, 1]

x x
1
ΠC (z) = arg minx∈C 2 kx − zk2
lecture4 8/54
Extended real-valued function
Definition.
Let X be a Euclidean space (e.g., X = Rn or Rm×n ). Let
f : X → (−∞, +∞] be an extended real-valued function1 .
(1) The (effective) domain of f is defined to be the set
dom(f ) := {x ∈ X | f (x) < +∞}.
(2) f is said to be proper if dom(f ) 6= ∅.

(3) f is said to be closed if its epi-graph
epi(f ) := {(x, α) ∈ X × R | f (x) ≤ α}
is closed.
(4) f is said to be convex if its epi-graph is convex.
1 Here f is allowed to take the value of +∞, but not allowed to take the value of −∞
lecture4 9/54
Extended real-valued function
• For a real-valued function f : X → R, this definition of convexity
epi(f ) is convex (1)
coincides with the one we have used
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y), ∀ x, y ∈ dom(f ), λ ∈ [0, 1]

(2)
[Exercise] (1) ⇐⇒ (2)
• A convex function f : D ⊆ Rn → R can be extended to a convex
function on all of Rn by setting f (x) = +∞ for x ∈
/ D.
lecture4 10/54
Example: extended real-valued function

 0, if x = 0


f (x) = 1, if x > 0


 +∞, if x < 0
• dom(f ) = [0, +∞), f is proper

• epi(f ) = {0} × [0, +∞) ∪ (0, +∞) × [1, +∞) is closed, i.e., f is
closed
• epi(f ) is not convex, i.e., f is not convex
lecture4 11/54
Indicator function
Let C be a nonempty set in X . f : R2 → (−∞, +∞]

The indicator function of C is
(
0, if x ∈ C
δC (x) =
+∞, if x ∈ /C
• dom(f ) = C, f is proper
• epi(f ) = C × [0, +∞) is closed if C is closed, i.e., δC (·) is closed if
C is closed
• epi(f ) is convex if C is convex, i.e., δC (·) is convex if C is convex
lecture4 12/54
Dual/polar cone
Definition (Cone)
A set C ⊆ X is called a cone if λx ∈ C when x ∈ C and λ ≥ 0.
Definition (Dual and polar cone)

The dual cone of a set C ⊆ X (not necessarily convex) is defined by
C ∗ = {y ∈ X | hx, yi ≥ 0 ∀x ∈ C}
The polar cone of C is C o = −C ∗ .

If C ∗ = C, then C is said to be self-dual.
• C ∗ is always a convex cone, even if C is neither convex nor a cone.
lecture4 13/54
Dual/polar cone
Figure 2: Left: A set C and its dual cone C ∗ . Right: A set C and its polar
cone C o . The dual cone and the polar cone are symmetric to each other with
respect to the origin. Image from internet
lecture4 14/54
Example: self-dual cones
1. X = Rn . C = Rn+ is a self-dual closed convex cone

. Proof: C ∗ = {y ∈ Rn | hx, yi ≥ 0 ∀x ∈ Rn n
+ } = R+
2. X = Sn . C = Sn+ is a self-dual closed convex cone (psd cone)

. Proof: want to show that {B ∈ Sn | hA, Bi ≥ 0 ∀A ∈ Sn n
+ } = S+
∗ n T n
[LHS ⊆ RHS] Take B ∈ C . For any x ∈ R , we have xx ∈ S+
and thus hxxT , Bi = xT Bx ≥ 0, which implies B ∈ Sn +.
[RHS ⊆ LHS] Take B ∈ Sn + . Compute its eigenvalue decomposition
B= n T n
P
i=1 λi vi vi , λi ≥ 0. For any A ∈ S+ , we have
Pn T Pn T
hA, Bi = hA, i=1 λi vi vi i = i=1 λi vi Avi ≥ 0.
lecture4 15/54
p
3. X = Rn . C := {x ∈ Rn | x22 + · · · + x2n ≤ x1 , x1 ≥ 0} is a
self-dual closed convex cone (second-order cone)
. Proof: want to show that {y ∈ Rn | hx, yi ≥ 0 ∀x ∈ C} = C
[RHS ⊆ LHS] Take y ∈ C. For any x ∈ C
v v
u n u n
uX uX
2
hx, yi = x1 y1 + x2 y2 + · · · + xn yn ≥ x1 y1 − t xi t yi2 ≥ 0
i=2 i=2
The first inequality follows from the Cauchy-Schwartz inequality, and

the last inequality follows from the fact that x, y ∈ C.
[LHS ⊆ RHS] Take y ∈ C ∗ . If [y2 ; . . . ; yn ] = 0, we take
x = [1; 0; . . . ; 0] ∈ C then hx, yi = y1 ≥ 0. Obviously, y ∈ C. Else,
we take v 
u n
uX
2
x= t yi ; −y2 ; . . . ; −yn  ∈ C
i=2
then v v
u n u n
uX 2 2
uX
2
hx, yi = y1 t yi − y2 − · · · − yn ≥ 0 ⇒ y1 ≥ t yi2 ⇒ y ∈ C
i=2 i=2 lecture4 16/54
.
Figure 3: Left: second-order cone {x ∈ R2 | x1 ≥ |x2 |}. Right: second-order

p
cone {x ∈ R3 | x1 ≥ x22 + x23 }. Image from internet
lecture4 17/54
Normal cone
Definition (Normal cone)

Let C be a convex set in X and x̄ ∈ C. The normal cone of C at x̄ ∈ C
is defined by
NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}
By convention, we let NC (x̄) = ∅ if x̄ ∈
/ C.
lecture4 18/54
Example: normal cones
NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

Example 1. C = [0, 1] ⊆ R



 (−∞, 0], if x̄ = 0


 [0, +∞), if x̄ = 1
NC (x̄) =


 {0}, if x̄ ∈ (0, 1)


 ∅, if x̄ ∈
/C
lecture4 19/54
NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

Example 2.
C = {x ∈ R2 | kxk ≤ 1} ⊆ R2

 {λx̄ | λ ≥ 0}, if kx̄k = 1


NC (x̄) = {0}, if kx̄k < 1


 ∅, if x̄ ∈
/C
. In page 22, we show that u ∈ NC (x̄) ⇐⇒ x̄ = ΠC (x̄ + u). If

kx̄k = 1 and u 6= 0, then x̄ = ΠC (x̄ + u) implies that kx̄ + uk > 1
x̄+u
and x̄ = ΠC (x̄ + u) = kx̄+uk . It follows that u = (kx̄ + uk − 1)x̄
(kx̄ + uk − 1 > 0). That is, NC (x̄) = {λx̄ | λ ≥ 0} as it is a cone.
lecture4 20/54
NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}
Example 3. C = a triangle ⊆ R2
lecture4 21/54
NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}
Example 3. C = a triangle ⊆ R2
[Exercise] (1) C = {x ∈ R2 | x1 + x2 ≤ 1, x1 ≥ 0, x2 ≥ 0}. Find NC (x̄)

for x̄ = [0; 1], x̄ = [0.5; 0.5], x̄ = [0.1; 0.2].
(2) C = Rn+ . Find NC (x̄), x̄ = [1; 1; 0; 0; . . . ; 0]. lecture4 21/54
Normal cone
Proposition Let C ⊆ X be a nonempty convex set and x̄ ∈ C. Then

(1) NC (x̄) is a closed convex cone.
(2) If x̄ ∈ int(C) (x̄ is an interior point of C), then NC (x̄) = {0}.
(3) If C is a cone, then NC (x̄) ⊆ C o .
lecture4 22/54
Normal cone
Proof*: (1) Closeness is easy to prove [Exercise]. First we prove that it

is a cone. Consider z ∈ NC (x̄) and λ ≥ 0. By definition, we have that
hz, x − x̄i ≤ 0 ∀ x ∈ C ⇒ hλz, x − x̄i ≤ 0 ∀ x ∈ C. Thus
λz ∈ NC (x̄). Next we show that if z1 , z2 ∈ NC (x̄), then
z1 + z2 ∈ NC (x̄). By definition, we have that hz1 , x − x̄i ≤ 0 and
hz2 , x − x̄i ≤ 0 for all x ∈ C. Thus hz1 + z2 , x − x̄i ≤ 0 for all x ∈ C,
which implies that z1 + z2 ∈ NC (x̄) and NC (x̄) is convex.
(2) Let z ∈ NC (x̄). Since x̄ ∈ int(C), there exists > 0 such that
x̄ + tz ∈ C for |t| < . By definition of normal cone, we have that
0 ≥ hz, (x̄ + tz) − x̄i = tkzk2 . By considering both positive and negative
t, we get kzk2 = 0. Hence z = 0.
(3) Suppose C is a cone. Then x̄ + x ∈ C for any x ∈ C. Thus for any
z ∈ NC (x̄), we have hz, xi = hz, (x + x̄) − x̄i ≤ 0 ∀ x ∈ C. Hence
z ∈ C o . That is, NC (x̄) ⊆ C o .
lecture4 23/54
Normal cone
Proposition Let C ⊆ X be a nonempty closed convex set. Then for any

y ∈ C,
u ∈ NC (y) ⇐⇒ y = ΠC (y + u)
Proof. “⇒” Suppose u ∈ NC (y). Then hu, x − yi ≤ 0 for all x ∈ C.

Thus
h(y + u) − y, x − yi ≤ 0 for all x ∈ C
which implies that y = ΠC (y + u).
“⇐” Suppose y = ΠC (y + u). Then we know that
hu, x − yi = h(y + u) − y, x − yi ≤ 0 for all x ∈ C
which implies that u ∈ NC (y).
lecture4 24/54
Subdifferential
Definition Let f : X → (−∞, +∞] be a convex function.

We call v a subgradient of f at x ∈ dom(f ) if
f (z) ≥ f (x) + hv, z − xi ∀z ∈ X.
The set of all subgradients at x is called the subdifferential of f at x,
denoted as
∂f (x) = {v | f (z) ≥ f (x) + hv, z − xi ∀ z ∈ X }.
By convention, ∂f (x) = ∅ for any x ∈
/ dom(f ).
lecture4 25/54
Subdifferential and optimization
Subgradient is an extension of gradient
• If f is differentiable at x, then ∂f (x) = {∇f (x)}.

. Proof: If v ∈ ∂f (x), then f (x + h) ≥ f (x) + hv, hi ∀ h. By taking
h = t(v − ∇f (x)), t > 0, and use first-order Taylor series expansion
to get tkv − ∇f (x)k ≤ o(t) ∀ t > 0, which implies v = ∇f (x).
Theorem Let f : X → (−∞, +∞] be a proper convex function. Then

x̄ ∈ X is a global minimizer of min f (x) if and only if 0 ∈ ∂f (x̄).
x∈X
lecture4 26/54
Subdifferential and optimization
Subgradient is an extension of gradient
• If f is differentiable at x, then ∂f (x) = {∇f (x)}.

. Proof: If v ∈ ∂f (x), then f (x + h) ≥ f (x) + hv, hi ∀ h. By taking
h = t(v − ∇f (x)), t > 0, and use first-order Taylor series expansion
to get tkv − ∇f (x)k ≤ o(t) ∀ t > 0, which implies v = ∇f (x).
Theorem Let f : X → (−∞, +∞] be a proper convex function. Then

x̄ ∈ X is a global minimizer of min f (x) if and only if 0 ∈ ∂f (x̄).
x∈X
Proof. By the subgradient inequality f (z) ≥ f (x̄) + hv, z − x̄i ∀z ∈ X.

Take 0 = v ∈ ∂f (x̄)
lecture4 26/54
Example: subdifferential
Example 1.
f (x) = |x|, x ∈ R.
1
0.9
0.8
 0.7
 {−1}, if x < 0

 0.6
0.5
∂f (x) = [−1, 1], if x = 0 0.4


 0.3
 {1}, if x > 0 0.2
0.1
0
-1 -0.5 0 0.5 1
lecture4 27/54
Example 1.
f (x) = |x|, x ∈ R.
1
0.9
0.8
 0.7
 {−1}, if x < 0

 0.6
0.5
∂f (x) = [−1, 1], if x = 0 0.4


 0.3
 {1}, if x > 0 0.2
0.1
0
-1 -0.5 0 0.5 1
The `1 norm promotes sparsity: for example, in lasso

1
minimize kXβ − Y k2 + λkβk1
β∈Rp 2
Optimality condition: 0 ∈ X T (Xβ − Y ) + λ∂(k · k1 )(β)

T
 βi < 0, 0 = [X (Xβ − Y )]i − λ

βi = 0, 0 ∈ [X T (Xβ − Y )]i + λ[−1, 1] (∗)
 β > 0, 0 = [X T (Xβ − Y )] + λ

i i
(∗) ⇐⇒ βi = 0, |[X T (Xβ − Y )]i | ≤ λ Sparsity!

lecture4 27/54
Example 2.
f (x) = max{x2 − 1, 0}, x ∈ R.
1.5
1



 {2x}, if x < −1, x > 1 0.5

 {0},

if − 1 < x < 1 0
∂f (x) =


 [−2, 0], if x = −1 -0.5


 [0, 2], if x = 1 -1
-1.5 -1 -0.5 0 0.5 1 1.5
lecture4 28/54
Example 3. Let C be a convex set.

(
NC (x), if x ∈ C
∂δC (x) =
∅, if x ∈
/ C.
Proof: Take x ∈ C
v ∈ ∂δC (x)
⇐⇒ δC (z) ≥ δC (x) + hv, z − xi ∀z
⇐⇒ 0 ≥ hv, z − xi ∀z ∈ C
⇐⇒ v ∈ NC (x)
lecture4 29/54
Lipschitz continuous
Definition (Lipschitz continuous)

A function F : Rn → Rm is said to be locally Lipschitz continuous if for
any open set O ⊆ Rn , there exists a constant L (depending on O) such
that
kF (x) − F (y)k ≤ Lkx − yk ∀ x, y ∈ O.
If O = Rn , then F is said to be globally Lipschitz continuous.
Example
(1) f (x) = |x|, x ∈ R is globally Lipschitz continuous with Lipschitz
constant L = 1
(2) f (x) = x2 , x ∈ R is locally Lipschitz continuous but not globally
Lipschitz continuous
lecture4 30/54
Fenchel conjugate
Definition Let f : X → [−∞, +∞]. The (Fenchel) conjugate of f is

defined by
f ∗ (y) = sup{hy, xi − f (x) | x ∈ X }, y ∈ X
Remark
• f ∗ is always closed and convex, even if f is neither convex nor

closed.
• If f : X → (−∞, +∞] a closed proper convex function, then
(f ∗ )∗ = f .
lecture4 31/54
Fenchel conjugate
Proposition Let f : X → (−∞, +∞] be a closed proper convex

function. The following is equivalent:
(1) f (x) + f ∗ (y) = hx, yi
(2) y ∈ ∂f (x)
(3) x ∈ ∂f ∗ (y)
• “y ∈ ∂f (x) ⇐⇒ x ∈ ∂f ∗ (y)” means that ∂f ∗ is the inverse of ∂f

in the sense of multi-valued mappings.
lecture4 32/54
Example: conjugate function
Example 1. Let C ⊆ X be a nonempty convex set. Compute the

conjugate of the indicator function of C.
Solution. Recall that the indicator function of C is
(
0, if x ∈ C
δC (x) =
+∞, if x ∈ /C
Its conjugate
∗
δC (y) = sup{hy, xi − δC (x) | x ∈ X } = sup{hy, xi | x ∈ C}
is called as the support function.
lecture4 33/54
Example 2. Let f (x) = kxk1 , x ∈ Rn . Compute f ∗ .

Solution. f ∗ (y) = supx {hy, xi − kxk1 }
Case 1: if kyk∞ ≤ 1, we have hy, xi − kxk1 ≤ kxk1 kyk∞ − kxk1 ≤ 0.
Therefore, f ∗ (y) = 0.
Case 2: if kyk∞ > 1, there exists |yk | > 1. For a positive integer m, we
construct
x̄ = [0; . . . ; 0; m sign(yk ); 0; . . . ; 0]
and therefore f ∗ (y) ≥ hy, x̄i − kx̄k1 = m(|yk | − 1) → +∞, as
m → +∞. Therefore, f ∗ (y) = +∞.
In conclusion, f ∗ (y) = δC (y), C = {y ∈ Rn | kyk∞ ≤ 1}.
lecture4 34/54
Example 2. Let f (x) = kxk1 , x ∈ Rn . Compute f ∗ .

Solution. f ∗ (y) = supx {hy, xi − kxk1 }
Case 1: if kyk∞ ≤ 1, we have hy, xi − kxk1 ≤ kxk1 kyk∞ − kxk1 ≤ 0.
Therefore, f ∗ (y) = 0.
Case 2: if kyk∞ > 1, there exists |yk | > 1. For a positive integer m, we
construct
x̄ = [0; . . . ; 0; m sign(yk ); 0; . . . ; 0]
and therefore f ∗ (y) ≥ hy, x̄i − kx̄k1 = m(|yk | − 1) → +∞, as
m → +∞. Therefore, f ∗ (y) = +∞.
In conclusion, f ∗ (y) = δC (y), C = {y ∈ Rn | kyk∞ ≤ 1}.
[Exercise] Let f (x) = λkxkp , x ∈ Rn , 1 < p < ∞, λ > 0. Show that
f ∗ (y) = δC (y), C = {y ∈ Rn | kykq ≤ λ}, p1 + 1q = 1.
lecture4 34/54
Proximal operator
Moreau envelope and proximal mapping
Let f : X → (−∞, +∞] be a closed proper convex function. We define
• Moreau envelope (Moreau-Yosida regularization) of f at x

1
Mf (x) = min f (y) + ky − xk2
y 2
• Proximal mapping of f at x

1
Pf (x) = arg min f (y) + ky − xk2
y 2
• Mf (x) is differentiable, its gradient is ∇Mf (x) = x − Pf (x)

(Moreau envelope is a way to smooth a possibly non-differentiable
convex function)
• Pf (x) exists and is unique
• Mf (x) ≤ f (x)
• arg min f (x) = arg min Mf (x)
lecture4 35/54
Example
Let C ⊆ X be a nonempty closed convex set and f (x) = δC (x) be the

indicator function of C. Its proximal mapping is

1 1
Pf (x) = arg min δC (y) + ky − xk2 = arg min ky − xk2 = ΠC (x)
y∈X 2 y∈C 2
Its Moreau envelope is

1
Mf (x) = kx − ΠC (x)k2
2
lecture4 36/54
Example
Let f (x) = λ|x|, x ∈ R. Its Moreau envelope (known as Huber function)

(
1 2
x , |x| ≤ λ
Mf (x) = 2
λ2
λ|x| − 2 , |x| > λ
Its proximal mapping is (known as soft thresholding)
Pf (x) = sign(x) max{|x| − λ, 0}
Soft thresholding
1.5 2 2
1.8 1.8
1
1.6 1.6
1.4 1.4
0.5
1.2 1.2
0 1 1
0.8 0.8
-0.5
0.6 0.6
0.4 0.4
-1
0.2 0.2
-1.5 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
We can see that Mf is a smoothing of f , Mf ≤ f ,

arg min f (x) = arg min Mf (x).
lecture4 37/54
Soft thresholding
The soft thresholding operator Sλ : Rn → Rn is defined as

 
sign(x1 ) max{|x1 | − λ, 0}
 sign(x2 ) max{|x2 | − λ, 0} 
 
Sλ (x) = 
 .. 
.

 
sign(xn ) max{|xn | − λ, 0}
for any x = [x1 ; . . . ; xn ] ∈ Rn and λ > 0.

Example. Given
     
1.5 1 0
−0.4  0  0
     
x =  3 , S0.5 (x) =  2.5  , S2 (x) = 1
     
     
 −2  −1.5 0
0.8 0.3 0
lecture4 38/54
Theorem (Moreau decomposition)

Let f : X → (−∞, +∞] be a closed proper convex function and f ∗ be
its conjugate. For any x ∈ X , it holds that
x =Pf (x) + Pf ∗ (x)
1
kxk2 =Mf (x) + Mf ∗ (x)
2
Example. Let C ⊆ X be a nonempty closed convex cone. f (x) = δC (x)
and f ∗ (x) = δC
∗
(x) = δC o (x). Therefore
x = ΠC (x) + ΠC o (x).
lecture4 39/54
Theorem (Moreau decomposition)

Let f : X → (−∞, +∞] be a closed proper convex function and f ∗ be
its conjugate. For any x ∈ X , it holds that
x =Pf (x) + Pf ∗ (x)
1
kxk2 =Mf (x) + Mf ∗ (x)
2
Example. Let C ⊆ X be a nonempty closed convex cone. f (x) = δC (x)
and f ∗ (x) = δC
∗
(x) = δC o (x). Therefore
x = ΠC (x) + ΠC o (x).
• Mf (·) is always differentiable even though f is non-differentiable

• Pf (·) is important in many optimization algorithms (e.g.,
accelerated proximal gradient methods introduced later)
• For many widely used regularizaters, Pf (·) and Mf (·) have explicit
expression
lecture4 39/54
lecture4 40/54
(Accelerated) proximal gradient
method
A proximal point view of gradient methods
To minimize a differentiable function min f (β)

β
β (k+1) = β (k) − αk ∇f (β (k) )
The gradient step can be written equivalently as

1
β (k+1) = arg min{f (β (k) ) + h∇f (β (k) ), β − β (k) i + kβ − β (k) k2 }
β | {z } 2αk
linear approximation
| {z }
proximal term
lecture4 41/54
Optimizing composite functions
minimize
p
f (β) + g(β)
β∈R
• f : Rp → R is convex and differentiable, ∇f is L-Lipschitz

continuous
• g : Rp → (−∞, +∞] is closed proper convex, non-differentiable
• For example, in lasso,
1
β∈Rp 2
| {z } | {z }
=g(β)
=f (β)
• Since g is non-differentiable, we cannot apply gradient methods
lecture4 42/54
Proximal gradient step
minimize
p
f (β) + g(β)
β∈R
Gradient step (suppose g(β) disappears):
β (k+1) = β (k) − αk ∇f (β (k) )
which can be written equivalently as

1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i + kβ − β (k) k2
β 2αk
Proximal gradient step:

(k+1) (k) (k) (k) 1 (k) 2
β = arg min f (β ) + h∇f (β ), β − β i+g(β) + kβ − β k
β 2αk
lecture4 43/54

1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i+g(β) + kβ − β (k) k2
β 2αk
After ignoring constant terms and completing the square, the above step
can be written equivalently as
2
(k+1) 1
(k) (k)
β = arg min β − β − αk ∇f (β ) + g(β)
β 2αk

= Pαk g β (k) − αk ∇f (β (k) )
lecture4 44/54

1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i+g(β) + kβ − β (k) k2
β 2αk
After ignoring constant terms and completing the square, the above step
can be written equivalently as
2
(k+1) 1
(k) (k)
β = arg min β − β − αk ∇f (β ) + g(β)
β 2αk

= Pαk g β (k) − αk ∇f (β (k) )
Derivation* completing the square

1
h∇f (β (k) ), βi + kβ − β (k) k2
2αk
1 (k) 1
=h∇f (β (k) ) − β , βi + kβk2 + constant
αk 2αk
1 1
= hαk ∇f (β (k) ) − β (k) , βi + kβk2 + constant
αk 2αk
lecture4 44/54
Proximal gradient methods
Algorithm (Proximal gradient (PG) method)

Choose β (0) , constant step length α > 0. Set k ← 0
repeat until convergence

β (k+1) = Pαg β (k) − α∇f (β (k) )
k ←k+1
end(repeat)
return β (k)
lecture4 45/54
Proximal gradient methods
(Informal) in convex problems (f and g are convex), iteration complexity

of PG method is O( k1 ):

1
f (β (k) ) + g(β (k) ) − minp f (β) + g(β) ≤ O
β∈R k
| {z }
optimal value
If adopting stopping condition
f (β (k) ) + g(β (k) ) − optimal value ≤ 10−4
we need O(104 ) PG iterations
lecture4 46/54
Nesterov’s accelerated method
Nesterov’s idea: include a momentum term for acceleration

tk − 1 (k)
β̄ (k) = β (k) + β − β (k−1)
tk+1
| {z }
momentum term

β (k+1) = Pαg β̄ (k) − α∇f (β̄ (k) )
{tk } is a positive sequence such that t0 = t1 = 1, t2k+1 − tk+1 ≤ t2k
• e.g., tk = 1, ∀ k. In this case, momentum term = 0, it is reduced to

PG without acceleration
√ √
1+ 1+4t2k
• e.g., tk+1 = 2 ⇒ t0 = 1, t1 = 1, t2 = 1+ 5
2 ,...
• there are many other sequences satisfying the condition
√
1+ 1+4t2k
• Next we simply take t0 = t1 = 1, tk+1 = 2
lecture4 47/54
Sequence tk
√
1+ 1+4t2k
Take t0 = t1 = 1, tk+1 = 2
45 1
40 0.9
0.8
35
0.7
30
0.6
25
0.5
20
0.4
15
0.3
10
0.2
5 0.1
0 0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
lecture4 48/54
Accelerated proximal gradient methods
Algorithm (Accelerated proximal gradient (APG) method)

Choose β (0) , constant step length α > 0. Set t0 = t1 = 1, k ← 0
repeat until convergence
tk − 1 (k)
β̄ (k) = β (k) + (β − β (k−1) )
tk+1

β (k+1) = Pαg β̄ (k) − α∇f (β̄ (k) )
k ←k+1
p
1+ 1 + 4t2k
tk+1 =
2
end(repeat)
return β (k)
The algorithmic framework follows from [1]. It is built based on

Nesterov’s accelerated method in 1983 [2]
lecture4 49/54
(Informal) in convex problems (f and g are convex), iteration complexity

of APG method is O( k12 ):

1
f (β (k) ) + g(β (k) ) − minp f (β) + g(β) ≤ O
β∈R k2
| {z }
optimal value
If adopting stopping condition
f (β (k) ) + g(β (k) ) − optimal value ≤ 10−4
we need O(102 ) PG iterations
lecture4 50/54
• Backtracking line search is also applicable for finding step length αk .

• For simplicity, we take a constant step length. f It should satisfy
α ∈ (0, L1 ), where L is Lipschitz constant of ∇f (·) (typically
unknown)
• APG methods enjoy the same computational cost per iteration as
PG methods.
• Iteration complexity: APG O( k12 ); PG O( k1 )
lecture4 51/54
APG for lasso
1
minimize kXβ − Y k2 + λkβk1 lasso problem
p
β∈R 2
| {z } | {z }
=g(β)
=f (β)
Then ∇f (β) = X T (Xβ − Y ) with Lipschitz constant L = λmax (X T X).

Choose step length α = 1/L. APG iterations:
tk − 1 (k)
β̄ (k) = β (k) + β − β (k−1)
tk+1

1 T
β (k+1) (k) (k)
= Sλ/L β̄ − X (X β̄ − Y )
L
APG is also applicable for “logistic regression + lasso regularization”
lecture4 52/54
APG for lasso
Design a stopping criteria for lasso problem

1
p
β∈R 2
| {z } | {z }
=g(β)
=f (β)
We know that β is a global minimizer to lasso problem if and only if
0 ∈ ∇f (β) + ∂g(β),
which is equivalent to [check]
β = Pg (β − ∇f (β)).
Therefore, we can choose a tolerance ε > 0 and stop the method at β (k)
once the stopping criteria is satisfied

β (k) − Sλ β (k) − X T (Xβ (k) − Y ) < ε.
lecture4 53/54
Restart
• A strategy to speed up APG is to restart the algorithm after a fixed

number of iterations
• using the latest iterate as the starting point of the new round of
APG iteration
• a reasonable choice is to perform restart every 100 or 200 iterations
lecture4 54/54
References i
A. Beck and M. Teboulle.

A fast iterative shrinkage-thresholding algorithm for linear
inverse problems.
SIAM journal on imaging sciences, 2(1):183–202, 2009.
Y. E. Nesterov.
A method for solving the convex programming problem with
convergence rate O(1/k 2 ).
In Dokl. Akad. Nauk SSSR,, volume 269, pages 543–547, 1983.

Lecture4 APG

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture4 APG

Uploaded by

Copyright:

Available Formats

APG

1. Basic convex analysis

A vector norm on Rn is a function k · k : Rn → R satisfying the following

For the space of m × n matrices, Rm×n , we define the standard inner

Theorem (Projection theorem). Let C be a closed convex set.

Theorem (Projection theorem). Let C be a closed convex set.

Figure 1: C = {x ∈ R2 | kxk2 ≤ 1} `2 -norm ball

3. C = Sn+ the space of n × n symmetric and positive semidefinite

for a given A ∈ Sn with eigenvalue decomposition

Left: min f (x) = 0, arg min f (x) = {2}

Right: min f (x) = 0, arg min f (x) = [−1, 1]

dom(f ) := {x ∈ X | f (x) < +∞}.

(2) f is said to be proper if dom(f ) 6= ∅.

epi(f ) := {(x, α) ∈ X × R | f (x) ≤ α}

• For a real-valued function f : X → R, this definition of convexity

epi(f ) is convex (1)

coincides with the one we have used

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y), ∀ x, y ∈ dom(f ), λ ∈ [0, 1]

• dom(f ) = [0, +∞), f is proper

Let C be a nonempty set in X . f : R2 → (−∞, +∞]

Definition (Dual and polar cone)

The polar cone of C is C o = −C ∗ .

• C ∗ is always a convex cone, even if C is neither convex nor a cone.

1. X = Rn . C = Rn+ is a self-dual closed convex cone

2. X = Sn . C = Sn+ is a self-dual closed convex cone (psd cone)

The first inequality follows from the Cauchy-Schwartz inequality, and

Figure 3: Left: second-order cone {x ∈ R2 | x1 ≥ |x2 |}. Right: second-order

Definition (Normal cone)

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

. In page 22, we show that u ∈ NC (x̄) ⇐⇒ x̄ = ΠC (x̄ + u). If

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

[Exercise] (1) C = {x ∈ R2 | x1 + x2 ≤ 1, x1 ≥ 0, x2 ≥ 0}. Find NC (x̄)

Proposition Let C ⊆ X be a nonempty convex set and x̄ ∈ C. Then

Proof*: (1) Closeness is easy to prove [Exercise]. First we prove that it

Proposition Let C ⊆ X be a nonempty closed convex set. Then for any

Proof. “⇒” Suppose u ∈ NC (y). Then hu, x − yi ≤ 0 for all x ∈ C.

hu, x − yi = h(y + u) − y, x − yi ≤ 0 for all x ∈ C

which implies that u ∈ NC (y).

Definition Let f : X → (−∞, +∞] be a convex function.

Subgradient is an extension of gradient

• If f is differentiable at x, then ∂f (x) = {∇f (x)}.

Theorem Let f : X → (−∞, +∞] be a proper convex function. Then

Subgradient is an extension of gradient

• If f is differentiable at x, then ∂f (x) = {∇f (x)}.

Theorem Let f : X → (−∞, +∞] be a proper convex function. Then

Proof. By the subgradient inequality f (z) ≥ f (x̄) + hv, z − x̄i ∀z ∈ X.

∂f (x) = [−1, 1], if x = 0 0.4

∂f (x) = [−1, 1], if x = 0 0.4

The `1 norm promotes sparsity: for example, in lasso

(∗) ⇐⇒ βi = 0, |[X T (Xβ − Y )]i | ≤ λ Sparsity!

Example 3. Let C be a convex set.

Definition (Lipschitz continuous)

Definition Let f : X → [−∞, +∞]. The (Fenchel) conjugate of f is

f ∗ (y) = sup{hy, xi − f (x) | x ∈ X }, y ∈ X

• f ∗ is always closed and convex, even if f is neither convex nor

Proposition Let f : X → (−∞, +∞] be a closed proper convex

• “y ∈ ∂f (x) ⇐⇒ x ∈ ∂f ∗ (y)” means that ∂f ∗ is the inverse of ∂f

Example 1. Let C ⊆ X be a nonempty convex set. Compute the

is called as the support function.

Example 2. Let f (x) = kxk1 , x ∈ Rn . Compute f ∗ .