Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

L.

Vandenberghe ECE236C (Spring 2019)

1. Gradient method

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method

1.1
Gradient method

to minimize a convex differentiable function f : choose an initial point x0 and repeat

x k+1 = x k − t k ∇ f (x k ), k = 0, 1, . . .

step size t k is constant or determined by line search

Advantages

• every iteration is inexpensive


• does not require second derivatives

Notation

• x k can refer to k th element of a sequence, or to the k th component of vector x


• to avoid confusion, we sometimes use x (k) to denote elements of a sequence

Gradient method 1.2


Quadratic example

1
f (x) = (x12 + γx22) (with γ > 1)
2

with exact line search and starting point x (0) = (γ, 1)

4
k x (k) − x? k2 γ−1 k
 
=
(0)
k x − x k2 ? γ+1 0
x2

where x? = 0
−4
−10 0 10
x1

gradient method is often slow; convergence very dependent on scaling

Gradient method 1.3


Nondifferentiable example

x1 + γ|x2 |
q
f (x) = x12 + γx22 if |x2 | ≤ x1, f (x) = p if |x2 | > x1
1+γ

with exact line search, starting point x (0) = (γ, 1), converges to non-optimal point

0
x2

−2
−2 0 2 4
x1

gradient method does not handle nondifferentiable problems

Gradient method 1.4


First-order methods

address one or both shortcomings of the gradient method

Methods for nondifferentiable or constrained problems

• subgradient method
• proximal gradient method
• smoothing methods
• cutting-plane methods

Methods with improved convergence

• conjugate gradient method


• accelerated gradient method
• quasi-Newton methods

Gradient method 1.5


Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method


Convex function

a function f is convex if dom f is a convex set and Jensen’s inequality holds:

f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , θ ∈ [0, 1]

First-order condition

for (continuously) differentiable f , Jensen’s inequality can be replaced with

f (y) ≥ f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f

Second-order condition

for twice differentiable f , Jensen’s inequality can be replaced with

∇2 f (x)  0 for all x ∈ dom f

Gradient method 1.6


Strictly convex function

f is strictly convex if dom f is a convex set and

f (θ x + (1 − θ)y) < θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , x , y , and θ ∈ (0, 1)

strict convexity implies that if a minimizer of f exists, it is unique

First-order condition

for differentiable f , strict Jensen’s inequality can be replaced with

f (y) > f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f , x , y

Second-order condition

note that ∇2 f (x)  0 is not necessary for strict convexity (cf., f (x) = x 4)

Gradient method 1.7


Monotonicity of gradient

a differentiable function f is convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) ≥ 0 for all x, y ∈ dom f

i.e., the gradient ∇ f : Rn → Rn is a monotone mapping

a differentiable function f is strictly convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) > 0 for all x, y ∈ dom f , x , y

i.e., the gradient ∇ f : Rn → Rn is a strictly monotone mapping

Gradient method 1.8


Proof

• if f is differentiable and convex, then

f (y) ≥ f (x) + ∇ f (x)T (y − x), f (x) ≥ f (y) + ∇ f (y)T (x − y)

combining the inequalities gives (∇ f (x) − ∇ f (y))T (x − y) ≥ 0

• if ∇ f is monotone, then g0(t) ≥ g0(0) for t ≥ 0 and t ∈ dom g , where

g(t) = f (x + t(y − x)), g0(t) = ∇ f (x + t(y − x))T (y − x)

hence
∫ 1
f (y) = g(1) = g(0) + g0(t) dt ≥ g(0) + g0(0)
0
= f (x) + ∇ f (x)T (y − x)

this is the first-order condition for convexity

Gradient method 1.9


Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method


Lipschitz continuous gradient

the gradient of f is Lipschitz continuous with parameter L > 0 if

k∇ f (x) − ∇ f (y)k∗ ≤ L k x − yk for all x, y ∈ dom f

• functions f with this property are also called L -smooth

• the definition does not assume convexity of f (and holds for − f if it holds for f )

• in the definition, k · k and k · k∗ are a pair of dual norms:

uT v
kuk∗ = sup = sup uT v
v,0 kvk kvk=1

this implies a generalized Cauchy–Schwarz inequality

|uT v| ≤ kuk∗ kvk for all u, v

Gradient method 1.10


Choice of norm

Equivalence of norms

• for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that

c1 k xkb ≤ k xka ≤ c2 k xkb for all x

• constants depend on dimension; for example, for x ∈ Rn,

√ 1
k xk2 ≤ k xk1 ≤ n k xk2, √ k xk2 ≤ k xk∞ ≤ k xk2
n

Norm in definition of Lipschitz continuity

• without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2


• the parameter L depends on choice of norm
• in complexity bounds, choice of norm can simplify dependence on dimensions

Gradient method 1.11


Quadratic upper bound
suppose ∇ f is Lipschitz continuous with parameter L

• this implies (from the generalized Cauchy–Schwarz inequality) that

(∇ f (x) − ∇ f (y))T (x − y) ≤ L k x − yk 2 for all x, y ∈ dom f (1)

• if dom f is convex, (1) is equivalent to


L
f (y) ≤ f (x) + ∇ f (x) (y − x) + k y − xk 2 for all x, y ∈ dom f
T
(2)
2

f (y) (x, f (x))

Gradient method 1.12


Proof (of the equivalence of (1) and (2) if dom f is convex)
• consider arbitrary x, y ∈ dom f and define g(t) = f (x + t(y − x))
• g(t) is defined for t ∈ [0, 1] because dom f is convex
• if (1) holds, then

g0(t) − g0(0) = (∇ f (x + t(y − x)) − ∇ f (x))T (y − x) ≤ t L k x − yk 2

integrating from t = 0 to t = 1 gives (2):


∫ 1 L
f (y) = g(1) = g(0) + g0(t) dt ≤ g(0) + g0(0) + k x − yk 2
0 2
L
= f (x) + ∇ f (x) (y − x) + k x − yk 2
T
2

• conversely, if (2) holds, then (2) and the same inequality with x , y switched, i.e.,
L
f (x) ≤ f (y) + ∇ f (y) (x − y) + k x − yk 2,
T
2

can be combined to give (∇ f (x) − ∇ f (y))T (x − y) ≤ L k x − yk 2

Gradient method 1.13


Consequence of quadratic upper bound

if dom f = Rn and f has a minimizer x?, then

1 L
k∇ f (z)k∗ ≤ f (z) − f (x ) ≤ kz − x? k 2 for all z
2 ?
2L 2

• right-hand inequality follows from upper bound property (2) at x = x?, y = z


• left-hand inequality follows by minimizing quadratic upper bound for x = z

L
 
inf f (y) ≤ inf f (z) + ∇ f (z)T (y − z) + k y − zk 2
y y 2

Lt 2 
= inf inf f (z) + t∇ f (z)T v +
kvk=1 t 2
1
 
= inf f (z) − (∇ f (z)T v)2
kvk=1 2L
1
= f (z) − k∇ f (z)k∗2
2L

Gradient method 1.14


Co-coercivity of gradient

if f is convex with dom f = Rn and ∇ f is L -Lipschitz continuous, then

1
(∇ f (x) − ∇ f (y)) (x − y) ≥ k∇ f (x) − ∇ f (y)k∗2 for all x, y
T
L

• this property is known as co-coercivity of ∇ f (with parameter 1/L )

• co-coercivity in turn implies Lipschitz continuity of ∇ f (by Cauchy–Schwarz)

• hence, for differentiable convex f with dom f = Rn

Lipschitz continuity of ∇ f ⇒ upper bound property (2) (equivalently, (1))


⇒ co-coercivity of ∇ f
⇒ Lipschitz continuity of ∇ f

therefore the three properties are equivalent

Gradient method 1.15


Proof of co-coercivity: define two convex functions fx , fy with domain Rn

fx (z) = f (z) − ∇ f (x)T z, fy (z) = f (z) − ∇ f (y)T z

• the two functions have L -Lipschitz continuous gradients


• z = x minimizes fx (z); from the left-hand inequality on page 1.14,

f (y) − f (x) − ∇ f (x)T (y − x) = fx (y) − fx (x)


1
≥ k∇ fx (y)k∗2
2L
1
= k∇ f (y) − ∇ f (x)k∗2
2L

• similarly, z = y minimizes fy (z); therefore

1
T
f (x) − f (y) − ∇ f (y) (x − y) ≥ k∇ f (y) − ∇ f (x)k∗2
2L

combining the two inequalities shows co-coercivity


Gradient method 1.16
Lipschitz continuity with respect to Euclidean norm

supose f is convex with dom f = Rn, and L -smooth for the Euclidean norm:

k∇ f (x) − ∇ f (y)k2 ≤ L k x − yk2 for all x , y

• the equivalent property (1) states that

(∇ f (x) − ∇ f (y))T (x − y) ≤ L(x − y)T (x − y) for all x , y

• this is monotonicity of L x − ∇ f (x), i.e., equivalent to the property that

L
k xk22 − f (x) is a convex function
2

• if f is twice differentiable, the Hessian of this function is LI − ∇2 f (x):

λmax(∇2 f (x)) ≤ L for all x

is an equivalent characterization of L -smoothness

Gradient method 1.17


Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method


Strongly convex function

f is strongly convex with parameter m > 0 if dom f is convex and

m
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)k x − yk 2
2

holds for all x, y ∈ dom f , θ ∈ [0, 1]

• this is a stronger version of Jensen’s inequality


• it holds if and only if it holds for f restricted to arbitrary lines:

m 2
f (x + t(y − x)) − t k x − yk 2 (3)
2

is a convex function of t , for all x, y ∈ dom f

• without loss of generality, we can take k · k = k · k2


• however, the strong convexity parameter m depends on the norm used

Gradient method 1.18


Quadratic lower bound
if f is differentiable and m-strongly convex, then
m
f (y) ≥ f (x) + ∇ f (x) (y − x) + k y − xk 2 for all x, y ∈ dom f
T
(4)
2

f (y)

(x, f (x))

• follows from the 1st order condition of convexity of (3)


• this implies that the sublevel sets of f are bounded
• if f is closed (has closed sublevel sets), it has a unique minimizer x? and
m ? 2 1
?
kz − x k ≤ f (z) − f (x ) ≤ k∇ f (z)k∗2 for all z ∈ dom f
2 2m
(proof as on page 1.14)
Gradient method 1.19
Strong monotonicity

differentiable f is strongly convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) ≥ mk x − yk 2 for all x, y ∈ dom f

this is called strong monotonicity (coercivity) of ∇ f

Proof

• one direction follows from (4) and the same inequality with x and y switched
• for the other direction, assume ∇ f is strongly monotone and define

m 2
g(t) = f (x + t(y − x)) − t k x − yk 2
2

then g0(t) is nondecreasing, so g is convex

Gradient method 1.20


Strong convexity with respect to Euclidean norm

suppose f is m-strongly convex for the Euclidean norm:

m
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)k x − yk22
2

for x, y ∈ dom f , θ ∈ [0, 1]

• this is Jensen’s inequality for the function

m
h(x) = f (x) − k xk22
2

• therefore f is strongly convex if and only if h is convex


• if f is twice differentiable, h is convex if and only if ∇2 f (x) − mI  0, or

λmin(∇2 f (x)) ≥ m for all x ∈ dom f

Gradient method 1.21


Extension of co-coercivity

suppose f is m-strongly convex and L -smooth for k · k2, and dom f = Rn

• then the function


m
h(x) = f (x) − k xk22
2
is convex and (L − m)-smooth:

0 ≤ (∇h(x) − ∇h(y))T (x − y)
= (∇ f (x) − ∇ f (y))T (x − y) − mk x − yk22
≤ (L − m)k x − yk22

• co-coercivity of ∇h can be written as

mL 1
(∇ f (x) − ∇ f (y))T (x − y) ≥ k x − yk22 + k∇ f (x) − ∇ f (y)k22
m+L m+L

for all x, y ∈ dom f

Gradient method 1.22


Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method


Analysis of gradient method

x k+1 = x k − t k ∇ f (x k ), k = 0, 1, . . .

with fixed step size or backtracking line search

Assumptions

1. f is convex and differentiable with dom f = Rn

2. ∇ f (x) is L -Lipschitz continuous with respect to the Euclidean norm, with L > 0

3. optimal value f ? = inf x f (x) is finite and attained at x?

Gradient method 1.23


Basic gradient step

• from quadratic upper bound (page 1.12) with y = x − t∇ f (x):

Lt
f (x − t∇ f (x)) ≤ f (x) − t(1 − ) k∇ f (x)k22
2

• therefore, if x + = x − t∇ f (x) and 0 < t ≤ 1/L ,

t
f (x ) ≤ f (x) − k∇ f (x)k22
+
(5)
2

• from (5) and convexity of f ,

t
f (x +) − f ? ≤ ∇ f (x)T (x − x?) − k∇ f (x)k22
2
1  ? 2 ?
2 
k x − x k2 − x − x − t∇ f (x) 2

=
2t
1 
k x − x? k22 − k x + − x? k22

= (6)
2t

Gradient method 1.24


Descent properties

assume ∇ f (x) , 0

• the inequality (5) shows that

f (x +) < f (x)

• the inequality (6) shows that

k x + − x? k2 < k x − x? k2

in the gradient method, function value and distance to the optimal set decrease

Gradient method 1.25


Gradient method with constant step size

x k+1 = x k − t∇ f (x k ), k = 0, 1, . . .

• take x = xi−1, x + = xi in (6) and add the bounds for i = 1, . . . , k :

k k 
? 1X ? 2 ? 2

( f (xi ) − f ) ≤ k xi−1 − x k2 − k xi − x k2
X
i=1 2t i=1
1 
k x0 − x? k22 − k x k − x? k22

=
2t
1
≤ k x0 − x? k22
2t

• since f (xi ) is non-increasing (see (5))

k
1X 1
?
f (x k ) − f ≤ ( f (xi ) − f ?) ≤ k x0 − x? k22
k i=1 2kt

Conclusion: number of iterations to reach f (x k ) − f ? ≤  is O(1/)


Gradient method 1.26
Backtracking line search

initialize t k at tˆ > 0 (for example, tˆ = 1) and take t k := βt k until

f (x k − t k ∇ f (x k )) < f (x k ) − αt k k∇ f (x k )k22

f (xk − t∇ f (xk ))

f (xk ) − αt k∇ f (xk )k22

f (xk ) − t k∇ f (xk )k22


t

0 < β < 1; we will take α = 1/2 (mostly to simplify proofs)

Gradient method 1.27


Analysis for backtracking line search

line search with α = 1/2, if f has a Lipschitz continuous gradient

tL
f (xk ) − t(1 − )k∇ f (xk )k22
2
t
f (xk ) − k∇ f (xk )k22
2

f (xk − t∇ f (xk ))

t = 1/L

selected step size satisfies t k ≥ tmin = min{tˆ, β/L}

Gradient method 1.28


Gradient method with backtracking line search

• from line search condition and convexity of f ,

ti
f (xi+1) ≤ f (xi ) − k∇ f (xi )k22
2
ti
≤ f ? + ∇ f (xi )T (xi − x?) − k∇ f (xi )k22
2
? 1  ? 2 ? 2

= f + k xi − x k2 − k xi+1 − x k2
2ti

• this implies k xi+1 − x? k2 ≤ k xi − x? k , so we can replace ti with tmin ≤ ti :

? 1  ? 2 ? 2

f (xi+1) − f ≤ k xi − x k2 − k xi−1 − x k2
2tmin

• adding the upper bounds gives same 1/k bound as with constant step size

k
1X 1
?
f (x k ) − f ≤ ( f (xi ) − f ?) ≤ k x0 − x? k22
k i=1 2ktmin

Gradient method 1.29


Gradient method for strongly convex functions

better results exist if we add strong convexity to the assumptions on p. 1.23

Analysis for constant step size

if x + = x − t∇ f (x) and 0 < t ≤ 2/(m + L):

k x + − x? k22 = k x − t∇ f (x) − x? k22


= k x − x? k22 − 2t∇ f (x)T (x − x?) + t 2 k∇ f (x)k22
2mL ? 2 2
≤ (1 − t )k x − x k2 + t(t − )k∇ f (x)k22
m+L m+L
2mL
≤ (1 − t )k x − x? k22
m+L

(step 3 follows from result on page 1.22)

Gradient method 1.30


Distance to optimum

2mL
k xk − x? k22 k
≤ c k x0 − x? k22, c =1−t
m+L

• implies (linear) convergence


2
γ−1

• for t = 2/(m + L), get c = with γ = L/m
γ+1

Bound on function value (from page 1.14)

L ? 2 ck L
?
f (x k ) − f ≤ k x k − x k2 ≤ k x0 − x? k22
2 2

Conclusion: number of iterations to reach f (x k ) − f ? ≤  is O(log(1/))

Gradient method 1.31


Limits on convergence rate of first-order methods

First-order method: any iterative algorithm that selects x k+1 in the set

x0 + span{∇ f (x0), ∇ f (x1), . . . , ∇ f (x k )}

Problem class: any function that satisfies the assumptions on page 1.23

Theorem (Nesterov): for every integer k ≤ (n − 1)/2 and every x0, there exist
functions in the problem class such that for any first-order method

L x x ?k2
3 k 0 − 2
f (x k ) − f ? ≥
32 (k + 1)2

• suggests 1/k rate for gradient method is not optimal


• more recent accelerated gradient methods have 1/k 2 convergence (see later)

Gradient method 1.32


References

• A. Beck, First-Order Methods in Optimization (2017), chapter 5.

• Yu. Nesterov, Lectures on Convex Optimization (2018), section 2.1. (The result
on page 1.32 is Theorem 2.1.7 in the book.)

• B. T. Polyak, Introduction to Optimization (1987), section 1.4.

• The example on page 1.4 is from N. Z. Shor, Nondifferentiable Optimization


and Polynomial Problems (1998), page 37.

Gradient method 1.33

You might also like