Gradient

L.
Vandenberghe ECE236C (Spring 2019)
1. Gradient method
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method
1.1
Gradient method
to minimize a convex differentiable function f : choose an initial point x0 and repeat
x k+1 = x k − t k ∇ f (x k ), k = 0, 1, . . .
step size t k is constant or determined by line search
Advantages
• every iteration is inexpensive

• does not require second derivatives
Notation
• x k can refer to k th element of a sequence, or to the k th component of vector x

• to avoid confusion, we sometimes use x (k) to denote elements of a sequence
Gradient method 1.2

Quadratic example
1
f (x) = (x12 + γx22) (with γ > 1)
2
with exact line search and starting point x (0) = (γ, 1)
4
k x (k) − x? k2 γ−1 k

=
(0)
k x − x k2 ? γ+1 0
x2
where x? = 0
−4
−10 0 10
x1
gradient method is often slow; convergence very dependent on scaling
Gradient method 1.3

Nondifferentiable example
x1 + γ|x2 |
q
f (x) = x12 + γx22 if |x2 | ≤ x1, f (x) = p if |x2 | > x1
1+γ
with exact line search, starting point x (0) = (γ, 1), converges to non-optimal point
0
x2
−2
−2 0 2 4
x1
gradient method does not handle nondifferentiable problems
Gradient method 1.4

First-order methods
address one or both shortcomings of the gradient method
Methods for nondifferentiable or constrained problems
• subgradient method
• proximal gradient method
• smoothing methods
• cutting-plane methods
Methods with improved convergence
• conjugate gradient method

• accelerated gradient method
• quasi-Newton methods
Gradient method 1.5

Outline

Convex function
a function f is convex if dom f is a convex set and Jensen’s inequality holds:
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , θ ∈ [0, 1]
First-order condition
for (continuously) differentiable f , Jensen’s inequality can be replaced with
f (y) ≥ f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f
Second-order condition
for twice differentiable f , Jensen’s inequality can be replaced with
∇2 f (x) 0 for all x ∈ dom f
Gradient method 1.6

Strictly convex function
f is strictly convex if dom f is a convex set and
f (θ x + (1 − θ)y) < θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , x , y , and θ ∈ (0, 1)
strict convexity implies that if a minimizer of f exists, it is unique
First-order condition
for differentiable f , strict Jensen’s inequality can be replaced with
f (y) > f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f , x , y
Second-order condition
note that ∇2 f (x) 0 is not necessary for strict convexity (cf., f (x) = x 4)
Gradient method 1.7

Monotonicity of gradient
a differentiable function f is convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T (x − y) ≥ 0 for all x, y ∈ dom f
i.e., the gradient ∇ f : Rn → Rn is a monotone mapping
a differentiable function f is strictly convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T (x − y) > 0 for all x, y ∈ dom f , x , y
i.e., the gradient ∇ f : Rn → Rn is a strictly monotone mapping
Gradient method 1.8

Proof
• if f is differentiable and convex, then
f (y) ≥ f (x) + ∇ f (x)T (y − x), f (x) ≥ f (y) + ∇ f (y)T (x − y)
combining the inequalities gives (∇ f (x) − ∇ f (y))T (x − y) ≥ 0
• if ∇ f is monotone, then g0(t) ≥ g0(0) for t ≥ 0 and t ∈ dom g , where
g(t) = f (x + t(y − x)), g0(t) = ∇ f (x + t(y − x))T (y − x)
hence
∫ 1
f (y) = g(1) = g(0) + g0(t) dt ≥ g(0) + g0(0)
0
= f (x) + ∇ f (x)T (y − x)
this is the first-order condition for convexity
Gradient method 1.9

Outline

Lipschitz continuous gradient
the gradient of f is Lipschitz continuous with parameter L > 0 if
k∇ f (x) − ∇ f (y)k∗ ≤ L k x − yk for all x, y ∈ dom f
• functions f with this property are also called L -smooth
• the definition does not assume convexity of f (and holds for − f if it holds for f )
• in the definition, k · k and k · k∗ are a pair of dual norms:
uT v
kuk∗ = sup = sup uT v
v,0 kvk kvk=1
this implies a generalized Cauchy–Schwarz inequality
|uT v| ≤ kuk∗ kvk for all u, v
Gradient method 1.10

Choice of norm
Equivalence of norms
• for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that
c1 k xkb ≤ k xka ≤ c2 k xkb for all x
• constants depend on dimension; for example, for x ∈ Rn,
√ 1
k xk2 ≤ k xk1 ≤ n k xk2, √ k xk2 ≤ k xk∞ ≤ k xk2
n
Norm in definition of Lipschitz continuity
• without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2

• the parameter L depends on choice of norm
• in complexity bounds, choice of norm can simplify dependence on dimensions

Quadratic upper bound
suppose ∇ f is Lipschitz continuous with parameter L
• this implies (from the generalized Cauchy–Schwarz inequality) that
(∇ f (x) − ∇ f (y))T (x − y) ≤ L k x − yk 2 for all x, y ∈ dom f (1)
• if dom f is convex, (1) is equivalent to

L
f (y) ≤ f (x) + ∇ f (x) (y − x) + k y − xk 2 for all x, y ∈ dom f
T
(2)
2
f (y) (x, f (x))

Proof (of the equivalence of (1) and (2) if dom f is convex)
• consider arbitrary x, y ∈ dom f and define g(t) = f (x + t(y − x))
• g(t) is defined for t ∈ [0, 1] because dom f is convex
• if (1) holds, then
g0(t) − g0(0) = (∇ f (x + t(y − x)) − ∇ f (x))T (y − x) ≤ t L k x − yk 2
integrating from t = 0 to t = 1 gives (2):

∫ 1 L
f (y) = g(1) = g(0) + g0(t) dt ≤ g(0) + g0(0) + k x − yk 2
0 2
L
= f (x) + ∇ f (x) (y − x) + k x − yk 2
T
2
• conversely, if (2) holds, then (2) and the same inequality with x , y switched, i.e.,
L
f (x) ≤ f (y) + ∇ f (y) (x − y) + k x − yk 2,
T
2
can be combined to give (∇ f (x) − ∇ f (y))T (x − y) ≤ L k x − yk 2

Consequence of quadratic upper bound
if dom f = Rn and f has a minimizer x?, then
1 L
k∇ f (z)k∗ ≤ f (z) − f (x ) ≤ kz − x? k 2 for all z
2 ?
2L 2
• right-hand inequality follows from upper bound property (2) at x = x?, y = z

• left-hand inequality follows by minimizing quadratic upper bound for x = z
L

inf f (y) ≤ inf f (z) + ∇ f (z)T (y − z) + k y − zk 2
y y 2

Lt 2
= inf inf f (z) + t∇ f (z)T v +
kvk=1 t 2
1

= inf f (z) − (∇ f (z)T v)2
kvk=1 2L
1
= f (z) − k∇ f (z)k∗2
2L

Co-coercivity of gradient
if f is convex with dom f = Rn and ∇ f is L -Lipschitz continuous, then
1
(∇ f (x) − ∇ f (y)) (x − y) ≥ k∇ f (x) − ∇ f (y)k∗2 for all x, y
T
L
• this property is known as co-coercivity of ∇ f (with parameter 1/L )
• co-coercivity in turn implies Lipschitz continuity of ∇ f (by Cauchy–Schwarz)
• hence, for differentiable convex f with dom f = Rn
Lipschitz continuity of ∇ f ⇒ upper bound property (2) (equivalently, (1))

⇒ co-coercivity of ∇ f
⇒ Lipschitz continuity of ∇ f
therefore the three properties are equivalent

Proof of co-coercivity: define two convex functions fx , fy with domain Rn
fx (z) = f (z) − ∇ f (x)T z, fy (z) = f (z) − ∇ f (y)T z
• the two functions have L -Lipschitz continuous gradients

• z = x minimizes fx (z); from the left-hand inequality on page 1.14,
f (y) − f (x) − ∇ f (x)T (y − x) = fx (y) − fx (x)

1
≥ k∇ fx (y)k∗2
2L
1
= k∇ f (y) − ∇ f (x)k∗2
2L
• similarly, z = y minimizes fy (z); therefore
1
T
f (x) − f (y) − ∇ f (y) (x − y) ≥ k∇ f (y) − ∇ f (x)k∗2
2L
combining the two inequalities shows co-coercivity

Lipschitz continuity with respect to Euclidean norm
supose f is convex with dom f = Rn, and L -smooth for the Euclidean norm:
k∇ f (x) − ∇ f (y)k2 ≤ L k x − yk2 for all x , y
• the equivalent property (1) states that
(∇ f (x) − ∇ f (y))T (x − y) ≤ L(x − y)T (x − y) for all x , y
• this is monotonicity of L x − ∇ f (x), i.e., equivalent to the property that
L
k xk22 − f (x) is a convex function
2
• if f is twice differentiable, the Hessian of this function is LI − ∇2 f (x):
λmax(∇2 f (x)) ≤ L for all x
is an equivalent characterization of L -smoothness

Outline

Strongly convex function
f is strongly convex with parameter m > 0 if dom f is convex and
m
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)k x − yk 2
2
holds for all x, y ∈ dom f , θ ∈ [0, 1]
• this is a stronger version of Jensen’s inequality

• it holds if and only if it holds for f restricted to arbitrary lines:
m 2
f (x + t(y − x)) − t k x − yk 2 (3)
2
is a convex function of t , for all x, y ∈ dom f
• without loss of generality, we can take k · k = k · k2

• however, the strong convexity parameter m depends on the norm used

Quadratic lower bound
if f is differentiable and m-strongly convex, then
m
f (y) ≥ f (x) + ∇ f (x) (y − x) + k y − xk 2 for all x, y ∈ dom f
T
(4)
2
f (y)
(x, f (x))
• follows from the 1st order condition of convexity of (3)

• this implies that the sublevel sets of f are bounded
• if f is closed (has closed sublevel sets), it has a unique minimizer x? and
m ? 2 1
?
kz − x k ≤ f (z) − f (x ) ≤ k∇ f (z)k∗2 for all z ∈ dom f
2 2m
(proof as on page 1.14)
Strong monotonicity
differentiable f is strongly convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T (x − y) ≥ mk x − yk 2 for all x, y ∈ dom f
this is called strong monotonicity (coercivity) of ∇ f
Proof
• one direction follows from (4) and the same inequality with x and y switched
• for the other direction, assume ∇ f is strongly monotone and define
m 2
g(t) = f (x + t(y − x)) − t k x − yk 2
2
then g0(t) is nondecreasing, so g is convex

Strong convexity with respect to Euclidean norm
suppose f is m-strongly convex for the Euclidean norm:
m
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)k x − yk22
2
for x, y ∈ dom f , θ ∈ [0, 1]
• this is Jensen’s inequality for the function
m
h(x) = f (x) − k xk22
2
• therefore f is strongly convex if and only if h is convex

• if f is twice differentiable, h is convex if and only if ∇2 f (x) − mI 0, or
λmin(∇2 f (x)) ≥ m for all x ∈ dom f

Extension of co-coercivity
suppose f is m-strongly convex and L -smooth for k · k2, and dom f = Rn
• then the function

m
h(x) = f (x) − k xk22
2
is convex and (L − m)-smooth:
0 ≤ (∇h(x) − ∇h(y))T (x − y)
= (∇ f (x) − ∇ f (y))T (x − y) − mk x − yk22
≤ (L − m)k x − yk22
• co-coercivity of ∇h can be written as
mL 1
(∇ f (x) − ∇ f (y))T (x − y) ≥ k x − yk22 + k∇ f (x) − ∇ f (y)k22
m+L m+L
for all x, y ∈ dom f

Outline

Analysis of gradient method
x k+1 = x k − t k ∇ f (x k ), k = 0, 1, . . .
with fixed step size or backtracking line search
Assumptions
1. f is convex and differentiable with dom f = Rn
2. ∇ f (x) is L -Lipschitz continuous with respect to the Euclidean norm, with L > 0
3. optimal value f ? = inf x f (x) is finite and attained at x?

Basic gradient step
• from quadratic upper bound (page 1.12) with y = x − t∇ f (x):
Lt
f (x − t∇ f (x)) ≤ f (x) − t(1 − ) k∇ f (x)k22
2
• therefore, if x + = x − t∇ f (x) and 0 < t ≤ 1/L ,
t
f (x ) ≤ f (x) − k∇ f (x)k22
+
(5)
2
• from (5) and convexity of f ,
t
f (x +) − f ? ≤ ∇ f (x)T (x − x?) − k∇ f (x)k22
2
1 ? 2 ?
2
k x − x k2 − x − x − t∇ f (x) 2

=
2t
1
k x − x? k22 − k x + − x? k22

= (6)
2t

Descent properties
assume ∇ f (x) , 0
• the inequality (5) shows that
f (x +) < f (x)
• the inequality (6) shows that
k x + − x? k2 < k x − x? k2
in the gradient method, function value and distance to the optimal set decrease

Gradient method with constant step size
x k+1 = x k − t∇ f (x k ), k = 0, 1, . . .
• take x = xi−1, x + = xi in (6) and add the bounds for i = 1, . . . , k :
k k
? 1X ? 2 ? 2

( f (xi ) − f ) ≤ k xi−1 − x k2 − k xi − x k2
X
i=1 2t i=1
1
k x0 − x? k22 − k x k − x? k22

=
2t
1
≤ k x0 − x? k22
2t
• since f (xi ) is non-increasing (see (5))
k
1X 1
?
f (x k ) − f ≤ ( f (xi ) − f ?) ≤ k x0 − x? k22
k i=1 2kt
Conclusion: number of iterations to reach f (x k ) − f ? ≤ is O(1/)

Backtracking line search
initialize t k at tˆ > 0 (for example, tˆ = 1) and take t k := βt k until
f (x k − t k ∇ f (x k )) < f (x k ) − αt k k∇ f (x k )k22
f (xk − t∇ f (xk ))
f (xk ) − αt k∇ f (xk )k22
f (xk ) − t k∇ f (xk )k22

t
0 < β < 1; we will take α = 1/2 (mostly to simplify proofs)

Analysis for backtracking line search
line search with α = 1/2, if f has a Lipschitz continuous gradient
tL
f (xk ) − t(1 − )k∇ f (xk )k22
2
t
f (xk ) − k∇ f (xk )k22
2
f (xk − t∇ f (xk ))
t = 1/L
selected step size satisfies t k ≥ tmin = min{tˆ, β/L}

Gradient method with backtracking line search
• from line search condition and convexity of f ,
ti
f (xi+1) ≤ f (xi ) − k∇ f (xi )k22
2
ti
≤ f ? + ∇ f (xi )T (xi − x?) − k∇ f (xi )k22
2
? 1 ? 2 ? 2

= f + k xi − x k2 − k xi+1 − x k2
2ti
• this implies k xi+1 − x? k2 ≤ k xi − x? k , so we can replace ti with tmin ≤ ti :
? 1 ? 2 ? 2

f (xi+1) − f ≤ k xi − x k2 − k xi−1 − x k2
2tmin
• adding the upper bounds gives same 1/k bound as with constant step size
k
1X 1
?
f (x k ) − f ≤ ( f (xi ) − f ?) ≤ k x0 − x? k22
k i=1 2ktmin

Gradient method for strongly convex functions
better results exist if we add strong convexity to the assumptions on p. 1.23
Analysis for constant step size
if x + = x − t∇ f (x) and 0 < t ≤ 2/(m + L):
k x + − x? k22 = k x − t∇ f (x) − x? k22

= k x − x? k22 − 2t∇ f (x)T (x − x?) + t 2 k∇ f (x)k22
2mL ? 2 2
≤ (1 − t )k x − x k2 + t(t − )k∇ f (x)k22
m+L m+L
2mL
≤ (1 − t )k x − x? k22
m+L
(step 3 follows from result on page 1.22)

Distance to optimum
2mL
k xk − x? k22 k
≤ c k x0 − x? k22, c =1−t
m+L
• implies (linear) convergence

2
γ−1

• for t = 2/(m + L), get c = with γ = L/m
γ+1
Bound on function value (from page 1.14)
L ? 2 ck L
?
f (x k ) − f ≤ k x k − x k2 ≤ k x0 − x? k22
2 2
Conclusion: number of iterations to reach f (x k ) − f ? ≤ is O(log(1/))

Limits on convergence rate of first-order methods
First-order method: any iterative algorithm that selects x k+1 in the set
x0 + span{∇ f (x0), ∇ f (x1), . . . , ∇ f (x k )}
Problem class: any function that satisfies the assumptions on page 1.23
Theorem (Nesterov): for every integer k ≤ (n − 1)/2 and every x0, there exist
functions in the problem class such that for any first-order method
L x x ?k2
3 k 0 − 2
f (x k ) − f ? ≥
32 (k + 1)2
• suggests 1/k rate for gradient method is not optimal

• more recent accelerated gradient methods have 1/k 2 convergence (see later)

References
• A. Beck, First-Order Methods in Optimization (2017), chapter 5.
• Yu. Nesterov, Lectures on Convex Optimization (2018), section 2.1. (The result
on page 1.32 is Theorem 2.1.7 in the book.)
• B. T. Polyak, Introduction to Optimization (1987), section 1.4.
• The example on page 1.4 is from N. Z. Shor, Nondifferentiable Optimization

and Polynomial Problems (1998), page 37.

Gradient

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gradient

Uploaded by

Copyright:

Available Formats

L.

Vandenberghe ECE236C (Spring 2019)

• gradient method, first-order methods

• Lipschitz continuity of gradient

• analysis of gradient method

to minimize a convex differentiable function f : choose an initial point x0 and repeat

step size t k is constant or determined by line search

• every iteration is inexpensive

• x k can refer to k th element of a sequence, or to the k th component of vector x

Gradient method 1.2

with exact line search and starting point x (0) = (γ, 1)

gradient method is often slow; convergence very dependent on scaling

Gradient method 1.3

gradient method does not handle nondifferentiable problems

Gradient method 1.4

address one or both shortcomings of the gradient method

Methods for nondifferentiable or constrained problems

Methods with improved convergence

• conjugate gradient method

Gradient method 1.5

• gradient method, first-order methods

• Lipschitz continuity of gradient

• analysis of gradient method

a function f is convex if dom f is a convex set and Jensen’s inequality holds:

f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , θ ∈ [0, 1]

for (continuously) differentiable f , Jensen’s inequality can be replaced with

f (y) ≥ f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f

for twice differentiable f , Jensen’s inequality can be replaced with

∇2 f (x)  0 for all x ∈ dom f

Gradient method 1.6

f is strictly convex if dom f is a convex set and

f (θ x + (1 − θ)y) < θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , x , y , and θ ∈ (0, 1)

strict convexity implies that if a minimizer of f exists, it is unique

for differentiable f , strict Jensen’s inequality can be replaced with

f (y) > f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f , x , y

Gradient method 1.7

a differentiable function f is convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) ≥ 0 for all x, y ∈ dom f

i.e., the gradient ∇ f : Rn → Rn is a monotone mapping

a differentiable function f is strictly convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) > 0 for all x, y ∈ dom f , x , y

i.e., the gradient ∇ f : Rn → Rn is a strictly monotone mapping

Gradient method 1.8

• if f is differentiable and convex, then

f (y) ≥ f (x) + ∇ f (x)T (y − x), f (x) ≥ f (y) + ∇ f (y)T (x − y)

combining the inequalities gives (∇ f (x) − ∇ f (y))T (x − y) ≥ 0

• if ∇ f is monotone, then g0(t) ≥ g0(0) for t ≥ 0 and t ∈ dom g , where

g(t) = f (x + t(y − x)), g0(t) = ∇ f (x + t(y − x))T (y − x)

this is the first-order condition for convexity

Gradient method 1.9

• gradient method, first-order methods

• Lipschitz continuity of gradient

• analysis of gradient method

the gradient of f is Lipschitz continuous with parameter L > 0 if

k∇ f (x) − ∇ f (y)k∗ ≤ L k x − yk for all x, y ∈ dom f

• functions f with this property are also called L -smooth

• in the definition, k · k and k · k∗ are a pair of dual norms:

this implies a generalized Cauchy–Schwarz inequality

|uT v| ≤ kuk∗ kvk for all u, v

Gradient method 1.10

c1 k xkb ≤ k xka ≤ c2 k xkb for all x

• constants depend on dimension; for example, for x ∈ Rn,

∇2 f (x) 0 for all x ∈ dom f