Newton Algorithm For Unconstrained Optimization: October 21, 2008

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Lecture 14

Newton Algorithm for Unconstrained Optimization

October 21, 2008


Lecture 14

Outline

• Newton Method for System of Nonlinear Equations

• Newton’s Method for Optimization

• Classic Analysis

Convex Optimization 1
Lecture 14

Newton’s Method for System of Equations


• A numerical method for solving a system of equations
G(x) = 0, G : Rn → Rn

• When G is continuously differentiable, the classical Newton method is


based on a natural (local) approximation of G: linearization
• Given an iterate xk , the map G is approximated at xk by the following
linear map
L(x; xk ) = G(xk ) + JG(xk )(x − xk ),
where JG(x) is the Jacobian of G at x
• We have L(x; xk ) ≈ G(x)
• System G(x) = 0 is “replaced” by the system L(x; xk ) = 0
• The resulting solution is defining a new iterate xk+1,
xk+1 = xk − JG(xk )−1G(xk )

Convex Optimization 2
Lecture 14

Properties of Newton’s Method

xk+1 = xk − JG(xk )−1G(xk )

• Fast local convergence, but globally the method can fail


• When started far from a solution

Convex Optimization 3
Lecture 14

• Numerical instabilities occur when JG(x∗) is singular (or nearly singular)

• Two main properties that make the process work


• A linear model L(x; xk ) that provides good approximation of G near
xk , when xk is near a solution
• Solvability of linear equation L(x; xk ) = 0 when JG(xk ) is invertible

• These two properties are “guaranteed” when


• G is continuously differentiable and
• JG(x∗) is invertible (nonsingular)

Convex Optimization 4
Lecture 14

Convergence Rate Terminology

Definition 7.2.1 Let {xk } ⊆ Rn be a sequence converging to some


x∗ ∈ Rn. The convergence rate is said to be
• Q-linear if
kxk+1 − x∗k
lim sup ∗
<∞
k→∞ kxk − x k
• Q-superlinear if
kxk+1 − x∗k
lim sup ∗
=0
k→∞ kx k − x k
• Q-quadratic if
kxk+1 − x∗k
lim sup ∗ 2
<∞
k→∞ kxk − x k
• R-linear if
lim sup (kxk+1 − x∗k)1/k < 1
k→∞

Convex Optimization 5
Lecture 14

Unconstrained Minimization
minimize f (x)
Suppose that:
• The function f is convex and twice continuously differentiable over
domf
• The optimal value is attained: there exists x∗ such that

f (x∗) = inf f (x)


x

Newton Method can be applied to solve the corresponding optimality


condition
∇f (x∗) = 0,
resulting in xk+1 = xk − ∇f 2(xk )−1∇f (xk ).
This is known as pure Newton method
As discussed, in this form the method may not always converge.

Convex Optimization 6
Lecture 14

Newton’s direction
dk = −∇2f (xk )−1∇f (xk )
Interpretations:
• Second-order Taylor’s expansion at xk yields

T 1 T 2 
2

f (xk + d) = f (xk ) + ∇f (xk ) d + d ∇ f (xk )d + o kdk
2

• The right-hand side without the small order term, provides a (quadratic)
approximation of f in a (small) neighborhood of xk
1
f (xk + d) ≈ f (xk ) + ∇f (xk )T d + dT ∇2f (xk )d
2
• Minimizing the quadratic approximation w/r to d yields:
∇f (xk )T + ∇2f (xk )d = 0
• Newton’s dk can be also viewed as solving linearized optimality condition
∇f (xk + d) ≈ ∇f (xk ) + ∇2f (xk )d = 0

Convex Optimization 7
Lecture 14

Newton Decrement

• Newton’s decrement at xk is defined by:


 1/2
T 2 −1
λ(xk ) = ∇f (xk ) ∇ f (xk ) ∇f (xk )

• Provides a measure of the proximity of x to x∗


• Obtained by evaluating the difference between f (xk ) and the quadratic
approximation of f at xk evaluated at the optimal d (Newton’s direction)

1 1
 
f (xk ) − f (xk ) + ∇f (xk )T dk + dTk ∇2f (xk )dk = λ(xk )2
2 2

Properties:
• Equal to the norm of the Newton step in the quadratic Hessian norm
h i1/2
2
λ(xk ) = dk ∇ f (xk )dk = kdk k∇2f (xk )
• Affine invariant (unlike k∇f (x)k)

Convex Optimization 8
Lecture 14

Newton’s Method

Given a starting point x ∈ domf , error tolerance  > 0, and parameters


σ ∈ (0, 1/2) and β ∈ (0, 1).
Repeat
1. Compute the Newton’s direction and decrement:
d := −∇2f (x)−1∇f (x), λ2 := ∇f (x)T ∇2f (x)−1∇f (x)
2. Stopping criterion: quit if λ2/2 ≤ 
3. Line search: Choose stepsize α by backtracking line search, i.e., starting
with α = 1 do

(a) If f (x + αd) < f (x) + σα∇f (x)T d go to Step 4

(b) Else α = βα and go to (b).


4. Update x := x + αd.

Convex Optimization 9
Lecture 14

Classical Convergence Analysis - Main Results


Assumption 1:
• The level set L0 = {x | f (x) ≤ f (x0)} is closed
• f strongly convex on L0 with a constant m
• ∇2f is Lipschitz continuous on L0, with a constant L > 0:
k∇2f (x) − ∇2f (y)k ≤ Lkx − yk2
(L measures how well f can be approximated by a quadratic function)
Analysis outline: there exists a constant η ∈ (0, 2m2/L) such that
• When k∇f (xk )k ≥ η , then f (xk+1) − f (xk ) ≤ −γ for γ = σβη 2 Mm2
• When k∇f (xk )k < η , then
2
L L

2
k∇f (xk+1)k2 ≤ 2
k∇f (xk )k2
2m 2m

Convex Optimization 10
Lecture 14

Two Phases of Netwon’s Method

Damped Newton phase (k∇f (x)k ≥ η ; far from x∗)


• Most iterations require backtracking steps
• Function value decreases by at least γ at each iteration
• This phase ends after at most (f (x0) − f ∗)/γ iterations

Quadratically convergent phase (k∇f (x)k < η ; locally near x∗)


• All iterations in this phase use stepsize α = 1
• The gradient k∇f (x)k converges to zero quadratically:

2l−k  2l−k
L L 1

2
k∇f (x l )k ≤ 2
k∇f (xk )k ≤ for l ≥ k
2m 2m 2

Convex Optimization 11
Lecture 14

Analysis of Newton Method

Theorem Let Assumption 1 hold. Then, there exists a constant η ∈


(0, 2m2/L) such that
• When k∇f (xk )k ≥ η , then f (xk+1) − f (xk ) ≤ −γ for γ = σβη 2 Mm2
• When k∇f (xk )k < η , then
2
L L

2
k∇f (xk+1)k2 ≤ 2
k∇f (xk )k2
2m 2m
Proof Let k∇f (xk )k ≥ η . Let estimate the stepsize αk at the end of the
backtracking line search. First note that {xk } ⊂ L0. Since f is strongly
convex, the level set L0 is bounded, hence ∇f is Lipschitz continuous over
L0 with some constant M .
By Q-approximation Lemma, we have for any α > 0

T α2 M
f (xk + αd) ≤ f (xk ) + α∇f (xk ) dk + kdk k2.
2

Convex Optimization 12
Lecture 14

1/2
Using the Newton decrement λk = ∇f (xk )T ∇2f (xk )−1∇f (xk )T , and
the fact dk = −∇2f (xk )−1∇f (xk ) we can write

α2 M
f (xk + αdk ) ≤ f (xk ) − αλ2k + kdk k2.
2

Note that

λ2k = ∇f (xk )T ∇2f (xk )−1∇f (xk ) = dTk ∇2f (xk )dk ≥ mkdk k2,

by strong convexity of f . Hence kdk k2 ≤ λ2k /m implying that

α2 M 2
f (xk + αdk ) ≤ f (xk ) − αλ2k
+ λk
2m 
αM

≤ f (xk ) − α 1 − λ2k .
2m

Convex Optimization 13
Lecture 14

m
Thus the stepsize α = M satisfies the exit condition in the backtracking
m
line search (with σ ≤ 1/2), since for α = M we have

m 2 m
f (xk + αdk ) ≤ f (xk ) − λk < f (xk ) − σ λ2k .
2M M

m m
Therefore, the backtracking line search stops with some αk ≥ M
≥ βM .
Thus, when the line search is exited with step αk , we have

m 2
f (xk + αk dk ) < f (xk ) − σαk λ2k ≤ f (xk ) − σβ λk .
M

Since ∇2f (x) ≤ M I , we have

−1 1
λ2k T 2
= ∇f (xk ) ∇ f (xk ) T
∇f (xk ) ≥ k∇f (xk )k.
M

Convex Optimization 14
Lecture 14

Hence, when k∇f (xk )k ≤ η , we have

m 2
f (xk + αk dk ) < f (xk ) − σβ 2
η .
M

Suppose now k∇f (xk )k ≤ η .


Strong Q-lemma: For a continuously differentiable function g over Rn
with Lipschitz gradients with constant L, we have

L
kg(x + y) − g(x) − ∇g(x)T yk ≤ kyk2.
2

Applying this lemma to ∇f (x), we have

L
k∇f (xk + d) − ∇f (x) − ∇ f (x)dk ≤ kdk2.
2
2

Convex Optimization 15
Lecture 14

letting d be the Newton direction, we obtain

L 2 L 2
k∇f (xk+1)k ≤ k∇ f (xk ) ∇f (xk )k ≤ k∇ f (xk )−1k2 k∇f (xk )k2.
−1 2
2 2

Since mI ≤ ∇f 2(x), it follows

L 2
k∇f (xk+1)k ≤ 2
k∇f (x k )k .
2m

Hence, k∇f (xk+1)k ≤ η . Furthermore, for any K ≥ 0,

2 2K
2m2 L L
 
k∇f (xK+1)k ≤ 2
k∇f (xK )k ≤ ··· ≤ 2
k∇f (xk )k ,
L 2m 2m

showing the quadratic convergence rate.

Convex Optimization 16
Lecture 14

Conclusions
The number of iterations until f (x) − f ∗ ≤  is bounded above by

f (x0) − f ∗ 0
 
+ log2 log2
γ 

• γ = σβη 2 Mm2 , 0 = 2m3/L2


• The second term is small (of the order of 6) and almost constant for
practical purposes:
six iterations of the quadratically convergent phase results in accuracy

≈ 5 · 10−200
• In practice, the constants m, L (hence γ , 0) are usually unknown
• The analysis provides qualitative insight in the convergence properties
(i.e., explains two algorithm phases)

Convex Optimization 17

You might also like