Professional Documents
Culture Documents
Optimization PDF
Optimization PDF
J. Zico Kolter
1
Outline
Introduction to optimization
Unconstrained optimization
Constrained optimization
Practical optimization
2
Outline
Introduction to optimization
Unconstrained optimization
Constrained optimization
Practical optimization
3
Beyond
linear
programming
Linear
programming
minimize c T x
x
subject to Gx ≤ h
Ax = b
4
Beyond
linear
programming
Linear
programming General
(continuous)
optimization
minimize c T x minimize f (x )
x x
subject to Gx ≤ h subject to gi (x ) ≤ 0, i = 1, . . . , m
Ax = b hi (x ) = 0, i = 1, . . . , p
4
Example: image
deblurring
FTVd: β = 25, SNR:
Blurry&Noisy. SNR:12.58dB,
5.19dB t = 1.83s ForWaRD: 27, SNR:
FTVd: β =SNR: 13.11dB,
12.46dB, t = 14.10s
t = 4.88s
Original
image Blurred
image
deconvwnr:
FTVd: SNR:
β = 2 , SNR: 11.51dB,
12.58dB, t = 0.05s
t = 1.83s 5 Reconstruction
deconvreg:
FTVd: SNR:
β = 2 , SNR:
Fig. 5.1. Test images: Lena (left) and Man (right).
11.20dB,
13.11dB, t = 0.34s
t = 14.10s 7
Given
Test No.corrupted
Image × n Type
Size m Blurring image represented
Blurring Kernel Parameters as vector y ∈ Rm·n , find
x ∈2.R Man by1024solving
1. m·nLena 512 ⇥ 512
the optimization problem
Gaussian hsize={3, 5, · · · , 19, 21}, = 10
⇥ 1024 Gaussian
(n−1
hsize={3, 5, · · · , 19, 21}, = 10
)
∑ ∑
m−1
comparisons between∥K
FTVd∗xand−y∥ |xmi −Release |+ |xni − xn(i+1) |
2
Allminimize +λ under GNU/Linux
LD were2 performed xm(i+1)
2.6.9-55.0.2 and
MATLAB v7.2 x(R2006a) running on a DELL Optiplex GX620 with Dual Intel Pentium D CPU 3.20GHz
i=1 i=1
(only one processor was used by MATLAB) and 3 GB of memory. The C++ code of LD was a part of the
∑
m ( ( ) )
minimize ℓ hθ x (i) , y (i)
θ
i=1
6
I. M OTION PLANNING BENCHMARK
Example: robot
trajectory
planning
Robot state
in our benchmark t and
tests. xLeft andcontrol
center:inputs
two ofuthe
t scenes
planning benchmark. Right: a third−1 scene, showing the path
T∑
nner on an 18-DOF full-body planning
minimize ∥x problem.
− x ∥2 + ∥u ∥2
t t+1 2 t 2
x1:T ,u1:T −1
i=1
red our algorithm to several
subject to xt+1 other motion
= fdynamics (xt ,plan-
ut ), (robot dynamics)
ms on a collection of problems in simulated
fcollision (xt ) ≥ 0.1 (avoid collisions)
. Our evaluation is based on four test scenes
x1 = xinit , x = x
the MoveIt! distribution that is part Tof thegoal ROS 7
Many other
applications
8
Outline
Introduction to optimization
Unconstrained optimization
Constrained optimization
Practical optimization
9
Classes
of
optimization
problems
Constrained
Smooth
Nonsmooth
minimize f (x )
x
vs.
minimize f (x ) subject to gi (x ) ≤ 0, i = 1, . . . , m
x
hi (x ) = 0, i = 1, . . . , p
11
Convex
vs. nonconvex
optimization
Originally researchers distinguished between linear (easy) and
nonlinear (hard) optimization problems
But in the 80s and 90s, it became clear that this wasn’t the right line:
the real distinction is between convex (easy) and nonconvex (hard)
problems
(y, f (y))
(x, f (x))
f is concave if −f is convex
f1(x) f2(x)
Convex
function Nonconvex
function
Together, these properties imply that any local optima must also be
a global optima
Thus, for convex problems we can use local methods to find the
globally optimal solution (cf. linear programming vs. integer
programming)
14
Smooth
vs. Nonsmooth
optimization
f1(x) f2(x)
Smooth
function Nonsmooth
function
Introduction to optimization
Unconstrained optimization
Constrained optimization
Practical optimization
16
Solving
optimization
problems
Starting with the unconstrained, smooth, one dimensional case
f (x)
∇xf (x)
x1
x2
x0
f (x)
f (x0) + ∇xf (x)T (x − x0)
x
f (x ) ≈ f (x0 ) + ∇x f (x0 )T (x − x0 )
f (x ) ≥ f (x0 ) + ∇x f (x0 )T (x − x0 )
19
Some
common
gradients
∂ ∑
n
∂f (x )
= ai xi = ai
∂xi ∂xi
i=1
20
How
do
we
find ∇x f (x ) = 0?
Direct
solution: In some cases, it is possible to analytically
compute the x ⋆ such that ∇x f (x ⋆ ) = 0
Example:
Iterative
methods: more commonly the condition that the gradient
equal zero will not have an analytical solution, require iterative
methods
21
Gradient
descent
The gradient doesn’t just give us the optimality condition, it also
points in the direction of “steepest ascent” for the function f
∇xf (x)
x1
x2
Repeat: x ← x − α∇x f (x )
f - f*
x2
1.0 10-9
0.5 10-11
10-13
0.0 10-15
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 20 40 60 80 100
x1 Iteration
100 iterations of gradient descent on function
23
How
do
we
choose
step
size α?
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
x2
x2
1.0 1.0
0.5 0.5
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1 x1
α = 0.05 α = 0.42
24
101
10-1 alpha = 0.2
10-3 alpha = 0.42
10-5 alpha = 0.05
10-7
f - f*
10-9
10-11
10-13
10-15
0 20 40 60 80 100
Iteration
Convergence of gradient descent for different step sizes
25
If we know gradient is Lipschitz continuous with constant L, step
size α = 1/L is good in theory and practice
Idea
#1 (“exact” line search): want to choose α to minimize
f (x0 + α∇f (x0 )) for current iterate x0 ; this is just another
optimization problem, but with a single variable α
Idea
#2 (backtracking line search): try a few α’s on each iteration
until we get one that causes a suitable decrease in the function
26
Backtracking
line
search
function α = Backtracking-Line-Search(x , f , ∆x , α0 , γ, β)
// x : current iterate
// f : function being optimized
// ∆x : descent direction (i.e., ∆x = −∇x f (x ))
// α0 : initial step size
// γ ∈ (0, 0.5), β ∈ (0, 1): backtracking parameters
α ← α0
while f (x + α∆x ) > f (x ) + γα∇x f (x )T ∆x
α←α·β
return α
f - f*
x2
1.0 10-9
0.5 10-11
10-13
0.0 10-15
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 20 40 60 80 100
x1 Iteration
100 iterations of gradient descent with backtracking line search on
function
28
Trouble
with
gradient
descent
4
3
2
1
0
x2
1
2
3
4
0 5 10 15 20 25
x1
Gradient descent with backtracking line search on function
f (x ) = 0.01x12 + x22 − 0.5x1 − x2
x
f (x) f (x0) + f ′(x0)(x − x0)
31
f (x0) + ∇xf (x)T (x − x0) + 12 (x − x0)T ∇2xf (x)(x − x0)
f (x) x0
∂ 2 f (x ) ∂ 2 f (x )
Because ∂xi ∂xj = ∂xj ∂xi (i.e., it doesn’t matter which order we
differentiate in), the Hessian for any function is always a symmetric
matrix
33
4
3
2
1
0
x2
1
2
3
4
0 5 10 15 20 25
x1
For our previous example, Newton’s method finds the exact solution
in a single step (holds true for any convex quadratic function)
10-7
10-9
10-11
10-13
10-15
0 5 10 15 20 25 30 35 40
Iteration
35
Handling
nonconvexity
f1(x) f2(x)
Convex
function Nonconvex
function
36
In practice, for the vast majority of practical nonconvex problem (e.g.
deep Networks, probabilistic graphical models with missing
variables), people are satisfied with finding local optima
37
Handling
nonsmooth
functions
|x|
38
Subgradients
Although nonsmooth convex functions do not have gradients
everywhere, they do have a subgradient at every point
x0
f (x)
f (x) ≥ f (x0) + ∇xf (x)T (x − x0)
x
One subtle point: a constant step size (no matter how small) may
never converge exactly (similar issues for backtracking line search)
f (x)
|x|
αt = α0 /(n0 + t)
40
No “easy” analogue for Newton’s method when we only have
subgradients
41
Outline
Introduction to optimization
Unconstrained optimization
Constrained optimization
Practical optimization
42
Constrained
optimization
Recall constrained optimization problem
minimize f (x )
x
subject to gi (x ) ≤ 0, i = 1, . . . , m
hi (x ) = 0, i = 1, . . . , p
and its convex variant
minimize f (x )
x
subject to gi (x ) ≤ 0, i = 1, . . . , m
Ax = b
with f and all gi convex
gi (x ) ≤ 0, i = 1, . . . , m
Proj(v ) = argmin ∥x −v ∥2 subject to
x hi (x ) = 0, i = 1, . . . , p
But doesn’t solving the projection seem just as hard as solving the
problem to begin with?
44
The good news: some constraints have very simple forms of project
Example: x ≥ 0:
Example: Ax = b
45
Duality
and
the
KKT conditions
∑
m ∑
p
L(x , λ, ν) = f (x ) + λi gi (x ) + νi hi (x )
i=1 i=1
46
Because optimal (x ⋆ , λ⋆ , ν ⋆ ) must have f (x ⋆ ) = L(x ⋆ , λ⋆ , ν ⋆ ) and
∇x Λ(x ⋆ , λ⋆ , ν ⋆ ) = 0, we have the following conditions
1. gi (x ⋆ ) ≤ 0, ∀i = 1, . . . , m
2. hi (x ⋆ ) = 0, ∀i = 1, . . . , p
3. λ⋆i ≥ 0, ∀i = 1, . . . , m
∑
m
4. λ⋆i gi (x ⋆ ) = 0 =⇒ λ⋆i gi (x ⋆ ) = 0, ∀i = 1, . . . , m
i=1
∑
x ∑
p
5. ∇x f (x ) +⋆
λ⋆i ∇x gi (x ⋆ ) + νi⋆ ∇x hi (x ⋆ ) = 0
i=1 i=1
47
Equality
constrained
optimization
minimize f (x )
x
subject to Ax = b
∇x f (x ⋆ ) + AT ν ⋆ = 0
Ax ⋆ − b = 0
48
Just like for unconstrained case, after a bit of derivation, we can use
Newton’s method to find a zero of this system by repeatedly solving
the linear system
[ ][ ] [ ]
∇2x f (x ) AT ∆x ∇x f (x ) + AT ν
=
A 0 ∆ν Ax − b
x ← x − α∆x , ν ← ν − α∆ν
49
Inequality
constrained
optimization
Consider the full convex constrained optimization problem, with
convex f , gi
minimize f (x )
x
subject to gi (x ) ≤ 0, i = 1, . . . , m
Ax = b
where
∇x g1 (x )T
..
Dg(x ) = .
∇x gm (x )T
This is the primal-dual interior point method, which is the state of the
art in exactly solving convex optimization problems
51
Handling
nonconvexity
Nonetheless, there are a few methods that can work well in practice
to (locally) find hopefully feasible solutions
52
One popular approach: sequential convex programming, iteratively
attempt to solve penalized, unconstrained objective
∑
m ∑
p
minimize f (x ) + µ |gi (x )|+ + µ |hi (x )|
x
i=1 i=1
where f˜(x , x0 ) = f (x0 ) + ∇x f (x0 )(x − x0 ) (and similarly for g̃i , h̃i )
are first order Taylor approximations
Consider problem
∑
n
minimize ∥Ax − b∥22 + µ |xi |
x
i=1
Introduction to optimization
Unconstrained optimization
Constrained optimization
Practical optimization
55
Practically
solving
optimization
problems
56
cvxpy
Available at http://www.cvxpy.org
57
Image
deblurring
Blurry&Noisy. 25, SNR:
FTVd: β =SNR: 12.58dB, t = 1.83s
5.19dB FTVd:
ForWaRD: = 27, SNR:
SNR:β12.46dB, t =13.11dB,
4.88s t = 14.10s
(n−1 )
∑ ∑
FTVd: β = 25deconvwnr: SNR:t11.51dB,
, SNR: 12.58dB, FTVd: β = 27,deconvreg:
= 1.83s t = 0.05s SNR:t 11.20dB,
SNR: 13.11dB, = 14.10s t = 0.34s
m−1
minimize ∥K ∗ x − y∥22 + µ |xmi − xm(i+1) | + |xni − xn(i+1) |
x
i=1 i=1
cvxpy code:
import cvxpy as cp Fig. 5.4. The first row contains the blurry and noisy image and the image recovered by ForWaRD; the second row contains
deconvwnr:
the results of FTVdSNR: = 25 , ✏ =t =
(left: 11.51dB, 5⇥ 10 2 ; right:
0.05s 27 , ✏ = 2 ⇥SNR:
deconvreg:
= 10 3 );11.20dB, t = 0.34s
and the third row is recovered by deconvwnr (left)
and deconvreg (right)
Y,K = ... 19
X = cp.Variable(Y.shape[0], Y.shape[1])
f = (cp.sum_squares(K*cp.vec(X) - cp.vec(Y)) +
mu*cp.sum_entries(cp.abs(X[:,:-1] - X[:,1:])) +
mu*cp.sum_entries(cp.abs(X[:-1,:] - X[1:,:])))
constraints = []
result = cp.Problem(cp.Minimize(f), constraints).solve()
Fig. 5.4. The first row contains the blurry and noisy image and the image recovered by ForWaRD; the second row contains
the results of FTVd (left: = 25 , ✏ = 5 ⇥ 10 2; right: = 27 , ✏ = 2 ⇥ 10 3 ); and the third row is recovered by deconvwnr (left)
and deconvreg (right)
58
19