Professional Documents
Culture Documents
Konveksna Optimizacija
Konveksna Optimizacija
Lieven Vandenberghe
Introduction
• mathematical optimization
• recent history
1
Mathematical optimization
• generally intractable
Introduction 2
The famous exception: linear programming
minimize cT x
subject to aTi x ≤ bi, i = 1, . . . , m
Introduction 3
Convex optimization problem
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Introduction 4
History
minimize cT x
subject to aTi x ≤ bi, i = 1, . . . , m
Introduction 5
New applications since 1990
Introduction 6
Advances in convex optimization algorithms
interior-point methods
Introduction 7
Overview
3. First-order methods
• gradient algorithms
• dual techniques
Convex optimization — MLSS 2011
• convex sets
• convex functions
contains the line segment between any two points in the set
1
1
p(t)
0
C
x2
0
−1
−1
−2
0 π/3 2π/3 π −2 −1
t x01 1 2
• convex sets
• convex functions
(y, f (y))
(x, f (x))
f is concave if −f is convex
epi f
f (y)
f (x) + ∇f (x)T (y − x)
(x, f (x))
• convex sets
• convex functions
1. verify definition
examples
m
X
f (x) = − log(bi − aTi x)
i=1
proof:
f (x) = max{xi1 + xi2 + · · · + xir | 1 ≤ i1 < i2 < · · · < ir ≤ n}
λmax(X) = sup y T Xy
kyk2 =1
examples
h(x) = inf cT y
y:Ay≤x
follows by taking
composition of g : Rn → R and h : R → R:
f (x) = h(g(x))
f is convex if
g convex, h convex and nondecreasing
g concave, h convex and nonincreasing
examples
g(x, t) = tf (x/t)
examples
• perspective of f (x) = xT x is quadratic-over-linear function
xT x
g(x, t) =
t
f (x)
xy
(0, −f ∗(y))
1 T ∗ 1 T −1
f (x) = x Qx f (y) = y Q y
2 2
negative entropy
n
X n
X
f (x) = xi log xi f ∗(y) = e yi − 1
i=1 i=1
norm
0 kyk∗ ≤ 1
f (x) = kxk f ∗(y) =
+∞ otherwise
• linear programming
• quadratic programming
• geometric programming
• semidefinite programming
• modeling software
Convex optimization problem
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b
minimize cT x + d
subject to Gx ≤ h
Ax = b
−c
P x⋆
f (x)
aTi x + bi
minimize t
subject to aTi x + bi ≤ t, i = 1, . . . , m
an LP with variables x, t ∈ R
n
X
minimize kAx − bk1 minimize yi
i=1
subject to −y ≤ Ax − b ≤ y
(1 is vector of ones)
10
0
1.5 1.0 0.5 0.0 0.5 1.0 1.5
100
80
60
40
20
0
1.5 1.0 0.5 0.0 0.5 1.0 1.5
25
20
15
10
5
f (t)
-5
-10
-15
-20
-10 -5 0 5 10
t
aT xi + b > 0, i = 1, . . . , N
aT yi + b < 0, i = 1, . . . , M
aT xi + b ≥ 1, i = 1, . . . , N, aT yi + b ≤ −1, i = 1, . . . , M
N
X M
X
minimize max{0, 1 − aT xi − b} + max{0, 1 + aT yi + b}
i=1 i=1
minimize (1/2)xT P x + q T x + r
subject to Gx ≤ h
−∇f0(x⋆)
x⋆
minimize cT x
subject to Gx ≤ h
H1 = {z | aT z + b = 1}
H2 = {z | aT z + b = −1}
minimize kak22 = aT a
subject to aT xi + b ≥ 1, i = 1, . . . , N
aT yi + b ≤ −1, i = 1, . . . , M
a quadratic program in a, b
N
X M
X
min. γkak22 + max{0, 1 − aT xi − b} + max{0, 1 + aT yi + b}
i=1 i=1
γ=0 γ = 10
equivalent to a QP
n−1
X n−1
X
φquad(x̂) = (x̂i+1 − x̂i)2, φtv (x̂) = |x̂i+1 − x̂i|
i=1 i=1
xcor 2
2
0 500 1000 1500 2000
2 i
quad.
2
0 500 1000 1500 2000
2 i
t.v.
2
0 500 1000 1500 2000
i
• quadratic smoothing smooths out noise and sharp transitions in signal
• total variation smoothing preserves sharp transitions in signal
posynomial function
K
a a
X
f (x) = ck x1 1k x2 2k · · · xannk , dom f = Rn++
k=1
with ck > 0
minimize f0(x)
subject to fi(x) ≤ 1, i = 1, . . . , m
with fi posynomial
change variables to
yi = log xi,
K
!
X
minimize exp(aT0k y + b0k )
log
k=1
K
!
X
subject to log exp(aTik y + bik ) ≤ 0, i = 1, . . . , m
k=1
minimize f T x
subject to kAix + bik2 ≤ cTi x + di, i = 1, . . . , m
p
• k · k2 is Euclidean norm kyk2 = y12 + · · · + yn2
• constraints are nonlinear, nondifferentiable, convex
n q o
y y12 + · · · + yp−1
2 ≤ yp
0
1
1
0
0
y2 −1 −1
y1
minimize cT x
subject to prob(aTi x ≤ bi) ≥ η, i = 1, . . . , m
1/2
āTi x + Φ−1(η)kΣi xk2 ≤ bi
η
Φ(t)
0.5
Φ−1(η)
0
0
t
minimize cT x
subject to aTi x ≤ bi for all ai ∈ Ei, i = 1, . . . , m
SOCP formulation
minimize cT x
subject to āTi x + kPiT xk2 ≤ bi, i = 1, . . . , m
follows from
sup (āi + Piu)T x = āTi x + kPiT xk2
kuk2 ≤1
xT Ax + 2bT x + c ≤ 0
m
T
L x + L−1b
≤ (bT A−1b − c)1/2
2
hyperbolic constraint
xT x ≤ yz, y, z ≥ 0
m
2x
y − z
≤ y + z, y, z ≥ 0
2
negative powers
x−3 ≤ t, x>0
m
∃z : 1 ≤ tz, z 2 ≤ tx, x, z ≥ 0
minimize cT x
subject to x1A1 + x2A2 + · · · + xnAn B
0.5
z
x y
0
y z
0
1
1
0 0.5
y −1 0 x
matrix-fractional function
P
nuclear norm approximation (kXk∗ = k σk (X))
minimize xT AT Ax − 2bT Ax + bT b
subject to x2i = 1, i = 1, . . . , n
• cost function and second constraint are linear (in the variables Y , x)
• first constraint is nonlinear and nonconvex
0.5
frequency
0.3
0.2
0.1
0
1 1.2
kAx − bk2/(SDP bound)
• a convex constraint on x
• holds if and only if f is a sum of squares of (two) polynomials:
X
f (t) = (yk0 + yk1t + · · · + ykmtm)2
k
T
1 1
. ..
X
= . yk ykT
tm k tm
T
1 1
= .. Y ..
tm tm
yk ykT 0
P
where Y =
k
x0 = Y11
x1 = Y12 + Y21
x2 = Y13 + Y22 + Y32
..
x2m = Ym+1,m+1
s s
!
X X
xT p(t) = (ykT q(t))2 = q(t)T yk ykT q(t)
k=1 k=1
related packages
MATLAB code
cvx_begin
variable x(3);
minimize(norm(A*x - b, 1))
subject to
x >= 0;
x <= 1;
cvx_end
Conic optimization
• modeling
• duality
Generalized (conic) inequalities
• closed
• pointed: K ∩ (−K) = {0}
• with nonempty interior: int K 6= ∅; equivalently, K + (−K) = Rm
notation
x K y ⇐⇒ x − y ∈ K, x ≻K y ⇐⇒ x − y ∈ int K
Conic optimization 59
Cone linear program
minimize cT x
subject to Ax K b
Conic optimization 60
Norm cones
m−1
K = (x, y) ∈ R × R | kxk ≤ y
0.5
y
0
1
1
0 0
x2 −1 −1 x1
for the Euclidean norm this is the second-order cone (notation: Qm)
Conic optimization 61
Second-order cone program
minimize cT x
subject to kBk0x + dk0k2 ≤ Bk1x + dk1, k = 1, . . . , r
Conic optimization 62
Vector notation for symmetric matrices
√
coefficients 2 are added so that standard inner products are preserved:
Conic optimization 63
Positive semidefinite cone
1
z
0.5
0
1
1
0
0.5
−1
y 0 x
√
x√ y/ 2
S2 =
(x, y, z) 0
y/ 2 z
Conic optimization 64
Semidefinite program
minimize cT x
subject to x1A11 + x2A12 + · · · + xnA1n B1
...
x1Ar1 + x2Ar2 + · · · + xnArn Br
K = S p1 × S p2 × · · · × S pr
vec(A11) vec(A12) · · · vec(A1n) vec(B1)
vec(A21) vec(A22) · · · vec(A2n) vec(B2)
A= .. .. .. , b= .
.
vec(Ar1) vec(Ar2) · · · vec(Arn) vec(Br )
Conic optimization 65
Exponential cone
the epigraph of the perspective of exp x is a non-proper cone
n o
3 x/y
K = (x, y, z) ∈ R | ye ≤ z, y > 0
0.5
z
0
3
1 1
0
y −1
0 −2 x
Conic optimization 66
Geometric program
minimize cT x
ni
exp(aTik x + bik ) ≤ 0,
P
subject to log i = 1, . . . , r
k=1
cone LP formulation
minimize cT x
T
aik x + bik
subject to 1 ∈ Kexp, k = 1, . . . , ni, i = 1, . . . , r
zik
ni
P
zik ≤ 1, i = 1, . . . , m
k=1
Conic optimization 67
Power cone
m
P
definition: for α = (α1, α2, . . . , αm) > 0, αi = 1
i=1
Rm xα · · · xα
1
Kα = (x, y) ∈ + × R | |y| ≤ 1 m
m
examples for m = 2
α = ( 12 , 21 ) α = ( 23 , 31 ) α = ( 43 , 41 )
0.5 0.5
0.4
0.2
0
y
0
y
0
y
−0.2
−0.4 −0.5
−0.5
0 0 0
0.5 0.5 0.5 1
0.5 1 0.5 1 0.5
1 0 1 0
1 0
x1 x2 x1 x2 x1 x2
Conic optimization 68
Outline
• modeling
• duality
Modeling tools
f (x) ≤ t
Conic optimization 69
Examples of second-order cone representable functions
• convex quadratic
f (x) = xT P x + q T x + r (P 0)
• quadratic-over-linear function
xT x
f (x, y) = with dom f = Rn × R+ (assume 0/0 = 0)
y
xβ
α x>0
f (x) = |x| , f (x) =
+∞ x ≤ 0
Conic optimization 70
Examples of SD cone representable functions
• matrix-fractional function
P
• nuclear norm f (X) = kXk∗ = i σi (X)
U X 1
kXk∗ ≤ t ⇐⇒ ∃U, V : 0, (tr U + tr V ) ≤ t
XT V 2
Conic optimization 71
Functions representable with exponential and power cone
exponential cone
power cone
Conic optimization 72
Outline
• modeling
• duality
Linear programming duality
duality theorem
Conic optimization 73
Dual cone
definition
K ∗ = {y | xT y ≥ 0 for all x ∈ K}
H −1K ∗
Conic optimization 74
Examples
∗ +
Kexp = (u, v, w) ∈ R− × R × R | −u log(−u/w) + u − v ≤ 0
(with 0 log(0/w) = 0 if w ≥ 0)
Kα∗ Rm α1 αm
= (u, v) ∈ + × R | |v| ≤ (u1/α1) · · · (um/αm)
Conic optimization 75
Primal and dual cone LP
minimize cT x
subject to Ax b
maximize −bT z
subject to AT z + c = 0
z ∗ 0
Conic optimization 76
Strong duality
p⋆ = d ⋆
Conic optimization 77
Optimality conditions
optimality conditions
T
0 0 A x c
= +
s −A 0 z b
s 0, z ∗ 0, zT s = 0
z T s = c T x + bT z
Conic optimization 78
Convex optimization — MLSS 2011
Barrier methods
• normal barriers
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Barrier methods 79
Logarithmic barrier function for linear inequalities
m
X
ψ(x) = φ(b − Ax), φ(s) = − log si
i=1
with s = b − Ax
1 1 1 1
∇φ(s) = − ,..., , ∇φ2(s) = diag , . . . ,
s1 sm s21 s2m
Barrier methods 80
Central path for linear program
⋆ x⋆(t)
x
∇ft(x) = tc − AT ∇φ(s) = 0, s = b − Ax
Barrier methods 81
Central path and duality
m
c T x + bT z = s T z =
t
Barrier methods 82
Barrier method
x+ = x − α∇2ft(x)−1∇ft(x)
• increase t
Barrier methods 83
Outline
• normal barriers
g ′′′(α) ≤ 2g ′′(α)3/2
Barrier methods 84
Examples
nonnegative orthant: K = Rm
+
m
X
φ(x) = − log xi (θ = m)
i=1
φ(x, y) = − log(y 2 − xT x) (θ = 2)
Barrier methods 85
exponential cone: Kexp = cl{(x, y, z) ∈ R3 | yex/y ≤ z, y > 0}
φ(x, y) = − log x12α1 x22α2 − y 2 − log x1 − log x2 (θ = 4)
Barrier methods 86
Central path
minimize cT x
subject to Ax b
φ(b − Ax)
Barrier methods 87
Newton step
centering problem
Newton step at x
∆x = −∇2ft(x)−1∇ft(x)
Newton decrement
T 2
1/2
λt(x) = ∆x ∇ ft(x)∆x
1/2
= (−∇ft(x)∆x)
Barrier methods 88
Damped Newton method
algorithm
select ǫ ∈ (0, 1/2), η ∈ (0, 1/4], and a starting point x ∈ dom ft
repeat:
1. compute Newton step ∆x and Newton decrement λt(x)
2. if λt(x)2 ≤ ǫ, return x
3. otherwise, set x := x + α∆x with
1
α= if λt(x) ≥ η, α=1 if λt(x) < η
1 + λt(x)
Barrier methods 89
Convergence results for damped Newton method
2
2λt(x+) ≤ (2λt(x)) if λt(x) < η
Barrier methods 90
Outline
• normal barriers
duality point on central path: x⋆(t) defines a strictly dual feasible z ⋆(t)
1
z ⋆(t) = − ∇φ(s), s = b − Ax⋆(t)
t
T T θT T θ
⋆
c x+b z =s z = , c x−p ≤
t t
Barrier methods 91
Short-step barrier method
algorithm: parameters ǫ ∈ (0, 1), β = 1/8
• select initial x and t with λt(x) ≤ β
• repeat until 2θ/t ≤ ǫ:
1
t := 1 + √ t, x := x − ∇ft(x)−1∇ft(x)
1+8 θ
properties
• increase t slowly so x stays in region of quadratic region (λt(x) ≤ β)
• iteration complexity
√
θ
#iterations = O θ log
ǫt0
Barrier methods 92
Predictor-corrector methods
predictor-corrector method
Barrier methods 93
Convex optimization — MLSS 2011
Primal-dual methods
• symmetric cones
• implementation
Primal-dual interior-point methods
differences
Primal-dual methods 94
Primal-dual central path for linear programming
optimality conditions
Ax + s = b, AT z + c = 0, (s, z) ≥ 0, s◦z =0
T 1
Ax + s = b, A z + c = 0, (s, z) ≥ 0, s◦z = 1
t
Primal-dual methods 95
Primal-dual search direction
steps solve central path equations linearized around current iterates x, s, z
z ◦ ∆s + s ◦ ∆z = σµ1 − s ◦ z
Primal-dual methods 96
Outline
• symmetric cones
• implementation
Symmetric cones
symmetric primal-dual solvers for cone LPs are limited to symmetric cones
• second-order cone
• positive semidefinite cone
• direct products of these ‘primitive’ symmetric cones (such as Rp+)
Primal-dual methods 97
Vector product and identity element
nonnegative orthant: componentwise product
x ◦ y = diag(x)y
1
x ◦ y = vec(XY + Y X) with X = mat(x), Y = mat(Y )
2
identity element is e = vec(I)
second-order cone: the product of x = (x0, x1) and y = (y0, y1) is
T
1 x y
x◦y = √
2 x0 y1 + y0 x1
√
identity element is e = ( 2, 0, . . . , 0)
Primal-dual methods 98
Classification
practical implication
can focus on Qp, S p and study these cones using elementary linear algebra
Primal-dual methods 99
Spectral decomposition
θ θ
X X qi i = j
x= λi q i , with qi = e and qi ◦ qj =
0 i= 6 j
i=1 i=1
p
X
θ = p, mat(x) = λiviviT , qi = vec(viviT )
i=1
x0 ± kx1k2
1 1
θ = 2, λi = √ , qi = √ , i = 1, 2
2 2 ±x1/kx1k2
nonnegativity
log-det barrier
θ
X
φ(x) = − log det x = − log λi
i=1
• a θ-normal barrier
• gradient is ∇φ(x) = −x−1
• symmetric cones
• implementation
Symmetric parametrization of central path
centering problem
1
Ax + s = b, AT z + c = 0, (s, z) ≻ 0, z = s−1
t
1
Ax + b = s, AT z + c = 0, (s, z) ≻ 0, s◦z = e
t
s ◦ z = µe ⇐⇒ (Hs) ◦ (H −1z) = µ e
example (K = S p):
AT ∇2φ(w)A∆x = r, w = u2
• symmetric cones
• implementation
Software implementations
Customized implementations
T
A A 0 D1 + D2 D2 − D1
+ (D1, D2 positive diagonal)
0 0 D2 − D1 D1 + D2
(AD−1AT + I)∆u = r
m n custom general-purpose
50 200 0.02 0.32
50 400 0.03 0.59
100 1000 0.12 1.69
100 2000 0.24 3.43
500 1000 1.19 7.54
500 2000 2.38 17.6
Gradient methods
108
Classical gradient method
advantages
• every iteration is inexpensive
• does not require second derivatives
disadvantages
• often very slow; very sensitive to scaling
• does not handle nondifferentiable functions
1 2
f (x) = (x1 + γx22) (γ > 1)
2
with exact line search and starting point x(0) = (γ, 1)
k
kx(k) − x⋆k2 γ−1
(0) ⋆
=
kx − x k2 γ+1
0
x2
4
10 0 10
x1
Gradient methods 110
Nondifferentiable example
x1 + γ|x2|
q
f (x) = x21 + γx22 (|x2| ≤ x1), f (x) = √ (|x2| > x1)
1+γ
with exact line search, x(0) = (γ, 1), converges to non-optimal point
0
x2
22 0 2 4
x1
• smoothing methods
• subgradient method
• proximal gradient method
x1 x2
x
x
−1
1
∂f (x) = x if x 6= 0, ∂f (x) = {g | kgk2 ≤ 1} if x = 0
kxk2
weak calculus
subdifferentiability
g = α1 g1 + α2 g2 , g1 ∈ ∂f1(x), g2 ∈ ∂f2(x)
g = AT g̃, g̃ ∈ ∂h(Ax + b)
results
0 0
10 10
0.1 0.01/ k
0.01 0.01/k
0.001 -1
10
(fbest − f ⋆)/f ⋆
-1
10
-2
10
-2
10
-3
10
(k)
-3
10 -4
10
-4 -5
10 10
0 500 1000 1500 2000 2500 3000 0 1000 2000 3000 4000 5000
k k
fixed steplength diminishing
√ step size
s = 0.1, 0.01, 0.001 tk = 0.01/ k, tk = 0.01/k
• h(x) = 0: proxh(x) = x
• h(x) = IC (x) (indicator function of C): proxh is projection on C
x+ = proxth (x − t∇g(x))
x+ = x − t∇g(x)
x+ = PC (x − t∇g(x)) C
x+
x − t∇g(x)
x+ = proxth (x − t∇g(x))
and
ui − t ui ≥ t
proxth(u)i = 0 −t ≤ ui ≤ t
ui + t ui ≥ t
proxth(u)i
−t ui
t
• proxh is nonexpansive
• Moreau decomposition
v ∈ L⊥
∗ T 0
h (v) = sup v u =
u∈L +∞ otherwise
= IL⊥ (v)
n p
X xi + x2i + 4t
h(x) = − log xi, proxth(x)i = , i = 1, . . . , n
i=1
2
t
proxtd(x) = θPC (x) + (1 − θ)x, θ=
max{d(x), t}
1 t
proxth(x) = x+ PC (x)
1+t 1+t
example: norms
• proxh is projection on C
• formula useful for prox-operator of k · k∗ if projection on C is inexpensive
assumptions
L (0)
f (x(k)) − f ⋆ ≤ kx − x⋆k22
2k
√
• compare with 1/ k rate of subgradient method
• can be extended to account for line searches
sequence x(k) remains feasible (in dom h); sequence y (k) not necessarily
assumptions
2L
f (x(k)) − f ⋆ ≤ 2
kx (0)
− f ⋆ 2
k2
(k + 1)
randomly generated data with m = 2000, n = 1000, same fixed step size
100 100
gradient gradient
FISTA FISTA
10-1 10-1
f (x(k)) − f ⋆
10-2 10-2
|f ⋆|
10-3 10-3
10-4 10-4
10-5 10-5
Dual methods
• Lagrange duality
• dual decomposition
• multiplier methods
Dual function
convex problem (with linear constraints for simplicity)
minimize f (x)
subject to Gx ≤ h
Ax = b
optimal value p⋆
Lagrangian
dual function
= −f ∗(−GT λ − AT ν) − hT λ − bT ν
maximize g(λ, ν)
subject to λ ≥ 0
optimal value d⋆
dual problem
minimize bT z
subjec to kAT zk∗ ≤ 1
minimize kAx − bk
reformulated problem
minimize kyk
subject to y = Ax − b
dual function
T T T
g(ν) = inf kyk + ν y − ν Ax + b ν
x,y
T
b ν AT ν = 0, kνk∗ ≤ 1
=
−∞ otherwise
dual problem
minimize bT z
subject to AT z = 0, kzk∗ ≤ 1
1. primal feasibility:
x ∈ dom f, Gx ≤ h, Ax = b
2. λ ≥ 0
3. complementary slackness:
λT (h − Gx) = 0
∇f (x) + GT λ + AT ν = 0
• Lagrange dual
• dual decomposition
• multiplier methods
Dual methods
primal problem
minimize f (x)
subject to Gx ≤ h
Ax = b
dual problem
∗ T
f (y) = sup y x − f (x)
x
µ T
f (x) − x x is convex
2
minimize f (x)
subject to Ax = b
algorithm
+ T
x = argmin f (x) + ν Ax
x
ν + = ν + t(Ax+ − b)
fj∗(−GTj λ) T
= − inf fj (xj ) + λ Gj xj
xj
x+ T
j = argmin fj (xj ) + λ Gj xj , j = 1, 2
xj
+
t(G1x+ G2 x +
λ = λ+ 1 + 2 − h) +
• Lagrange dual
• dual decomposition
• multiplier methods
First-order dual methods
proximal gradient method (this section): dual costs split in two terms
dual problem
has the composite structure required for the proximal gradient method if
∗ T T
∇f (A z) = argmin f (x) − z Ax
x
T
x̂ := argmin f (x) − z Ax
x
z := PC (z + t(b − Ax̂))
• PC is projection on C = {y | kyk∗ ≤ 1}
• step size t is constant or from backtracking line search
• can use accelerated gradient projection algorithm (FISTA) for z-update
• first step decouples if f is separable
• Lagrange dual
• dual decomposition
• multiplier methods
Moreau-Yosida regularization of the dual
a general technique for smoothing the dual of
minimize f (x)
subject to Ax = b
x+ = argmin Lt(x, ν)
ν + = ν + t(Ax+ − b)
t
T
Lt(x, ν) = f (x) + ν (Ax − b) + kAx − bk22
2
augmented Lagrangian
t
Lt(x, y, ν) = f (x) + h(y) + ν T (Ax + By − b) + kAx + By − bk22
2
3. dual update:
ν (k) := ν (k−1) + t Ax(k) + By (k) − b
reformulation
augmented Lagrangian
Lt(X, Y, Z)
t 2
= tr(CX) − log det X + kY k1 + tr(Z(X − Y )) + kX − Y kF
2
• minimization over X:
t 1
X̂ = argmin − log det X + kX − Y + (C + Z)k2F
X 2 t
see the websites for expanded notes, references to literature and software