Download as pdf or txt
Download as pdf or txt
You are on page 1of 394

Convex Optimization — Boyd & Vandenberghe

1. Introduction

• mathematical optimization

• least-squares and linear programming

• convex optimization

• example

• course goals and topics

• nonlinear optimization

• brief history of convex optimization

1–1
Mathematical optimization

(mathematical) optimization problem

minimize f0(x)
subject to fi(x) ≤ bi, i = 1, . . . , m

• x = (x1, . . . , xn): optimization variables

• f0 : Rn → R: objective function

• fi : Rn → R, i = 1, . . . , m: constraint functions

optimal solution x? has smallest value of f0 among all vectors that


satisfy the constraints

Introduction 1–2
Examples
portfolio optimization
• variables: amounts invested in different assets
• constraints: budget, max./min. investment per asset, minimum return
• objective: overall risk or return variance

device sizing in electronic circuits


• variables: device widths and lengths
• constraints: manufacturing limits, timing requirements, maximum area
• objective: power consumption

data fitting
• variables: model parameters
• constraints: prior information, parameter limits
• objective: measure of misfit or prediction error

Introduction 1–3
Solving optimization problems

general optimization problem

• very difficult to solve


• methods involve some compromise, e.g., very long computation time, or
not always finding the solution

exceptions: certain problem classes can be solved efficiently and reliably

• least-squares problems
• linear programming problems
• convex optimization problems

Introduction 1–4
Least-squares

minimize kAx − bk22

solving least-squares problems

• analytical solution: x? = (AT A)−1AT b


• reliable and efficient algorithms and software
• computation time proportional to n2k (A ∈ Rk×n); less if structured
• a mature technology

using least-squares

• least-squares problems are easy to recognize


• a few standard techniques increase flexibility (e.g., including weights,
adding regularization terms)

Introduction 1–5
Linear programming

minimize cT x
subject to aTi x ≤ bi, i = 1, . . . , m

solving linear programs


• no analytical formula for solution
• reliable and efficient algorithms and software
• computation time proportional to n2m if m ≥ n; less with structure
• a mature technology

using linear programming


• not as easy to recognize as least-squares problems
• a few standard tricks used to convert problems into linear programs
(e.g., problems involving `1- or `∞-norms, piecewise-linear functions)

Introduction 1–6
Convex optimization problem

minimize f0(x)
subject to fi(x) ≤ bi, i = 1, . . . , m

• objective and constraint functions are convex:

fi(αx + βy) ≤ αfi(x) + βfi(y)

if α + β = 1, α ≥ 0, β ≥ 0

• includes least-squares problems and linear programs as special cases

Introduction 1–7
solving convex optimization problems

• no analytical solution
• reliable and efficient algorithms
• computation time (roughly) proportional to max{n3, n2m, F }, where F
is cost of evaluating fi’s and their first and second derivatives
• almost a technology

using convex optimization

• often difficult to recognize


• many tricks for transforming problems into convex form
• surprisingly many problems can be solved via convex optimization

Introduction 1–8
Example

m lamps illuminating n (small, flat) patches


lamp power pj
PSfrag replacements
rkj
θkj

illumination Ik

intensity Ik at patch k depends linearly on lamp powers pj :


m
X
−2
Ik = akj pj , akj = rkj max{cos θkj , 0}
j=1

problem: achieve desired illumination Ides with bounded lamp powers

minimize maxk=1,...,n | log Ik − log Ides|


subject to 0 ≤ pj ≤ pmax, j = 1, . . . , m

Introduction 1–9
how to solve?
1. use uniform power: pj = p, vary p
2. use least-squares:
Pn 2
minimize k=1 (I k − I des )

round pj if pj > pmax or pj < 0


3. use weighted least-squares:
Pn 2
Pm 2
minimize k=1 (Ik − Ides ) + j=1 wj (pj − pmax /2)

iteratively adjust weights wj until 0 ≤ pj ≤ pmax


4. use linear programming:
minimize maxk=1,...,n |Ik − Ides|
subject to 0 ≤ pj ≤ pmax, j = 1, . . . , m

which can be solved via linear programming

of course these are approximate (suboptimal) ‘solutions’

Introduction 1–10
5. use convex optimization: problem is equivalent to

minimize f0(p) = maxk=1,...,n h(Ik /Ides)


subject to 0 ≤ pj ≤ pmax, j = 1, . . . , m

with h(u) = max{u, 1/u}


5

3
h(u)

PSfrag replacements 1

0
0 1 2 3 4
u

f0 is convex because maximum of convex functions is convex

exact solution obtained with effort ≈ modest factor × least-squares effort

Introduction 1–11
additional constraints: does adding 1 or 2 below complicate the problem?

1. no more than half of total power is in any 10 lamps


2. no more than half of the lamps are on (pj > 0)

• answer: with (1), still easy to solve; with (2), extremely difficult
• moral: (untrained) intuition doesn’t always work; without the proper
background very easy problems can appear quite similar to very difficult
problems

Introduction 1–12
Course goals and topics

goals

1. recognize/formulate problems (such as the illumination problem) as


convex optimization problems
2. develop code for problems of moderate size (1000 lamps, 5000 patches)
3. characterize optimal solution (optimal power distribution), give limits of
performance, etc.

topics

1. convex sets, functions, optimization problems


2. examples and applications
3. algorithms

Introduction 1–13
Nonlinear optimization

traditional techniques for general nonconvex problems involve compromises

local optimization methods (nonlinear programming)


• find a point that minimizes f0 among feasible points near it
• fast, can handle large problems
• require initial guess
• provide no information about distance to (global) optimum

global optimization methods


• find the (global) solution
• worst-case complexity grows exponentially with problem size

these algorithms are often based on solving convex subproblems

Introduction 1–14
Brief history of convex optimization
theory (convex analysis): ca1900–1970
algorithms
• 1947: simplex algorithm for linear programming (Dantzig)
• 1960s: early interior-point methods (Fiacco & McCormick, Dikin, . . . )
• 1970s: ellipsoid method and other subgradient methods
• 1980s: polynomial-time interior-point methods for linear programming
(Karmarkar 1984)
• late 1980s–now: polynomial-time interior-point methods for nonlinear
convex optimization (Nesterov & Nemirovski 1994)

applications
• before 1990: mostly in operations research; few in engineering
• since 1990: many new applications in engineering (control, signal
processing, communications, circuit design, . . . ); new problem classes
(semidefinite and second-order cone programming, robust optimization)

Introduction 1–15
Convex Optimization — Boyd & Vandenberghe

2. Convex sets

• affine and convex sets

• some important examples

• operations that preserve convexity

• generalized inequalities

• separating and supporting hyperplanes

• dual cones and generalized inequalities

2–1
Affine set

line through x1, x2: all points

x=
PSfrag replacements θx1 + (1 − θ)x2 (θ ∈ R)

θ = 1.2 x1
θ=1
θ = 0.6
x2
θ=0
θ = −0.2

affine set: contains the line through any two distinct points in the set

example: solution set of linear equations {x | Ax = b}

(conversely, every affine set can be expressed as solution set of system of


linear equations)

Convex sets 2–2


Convex set

line segment between x1 and x2: all points

x = θx1 + (1 − θ)x2

with 0 ≤ θ ≤ 1

convex set: contains line segment between any two points in the set

x1, x2 ∈ C, 0≤θ≤1 =⇒ θx1 + (1 − θ)x2 ∈ C

examples (one convex, two nonconvex sets)

Convex sets 2–3


Convex combination and convex hull

convex combination of x1,. . . , xk : any point x of the form

x = θ 1 x1 + θ 2 x2 + · · · + θ k xk

with θ1 + · · · + θk = 1, θi ≥ 0

convex hull conv S: set of all convex combinations of points in S

Convex sets 2–4


Convex cone

conic (nonnegative) combination of x1 and x2: any point of the form

x = θ 1 x1 + θ 2 x2

with θ1 ≥ 0, θ2 ≥ 0

PSfrag replacements x1

x2
0

convex cone: set that contains all conic combinations of points in the set

Convex sets 2–5


Hyperplanes and halfspaces
hyperplane: set of the form {x | aT x = b} (a 6= 0)
PSfrag replacements a

x0
x
aT x = b

halfspace: set of the form {x | aT x ≤ b} (a 6= 0)


a

x0 aT x ≥ b
PSfrag replacements

aT x ≤ b

• a is the normal vector


• hyperplanes are affine and convex; halfspaces are convex

Convex sets 2–6


Euclidean balls and ellipsoids

(Euclidean) ball with center xc and radius r:

B(xc, r) = {x | kx − xck2 ≤ r} = {xc + ru | kuk2 ≤ 1}

ellipsoid: set of the form

{x | (x − xc)T P −1(x − xc) ≤ 1}

with P ∈ Sn++ (i.e., P symmetric positive definite)

PSfrag replacements
xc

other representation: {xc + Au | kuk2 ≤ 1} with A square and nonsingular

Convex sets 2–7


Norm balls and norm cones
norm: a function k · k that satisfies

• kxk ≥ 0; kxk = 0 if and only if x = 0


• ktxk = |t| kxk for t ∈ R
• kx + yk ≤ kxk + kyk

notation: k · k is general (unspecified) norm; k · ksymb is particular norm


norm ball with center xc and radius r: {x | kx − xck ≤ r}
PSfrag replacements

norm cone: {(x, t) | kxk ≤ t}


0.5

Euclidean norm cone is called second- t


order cone 0
1
1
0 0
x2 −1 −1 x1
norm balls and cones are convex

Convex sets 2–8


Polyhedra

solution set of finitely many linear inequalities and equalities

Ax  b, Cx = d

(A ∈ Rm×n, C ∈ Rp×n,  is componentwise inequality)

a1 a2
PSfrag replacements

P
a5
a3

a4

polyhedron is intersection of finite number of halfspaces and hyperplanes

Convex sets 2–9


Positive semidefinite cone
notation:
• Sn is set of symmetric n × n matrices
• Sn+ = {X ∈ Sn | X  0}: positive semidefinite n × n matrices

X ∈ Sn+ ⇐⇒ z T Xz ≥ 0 for all z

Sn+ is a convex cone


• Sn++ = {X ∈ Sn | X  0}: positive definite n × n matrices
PSfrag replacements

 
x y 0.5
z

example: ∈ S2+
y z
0
1
1
0
0.5
y −1 0 x

Convex sets 2–10


Operations that preserve convexity

practical methods for establishing convexity of a set C

1. apply definition

x1, x2 ∈ C, 0≤θ≤1 =⇒ θx1 + (1 − θ)x2 ∈ C

2. show that C is obtained from simple convex sets (hyperplanes,


halfspaces, norm balls, . . . ) by operations that preserve convexity
• intersection
• affine functions
• perspective function
• linear-fractional functions

Convex sets 2–11


Intersection

the intersection of (any number of) convex sets is convex

example:
S = {x ∈ Rm | |p(t)| ≤ 1 for |t| ≤ π/3}

where p(t) = x1 cos t + x2 cos 2t + · · · + xm cos mt

for m = 2: PSfrag replacements

PSfrag replacements 2

1
1
p(t)

0
x2 0 S
−1

−1

−2
0 π/3 2π/3 π −2 −1
t x01 1 2

Convex sets 2–12


Affine function
suppose f : Rn → Rm is affine (f (x) = Ax + b with A ∈ Rm×n, b ∈ Rm)

• the image of a convex set under f is convex

S ⊆ Rn convex =⇒ f (S) = {f (x) | x ∈ S} convex

• the inverse image f −1(C) of a convex set under f is convex

C ⊆ Rm convex =⇒ f −1(C) = {x ∈ Rn | f (x) ∈ C} convex

examples
• scaling, translation, projection
• solution set of linear matrix inequality {x | x1A1 + · · · + xmAm  B}
(with Ai, B ∈ Sp)
• hyperbolic cone {x | xT P x ≤ (cT x)2, cT x ≥ 0} (with P ∈ Sn+)

Convex sets 2–13


Perspective and linear-fractional function

perspective function P : Rn+1 → Rn:

P (x, t) = x/t, dom P = {(x, t) | t > 0}

images and inverse images of convex sets under perspective are convex

linear-fractional function f : Rn → Rm:

Ax + b
f (x) = T , dom f = {x | cT x + d > 0}
c x+d

images and inverse images of convex sets under linear-fractional functions


are convex

Convex sets 2–14


example of a linear-fractional function

1
f (x) = x
x1 + x 2 + 1

1 1

PSfrag replacements PSfrag replacements


x2

x2
0 C 0
f (C)

−1 −1
−1 0 1 −1 0 1
x1 x1

Convex sets 2–15


Generalized inequalities

a convex cone K ⊆ Rn is a proper cone if

• K is closed (contains its boundary)


• K is solid (has nonempty interior)
• K is pointed (contains no line)

examples
• nonnegative orthant K = Rn+ = {x ∈ Rn | xi ≥ 0, i = 1, . . . , n}
• positive semidefinite cone K = Sn+
• nonnegative polynomials on [0, 1]:

K = {x ∈ Rn | x1 + x2t + x3t2 + · · · + xntn−1 ≥ 0 for t ∈ [0, 1]}

Convex sets 2–16


generalized inequality defined by a proper cone K:

x K y ⇐⇒ y − x ∈ K, x ≺K y ⇐⇒ y − x ∈ int K

examples
• componentwise inequality (K = Rn+)

x Rn+ y ⇐⇒ xi ≤ y i , i = 1, . . . , n

• matrix inequality (K = Sn+)

X Sn+ Y ⇐⇒ Y − X positive semidefinite

these two types are so common that we drop the subscript in K


properties: many properties of K are similar to ≤ on R, e.g.,

x K y, u K v =⇒ x + u K y + v

Convex sets 2–17


Minimum and minimal elements

K is not in general a linear ordering : we can have x 6K y and y 6K x


x ∈ S is the minimum element of S with respect to K if

y∈S =⇒ x K y

x ∈ S is a minimal element of S with respect to K if

y ∈ S, y K x =⇒ y=x

example (K = R2+)PSfrag replacements


S2
S1 x2
x1 is the minimum element of S1
x2 is a minimal element of S2 x1

Convex sets 2–18


Separating hyperplane theorem

if C and D are disjoint convex sets, then there exists a 6= 0, b such that

aT x ≤ b for x ∈ C, aT x ≥ b for x ∈ D

aT x ≥ b aT x ≤ b

PSfrag replacements
D
C

the hyperplane {x | aT x = b} separates C and D

strict separation requires additional assumptions (e.g., C is closed, D is a


singleton)

Convex sets 2–19


Supporting hyperplane theorem

supporting hyperplane to set C at boundary point x0:

{x | aT x = aT x0}

where a 6= 0 and aT x ≤ aT x0 for all x ∈ C

a
x0
PSfrag replacements C

supporting hyperplane theorem: if C is convex, then there exists a


supporting hyperplane at every boundary point of C

Convex sets 2–20


Dual cones and generalized inequalities
dual cone of a cone K:

K ∗ = {y | y T x ≥ 0 for all x ∈ K}

examples

• K = Rn+: K ∗ = Rn+
• K = Sn+: K ∗ = Sn+
• K = {(x, t) | kxk2 ≤ t}: K ∗ = {(x, t) | kxk2 ≤ t}
• K = {(x, t) | kxk1 ≤ t}: K ∗ = {(x, t) | kxk∞ ≤ t}

first three examples are self-dual cones


dual cones of proper cones are proper, hence define generalized inequalities:

y K ∗ 0 ⇐⇒ y T x ≥ 0 for all x K 0

Convex sets 2–21


Minimum and minimal elements via dual inequalities

minimum element w.r.t. K


x is minimum element of S iff for all
λ K ∗ 0, x is the unique minimizer S
of λT z over S
PSfrag replacements x

minimal element w.r.t. K


• if x minimizes λT z over S for some λ K ∗ 0, then x is minimal
λ1
PSfrag replacements

x1 S
λ2
x2

• if x is a minimal element of a convex set S, then there exists a nonzero


λ K ∗ 0 such that x minimizes λT z over S

Convex sets 2–22


optimal production frontier

• different production methods use different amounts of resources x ∈ Rn


• production set P : resource vectors x for all possible production methods
• efficient (Pareto optimal) methods correspond to resource vectors x
that are minimal w.r.t. Rn+

PSfrag replacements fuel

example (n = 2)
x1, x2, x3 are efficient; x4, x5 are not P
x1

x2 x5 x4
λ
x3
labor

Convex sets 2–23


Convex Optimization — Boyd & Vandenberghe

3. Convex functions

• basic properties and examples

• operations that preserve convexity

• the conjugate function

• quasiconvex functions

• log-concave and log-convex functions

• convexity with respect to generalized inequalities

3–1
Definition
f : Rn → R is convex if dom f is a convex set and

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y)

for all x, y ∈ dom f , 0 ≤ θ ≤ 1

PSfrag replacements (y, f (y))


(x, f (x))

• f is concave if −f is convex
• f is strictly convex if dom f is convex and

f (θx + (1 − θ)y) < θf (x) + (1 − θ)f (y)

for x, y ∈ dom f , x 6= y, 0 < θ < 1

Convex functions 3–2


Examples on R

convex:
• affine: ax + b on R, for any a, b ∈ R
• exponential: eax, for any a ∈ R
• powers: xα on R++, for α ≥ 1 or α ≤ 0
• powers of absolute value: |x|p on R, for p ≥ 1
• negative entropy: x log x on R++

concave:
• affine: ax + b on R, for any a, b ∈ R
• powers: xα on R++, for 0 ≤ α ≤ 1
• logarithm: log x on R++

Convex functions 3–3


Examples on Rn and Rm×n
affine functions are convex and concave; all norms are convex
examples on Rn
• affine function f (x) = aT x + b
Pn
• norms: kxkp = ( i=1 |xi|p)1/p for p ≥ 1; kxk∞ = maxk |xk |

examples on Rm×n (m × n matrices)


• affine function
m X
X n
f (X) = tr(AT X) + b = Aij Xij + b
i=1 j=1

• spectral (maximum singular value) norm

f (X) = kXk2 = σmax(X) = (λmax(X T X))1/2

Convex functions 3–4


Restriction of a convex function to a line

f : Rn → R is convex if and only if the function g : R → R,

g(t) = f (x + tv), dom g = {t | x + tv ∈ dom f }

is convex (in t) for any x ∈ dom f , v ∈ Rn

can check convexity of f by checking convexity of functions of one variable

example. f : Sn → R with f (X) = log det X, dom X = Sn++

g(t) = log det(X + tV ) = log det X + log det(I + tX −1/2V X −1/2)


Xn
= log det X + log(1 + tλi)
i=1

where λi are the eigenvalues of X −1/2V X −1/2

g is concave in t (for any choice of X  0, V ); hence f is concave

Convex functions 3–5


Extended-value extension

extended-value extension f˜ of f is

f˜(x) = f (x), x ∈ dom f, f˜(x) = ∞, x 6∈ dom f

often simplifies notation; for example, the condition

0≤θ≤1 =⇒ f˜(θx + (1 − θ)y) ≤ θ f˜(x) + (1 − θ)f˜(y)

(as an inequality in R ∪ {∞}), means the same as the two conditions

• dom f is convex
• for x, y ∈ dom f ,

0≤θ≤1 =⇒ f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y)

Convex functions 3–6


First-order condition

f is differentiable if dom f is open and the gradient


 
∂f (x) ∂f (x) ∂f (x)
∇f (x) = , ,...,
∂x1 ∂x2 ∂xn

exists at each x ∈ dom f

1st-order condition: differentiable f with convex domain is convex iff

f (y) ≥ f (x) + ∇f (x)T (y − x) for all x, y ∈ dom f

f (y)
f (x) + ∇f (x)T (y − x)
g replacements

(x, f (x))

first-order approximation of f is global underestimator

Convex functions 3–7


Second-order conditions

f is twice differentiable if dom f is open and the Hessian ∇2f (x) ∈ Sn,

2 ∂ 2f (x)
∇ f (x)ij = , i, j = 1, . . . , n,
∂xi∂xj

exists at each x ∈ dom f

2nd-order conditions: for twice differentiable f with convex domain

• f is convex if and only if

∇2f (x)  0 for all x ∈ dom f

• if ∇2f (x)  0 for all x ∈ dom f , then f is strictly convex

Convex functions 3–8


Examples
quadratic function: f (x) = (1/2)xT P x + q T x + r (with P ∈ Sn)

∇f (x) = P x + q, ∇2f (x) = P

convex if P  0
least-squares objective: f (x) = kAx − bk22

∇f (x) = 2AT (Ax − b), ∇2f (x) = 2AT A

convex (for any A) PSfrag replacements

quadratic-over-linear: f (x, y) = x2/y 2

f (x, y)
  T 1
2 y y
∇2f (x, y) = 3 0
y −x −x 0
2 2
1 0
convex for y > 0 y 0 −2 x

Convex functions 3–9


Pn
log-sum-exp: f (x) = log k=1 exp xk is convex

1 1
∇2f (x) = diag(z) − zz T
(zk = exp xk )
1T z (1T z)2

to show ∇2f (x)  0, we must verify that v T ∇2f (x)v ≥ 0 for all v:
P 2
P P
T 2 ( k zk vk )( k zk ) − ( k v k zk ) 2
v ∇ f (x)v = P ≥0
( k zk ) 2
P 2
P 2
P
since ( k v k zk ) ≤( k zk vk )( k zk ) (from Cauchy-Schwarz inequality)

Qn 1/n n
geometric mean: f (x) = ( k=1 x k ) on R ++ is concave

(similar proof as for log-sum-exp)

Convex functions 3–10


Epigraph and sublevel set

α-sublevel set of f : Rn → R:

Cα = {x ∈ dom f | f (x) ≤ α}

sublevel sets of convex functions are convex (converse is false)

epigraph of f : Rn → R:

epi f = {(x, t) ∈ Rn+1 | x ∈ dom f, f (x) ≤ t}

epi f

PSfrag replacements f

f is convex if and only if epi f is a convex set

Convex functions 3–11


Jensen’s inequality

basic inequality: if f is convex, then for 0 ≤ θ ≤ 1,

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y)

extension: if f is convex, then

f (E z) ≤ E f (z)

for any random variable z

basic inequality is special case with discrete distribution

prob(z = x) = θ, prob(z = y) = 1 − θ

Convex functions 3–12


Operations that preserve convexity

practical methods for establishing convexity of a function

1. verify definition (often simplified by restricting to a line)

2. for twice differentiable functions, show ∇2f (x)  0

3. show that f is obtained from simple convex functions by operations


that preserve convexity
• nonnegative weighted sum
• composition with affine function
• pointwise maximum and supremum
• composition
• minimization
• perspective

Convex functions 3–13


Positive weighted sum & composition with affine function

nonnegative multiple: αf is convex if f is convex, α ≥ 0

sum: f1 + f2 convex if f1, f2 convex (extends to infinite sums, integrals)

composition with affine function: f (Ax + b) is convex if f is convex

examples

• log barrier for linear inequalities

m
X
f (x) = − log(bi − aTi x), dom f = {x | aTi x < bi, i = 1, . . . , m}
i=1

• (any) norm of affine function: f (x) = kAx + bk

Convex functions 3–14


Pointwise maximum

if f1, . . . , fm are convex, then f (x) = max{f1(x), . . . , fm(x)} is convex

examples

• piecewise-linear function: f (x) = maxi=1,...,m(aTi x + bi) is convex


• sum of r largest components of x ∈ Rn:

f (x) = x[1] + x[2] + · · · + x[r]

is convex (x[i] is ith largest component of x)


proof:
f (x) = max{xi1 + xi2 + · · · + xir | 1 ≤ i1 < i2 < · · · < ir ≤ n}

Convex functions 3–15


Pointwise supremum
if f (x, y) is convex in x for each y ∈ A, then

g(x) = sup f (x, y)


y∈A

is convex
examples
• support function of a set C: SC (x) = supy∈C y T x is convex
• distance to farthest point in a set C:

f (x) = sup kx − yk
y∈C

• maximum eigenvalue of symmetric matrix: for X ∈ Sn,

λmax(X) = sup y T Xy
kyk2 =1

Convex functions 3–16


Composition with scalar functions

composition of g : Rn → R and h : R → R:

f (x) = h(g(x))

g convex, h convex, h̃ nondecreasing


f is convex if
g concave, h convex, h̃ nonincreasing

• proof (for n = 1, differentiable g, h)

f 00(x) = h00(g(x))g 0(x)2 + h0(g(x))g 00(x)

• note: monotonicity must hold for extended-value extension h̃

examples
• exp g(x) is convex if g is convex
• 1/g(x) is convex if g is concave and positive

Convex functions 3–17


Vector composition

composition of g : Rn → Rk and h : Rk → R:

f (x) = h(g(x)) = h(g1(x), g2(x), . . . , gk (x))

gi convex, h convex, h̃ nondecreasing in each argument


f is convex if
gi concave, h convex, h̃ nonincreasing in each argument

proof (for n = 1, differentiable g, h)

f 00(x) = g 0(x)T ∇2h(g(x))g 0(x) + ∇h(g(x))T g 00(x)

examples
Pm
• i=1 log gi (x) is concave if gi are concave and positive
Pm
• log i=1 exp gi(x) is convex if gi are convex

Convex functions 3–18


Minimization

if f (x, y) is convex in (x, y) and C is a convex set, then

g(x) = inf f (x, y)


y∈C

is convex
examples
• f (x, y) = xT Ax + 2xT By + y T Cy with
 
A B
 0, C0
BT C

minimizing over y gives g(x) = inf y f (x, y) = xT (A − BC −1B T )x


g is convex, hence Schur complement A − BC −1B T  0
• distance to a set: dist(x, S) = inf y∈S kx − yk is convex if S is convex

Convex functions 3–19


Perspective

the perspective of a function f : Rn → R is the function g : Rn × R → R,

g(x, t) = tf (x/t), dom g = {(x, t) | x/t ∈ dom f, t > 0}

g is convex if f is convex

examples
• f (x) = xT x is convex; hence g(x, t) = xT x/t is convex for t > 0
• negative logarithm f (x) = − log x is convex; hence relative entropy
g(x, t) = t log t − t log x is convex on R2++
• if f is convex, then

T T

g(x) = (c x + d)f (Ax + b)/(c x + d)

is convex on {x | cT x + d > 0, (Ax + b)/(cT x + d) ∈ dom f }

Convex functions 3–20


The conjugate function

the conjugate of a function f is

f ∗(y) = sup (y T x − f (x))


x∈dom f

f (x)
xy

PSfrag replacements
x

(0, −f ∗(y))

• f ∗ is convex (even if f is not)


• will be useful in chapter 5

Convex functions 3–21


examples

• negative logarithm f (x) = − log x

f ∗(y) = sup(xy + log x)


x>0

−1 − log(−y) y < 0
=
∞ otherwise

• strictly convex quadratic f (x) = (1/2)xT Qx with Q ∈ Sn++

f ∗(y) = sup(y T x − (1/2)xT Qx)


x
1 T −1
= y Q y
2

Convex functions 3–22


Quasiconvex functions

f : Rn → R is quasiconvex if dom f is convex and the sublevel sets

Sα = {x ∈ dom f | f (x) ≤ α}

are convex for all α

PSfrag replacements
β

a b c

• f is quasiconcave if −f is quasiconvex
• f is quasilinear if it is quasiconvex and quasiconcave

Convex functions 3–23


Examples
p
• |x| is quasiconvex on R
• ceil(x) = inf{z ∈ Z | z ≥ x} is quasilinear
• log x is quasilinear on R++
• f (x1, x2) = x1x2 is quasiconcave on R2++
• linear-fractional function
aT x + b
f (x) = T , dom f = {x | cT x + d > 0}
c x+d
is quasilinear
• distance ratio
kx − ak2
f (x) = , dom f = {x | kx − ak2 ≤ kx − bk2}
kx − bk2

is quasiconvex

Convex functions 3–24


internal rate of return

• cash flow x = (x0, . . . , xn); xi is payment in period i (to us if xi > 0)


• we assume x0 < 0 and x0 + x1 + · · · + xn > 0
• present value of cash flow x, for interest rate r:
n
X
PV(x, r) = (1 + r)−ixi
i=0

• internal rate of return is smallest interest rate for which PV(x, r) = 0:

IRR(x) = inf{r ≥ 0 | PV(x, r) = 0}

IRR is quasiconcave: superlevel set is intersection of halfspaces


n
X
IRR(x) ≥ R ⇐⇒ (1 + r)−ixi ≥ 0 for 0 ≤ r ≤ R
i=0

Convex functions 3–25


Properties
modified Jensen inequality: for quasiconvex f

0≤θ≤1 =⇒ f (θx + (1 − θ)y) ≤ max{f (x), f (y)}

first-order condition: differentiable f with cvx domain is quasiconvex iff

f (y) ≤ f (x) =⇒ ∇f (x)T (y − x) ≤ 0

∇f (x)
x

PSfrag replacements

sums of quasiconvex functions are not necessarily quasiconvex

Convex functions 3–26


Log-concave and log-convex functions
a positive function f is log-concave if log f is concave:

f (θx + (1 − θ)y) ≥ f (x)θ f (y)1−θ for 0 ≤ θ ≤ 1

f is log-convex if log f is convex

• powers: xa on R++ is log-convex for a ≤ 0, log-concave for a ≥ 0


• many common probability densities are log-concave, e.g., normal:

1 1 T
Σ−1 (x−x̄)
f (x) = p e− 2 (x−x̄)
(2π)n det Σ

• cumulative Gaussian distribution function Φ is log-concave


Z x
1 2
Φ(x) = √ e−u /2 du
2π −∞

Convex functions 3–27


Properties of log-concave functions

• twice differentiable f with convex domain is log-concave if and only if

f (x)∇2f (x)  ∇f (x)∇f (x)T

for all x ∈ dom f

• product of log-concave functions is log-concave

• sum of log-concave functions is not always log-concave

• integration: if f : Rn × Rm → R is log-concave, then


Z
g(x) = f (x, y) dy

is log-concave (not easy to show)

Convex functions 3–28


consequences of integration property

• convolution f ∗ g of log-concave functions f , g is log-concave


Z
(f ∗ g)(x) = f (x − y)g(y)dy

• if C ⊆ Rn convex and y is a random variable with log-concave pdf then

f (x) = prob(x + y ∈ C)

is log-concave
proof: write f (x) as integral of product of log-concave functions
Z 
1 u∈C
f (x) = g(x + y)p(y) dy, g(u) =
0 u∈
6 C,

p is pdf of y

Convex functions 3–29


example: yield function

Y (x) = prob(x + w ∈ S)

• x ∈ Rn: nominal parameter values for product

• w ∈ Rn: random variations of parameters in manufactured product

• S: set of acceptable values

if S is convex and w has a log-concave pdf, then

• Y is log-concave

• yield regions {x | Y (x) ≥ α} are convex

Convex functions 3–30


Convexity with respect to generalized inequalities

f : Rn → Rm is K-convex if dom f is convex and

f (θx + (1 − θ)y) K θf (x) + (1 − θ)f (y)

for x, y ∈ dom f , 0 ≤ θ ≤ 1

example f : Sm → Sm, f (X) = X 2 is Sm


+ -convex

proof: for fixed z ∈ Rm, z T X 2z = kXzk22 is convex in X, i.e.,

z T (θX + (1 − θ)Y )2z ≤ θz T X 2z + (1 − θ)z T Y 2z

for X, Y ∈ Sm, 0 ≤ θ ≤ 1

therefore (θX + (1 − θ)Y )2  θX 2 + (1 − θ)Y 2

Convex functions 3–31


Convex Optimization — Boyd & Vandenberghe

4. Convex optimization problems

• optimization problem in standard form


• convex optimization problems
• quasiconvex optimization
• linear optimization
• quadratic optimization
• geometric programming
• generalized inequality constraints
• semidefinite programming
• vector optimization

4–1
Optimization problem in standard form

minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p

• x ∈ Rn is the optimization variable


• f0 : Rn → R is the objective or cost function
• fi : Rn → R, i = 1, . . . , m, are the inequality constraint functions
• hi : Rn → R are the equality constraint functions

optimal value:

p? = inf{f0(x) | fi(x) ≤ 0, i = 1, . . . , m, hi(x) = 0, i = 1, . . . , p}

• p? = ∞ if problem is infeasible (no x satisfies the constraints)


• p? = −∞ if problem is unbounded below

Convex optimization problems 4–2


Optimal and locally optimal points

x is feasible if x ∈ dom f0 and it satisfies the constraints


a feasible x is optimal if f0(x) = p?; Xopt is the set of optimal points
x is locally optimal if there is an R > 0 such that x is optimal for

minimize (over z) f0(z)


subject to fi(z) ≤ 0, i = 1, . . . , m, hi(z) = 0, i = 1, . . . , p
kz − xk2 ≤ R

examples (with n = 1, m = p = 0)
• f0(x) = 1/x, dom f0 = R++: p? = 0, no optimal point
• f0(x) = − log x, dom f0 = R++: p? = −∞
• f0(x) = x log x, dom f0 = R++: p? = −1/e, x = 1/e is optimal
• f0(x) = x3 − 3x, p? = −∞, local optimum at x = 1

Convex optimization problems 4–3


Implicit constraints

the standard form optimization problem has an implicit constraint

m
\ p
\
x∈D= dom fi ∩ dom hi,
i=0 i=1

• we call D the domain of the problem


• the constraints fi(x) ≤ 0, hi(x) = 0 are the explicit constraints
• a problem is unconstrained if it has no explicit constraints (m = p = 0)

example:
Pk
minimize f0(x) = − i=1 log(bi − aTi x)

is an unconstrained problem with implicit constraints aTi x < bi

Convex optimization problems 4–4


Feasibility problem

find x
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p

can be considered a special case of the general problem with f0(x) = 0:

minimize 0
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p

• p? = 0 if constraints are feasible; any feasible x is optimal


• p? = ∞ if constraints are infeasible

Convex optimization problems 4–5


Convex optimization problem

standard form convex optimization problem

minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
aTi x = bi, i = 1, . . . , p

• f0, f1, . . . , fm are convex; equality constraints are affine


• problem is quasiconvex if f0 is quasiconvex (and f1, . . . , fm convex)

often written as

minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b

important property: feasible set of a convex optimization problem is convex

Convex optimization problems 4–6


example

minimize f0(x) = x21 + x22


subject to f1(x) = x1/(1 + x22) ≤ 0
h1(x) = (x1 + x2)2 = 0

• f0 is convex; feasible set {(x1, x2) | x1 = −x2 ≤ 0} is convex

• not a convex problem (according to our definition): f1 is not convex, h1


is not affine

• equivalent (but not identical) to the convex problem

minimize x21 + x22


subject to x1 ≤ 0
x1 + x 2 = 0

Convex optimization problems 4–7


Local and global optima
any locally optimal point of a convex problem is (globally) optimal
proof: suppose x is locally optimal and y is optimal with f0(y) < f0(x)
x locally optimal means there is an R > 0 such that

z feasible, kz − xk2 ≤ R =⇒ f0(z) ≥ f0(x)

consider z = θy + (1 − θ)x with θ = R/(2ky − xk2)

• ky − xk2 > R, so 0 < θ < 1/2


• z is a convex combination of two feasible points, hence also feasible
• kz − xk2 = R/2 and

f0(z) ≤ θf0(x) + (1 − θ)f0(y) < f0(x)

which contradicts our assumption that x is locally optimal

Convex optimization problems 4–8


Optimality criterion for differentiable f0

x is optimal if and only if it is feasible and

∇f0(x)T (y − x) ≥ 0 for all feasible y

−∇f0(x)
x
X

PSfrag replacements

if nonzero, ∇f0(x) defines a supporting hyperplane to feasible set X at x

Convex optimization problems 4–9


• unconstrained problem: x is optimal if and only if

x ∈ dom f0, ∇f0(x) = 0

• equality constrained problem

minimize f0(x) subject to Ax = b

x is optimal if and only if there exists a ν such that

x ∈ dom f0, Ax = b, ∇f0(x) + AT ν = 0

• minimization over nonnegative orthant

minimize f0(x) subject to x  0

x is optimal if and only if



∇f0(x)i ≥ 0 xi = 0
x ∈ dom f0, x  0,
∇f0(x)i = 0 xi > 0

Convex optimization problems 4–10


Equivalent convex problems
two problems are (informally) equivalent if the solution of one is readily
obtained from the solution of the other, and vice-versa
some common transformations that preserve convexity:
• eliminating equality constraints

minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b

is equivalent to

minimize (over z) f0(F z + x0)


subject to fi(F z + x0) ≤ 0, i = 1, . . . , m

where F and x0 are such that

Ax = b ⇐⇒ x = F z + x0 for some z

Convex optimization problems 4–11


• introducing equality constraints

minimize f0(A0x + b0)


subject to fi(Aix + bi) ≤ 0, i = 1, . . . , m

is equivalent to

minimize (over x, yi) f0(y0)


subject to fi(yi) ≤ 0, i = 1, . . . , m
yi = Aix + bi, i = 0, 1, . . . , m

• introducing slack variables for linear inequalities

minimize f0(x)
subject to aTi x ≤ bi, i = 1, . . . , m

is equivalent to

minimize (over x, s) f0(x)


subject to aTi x + si = bi, i = 1, . . . , m
si ≥ 0, i = 1, . . . m

Convex optimization problems 4–12


• epigraph form: standard form convex problem is equivalent to

minimize (over x, t) t
subject to f0(x) − t ≤ 0
fi(x) ≤ 0, i = 1, . . . , m
Ax = b

• minimizing over some variables

minimize f0(x1, x2)


subject to fi(x1) ≤ 0, i = 1, . . . , m

is equivalent to

minimize f˜0(x1)
subject to fi(x1) ≤ 0, i = 1, . . . , m

where f˜0(x1) = inf x2 f0(x1, x2)

Convex optimization problems 4–13


Quasiconvex optimization

minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b
with f0 : Rn → R quasiconvex, f1, . . . , fm convex

can have locally optimal points that are not (globally) optimal

(x, f0(x))

PSfrag replacements

Convex optimization problems 4–14


convex representation of sublevel sets of f0

if f0 is quasiconvex, there exists a family of functions φt such that:


• φt(x) is convex in x for fixed t
• t-sublevel set of f0 is 0-sublevel set of φt, i.e.,

f0(x) ≤ t ⇐⇒ φt(x) ≤ 0

example
p(x)
f0(x) =
q(x)

with p convex, q concave, and p(x) ≥ 0, q(x) > 0 on dom f0

can take φt(x) = p(x) − tq(x):


• for t ≥ 0, φt convex in x
• p(x)/q(x) ≤ t if and only if φt(x) ≤ 0

Convex optimization problems 4–15


quasiconvex optimization via convex feasibility problems

φt(x) ≤ 0, fi(x) ≤ 0, i = 1, . . . , m, Ax = b (1)

• for fixed t, a convex feasibility problem in x


• if feasible, we can conclude that t ≥ p?; if infeasible, t ≤ p?

Bisection method for quasiconvex optimization

given l ≤ p?, u ≥ p?, tolerance  > 0.


repeat
1. t := (l + u)/2.
2. Solve the convex feasibility problem (1).
3. if (1) is feasible, u := t; else l := t.
until u − l ≤ .

requires exactly dlog2((u − l)/)e iterations (where u, l are initial values)

Convex optimization problems 4–16


Linear program (LP)

minimize cT x + d
subject to Gx  h
Ax = b

• convex problem with affine objective and constraint functions


• feasible set is a polyhedron

−c

P x?
PSfrag replacements

Convex optimization problems 4–17


Examples
diet problem: choose quantities x1, . . . , xn of n foods
• one unit of food j costs cj , contains amount aij of nutrient i
• healthy diet requires nutrient i in quantity at least bi

to find cheapest healthy diet,

minimize cT x
subject to Ax  b, x0

piecewise-linear minimization

minimize maxi=1,...,m(aTi x + bi)

equivalent to an LP

minimize t
subject to aTi x + bi ≤ t, i = 1, . . . , m

Convex optimization problems 4–18


Chebyshev center of a polyhedron
Chebyshev center of

P = {x | aTi x ≤ bi, i = 1, . . . , m}
xcheb
is center of largest inscribed ball
PSfrag replacements
B = {xc + u | kuk2 ≤ r}

• aTi x ≤ bi for all x ∈ B if and only if

sup{aTi (xc + u) | kuk2 ≤ r} = aTi xc + rkaik2 ≤ bi

• hence, xc, r can be determined by solving the LP

maximize r
subject to aTi xc + rkaik2 ≤ bi, i = 1, . . . , m

Convex optimization problems 4–19


(Generalized) linear-fractional program

minimize f0(x)
subject to Gx  h
Ax = b
linear-fractional program

cT x + d
f0(x) = T , dom f0(x) = {x | eT x + f > 0}
e x+f

• a quasiconvex optimization problem; can be solved by bisection


• also equivalent to the LP (variables y, z)

minimize cT y + dz
subject to Gy  hz
Ay = bz
eT y + f z = 1
z≥0

Convex optimization problems 4–20


generalized linear-fractional program

cTi x + di
f0(x) = max T , dom f0(x) = {x | eTi x+fi > 0, i = 1, . . . , r}
i=1,...,r e x + fi
i

a quasiconvex optimization problem; can be solved by bisection

example: Von Neumann model of a growing economy

maximize (over x, x+) mini=1,...,n x+


i /xi
subject to x+  0, Bx+  Ax

• x, x+ ∈ Rn: activity levels of n sectors, in current and next period


• (Ax)i, (Bx+)i: produced, resp. consumed, amounts of good i
• x+
i /xi : growth rate of sector i

allocate activity to maximize growth rate of slowest growing sector

Convex optimization problems 4–21


Quadratic program (QP)

minimize (1/2)xT P x + q T x + r
subject to Gx  h
Ax = b

• P ∈ Sn+, so objective is convex quadratic


• minimize a convex quadratic function over a polyhedron

−∇f0(x?)

x?

P
PSfrag replacements

Convex optimization problems 4–22


Examples

least-squares
minimize kAx − bk22

• analytical solution x? = A†b (A† is pseudo-inverse)


• can add linear constraints, e.g., l  x  u

linear program with random cost

minimize c̄T x + γxT Σx = E cT x + γ var(cT x)


subject to Gx  h, Ax = b

• c is random vector with mean c̄ and covariance Σ


• hence, cT x is random variable with mean c̄T x and variance xT Σx
• γ > 0 is risk aversion parameter; controls the trade-off between
expected cost and variance (risk)

Convex optimization problems 4–23


Quadratically constrained quadratic program (QCQP)

minimize (1/2)xT P0x + q0T x + r0


subject to (1/2)xT Pix + qiT x + ri ≤ 0, i = 1, . . . , m
Ax = b

• Pi ∈ Sn+; objective and constraints are convex quadratic

• if P1, . . . , Pm ∈ Sn++, feasible region is intersection of m ellipsoids and


an affine set

Convex optimization problems 4–24


Second-order cone programming

minimize f T x
subject to kAix + bik2 ≤ cTi x + di, i = 1, . . . , m
Fx = g

(Ai ∈ Rni×n, F ∈ Rp×n)

• inequalities are called second-order cone (SOC) constraints:

(Aix + bi, cTi x + di) ∈ second-order cone in Rni+1

• for ni = 0, reduces to an LP; if ci = 0, reduces to a QCQP

• more general than QCQP and LP

Convex optimization problems 4–25


Robust linear programming
the parameters in optimization problems are often uncertain, e.g., in an LP

minimize cT x
subject to aTi x ≤ bi, i = 1, . . . , m,

there can be uncertainty in c, ai, bi


two common approaches to handling uncertainty (in ai, for simplicity)
• deterministic model: constraints must hold for all ai ∈ Ei

minimize cT x
subject to aTi x ≤ bi for all ai ∈ Ei, i = 1, . . . , m,

• stochastic model: ai is random variable; constraints must hold with


probability η

minimize cT x
subject to prob(aTi x ≤ bi) ≥ η, i = 1, . . . , m

Convex optimization problems 4–26


deterministic approach via SOCP

• choose an ellipsoid as Ei:

Ei = {āi + Piu | kuk2 ≤ 1} (āi ∈ Rn, Pi ∈ Rn×n)

center is āi, semi-axes determined by singular values/vectors of Pi

• robust LP

minimize cT x
subject to aTi x ≤ bi ∀ai ∈ Ei, i = 1, . . . , m

is equivalent to the SOCP

minimize cT x
subject to āTi x + kPiT xk2 ≤ bi, i = 1, . . . , m

(follows from supkuk2≤1(āi + Piu)T x = āTi x + kPiT xk2)

Convex optimization problems 4–27


stochastic approach via SOCP

• assume ai is Gaussian with mean āi, covariance Σi (ai ∼ N (āi, Σi))

• aTi x is Gaussian r.v. with mean āTi x, variance xT Σix; hence


!
T
T bi − āi x
prob(ai x ≤ bi) = Φ 1/2
kΣi xk2
√ Rx −t 2
/2
where Φ(x) = (1/ 2π) −∞
e dt is CDF of N (0, 1)

• robust LP
minimize cT x
subject to prob(aTi x ≤ bi) ≥ η, i = 1, . . . , m,

with η ≥ 1/2, is equivalent to the SOCP

minimize cT x
1/2
subject to āTi x + Φ−1(η)kΣi xk2 ≤ bi, i = 1, . . . , m

Convex optimization problems 4–28


Geometric programming
monomial function

f (x) = cxa1 1 xa2 2 · · · xann , dom f = Rn++

with c > 0; exponent αi can be any real number


posynomial function: sum of monomials
K
X a a
f (x) = ck x1 1k x2 2k · · · xannk , dom f = Rn++
k=1

geometric program (GP)

minimize f0(x)
subject to fi(x) ≤ 1, i = 1, . . . , m
hi(x) = 1, i = 1, . . . , p

with fi posynomial, hi monomial

Convex optimization problems 4–29


Geometric program in convex form
change variables to yi = log xi, and take logarithm of cost, constraints

• monomial f (x) = cxa1 1 · · · xann transforms to

log f (ey1 , . . . , eyn ) = aT y + b (b = log c)

PK a
1k 2k a nk a
• posynomial f (x) = k=1 ck x1 x2 · · · xn transforms to
K
!
X T
log f (ey1 , . . . , eyn ) = log eak y+bk (bk = log ck )
k=1

• geometric program transforms to convex problem


P 
K T
minimize log k=1 exp(a0k y + b0k )
P 
K T
subject to log k=1 exp(aik y + bik ) ≤ 0, i = 1, . . . , m
Gy + d = 0

Convex optimization problems 4–30


PSfrag replacements
Design of cantilever beam
segment 4 segment 3 segment 2 segment 1

• N segments with unit lengths, rectangular cross-sections of size wi × hi


• given vertical force F applied at the right end

design problem

minimize total weight


subject to upper & lower bounds on wi, hi
upper bound & lower bounds on aspect ratios hi/wi
upper bound on stress in each segment
upper bound on vertical deflection at the end of the beam

variables: wi, hi for i = 1, . . . , N

Convex optimization problems 4–31


objective and constraint functions

• total weight w1h1 + · · · + wN hN is posynomial

• aspect ratio hi/wi and inverse aspect ratio wi/hi are monomials

• maximum stress in segment i is given by 6iF/(wih2i ), a monomial

• the vertical deflection yi and slope vi of central axis at the right end of
segment i are defined recursively as

F
vi = 12(i − 1/2) 3 + vi+1
Ewihi
F
yi = 6(i − 1/3) 3 + vi+1 + yi+1
Ewihi

for i = N, N − 1, . . . , 1, with vN +1 = yN +1 = 0 (E is Young’s modulus)


vi and yi are posynomial functions of w, h

Convex optimization problems 4–32


formulation as a GP
minimize w 1 h1 + · · · + w N hN
−1
subject to wmax wi ≤ 1, wminwi−1 ≤ 1, i = 1, . . . , N
h−1
max hi ≤ 1, hminh−1
i ≤ 1, i = 1, . . . , N
−1
Smax wi−1hi ≤ 1, Sminwih−1
i ≤ 1, i = 1, . . . , N
−1
6iF σmax wi−1h−2
i ≤ 1, i = 1, . . . , N
−1
ymax y1 ≤ 1

note
• we write wmin ≤ wi ≤ wmax and hmin ≤ hi ≤ hmax

wmin/wi ≤ 1, wi/wmax ≤ 1, hmin/hi ≤ 1, hi/hmax ≤ 1

• we write Smin ≤ hi/wi ≤ Smax as

Sminwi/hi ≤ 1, hi/(wiSmax) ≤ 1

Convex optimization problems 4–33


Minimizing spectral radius of nonnegative matrix

Perron-Frobenius eigenvalue λpf (A)

• exists for (elementwise) positive A ∈ Rn×n


• a real, positive eigenvalue of A, equal to spectral radius maxi |λi(A)|
• determines asymptotic growth (decay) rate of Ak : Ak ∼ λkpf as k → ∞
• alternative characterization: λpf (A) = inf{λ | Av  λv for some v  0}

minimizing spectral radius of matrix of posynomials

• minimize λpf (A(x)), where the elements A(x)ij are posynomials of x


• equivalent geometric program:

minimize λ Pn
subject to j=1 A(x)ij vj /(λvi ) ≤ 1, i = 1, . . . , n

variables λ, v, x

Convex optimization problems 4–34


Generalized inequality constraints

convex problem with generalized inequality constraints

minimize f0(x)
subject to fi(x) Ki 0, i = 1, . . . , m
Ax = b

• f0 : Rn → R convex; fi : Rn → Rki Ki-convex w.r.t. proper cone Ki


• same properties as standard convex problem (convex feasible set, local
optimum is global, etc.)

conic form problem: special case with affine objective and constraints

minimize cT x
subject to F x + g K 0
Ax = b

extends linear programming (K = Rm


+ ) to nonpolyhedral cones

Convex optimization problems 4–35


Semidefinite program (SDP)

minimize cT x
subject to x1F1 + x2F2 + · · · + xnFn + G  0
Ax = b
with Fi, G ∈ Sk

• inequality constraint is called linear matrix inequality (LMI)


• includes problems with multiple LMI constraints: for example,

x1F̂1 + · · · + xnF̂n + Ĝ  0, x1F̃1 + · · · + xnF̃n + G̃  0

is equivalent to single LMI


       
F̂1 0 F̂2 0 F̂n 0 Ĝ 0
x1 +x2 +· · ·+xn + 0
0 F̃1 0 F̃2 0 F̃n 0 G̃

Convex optimization problems 4–36


LP and SOCP as SDP

LP and equivalent SDP

LP: minimize cT x SDP: minimize cT x


subject to Ax  b subject to diag(Ax − b)  0

(note different interpretation of generalized inequality )

SOCP and equivalent SDP

SOCP: minimize f T x
subject to kAix + bik2 ≤ cTi x + di, i = 1, . . . , m

SDP: minimize f T x 
(cTi x + di)I Ai x + bi
subject to  0, i = 1, . . . , m
(Aix + bi)T cTi x + di

Convex optimization problems 4–37


Eigenvalue minimization

minimize λmax(A(x))

where A(x) = A0 + x1A1 + · · · + xnAn (with given Ai ∈ Sk )

equivalent SDP
minimize t
subject to A(x)  tI

• variables x ∈ Rn, t ∈ R
• follows from
λmax(A) ≤ t ⇐⇒ A  tI

Convex optimization problems 4–38


Matrix norm minimization

T
1/2
minimize kA(x)k2 = λmax(A(x) A(x))

where A(x) = A0 + x1A1 + · · · + xnAn (with given Ai ∈ Sp×q )


equivalent SDP

minimize t 
tI A(x)
subject to 0
A(x)T tI

• variables x ∈ Rn, t ∈ R
• constraint follows from

kAk2 ≤ t ⇐⇒ AT A  t2I, t ≥ 0
 
tI A
⇐⇒ 0
AT tI

Convex optimization problems 4–39


Vector optimization

general vector optimization problem

minimize (w.r.t. K) f0(x)


subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) ≤ 0, i = 1, . . . , p

vector objective f0 : Rn → Rq , minimized w.r.t. proper cone K ∈ Rq

convex vector optimization problem

minimize (w.r.t. K) f0(x)


subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b

with f0 K-convex, f1, . . . , fm convex

Convex optimization problems 4–40


Optimal and Pareto optimal points

set of achievable objective values

O = {f0(x) | x feasible}

• feasible x is optimal if f0(x) is a minimum value of O


• feasible x is Pareto optimal if f0(x) is a minimal value of O

O
O
f0(xpo )
rag replacements PSfrag replacements

f0(x?)
x? is optimal xpo is Pareto optimal

Convex optimization problems 4–41


Multicriterion optimization

vector optimization problem with K = Rq+

f0(x) = (F1(x), . . . , Fq (x))

• q different objectives Fi; roughly speaking we want all Fi’s to be small


• feasible x? is optimal if

y feasible =⇒ f0(x?)  f0(y)

if there exists an optimal point, the objectives are noncompeting


• feasible xpo is Pareto optimal if

y feasible, f0(y)  f0(xpo) =⇒ f0(xpo) = f0(y)

if there are multiple Pareto optimal values, there is a trade-off between


the objectives

Convex optimization problems 4–42


Regularized least-squares

minimize (w.r.t. R2+) (kAx − bk22, kxk22)

25

20 O
F2(x) = kxk22

15

PSfrag replacements 10

0
0 10 20 30 40 50

F1(x) = kAx − bk22

example for A ∈ R100×10; heavy line is formed by Pareto optimal points

Convex optimization problems 4–43


Risk return trade-off in portfolio optimization
minimize (w.r.t. R2+) (−p̄T x, xT Σx)
subject to 1T x = 1, x  0

• x ∈ Rn is investment portfolio; xi is fraction invested in asset i


• p ∈ Rn is vector of relative asset price changes; modeled as a random
PSfrag replacements
variable with mean p̄, covariance Σ
• p̄T x = E r is expected return; xT Σx = var r is return variance
PSfrag replacements
example
15% 1

x(4) x(3) x(2)

allocation x
mean return

10%
0.5
x(1)
5%

0
0%
0% 10% 20% 0% 10% 20%
standard deviation of return standard deviation of return
Convex optimization problems 4–44
Scalarization

to find Pareto optimal points: choose λ K ∗ 0 and solve scalar problem

minimize λT f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p

PSfrag replacements O
if x is optimal for scalar problem,
then it is Pareto-optimal for vector f0(x1)
optimization problem
f0(x3)
λ1
f0(x2) λ2

for convex vector optimization problems, can find (almost) all Pareto
optimal points by varying λ K ∗ 0

Convex optimization problems 4–45


Scalarization for multicriterion problems
to find Pareto optimal points, minimize positive weighted sum

λT f0(x) = λ1F1(x) + · · · + λq Fq (x)

examples

• regularized least-squares problem of page 4–43


20

take λ = (1, γ) with γ > 0 15

kxk22
10
minimize kAx − bk22 + γkxk22
PSfrag replacements
5
for fixed γ, a LS problem γ=1
0
0 5 10 15 20
kAx − bk22

Convex optimization problems 4–46


• risk-return trade-off of page 4–44

minimize −p̄T x + γxT Σx


subject to 1T x = 1, x  0

for fixed γ > 0, a quadratic program

Convex optimization problems 4–47


Convex Optimization — Boyd & Vandenberghe

5. Duality

• Lagrange dual problem


• weak and strong duality
• geometric interpretation
• optimality conditions
• perturbation and sensitivity analysis
• examples
• generalized inequalities

5–1
Lagrangian

standard form problem (not necessarily convex)

minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p

variable x ∈ Rn, domain D, optimal value p?


Lagrangian: L : Rn × Rm × Rp → R, with dom L = D × Rm × Rp,
m
X p
X
L(x, λ, ν) = f0(x) + λifi(x) + νihi(x)
i=1 i=1

• weighted sum of objective and constraint functions


• λi is Lagrange multiplier associated with fi(x) ≤ 0
• νi is Lagrange multiplier associated with hi(x) = 0

Duality 5–2
Lagrange dual function

Lagrange dual function: g : Rm × Rp → R,

g(λ, ν) = inf L(x, λ, ν)


x∈D
m p
!
X X
= inf f0(x) + λifi(x) + νihi(x)
x∈D
i=1 i=1

g is concave, can be −∞ for some λ, ν

lower bound property: if λ  0, then g(λ, ν) ≤ p?

proof: if x̃ is feasible and λ  0, then

f0(x̃) ≥ L(x̃, λ, ν) ≥ inf L(x, λ, ν) = g(λ, ν)


x∈D

minimizing over all feasible x̃ gives p? ≥ g(λ, ν)

Duality 5–3
Least-norm solution of linear equations

minimize xT x
subject to Ax = b
dual function
• Lagrangian is L(x, ν) = xT x + ν T (Ax − b)
• to minimize L over x, set gradient equal to zero:

∇xL(x, ν) = 2x + AT ν = 0 =⇒ x = −(1/2)AT ν

• plug in in L to obtain g:
1
g(ν) = L((−1/2)AT ν, ν) = − ν T AAT ν − bT ν
4
a concave function of ν

lower bound property: p? ≥ −(1/4)ν T AAT ν − bT ν for all ν

Duality 5–4
Standard form LP

minimize cT x
subject to Ax = b, x0
dual function
• Lagrangian is

L(x, λ, ν) = cT x + ν T (Ax − b) − λT x
= −bT ν + (c + AT ν − λ)T x
• L is linear in x, hence

−bT ν AT ν − λ + c = 0
g(λ, ν) = inf L(x, λ, ν) =
x −∞ otherwise

g is linear on affine domain {(λ, ν) | AT ν − λ + c = 0}, hence concave

lower bound property: p? ≥ −bT ν if AT ν + c  0

Duality 5–5
Equality constrained norm minimization

minimize kxk
subject to Ax = b

dual function

T T bT ν kAT νk∗ ≤ 1
g(ν) = inf (kxk − ν Ax + b ν) =
x −∞ otherwise
where kvk∗ = supkuk≤1 uT v is dual norm of k · k

proof: follows from inf x(kxk − y T x) = 0 if kyk∗ ≤ 1, −∞ otherwise


• if kyk∗ ≤ 1, then kxk − y T x ≥ 0 for all x, with equality if x = 0
• if kyk∗ > 1, choose x = tu where kuk ≤ 1, uT y = kyk∗ > 1:

kxk − y T x = t(kuk − kyk∗) → −∞ as t → ∞

lower bound property: p? ≥ bT ν if kAT νk∗ ≤ 1

Duality 5–6
Two-way partitioning

minimize xT W x
subject to x2i = 1, i = 1, . . . , n

• a nonconvex problem; feasible set contains 2n discrete points


• interpretation: partition {1, . . . , n} in two sets; Wij is cost of assigning
i, j to the same set; −Wij is cost of assigning to different sets

dual function
X
T
g(ν) = inf (x W x + νi(x2i − 1)) = inf xT (W + diag(ν))x − 1T ν
x x
i 
−1T ν W + diag(ν)  0
=
−∞ otherwise

lower bound property: p? ≥ −1T ν if W + diag(ν)  0


example: ν = −λmin(W )1 gives bound p? ≥ nλmin(W )

Duality 5–7
Lagrange dual and conjugate function

minimize f0(x)
subject to Ax  b, Cx = d
dual function
T T T T T

g(λ, ν) = inf f0(x) + (A λ + C ν) x − b λ − d ν
x∈dom f0

= −f0∗(−AT λ − C T ν) − bT λ − dT ν

• recall definition of conjugate f ∗(y) = supx∈dom f (y T x − f (x))


• simplifies derivation of dual if conjugate of f0 is kown

example: entropy maximization


n
X n
X
f0(x) = xi log xi, f0∗(y) = eyi−1
i=1 i=1

Duality 5–8
The dual problem

Lagrange dual problem

maximize g(λ, ν)
subject to λ  0

• finds best lower bound on p?, obtained from Lagrange dual function
• a convex optimization problem; optimal value denoted d?
• λ, ν are dual feasible if λ  0, (λ, ν) ∈ dom g
• often simplified by making implicit constraint (λ, ν) ∈ dom g explicit

example: standard form LP and its dual (page 5–5)

minimize cT x maximize −bT ν


subject to Ax = b subject to AT ν + c  0
x0

Duality 5–9
Weak and strong duality
weak duality: d? ≤ p?
• always holds (for convex and nonconvex problems)
• can be used to find nontrivial lower bounds for difficult problems
for example, solving the SDP

maximize −1T ν
subject to W + diag(ν)  0

gives a lower bound for the two-way partitioning problem on page 5–7

strong duality: d? = p?
• does not hold in general
• (usually) holds for convex problems
• conditions that guarantee strong duality in convex problems are called
constraint qualifications

Duality 5–10
Slater’s constraint qualification

strong duality holds for a convex problem

minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b

if it is strictly feasible, i.e.,

∃x ∈ int D : fi(x) < 0, i = 1, . . . , m, Ax = b

• also guarantees that the dual optimum is attained (if p? > −∞)
• can be sharpened: e.g., can replace int D with relint D (interior
relative to affine hull); linear inequalities do not need to hold with strict
inequality, . . .
• there exist many other types of constraint qualifications

Duality 5–11
Inequality form LP

primal problem
minimize cT x
subject to Ax  b

dual function

T T T
 −bT λ AT λ + c = 0
g(λ) = inf (c + A λ) x − b λ =
x −∞ otherwise

dual problem
maximize −bT λ
subject to AT λ + c = 0, λ0

• from Slater’s condition: p? = d? if Ax̃ ≺ b for some x̃


• in fact, p? = d? except when primal and dual are infeasible

Duality 5–12
Quadratic program
primal problem (assume P ∈ Sn++)
minimize xT P x
subject to Ax  b

dual function
 1
g(λ) = inf x P x + λ (Ax − b) = − λT AP −1AT λ − bT λ
T T
x 4

dual problem
maximize −(1/4)λT AP −1AT λ − bT λ
subject to λ  0

• from Slater’s condition: p? = d? if Ax̃ ≺ b for some x̃


• in fact, p? = d? always

Duality 5–13
A nonconvex problem with strong duality

minimize xT Ax + 2bT x
subject to xT x ≤ 1
A 6 0, hence nonconvex

dual function: g(λ) = inf x(xT (A + λI)x + 2bT x − λ)

• unbounded below if A + λI 6 0 or if A + λI  0 and b 6∈ R(A + λI)


• minimized by x = −(A + λI)†b otherwise: g(λ) = −bT (A + λI)†b − λ

dual problem and equivalent SDP:

maximize −bT (A + λI)†b − λ maximize −t


 −λ 
subject to A + λI  0 A + λI b
subject to 0
b ∈ R(A + λI) bT t

strong duality although primal problem is not convex (not easy to show)

Duality 5–14
Geometric interpretation
for simplicity, consider problem with one constraint f1(x) ≤ 0
interpretation of dual function:

g(λ) = inf (t + λu), where G = {(f1(x), f0(x)) | x ∈ D}


(u,t)∈G PSfrag replacements

t t
PSfrag replacements
G G
f1(x) p? p?
λu + t = g(λ) d?
g(λ)
u u

• λu + t = g(λ) is (non-vertical) supporting hyperplane to G


• hyperplane intersects t-axis at t = g(λ)

Duality 5–15
epigraph variation: same interpretation if G is replaced with

A = {(u, t) | f1(x) ≤ u, f0(x) ≤ t for some x ∈ D}


t

A
PSfrag replacements

f1(x)
λu + t = g(λ) p?

g(λ)
u

strong duality
• holds if there is a non-vertical supporting hyperplane to A at (0, p?)
• for convex problem, A is convex, hence has supp. hyperplane at (0, p?)
• Slater’s condition: if there exist (ũ, t̃) ∈ A with ũ < 0, then supporting
hyperplanes at (0, p?) must be non-vertical

Duality 5–16
Complementary slackness

assume strong duality holds, x? is primal optimal, (λ?, ν ?) is dual optimal


m p
!
X X
f0(x?) = g(λ?, ν ?) = inf f0(x) + λ?ifi(x) + νi?hi(x)
x
i=1 i=1
m
X p
X
≤ f0(x?) + λ?ifi(x?) + νi?hi(x?)
i=1 i=1
≤ f0(x?)

hence, the two inequalities hold with equality

• x? minimizes L(x, λ?, ν ?)


• λ?ifi(x?) = 0 for i = 1, . . . , m (known as complementary slackness):

λ?i > 0 =⇒ fi(x?) = 0, fi(x?) < 0 =⇒ λ?i = 0

Duality 5–17
Karush-Kuhn-Tucker (KKT) conditions

the following four conditions are called KKT conditions (for a problem with
differentiable fi, hi):

1. primal constraints: fi(x) ≤ 0, i = 1, . . . , m, hi(x) = 0, i = 1, . . . , p


2. dual constraints: λ  0
3. complementary slackness: λifi(x) = 0, i = 1, . . . , m
4. gradient of Lagrangian with respect to x vanishes:

m
X p
X
∇f0(x) + λi∇fi(x) + νi∇hi(x) = 0
i=1 i=1

from page 5–17: if strong duality holds and x, λ, ν are optimal, then they
must satisfy the KKT conditions

Duality 5–18
KKT conditions for convex problem

if x̃, λ̃, ν̃ satisfy KKT for a convex problem, then they are optimal:

• from complementary slackness: f0(x̃) = L(x̃, λ̃, ν̃)


• from 4th condition (and convexity): g(λ̃, ν̃) = L(x̃, λ̃, ν̃)

hence, f0(x̃) = g(λ̃, ν̃)

if Slater’s condition is satisfied:

x is optimal if and only if there exist λ, ν that satisfy KKT conditions

• recall that Slater implies strong duality, and dual optimum is attained
• generalizes optimality condition ∇f0(x) = 0 for unconstrained problem

Duality 5–19
example: water-filling (assume αi > 0)
Pn
minimize − i=1 log(xi + αi)
subject to x  0, 1T x = 1

x is optimal iff x  0, 1T x = 1, and there exist λ ∈ Rn, ν ∈ R such that


1
λ  0, λixi = 0, + λi = ν
xi + α i

• if ν < 1/αi: λi = 0 and xi = 1/ν − αi


• if ν ≥ 1/αi: λi = ν − 1/αi and xi = 0
Pn
• determine ν from 1T x = i=1 max{0, 1/ν − αi} = 1

interpretation
PSfrag replacements
• n patches; level of patch i is at height αi
1/ν ?
xi
• flood area with unit amount of water
αi
?
• resulting level is 1/ν
i
Duality 5–20
Perturbation and sensitivity analysis
(unperturbed) optimization problem and its dual

minimize f0(x) maximize g(λ, ν)


subject to fi(x) ≤ 0, i = 1, . . . , m subject to λ  0
hi(x) = 0, i = 1, . . . , p

perturbed problem and its dual

min. f0(x) max. g(λ, ν) − uT λ − v T ν


s.t. fi(x) ≤ ui, i = 1, . . . , m s.t. λ0
hi(x) = vi, i = 1, . . . , p

• x is primal variable; u, v are parameters


• p?(u, v) is optimal value as a function of u, v
• we are interested in information about p?(u, v) that we can obtain from
the solution of the unperturbed problem and its dual

Duality 5–21
global sensitivity result
assume strong duality holds for unperturbed problem, and that λ?, ν ? are
dual optimal for unperturbed problem
apply weak duality to perturbed problem:

p?(u, v) ≥ g(λ?, ν ?) − uT λ? − v T ν ?
= p?(0, 0) − uT λ? − v T ν ?

sensitivity interpretation

• if λ?i large: p? increases greatly if we tighten constraint i (ui < 0)


• if λ?i small: p? does not decrease much if we loosen constraint i (ui > 0)
• if νi? large and positive: p? increases greatly if we take vi < 0;
if νi? large and negative: p? increases greatly if we take vi > 0
• if νi? small and positive: p? does not decrease much if we take vi > 0;
if νi? small and negative: p? does not decrease much if we take vi < 0

Duality 5–22
local sensitivity: if (in addition) p?(u, v) is differentiable at (0, 0), then

∂p?(0, 0) ∂p?(0, 0)
λ?i =− , νi? =−
∂ui ∂vi

proof (for λ?i): from global sensitivity result,

∂p?(0, 0) p?(tei, 0) − p?(0, 0)


= lim ≥ −λ?i
∂ui t&0 t

∂p?(0, 0) p?(tei, 0) − p?(0, 0)


= lim ≤ −λ?i
∂ui t%0 t
hence, equality

PSfrag replacements
?
p (u) for a problem with one (inequality)
constraint: u
u=0 p?(u)

p?(0) − λ?u
Duality 5–23
Duality and problem reformulations

• equivalent formulations of a problem can lead to very different duals


• reformulating the primal problem can be useful when the dual is difficult
to derive, or uninteresting

common reformulations

• introduce new variables and equality constraints


• make explicit constraints implicit or vice-versa
• transform objective or constraint functions
e.g., replace f0(x) by φ(f0(x)) with φ convex, increasing

Duality 5–24
Introducing new variables and equality constraints

minimize f0(Ax + b)

• dual function is constant: g = inf x L(x) = inf x f0(Ax + b) = p?


• we have strong duality, but dual is quite useless

reformulated problem and its dual

minimize f0(y) maximize bT ν − f0∗(ν)


subject to Ax + b − y = 0 subject to AT ν = 0

dual function follows from

g(ν) = inf (f0(y) − ν T y + ν T Ax + bT ν)


x,y

−f0∗(ν) + bT ν AT ν = 0
=
−∞ otherwise

Duality 5–25
norm approximation problem: minimize kAx − bk

minimize kyk
subject to y = Ax − b

can look up conjugate of k · k, or derive dual directly

g(ν) = inf (kyk + ν T y − ν T Ax + bT ν)


x,y
 T
b ν + inf y (kyk + ν T y) AT ν = 0
=
−∞ otherwise
 T
b ν AT ν = 0, kνk∗ ≤ 1
=
−∞ otherwise

(see page 5–4)


dual of norm approximation problem

maximize bT ν
subject to AT ν = 0, kνk∗ ≤ 1

Duality 5–26
Implicit constraints
LP with box constraints: primal and dual problem

minimize cT x maximize −bT ν − 1T λ1 − 1T λ2


subject to Ax = b subject to c + AT ν + λ1 − λ2 = 0
−1  x  1 λ1  0, λ2  0

reformulation with box constraints made implicit


 T
c x −1  x  1
minimize f0(x) =
∞ otherwise
subject to Ax = b

dual function
g(ν) = inf (cT x + ν T (Ax − b))
−1x1

= −bT ν − kAT ν + ck1

dual problem: maximize −bT ν − kAT ν + ck1

Duality 5–27
Problems with generalized inequalities

minimize f0(x)
subject to fi(x) Ki 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p

Ki is generalized inequality on Rki


definitions are parallel to scalar case:
• Lagrange multiplier for fi(x) Ki 0 is vector λi ∈ Rki
• Lagrangian L : Rn × Rk1 × · · · × Rkm × Rp → R, is defined as
m
X p
X
L(x, λ1, · · · , λm, ν) = f0(x) + λTi fi(x) + νihi(x)
i=1 i=1

• dual function g : Rk1 × · · · × Rkm × Rp → R, is defined as

g(λ1, . . . , λm, ν) = inf L(x, λ1, · · · , λm, ν)


x∈D

Duality 5–28
lower bound property: if λi Ki∗ 0, then g(λ1, . . . , λm, ν) ≤ p?
proof: if x̃ is feasible and λ Ki∗ 0, then
m
X p
X
f0(x̃) ≥ f0(x̃) + λTi fi(x̃) + νihi(x̃)
i=1 i=1
≥ inf L(x, λ1, . . . , λm, ν)
x∈D

= g(λ1, . . . , λm, ν)

minimizing over all feasible x̃ gives p? ≥ g(λ1, . . . , λm, ν)


dual problem

maximize g(λ1, . . . , λm, ν)


subject to λi Ki∗ 0, i = 1, . . . , m

• weak duality: p? ≥ d? always


• strong duality: p? = d? for convex problem with constraint qualification
(for example, Slater’s: primal problem is strictly feasible)

Duality 5–29
Semidefinite program
primal SDP (Fi, G ∈ Sk )

minimize cT x
subject to x1F1 + · · · + xnFn  G

• Lagrange multiplier is matrix Z ∈ Sk


• Lagrangian L(x, Z) = cT x + tr (Z(x1F1 + · · · + xnFn − G))
• dual function

− tr(GZ) tr(FiZ) + ci = 0, i = 1, . . . , n
g(Z) = inf L(x, Z) =
x −∞ otherwise

dual SDP
maximize − tr(GZ)
subject to Z  0, tr(FiZ) + ci = 0, i = 1, . . . , n

p? = d? if primal SDP is strictly feasible (∃x with x1F1 + · · · + xnFn ≺ G)

Duality 5–30
Convex Optimization — Boyd & Vandenberghe

6. Approximation and fitting

• norm approximation

• least-norm problems

• regularized approximation

• robust approximation

6–1
Norm approximation

minimize kAx − bk

(A ∈ Rm×n with m ≥ n, k · k is a norm on Rm)

interpretations of solution x? = argminx kAx − bk:


• geometric: Ax? is point in R(A) closest to b
• estimation: linear measurement model

y = Ax + v

y are measurements, x is unknown, v is measurement error


given y = b, best guess of x is x?
• optimal design: x are design variables (input), Ax is result (output)
x? is design that best approximates desired result b

Approximation and fitting 6–2


examples

• least-squares approximation (k · k2): solution satisfies normal equations

AT Ax = AT b

(x? = (AT A)−1AT b if rank A = n)

• Chebyshev approximation (k · k∞): can be solved as an LP

minimize t
subject to −t1  Ax − b  t1

• sum of absolute residuals approximation (k · k1): can be solved as an LP

minimize 1T y
subject to −y  Ax − b  y

Approximation and fitting 6–3


PSfrag replacements
Penalty function approximation

minimize φ(r1) + · · · + φ(rm)


subject to r = Ax − b

(A ∈ Rm×n, φ : R → R is a convex penalty function)

examples
2
2
• quadratic: φ(u) = u log barrier
1.5 quadratic
• deadzone-linear with width a:

φ(u)
1
deadzone-linear
φ(u) = max{0, |u| − a}
0.5

• log-barrier with limit a: 0


−1.5 −1 −0.5 0 0.5 1 1.5
u

−a2 log(1 − (u/a)2) |u| < a
φ(u) =
∞ otherwise

Approximation and fitting 6–4


example (m = 100, n = 30): histogram of residuals for penalties

φ(u) = |u|, φ(u) = u2, φ(u) = max{0, |u|−a}, φ(u) = − log(1−u2)

40
p=1

0
−2 −1 0 1 2
10
p=2

0
−2 −1 0 1 2
Deadzone

20

0
−2 −1 0 1 2
Log barrier

10

0
−2 −1 0 1 2
r

shape of penalty function has large effect on distribution of residuals

Approximation and fitting 6–5


Huber
frag replacements penalty function (with parameter M )
2
PSfrag u |u| ≤ M
φhub(u) = replacements
M (2|u| − M ) |u| > M

linear growth for large u makes approximation less sensitive to outliers


2
20

1.5
10
φhub (u)

f (t)
1
0

0.5 −10

0 −20
−1.5 −1 −0.5 0 0.5 1 1.5 −10 −5 0 5 10
u t

• left: Huber penalty for M = 1


• right: affine function f (t) = α + βt fitted to 42 points ti, yi (circles)
using quadratic (dashed) and Huber (solid) penalty

Approximation and fitting 6–6


Least-norm problems

minimize kxk
subject to Ax = b

(A ∈ Rm×n with m ≤ n, k · k is a norm on Rn)

interpretations of solution x? = argminAx=b kxk:

• geometric: x? is point in affine set {x | Ax = b} with minimum


distance to 0

• estimation: b = Ax are (perfect) measurements of x; x? is smallest


(’most plausible’) estimate consistent with measurements

• design: x are design variables (inputs); b are required results (outputs)


x? is smallest (’most efficient’) design that satisfies requirements

Approximation and fitting 6–7


examples
• least-squares solution of linear equations (k · k2):
can be solved via optimality conditions

2x + AT ν = 0, Ax = b

• minimum sum of absolute values (k · k1): can be solved as an LP

minimize 1T y
subject to −y  x  y, Ax = b

tends to produce sparse solution x?

extension: least-penalty problem

minimize φ(x1) + · · · + φ(xn)


subject to Ax = b

φ : R → R is convex penalty function

Approximation and fitting 6–8


Regularized approximation

minimize (w.r.t. R2+) (kAx − bk, kxk)

A ∈ Rm×n, norms on Rm and Rn can be different

interpretation: find good approximation Ax ≈ b with small x

• estimation: linear measurement model y = Ax + v, with prior


knowledge that kxk is small
• optimal design: small x is cheaper or more efficient, or the linear
model y = Ax is only valid for small x
• robust approximation: good approximation Ax ≈ b with small x is
less sensitive to errors in A than good approximation with large x

Approximation and fitting 6–9


Scalarized problem

minimize kAx − bk + γkxk

• solution for γ > 0 traces out optimal trade-off curve


• other common method: minimize kAx − bk2 + δkxk2 with δ > 0

Tikhonov regularization

minimize kAx − bk22 + δkxk22

can be solved as a least-squares problem


    2
minimize √A x−
b
δI 0 2

solution x? = (AT A + δI)−1AT b

Approximation and fitting 6–10


Optimal input design
linear dynamical system with impulse response h:
t
X
y(t) = h(τ )u(t − τ ), t = 0, 1, . . . , N
τ =0

input design problem: multicriterion problem with 3 objectives


PN
1. tracking error with desired output ydes: Jtrack = t=0(y(t) − ydes(t))2
PN
2. input magnitude: Jmag = t=0 u(t)2
PN −1
3. input variation: Jder = t=0 (u(t + 1) − u(t))2
track desired output using a small and slowly varying input signal
regularized least-squares formulation

minimize Jtrack + δJder + ηJmag

for fixed δ, η, a least-squares problem in u(0), . . . , u(N )

Approximation and fitting 6–11


PSfrag replacements
ag replacements
example: 3 solutions on optimal trade-off curve

(top) δ = 0, small η; (middle) δ = 0, larger η; (bottom) large δ


5 1
ag replacements PSfrag replacements
0.5
0

y(t)
u(t)

0
−5 −0.5
−10 −1
0 50 100 150 200 0 50 100 150 200
t t
4 1
ag replacements PSfrag replacements
2 0.5

y(t)
u(t)

0 0
−2 −0.5
−4 −1
0 50 100 150 200 0 50 100 150 200
t t
4 1
2 0.5
y(t)
u(t)

0 0
−2 −0.5
−4 −1
0 50 100 150 200 0 50 100 150 200
t t

Approximation and fitting 6–12


Signal reconstruction

minimize (w.r.t. R2+) (kx̂ − xcork2, φ(x̂))

• x ∈ Rn is unknown signal
• xcor = x + v is (known) corrupted version of x, with additive noise v
• variable x̂ (reconstructed signal) is estimate of x
• φ : Rn → R is regularization function or smoothing objective

examples: quadratic smoothing, total variation smoothing:

n−1
X n−1
X
φquad(x̂) = (x̂i+1 − x̂i)2, φtv (x̂) = |x̂i+1 − x̂i|
i=1 i=1

Approximation and fitting 6–13


quadratic smoothing example
PSfrag replacements
placements
0.5

0.5


0

−0.5
x

0
0 1000 2000 3000 4000

0.5
−0.5
0 1000 2000 3000 4000


0

−0.5
0.5
0 1000 2000 3000 4000

0.5
xcor


0

−0.5 −0.5
0 1000 2000 3000 4000 0 1000 2000 3000 4000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φquad(x̂)

Approximation and fitting 6–14


total variation reconstruction example
2

placements

x̂i
2 0
PSfrag replacements
1
−2
0 500 1000 1500 2000
x

0
2
−1

x̂i
−2 0
0 500 1000 1500 2000

2 −2
0 500 1000 1500 2000
1 2
xcor

x̂i
0
−1

−2 −2
0 500 1000 1500 2000 0 500 1000 1500 2000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φquad(x̂)

quadratic smoothing smooths out noise and sharp transitions in signal

Approximation and fitting 6–15


2

placements


2 0
PSfrag replacements
1
−2
0 500 1000 1500 2000
x

0
2
−1


−2 0
0 500 1000 1500 2000

2 −2
0 500 1000 1500 2000
1 2
xcor


0
−1

−2 −2
0 500 1000 1500 2000 0 500 1000 1500 2000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φtv (x̂)

total variation smoothing preserves sharp transitions in signal

Approximation and fitting 6–16


Robust approximation
minimize kAx − bk with uncertain A
two approaches: PSfrag replacements

• stochastic: assume A is random, minimize E kAx − bk


• worst-case: set A of possible values of A, minimize supA∈A kAx − bk
tractable only in special cases (certain norms k · k, distributions, sets A)

12
example: A(u) = A0 + uA1
10
xnom
• xnom minimizes kA0x − bk22
8

• xstoch minimizes E kA(u)x − bk22

r(u)
6 xstoch
with u uniform on [−1, 1] xwc
4

• xwc minimizes sup−1≤u≤1 kA(u)x − bk22 2

figure shows r(u) = kA(u)x − bk2 0


−2 −1 0 1 2
u
Approximation and fitting 6–17
stochastic robust LS with A = Ā + U , U random, E U = 0, E U T U = P

minimize E k(Ā + U )x − bk22

• explicit expression for objective:

E kAx − bk22 = E kĀx − b + U xk22


= kĀx − bk22 + E xT U T U x
= kĀx − bk22 + xT P x

• hence, robust LS problem is equivalent to LS problem

minimize kĀx − bk22 + kP 1/2xk22

• for P = δI, get Tikhonov regularized problem

minimize kĀx − bk22 + δkxk22

Approximation and fitting 6–18


worst-case robust LS with A = {Ā + u1A1 + · · · + upAp | kuk2 ≤ 1}

minimize supA∈A kAx − bk22 = supkuk2≤1 kP (x)u + q(x)k22


 
where P (x) = A1 x A 2 x · · · Apx , q(x) = Āx − b

• from page 5–14, strong duality holds between the following problems

maximize kP u + qk22 minimize t+ λ 


subject to kuk22 ≤ 1 I P q
subject to  P T λI 0 0
qT 0 t

• hence, robust LS problem is equivalent to SDP

minimize t+ λ 
I P (x) q(x)
subject to  P (x)T λI 0 0
q(x)T 0 t

Approximation and fitting 6–19


PSfrag replacements

example: histogram of residuals

r(u) = k(A0 + u1A1 + u2A2)x − bk2

with u uniformly distributed on unit disk, for three values of x


0.25

0.2 xrls
frequency

0.15

0.1
xtik
0.05 xls

0
0 1 2 3 4 5
r(u)

• xls minimizes kA0x − bk2


• xtik minimizes kA0x − bk22 + kxk22 (Tikhonov solution)
• xwc minimizes supkuk2≤1 kA0x − bk22 + kxk22

Approximation and fitting 6–20


Convex Optimization — Boyd & Vandenberghe

7. Statistical estimation

• maximum likelihood estimation

• optimal detector design

• experiment design

7–1
Parametric distribution estimation

• distribution estimation problem: estimate probability density p(y) of a


random variable from observed values
• parametric distribution estimation: choose from a family of densities
px(y), indexed by a parameter x

maximum likelihood estimation

maximize (over x) log px(y)

• y is observed value
• l(x) = log px(y) is called log-likelihood function
• can add constraints x ∈ C explicitly, or define px(y) = 0 for x 6∈ C
• a convex optimization problem if log px(y) is concave in x for fixed y

Statistical estimation 7–2


Linear measurements with IID noise

linear measurement model

yi = aTi x + vi, i = 1, . . . , m

• x ∈ Rn is vector of unknown parameters


• vi is IID measurement noise, with density p(z)
m Qm
• yi is measurement: y ∈ R has density px(y) = i=1 p(yi − aTi x)

maximum likelihood estimate: any solution x of


Pm
maximize l(x) = i=1 log p(yi − aTi x)

(y is observed value)

Statistical estimation 7–3


examples
2
/(2σ 2 )
• Gaussian noise N (0, σ 2): p(z) = (2πσ 2)−1/2e−z ,
m
m 2 1 X T
l(x) = − log(2πσ ) − 2 (ai x − yi)2
2 2σ i=1

ML estimate is LS solution
• Laplacian noise: p(z) = (1/(2a))e−|z|/a,
m
1X T
l(x) = −m log(2a) − |ai x − yi|
a i=1

ML estimate is `1-norm solution


• uniform noise on [−a, a]:

−m log(2a) |aTi x − yi| ≤ a, i = 1, . . . , m
l(x) =
−∞ otherwise
ML estimate is any x with |aTi x − yi| ≤ a

Statistical estimation 7–4


Logistic regression
random variable y ∈ {0, 1} with distribution

exp(aT u + b)
p = prob(y = 1) =
1 + exp(aT u + b)

• a, b are parameters; u ∈ Rn are (observable) explanatory variables


• estimation problem: estimate a, b from m observations (ui, yi)

log-likelihood function (for y1 = · · · = yk = 1, yk+1 = · · · = ym = 0):


 
Yk T Ym
 exp(a u i + b) 1 
l(a, b) = log T T
i=1
1 + exp(a ui + b) 1 + exp(a ui + b)
i=k+1

k
X m
X
= (aT ui + b) − log(1 + exp(aT ui + b))
i=1 i=1

concave in a, b

Statistical estimation 7–5


PSfrag
examplereplacements
(n = 1, m = 50 measurements)

0.8
prob(y = 1)
0.6

0.4

0.2

0 2 4 6 8 10
u

• circles show 50 points (ui, yi)


• solid curve is ML estimate of p = exp(au + b)/(1 + exp(au + b))

Statistical estimation 7–6


(Binary) hypothesis testing

detection (hypothesis testing) problem

given observation of a random variable X ∈ {1, . . . , n}, choose between:

• hypothesis 1: X was generated by distribution p = (p1, . . . , pn)


• hypothesis 2: X was generated by distribution q = (q1, . . . , qn)

randomized detector

• a nonnegative matrix T ∈ R2×n, with 1T T = 1


• if we observe X = k, we choose hypothesis 1 with probability t1k ,
hypothesis 2 with probability t2k
• if all elements of T are 0 or 1, it is called a deterministic detector

Statistical estimation 7–7


detection probability matrix:
 
  1 − Pfp Pfn
D= Tp Tq =
Pfp 1 − Pfn

• Pfp is probability of selecting hypothesis 2 if X is generated by


distribution 1 (false positive)
• Pfn is probability of selecting hypothesis 1 if X is generated by
distribution 2 (false negative)

multicriterion formulation of detector design

minimize (w.r.t. R2+) (Pfp, Pfn) = ((T p)2, (T q)1)


subject to t1k + t2k = 1, k = 1, . . . , n
tik ≥ 0, i = 1, 2, k = 1, . . . , n

variable T ∈ R2×n

Statistical estimation 7–8


scalarization (with weight λ > 0)

minimize (T p)2 + λ(T q)1


subject to t1k + t2k = 1, tik ≥ 0, i = 1, 2, k = 1, . . . , n

an LP with a simple analytical solution



(1, 0) pk ≥ λqk
(t1k , t2k ) =
(0, 1) pk < λqk

• a deterministic detector, given by a likelihood ratio test


• if pk = λqk for some k, any value 0 ≤ t1k ≤ 1, t1k = 1 − t2k is optimal
(i.e., Pareto-optimal detectors include non-deterministic detectors)

minimax detector

minimize max{Pfp, Pfn} = max{(T p)2, (T q)1}


subject to t1k + t2k = 1, tik ≥ 0, i = 1, 2, k = 1, . . . , n

an LP; solution is usually not deterministic

Statistical estimation 7–9


PSfrag replacements

example  
0.70 0.10
 0.20 0.10 
P =
 0.05

0.70 
0.05 0.10

0.8

0.6
Pfn

0.4

1
0.2
2 4
3
0
0 0.2 0.4 0.6 0.8 1
Pfp

solutions 1, 2, 3 (and endpoints) are deterministic; 4 is minimax detector

Statistical estimation 7–10


Experiment design
m linear measurements yi = aTi x + wi, i = 1, . . . , m of unknown x ∈ Rn
• measurement errors wi are IID N (0, 1)
• ML (least-squares) estimate is

m
!−1 m
X X
x̂ = aiaTi yi a i
i=1 i=1

• error e = x̂ − x has zero mean and covariance


m
!−1
X
E = E eeT = aiaTi
i=1

confidence ellipsoids are given by {x | (x − x̂)T E −1(x − x̂) ≤ β}

experiment design: choose ai ∈ {v1, . . . , vp} (a set of possible test


vectors) to make E ‘small’

Statistical estimation 7–11


vector optimization formulation
Pp 
T −1
minimize (w.r.t. Sn+) E= k=1 mk vk vk
subject to mk ≥ 0, m1 + · · · + m p = m
mk ∈ Z

• variables are mk (# vectors ai equal to vk )


• difficult in general, due to integer constraint

relaxed experiment design


assume m  p, use λk = mk /m as (continuous) real variable
Pp 
T −1
minimize (w.r.t. Sn+) E = (1/m) k=1 λk vk vk
subject to λ  0, 1T λ = 1

• common scalarizations: minimize log det E, tr E, λmax(E), . . .


• can add other convex constraints, e.g., bound experiment cost cT λ ≤ B

Statistical estimation 7–12


D-optimal design
Pp 
T −1
minimize log det k=1 λk vk vk
T
subject to λ  0, 1 λ=1

interpretation: minimizes volume of confidence ellipsoids


dual problem

maximize log det W + n log n


subject to vkT W vk ≤ 1, k = 1, . . . , p

interpretation: {x | xT W x ≤ 1} is minimum volume ellipsoid centered at


origin, that includes all test vectors vk
complementary slackness: for λ, W primal and dual optimal

λk (1 − vkT W vk ) = 0, k = 1, . . . , p

optimal experiment uses vectors vk on boundary of ellipsoid defined by W

Statistical estimation 7–13


example (p = 20)
λ1 = 0.5

PSfrag replacements
λ2 = 0.5

design uses two vectors, on boundary of ellipse defined by optimal W

Statistical estimation 7–14


derivation of dual of page 7–13
first reformulate primal problem with new variable X:
−1
minimize log det
PpX
subject to X = k=1 λk vk vkT , λ  0, 1T λ = 1

p
!!
X
L(X, λ, Z, z, ν) = log det X −1+tr Z X− λk vk vkT −z T λ+ν(1T λ−1)
k=1

• minimize over X by setting gradient to zero: −X −1 + Z = 0


• minimum over λk is −∞ unless −vkT Zvk − zk + ν = 0

dual problem

maximize n + log det Z − ν


subject to vkT Zvk ≤ ν, k = 1, . . . , p

change variable W = Z/ν, and optimize over ν to get dual of page 7–13

Statistical estimation 7–15


Convex Optimization — Boyd & Vandenberghe

8. Geometric problems

• extremal volume ellipsoids

• centering

• classification

• placement and facility location

8–1
Minimum volume ellipsoid around a set

Löwner-John ellipsoid of a set C: minimum volume ellipsoid E s.t. C ⊆ E


• parametrize E as E = {v | kAv + bk2 ≤ 1}; w.l.o.g. assume A ∈ Sn++
• vol E is proportional to det A−1; to compute minimum volume ellipsoid,

minimize (over A, b) log det A−1


subject to supv∈C kAv + bk2 ≤ 1

convex, but evaluating the constraint can be hard (for general C)

finite set C = {x1, . . . , xm}:

minimize (over A, b) log det A−1


subject to kAxi + bk2 ≤ 1, i = 1, . . . , m

also gives Löwner-John ellipsoid for polyhedron conv{x1, . . . , xm}

Geometric problems 8–2


Maximum volume inscribed ellipsoid
maximum volume ellipsoid E inside a convex set C ⊆ Rn

• parametrize E as E = {Bu + d | kuk2 ≤ 1}; w.l.o.g. assume B ∈ Sn++


• vol E is proportional to det B; can compute E by solving

maximize log det B


subject to supkuk2≤1 IC (Bu + d) ≤ 0

(where IC (x) = 0 for x ∈ C and IC (x) = ∞ for x 6∈ C)


convex, but evaluating the constraint can be hard (for general C)

polyhedron {x | aTi x ≤ bi, i = 1, . . . , m}:

maximize log det B


subject to kBaik2 + aTi d ≤ bi, i = 1, . . . , m

(constraint follows from supkuk2≤1 aTi (Bu + d) = kBaik2 + aTi d)

Geometric problems 8–3


Efficiency of ellipsoidal approximations

C ⊆ Rn convex, bounded, with nonempty interior

• Löwner-John ellipsoid, shrunk by a factor n, lies inside C


• maximum volume inscribed ellipsoid, expanded by a factor n, covers C

example (for two polyhedra in R2)

PSfrag replacements


factor n can be improved to n if C is symmetric

Geometric problems 8–4


Centering

some possible definitions of ‘center’ of a convex set C:

• center of largest inscribed ball (’Chebyshev center’)


for polyhedron, can be computed via linear programming (page 4–19)
• center of maximum volume inscribed ellipsoid (page 8–3)

xcheb xmve

ag replacements PSfrag replacements

MVE center is invariant under affine coordinate transformations

Geometric problems 8–5


Analytic center of a set inequalities

the analytic center of set of convex inequalities and linear equations

fi(x) ≤ 0, i = 1, . . . , m, Fx = g

is defined as the optimal point of


Pm
minimize − i=1 log(−fi(x))
subject to F x = g

• more easily computed than MVE or Chebyshev center (see later)


• not just a property of the feasible set: two sets of inequalities can
describe the same set, but have different analytic centers

Geometric problems 8–6


analytic center of linear inequalities aTi x ≤ bi, i = 1, . . . , m

xac is minimizer of

m
X xac
φ(x) = − log(bi − aTi x)
i=1

PSfrag replacements

inner and outer ellipsoids from analytic center:

Einner ⊆ {x | aTi x ≤ bi, i = 1, . . . , m} ⊆ Eouter

where

Einner = {x | (x − xac)T ∇2φ(xac)(x − xac ≤ 1}


Eouter = {x | (x − xac)T ∇2φ(xac)(x − xac) ≤ m(m − 1)}

Geometric problems 8–7


Linear discrimination
separate two sets of points {x1, . . . , xN }, {y1, . . . , yM } by a hyperplane:

aT xi + bi > 0, i = 1, . . . , N, aT yi + bi < 0, i = 1, . . . , M

homogeneous in a, b, hence equivalent to

aT xi + bi ≥ 1, i = 1, . . . , N, aT yi + bi ≤ −1, i = 1, . . . , M

a set of linear inequalities in a, b

Geometric problems 8–8


Robust linear discrimination

(Euclidean) distance between hyperplanes

H1 = {z | aT z + b = 1}
H2 = {z | aT z + b = −1}

is dist(H1, H2) = 2/kak2

to separate two sets of points by maximum margin,

minimize (1/2)kak2
subject to aT xi + b ≥ 1, i = 1, . . . , N (1)
aT yi + b ≤ −1, i = 1, . . . , M

(after squaring objective) a QP in a, b

Geometric problems 8–9


Lagrange dual of maximum margin separation problem (1)

maximize 1 T λ + 1T µ
PN PM
subject to 2 i=1 λi xi − i=1 µi yi ≤1 (2)
2
1T λ = 1T µ, λ  0, µ0

from duality, optimal value is inverse of maximum margin of separation

interpretation

• change variables to θi = λi/1T λ, γi = µi/1T µ, t = 1/(1T λ + 1T µ)


• invert objective to minimize 1/(1T λ + 1T µ) = t

minimize t
PN PM
subject to i=1 θi xi − i=1 γi yi ≤t
2
T
θ  0, 1 θ = 1, γ  0, 1T γ = 1

optimal value is distance between convex hulls

Geometric problems 8–10


Approximate linear separation of non-separable sets

minimize 1T u + 1T v
subject to aT xi + b ≥ 1 − ui, i = 1, . . . , N
aT yi + b ≤ −1 + vi, i = 1, . . . , M
u  0, v  0
• an LP in a, b, u, v
• at optimum, ui = max{0, 1 − aT xi − b}, vi = max{0, 1 + aT yi + b}
• can be interpreted as a heuristic for minimizing #misclassified points

Geometric problems 8–11


Support vector classifier

minimize kak2 + γ(1T u + 1T v)


subject to aT xi + b ≥ 1 − ui, i = 1, . . . , N
aT yi + b ≤ −1 + vi, i = 1, . . . , M
u  0, v  0
produces point on trade-off curve between inverse of margin 2/kak2 and
classification error, measured by total slack 1T u + 1T v

same example as previous page,


with γ = 0.1:

Geometric problems 8–12


Nonlinear discrimination

separate two sets of points by a nonlinear function:

f (xi) > 0, i = 1, . . . , N, f (yi) < 0, i = 1, . . . , M

• choose a linearly parametrized family of functions

f (z) = θ T F (z)

F = (F1, . . . , Fk ) : Rn → Rk are basis functions

• solve a set of linear inequalities in θ:

θT F (xi) ≥ 1, i = 1, . . . , N, θ T F (yi) ≤ −1, i = 1, . . . , M

Geometric problems 8–13


quadratic discrimination: f (z) = z T P z + q T z + r

xTi P xi + q T xi + r ≥ 1, yiT P yi + q T yi + r ≤ −1

can add additional constraints (e.g., P  −I to separate by an ellipsoid)


polynomial discrimination: F (z) are all monomials up to a given degree

separation by ellipsoid separation by 4th degree polynomial

Geometric problems 8–14


Placement and facility location

• N points with coordinates xi ∈ R2 (or R3)


• some positions xi are given; the other xi’s are variables
• for each pair of points, a cost function fij (xi, xj )

placement problem
P
minimize i6=j fij (xi, xj )

variables are positions of free points


interpretations
• points represent plants or warehouses; fij is transportation cost between
facilities i and j
• points represent cells on an IC; fij represents wirelength

Geometric problems 8–15


P
example: minimize (i,j)∈A h(kxi − xj k2), with 6 free points, 27 links

optimal placement for h(z) = z, h(z) = z 2, h(z) = z 4


1 1 1
eplacements PSfrag replacements PSfrag replacements

0 0 0

−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
PSfrag replacements
eplacements
histograms of connection lengths kxi − xj k2
PSfrag replacements

4 4 6
5
3 3
4
2 2 3
2
1 1
1
00 0.5 1 1.5 2 00 0.5 1 1.5 00 0.5 1 1.5

Geometric problems 8–16


Convex Optimization — Boyd & Vandenberghe

9. Numerical linear algebra background

• matrix structure and algorithm complexity

• solving linear equations with factored matrices

• LU, Cholesky, LDLT factorization

• block elimination and the matrix inversion lemma

• solving underdetermined equations

9–1
Matrix structure and algorithm complexity

cost (execution time) of solving Ax = b with A ∈ Rn×n

• for general methods, grows as n3


• less if A is structured (banded, sparse, Toeplitz, . . . )

flop counts

• flop (floating-point operation): one addition, subtraction,


multiplication, or division of two floating-point numbers
• to estimate complexity of an algorithm: express number of flops as a
(polynomial) function of the problem dimensions, and simplify by
keeping only the leading terms
• not an accurate predictor of computation time on modern computers
• useful as a rough estimate of complexity

Numerical linear algebra background 9–2


vector-vector operations (x, y ∈ Rn)

• inner product xT y: 2n − 1 flops (or 2n if n is large)


• sum x + y, scalar multiplication αx: n flops

matrix-vector product y = Ax with A ∈ Rm×n

• m(2n − 1) flops (or 2mn if n large)


• 2N if A is sparse with N nonzero elements
• 2p(n + m) if A is given as A = U V T , U ∈ Rm×p, V ∈ Rn×p

matrix-matrix product C = AB with A ∈ Rm×n, B ∈ Rn×p

• mp(2n − 1) flops (or 2mnp if n large)


• less if A and/or B are sparse
• (1/2)m(m + 1)(2n − 1) ≈ m2n if m = p and C symmetric

Numerical linear algebra background 9–3


Linear equations that are easy to solve

diagonal matrices (aij = 0 if i 6= j): n flops

x = A−1b = (b1/a11, . . . , bn/ann)

lower triangular (aij = 0 if j > i): n2 flops

x1 := b1/a11
x2 := (b2 − a21x1)/a22
x3 := (b3 − a31x1 − a32x2)/a33
..

xn := (bn − an1x1 − an2x2 − · · · − an,n−1xn−1)/ann

called forward substitution


upper triangular (aij = 0 if j < i): n2 flops via backward substitution

Numerical linear algebra background 9–4


orthogonal matrices: A−1 = AT
• 2n2 flops to compute x = AT b for general A
• less with structure, e.g., if A = I − 2uuT with kuk2 = 1, we can
compute x = AT b = b − 2(uT b)u in 4n flops

permutation matrices:

1 j = πi
aij =
0 otherwise

where π = (π1, π2, . . . , πn) is a permutation of (1, 2, . . . , n)


• interpretation: Ax = (xπ1 , . . . , xπn )
• satisfies A−1 = AT , hence cost of solving Ax = b is 0 flops
example:
   
0 1 0 0 0 1
A =  0 0 1 , A−1 = AT =  1 0 0 
1 0 0 0 1 0

Numerical linear algebra background 9–5


The factor-solve method for solving Ax = b

• factor A as a product of simple matrices (usually 2 or 3):

A = A 1 A2 · · · A k

(Ai diagonal, upper or lower triangular, etc)

• compute x = A−1b = A−1


k · · · A −1 −1
2 A1 b by solving k ‘easy’ equations

A1x1 = b, A 2 x2 = x 1 , ..., Ak x = xk−1

cost of factorization step usually dominates cost of solve step


equations with multiple righthand sides

Ax1 = b1, Ax2 = b2, ..., Axm = bm

cost: one factorization plus m solves

Numerical linear algebra background 9–6


LU factorization
every nonsingular matrix A can be factored as

A = P LU

with P a permutation matrix, L lower triangular, U upper triangular


cost: (2/3)n3 flops

Solving linear equations by LU factorization.

given a set of linear equations Ax = b, with A nonsingular.


1. LU factorization. Factor A as A = P LU ((2/3)n3 flops).
2. Permutation. Solve P z1 = b (0 flops).
3. Forward substitution. Solve Lz2 = z1 (n2 flops).
4. Backward substitution. Solve U x = z2 (n2 flops).

cost: (2/3)n3 + 2n2 ≈ (2/3)n3 for large n

Numerical linear algebra background 9–7


sparse LU factorization

A = P1LU P2

• adding permutation matrix P2 offers possibility of sparser L, U (hence,


cheaper factor and solve steps)

• P1 and P2 chosen (heuristically) to yield sparse L, U

• choice of P1 and P2 depends on sparsity pattern and values of A

• cost is usually much less than (2/3)n3; exact value depends in a


complicated way on n, number of zeros in A, sparsity pattern

Numerical linear algebra background 9–8


Cholesky factorization

every positive definite A can be factored as

A = LLT

with L lower triangular


cost: (1/3)n3 flops

Solving linear equations by Cholesky factorization.

given a set of linear equations Ax = b, with A ∈ Sn++.


1. Cholesky factorization. Factor A as A = LLT ((1/3)n3 flops).
2. Forward substitution. Solve Lz1 = b (n2 flops).
3. Backward substitution. Solve LT x = z1 (n2 flops).

cost: (1/3)n3 + 2n2 ≈ (1/3)n3 for large n

Numerical linear algebra background 9–9


sparse Cholesky factorization

A = P LLT P T

• adding permutation matrix P offers possibility of sparser L

• P chosen (heuristically) to yield sparse L

• choice of P only depends on sparsity pattern of A (unlike sparse LU)

• cost is usually much less than (1/3)n3; exact value depends in a


complicated way on n, number of zeros in A, sparsity pattern

Numerical linear algebra background 9–10


LDLT factorization

every nonsingular symmetric matrix A can be factored as

A = P LDLT P T

with P a permutation matrix, L lower triangular, D block diagonal with


1 × 1 or 2 × 2 diagonal blocks

cost: (1/3)n3

• cost of solving symmetric sets of linear equations by LDLT factorization:


(1/3)n3 + 2n2 ≈ (1/3)n3 for large n
• for sparse A, can choose P to yield sparse L; cost  (1/3)n3

Numerical linear algebra background 9–11


Equations with structured sub-blocks
    
A11 A12 x1 b1
= (1)
A21 A22 x2 b2

• variables x1 ∈ Rn1 , x2 ∈ Rn2 ; blocks Aij ∈ Rni×nj


• if A11 is nonsingular, can eliminate x1: x1 = A−1
11 (b1 − A12 x2 );
to compute x2, solve

(A22 − A21A−1
11 A 12 )x 2 = b 2 − A 21 A −1
11 b1

Solving linear equations by block elimination.

given a nonsingular set of linear equations (1), with A11 nonsingular.


1. Form A−1 −1
11 A12 and A11 b1 .
2. Form S = A22 − A21A−1 −1
11 A12 and b̃ = b2 − A21 A11 b1 .
3. Determine x2 by solving Sx2 = b̃.
4. Determine x1 by solving A11x1 = b1 − A12x2.

Numerical linear algebra background 9–12


dominant terms in flop count

• step 1: f + n2s (f is cost of factoring A11; s is cost of solve step)


• step 2: 2n22n1 (cost dominated by product of A21 and A−1
11 A12 )

• step 3: (2/3)n32

total: f + n2s + 2n22n1 + (2/3)n32

examples
• general A11 (f = (2/3)n31, s = 2n21): no gain over standard method

#flops = (2/3)n31 + 2n21n2 + 2n22n1 + (2/3)n32 = (2/3)(n1 + n2)3

• block elimination is useful for structured A11 (f  n31)


for example, diagonal (f = 0, s = n1): #flops ≈ 2n22n1 + (2/3)n32

Numerical linear algebra background 9–13


Structured matrix plus low rank term

(A + BC)x = b

• A ∈ Rn×n, B ∈ Rn×p, C ∈ Rp×n


• assume A has structure (Ax = b easy to solve)

first write as     
A B x b
=
C −I y 0

now apply block elimination: solve

(I + CA−1B)y = CA−1b,

then solve Ax = b − By
this proves the matrix inversion lemma: if A and A + BC nonsingular,

(A + BC)−1 = A−1 − A−1B(I + CA−1B)−1CA−1

Numerical linear algebra background 9–14


example: A diagonal, B, C dense

• method 1: form D = A + BC, then solve Dx = b


cost: (2/3)n3 + 2pn2

• method 2 (via matrix inversion lemma): solve

(I + CA−1B)y = A−1b, (2)

then compute x = A−1b − A−1By


total cost is dominated by (2): 2p2n + (2/3)p3 (i.e., linear in n)

Numerical linear algebra background 9–15


Underdetermined linear equations

if A ∈ Rp×n with p < n, rank A = p,

{x | Ax = b} = {F z + x̂ | z ∈ Rn−p}

• x̂ is (any) particular solution


• columns of F ∈ Rn×(n−p) span nullspace of A
• there exist several numerical methods for computing F
(QR factorization, rectangular LU factorization, . . . )

Numerical linear algebra background 9–16


Convex Optimization — Boyd & Vandenberghe

10. Unconstrained minimization

• terminology and assumptions

• gradient descent method

• steepest descent method

• Newton’s method

• self-concordant functions

• implementation

10–1
Unconstrained minimization

minimize f (x)

• f convex, twice continuously differentiable (hence dom f open)


• we assume optimal value p? = inf x f (x) is attained (and finite)

unconstrained minimization methods

• produce sequence of points x(k) ∈ dom f , k = 0, 1, . . . with

f (x(k)) → p?

• can be interpreted as iterative methods for solving optimality condition

∇f (x?) = 0

Unconstrained minimization 10–2


Initial point and sublevel set

algorithms in this chapter require a starting point x(0) such that


• x(0) ∈ dom f
• sublevel set S = {x | f (x) ≤ f (x(0))} is closed

2nd condition is hard to verify, except when all sublevel sets are closed:
• equivalent to condition that epi f is closed
• true if dom f = Rn
• true if f (x) → ∞ as x → bd dom f

examples of differentiable functions with closed sublevel sets:


m
X m
X
f (x) = log( exp(aTi x + bi)), f (x) = − log(bi − aTi x)
i=1 i=1

Unconstrained minimization 10–3


Strong convexity and implications
f is strongly convex on S if there exists an m > 0 such that

∇2f (x)  mI for all x ∈ S

implications

• for x, y ∈ S,
m
f (y) ≥ f (x) + ∇f (x)T (y − x) + kx − yk22
2
hence, S is bounded
• p? > −∞, and for x ∈ S,

? 1
f (x) − p ≤ k∇f (x)k22
2m
useful as stopping criterion (if you know m)

Unconstrained minimization 10–4


Descent methods

x(k+1) = x(k) + t(k)∆x(k) with f (x(k+1)) < f (x(k))

• other notations: x+ = x + t∆x, x := x + t∆x


• ∆x is the step, or search direction; t is the step size, or step length
• from convexity, f (x+) < f (x) implies ∇f (x)T ∆x < 0
(i.e., ∆x is a descent direction)

General descent method.


given a starting point x ∈ dom f .
repeat
1. Determine a descent direction ∆x.
2. Line search. Choose a step size t > 0.
3. Update. x := x + t∆x.
until stopping criterion is satisfied.

Unconstrained minimization 10–5


Line search types

exact line search: t = argmint>0 f (x + t∆x)

backtracking line search (with parameters α ∈ (0, 1/2), β ∈ (0, 1))


• starting at t = 1, repeat t := βt until

f (x + t∆x) < f (x) + αt∇f (x)T ∆x

• graphical interpretation: backtrack until t ≤ t0

PSfrag replacements

f (x + t∆x)

f (x) + t∇f (x)T ∆x f (x) + αt∇f (x)T ∆x


t
t=0 t0

Unconstrained minimization 10–6


Gradient descent method
general descent method with ∆x = −∇f (x)

given a starting point x ∈ dom f .


repeat
1. ∆x := −∇f (x).
2. Line search. Choose step size t via exact or backtracking line search.
3. Update. x := x + t∆x.
until stopping criterion is satisfied.

• stopping criterion usually of the form k∇f (x)k2 ≤ 


• convergence result: for strongly convex f ,

f (x(k)) − p? ≤ ck (f (x(0)) − p?)

c ∈ (0, 1) depends on m, x(0), line search type


• very simple, but often very slow; rarely used in practice

Unconstrained minimization 10–7


quadratic problem in R2

f (x) = (1/2)(x21 + γx22) (γ > 0)

with exact line search, starting at x(0) = (γ, 1):


 k  k
(k) γ−1 (k) γ−1
x1 =γ , x2 = −
γ+1 γ+1

• veryreplacements
PSfrag slow if γ  1 or γ  1
• example for γ = 10:

x(0)
x2

0
x(1)
−4
−10 0 10
x1

Unconstrained minimization 10–8


nonquadratic example

f (x1, x2) = ex1+3x2−0.1 + ex1−3x2−0.1 + e−x1−0.1

x(0) x(0)
x(2)
x(1)
replacements PSfrag replacements
x(1)

backtracking line search exact line search

Unconstrained minimization 10–9


a problem in R100
PSfrag replacements 500
X
f (x) = cT x − log(bi − aTi x)
i=1

104

102
f (x(k) ) − p?

100 exact l.s.

10−2

backtracking l.s.
10−4
0 50 100 150 200
k

‘linear’ convergence, i.e., a straight line on a semilog plot

Unconstrained minimization 10–10


Steepest descent method

normalized steepest descent direction (at x, for norm k · k):

∆xnsd = argmin{∇f (x)T v | kvk = 1}

interpretation: for small v, f (x + v) ≈ f (x) + ∇f (x)T v;


direction ∆xnsd is unit-norm step with most negative directional derivative

(unnormalized) steepest descent direction

∆xsd = k∇f (x)k∗∆xnsd

satisfies ∇f (x)T ∆sd = −k∇f (x)k2∗

steepest descent method


• general descent method with ∆x = ∆xsd
• convergence properties similar to gradient descent

Unconstrained minimization 10–11


examples

• Euclidean norm: ∆xsd = −∇f (x)


• quadratic norm kxkP = (xT P x)1/2 (P ∈ Sn++): ∆xsd = −P −1∇f (x)
• `1-norm: ∆xsd = −(∂f (x)/∂xi)ei, where |∂f (x)/∂xi| = k∇f (x)k∞

unit balls and normalized steepest descent directions for a quadratic norm
and the `1-norm:

−∇f (x)

−∇f (x)
∆xnsd
∆xnsd

PSfrag replacements PSfrag replacements

Unconstrained minimization 10–12


choice of norm for steepest descent

(0) x(0)
x
x(2)
x(1) x (2)
replacements
PSfrag replacements
x(1)

• steepest descent with backtracking line search for two quadratic norms
• ellipses show {x | kx − x(k)kP = 1}
• equivalent interpretation of steepest descent with quadratic norm k · kP :
gradient descent after change of variables x̄ = P 1/2x

shows choice of P has strong effect on speed of convergence

Unconstrained minimization 10–13


Newton step

∆xnt = −∇2f (x)−1∇f (x)

interpretations
• x + ∆xnt minimizes second order approximation

b T 1 T 2
f (x + v) = f (x) + ∇f (x) v + v ∇ f (x)v
2

• x + ∆xnt solves linearized optimality condition

∇f (x + v) ≈ ∇fb(x + v) = ∇f (x) + ∇2f (x)v = 0


PSfrag replacements
fb0
PSfrag replacements
fb f0
(x + ∆xnt, f 0(x + ∆xnt))
(x, f (x)) (x, f 0(x))
(x + ∆xnt, f (x + ∆xnt)) f 0

Unconstrained minimization 10–14


• ∆xnt is steepest descent direction at x in local Hessian norm

T 2
1/2
kuk∇2f (x) = u ∇ f (x)u

x + ∆xnsd
x + ∆xnt
PSfrag replacements

dashed lines are contour lines of f ; ellipse is {x + v | v T ∇2f (x)v = 1}


arrow shows −∇f (x)

Unconstrained minimization 10–15


Newton decrement
T 2 −1
1/2
λ(x) = ∇f (x) ∇ f (x) ∇f (x)

a measure of the proximity of x to x?


properties
• gives an estimate of f (x) − p?, using quadratic approximation fb:

b 1
f (x) − inf f (y) = λ(x)2
y 2

• equal to the norm of the Newton step in the quadratic Hessian norm

2
1/2
λ(x) = ∆xnt∇ f (x)∆xnt

• directional derivative in the Newton direction: ∇f (x)T ∆xnt = −λ(x)2


• affine invariant (unlike k∇f (x)k2)

Unconstrained minimization 10–16


Newton’s method

given a starting point x ∈ dom f , tolerance  > 0.


repeat
1. Compute the Newton step and decrement.
∆xnt := −∇2f (x)−1∇f (x); λ2 := ∇f (x)T ∇2f (x)−1∇f (x).
2. Stopping criterion. quit if λ2/2 ≤ .
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + t∆xnt .

affine invariant, i.e., independent of linear changes of coordinates:

Newton iterates for f˜(y) = f (T y) with starting point y (0) = T −1x(0) are

y (k) = T −1x(k)

Unconstrained minimization 10–17


Classical convergence analysis

assumptions
• f strongly convex on S with constant m
• ∇2f is Lipschitz continuous on S, with constant L > 0:

k∇2f (x) − ∇2f (y)k2 ≤ Lkx − yk2

(L measures how well f can be approximated by a quadratic function)

outline: there exist constants η ∈ (0, m2/L), γ > 0 such that

• if k∇f (x)k2 ≥ η, then f (x(k+1)) − f (x(k)) ≤ −γ


• if k∇f (x)k2 < η, then
 2
L (k+1) L (k)
2
k∇f (x )k2 ≤ 2
k∇f (x )k2
2m 2m

Unconstrained minimization 10–18


damped Newton phase (k∇f (x)k2 ≥ η)

• most iterations require backtracking steps


• function value decreases by at least γ
• if p? > −∞, this phase ends after at most (f (x(0)) − p?)/γ iterations

quadratically convergent phase (k∇f (x)k2 < η)

• all iterations use step size t = 1


• k∇f (x)k2 converges to zero quadratically: if k∇f (x(k))k2 < η, then

 2l−k  2l−k
L l L k 1
2
k∇f (x )k2 ≤ 2
k∇f (x )k2 ≤ , l≥k
2m 2m 2

Unconstrained minimization 10–19


conclusion: number of iterations until f (x) − p? ≤  is bounded above by

f (x(0)) − p?
+ log2 log2(0/)
γ

• γ, 0 are constants that depend on m, L, x(0)


• second term is small (of the order of 6) and almost constant for
practical purposes
• in practice, constants m, L (hence γ, 0) are usually unknown
• provides qualitative insight in convergence properties (i.e., explains two
algorithm phases)

Unconstrained minimization 10–20


PSfrag replacements
Examples

example in R2 (page 10–9)

105

x(0) 100

f (x(k)) − p?
x(1) 10−5
placements
10−10

10−150 1 2 3 4 5
k

• backtracking parameters α = 0.1, β = 0.7


• converges in only 5 steps
• quadratic local convergence

Unconstrained minimization 10–21


PSfrag replacements

example in R100 (page 10–10)

105 2

exact line search


100 1.5
f (x(k) ) − p?

step size t(k)


−5
backtracking
10 1

exact line search


10−10 0.5 backtracking

10−15 0
0 2 4 6 8 10 0 2 4 6 8
k k

• backtracking parameters α = 0.01, β = 0.5


• backtracking line search almost as fast as exact l.s. (and much simpler)
• clearly shows two phases in algorithm

Unconstrained minimization 10–22


example in R10000 (with sparse ai)
10000
X 100000
X
f (x) = − log(1 − x2i ) − log(bi − aTi x)
i=1 i=1
PSfrag replacements 5
f (x(k) ) − p? 10

100

10−5

0 5 10 15 20
k

• backtracking parameters α = 0.01, β = 0.5.


• performance similar as for small examples

Unconstrained minimization 10–23


Self-concordance

shortcomings of classical convergence analysis

• depends on unknown constants (m, L, . . . )


• bound is not affinely invariant, although Newton’s method is

convergence analysis via self-concordance (Nesterov and Nemirovski)

• does not depend on any unknown constants


• gives affine-invariant bound
• applies to special class of convex functions (‘self-concordant’ functions)
• developed to analyze polynomial-time interior-point methods for convex
optimization

Unconstrained minimization 10–24


Self-concordant functions

definition
• f : R → R is self-concordant if |f 000(x)| ≤ 2f 00(x)3/2 for all x ∈ dom f
• f : Rn → R is self-concordant if g(t) = f (x + tv) is self-concordant for
all x ∈ dom f , v ∈ Rn

examples on R
• linear and quadratic functions
• negative logarithm f (x) = − log x
• negative entropy plus negative logarithm: f (x) = x log x − log x

affine invariance: if f : R → R is s.c., then f˜(y) = f (ay + b) is s.c.:

f˜000(y) = a3f 000(ay + b), f˜00(y) = a2f 00(ay + b)

Unconstrained minimization 10–25


Self-concordant calculus

properties
• preserved under positive scaling α ≥ 1, and sum
• preserved under composition with affine function
• if g is convex with dom g = R++ and |g 000(x)| ≤ 3g 00(x)/x then

f (x) = log(−g(x)) − log x

is self-concordant

examples: properties can be used to show that the following are s.c.
Pm
• f (x) = − i=1 log(bi − aTi x) on {x | aTi x < bi, i = 1, . . . , m}
• f (X) = − log det X on Sn++
• f (x) = − log(y 2 − xT x) on {(x, y) | kxk2 < y}

Unconstrained minimization 10–26


Convergence analysis for self-concordant functions

summary: there exist constants η ∈ (0, 1/4], γ > 0 such that

• if λ(x) > η, then


f (x(k+1)) − f (x(k)) ≤ −γ

• if λ(x) ≤ η, then
 2
2λ(x(k+1)) ≤ 2λ(x(k))

(η and γ only depend on backtracking parameters α, β)

complexity bound: number of Newton iterations bounded by

f (x(0)) − p?
+ log2 log2(1/)
γ

for α = 0.1, β = 0.8,  = 10−10, bound evaluates to 375(f (x(0)) − p?) + 6

Unconstrained minimization 10–27


PSfrag replacements
numerical example: 150 randomly generated instances of
Pm T
minimize f (x) = − i=1 log(b i − a i x)

25

20

◦:

iterations
m = 100, n = 50 15
: m = 1000, n = 500
♦: m = 1000, n = 50 10

0
0 5 10 15 20 25 30 35
(0) ?
f (x )−p

• number of iterations much smaller than 375(f (x(0)) − p?) + 6


• bound of the form c(f (x(0)) − p?) + 6 with smaller c (empirically) valid

Unconstrained minimization 10–28


Implementation

main effort in each iteration: evaluate derivatives and solve Newton system

H∆x = g

where H = ∇2f (x), g = −∇f (x)

via Cholesky factorization

H = LLT , ∆xnt = L−T L−1g, λ(x) = kL−1gk2

• cost (1/3)n3 flops for unstructured system


• cost  (1/3)n3 if H sparse, banded

Unconstrained minimization 10–29


example of dense Newton system with structure
n
X
f (x) = ψi(xi) + ψ0(Ax + b), H = D + A T H0 A
i=1

• assume A ∈ Rp×n, dense, with p  n


• D diagonal with diagonal elements ψi00(xi); H0 = ∇2ψ0(Ax + b)

method 1: form H, solve via dense Cholesky factorization: (cost (1/3)n3)


method 2 (page 9–15): factor H0 = L0LT0 ; write Newton system as

D∆x + AT L0w = −g, LT0 A∆x − w = 0

eliminate ∆x from first equation; compute w and ∆x from

(I + LT0 AD−1AT L0)w = −LT0 AD−1g, D∆x = −g − AT L0w

cost: 2p2n (dominated by computation of LT0 AD−1AT L0)

Unconstrained minimization 10–30


Convex Optimization — Boyd & Vandenberghe

11. Equality constrained minimization

• equality constrained minimization

• eliminating equality constraints

• Newton’s method with equality constraints

• infeasible start Newton method

• implementation

11–1
Equality constrained minimization

minimize f (x)
subject to Ax = b

• f convex, twice continuously differentiable


• A ∈ Rp×n with rank A = p
• we assume p? is finite and attained

optimality conditions: x? is optimal iff there exists a ν ? such that

∇f (x?) + AT ν ? = 0, Ax? = b

Equality constrained minimization 11–2


equality constrained quadratic minimization (with P ∈ Sn+)

minimize (1/2)xT P x + q T x + r
subject to Ax = b

optimality condition:
 T
 ?
  
P A x −q
=
A 0 ν? b

• coefficient matrix is called KKT matrix


• KKT matrix is nonsingular if and only if

Ax = 0, x 6= 0 =⇒ xT P x > 0

• equivalent condition for nonsingularity: P + AT A  0

Equality constrained minimization 11–3


Eliminating equality constraints

represent solution of {x | Ax = b} as

{x | Ax = b} = {F z + x̂ | z ∈ Rn−p}

• x̂ is (any) particular solution


• range of F ∈ Rn×(n−p) is nullspace of A (rank F = n − p and AF = 0)

reduced or eliminated problem

minimize f (F z + x̂)

• an unconstrained problem with variable z ∈ Rn−p


• from solution z ?, obtain x? and ν ? as

x? = F z ? + x̂, ν ? = −(AAT )−1A∇f (x?)

Equality constrained minimization 11–4


example: optimal allocation with resource constraint

minimize f1(x1) + f2(x2) + · · · + fn(xn)


subject to x1 + x2 + · · · + xn = b

eliminate xn = b − x1 − · · · − xn−1, i.e., choose


 
I
x̂ = ben, F = ∈ Rn×(n−1)
−1T

reduced problem:

minimize f1(x1) + · · · + fn−1(xn−1) + fn(b − x1 − · · · − xn−1)

(variables x1, . . . , xn−1)

Equality constrained minimization 11–5


Newton step

Newton step of f at feasible x is given by (1st block) of solution of


 2 T
   
∇ f (x) A ∆xnt −∇f (x)
=
A 0 w 0

interpretations

• ∆xnt solves second order approximation (with variable v)

minimize fb(x + v) = f (x) + ∇f (x)T v + (1/2)v T ∇2f (x)v


subject to A(x + v) = b

• equations follow from linearizing optimality conditions

∇f (x + ∆xnt) + AT w = 0, A(x + ∆xnt) = b

Equality constrained minimization 11–6


Newton decrement

T 2
1/2 T
1/2
λ(x) = ∆xnt∇ f (x)∆xnt = −∇f (x) ∆xnt

properties

• gives an estimate of f (x) − p? using quadratic approximation fb:

1
f (x) − inf fb(y) = λ(x)2
Ay=b 2

• directional derivative in Newton direction:

d
f (x + t∆xnt) = −λ(x)2
dt t=0

T 2 −1
1/2
• in general, λ(x) 6= ∇f (x) ∇ f (x) ∇f (x)

Equality constrained minimization 11–7


Newton’s method with equality constraints

given starting point x ∈ dom f with Ax = b, tolerance  > 0.


repeat
1. Compute the Newton step and decrement ∆xnt, λ(x).
2. Stopping criterion. quit if λ2/2 ≤ .
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + t∆xnt .

• a feasible descent method: x(k) feasible and f (x(k+1)) < f (x(k))


• affine invariant

Equality constrained minimization 11–8


Newton’s method and elimination

Newton’s method for reduced problem

minimize f˜(z) = f (F z + x̂)

• variables z ∈ Rn−p
• x̂ satisfies Ax̂ = b; rank F = n − p and AF = 0
• Newton’s method for f˜, started at z (0), generates iterates z (k)

Newton’s method with equality constraints

when started at x(0) = F z (0) + x̂, iterates are

x(k+1) = F z (k) + x̂

hence, don’t need separate convergence analysis

Equality constrained minimization 11–9


Newton step at infeasible points
2nd interpretation of page 11–6 extends to infeasible x (i.e., Ax 6= b)
linearizing optimality conditions at infeasible x (with x ∈ dom f ) gives
 2 T
   
∇ f (x) A ∆xnt ∇f (x)
=− (1)
A 0 w Ax − b

primal-dual interpretation
• write optimality condition as r(y) = 0, where

y = (x, ν), r(y) = (∇f (x) + AT ν, Ax − b)

• linearizing r(y) = 0 gives r(y + ∆y) ≈ r(y) + Dr(y)∆y = 0:


 2 T
   T

∇ f (x) A ∆xnt ∇f (x) + A ν
=−
A 0 ∆νnt Ax − b

same as (1) with w = ν + ∆νnt

Equality constrained minimization 11–10


Infeasible start Newton method

given starting point x ∈ dom f , ν , tolerance  > 0, α ∈ (0, 1/2), β ∈ (0, 1).
repeat
1. Compute primal and dual Newton steps ∆xnt, ∆νnt.
2. Backtracking line search on krk2.
t := 1.
while kr(x + t∆xnt, ν + t∆νnt)k2 > (1 − αt)kr(x, ν)k2, t := βt.
3. Update. x := x + t∆xnt , ν := ν + t∆νnt.
until Ax = b and kr(x, ν)k2 ≤ .

• not a descent method: f (x(k+1)) > f (x(k)) is possible


• directional derivative of kr(y)k22 in direction ∆y = (∆xnt, ∆νnt) is

d
kr(y + ∆y)k2 = −kr(y)k2
dt t=0

Equality constrained minimization 11–11


Solving KKT systems

 T
   
H A v g
=−
A 0 w h

solution methods

• LDLT factorization
• elimination (if H nonsingular)

AH −1AT w = h − AH −1g, Hv = −(g + AT w)

• elimination with singular H: write as


 T T
   T

H + A QA A v g + A Qh
=−
A 0 w h

with Q  0 for which H + AT QA  0, and apply elimination

Equality constrained minimization 11–12


Equality constrained analytic centering
Pn
primal problem: minimize − i=1 log xi subject to Ax = b
T
Pn T
dual problem: maximize −b ν + i=1 log(A ν)i + n
PSfrag replacements

three methods for an example with A ∈ R100×500, different starting points

1. Newton method with equality constraints (requires x(0)  0, Ax(0) = b)

105
f (x(k) ) − p?

100

10−5

10−100 5 10 15 20
k

Equality constrained minimization 11–13


2. Newton method applied to dual problem (requires AT ν (0)  0)
105

p? − g(ν (k) )
100
PSfrag replacements

10−5

10−100 2 4 6 8 10
k
3. infeasible start Newton method (requires x(0)  0)
1010

105
kr(x(k) , ν (k) )k2

100

10−5

10−10

10−150 5 10 15 20 25
k
Equality constrained minimization 11–14
complexity per iteration of three methods is identical

1. use block elimination to solve KKT system


 −2 T
   −1

diag(x) A ∆x diag(x) 1
=
A 0 w 0

reduces to solving A diag(x)2AT w = b

2. solve Newton system A diag(AT ν)−2AT ∆ν = −b + A diag(AT ν)−11

3. use block elimination to solve KKT system


 −2 T
   −1

diag(x) A ∆x diag(x) 1
=
A 0 ∆ν Ax − b

reduces to solving A diag(x)2AT w = 2Ax − b

conclusion: in each case, solve ADAT w = h with D positive diagonal

Equality constrained minimization 11–15


Network flow optimization
Pn
minimize i=1 φi (xi )
subject to Ax = b

• directed graph with n arcs, p + 1 nodes


• xi: flow through arc i; φi: cost flow function for arc i (with φ00i (x) > 0)
• node-incidence matrix à ∈ R(p+1)×n defined as

 1 arc j leaves node i
Ãij = −1 arc j enters node i

0 otherwise

• reduced node-incidence matrix A ∈ Rp×n is à with last row removed


• b ∈ Rp is (reduced) source vector
• rank A = p if graph is connected

Equality constrained minimization 11–16


KKT system

 T
   
H A v g
=−
A 0 w h

• H = diag(φ001 (x1), . . . , φ00n(xn)), positive diagonal

• solve via elimination:

AH −1AT w = h − AH −1g, Hv = −(g + AT w)

sparsity pattern of coefficient matrix is given by graph connectivity

(AH −1AT )ij 6= 0 ⇐⇒ (AAT )ij 6= 0


⇐⇒ nodes i and j are connected by an arc

Equality constrained minimization 11–17


Analytic center of linear matrix inequality

minimize − log det X


subject to tr(AiX) = bi, i = 1, . . . , p
variable X ∈ Sn
optimality conditions
p
X
X ?  0, −(X ?)−1 + νj?Ai = 0, tr(AiX ?) = bi, i = 1, . . . , p
j=1

Newton equation at feasible X:


p
X
X −1∆XX −1 + wj Ai = X −1, tr(Ai∆X) = 0, i = 1, . . . , p
j=1

• follows from linear approximation (X + ∆X)−1 ≈ X −1 − X −1∆XX −1


• n(n + 1)/2 + p variables ∆X, w

Equality constrained minimization 11–18


solution by block elimination
Pp
• eliminate ∆X from first equation: ∆X = X − j=1 wj XAj X

• substitute ∆X in second equation

p
X
tr(AiXAj X)wj = bi, i = 1, . . . , p (2)
j=1

a dense positive definite set of linear equations with variable w ∈ Rp

flop count (dominant terms) using Cholesky factorization X = LLT :

• form p products LT Aj L: (3/2)pn3


• form p(p + 1)/2 inner products tr((LT AiL)(LT Aj L)): (1/2)p2n2
• solve (2) via Cholesky factorization: (1/3)p3

Equality constrained minimization 11–19


Convex Optimization — Boyd & Vandenberghe

12. Interior-point methods

• inequality constrained minimization


• logarithmic barrier function and central path
• barrier method
• feasibility and phase I methods
• complexity analysis via self-concordance
• generalized inequalities

12–1
Inequality constrained minimization

minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m (1)
Ax = b

• fi convex, twice continuously differentiable


• A ∈ Rp×n with rank A = p
• we assume p? is finite and attained
• we assume problem is strictly feasible: there exists x̃ with

x̃ ∈ dom f0, fi(x̃) < 0, i = 1, . . . , m, Ax̃ = b

hence, strong duality holds and dual optimum is attained

Interior-point methods 12–2


Examples

• LP, QP, QCQP, GP

• entropy maximization with linear inequality constraints


Pn
minimize i=1 xi log xi
subject to F x  g
Ax = b

with dom f0 = Rn++

• differentiability may require reformulating the problem, e.g.,


piecewise-linear minimization or `∞-norm approximation via LP

• SDPs and SOCPs are better handled as problems with generalized


inequalities (see later)

Interior-point methods 12–3


Logarithmic barrier
reformulation of (1) via indicator function:
Pm
minimize f0(x) + i=1 I−(fi(x))
subject to Ax = b

where I−(u) = 0 if u ≤ 0, I−(u) = ∞ otherwise (indicator function of R−)


approximation via logarithmic barrier
PSfrag replacements
Pm
minimize f0(x) − (1/t) i=1 log(−fi (x))
subject to Ax = b
10
• an equality constrained problem
5
• for t > 0, −(1/t) log(−u) is a
smooth approximation of I− 0

• approximation improves as t → ∞
−5
−3 −2 −1 0 1
u
Interior-point methods 12–4
logarithmic barrier function

m
X
φ(x) = − log(−fi(x)), dom φ = {x | f1(x) < 0, . . . , fm(x) < 0}
i=1

• convex (follows from composition rules)


• twice continuously differentiable, with derivatives

m
X 1
∇φ(x) = ∇fi(x)
i=1
−f i (x)
m
X m
X 1
1
∇2φ(x) = 2
∇f i (x)∇f i (x) T
+ ∇ 2
fi(x)
f (x)
i=1 i i=1
−fi(x)

Interior-point methods 12–5


Central path

• for t > 0, define x?(t) as the solution of

minimize tf0(x) + φ(x)


subject to Ax = b

(for now, assume x?(t) exists and is unique for each t > 0)
• central path is {x?(t) | t > 0}

example: central path for an LP c

minimize cT x
subject to aTi x ≤ bi, PSfrag
i = 1, ...,6
replacements
? x?(10)
x
T T ?
hyperplane c x = c x (t) is tangent to
level curve of φ through x?(t)

Interior-point methods 12–6


Dual points on central path
x = x?(t) if there exists a w such that
m
X 1
t∇f0(x) + ∇fi(x) + AT w = 0, Ax = b
i=1
−fi(x)

• therefore, x?(t) minimizes the Lagrangian


m
X
L(x, λ?(t), ν ?(t)) = f0(x) + λ?i(t)fi(x) + ν ?(t)T (Ax − b)
i=1

where we define λ?i(t) = 1/(−tfi(x?(t)) and ν ?(t) = w/t


• this confirms the intuitive idea that f0(x?(t)) → p? if t → ∞:

p? ≥ g(λ?(t), ν ?(t))
= L(x?(t), λ?(t), ν ?(t))
= f0(x?(t)) − m/t

Interior-point methods 12–7


Interpretation via KKT conditions

x = x?(t), λ = λ?(t), ν = ν ?(t) satisfy

1. primal constraints: fi(x) ≤ 0, i = 1, . . . , m, Ax = b


2. dual constraints: λ  0
3. approximate complementary slackness: −λifi(x) = 1/t, i = 1, . . . , m
4. gradient of Lagrangian with respect to x vanishes:

m
X
∇f0(x) + λi∇fi(x) + AT ν = 0
i=1

difference with KKT is that condition 3 replaces λifi(x) = 0

Interior-point methods 12–8


Force field interpretation

centering problem (for problem with no equality constraints)


Pm
minimize tf0(x) − i=1 log(−fi (x))

force field interpretation

• tf0(x) is potential of force field F0(x) = −t∇f0(x)


• − log(−fi(x)) is potential of force field Fi(x) = (1/fi(x))∇fi(x)

the forces balance at x?(t):

m
X
F0(x?(t)) + Fi(x?(t)) = 0
i=1

Interior-point methods 12–9


example
minimize cT x
subject to aTi x ≤ bi, i = 1, . . . , m

• objective force field is constant: F0(x) = −tc


• constraint force field decays as inverse distance to constraint hyperplane:
−ai 1
Fi(x) = T
, kFi(x)k2 =
bi − a i x dist(x, Hi)

where Hi = {x | aTi x = bi}

PSfrag replacements −c PSfrag replacements

−3c
t=1 t=3

Interior-point methods 12–10


Barrier method

given strictly feasible x, t := t(0) > 0, µ > 1, tolerance  > 0.


repeat
1. Centering step. Compute x?(t) by minimizing tf0 + φ, subject to Ax = b.
2. Update. x := x?(t).
3. Stopping criterion. quit if m/t < .
4. Increase t. t := µt.

• terminates with f0(x) − p? ≤  (stopping criterion follows from


f0(x?(t)) − p? ≤ m/t)
• centering usually done using Newton’s method, starting at current x
• choice of µ involves a trade-off: large µ means fewer outer iterations,
more inner (Newton) iterations; typical values: µ = 10–20
• several heuristics for choice of t(0)

Interior-point methods 12–11


Convergence analysis

number of outer (centering) iterations: exactly


 
log(m/(t(0)))
log µ

plus the initial centering step (to compute x?(t(0)))

centering problem

minimize tf0(x) + φ(x)

see convergence analysis of Newton’s method


• tf0 + φ must have closed sublevel sets for t ≥ t(0)
• classical analysis requires strong convexity, Lipschitz condition
• analysis via self-concordance requires self-concordance of tf0 + φ

Interior-point methods 12–12


Examples

inequality form LP (m = 100 inequalities, n = 50 variables)

102 140

Newton iterations
120
0
10
duality gap

100

10−2 80
60
−4
10
40
10−6 µ = 50 µ = 150 µ=2 20
0
0 20 40 60 80 0 40 80 120 160 200
Newton iterations µ

• starts with x on central path (t(0) = 1, duality gap 100)


• terminates when t = 108 (gap 10−6)
• centering uses Newton’s method with backtracking
• total number of Newton iterations not very sensitive for µ ≥ 10

Interior-point methods 12–13


geometric program (m = 100 inequalities and n = 50 variables)
PSfrag replacements P 
5 T
minimize log k=1 exp(a0k x + b0k )
P 
5 T
subject to log k=1 exp(a ik x + bik ) ≤ 0, i = 1, . . . , m

102

100
duality gap

10−2

10−4

10−6 µ = 150 µ = 50 µ=2

0 20 40 60 80 100 120
Newton iterations

Interior-point methods 12–14


family of standard LPs (A ∈ Rm×2m)

minimize cT x
subject to Ax = b, x0

m = 10, . . . , 1000; for each m, solve 100 randomly generated instances


PSfrag replacements
35
Newton iterations

30

25

20

15 1
10 102 103
m

number of iterations grows very slowly as m ranges over a 100 : 1 ratio

Interior-point methods 12–15


Feasibility and phase I methods

feasibility problem: find x such that

fi(x) ≤ 0, i = 1, . . . , m, Ax = b (2)

phase I: computes strictly feasible starting point for barrier method


basic phase I method

minimize (over x, s) s
subject to fi(x) ≤ s, i = 1, . . . , m (3)
Ax = b

• if x, s feasible, with s < 0, then x is strictly feasible for (2)


• if optimal value p̄? of (3) is positive, then problem (2) is infeasible
• if p̄? = 0 and attained, then problem (2) is feasible (but not strictly);
if p̄? = 0 and not attained, then problem (2) is infeasible

Interior-point methods 12–16


sum of infeasibilities phase I method
PSfrag replacements
ag replacements T
minimize 1T s bi − ai xmax
subject to s  0, fi(x) ≤ si, i = 1, . . . , m
Ax = b

for infeasible problems, produces a solution that satisfies many more


inequalities than basic phase I method

example (infeasible set of 100 linear inequalities in 50 variables)


60 60

40 40
number

number
20 20

0 0
−1 −0.5 0 0.5 1 1.5 −1 −0.5 0 0.5 1 1.5
bi − aTi xmax bi − aTi xsum

left: basic phase I solution; satisfies 39 inequalities


right: sum of infeasibilities phase I solution; satisfies 79 solutions

Interior-point methods 12–17


−0.5 −0.5
0 0
0.5 0.5
example: family of linear inequalities Ax  b + γ∆b
1 1
• data0 chosen to be strictly feasible for 0γ > 0, infeasible for γ ≤ 0
20 20
• use40basic phase I, terminate when s <
40 0 or dual objective is positive
60 60

Newton iterations
80 100 80
100 80 Infeasible 100 Feasible
60
40
20
0
−1 −0.5 0 0.5 1
γ
Newton iterations

Newton iterations
100 100
80 80
60 60
40 40
20 20
0 0
−100 −10−2 −10−4 −10−6 10−6 10−4 −2
100
γ γ 10
number of iterations roughly proportional to log(1/|γ|)

Interior-point methods 12–18


Complexity analysis via self-concordance

same assumptions as on page 12–2, plus:

• sublevel sets (of f0, on the feasible set) are bounded


• tf0 + φ is self-concordant with closed sublevel sets

second condition

• holds for LP, QP, QCQP


• may require reformulating the problem, e.g.,
Pn Pn
minimize i=1 xi log xi −→ minimize i=1 xi log xi
subject to F x  g subject to F x  g, x  0

• needed for complexity analysis; barrier method works even when


self-concordance assumption does not apply

Interior-point methods 12–19


Newton iterations per centering step: from self-concordance theory

µtf0(x) + φ(x) − µtf0(x+) − φ(x+)


#Newton iterations ≤ +c
γ

• bound on effort of computing x+ = x?(µt) starting at x = x?(t)


• γ, c are constants (depend only on Newton algorithm parameters)
• from duality (with λ = λ?(t), ν = ν ?(t)):

µtf0(x) + φ(x) − µtf0(x+) − φ(x+)


Xm
= µtf0(x) − µtf0(x+) + log(−µtλifi(x+)) − m log µ
i=1
m
X
≤ µtf0(x) − µtf0(x+) − µt λifi(x+) − m − m log µ
i=1
≤ µtf0(x) − µtg(λ, ν) − m − m log µ
= m(µ − 1 − log µ)

Interior-point methods 12–20


total number of Newton iterations (excluding first centering step)
 (0)
 
log(m/(t )) m(µ − 1 − log µ)
placements #Newton iterations ≤ N = +c
log µ γ

5 104

4 104
figure shows N for typical values of γ, c,
3 104
m
N

4 m = 100, = 105
2 10 t(0)
1 104

0
1 1.1 1.2
µ

• confirms trade-off in choice of µ


• in practice, #iterations is in the tens; not very sensitive for µ ≥ 10

Interior-point methods 12–21


polynomial-time complexity of barrier method


• for µ = 1 + 1/ m:
  
√ m/t(0)
N =O m log



• number of Newton iterations for fixed gap reduction is O( m)

• multiply with cost of one Newton iteration (a polynomial function of


problem dimensions), to get bound on number of flops

this choice of µ optimizes worst-case complexity; in practice we choose µ


fixed (µ = 10, . . . , 20)

Interior-point methods 12–22


Generalized inequalities

minimize f0(x)
subject to fi(x) Ki 0, i = 1, . . . , m
Ax = b

• f0 convex, fi : Rn → Rki , i = 1, . . . , m, convex with respect to proper


cones Ki ∈ Rki
• fi twice continuously differentiable
• A ∈ Rp×n with rank A = p
• we assume p? is finite and attained
• we assume problem is strictly feasible; hence strong duality holds and
dual optimum is attained

examples of greatest interest: SOCP, SDP

Interior-point methods 12–23


Generalized logarithm for proper cone

ψ : Rq → R is generalized logarithm for proper cone K ⊆ Rq if:

• dom ψ = int K and ∇2ψ(y) ≺ 0 for y K 0


• ψ(sy) = ψ(y) + θ log s for y K 0, s > 0 (θ is the degree of ψ)

examples
Pn
• nonnegative orthant K = Rn+: ψ(y) = i=1 log yi , with degree θ = n
• positive semidefinite cone K = Sn+:

ψ(Y ) = log det Y (θ = n)

• second-order cone K = {y ∈ Rn+1 | (y12 + · · · + yn2 )1/2 ≤ yn+1}:

2
ψ(y) = log(yn+1 − y12 − · · · − yn2 ) (θ = 2)

Interior-point methods 12–24


properties (without proof): for y K 0,

∇ψ(y) K ∗ 0, y T ∇ψ(y) = θ

Pn
• nonnegative orthant Rn+: ψ(y) = i=1 log yi

∇ψ(y) = (1/y1, . . . , 1/yn), y T ∇ψ(y) = n

• positive semidefinite cone Sn+: ψ(Y ) = log det Y

∇ψ(Y ) = Y −1, tr(Y ∇ψ(Y )) = n

• second-order cone K = {y ∈ Rn+1 | (y12 + · · · + yn2 )1/2 ≤ yn+1}:


 
−y1
2  .. 
ψ(y) = 2  , y T
∇ψ(y) = 2
2 2
yn+1 − y1 − · · · − yn  −y n

yn+1

Interior-point methods 12–25


Logarithmic barrier and central path

logarithmic barrier for f1(x) K1 0, . . . , fm(x) Km 0:

m
X
φ(x) = − ψi(−fi(x)), dom φ = {x | fi(x) ≺Ki 0, i = 1, . . . , m}
i=1

• ψi is generalized logarithm for Ki, with degree θi


• φ is convex, twice continuously differentiable

central path: {x?(t) | t > 0} where x?(t) solves

minimize tf0(x) + φ(x)


subject to Ax = b

Interior-point methods 12–26


Dual points on central path
x = x?(t) if there exists w ∈ Rp,
m
X
t∇f0(x) + Dfi(x)T ∇ψi(−fi(x)) + AT w = 0
i=1

(Dfi(x) ∈ Rki×n is derivative matrix of fi)

• therefore, x?(t) minimizes Lagrangian L(x, λ?(t), ν ?(t)), where

1 w
λ?i(t) = ∇ψi(−fi(x?(t))), ?
ν (t) =
t t

• from properties of ψi: λ?i(t) Ki∗ 0, with duality gap

m
X
f0(x?(t)) − g(λ?(t), ν ?(t)) = (1/t) θi
i=1

Interior-point methods 12–27


example: semidefinite programming (with Fi ∈ Sp)

minimize cT x Pn
subject to F (x) = i=1 xiFi + G  0

• logarithmic barrier: φ(x) = log det(−F (x)−1)


• central path: x?(t) minimizes tcT x − log det(−F (x)); hence

tci − tr(FiF (x?(t))−1) = 0, i = 1, . . . , n

• dual point on central path: Z ?(t) = −(1/t)F (x?(t))−1 is feasible for

maximize tr(GZ)
subject to tr(FiZ) + ci = 0, i = 1, . . . , n
Z0

• duality gap on central path: cT x?(t) − tr(GZ ?(t)) = p/t

Interior-point methods 12–28


Barrier method

given strictly feasible x, t := t(0) > 0, µ > 1, tolerance  > 0.


repeat
1. Centering step. Compute x?(t) by minimizing tf0 + φ, subject to Ax = b.
2. Update. x := x?(t).
P
3. Stopping criterion. quit if ( i θi)/t < .
4. Increase t. t := µt.

P
• only difference is duality gap m/t on central path is replaced by i θi /t

• number of outer iterations:


& P '
(0)
log(( i θi)/(t ))
log µ

• complexity analysis via self-concordance applies to SDP, SOCP

Interior-point methods 12–29


Examples

second-order cone program (50 variables, 50 SOC constraints in R6)


PSfrag replacements
g replacements
102

Newton iterations
120
duality gap

0
10

10−2 80

10−4 40
10−6 µ = 50 µ = 200 µ=2
0
0 20 40 60 80 20 60 100 140 180
Newton iterations µ

semidefinite program (100 variables, LMI constraint in S100)


140
2
10
Newton iterations
duality gap

100 100

10−2
60
−4
10

10−6 µ = 150 µ = 50 µ=2 20


0 20 40 60 80 100 0 20 40 60 80 100 120
Newton iterations µ
Interior-point methods 12–30
family of SDPs (A ∈ Sn, x ∈ Rn)

minimize 1T x
subject to A + diag(x)  0

n = 10, . . . , 1000, for each n solve 100 randomly generated instances

PSfrag replacements
35
Newton iterations

30

25

20

15
101 102 103
n

Interior-point methods 12–31


Primal-dual interior-point methods

more efficient than barrier method when high accuracy is needed

• update primal and dual variables at each iteration; no distinction


between inner and outer iterations
• often exhibit superlinear asymptotic convergence
• search directions can be interpreted as Newton directions for modified
KKT conditions
• can start at infeasible points
• cost per iteration same as barrier method

Interior-point methods 12–32


Convex Optimization — Boyd & Vandenberghe

13. Conclusions

• main ideas of the course


• importance of modeling in optimization

13–1
Modeling

mathematical optimization

• problems in engineering design, data analysis and statistics, economics,


management, . . . , can often be expressed as mathematical
optimization problems
• techniques exist to take into account multiple objectives or uncertainty
in the data

tractability

• roughly speaking, tractability in optimization requires convexity


• algorithms for nonconvex optimization find local (suboptimal) solutions,
or are very expensive
• surprisingly many applications can be formulated as convex problems

Conclusions 13–2
Theoretical consequences of convexity

local optima are global

extensive duality theory

• systematic way of deriving lower bounds on optimal value


• necessary and sufficient optimality conditions
• certificates of infeasibility
• sensitivity analysis

Conclusions 13–3
Practical consequences of convexity

convex problems can be solved globally and efficiently

• algorithms with a polynomial worst-case complexity

• basic algorithms (e.g., Newton, barrier method, . . . ) are easy to


implement and work well for small and medium size problems (larger
problems if structure is exploited)

• more and more high-quality implementations of advanced algorithms


and modeling tools are becoming available

Conclusions 13–4
Convex optimization examples

† force/moment generation with thrusters


† minimum-time optimal control
† optimal transmitter power allocation
† phased-array antenna beamforming
† optimal receiver location
† power allocation in FDM system
† optimizing structural dynamics

1
Force/moment generation with thrusters

† rigid body with center of mass at origin p = 0 ∈ R2

† n forces with magnitude ui, acting at pi = (pix, piy ), in direction µi

ui
θi
(pix, piy )

PSfrag replacements

Convex optimization examples 2


Pn
† resulting horizontal force: Fx = i=1 ui cos µi

Pn
† resulting vertical force: Fy = i=1 ui sin µi

Pn
† resulting torque: T = i=1 (piy ui cos µi ¡ pixui sin µi)

† force limits: 0 • ui • 1 (thrusters)

† fuel usage: u1 + ¢ ¢ ¢ + un

problem: flnd thruster forces ui that yield given desired forces and torques
Fxdes, Fydes, T des, and minimize fuel usage (if feasible)

Convex optimization examples 3


can be expressed as LP:

minimize 1T u
subject to F u = f des
0 • ui • 1, i = 1, . . . , n

where
 
cos µ1 ¢¢¢ cos µn
F = sin µ1 ¢¢¢ sin µn ,
p1y cos µ1 ¡ p1x sin µ1 ¢ ¢ ¢ pny cos µn ¡ pnx sin µn

f des = (Fxdes, Fydes, T des), 1 = ( 1, 1, ¢ ¢ ¢ 1 )

Convex optimization examples 4


Extensions of thruster problem

† opposing thruster pairs:


Pn
minimize kuk1 = i=1 |ui|
subject to F u = f des
|ui| • 1, i = 1, . . . , n

can express as LP

† more accurate fuel use model:


Pn
minimize i=1 `i (ui )
subject to F u = f des
0 • ui • 1, i = 1, . . . , n

`i are piecewise linear increasing convex functions


can express as LP

Convex optimization examples 5


† minimize maximum force/moment error:

minimize kF u ¡ f desk∞
subject to 0 • ui • 1, i = 1, . . . , n

can express as LP

† minimize number of thrusters used:

minimize # thrusters on
subject to F u = f des
0 • ui • 1, i = 1, . . . , n

can’t express as LP
(but we could check feasibility of each of the 2n subsets of thrusters)

Convex optimization examples 6


Minimum-time optimal control

† linear dynamical system:

x(t + 1) = Ax(t) + Bu(t), t = 0, 1, . . . , K, x(0) = x0

† inputs limited to range [¡1, 1]:

ku(t)k∞ • 1, t = 0, 1, . . . , K

† settling time:

f (u(0), . . . , u(K)) = min {T | x(t) = 0 for T • t • K + 1}

Convex optimization examples 7


settling time f is quasiconvex function of (u(0), . . . , u(K)):

f (u(0), u(1), . . . , u(K)) • T

if and only if for all t = T, . . . , K + 1

x(t) = Atx0 + At¡1Bu(0) + ¢ ¢ ¢ + Bu(t ¡ 1) = 0

i.e., sublevel sets are affine


minimum-time optimal control problem:

minimize f (u(0), u(1), . . . , u(K))


subject to ku(t)k∞ • 1, t = 0, . . . , K

with variables u(0), . . . , u(K)


a quasiconvex problem; can be solved via bisection with LPs

Convex optimization examples 8


Minimum-time control example

three unit masses, connected by two unit springs with equilibrium length
one

u(t) ∈ R2 is force on left & right masses over time interval


(0.15t, 0.15(t + 1)]

u1(t) u2(t)
PSfrag replacements

problem: pick u(0), . . . , u(K) to bring masses to positions (¡1, 0, 1) (at


rest), as quickly as possible, from initial condition (¡3, 0, 2) (at rest)

Convex optimization examples 9


optimal solution:
1

0
u1
−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t

1
u2

−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
PSfrag replacements t
2
¡
right mass
1 ¡
ª
positions

0 ¾ center mass
−1
@
I
@
−2 left mass
−3
−4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t

Convex optimization examples 10


Optimal transmitter power allocation
† m transmitters, mn receivers all at same frequency

† transmitter i wants to transmit to n receivers labeled (i, j), j = 1, . . . , n


transmitter k
PSfrag replacements
receiver (i, j)

transmitter i

† Aijk is path gain from transmitter k to receiver (i, j)

† Nij is (self) noise power of receiver (i, j)

† variables: transmitter powers pk , k = 1, . . . , m

Convex optimization examples 11


at receiver (i, j):
† signal power:
Sij = Aijipi

† noise plus interference power:


X
Iij = Aijk pk + Nij
k6=i

† signal to interference/noise ratio (SINR): Sij /Iij


problem: choose pi to maximize smallest SINR:

Aijipi
maximize min P
k6=i Aijk pk + Nij
i,j
subject to 0 • pi • pmax

. . . a (generalized) linear fractional program

Convex optimization examples 12


Phased-array antenna beamforming

(xi, yi)

PSfrag replacements θ

† omnidirectional antenna elements at positions (x1, y1), . . . , (xn, yn)

† unit plane wave incident from angle µ induces in ith element a signal
ej(xi cos µ+yi sin µ¡ωt)

(j = ¡1, frequency ω, wavelength 2…)

Convex optimization examples 13


† demodulate to get output ej(xi cos µ+yi sin µ) ∈ C

† linearly combine with complex weights wi:

n
X
y(µ) = wiej(xi cos µ+yi sin µ)
i=1

† y(µ) is (complex) antenna array gain pattern

† |y(µ)| gives sensitivity of array as function of incident angle µ

† depends on design variables Re w, Im w


(called antenna array weights or shading coefficients)

design problem: choose w to achieve desired gain pattern

Convex optimization examples 14


Sidelobe level minimization
make |y(µ)| small for |µ ¡ µtar | > fi

(µtar : target direction; 2fi: beamwidth)

via least-squares (discretize angles)


P 2
minimize i |y(µ i )|
subject to y(µtar ) = 1

(sum is over angles outside beam)

least-squares problem with two (real) linear equality constraints

Convex optimization examples 15


50◦
|y(θ)|
@
@ θtar = 30◦
R
@

10◦
sidelobe level
@
R
@

PSfrag replacements

Convex optimization examples 16


minimize sidelobe level (discretize angles)

minimize maxi |y(µi)|


subject to y(µtar ) = 1

(max over angles outside beam)

can be cast as SOCP

minimize t
subject to |y(µi)| • t
y(µtar ) = 1

Convex optimization examples 17


50◦
|y(θ)|
@
@
R
@ θtar = 30◦

10◦
sidelobe level
@
R
@

PSfrag replacements

Convex optimization examples 18


Extensions
convex (& quasiconvex) extensions:
† y(µ0) = 0 (null in direction µ0)

† w is real (amplitude only shading)

† |wi| • 1 (attenuation only shading)

2
Pn 2
† minimize ¾ i=1 |wi | (thermal noise power in y)

† minimize beamwidth given a maximum sidelobe level

nonconvex extension:
† maximize number of zero weights

Convex optimization examples 19


Optimal receiver location

† N transmitter frequencies 1, . . . , N
† transmitters at locations ai, bi ∈ R2 use frequency i
† transmitters at a1, a2, . . . , aN are the wanted ones
† transmitters at b1, b2, . . . , bN are interfering
† receiver at position x ∈ R2

b3
PSfrag replacements q

a3 a
a2 a b2
q
x
a1 a

b1
q

Convex optimization examples 20


† (signal) receiver power from ai: kx ¡ aik¡fi (fi … 2.1)

† (interfering) receiver power from bi: kx ¡ bik¡fi (fi … 2.1)

† worst signal to interference ratio, over all frequencies, is

kx ¡ aik¡fi
S/I = min
i kx ¡ bi k¡fi

† what receiver location x maximizes S/I?

Convex optimization examples 21


S/I is quasiconcave on {x | S/I ‚ 1}, i.e., on

{x | kx ¡ aik • kx ¡ bik, i = 1, . . . , N }

b3
q

a3 a

a2 a b2
q
PSfrag replacements
a1 a

b1
q

can use bisection; every iteration is a convex quadratic feasibility problem

Convex optimization examples 22


PSfrag replacements

Power allocation in FDM system


frequency division multiplex (FDM) system
u1 y1
u2 y y2
u
mod. channel demod.
un yn

† signal ui modulates carrier frequency fi with power pi

† channel is slightly nonlinear

† powers affect signal power, interference power at each yi

† problem: choose powers to maximize minimum SINR


(signal to noise & interference ratio)

Convex optimization examples 23


† demodulated signal power in yi proportional to pi

† noise power in yi is ¾i2

† interference power in yi is sum of crosstalk & intermodulation products


from nonlinearity

† crosstalk power ci is linear in powers:

c = Cp, Cij ‚ 0

C is often tridiagonal, i.e., have crosstalk from adjacent channels only

Convex optimization examples 24


† intermodulation power: kth order IM products have frequencies

±fi1 ± fi2 ± ¢ ¢ ¢ ± fik

with power proportional to pi1 pi2 ¢ ¢ ¢ pik


e.g., for frequencies 1, 2, 3:
frequency IM product pwr. prop. to
2 1+1 p21
3 1+2 p 1 p2
1 2−1 p 2 p1
1 3−2 p 3 p2
2 3−1 p 3 p1
3 1+1+1 p31
1 1+1−1 p31
2 2+1−1 p2p21
... ... ...

† total IM power at fi is (complicated) polynomial of p1, . . . , pn, with


nonnegative coefficients

Convex optimization examples 25


inverse SINR at frequency i

noise + crosstalk + IM power


signal power

is posynomial function of p1, . . . , pn

hence, problem such as

maximize mini SINRi


subject to 0 < pi • Pmax

is geometric program

Convex optimization examples 26


Optimizing structural dynamics

linear elastic structure

PSfrag replacements
f1
f2
f3
f4

dynamics (ignoring damping): M d¨ + Kd = 0


† d(t) ∈ Rk : vector of displacements
† M = M T ´ 0 is mass matrix; K = K T ´ 0 is stiffness matrix

Convex optimization examples 27


Fundamental frequency
† solutions have form
k
X
di(t) = fiij cos(ωj t ¡ `j )
j=1

where 0 • ω1 • ω2 • ¢ ¢ ¢ • ωk are the modal frequencies, i.e., positive


solutions of det(ω 2M ¡ K) = 0

† fundamental frequency:
1/2 1/2
ω1 = ‚min(K, M ) = ‚min(M ¡1/2KM ¡1/2)

– structure behaves like mass at frequencies below ω1


– gives stiffness measure (the larger ω1, the stiffer the structure)

† ω1 ‚ Ω ⇐⇒ Ω2M ¡ K „ 0 so ω1 is quasiconcave function of M , K

Convex optimization examples 28


† design variables: xi, cross-sectional area of structural member i
(geometry of structure flxed)
P P
† M (x) = M0 + i x i Mi , K(x) = K0 + i xi Ki

P
† structure weight w = w0 + i x i wi

† problem: minimize weight s.t. ω1 ‚ Ω, limits on cross-sectional areas

as SDP: P
minimize w0 + i xiwi
subject to Ω2M (x) ¡ K(x) „ 0
li • x i • u i

Convex optimization examples 29


Filter design

† FIR fllters

† Chebychev design

† linear phase fllter design

† equalizer design

† fllter magnitude speciflcations

1
FIR fllters

flnite impulse response (FIR) fllter:

n¡1
X
y(t) = h¿ u(t ¡ ¿ ), t∈Z
¿ =0

† (sequence) u : Z → R is input signal


† (sequence) y : Z → R is output signal
† hi are called filter coefficients
† n is fllter order or length

Filter design 2
fllter frequency response: H : R → C

H(ω) = h0 + h1e¡jω + ¢ ¢ ¢ + hn¡1e¡j(n¡1)ω


n¡1
X n¡1
X
= ht cos tω + j ht sin tω
t=0 t=0


† j means ¡1 here (EE tradition)
† H is periodic and conjugate symmetric, so only need to know/specify
for 0 • ω • …

FIR fllter design problem: choose h so it and H satisfy/optimize specs

Filter design 3
example: (lowpass) FIR fllter, order n = 21

impulse response h:

0.2

0.1
h(t)

−0.1

PSfrag replacements
−0.2

0 2 4 6 8 10 12 14 16 18 20

Filter design 4
frequency response magnitude (i.e., |H(ω)|):
1
10

0
10
|H(ω)|
−1
10

−2
PSfrag replacements 10

−3
10
0 0.5 1 1.5 2 2.5 3
ω
frequency response phase (i.e., 6 H(ω)):
3

1
H(ω)

−1
6

PSfrag replacements −2

−3
0 0.5 1 1.5 2 2.5 3
ω

Filter design 5
Chebychev design

minimize max |H(ω) ¡ Hdes(ω)|


ω∈[0,…]

† h is optimization variable
† Hdes : R → C is (given) desired transfer function
† convex problem
† can add constraints, e.g., |hi| • 1
sample (discretize) frequency:

minimize max |H(ωk ) ¡ Hdes(ωk )|


k=1,...,m

† sample points 0 • ω1 < ¢ ¢ ¢ < ωm • … are flxed (e.g., ωk = k…/m)


† m À n (common rule-of-thumb: m = 15n)
† yields approximation (relaxation) of problem above

Filter design 6
Chebychev design via SOCP:

minimize t° °
subject to °A h ¡ b ° • t,
(k) (k)
k = 1, . . . , m

where
• ‚
1 cos ωk ¢¢¢ cos(n¡1)ωk
A(k) =
0 ¡sin ωk ¢ ¢ ¢ ¡sin(n¡1)ωk
• ‚
<Hdes(ωk )
b(k) =
=Hdes(ωk )
 
h0
h =  ¢¢¢ 
hn¡1

Filter design 7
Linear phase fllters

suppose
† n = 2N + 1 is odd
† impulse response is symmetric about midpoint:

ht = hn¡1¡t, t = 0, . . . , n ¡ 1

then

H(ω) = h0 + h1e¡jω + ¢ ¢ ¢ + hn¡1e¡j(n¡1)ω


= e¡jN ω (2h0 cos N ω + 2h1 cos(N ¡1)ω + ¢ ¢ ¢ + hN )
∆ e
= e¡jN ω H(ω)

Filter design 8
† term e¡jN ω represents N -sample delay

e
† H(ω) is real

e
† |H(ω)| = |H(ω)|

† called linear phase fllter (6 H(ω) is linear except for jumps of ±…)

Filter design 9
Lowpass fllter speciflcations

δ1
1/δ1

PSfrag replacements

δ2

ωp ωs π
ω
idea:
† pass frequencies in passband [0, ωp]
† block frequencies in stopband [ωs, …]

Filter design 10
speciflcations:
† maximum passband ripple (±20 log10 –1 in dB):

1/–1 • |H(ω)| • –1, 0 • ω • ωp

† minimum stopband attenuation (¡20 log10 –2 in dB):

|H(ω)| • –2, ωs • ω • …

Filter design 11
Linear phase lowpass fllter design

† sample frequency
e
† can assume wlog H(0) > 0, so ripple spec is

e k ) • –1
1/–1 • H(ω

design for maximum stopband attenuation:

minimize –2
e k ) • – 1 , 0 • ω k • ωp
subject to 1/–1 • H(ω
e k ) • – 2 , ωs • ωk • …
¡–2 • H(ω

Filter design 12
† passband ripple –1 is given

† an LP in variables h, –2

† known (and used) since 1960’s

† can add other constraints, e.g., |hi| • fi

variations and extensions:


† flx –2, minimize –1 (convex, but not LP)

† flx –1 and –2, minimize ωs (quasiconvex)

† flx –1 and –2, minimize order n (quasiconvex)

Filter design 13
example
† linear phase fllter, n = 21
† passband [0, 0.12…]; stopband [0.24…, …]
† max ripple –1 = 1.012 (±0.1dB)
† design for maximum stopband attenuation

impulse response h:
0.2

0.1
h(t)

−0.1
PSfrag replacements
−0.2

0 2 4 6 8 10 12 14 16 18 20

Filter design 14
frequency response magnitude (i.e., |H(ω)|):
1
10

0
10

|H(ω)| −1
10

−2
PSfrag replacements 10

−3
10
0 0.5 1 1.5 2 2.5 3
ω
frequency response phase (i.e., 6 H(ω)):
3

0
H(ω)

−1
PSfrag replacements −2
6

−3
0 0.5 1 1.5 2 2.5 3
ω

Filter design 15
Equalizer design

PSfrag replacements
H(ω) G(ω)

equalization: given
† G (unequalized frequency response)
† Gdes (desired frequency response)
e=∆
design (FIR equalizer) H so that G GH … Gdes

† common choice: Gdes(ω) = e¡jDω (delay)


i.e., equalization is deconvolution (up to delay)
† can add constraints on H, e.g., limits on |hi| or maxω |H(ω)|

Filter design 16
Chebychev equalizer design:
fl fl
fl e fl
minimize max flG(ω) ¡ Gdes(ω)fl
ω∈[0,…]

convex; SOCP after sampling frequency

Filter design 17
time-domain equalization: optimize impulse response g̃ of equalized
system
e.g., with Gdes(ω) = e¡jDω ,

1 t=D
gdes(t) =
6 D
0 t=

sample design:
minimize maxt6=D |g̃(t)|
subject to g̃(D) = 1
† an LP
P 2
P
† can use t6=D g̃(t) or t6=D |g̃(t)|

Filter design 18
extensions:
† can impose (convex) constraints

† can mix time- and frequency-domain speciflcations

† can equalize multiple systems, i.e., choose H so

G(k)H … Gdes, k = 1, . . . , K

† can equalize multi-input multi-output systems


(i.e., G and H are matrices)

† extends to multidimensional systems, e.g., image processing

Filter design 19
Equalizer design example

unequalized system G is 10th order FIR:

0.8

0.6

0.4
g(t)

0.2

0
PSfrag replacements
−0.2

−0.4
0 1 2 3 4 5 6 7 8 9

Filter design 20
1
10

|G(ω)|
0
10

PSfrag replacements
−1
10
0 0.5 1 1.5 2 2.5 3
ω

1
G(ω)

−1
6

PSfrag replacements −2

−3
0 0.5 1 1.5 2 2.5 3
ω

e
design 30th order FIR equalizer with G(ω) … e¡j10ω

Filter design 21
Chebychev equalizer design:
fl fl
fl fl
minimize max flG̃(ω) ¡ e¡j10ω fl
ω

equalized system impulse response g̃


1

0.8

0.6
g̃(t)

0.4

0.2
PSfrag replacements
0

−0.2
0 5 10 15 20 25 30 35

Filter design 22
e
equalized frequency response magnitude |G|
1
10

|G(ω)| 0
10
e

PSfrag replacements
−1
10
0 0.5 1 1.5 2 2.5 3
ω
e
equalized frequency response phase 6 G
3

1
G(ω)

0
e

−1
6

PSfrag replacements −2

−3
0 0.5 1 1.5 2 2.5 3
ω

Filter design 23
time-domain equalizer design:

minimize max |g̃(t)|


t6=10

equalized system impulse response g̃


1

0.8

0.6
g̃(t)

0.4

0.2
PSfrag replacements
0

−0.2
0 5 10 15 20 25 30 35

Filter design 24
e
equalized frequency response magnitude |G|
1
10

0
10
|G(ω)|

PSfrag replacements
e

−1
10
0 0.5 1 1.5 2 2.5 3
ω
e
equalized frequency response phase 6 G
3

0
G(ω)

−1
PSfrag replacements
e

−2
6

−3
0 0.5 1 1.5 2 2.5 3
ω

Filter design 25
Filter magnitude speciflcations

transfer function magnitude spec has form

L(ω) • |H(ω)| • U (ω), ω ∈ [0, …]

where L, U : R → R+ are given

† lower bound is not convex in fllter coefficients h

† arises in many applications, e.g., audio, spectrum shaping

† can change variables to solve via convex optimization

Filter design 26
Autocorrelation coe–cients
autocorrelation coefficients associated with impulse response
h = (h0, . . . , hn¡1) ∈ Rn are
X
rt = h¿ h¿ +t
¿

(we take hk = 0 for k < 0 or k ‚ n)

† rt = r¡t; rt = 0 for |t| ‚ n


† hence suffices to specify r = (r0, . . . , rn¡1) ∈ Rn

Filter design 27
Fourier transform of autocorrelation coefficients is

X n¡1
X
R(ω) = e¡jω¿ r¿ = r0 + 2rt cos ωt = |H(ω)|2
¿ t=1

† always have R(ω) ‚ 0 for all ω


† can express magnitude speciflcation as

L(ω)2 • R(ω) • U (ω)2, ω ∈ [0, …]

. . . convex in r

Filter design 28
Spectral factorization

question: when is r ∈ Rn the autocorrelation coefficients of some h ∈ Rn?


answer: (spectral factorization theorem) if and only if R(ω) ‚ 0 for all ω

† spectral factorization condition is convex in r


† many algorithms for spectral factorization, i.e., flnding an h s.t.
R(ω) = |H(ω)|2

magnitude design via autocorrelation coefficients:


† use r as variable (instead of h)
† add spec. fact. condition R(ω) ‚ 0 for all ω
† optimize over r
† use spectral factorization to recover h

Filter design 29
log-Chebychev magnitude design
choose h to minimize

max |20 log10 |H(ω)| ¡ 20 log10 D(ω)|


ω

† D is desired transfer function magnitude


(D(ω) > 0 for all ω)
† flnd minimax logarithmic (dB) flt
reformulate as

minimize t
subject to D(ω)2/t • R(ω) • tD(ω)2, 0•ω•…

† convex in variables r, t
† constraint includes spectral factorization condition

Filter design 30

example: 1/f (pink noise) fllter (i.e., D(ω) = 1/ ω), n = 50,
log-Chebychev design over 0.01… • ω • …
1
10

0
10
|H(ω)|2

−1
10

PSfrag replacements
−2
10 −1 0
10 10
ω

optimal flt: ±0.5dB

Filter design 31
Disciplined convex programming
and the cvx modeling framework
Michael C. Grant
Information Systems Laboratory
Stanford University

April 27, 2006


Agenda

Today’s topics:

• disciplined convex programming, a methodology for practical convex


optimization; and

• the cvx modeling system: a MATLAB-based software package which supports


this methodology

Disciplined convex programming and the cvx modeling framework 1


Where you are now

What you have seen thus far:

• Fundamental definitions of convex sets and convex/concave functions

• An extensive convexity calculus: a collection of operations, combinations and


transformations that are known to preserve convexity

• An introduction to convex optimization problems, or convex programs

• Several specific classes of convex programs: LPs, SDPs, GPs, SOCPs... hopefully
you can identify a problem from one of these classes when you see one

You have not yet seen:

• How convex optimization problems are solved

(well, that and a few other things)

Disciplined convex programming and the cvx modeling framework 2


Solvers

A solver is an engine for solving a particular type of mathematical problem, such as a


convex program

Solvers typically handle only a certain class of problems, such as LPs, or SDPs, or GPs

They also require that problems be expressed in a standard form

Most problems do not immediately present themselves in standard form, so they must
be transformed into standard form

Disciplined convex programming and the cvx modeling framework 3


Solver example: MATLAB’s linprog

A program for solving linear programs:

x = linprog( c, A, b, A_eq, B_eq, l, u )

Problems must be expressed in the following standard form:

minimize cT x
subject to Ax  b
Aeqx = beq
`xu

As standard forms go, this one is quite flexible: in fact, the first step linprog often
does is to convert this problem into a more restricted standard form!

Disciplined convex programming and the cvx modeling framework 4


Conversion to standard form: common tricks

Representing free variables as the difference of nonnegative variables:

x free =⇒ x+ − x−, x+ ≥ 0, x− ≥ 0

Eliminating inequality constraints using slack variables:

aT x ≤ b =⇒ aT x + s = b, s ≥ 0

Splitting equality constraints into inequalities:

aT x = b =⇒ aT x ≤ b, aT x ≥ b

Disciplined convex programming and the cvx modeling framework 5


Solver example: SeDuMi

A program for solving LPs, SOCPs, SDPs, and related problems:

x = sedumi( A, b, c, K )

Solves problems of the form:

minimize cT x
subject to Ax = b
x ∈ K , K 1 × K2 × · · · × K L

where each set Ki ⊆ Rni , i = 1, 2, . . . , L is chosen from a very short list of cones (see
next slide...)

The Matlab variable K gives the number, types, and dimensions of the cones Ki

Disciplined convex programming and the cvx modeling framework 6


Cones supported by SeDuMi

• free variables: Rni


• a nonnegative orthant: Rn+i (for linear inequalities)
• a real or complex second-order cone:

Qn , { (x, y) ∈ Rn × R | kxk2 ≤ y }
Qnc , { (x, y) ∈ Cn × R | kxk2 ≤ y }

• a real or complex semidefinite cone:



Sn+ , X ∈ Rn×n | X = X T , X  0

Hn+ , X ∈ Cn×n | X = X H , X  0

The cones must be arranged in this order: i.e., the free variables first, then the
nonnegative orthants, then the second-order cones, then the semidefinite cones

Disciplined convex programming and the cvx modeling framework 7


Example: Norm approximation

Consider the problem


minimize kAx − bk

An optimal value x∗ minimizes the resiuduals

rk , aTk x − bk k = 1, 2, . . . m

according to the measure defined by the norm k · k

Obviously, the value of x∗ depends significantly upon the choice of that norm...

...and so does the process of conversion to standard form

Disciplined convex programming and the cvx modeling framework 8


Euclidean (`2) norm

Pm T

2 1/2
minimize kAx − bk2 = k=1 (ak x − bk )

No need to use linprog or SeDuMi here: this is a least squares problem, with an
analytic solution

x∗ = (AT A)−1AT b

In MATLAB or Octave, a single command computes the solution:

>> x = A \ b;

Disciplined convex programming and the cvx modeling framework 9


Chebyshev (`∞) norm

minimize f (Ax − b) , kAx − bk∞ = max |aTk x − b|


1≤k≤m

This can be expressed as a linear program:

 
  x
minimize 0 1
minimize q q
=⇒     
subject to −q1 ≤ Ax − b ≤ +q1 A −1 x b
subject to ≤
−A −1 q −b

The linprog call:

xq = linprog( [zeros(n,1);1], ...


[A,-ones(m,1);-A,-ones(m,1)], [b;-b] );
x = xq(1:n);

Disciplined convex programming and the cvx modeling framework 10


Manhattan (`1) norm
m
X
minimize f (Ax − b) , kAx − bk1 = |aTk x − b|
k=1

LP formulation:

 
 T x
minimize 0 1
minimize 1T w  w
   
=⇒
subject to −w ≤ Ax − b ≤ w A −I x b
subject to ≤
−A −I w −b

The linprog call:

xw = linprog( [zeros(n,1);ones(m,1)], ...


[A,-eye(m,1);-A,-eye(m,1)], [b;-b] );
x = xw(1:n);

Disciplined convex programming and the cvx modeling framework 11


Largest-L norm

Let |w|[k] represent the k-th largest element (in magnitude) of w ∈ Rm:

|w|[1] ≥ |w|[2] ≥ · · · ≥ |w|[m]

Then define the largest-L norm as


PL
kwk[L] , |w|[1] + |w|[2] + · · · + |w|[L] , k=1 σk (diag(w))

for any integer 1 ≤ L ≤ m. Special cases include

kwk[1] ≡ kwk∞, kwk[m] , kwk1

but novel results are produced for all 1 < L < m

Disciplined convex programming and the cvx modeling framework 12


Largest-L norm (continued)

This norm can be also represented in a linear program:

minimize 1T v + Lq
subject to −v − 1q ≤ Ax − b ≤ +v + 1q
v≥0

The linprog call:

xvq = linprog( [zeros(n,1);ones(m,1);L], ...


[A,-eye(m),ones(m,1);-A,-eye(m),-ones(m,1)], ...
[], [], [-Inf*ones(n,1);zeros(m,1);-Inf] );
x = xvq(1:n);

This is not much more complex than the `1 and `∞ cases—but of course, you have to
know this trick (and few do)

Disciplined convex programming and the cvx modeling framework 13


Constrained Euclidean (`2) norm

Add some constraints:


minimize kAx − bk2
subject to Cx = d
`≤x≤u
This is not a least-squares problem—but it is an SOCP, and SeDuMi can handle it,
once it is converted to standard form:
minimize z  T
minimize 0 0 0 0 1 x̄
subject to Ax − b = y    
Cx = d A −I b
C  d
x − s` = ` =⇒ 
subject to   x̄ =  
I −I  `
x + su = u
s` , s u ≥ 0 I I u
kyk2 ≤ z x̄ ∈ Rn × Rn+ × Rn+ × Qm

sl, su ∈ Rn+ are slack variables, which are used quite often to convert inequalities to
equations

Disciplined convex programming and the cvx modeling framework 14


Constrained Euclidean (`2) norm

The SeDuMi call:

AA = [ A, zeros(m,n), zeros(m,n), -eye(m), 0 ;


C, zeros(p,n), zeros(p,n), zeros(p,n), 0 ;
eye(n), -eye(n), zeros(n,n), zeros(n,n), 0 ;
eye(n), zeros(n,n), eye(n), zeros(n,n), 0 ];
bb = [ b ; d ; l ; u ];
cc = [ zeros( 3 * n + m, 1 ) ; 1 ];
K.f = n; K.l = 2 * n; K.q = m + 1;
xsyz = sedumi( AA, bb, cc, K );
x = xsyz( 1 : n );

Hopefully it is getting clear: this can be cumbersome

Disciplined convex programming and the cvx modeling framework 15


Modeling frameworks

A modeling framework simplifies the use of a numerical technology by shielding the


user from the underlying mathematical details. Frameworks for LP, QP, NLP, and SDP
are directly responsible for increased popularity of these problem classes
Most common modeling frameworks provide a language that allows problems to be
specified in a natural syntax, and automatically call an underlying numerical solver
Examples:

• AMPL, GAMS, AIMMS, LINDO, etc.: LP, QP, NLP

• SDPSOL, LMITOOL, LMILAB: SDP

• YALMIP: LP, SOCP, SDP

cvx is designed to support convex optimization; or more specifically, disciplined convex


programming

Disciplined convex programming and the cvx modeling framework 16


Disciplined convex programming

People don’t simply write down optimization problems and hope that they are convex;
instead, they draw from a “mental library” of functions and sets with known convexity
properties, and combine them in ways that convex analysis guarantees will produce
convex results...

...i.e., using the calculus rules discussed in previous lectures

Disciplined convex programming formalizes this methodology

Two key elements:


• An expandable atom library : a collection of functions and sets, or atoms, with
known properties of convexity, monotonicity, and range.

• A ruleset which governs how those atoms can be used and combined to form a valid
problem

Compliant problems are called disciplined convex programs, or DCPs

Disciplined convex programming and the cvx modeling framework 17


The DCP ruleset

A subset of the calculus rules you have already learned:

• objective functions: convex for minimization, concave for maximization


• constraints:
– equality and inequality constraints
– each must be separately convex
• expressions:
– addition, subtraction, scalar multiplication
– composition with affine functions
– nonlinear compositions (with appropriate monotonicity conditions). including
elementwise maximums, etc.

Not included (for now): perspective transformations, conjugates, pointwise supremum,


quasiconvexity

Disciplined convex programming and the cvx modeling framework 18


Is the ruleset too limited?

These rules constitute a set of sufficient conditions for convexity. For example,

sqrt( sum( square( x ) ) )

is convex, but does not obey the ruleset, because it is the composition of a concave
nondecreasing function and a convex function

In this case, the workaround is simple: use norm( x ) instead

In some cases, it may be necessary to add new functions (more later)

A practical hurdle? Perhaps, but it does mirror the mental process

Disciplined convex programming and the cvx modeling framework 19


Disciplined convex programming: benefits

• Verification of convexity is replaced with enforcement of the ruleset, which can be


performed reliably and quickly

• Conversion to solvable form is fully automated

• Because the atom library is extensible, generality is not compromised

• New atoms can be created by experts and shared with novices

Disciplined convex programming and the cvx modeling framework 20


Demonstration

Disciplined convex programming and the cvx modeling framework 21


Implementing atoms

Obviously, the usefulness of disciplined convex programming depends upon the size of
the atom library available to us... or at least, on the ability to expand that library as
needed

The manner in which atoms are implemented depends heavily on the capabilities of the
underlying solver

Currently, cvx uses (SeDuMi), whose standard form permits four constraint types:

• Equality constraints
• Inequality constraints
• Second-order cone constraints
• Linear matrix inequalities

How do you implement a variety of functions under these limited conditions?

Disciplined convex programming and the cvx modeling framework 22


Graph implementations

A fundamental principle of convex analysis: the close relationship between convex


functions and convex sets via the epigraph

The epigraph of a function f : Rn → R is

epi f , { (x, y) ∈ Rn × R | f (x) ≤ y }

A function is convex if and only if its epigraph is a convex set

For concave functions, the corresponding construction is the hypograph

hypo g , { (x, y) ∈ Rn × R | g(x) ≥ y }

A function is concave if and only if its hypograph is a convex set

Disciplined convex programming and the cvx modeling framework 23


Graph implementations (continued)

3.5

The absolute value function (blue line) 2.5

and its epigraph (grey region): 2

1.5

f (x) , |x| 1

epi f , { (x, y) | |x| ≤ y } 0.5

0
= { (x, y) | x ≤ y, −x ≤ y }
−0.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Key observation: for the purposes of convex optimization, the epigraph/hypograph can
be used to provide a complete description of a function

Graph implementations are, in effect, rules for “rewriting” a function in terms of other
atoms by describing its epigraph or hypograph

Disciplined convex programming and the cvx modeling framework 24


Example: the two-element maximum

f (x, y) , max{x, y} = inf { z | x ≤ z, y ≤ z }

function cvx_optval = max( x, y )


cvx_begin
variable z
minimize( z )
subject to
x <= z;
y <= z;
cvx_end

Disciplined convex programming and the cvx modeling framework 25


Example: `1 norm


f (x) , kxk1 = inf 1T v | − v ≤ x ≤ v

function cvx_optval = norm_1( x )


n = length( x );
cvx_begin
variable v(n)
minimize( sum(v) )
subject to
+x <= v;
-x <= v;
cvx_end

This implementation effectively encapsulates the very transformation employed in the


norm approximation example

Disciplined convex programming and the cvx modeling framework 26


Example: `2 norm

f (x) , kxk2 =⇒ epi f ≡ Qn

Obviously, the second-order (Lorentz) cone can represent the 2-norm:

function cvx_optval = norm_2( x )


n = length( x );
cvx_begin
variable y
{ x, y } == lorentz( n );
cvx_end

(In the actual implementation, the complex case is handled too)

Disciplined convex programming and the cvx modeling framework 27


Example: Simple squaring

SeDuMi does not even support the basic convex function f (x) = x2—or does it?
 
y x
x2 ≤ y =⇒ 0
x 1

So f (x) can be represented using a semidefinite program:

function cvx_optval = square( x )


cvx_begin sdp
variable y
minimize( y )
[ y, x ; x, 1 ] >= 0
cvx_end

Of course, if SeDuMi were replaced with a different solver, this implementation might
not be necessary

Disciplined convex programming and the cvx modeling framework 28


Example: Square root

Graph implementations can freely refer to other graph implementations

For example, the square root function can be implemented using square:


f (x) , x ≥ y =⇒ y2 ≤ x

function cvx_optval = sqrt( x )


cvx_begin
variable y
maximize( y )
subject to
square( y ) <= x
cvx_end

Disciplined convex programming and the cvx modeling framework 29


Example: eigenvalue functions

If X is symmetric:

λmin(X) ≥ y ⇐⇒ X − yI  0
λmax(X) ≤ y ⇐⇒ yI − X  0

function cvx_optval = lambda_min( X )


n = size(X,1);
cvx_begin
variable y
minimize( y )
X - y * eye( n ) == semidefinite(n);
cvx_end

Disciplined convex programming and the cvx modeling framework 30


Example: matrix fractional function

(
n n xT Y −1x Y  0
f : R × S → R, f (x, Y ) ,
+∞ otherwise

Convex in both x and Y , and can be represented using an LMI:


 
Y x
f (x, Y ) ≤ z ⇐⇒ 0
xT z

function cvx_optval = matrix_frac( x, Y )


n = length(x);
cvx_begin
variable z
minimize( z )
subject to
[(Y+Y’)/2 x; x’ z] == semidefinite(n+1);
cvx_end

Disciplined convex programming and the cvx modeling framework 31


Incomplete specifications

cvx actually supports a more general concept we call incomplete specifications, which
is

f (x) = inf y { g(x, y) | (x, y) ∈ S } or f (x) = supy { g(x, y) | (x, y) ∈ S }


(f, g convex) (f, g concave)

where S is a convex set

Graph implementations are just special cases of this more general framework

Disciplined convex programming and the cvx modeling framework 32


Example: Huber penalty

Least−square fit vs robust least−squares fit (Huber−penalized)


( 25
x2 |x| ≤ 1
f (x) , 20
2|x| − 1 |x| ≥ 1
15

10
3

5
2.5

y
2 0

1.5
−5
1

−10
0.5
Data points
0 −15 Huber penalty
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Regular LS
−20
Useful for robust regression— −10 −5 0
x
5 10

to suppress outliers

Disciplined convex programming and the cvx modeling framework 33


Example: Huber penalty
(
x2 |x| ≤ 1
f (x) ,
2|x| − 1 |x| ≥ 1
 2
= inf w + 2v | |x| ≤ w + v, w ≤ 1, v ≥ 0

function cvx_optval = huber( x )


cvx_begin
variables v w
minimize( square( w ) + 2 * v )
subject to
abs( x ) <= w + v
w <= 1
v >= 0
cvx_end

Disciplined convex programming and the cvx modeling framework 34


Example: distance from an ellipsoid

Let fE (x) be the distance from a point x to an ellipsoid E, where E is described by a


pair (P, yc) ∈ Rn×n × Rn:

f (x, y) , inf { kx − yk | y ∈ E }
E , { y ∈ Rn | kP (y − yc)k2 ≤ 1 }

Then g(x, y) , kx − yk2, S , Rn × E:

function cvx_optval = dist_ellipse( x, P, yc )


n = length( x );
cvx_begin
variable y(n)
minimize( norm( x - y, 2 ) )
subject to
norm( P * ( y - yc ), 2 ) <= 1;
cvx_end

Disciplined convex programming and the cvx modeling framework 35

You might also like