Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Alternating Direction Method of Multipliers

Ryan Tibshirani
Convex Optimization 10-725
Last time: dual ascent
Consider the problem

min f (x) subject to Ax = b


x

where f is strictly convex and closed. Denote Lagrangian

L(x, u) = f (x) + uT (Ax − b)

Dual gradient ascent repeats, for k = 1, 2, 3, . . .

x(k) = argmin L(x, u(k−1) )


x
(k) (k−1)
u =u + tk (Ax(k) − b)

Good: x update decomposes when f does. Bad: require stringent


assumptions (strong convexity of f ) to ensure convergence

2
Augmented Lagrangian method modifies the problem, for ρ > 0,
ρ
min f (x) + kAx − bk22
x 2
subject to Ax = b

uses a modified Lagrangian


ρ
Lρ (x, u) = f (x) + uT (Ax − b) + kAx − bk22
2
and repeats, for k = 1, 2, 3, . . .

x(k) = argmin Lρ (x, u(k−1) )


x
(k) (k−1)
u =u + ρ(Ax(k) − b)

Good: better convergence properties. Bad: lose decomposability

3
Alternating direction method of multipliers
Alternating direction method of multipliers or ADMM tries for the
best of both methods. Consider a problem of the form:

min f (x) + g(z) subject to Ax + Bz = c


x,z

We define augmented Lagrangian, for a parameter ρ > 0,


ρ
Lρ (x, z, u) = f (x) + g(z) + uT (Ax + Bz − c) + kAx + Bz − ck22
2
We repeat, for k = 1, 2, 3, . . .

x(k) = argmin Lρ (x, z (k−1) , u(k−1) )


x
z (k) = argmin Lρ (x(k) , z, u(k−1) )
z
(k) (k−1)
u =u + ρ(Ax(k) + Bz (k) − c)

4
Convergence guarantees

Under modest assumptions on f, g (these do not require A, B to


be full rank), the ADMM iterates satisfy, for any ρ > 0:
• Residual convergence: r(k) = Ax(k) + Bz (k) − c → 0 as
k → ∞, i.e., primal iterates approach feasibility
• Objective convergence: f (x(k) ) + g(z (k) ) → f ? + g ? , where
f ? + g ? is the optimal primal objective value
• Dual convergence: u(k) → u? , where u? is a dual solution

For details, see Boyd et al. (2010). Note that we do not generically
get primal convergence, but this is true under more assumptions

Convergence rate: roughly, ADMM behaves like first-order method.


Theory still being developed, see, e.g., in Hong and Luo (2012),
Deng and Yin (2012), Iutzeler et al. (2014), Nishihara et al. (2015)

5
Scaled form ADMM
Scaled form: denote w = u/ρ, so augmented Lagrangian becomes
ρ ρ
Lρ (x, z, w) = f (x) + g(z) + kAx + Bz − c + wk22 − kwk22
2 2
and ADMM updates become
ρ
x(k) = argmin f (x) + kAx + Bz (k−1) − c + w(k−1) k22
x 2
ρ
z (k) = argmin g(z) + kAx(k) + Bz − c + w(k−1) k22
z 2
w(k) = w(k−1) + Ax(k) + Bz (k) − c

Note that here kth iterate w(k) is just a running sum of residuals:
k
X
(k) (0)
Ax(i) + Bz (i) − c

w =w +
i=1

6
Outline

Today:
• Examples, practicalities
• Consensus ADMM
• Special decompositions

7
Connection to proximal operators
Consider

min f (x) + g(x) ⇐⇒ min f (x) + g(z) subject to x = z


x x,z

ADMM steps (equivalent to Douglas-Rachford, here):

x(k) = proxf,1/ρ (z (k−1) − w(k−1) )


z (k) = proxg,1/ρ (x(k) + w(k−1) )
w(k) = w(k−1) + x(k) − z (k)

where proxf,1/ρ is the proximal operator for f at parameter 1/ρ,


and similarly for proxg,1/ρ

In general, the update for block of variables reduces to prox update


whenever the corresponding linear transformation is the identity

8
Example: lasso regression

Given y ∈ Rn , X ∈ Rn×p , recall the lasso problem:


1
min ky − Xβk22 + λkβk1
β 2

We can rewrite this as:


1
min ky − Xβk22 + λkαk1 subject to β − α = 0
β,α 2

ADMM steps:

β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )




α(k) = Sλ/ρ (β (k) + w(k−1) )


w(k) = w(k−1) + β (k) − α(k)

9
Notes:
• The matrix X T X + ρI is always invertible, regardless of X
• If we compute a factorization (say Cholesky) in O(p3 ) flops,
then each β update takes O(p2 ) flops
• The α update applies the soft-thresolding operator St , which
recall is defined as

xj − t x > t

[St (x)]j = 0 −t ≤ x ≤ t , j = 1, . . . , p

xj + t x < −t

• ADMM steps are “almost” like repeated soft-thresholding of


ridge regression coefficients

10
Comparison of various algorithms for lasso regression: 100 random
instances with n = 200, p = 50

Coordinate desc
Proximal grad
Accel prox

1e−01
ADMM (rho=50)
ADMM (rho=100)
ADMM (rho=200)
Suboptimality fk−fstar

1e−04
1e−07
1e−10

0 10 20 30 40 50 60

Iteration k

11
Practicalities

In practice, ADMM usually obtains a relatively accurate solution in


a handful of iterations, but it requires a large number of iterations
for a highly accurate solution (like a first-order method)

Choice of ρ can greatly influence practical convergence of ADMM:


• ρ too large → not enough emphasis on minimizing f + g
• ρ too small → not enough emphasis on feasibility
Boyd et al. (2010) give a strategy for varying ρ; it can work well in
practice, but does not have convergence guarantees

Like deriving duals, transforming a problem into one that ADMM


can handle is sometimes a bit subtle, since different forms can lead
to different algorithms

12
Example: group lasso regression

Given y ∈ Rn , X ∈ Rn×p , recall the group lasso problem:


G
1 X
min ky − Xβk22 + λ cg kβg k2
β 2
g=1

Rewrite as:
G
1 X
min ky − Xβk22 + λ cg kαg k2 subject to β − α = 0
β,α 2
g=1

ADMM steps:

β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )




αg(k) = Rcg λ/ρ βg(k) + wg(k−1) , g = 1, . . . , G




w(k) = w(k−1) + β (k) − α(k)

13
Notes:
• The matrix X T X + ρI is always invertible, regardless of X
• If we compute a factorization (say Cholesky) in O(p3 ) flops,
then each β update takes O(p2 ) flops
• The α update applies the group soft-thresolding operator Rt ,
which recall is defined as
 
t
Rt (x) = 1 − x
kxk2 +

• Similar ADMM steps follow for a sum of arbitrary norms of as


regularizer, provided we know prox operator of each norm
• ADMM algorithm can be rederived when groups have overlap
(hard problem to optimize in general!). See Boyd et al. (2010)

14
Example: sparse subspace estimation

Given S ∈ Sp (typically S  0 is a covariance matrix), consider the


sparse subspace estimation problem (Vu et al., 2013):

max tr(SY ) − λkY k1 subject to Y ∈ Fk


Y

where Fk is the Fantope of order k, namely

Fk = {Y ∈ Sp : 0  Y  I, tr(Y ) = k}

Note that when λ = 0, the above problem is equivalent to ordinary


principal component analysis (PCA)

This above is an SDP and in principle solveable with interior-point


methods, though these can be complicated to implement and quite
slow for large problem sizes

15
Rewrite as:

min −tr(SY ) + IFk (Y ) + λkZk1 subject to Y = Z


Y,Z

ADMM steps are:

Y (k) = PFk (Z (k−1) − W (k−1) + S/ρ)


Z (k) = Sλ/ρ (Y (k) + W (k−1) )
W (k) = W (k−1) + Y (k) − Z (k)

Here PFk is Fantope projection operator, computed by clipping the


eigendecomposition A = U ΣU T , Σ = diag(σ1 , . . . , σp ):

PFk (A) = U Σθ U T , Σθ = diag(σ1 (θ), . . . , σp (θ))

where each σi (θ) = min{max{σi − θ, 0}, 1}, and pi=1 σi (θ) = k


P

16
Example: sparse + low rank decomposition
Given M ∈ Rn×m , consider the sparse plus low rank decomposition
problem (Candes et al., 2009):

min kLktr + λkSk1


L,S

subject to L + S = M

ADMM steps:

L(k) = S1/ρ
tr
(M − S (k−1) + W (k−1) )
S (k) = Sλ/ρ
`1
(M − L(k) + W (k−1) )
W (k) = W (k−1) + M − L(k) − S (k)
tr for matrix soft-thresolding
where, to distinguish them, we use Sλ/ρ
`1
and Sλ/ρ for elementwise soft-thresolding

17
Example from Candes et al. (2009):

(a) Original frames (b) Low-rank L̂ (c) Sparse Ŝ (d) Low-rank L̂


Convex optimization (this work) Alternating
18
Consensus ADMM
B
X
Consider a problem of the form: min fi (x)
x
i=1

The consensus ADMM approach begins by reparametrizing:


B
X
min fi (xi ) subject to xi = x, i = 1, . . . B
x1 ,...,xB ,x
i=1

This yields the decomposable ADMM steps:


(k) ρ (k−1) 2
xi = argmin fi (xi ) + kxi − x(k−1) + wi k2 , i = 1, . . . , B
xi 2
B
1 X  (k) (k−1)

x(k) = x i + wi
B
i=1
(k) (k−1) (k)
wi = wi + xi − x(k) , i = 1, . . . , B

19
Write x̄ = B1 B
P
i=1 xi and similarly for other variables. Not hard to
(k)
see that w̄ = 0 for all iterations k ≥ 1

Hence ADMM steps can be simplified, by taking x(k) = x̄(k) :


(k) ρ (k−1) 2
xi = argmin fi (xi ) + kxi − x̄(k−1) + wi k2 , i = 1, . . . , B
xi 2
(k) (k−1) (k)
wi = wi + xi − x̄(k) , i = 1, . . . , B

To reiterate, the xi , i = 1, . . . , B updates here are done in parallel

Intuition:
• Try to minimize each fi (xi ), use (squared) `2 regularization to
pull each xi towards the average x̄
• If a variable xi is bigger than the average, then wi is increased
• So the regularization in the next step pulls xi even closer

20
General consensus ADMM
B
X
Consider a problem of the form: min fi (aTi x + bi ) + g(x)
x
i=1

For consensus ADMM, we again reparametrize:


B
X
min fi (aTi xi + bi ) + g(x) subject to xi = x, i = 1, . . . B
x1 ,...,xB ,x
i=1

This yields the decomposable ADMM updates:


(k) ρ (k−1) 2
xi = argmin fi (aTi xi + bi ) + kxi − x(k−1) + wi k2 ,
xi 2
i = 1, . . . , B

x(k) = argmin kx − x̄(k) − w̄(k−1) k22 + g(x)
x 2
(k) (k−1) (k)
wi = wi + xi − x(k) , i = 1, . . . , B
21
Notes:
• It is no longer true that w̄(k) = 0 at a general iteration k, so
ADMM steps do not simplify as before
• To reiterate, the xi , i = 1, . . . , B updates are done in parallel
• Each xi update can be thought of as a loss minimization on
part of the data, with `2 regularization
• The x update is a proximal operation in regularizer g
• The w update drives the individual variables into consensus
• A different initial reparametrization will give rise to a different
ADMM algorithm

See Boyd et al. (2010), Parikh and Boyd (2013) for more details
on consensus ADMM, strategies for splitting up into subproblems,
and implementation tips

22
Special decompositions

ADMM can exhibit much faster convergence than usual, when we


parametrize subproblems in a “special way”
• ADMM updates relate closely to block coordinate descent, in
which we optimize a criterion in an alternating fashion across
blocks of variables
• With this in mind, get fastest convergence when minimizing
over blocks of variables leads to updates in nearly orthogonal
directions
• Suggests we should design ADMM form (auxiliary constraints)
so that primal updates de-correlate as best as possible
• This is done in, e.g., Ramdas and Tibshirani (2014), Wytock
et al. (2014), Barbero and Sra (2014)

23
Example: 2d fused lasso

Given an image Y ∈ Rd×d , equivalently written as y ∈ Rn , the 2d


fused lasso or 2d total variation denoising problem is:
1 X 
min kY − Θk2F + λ |Θi,j − Θi+1,j | + |Θi,j − Θi,j+1 |
Θ 2
i,j
1
⇐⇒ min ky − θk22 + λkDθk1
θ 2
Here D ∈ Rm×n is a 2d difference operator giving the appropriate
differences (across
164
horizontally andParallel
vertically adjacent positions)
and Distributed Algorithms

Figure 5.2: Variables are black dots; the partitions P and Q are in orange and cyan.
24
First way to rewrite:
1
min ky − θk22 + λkzk1 subject to θ = Dz
θ,z 2

Leads to ADMM steps:

θ(k) = (I + ρDT D)−1 y + ρDT (z (k−1) + w(k−1) )




z (k) = Sλ/ρ (Dθ(k) − w(k−1) )


w(k) = w(k−1) + z (k−1) − Dθ(k)

Notes:
• The θ update solves linear system in I + ρL, with L = DT D
the graph Laplacian matrix of the 2d grid, so this can be done
efficiently, in roughly O(n) operations
• The z update applies soft thresholding operator St
• Hence one entire ADMM cycle uses roughly O(n) operations
25
Second way to rewrite:
1 X 
min kY − Hk2F + λ |Hi,j − Hi+1,j | + |Vi,j − Vi,j+1 |
H,V 2
i,j

subject to H = V

Leads to ADMM steps:


(k−1) (k−1)
Y + ρ(V·,j − W·,j )
 
(k)
H·,j = FL1d
λ/(1+ρ) , j = 1, . . . , d
1+ρ
 
(k) (k) (k−1)
Vi,· = FL1d
λ/ρ H i,· + W i,· , i = 1, . . . , d
W (k) = W (k−1) + H (k) − V (k)

Notes:
• Both H, V updates solve (sequence of) 1d fused lassos, where
Pd−1
we write FL1d 1 2
τ (a) = argminx 2 ka − xk2 + τ i=1 |xi − xi+1 |

26
• Critical: each 1d fused lasso solution can be computed exactly
in O(d) operations with specialized algorithms (e.g., Johnson,
2013; Davies and Kovac, 2001)
• Hence one entire ADMM cycleParallel
again usesand Distributed
O(n) operations Algori

27
Comparison of 2d fused lasso algorithms: an image of dimension
300 × 200 (so n = 60, 000)
7
6
5
4
3

Data Exact solution

28
Two ADMM algorithms, (say) standard and specialized ADMM:

Standard
Specialized

1e+05
1e+04
f(k)−fstar

1e+03
1e+02
1e+01

0 20 40 60 80 100

29
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:

7
6
5
4
3

Standard ADMM Specialized ADMM


10 iterations 10 iterations

30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:

7
6
5
4
3

Standard ADMM Specialized ADMM


30 iterations 30 iterations

30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:

7
6
5
4
3

Standard ADMM Specialized ADMM


50 iterations 50 iterations

30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:

7
6
5
4
3

Standard ADMM Specialized ADMM


100 iterations 100 iterations

30
References and further reading
• A. Barbero and S. Sra (2014), “Modular proximal optimization
for multidimensional total-variation regularization”
• S. Boyd and N. Parikh and E. Chu and B. Peleato and J.
Eckstein (2010), “Distributed optimization and statistical
learning via the alternating direction method of multipliers”
• E. Candes and X. Li and Y. Ma and J. Wright (2009),
“Robust principal component analysis?”
• N. Parikh and S. Boyd (2013), “Proximal algorithms”
• A. Ramdas and R. Tibshirani (2014), “Fast and flexible
ADMM algorithms for trend filtering”
• V. Vu and J. Cho and J. Lei and K. Rohe (2013),“Fantope
projection and selection: a near-optimal convex relaxation of
sparse PCA”
• M. Wytock and S. Sra. and Z. Kolter (2014), “Fast Newton
methods for the group fused lasso”
31

You might also like