ADMM

Alternating Direction Method of Multipliers
Ryan Tibshirani
Convex Optimization 10-725
Last time: dual ascent
Consider the problem
min f (x) subject to Ax = b

x
where f is strictly convex and closed. Denote Lagrangian
L(x, u) = f (x) + uT (Ax − b)
Dual gradient ascent repeats, for k = 1, 2, 3, . . .
x(k) = argmin L(x, u(k−1) )

x
(k) (k−1)
u =u + tk (Ax(k) − b)
Good: x update decomposes when f does. Bad: require stringent

assumptions (strong convexity of f ) to ensure convergence
2
Augmented Lagrangian method modifies the problem, for ρ > 0,
ρ
min f (x) + kAx − bk22
x 2
subject to Ax = b
uses a modified Lagrangian

ρ
Lρ (x, u) = f (x) + uT (Ax − b) + kAx − bk22
2
and repeats, for k = 1, 2, 3, . . .
x(k) = argmin Lρ (x, u(k−1) )

x
(k) (k−1)
u =u + ρ(Ax(k) − b)
Good: better convergence properties. Bad: lose decomposability
3
Alternating direction method of multipliers
Alternating direction method of multipliers or ADMM tries for the
best of both methods. Consider a problem of the form:
min f (x) + g(z) subject to Ax + Bz = c

x,z
We define augmented Lagrangian, for a parameter ρ > 0,

ρ
Lρ (x, z, u) = f (x) + g(z) + uT (Ax + Bz − c) + kAx + Bz − ck22
2
We repeat, for k = 1, 2, 3, . . .
x(k) = argmin Lρ (x, z (k−1) , u(k−1) )

x
z (k) = argmin Lρ (x(k) , z, u(k−1) )
z
(k) (k−1)
u =u + ρ(Ax(k) + Bz (k) − c)
4
Convergence guarantees
Under modest assumptions on f, g (these do not require A, B to

be full rank), the ADMM iterates satisfy, for any ρ > 0:
• Residual convergence: r(k) = Ax(k) + Bz (k) − c → 0 as
k → ∞, i.e., primal iterates approach feasibility
• Objective convergence: f (x(k) ) + g(z (k) ) → f ? + g ? , where
f ? + g ? is the optimal primal objective value
• Dual convergence: u(k) → u? , where u? is a dual solution
For details, see Boyd et al. (2010). Note that we do not generically
get primal convergence, but this is true under more assumptions
Convergence rate: roughly, ADMM behaves like first-order method.

Theory still being developed, see, e.g., in Hong and Luo (2012),
Deng and Yin (2012), Iutzeler et al. (2014), Nishihara et al. (2015)
5
Scaled form ADMM
Scaled form: denote w = u/ρ, so augmented Lagrangian becomes
ρ ρ
Lρ (x, z, w) = f (x) + g(z) + kAx + Bz − c + wk22 − kwk22
2 2
and ADMM updates become
ρ
x(k) = argmin f (x) + kAx + Bz (k−1) − c + w(k−1) k22
x 2
ρ
z (k) = argmin g(z) + kAx(k) + Bz − c + w(k−1) k22
z 2
w(k) = w(k−1) + Ax(k) + Bz (k) − c
Note that here kth iterate w(k) is just a running sum of residuals:
k
X
(k) (0)
Ax(i) + Bz (i) − c

w =w +
i=1
6
Outline
Today:
• Examples, practicalities
• Consensus ADMM
• Special decompositions
7
Connection to proximal operators
Consider
min f (x) + g(x) ⇐⇒ min f (x) + g(z) subject to x = z

x x,z
ADMM steps (equivalent to Douglas-Rachford, here):
x(k) = proxf,1/ρ (z (k−1) − w(k−1) )

z (k) = proxg,1/ρ (x(k) + w(k−1) )
w(k) = w(k−1) + x(k) − z (k)
where proxf,1/ρ is the proximal operator for f at parameter 1/ρ,

and similarly for proxg,1/ρ
In general, the update for block of variables reduces to prox update

whenever the corresponding linear transformation is the identity
8
Example: lasso regression
Given y ∈ Rn , X ∈ Rn×p , recall the lasso problem:

1
min ky − Xβk22 + λkβk1
β 2
We can rewrite this as:

1
min ky − Xβk22 + λkαk1 subject to β − α = 0
β,α 2
ADMM steps:
β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )

α(k) = Sλ/ρ (β (k) + w(k−1) )

w(k) = w(k−1) + β (k) − α(k)
9
Notes:
• The matrix X T X + ρI is always invertible, regardless of X
• If we compute a factorization (say Cholesky) in O(p3 ) flops,
then each β update takes O(p2 ) flops
• The α update applies the soft-thresolding operator St , which
recall is defined as

xj − t x > t

[St (x)]j = 0 −t ≤ x ≤ t , j = 1, . . . , p

xj + t x < −t

• ADMM steps are “almost” like repeated soft-thresholding of

ridge regression coefficients
10
Comparison of various algorithms for lasso regression: 100 random
instances with n = 200, p = 50
Coordinate desc
Proximal grad
Accel prox
1e−01
ADMM (rho=50)
ADMM (rho=100)
ADMM (rho=200)
Suboptimality fk−fstar
1e−04
1e−07
1e−10
0 10 20 30 40 50 60
Iteration k
11
Practicalities
In practice, ADMM usually obtains a relatively accurate solution in

a handful of iterations, but it requires a large number of iterations
for a highly accurate solution (like a first-order method)
Choice of ρ can greatly influence practical convergence of ADMM:

• ρ too large → not enough emphasis on minimizing f + g
• ρ too small → not enough emphasis on feasibility
Boyd et al. (2010) give a strategy for varying ρ; it can work well in
practice, but does not have convergence guarantees
Like deriving duals, transforming a problem into one that ADMM

can handle is sometimes a bit subtle, since different forms can lead
to different algorithms
12
Example: group lasso regression
Given y ∈ Rn , X ∈ Rn×p , recall the group lasso problem:

G
1 X
min ky − Xβk22 + λ cg kβg k2
β 2
g=1
Rewrite as:
G
1 X
min ky − Xβk22 + λ cg kαg k2 subject to β − α = 0
β,α 2
g=1
ADMM steps:
β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )

αg(k) = Rcg λ/ρ βg(k) + wg(k−1) , g = 1, . . . , G

w(k) = w(k−1) + β (k) − α(k)
13
Notes:
• The matrix X T X + ρI is always invertible, regardless of X
• If we compute a factorization (say Cholesky) in O(p3 ) flops,
then each β update takes O(p2 ) flops
• The α update applies the group soft-thresolding operator Rt ,
which recall is defined as

t
Rt (x) = 1 − x
kxk2 +
• Similar ADMM steps follow for a sum of arbitrary norms of as

regularizer, provided we know prox operator of each norm
• ADMM algorithm can be rederived when groups have overlap
(hard problem to optimize in general!). See Boyd et al. (2010)
14
Example: sparse subspace estimation
Given S ∈ Sp (typically S 0 is a covariance matrix), consider the

sparse subspace estimation problem (Vu et al., 2013):
max tr(SY ) − λkY k1 subject to Y ∈ Fk

Y
where Fk is the Fantope of order k, namely
Fk = {Y ∈ Sp : 0 Y I, tr(Y ) = k}
Note that when λ = 0, the above problem is equivalent to ordinary

principal component analysis (PCA)
This above is an SDP and in principle solveable with interior-point

methods, though these can be complicated to implement and quite
slow for large problem sizes
15
Rewrite as:
min −tr(SY ) + IFk (Y ) + λkZk1 subject to Y = Z

Y,Z
ADMM steps are:
Y (k) = PFk (Z (k−1) − W (k−1) + S/ρ)

Z (k) = Sλ/ρ (Y (k) + W (k−1) )
W (k) = W (k−1) + Y (k) − Z (k)
Here PFk is Fantope projection operator, computed by clipping the

eigendecomposition A = U ΣU T , Σ = diag(σ1 , . . . , σp ):
PFk (A) = U Σθ U T , Σθ = diag(σ1 (θ), . . . , σp (θ))
where each σi (θ) = min{max{σi − θ, 0}, 1}, and pi=1 σi (θ) = k

P
16
Example: sparse + low rank decomposition
Given M ∈ Rn×m , consider the sparse plus low rank decomposition
problem (Candes et al., 2009):
min kLktr + λkSk1

L,S
subject to L + S = M
ADMM steps:
L(k) = S1/ρ
tr
(M − S (k−1) + W (k−1) )
S (k) = Sλ/ρ
`1
(M − L(k) + W (k−1) )
W (k) = W (k−1) + M − L(k) − S (k)
tr for matrix soft-thresolding
where, to distinguish them, we use Sλ/ρ
`1
and Sλ/ρ for elementwise soft-thresolding
17
Example from Candes et al. (2009):
(a) Original frames (b) Low-rank L̂ (c) Sparse Ŝ (d) Low-rank L̂

Convex optimization (this work) Alternating
18
Consensus ADMM
B
X
Consider a problem of the form: min fi (x)
x
i=1
The consensus ADMM approach begins by reparametrizing:

B
X
min fi (xi ) subject to xi = x, i = 1, . . . B
x1 ,...,xB ,x
i=1
This yields the decomposable ADMM steps:

(k) ρ (k−1) 2
xi = argmin fi (xi ) + kxi − x(k−1) + wi k2 , i = 1, . . . , B
xi 2
B
1 X (k) (k−1)

x(k) = x i + wi
B
i=1
(k) (k−1) (k)
wi = wi + xi − x(k) , i = 1, . . . , B
19
Write x̄ = B1 B
P
i=1 xi and similarly for other variables. Not hard to
(k)
see that w̄ = 0 for all iterations k ≥ 1
Hence ADMM steps can be simplified, by taking x(k) = x̄(k) :

(k) ρ (k−1) 2
xi = argmin fi (xi ) + kxi − x̄(k−1) + wi k2 , i = 1, . . . , B
xi 2
(k) (k−1) (k)
wi = wi + xi − x̄(k) , i = 1, . . . , B
To reiterate, the xi , i = 1, . . . , B updates here are done in parallel
Intuition:
• Try to minimize each fi (xi ), use (squared) `2 regularization to
pull each xi towards the average x̄
• If a variable xi is bigger than the average, then wi is increased
• So the regularization in the next step pulls xi even closer
20
General consensus ADMM
B
X
Consider a problem of the form: min fi (aTi x + bi ) + g(x)
x
i=1
For consensus ADMM, we again reparametrize:

B
X
min fi (aTi xi + bi ) + g(x) subject to xi = x, i = 1, . . . B
x1 ,...,xB ,x
i=1
This yields the decomposable ADMM updates:

(k) ρ (k−1) 2
xi = argmin fi (aTi xi + bi ) + kxi − x(k−1) + wi k2 ,
xi 2
i = 1, . . . , B
Bρ
x(k) = argmin kx − x̄(k) − w̄(k−1) k22 + g(x)
x 2
(k) (k−1) (k)
wi = wi + xi − x(k) , i = 1, . . . , B
21
Notes:
• It is no longer true that w̄(k) = 0 at a general iteration k, so
ADMM steps do not simplify as before
• To reiterate, the xi , i = 1, . . . , B updates are done in parallel
• Each xi update can be thought of as a loss minimization on
part of the data, with `2 regularization
• The x update is a proximal operation in regularizer g
• The w update drives the individual variables into consensus
• A different initial reparametrization will give rise to a different
ADMM algorithm
See Boyd et al. (2010), Parikh and Boyd (2013) for more details
on consensus ADMM, strategies for splitting up into subproblems,
and implementation tips
22
Special decompositions
ADMM can exhibit much faster convergence than usual, when we

parametrize subproblems in a “special way”
• ADMM updates relate closely to block coordinate descent, in
which we optimize a criterion in an alternating fashion across
blocks of variables
• With this in mind, get fastest convergence when minimizing
over blocks of variables leads to updates in nearly orthogonal
directions
• Suggests we should design ADMM form (auxiliary constraints)
so that primal updates de-correlate as best as possible
• This is done in, e.g., Ramdas and Tibshirani (2014), Wytock
et al. (2014), Barbero and Sra (2014)
23
Example: 2d fused lasso
Given an image Y ∈ Rd×d , equivalently written as y ∈ Rn , the 2d

fused lasso or 2d total variation denoising problem is:
1 X
min kY − Θk2F + λ |Θi,j − Θi+1,j | + |Θi,j − Θi,j+1 |
Θ 2
i,j
1
⇐⇒ min ky − θk22 + λkDθk1
θ 2
Here D ∈ Rm×n is a 2d difference operator giving the appropriate
differences (across
164
horizontally andParallel
vertically adjacent positions)
and Distributed Algorithms
Figure 5.2: Variables are black dots; the partitions P and Q are in orange and cyan.
24
First way to rewrite:
1
min ky − θk22 + λkzk1 subject to θ = Dz
θ,z 2
Leads to ADMM steps:
θ(k) = (I + ρDT D)−1 y + ρDT (z (k−1) + w(k−1) )

z (k) = Sλ/ρ (Dθ(k) − w(k−1) )

w(k) = w(k−1) + z (k−1) − Dθ(k)
Notes:
• The θ update solves linear system in I + ρL, with L = DT D
the graph Laplacian matrix of the 2d grid, so this can be done
efficiently, in roughly O(n) operations
• The z update applies soft thresholding operator St
• Hence one entire ADMM cycle uses roughly O(n) operations
25
Second way to rewrite:
1 X
min kY − Hk2F + λ |Hi,j − Hi+1,j | + |Vi,j − Vi,j+1 |
H,V 2
i,j
subject to H = V
Leads to ADMM steps:

(k−1) (k−1)
Y + ρ(V·,j − W·,j )

(k)
H·,j = FL1d
λ/(1+ρ) , j = 1, . . . , d
1+ρ

(k) (k) (k−1)
Vi,· = FL1d
λ/ρ H i,· + W i,· , i = 1, . . . , d
W (k) = W (k−1) + H (k) − V (k)
Notes:
• Both H, V updates solve (sequence of) 1d fused lassos, where
Pd−1
we write FL1d 1 2
τ (a) = argminx 2 ka − xk2 + τ i=1 |xi − xi+1 |
26
• Critical: each 1d fused lasso solution can be computed exactly
in O(d) operations with specialized algorithms (e.g., Johnson,
2013; Davies and Kovac, 2001)
• Hence one entire ADMM cycleParallel
again usesand Distributed
O(n) operations Algori
27
Comparison of 2d fused lasso algorithms: an image of dimension
300 × 200 (so n = 60, 000)
7
6
5
4
3
Data Exact solution
28
Two ADMM algorithms, (say) standard and specialized ADMM:
Standard
Specialized
1e+05
1e+04
f(k)−fstar
1e+03
1e+02
1e+01
0 20 40 60 80 100
29
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:
7
6
5
4
3
Standard ADMM Specialized ADMM

10 iterations 10 iterations
30
7
6
5
4
3

30
7
6
5
4
3

30
7
6
5
4
3

30
References and further reading
• A. Barbero and S. Sra (2014), “Modular proximal optimization
for multidimensional total-variation regularization”
• S. Boyd and N. Parikh and E. Chu and B. Peleato and J.
Eckstein (2010), “Distributed optimization and statistical
learning via the alternating direction method of multipliers”
• E. Candes and X. Li and Y. Ma and J. Wright (2009),
“Robust principal component analysis?”
• N. Parikh and S. Boyd (2013), “Proximal algorithms”
• A. Ramdas and R. Tibshirani (2014), “Fast and flexible
ADMM algorithms for trend filtering”
• V. Vu and J. Cho and J. Lei and K. Rohe (2013),“Fantope
projection and selection: a near-optimal convex relaxation of
sparse PCA”
• M. Wytock and S. Sra. and Z. Kolter (2014), “Fast Newton
methods for the group fused lasso”
31

ADMM

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ADMM

Uploaded by

Copyright:

Available Formats

Alternating Direction Method of Multipliers

min f (x) subject to Ax = b

where f is strictly convex and closed. Denote Lagrangian

L(x, u) = f (x) + uT (Ax − b)

Dual gradient ascent repeats, for k = 1, 2, 3, . . .

x(k) = argmin L(x, u(k−1) )

Good: x update decomposes when f does. Bad: require stringent

uses a modified Lagrangian

x(k) = argmin Lρ (x, u(k−1) )

Good: better convergence properties. Bad: lose decomposability

min f (x) + g(z) subject to Ax + Bz = c

We define augmented Lagrangian, for a parameter ρ > 0,

x(k) = argmin Lρ (x, z (k−1) , u(k−1) )

Under modest assumptions on f, g (these do not require A, B to

Convergence rate: roughly, ADMM behaves like first-order method.

min f (x) + g(x) ⇐⇒ min f (x) + g(z) subject to x = z

ADMM steps (equivalent to Douglas-Rachford, here):

x(k) = proxf,1/ρ (z (k−1) − w(k−1) )

where proxf,1/ρ is the proximal operator for f at parameter 1/ρ,

In general, the update for block of variables reduces to prox update

Given y ∈ Rn , X ∈ Rn×p , recall the lasso problem:

We can rewrite this as:

β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )

α(k) = Sλ/ρ (β (k) + w(k−1) )

• ADMM steps are “almost” like repeated soft-thresholding of

In practice, ADMM usually obtains a relatively accurate solution in

Choice of ρ can greatly influence practical convergence of ADMM:

Like deriving duals, transforming a problem into one that ADMM

Given y ∈ Rn , X ∈ Rn×p , recall the group lasso problem:

β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )

αg(k) = Rcg λ/ρ βg(k) + wg(k−1) , g = 1, . . . , G

w(k) = w(k−1) + β (k) − α(k)

• Similar ADMM steps follow for a sum of arbitrary norms of as

Given S ∈ Sp (typically S  0 is a covariance matrix), consider the

max tr(SY ) − λkY k1 subject to Y ∈ Fk

where Fk is the Fantope of order k, namely

Note that when λ = 0, the above problem is equivalent to ordinary

This above is an SDP and in principle solveable with interior-point

min −tr(SY ) + IFk (Y ) + λkZk1 subject to Y = Z

ADMM steps are:

Y (k) = PFk (Z (k−1) − W (k−1) + S/ρ)

Here PFk is Fantope projection operator, computed by clipping the

PFk (A) = U Σθ U T , Σθ = diag(σ1 (θ), . . . , σp (θ))

where each σi (θ) = min{max{σi − θ, 0}, 1}, and pi=1 σi (θ) = k

min kLktr + λkSk1

(a) Original frames (b) Low-rank L̂ (c) Sparse Ŝ (d) Low-rank L̂

The consensus ADMM approach begins by reparametrizing:

This yields the decomposable ADMM steps:

Hence ADMM steps can be simplified, by taking x(k) = x̄(k) :

To reiterate, the xi , i = 1, . . . , B updates here are done in parallel

For consensus ADMM, we again reparametrize:

This yields the decomposable ADMM updates:

ADMM can exhibit much faster convergence than usual, when we

Given an image Y ∈ Rd×d , equivalently written as y ∈ Rn , the 2d

Leads to ADMM steps:

θ(k) = (I + ρDT D)−1 y + ρDT (z (k−1) + w(k−1) )

z (k) = Sλ/ρ (Dθ(k) − w(k−1) )

Leads to ADMM steps:

Data Exact solution

Standard ADMM Specialized ADMM

Standard ADMM Specialized ADMM

Standard ADMM Specialized ADMM

Standard ADMM Specialized ADMM

Given S ∈ Sp (typically S 0 is a covariance matrix), consider the