Professional Documents
Culture Documents
A General Analysis of The Convergence of ADMM
A General Analysis of The Convergence of ADMM
A General Analysis of The Convergence of ADMM
Abstract et al. (2014); Wang & Banerjee (2012); Zhang et al. (2012);
Meshi & Globerson (2011); Wang et al. (2013); Aslan et al.
We provide a new proof of the linear convergence
(2013); Forouzan & Ihler (2013); Romera-Paredes & Pon-
of the alternating direction method of multipli-
til (2013); Behmardi et al. (2014); Zhang & Kwok (2014).
ers (ADMM) when one of the objective terms is
See Boyd et al. (2011) for an overview.
strongly convex. Our proof is based on a frame-
work for analyzing optimization algorithms in- Part of the appeal of ADMM is the fact that, in many con-
troduced in Lessard et al. (2014), reducing al- texts, the algorithm updates lend themselves to parallel im-
gorithm convergence to verifying the stability of plementations. The algorithm is given in Algorithm 1. We
a dynamical system. This approach generalizes refer to ρ > 0 as the step-size parameter.
a number of existing results and obviates any
assumptions about specific choices of algorithm Algorithm 1 Alternating Direction Method of Multipliers
parameters. On a numerical example, we demon- 1: Input: functions f and g, matrices A and B, vector c,
strate that minimizing the derived bound on the parameter ρ
convergence rate provides a practical approach 2: Initialize x0 , z0 , u0
to selecting algorithm parameters for particular 3: repeat
ADMM instances. We complement our upper 4: xk+1 = arg minx f (x) + ρ2 kAx + Bzk − c + uk k2
bound by constructing a nearly-matching lower 5: zk+1 = arg minz g(z) + ρ2 kAxk+1 + Bz − c + uk k2
bound on the worst-case rate of convergence. 6: uk+1 = uk + Axk+1 + Bzk+1 − c.
7: until meet stopping criterion
Algorithm 2 Over-Relaxed Alternating Direction Method Lemma 1. Suppose that f ∈ Sd (m, L), where 0 < m ≤
of Multipliers L < ∞. Suppose that b1 = ∇f (a1 ) and b2 = ∇f (a2 ).
1: Input: functions f and g, matrices A and B, vector c, Then
parameters ρ and α >
2: Initialize x0 , z0 , u0 a1 − a2 −2mLId (m + L)Id a1 − a2
≥ 0.
3: repeat b1 − b2 (m + L)Id −2Id b1 − b2
4: xk+1 = arg minx f (x) + ρ2 kAx + Bzk − c + uk k2
5: zk+1 = arg minz g(z)+ ρ2 kαAxk+1 −(1−α)Bzk + Proof. The Lipschitz continuity of ∇f implies the co-
Bz − αc + uk k2 coercivity of ∇f , that is
6: uk+1 = uk + αAxk+1 − (1 − α)Bzk + Bzk+1 − αc (a1 − a2 )> (b1 − b2 ) ≥ 1
L kb1 − b2 k 2 .
7: until meet stopping criterion
Note that f (x) − m2 kxk is convex and its gradient is Lip-
2
for particular matrices Â, B̂, Ĉ 1 , D̂1 , Ĉ 2 , and D̂2 (whose where βk+1 = ∇fˆ(rk+1 ). In the same spirit, we rewrite
dimensions do not depend on any problem parameters). the update rule for z as
First define the functions fˆ, ĝ : Rr → R via
sk+1 = arg min ĝ(s)
s
fˆ = (ρ−1 f ) ◦ A−1
(3) + 21 kαrk+1 − (1 − α)sk + s − αc + uk k2 .
−1 †
ĝ = (ρ g) ◦ B + Iim B ,
It follows that there exists some γk+1 ∈ ∂ĝ(sk+1 ) such that
where B † is any left inverse of B and where Iim B is
the {0, ∞}-indicator function of the image of B. We de- 0 = γk+1 + αrk+1 − (1 − α)sk + sk+1 − αc + uk .
fine κ = κf κ2A and to normalize we define
It follows then that
m L 1
m̂ = 2 L̂ = 2 ρ = (m̂L̂) ρ0 .
2 (4)
σ1 (A) σp (A) sk+1 = −αrk+1 + (1 − α)sk + αc − uk − γk+1
(8)
= sk − (1 − α)uk + αβk+1 − γk+1 ,
Note that under Assumption 3,
− 12 1 where the second equality follows by substituting in (7).
fˆ ∈ Sp (ρ−1
0 κ , ρ−1
0 κ )
2 (5a)
Combining (7) and (8) to simplify the u update, we have
ĝ ∈ Sq (0, ∞). (5b)
uk+1 = uk + αrk+1 − (1 − α)sk + sk+1 − αc
(9)
To define the relevant sequences, let the sequences = −γk+1 .
(xk ), (zk ), and (uk ) be generated by Algorithm 2 with
parameters α and ρ. Define the sequences (rk ) and (sk ) Together, (8) and (9) confirm the relation in (2a).
by rk = Axk and sk = Bzk and the sequence (ξk ) by
Corollary 5. Define the sequences (βk ) and (γk ) as in
s Proposition 4. Define the sequences (wk1 ) and (wk2 ) via
ξk = k .
uk
rk+1 − c sk+1
We define the sequence (νk ) as in Proposition 4. wk1 = wk2 = .
βk+1 γk+1
Proposition 4. There exist sequences (βk ) and (γk )
with βk = ∇fˆ(rk ) and γk ∈ ∂ĝ(sk ) such that when we Then the sequences (ξk ), (νk ), (wk1 ), and (wk2 ) satisfy (2b)
define the sequence (νk ) by and (2c) with the matrices
βk+1
νk = , −1 −1 −1 0
γk+1 Ĉ 1 = D̂1 =
0 0 1 0
(10)
then (ξk ) and (νk ) satisfy (2a) with the matrices 1 α−1 α −1
Ĉ 2 = D̂2 = .
0 0 0 1
1 α−1 α −1
 = B̂ = . (6)
0 0 0 −1
4. Convergence Rates from Semidefinite
Proof. Using the fact that A has full rank, we rewrite the Programming
update rule for x from Algorithm 2 as Now, in Theorem 6, we make use of the perspective de-
−1 −1 ρ 2 veloped in Section 3 to obtain convergence rates for Algo-
xk+1 = A arg min f (A r) + 2 kr + sk − c + uk k .
r rithm 2. This is essentially the same as the main result of
Lessard et al. (2014), and we include it because it is simple
Multiplying through by A, we can write and self-contained.
rk+1 = arg min fˆ(r) + 12 kr + sk − c + uk k2 . Theorem 6. Suppose that Assumption 3 holds. Let the
r sequences (xk ), (zk ), and (uk ) be generated by running
1
This implies that Algorithm 2 with step size ρ = (m̂L̂) 2 ρ0 and with over-
relaxation parameter α. Suppose that (x∗ , z∗ , u∗ ) is a fixed
0 = ∇fˆ(rk+1 ) + rk+1 + sk − c + uk , point of Algorithm 2, and define
and so
z z
rk+1 = −sk − uk + c − βk+1 , (7) ϕk = k ϕ∗ = ∗ .
uk u∗
A General Analysis of the Convergence of ADMM
Fix 0 < τ < 1, and suppose that there exist a 2 × 2 positive For fixed values of α, ρ0 , m̂, L̂, and τ , the feasibility of (11)
definite matrix P 0 and nonnegative constants λ1 , λ2 ≥ is a semidefinite program with variables P , λ1 , and λ2 . We
0 such that the 4 × 4 linear matrix inequality perform a binary search over τ to find the minimal rate τ
> such that the linear matrix inequality in (11) is satisfied.
 P  − τ 2 P Â> P B̂
0 The results are shown in Figure 1 for a wide range of con-
B̂ > P Â B̂ > P B̂ dition numbers κ, for α = 1.5, and for several choices
1 > 1 1 (11) of ρ0 . In Figure 2, we plot the values −1/ log τ to show the
D̂1
1
Ĉ λ M 0 Ĉ D̂1
+ number of iterations required to achieve a desired accuracy.
Ĉ 2 D̂2 0 λ2 M 2 Ĉ 2 D̂2
Convergence rate τ
where M 1 and M 2 are given by 0.8
1 1
−2ρ−2 ρ−1 (κ− 2 + κ 2 )
1 0 0
M = −1 − 1 1
ρ0 (κ 2 + κ 2 ) −2 0.6
ε = 0.5
0 1
M2 = . 0.4
ε = 0.25
1 0 ε=0
Then for all k ≥ 0, we have
100 101 102 103 104 105
√
kϕk − ϕ∗ k ≤ κB κP kϕ0 − ϕ∗ kτ . k
Condition number κ
Proof. Define rk , sk , βk , γk , ξk , νk , wk1 , and wk2 as before. Figure 1. For α = 1.5 and for several choices of ε in ρ0 = κε ,
Choose r∗ = Ax∗ , s∗ = Bz∗ , and we plot the minimal rate τ for which the linear matrix inequality
in (11) is satisfied as a function of κ.
r −c s s β
w∗1 = ∗ w∗2 = ∗ ξ∗ = ∗ ν∗ = ∗
β∗ γ∗ u∗ γ∗
such that (ξ∗ , ν∗ , w∗1 , w∗2 ) is a fixed point of the dynam- 104
Number of iterations
(ξk − ξ∗ )> P (ξk − ξ∗ ) ≤ τ 2k (ξ0 − ξ∗ )> P (ξ0 − ξ∗ ), and so the linear matrix inequality in (11) depends only
on κ and not on m̂ and L̂. Therefore, we will consider
for all k. It follows that 1
step sizes of this form (recall from (4) that ρ = (m̂L̂) 2 ρ0 ).
√ The choice ε = 0 is common in the literature (Giselsson
kξk − ξ∗ k ≤ κP kξ0 − ξ∗ kτ k .
& Boyd, 2014), but requires the user to know the strong-
The conclusion follows. convexity parameter m̂. We also consider the choice ε =
A General Analysis of the Convergence of ADMM
0.5, which produces worse guarantees, but does not require last row and column consist of zeros. We wish to prove
knowledge of m̂. that M is positive semidefinite for all sufficiently large κ.
To do so, we consider the cases ε ≥ 0 and ε < 0 separately,
One weakness of Theorem 6 is the fact that the rate we
though the two cases will be nearly identical. First suppose
produce is not given as a function of κ. To use Theorem 6
that ε ≥ 0. In this case, the nonzero entries of M are
as stated, we first specify the condition number (for exam-
specified by
ple, κ = 1000). Then we search for the minimal τ such
that (11) is feasible. This produces an upper bound on the M11 = ακ1−2ε + 4κ 2 −ε
3
To remedy this problem, in Section 5, we demonstrate how M12 = α2 κ1−2ε − ακ1−2ε + 12κ 2 −ε − 4ακ 2 −ε
Theorem 6 can be used to obtain the convergence rate of
3
M13 = 4κ + 8κ 2 −ε
Algorithm 2 as a symbolic function of the step size ρ and 3
α k 3 3
kϕk − ϕ∗ k ≤ Ckϕ0 − ϕ∗ k 1 − 0.5+|ε| , M33 = 8κ + 8κ2 − 4ακ2 + 8κ 2 −ε + 8κ 2 +ε .
2κ
where As before, we show that each of the first three leading
principal minors of M is positive. For large κ, the first
r n o
α
C = κB max 2−α , 2−α
α . leading principal minor (which is simple M11 ) is domi-
3
nated by the term 8κ 2 −ε , which is positive. Similarly,
Proof. We claim that for all sufficiently large κ, the linear the second leading principal minor is dominated by the
7
matrix inequality in (11) is satisfied with the rate τ = 1 − term 32(2 − α)κ 2 −ε , which is positive. The third leading
α
and with certificate principal minor is dominated by the term 128(2 − α)κ5 ,
2κ0.5+|ε|
which is positive. Since these leading coefficients are all
1 α−1 positive, it follows that for all sufficiently large κ, the ma-
λ1 = ακε−0.5 λ2 = α P = .
α−1 1 trix M is positive semidefinite.
The matrix on the right hand side of (11) can be expressed The result now follows from Theorem 6 by noting that P
as − 41 ακ−2 M , where M is a symmetric 4×4 matrix whose has eigenvalues α and 2 − α.
A General Analysis of the Convergence of ADMM
Note that since the matrix P doesn’t depend on ρ, the proof rate given exactly by (16), which is lower bounded by the
holds even when the step size changes at each iteration. expression in (15) when ε ≥ 0.
Now suppose that ε < 0. Choosing δ = L and λ = L,
6. Lower Bounds after multiplying the numerator and denominator of (14)
by κ0.5−ε , we see that T has eigenvalue
In this section, we probe the tightness of the upper bounds
on the convergence rate of Algorithm 2 given by Theo- 2α 2α
rem 6. The construction of the lower bound in this section 1− ≥ 1− . (17)
(1 + κ0.5−ε )(κ−0.5+ε + 1) 1 + κ0.5−ε
is similar to a construction given in Ghadimi et al. (2015).
When initialized with z as the eigenvector corresponding
Let Q be a d-dimensional symmetric positive-definite ma-
to this eigenvalue, Algorithm 2 will converge linearly with
trix whose largest and smallest eigenvalues are L and m
rate given exactly by the left hand side of (17), which is
respectively. Let f (x) = 12 x> Qx be a quadratic and
lower bounded by the expression in (15) when ε < 0.
let g(z) = 2δ kzk2 for some δ ≥ 0. Let A = Id , B = −Id ,
and c = 0. With these definitions, the optimization prob-
lem in (1) is solved by x = z = 0. The updates for Algo- Figure 3 compares the lower bounds given by (16) with
rithm 2 are given by the upper bounds given by Theorem 6 for α = 1.5 and
1
for several choices of ρ = (m̂L̂) 2 κε satisfying ε ≥ 0.
xk+1 = ρ(Q + ρI)−1 (zk − uk ) (13a) The upper and lower bounds agree visually on the range of
ρ choices ε depicted, demonstrating the practical tightness of
zk+1 = (αxk+1 + (1 − α)zk + uk ) (13b) the upper bounds given by Theorem 6 for a large range of
δ+ρ
choices of parameter values.
uk+1 = uk + αxk+1 + (1 − α)zk − zk+1 . (13c)
Solving for zk in (13b) and substituting the result into (13c) ε = 0.5 upper
gives uk+1 = ρδ zk+1 . Then eliminating xk+1 and uk from 104 ε = 0.25 upper
ε = 0 upper
Number of iterations
convergence of a generalization of ADMM to a multiterm this approach numerically for a distributed Lasso problem,
objective in the setting where each term can be decomposed but first we demonstrate that the usual range of (0, 2) for
as a strictly convex function and a polyhedral function. In the over-relaxation parameter α is too limited, that more
particular, this result does not require strong convexity. choices of α lead to linear convergence. In Figure 4, we
plot the largest value of α found through binary search such
More generally, there are a number of results for opera-
that (11) is satisfied for some τ < 1 as a function of κ.
tor splitting methods in the literature. Lions & Mercier
Proof techniques in prior work do not extend as easily to
(1979) and Eckstein & Ferris (1998) analyze the conver-
values of α > 2. In our framework, we simply change
gence of several operator splitting schemes. More re-
some constants in a small semidefinite program.
cently, Patrinos et al. (2014a;b) prove the equivalence of
forward-backward splitting and Douglas–Rachford split- 4
ting with a scaled version of the gradient method applied
to unconstrained nonconvex surrogate functions (called the
forward-backward envelope and the Douglas–Rachford en- 3.5
velope respectively). Goldstein et al. (2012) propose an
Largest α
accelerated version of ADMM in the spirit of Nesterov, 3
and prove a O(1/k 2 ) convergence rate in the case where f
and g are both strongly convex and g is quadratic.
2.5
The theory of over-relaxed ADMM is more limited. Eck-
stein & Bertsekas (1992) prove the convergence of over-
2
relaxed ADMM but do not give a rate. More recently, 100 101 102 103 104 105
Davis & Yin (2014a;b) analyze the convergence rates of Condition number κ
ADMM in a variety of settings. Giselsson & Boyd (2014)
prove the linear convergence of Douglas–Rachford split- Figure 4. As a function of κ, we plot the largest value of α such
ting in the strongly-convex setting. They use the fact that that (11) is satisfied for some τ < 1. In this figure, we set ε = 0
ADMM is Douglas–Rachford splitting applied to the dual in ρ0 = κε .
problem (Eckstein & Bertsekas, 1992) to derive a linear
convergence rate for over-relaxed ADMM with a specific
choice of step size ρ. Eckstein (1994) gives convergence re- 8.1. Distributed Lasso
sults for several specializations of ADMM, and found that Following Deng & Yin (2012), we give a numerical demon-
over-relaxation with α = 1.5 empirically sped up conver- stration with a distributed Lasso problem of the form
gence. Ghadimi et al. (2015) give some guidance on tuning
N
over-relaxed ADMM in the quadratic case. X 1
minimize kAi xi − bi k2 + kzk1
Unlike prior work, our framework requires no assumptions i=1
2µ
on the parameter choices in Algorithm 2. For example, subject to xi − z = 0 for all i = 1, . . . , N.
Theorem 6 certifies the linear convergence of Algorithm 2
even for values α > 2. In our framework, certifying a con- Each Ai is a tall matrix with full column rank, and so the
vergence rate for an arbitrary choice of parameters amounts first term in the objective will be strongly convex and its
to checking the feasibility of a 4 × 4 semidefinite program, gradient will be Lipschitz continuous. As in Deng & Yin
which is essentially instantaneous, as opposed to formulat- (2012), we choose N = 5 and µ = 0.1. Each Ai is
ing a proof. generated by populating a 600 × 500 matrix with indepen-
dent standard normal entries and normalizing the columns.
We generate each bi via bi = Ai x0 + εi , where x0 is a
8. Selecting Algorithm Parameters sparse 500-dimensional vector with 250 independent stan-
In this section, we show how to use the results of Section 4 dard normal entries, and εi ∼ N (0, 10−3 I).
to select the parameters α and ρ in Algorithm 2 and we In Figure 5, we compute the upper bounds on the conver-
show the effect on a numerical example. gence rate given by Theorem 6 for a grid of values of α
Recall that given a choice of parameters α and ρ and given and ρ. Each line corresponds to a fixed choice of α, and
the condition number κ, Theorem 6 gives an upper bound we plot only a subset of the values of α to keep the plot
on the convergence rate of Algorithm 2. Therefore, one ap- manageable. We omit points corresponding to parameter
proach to parameter selection is to do a grid search over values for which the linear matrix inequality in (11) was
the space of parameters for the choice that minimizes the not feasible for any value of τ < 1.
upper bound provided by Theorem 6. We demonstrate In Figure 6, we run Algorithm 2 for the same values of α
A General Analysis of the Convergence of ADMM
1 1,000
α=1
α = 1.5
0.98 800
Number of iterations
α = 1.9
Convergence rate τ
α=2
600 α = 2.025
0.96 α=1 α = 2.05
α = 1.5 α = 2.075
α = 1.9 400
0.94 α=2
α = 2.025
α = 2.05 200
0.92 α = 2.075
0
10−1 100 101 10−1 100 101
Step-size parameter ρ Step-size parameter ρ
Figure 5. We compute the upper bounds on the convergence rate Figure 6. We run Algorithm 2 for eighty-five values of α evenly
given by Theorem 6 for eighty-five values of α evenly spaced spaced between 0.1 and 2.2 and fifty value of ρ geometrically
between 0.1 and 2.2 and fifty values of ρ geometrically spaced spaced between 0.1 and 10. We plot the number of iterations
between 0.1 and 10. Each line corresponds to a fixed choice of α, required for zk to reach within 10−6 of a precomputed reference
and we show only a subset of the values of α to keep the plot solution. We show only a subset of the values of α to keep the plot
manageable. We omit points corresponding to parameter values manageable. We omit points corresponding to parameter values
for which (11) is not feasible for any value of τ < 1. This analysis for which Algorithm 2 exceeded 1000 iterations.
suggests choosing α = 2.0 and ρ = 1.7.
tion Processing Systems, pp. 2985–2993, 2013. Forouzan, S. and Ihler, A. Linear approximation to ADMM
for MAP inference. In Asian Conference on Machine
Behmardi, B., Archambeau, C., and Bouchard, G. Overlap- Learning, pp. 48–61, 2013.
ping trace norms in multi-view learning. arXiv preprint
arXiv:1404.6163, 2014. Gabay, D. and Mercier, B. A dual algorithm for the solu-
tion of nonlinear variational problems via finite element
Bioucas-Dias, J. M. and Figueiredo, M. A. T. Alternating approximation. Computers & Mathematics with Appli-
direction algorithms for constrained sparse regression: cations, 2(1):17–40, 1976.
application to hyperspectral unmixing. In Hyperspec-
tral Image and Signal Processing: Evolution in Remote Ghadimi, E., Teixeira, A., Shames, I., and Johansson, M.
Sensing, pp. 1–4. IEEE, 2010. Optimal parameter selection for the alternating direc-
tion method of multipliers (ADMM): Quadratic prob-
Bird, S. Optimizing Resource Allocations for Dynamic In- lems. IEEE Transactions on Automatic Control, 60(3):
teractive Applications. PhD thesis, University of Cali- 644–658, 2015.
fornia, Berkeley, 2014.
Giselsson, P. and Boyd, S. Diagonal scaling in Douglas–
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Rachford splitting and ADMM. In IEEE Conference on
Distributed optimization and statistical learning via the Decision and Control, pp. 5033–5039, 2014.
alternating direction method of multipliers. Foundations
and Trends in Machine Learning, 3(1):1–122, 2011. Glowinski, R. and Marroco, A. Sur l’approximation,
par éléments finis d’ordre un, et la résolution,
Corless, M. Guaranteed rates of exponential convergence
par pénalisation-dualité d’une classe de problèmes
for uncertain systems. Journal of Optimization Theory
de Dirichlet non linéaires. ESAIM: Mathemati-
and Applications, 64(3):481–494, 1990.
cal Modelling and Numerical Analysis-Modélisation
D’Alto, L. and Corless, M. Incremental quadratic stabil- Mathématique et Analyse Numérique, 9(R2):41–76,
ity. Numerical Algebra, Control and Optimization, 3(1): 1975.
175–201, 2013.
Goldstein, T., O’Donoghue, B., and Setzer, S. Fast alter-
Davis, D. and Yin, W. Convergence rate analysis of sev- nating direction optimization methods. CAM report, pp.
eral splitting schemes. arXiv preprint arXiv:1406.4834, 12–35, 2012.
2014a.
Hong, M. and Luo, Z.-Q. On the linear convergence of
Davis, D. and Yin, W. Convergence rates of relaxed the alternating direction method of multipliers. arXiv
Peaceman-Rachford and ADMM under regularity as- preprint arXiv:1208.3922, 2012.
sumptions. arXiv preprint arXiv:1407.5210, 2014b.
Iutzeler, F., Bianchi, P., Ciblat, Ph., and Hachem, W. Linear
Deng, W. and Yin, W. On the global and linear conver- convergence rate for distributed optimization with the al-
gence of the generalized alternating direction method of ternating direction method of multipliers. In IEEE Con-
multipliers. Technical report, DTIC Document, 2012. ference on Decision and Control, pp. 5046–5051, 2014.
Eckstein, J. Parallel alternating direction multiplier decom- Lessard, L., Recht, B., and Packard, A. Analysis and design
position of convex programs. Journal of Optimization of optimization algorithms via integral quadratic con-
Theory and Applications, 80(1):39–62, 1994. straints. arXiv preprint arXiv:1408.3595, 2014.
Eckstein, J. and Bertsekas, D. P. On the Douglas–Rachford Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Commu-
splitting method and the proximal point algorithm for nication efficient distributed machine learning with the
maximal monotone operators. Mathematical Program- parameter server. In Advances in Neural Information
ming, 55(1-3):293–318, 1992. Processing Systems, pp. 19–27, 2014.
Eckstein, J. and Ferris, M. C. Operator-splitting methods Lions, P.-L. and Mercier, B. Splitting algorithms for the
for monotone affine variational inequalities, with a par- sum of two nonlinear operators. SIAM Journal on Nu-
allel application to optimal control. INFORMS Journal merical Analysis, 16(6):964–979, 1979.
on Computing, 10(2):218–235, 1998.
Meshi, O. and Globerson, A. An alternating direction
Forero, P. A., Cano, A., and Giannakis, G. B. Consensus- method for dual MAP LP relaxation. In Machine Learn-
based distributed support vector machines. The Journal ing and Knowledge Discovery in Databases, pp. 470–
of Machine Learning Research, 11:1663–1707, 2010. 483. Springer, 2011.
A General Analysis of the Convergence of ADMM