Asynchrony&delays PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

1

Decentralized Consensus Optimization with


Asynchrony and Delays
Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H. Sayed

Abstract—We propose an asynchronous, decentralized algo-


rithm for consensus optimization. The algorithm runs over a
network in which the agents communicate with their neighbors
and perform local computation.
arXiv:1612.00150v2 [math.OC] 2 Mar 2017

In the proposed algorithm, each agent can compute and com-


municate independently at different times, for different durations,
with the information it has even if the latest information from its
neighbors is not yet available. Such an asynchronous algorithm
reduces the time that agents would otherwise waste idle because
of communication delays or because their neighbors are slower.
It also eliminates the need for a global clock for synchronization. Fig. 1: Network and uncoordinated computing.
Mathematically, the algorithm involves both primal and dual
variables, uses fixed step-size parameters, and provably converges
to the exact solution under a random agent assumption and
both bounded and unbounded delay assumptions. When running be inefficient or impractical to synchronize multiple nodes and
synchronously, the algorithm performs just as well as existing links. To see this, let xi,k ∈ Rp be the local variable of agent
competitive synchronous algorithms such as PG-EXTRA, which i at iteration k, and let X k = [x1,k , x2,k , . . . , xn,k ]> ∈ Rn×p
diverges without synchronization. Numerical experiments con-
collect all local variables, where k is the iteration index. In a
firm the theoretical findings and illustrate the performance of
the proposed algorithm. synchronous implementation, in order to perform an iteration
that updates the entire X k to X k+1 , all agents will need to
Index Terms—decentralized, asynchronous, delay, consensus
optimization.
wait for the slowest agent or be held back by the slowest
communication. In addition, a clock coordinator is necessary
for synchronization, which can be expensive and demanding
I. I NTRODUCTION AND R ELATED W ORK
to maintain in a large-scale decentralized network.
This paper considers a connected network of n agents that Motivated by these considerations, this paper proposes an
cooperatively solve the consensus optimization problem asynchronous decentralized algorithm where actions by agents
n are not required to run synchronously. To allow agents to
1X
minimize f¯(x) := fi (x), compute and communicate at different moments, for different
p
x∈R n i=1
durations, the proposed algorithm introduces delays into the
where fi (x) := si (x) + ri (x), i = 1, . . . , n. (1) iteration — the update of X k can rely on delayed information
We assume that the functions si and ri : Rp → R are convex received from neighbors. The information may be several-
differentiable and possibly nondifferentiable functions, respec- iteration out of date. Under uniformly bounded (but arbitrary)
tively. We call fi = si + ri a composite objective function. delays and that the next update is done by a random agent ,
Each si and ri are kept private by agent i = 1, 2, · · · , n, and ri this paper will show that the sequence {X k }k≥0 converges to
often serves as the regularization term or the indicator function a solution to Problem (1) with probability one.
to a certain constraint on the optimization variable x ∈ Rp that What can cause delays? Clearly, communication latency
is common to all the agents. Decentralized algorithms rely on introduces delays. Furthermore, as agents start and finish
agents’ local computation, as well as the information exchange their iterations independently, one agent may have updated
between agents. Such algorithms are generally robust to failure its variables while its neighbors are working on their current
of critical relaying agents and scalable with network size. iterations that still use old (i.e., delayed) copies of those
In decentralized computation, especially with heterogeneous variables; this situation is illustrated in Fig. 1. For example,
agents or due to processing and communication delays, it can before iteration 3, agents 1 and 2 have finished updating
their local variables x1,2 and x2,1 respectively, but agent 3
This work was supported in part by NSF grants CCF-1524250, DMS- is still relying on delayed neighboring variables {x1,0 , x2,0 },
1317602, ECCS-1407712, and ECCS-1462397, NSF China grant 61573331, rather than the updated variables {x1,2 , x2,1 }, to update x3,3 .
and DARPA project N66001-14-2-4029. A short version of this work appeared
in the conference publication [1]. Therefore, both computation and communication cause delays.
T. Wu and W. Yin are with the Department of Mathematics, University of
California, Los Angeles, CA90095, USA. K. Yuan and A. H. Sayed are with
the Department of Electrical Engineering, University of California, Los Ange- A. Relationship to certain synchronous algorithms
les, CA90095, USA. Q. Ling is with the Department of Automation, Univer- The proposed algorithm, if running synchronously, can be
sity of Science and Technology of China, Hefei, Anhui 230026, China. Emails:
{wuty11,kunyuan,wotaoyin,sayed}@ucla.edu, qingling@mail.ustc.edu.cn algebraically reduced to PG-EXTRA [2]; they solve problem
(1) with a fixed step-size parameter and are typically faster
2

than algorithms using diminishing step sizes. Also, both al- transferred into no communication delay by replacing an edge
gorithms generalize EXTRA [3], which only deals with dif- with a chain of dummy nodes. Information passing through a
ferentiable functions. However, divergence (or convergence to chain of τ dummy nodes simulates an edge with a τ -iteration
wrong solutions) can be observed when one runs EXTRA and delay. The computation in this setting is still synchronous, so a
PG-EXTRA in the asynchronous setting, where the proposed coordinator or global clock is still needed. [30], [32] consider
algorithm works correctly. The proposed algorithm in this random communication delays in their setting. However, they
paper must use additional dual variables that are associated are only suitable for consensus averaging, not generalized for
with edges, thus leading to a moderate cost of updating and the optimization problem (1).
communicating the dual variables. Our setting is identical to the setting outlined in Section
The proposed algorithm is also very different from decen- 2.6 of [33], the asynchronous decentralized ADMM. Our
tralized ADMM [3]–[5] except that both algorithms can use algorithm, however, handles composite functions and avoids
fixed parameters. Distributed consensus methods [6], [7] that solving possibly complicated subproblems. The recent paper
rely on fixed step-sizes can also converge fast, albeit only to [34] also considers both random computation and commu-
approximate solutions [6], [7]. nication delays. However, as the focus of that paper is a
Several useful diffusion strategies [8]–[13] have also been quasi-Newton method, its analysis assumes twice continuously
developed for solving stochastic decentralized optimization differentiable functions and thus excludes nondifferentiable
problems where realizations of random cost functions are functions si in our problem. In fact, [34] solves a different
observed at every iteration. To keep continuous learning alive, problem and cannot be directly used to solve our Problem (1)
these strategies also employ fixed step-sizes, and they converge even if its objective functions are smooth. It can, however,
fast to a small neighborhood of the true solution. The diffusion solve an approximation problem of (1) with either reduced
strategies operate on the primal domain, but they can out- solution accuracy or low speed.
perform some primal-dual strategies in the stochastic setting Our setting is also related with [35], in which an asyn-
due to the influence of gradient noise [14]. These studies chronous framework is proposed to solve a composite non-
focused on synchronous implementations. Here, our emphasis convex optimization problem, where delays and inconsistent
is on asynchronous networks, where delays are present and reads are allowed. However, the framework in [35] is designed
become critical, and also on deterministic optimization where for parallel computing within one single agent rather than a
convergence to the true solution of Problem (1) is desired. multi-agent network, which is a fundamental difference with
this paper. A summary of various decentralized asynchronous
B. Related decentralized algorithms under different settings
algorithms is listed in Table I.
Our setting of asynchrony is different from randomized
single activation, which is assumed for the randomized gossip
algorithm [15], [16]. Their setting activates only one edge at C. Contributions
a time and does not allow any delay. That is, before each
activation, computation and communication associated with This paper introduces an asynchronous, decentralized algo-
previous activations must be completed, and only one edge rithm for problem (1) that provably converges to an optimal
in each neighborhood can be activated at any time. Likewise, solution assuming that the next update is performed by a
our setting is different from randomized multi-activation such random agent and that communication is subject to arbitrary
as [17], [18] for consensus averaging, and [19]–[25] for and possibly unbounded delays. If running synchronously, the
consensus optimization, which activate multiple edges each proposed algorithm is as fast as the competitive PG-EXTRA
time and still do not allow any delay. These algorithms can algorithms except, for asynchrony, the proposed algorithm
be alternatively viewed as synchronous algorithms running in involves updating and communicating additional edge variable.
a sequence of varying subgraphs. Since each iteration waits Our asynchronous setting is considerably less restrictive
for the slowest agent or longest communication, a certain than the settings under which existing non-synchronous or
coordinator or global clock is needed. non-deterministic decentralized algorithms are proposed. In
Our setting is also different from [26]–[29], in which other our setting, the computation and communication of agents are
sources of asynchronous behavior in networks are allowed, uncoordinated. A global clock is not needed.
such as different arrival times of data at the agents, random Technical contributions are also made. We borrow ideas
on/off activation of edges and neighbors, random on/off up- from monotone operator theory and primal-dual operator split-
dates by the agents, and random link failures, etc. Although ting to derive the proposed algorithms in compact steps. To
the results in these works provide notable evidence for the establish convergence under delays, we cannot follow the
resilience of decentralized algorithms to network uncertainties, existing analysis of PG-EXTRA [2]; instead, motivated by
they do not consider delays. [33], [36], a new non-Euclidean metric is introduced to absorb
We also distinguish our setting from the fixed communica- the delays. In particular, the foundation of the analysis is not
tion delay setting [30], [31], where the information passing the monotonicity of the objective value sequence. Instead, the
through each edge takes a fixed number of iterations to arrive. proposed approach establishes monotonic conditional expecta-
Different edges can have different such numbers, and agents tions of certain distances to the solution. We believe this new
can compute with only the information they have, instead analysis can extend to a general set of primal-dual algorithms
of waiting. As demonstrated in [30], this setting can be beyond decentralized optimization.
3

TABLE I: A comparison among asynchronous decentralized algorithms


Algorithm Synchronization Cost Allow delays in Recursions Idle Time Cost Function Exact Convergence∗
Gossip [15], [16] Reduced No Yes Average Consensus Yes
Randomized Consensus [17], [18] Reduced No Yes Average Consensus Yes
Randomized DGD [19], [26] Reduced No Yes Aggregate Optimization No
Asynchronous Diffusion [27]–[29] Reduced No Yes Aggregate stochastic Optimization No
Randomized ADMM [20], [22], [25] Reduced No Yes Aggregate Optimization Yes
Randomized Prox-DGD [23] Reduced No Yes Composite Aggregate Optimization No
Asynchronous Newton [34] None Yes No Aggregate Optimization No
Asynchronous ADMM [33] None Yes No Aggregate Optimization Yes
Asynchronous Prox-DGD (22) None Yes No Composite Aggregate Optimization No
Proposed Alg. 2 None Yes No Composite Aggregate Optimization Yes
∗ Convergence to the exact optimal solution when step size is constant.

D. Notation through the maximum-degree or Metropolis-Hastings rules


i
Each agent i holds a local variable x ∈ R , whose value p [37]. It is easy to verify that null{I − W } = span{1}.
at iteration k is denoted by xi,k . We introduce variable X to The proposed algorithm involves a matrix V , which we
m×m
stack all local variables xi : now define. Introduce the p diagonal matrix D ∈ R with
 >  diagonal entries De,e = wij /2 for each edge e = (i, j). Let
x1 C = [cei ] ∈ Rm×n be the incidence matrix of G:
..  ∈ Rn×p .
 
X :=  (2)

 .   +1, i is the lower indexed agent connected to e,

>
(xn ) cei = −1, i is the higher indexed agent connected to e,

0, otherwise.

In this paper, we denote the j-th row of a matrix A as Aj . (3)
Therefore, for X defined in (2), it holds that Xi = (xi )> ,
which indicates that the ith row of X is the local variable With incidence matrix (3), we further define
xi ∈ Rp of agent i. Now we define functions V := DC ∈ Rm×n . (4)
n n
X X as the scaled incidence matrix. It is easy to verify:
s(X) := si (xi ), r(X) := ri (xi ),
i=1 i=1 Proposition 1 (Matrix factorization identity). When W and
as well as V is generated according to the above description, we have
n
X V > V = (I − W )/2.
f (X) := fi (xi ) = s(X) + r(X).
i=1 A. Problem formulation
The gradient of s(X) is defined as Let us reformulate Problem (1). First, it is equivalent to
 >  n n
∇s1 (x1 )
X X
minimize
1 n p
si (xi ) + ri (xi ),
x ,··· ,x ∈R
..  ∈ Rn×p .
 
∇s(X) :=  i=1 i=1
 . 
subject to x1 = x2 = · · · = xn . (5)
>
∇sn (xn )
Since null{I − W } = span{1} and recall the definition of X
product on Rn×p is defined as hX, Xi
The inner P e := in (2), Problem (5) is equivalent to
> e n i > i
tr(X X)= i=1 (x ) x e , where X, X are arbitrary matrices.
e
minimize s(X) + r(X),
X∈Rn×p

II. A LGORITHM D EVELOPMENT subject to (I − W )X = 0. (6)


Consider a strongly connected network G = {V, E} By Proposition 1, we have (I − W )X = 0 ⇒ V > V X = 0 ⇒
with agents V = {1, 2, · · · , n} and undirected edges E = X > V > V X = 0 ⇒ V X = 0. On the other hand, V X = 0 ⇒
{1, 2, · · · , m}. By convention, all edges (i, j) ∈ E obey V > V X = 0 ⇒ (I − W )X = 0. Therefore, we conclude that
i < j. To each edge (i, j) ∈ E, we assign a weight (I − W )X = 0 ⇔ V X = 0. Therefore, Problem (6) is further
wij > 0, which is used by agent i to scale the data xj it equivalent to
receives from agent j. Likewise, let wji = wij for agent
minimize s(X) + r(X),
j. If (i, j) ∈ / E, then wij = wji = 0. For each i, we let n×p
x∈R
Ni := {j|(i, j) ∈ E or (j, i) ∈ E} ∪ {i} denote the set of subject to V X = 0. (7)
neighbors of agent i (including agent i itself). We also let
Ei := {(i, j)|(i, j) ∈ E or (j, i) ∈ E} denote the set of all Now we denote
 > 
edges connected to i. Moreover, we also assume that there y1
exists at least one agent index i such that wii > 0. 
..  ∈ Rm×p

Y := 
Let W = [wij ] ∈ Rn×n denote the weight matrix, which is  . 
>
symmetric and doubly stochastic. Such W can be generated (y m )
4

as the dual variable, Problem (7) can be reformulated into the Algorithm 1: Synchronous algorithm based on (12)
saddle-point problem Input: Starting point {xi,0 }, {y e,0 }. Set counter k = 0;
1 while all agents i ∈ V in parallel do
max min s(X) + r(X) + hY, V Xi, (8)
m×p
Y ∈R n×p
X∈R α Wait until {xj,k }j∈Ni and {y e,k }e∈Ei are received;
Update xi,k+1 according to (12);
where α > 0 is a constant parameter. Notice that a similar
formulation using the incidence and Laplacian matrices was Update {y e,k+1 }e∈Li according to (12);
employed in [14] to derive primal-dual distributed optimiza- Wait until all neighbors finish computing;
tion strategies over networks. Set k ← k + 1;
Send out xi,k+1 and {y e,k+1 }e∈Li to neighbors;

B. Synchronous algorithm
Problem (8) can be solved iteratively by the primal-dual
C. Asynchronous algorithm
algorithm that is adapted from [38] [39]:
( In the asynchronous setting, each agent computes and com-
X k+1 =proxαr [X k−α∇s(X k )−V > (2Y k+1−Y k )], municates independently without any coordinator. Whenever
(9) an arbitrary agent finishes a round of its variables’ updates, we
Y k+1 = Y k + V X k ,
let the iteration index k increase by 1 (see Fig. 1). As discussed
where the proximal operator is defined as in Sec. I, both communication latency and uncoordinated
computation result in delays. Let τ k ∈ Rn+ and δ k ∈ Rm + be
1 k
vectors of delays at iteartion k. We define X k−τ and Y k−δ
k
kX − U k2F .

proxαr (U ) := arg min r(X) +
X∈Rn×p 2α as delayed primal and dual variables occurring at iteration k:
k >
   
Next, in the X-update, we eliminate Y k+1 and use I − x1,k−τ1
2V > V = W to arrive at:  
k−τ k .  ∈ Rn×p ,
 
( X :=  ..
X k+1 = proxαr [W X k −α∇s(X k )−V > Y k ],

k >
   
(10) xn,k−τn
Y k+1 = Y k + V X k ,
k >
   
which computes (Y k+1 , X k+1 ) from (Y k , X k ). Algorithm y 1,k−δ1
 
(10) is essentially equivalent to PG-EXTRA developed in [2]. k
Y k−δ := 

..  ∈ Rm×p ,

Algorithm (10) can run in a decentralized manner. To do so,  . 
  k
> 
we associate each row of the dual variable Y with an edge y m,k−δm
e = (i, j) ∈ E, and for simplicity we let agent i store and
update the variable y e (the choice is arbitrary). We also define where τjk is the j-th element of τk and δek is the e-th element
of δ k . In the asynchronous setting, recursion (10) is calculated
Li := {e = (i, j) ∈ E, ∀ j > i}, (11) with delayed variables, i.e.,
(
as the index set of dual variables that agent i needs to update. Xe k+1 =proxαr [WX k−τ k−α∇s(X k−τ k )−V > Y k−δk ],
k k (13)
Recall V is the scaled incidence matrix defined in (4), while Ye k+1 =Y k−δ + V X k−τ .
W is the weight matrix associated with the network. The
calculations of V X, V > Y and W X require communication, Suppose agent i finishes update i,k+1
k + 1. To guarantee conver-
and other operations are local. For agent i, it updates its local gence, instead of letting x = ei,k+1 and y e,k+1 = yee,k+1
x
variables xi,k+1 and {y e,k+1 }e∈Li according to directly, we propose a relaxation step
( k
 X  xi,k+1 = xi,k + ηi x ei,k+1 − xi,k−τi ,
(14)
X
xi,k+1 = proxαri
 wij xj,k − α∇si (xi,k )− vei y e,k , k
y e,k+1 = y e,k + ηi yee,k+1 − y e,k−δe , ∀e ∈ Li .
j∈Ni e∈Ei
y e,k+1 = y e,k + v xi,k +v xj,k , ∀ e ∈ L ,

where x
k k
ei,k+1 − xi,k−τi and yee,k+1 − y e,k−δe behave as
ei ej i
(12) updating directions, and ηi ∈ (0, 1) behaves as a step size.
where vei and vej are the (e, i)-th and (e, j)-th entries of the To distinguish from the step size α, we call ηi the relaxation
matrix V , respectively. parameter of agent i. Its value depends on how out of date
Algorithm 1 implements recursion (12) in the synchronous agent i receives information from its neighbors. Longer delays
fashion, which requires two synchronization barriers in each require a smaller ηi , which leads to slower convergence. Since
iteration k. The first one holds computing until an agent the remaining agents have not finished their updates yet, it
receives all necessary input; after the agent finishes computing holds that
its variables, the second barrier prevents it from sending out (
xj,k+1 = xj,k ∀j 6= i,
information until all of its neighbors finish their computation (15)
(otherwise, the information intended for iteration k + 1 may y e,k+1 = y e,k ∀e ∈ / Li .
arrive at a neighbor too early).
5

To write (14) and (15) compactly, we let Spi : Rn×p → Rn×p where α > 0 is a penalty constant. By V > V = 12 (I − W ) in
be the primal selection operator associated with agent i. For Proposition 1, Problem (18) is exactly
any matrix A ∈ Rn×p , [Spi (A)]j = Aj if j = i; otherwise 1 >
[Spi (A)]j = 0, where [Spi (A)]j and Aj denote the j-th row of minimize s(X) + X (I − W )X +r(X), (19)
X∈Rn×p 2α
matrix Spi (A) and A, respectively. Similarly, we also let SdLi : | {z
smooth
}
Rm×p → Rm×p be the dual selection operator associated with
which has the smooth-plus-proximable form. Therefore, ap-
agent i. For any matrix B ∈ Rm×p , [SdLi (B)]e = Be if e ∈ Li ;
plying the proximal gradient method with step-size α yields
otherwise [SdLi (B)]e = 0. With the above defined selection
the iteration:
operators, (14) and (15) are equivalent to  
k+1 k k 1 k
x = proxαr X − α[∇s(X ) − (I − W )X ]
(
e k+1 − X k−τ k ,

X k+1 = X k + ηi Spi X α
k (16)
Y k+1 = Y k + ηi SdLi Ye k+1 − Y k−δ . = proxαr W X k − α∇s(X k ) ,

(20)
Recursions (13) and (16) constitute the asynchronous algo- Recursion (20) is ready for decentralized implementation —
rithm. each agent i ∈ V performs
Similar to the synchronous algorithm, the asynchronous X 
recursion (13) and (16) can also be implemented in a decen- xi,k+1 = proxαri wij xj,k − α∇si (xi,k ) . (21)
tralized manner. When agent i is activated at iteration k: j∈Ni
 Since algorithm (21) solves the penalty problem (19) instead
Compute:
of the original problem (1), one must reduce α during the

 X 
j,k−τjk i,k−τik e,k−δek
 X
i,k+1
−α∇s



 x
e =prox αri w ij x i (x )− v ei y iterations or use a small value. Recursion (21) is similar to the

j∈Ni e∈Ei proximal diffusion gradient descent algorithm derived in [12],


 k
yee,k+1=y e,k−δek + vei xi,k−τik + vej xj,k−τj , ∀e ∈ Li ;
 
[13] in which each agent first applies local proximal gradient

descent and then performs weighted averaging with neighbors.
 Relaxed updates: 
According to [11], [37], the diffusion strategy removes the

 

i,k+1 i,k i,k+1 i,k−τik




 x = x + η i x
e x , asymmetry problem that can cause consensus implementations

to become unstable for stochastic optimization problems.

  
y e,k+1 = y e,k + η yee,k+1 − y e,k−δek , ∀e ∈ L ,


i i The asynchronous version of (21) allows delayed variables
(17) and adds relaxation to (21); each agent i performs
 P 
The entries of X, Y not held by agent i remain unchanged j,k−τjk k
 x ei,k+1 =proxαri j∈Niwij x −α∇si (xi,k−τi ) ,
from k to k + 1. Algorithm 2 summarizes the asynchronous  k

 xi,k+1 =xi,k + ηi x ei,k+1 − xi,k−τi .
updates.
(22)
Algorithm 2: Asynchronous algorithm based on (17) Its convergence follows from treating the proximal gradient
i,0 e,0
Input: Starting point {x }, {y }. Set counter k = 0; iteration as a fixed-point iteration and applying results from
while each agent i asynchronously do [33]. Unlike Algorithm 2, the asynchronous algorithm (22)
Compute per (17) using the information it has uses only primal variables, so it is easier to implement. Like
available; all other distributed gradient descent algorithms [6], [11]–
Send out xi,k+1 and {y e,k+1 }e∈Li to neighbors; [13], [40], however, algorithm (21) and its asynchronous
version (22) must use diminishing step sizes to guarantee exact
convergence, which causes slow convergence.
D. Asynchronous proximal decentralized gradient descent
III. C ONVERGENCE A NALYSIS
Sec. II-C has presented a decentralized primal-dual algo-
A. Preliminaries
rithm for problem (1). In [12], [13], [40], proximal gradient
descent algorithms are derived to solve stochastic composite Our convergence analysis is based on the theory of nonex-
optimization problems in a distributed manner by networked pansive operators, as used in e.g., [24].
agents. Using techniques similar to those developed in [12], In this subsection we first present several definitions, lem-
[13], [40] we can likewise derive a proximal variant for the mas and theorems that underlie the convergence analysis.
consensus gradient descent algorithm in [6]; see (21) below. First we introduce a new symmetric matrix
V>
 
Following Sec. II-C, its asynchronous version will also be I
developed; see (22) below. G := n ∈ Rm+n .
V Im
To derive the synchronous algorithm, we penalize the con-
straints of Problem (7) (which is equivalent to Problem (1)) As we will discuss later, G is an important auxiliary matrix that
and obtain an auxiliary problem: helps establish the convergence properties of the synchronous
Algorithm 1 and asynchronous Algorithm 2. Meanwhile, it
1 also determines the range of step size α that enables the algo-
minimize s(X) + r(X) + kV Xk2 , (18)
X∈Rn×p α rithms to converge. Recall that the weight matrix W associated
6

with the network is symmetric and doubly stochastic. Besides, Let ρmin := λmin (G) > 0 be the smallest eigenvalue of G.
there exists at least one agent i such that wii > 0. Under these In the following assumption, we specify the properties of the
conditions, it is shown in [37] that W has an eigenvalue 1 with cost functions si and ri , and the step-size α.
multiplicity 1, and all other eigenvalues are strictly inside the
Assumption 1.
unit circle. With such W , we can show that G is positive
definite. 1) The functions si and ri are closed, proper and convex.
2) The functions si are differentiable and satisfy:
Lemma 1. It holds that G  0.
k∇si (x) − ∇si (e
x)k ≤ Li kx − x
ek, e ∈ Rp ,
∀x, x
Proof: According to the Schur Complement condition on
positive definiteness, we known G  0 ⇐⇒ I  0, I − where Li > 0 is the Lipschitz constant.
V > V  0. Recall Proposition 1 and λmin (W ) > −1, it holds 3) The parameter α in synchronous algorithm (12) and
that I − V > V = 21 (I + W )  0, which proves G  0. asynchronous algorithm (17) satisfies 0 < α < 2ρmin /L,
To analyze the convergence of the proposed Algorithms 1 where L := maxi Li .
and 2, we still need a classic result from the nonexpansive Since r and s are convex, the solution to the saddle point
operator theory. In the following we provide related definitions problem (8) is the same as the solution to the following
and preliminary results. Karush-Kuhn-Tucker (KKT) system:
Definition 1. Let H be a finite dimensional vector space (e.g. 1 >
       
∇s 0 ∂r 0 0 αV X
Rp or Rn×p ) equipped with 0∈ + + , (23)
p a certain inner product h·, ·i and 0 0 0 0 − α1 V 0 Y
its induced norm kxk = hx, xi, ∀x ∈ H. An operator P : | {z } | {z } |{z}
operator A operator B Z
H → H is called nonexpansive if
which can be written as
kPx − Pyk ≤ kx − yk, ∀x, y ∈ H.  
X
0 ∈ AZ + BZ, with Z := ∈ R(n+m)×p .
Definition 2. For any operator P : H → H (not necessarily Y
nonexpansive), we define FixP := {x ∈ H|x = Px} to be
For any positive definite matrix M ∈ R(n+m)×(n+m) , we have
the set of fixed points of operator P.
0 ∈ AZ + BZ ⇔ M Z − AZ ∈ M Z + BZ (24)
Definition 3. When the operator P is nonexpansive and β ∈ −1 −1
(0, 1), the combination operator ⇔ Z −M AZ ∈ Z + M BZ
−1 −1
⇔ Z = (I + M B) (I − M −1 A)Z,
Q := (1 − β)I + βP
Note that (I + M −1 B)−1 is well-defined and single-valued;
is defined as the averaged operator associated with P. it is called resolvent [42]. From now on, we let
With Definition 2 and 3, it is easy to verify FixQ = FixP. 1
M := G  0 (25)
The next classical theorem states the convergence property α
of an averaged operator. and define the operator
Theorem 1 (KM iteration [41], [42]). Let P : H → H be a T := (I + M −1 B)−1 (I − M −1 A). (26)
nonexpansive operator and β ∈ (0, 1). Let Q be the averaged
operator associated with P. Suppose that the set FixQ is From (23), (24) and the definition of operator T in (26), we
nonempty. From any starting point z0 , the iteration conclude that the solutions to the KKT system (23) coincide
with the fixed points of T .
zk+1 = Qzk = (1 − β)zk + βPzk Now we consider the fixed-point iteration
produces a sequence {zk }k≥0 converging to a point z∗ ∈ Z k+1 = T Z k = (I + M −1 B)−1 (I − M −1 A)Z k , (27)
FixQ (and hence also to a point z∗ ∈ FixP).
which reduces to
For the remainder of the paper, we consider a specific
M Z k − AZ k ∈ M Z k+1 + BZ k+1 . (28)
Hilbert space H for the purpose of analysis.
Substituting M defined in (25) into (28) and multiplying α to
Definition 4. Define H to be the Hilbert space R(n+m)×p
both sides, we will achieve
e tr(Z > AZ)
p product hZ, ZiA := (n+m)×(n+m)
endowed with the inner e and
(
the norm kZkA := hZ, ZiA , where A ∈ R is X k+V >Y k−α∇s(X k ) ∈ X k+1+2V >Y k+1 +α∂r(X k+1 ),
a positive definite matrix, and Z ∈ R(n+m)×p . Y k + V X k = Y k+1 + V X k+1 − V X k+1 ,
which, by cancellation, is further equivalent to Recursion (9).
B. Convergence of Algorithm 1 Until now, we have rewritten the primal-dual algorithm (9)
In this subsection we prove the convergence of Algorithm 1. as a fixed-point iteration (27) with operator T defined in
We first re-derive the iteration (9) using operator splitting (26). Besides, we also established that FixT coincide with
techniques and identify it as an averaged operator. Next, we the solutions to the KKT system (23). What still needs to be
apply Theorem 1 and establish its convergence property. proved is that {Z k } generated through the fixed-point iteration
7

(27) will converge to FixT . Below is a classic important From recursion (33) we further have
result of the operator T . For a proof, see [43].  (31)
Z k+1 = Z k − ηik Zbk − Tik Zbk = Z k − ηik Sik Zbk . (34)
Theorem 2. Under Assumption 1 and the above-defined norm
with M = G/α, there exists a nonexpansive operator O and Therefore, the asynchronous update algorithm (17) can be
some β ∈ (0, 1) such that the operator T defined in (26) viewed as a stochastic coordinate-update iteration based on
satisfies T = (1−β)I+βO. Hence, T is an averaged operator. delayed variables [33], [36], with each coordinate update
taking the form of (34).
The convergence of the synchronous update (9) follows
2) Relation between Zbk and Z k : The following assumption
directly from Theorems 1 and 2.
introduces a uniform upper bound for the random delays.
Corollary 1. Under Assumption 1, Algorithm 1 produces Z k
Assumption 2. At any iteration k, the delays τjk , j =
that converges to a solution Z ∗ = [X ∗ ; Y ∗ ] to the KKT
1, 2, . . . , n and δek , e = 1, 2, . . . , m defined in (17) have a
system (23), which is also a saddle point to Problem (8).
uniform upper bound τ > 0.
Therefore, X ∗ is a solution to Problem (7).
As we are finishing this paper, the recent work [44] has
Corollary 1 states that if we run algorithm 1 in the syn-
relaxed the above assumption by associating step sizes to
chronous fashion, the agents’ local variables will converge to
potentially unbounded delays. We still keep Assumption 2 in
a consensual solution to the problem (1).
our main analysis and shall discuss the case of unbounded
C. Convergence of Algorithm 2 delays in Section III-E
Under Assumption 2, the relation between Z bk and Z k can
Algorithm 2 is an asynchronous version that involves ran- be characterized in the following lemma [33]:
dom variables. In this subsection we will establish its almost
sure (a.s.) convergence (i.e., with probability 1). The analysis Lemma 2. Under Assumption 2, it holds that
is inspired by [33] and [36]. X
Zbk = Z k + (Z d − Z d+1 ), (35)
1) Algorithm reformulation: It was shown in Sec. III-B that
d∈J(k)
Algorithm 1 can be rewritten as a fixed-point iteration (27).
Now we let Ti be the operator corresponding to the update where J(k) ⊆ {k − 1, ..., k − τ } is an index set.
performed by agent i, i.e., Ti only updates xi and {y e }e∈Li
The proof of relation (35) can be referred to [33].
(the definition of Li can be referred to (11)) and leaves the
other variables unchanged. We define 3) Convergence analysis: Now we introduce an assumption
about the activation probability of each agent.
Ii = {i} ∪ {n + e| e is the edge index of (i, j) ∈ Li }
Assumption 3. For any k > 0, let ik be the index of the agent
to be the set of all row indices of Z that agent i needs to that is responsible for the k-th completed update. It is assumed
update. The operator Ti : R(m+n)×p → R(m+n)×p is defined that each ik is a random variable. The random variable ik is
as follows. For any Z ∈ R(m+n)×p , Ti Z ∈ R(m+n)×p is a independent of i1 , i2 , · · · , ik−1 as
matrix with
( P (ik = i) =: qi > 0,
(T Z)j , if j ∈ Ii ;
(Ti Z)j := (29)
Zj , if j ∈
/ Ii , where qi ’s are constants.

where Zj , (T Z)j and (Ti Z)j are the j-th row of Z, T Z and This assumption is satisfied under either of the following
Ti Z, respectively (see the notation section I-D). Moreover, we scenarios: (i) every agent i is activated following an indepen-
define dent Poisson process with parameter λi , and any computation
Pn
occurring at agent i is instant, leading to qi = λi /( i=1 λi );
S := I − T , Si := I − Ti . (30) (ii) every agent i runs continuously, and the duration of
Based on the definition of the operator Ti , for any Z ∈ each round of computation follows Pnthe exponential distribution
R(m+n)×p , Si Z ∈ R(m+n)×p is also a matrix with exp(βi ), leading to qi = βi−1 /( i=1 βi−1 ). Scenarios (i) and
( (ii) often appear as assumptions in the existing literature.
(Z − Ti Z)j , if j ∈ Ii ;
(Si Z)j := (31) Definition 5. Let (Ω, F, P ) be the probability space we work
0, if j ∈
/ Ii .
with, where Ω, F and P are the sample space, σ-algebra and
We also define the delayed variable used in iteration k as the probability measure, respectively. Define
" #
k−τ k
bk := X Z k := σ(Z 0 , Z
b0 , Z 1 , Zb1 , · · · , Z k , Z
bk )
Z k ∈ R(m+n)×p . (32)
Y k−δ
to be the σ-algebra generated by Z 0 , Zb0 , · · · , Z k , Z
bk .
Suppose agent ik is activated at iteration k, with definitions
(29) and (32) it can be verified that recursions (13) and (16) Assumption 4. Throughout our analysis, we assume
are equivalent to: P (ik = i|Z k ) = P (ik = i) = qi , ∀i, k.
(
Zek = Ti Zbk ,
k
 (33) Remark 1. Assumption 4 indicates that the index responsible
Z k+1 = Z k − ηik Z bk − Zek . for the k-th completed update, ik , is independent of the delays
8

bk . Although not always practical, it is a


in different rows of Z Lemma 5. Let {Z k }k≥0 be the sequence generated by Algo-
key assumption for our proof to go through. One of the cases rithm 2. Then for any Z ∗ ∈ FixT , we have
for this assumption to hold is when every row of Zbk always
E kZ k+1 − Z ∗ k2M Z k

has the maximum delay τ . In reality, what happens is between
ξ X
this worst case and the no-delay case. Besides, Assumption 4 is ≤ kZ k − Z ∗ k2M + kZ d − Z d+1 k2M
also a common assumption in the recent literature of stochastic n
k−τ ≤d<k
asynchronous algorithms; see [33] and the references therein. 
1 τ κ 1

+ + − kZ k − Z̄ k+1 k2M , (42)
Now we are ready to prove the convergence of Algorithm 2. n ξ nqmin η
For simplicity, we define where E(· | Z k ) denotes conditional expectation on Z k and ξ
Z̄ k+1 := Z k − η S Z
bk , qmin := min{qi } > 0, (36) is an arbitrary positive number.
i
λmax := λmax (M ), λmin := λmin (M ), Proof: We have
E kZ k+1 − Z ∗ k2M | Z k

where η > 0 is some constant, and M  0 is defined in (25). 2 
Note that Z̄ k+1 is the full update as if all agents synchronously
 η bk − Z ∗
= E Z k − Sik Z Zk (by (34), (41))

compute at the k-th iteration. It is used only for analysis. nqik M

Lemma 3. Define Q := I − P. When P : R(m+n)×p →  2η


η2 
Sik Zbk , Z ∗ − Z k M + 2 2 kSik Zbk k2M Z k

R(m+n)×p is nonexpansive under the norm k · kA , we have =E
nqik n qik
hZ − Z, e A ≥ 1 kQZ − Q Zk
e QZ − Q Zi e 2, (37) + kZ k − Z ∗ k2M
A
2 n D n
(a) 2η X
E η2 X 1
= Si Zbk , Z ∗ − Z k + 2 kSi Zbk k2M
where Z, Ze ∈ R(m+n)×p are two arbitrary matrices. n i=1 M n i=1 qi
Proof: By nonexpansiveness of P, we have + kZ k − Z ∗ k2M
n
e 2A ≤ kZ − Zk
kPZ − P Zk e 2A , ∀ Z, Ze ∈ R(m+n)×p . (38) (40) 2ηD E η2 X 1
= kZ k
−Z ∗ k2M + k ∗
S Z , Z −Z
b k
+ 2 kSi Zbk k2M ,
n M n i=1 qi
Notice that
(43)
e 2 = kZ − Ze − (QZ − Q Z)k
kPZ − P Zk e 2
A A where the equality (a) holds because of Assumptions 3 and 4.
e 2 −2hZ−Z,
= kZ−Zk e QZ−Q Zi e A +kQZ−Q Zke 2 . (39) On the other hand, note that
A A
n n n
Substituting (39) into (38), we achieve (37).
X 1 1 X (40) κ X
kSi Zbk k2M ≤ kSi Zbk k2M ≤ kS Zbk k2M
Let κ be its condition number of matrix G. Since M = i=1
qi q min i=1
q min i=1
G/α, κ is also the condition number of M . The following (36) κ k k+1 2
= 2 kZ − Z̄ kM , (44)
lemma establishes the relation between Si Zb and S Z.
b η qmin
Lemma 4. Recall the definition of S, Si and M in (30) and and
D E
(25). It holds that S Zbk , Z ∗ − Z k
n n M
(35)
X X X
Si Zbk = S Z
bk , kSi Zbk k2M ≤ κkS Zbk k2M . (40) k ∗
= hS Zb , Z − Zb + k
(Z d − Z d+1 )iM
i=1 i=1 d∈J(k)

Proof: The first part comes immediately from the defini- (36) 1 X
= hS Zbk , Z ∗ − Zbk iM + hZ k − Z̄ k+1 , Z d − Z d+1 iM
tion of S and Si in (30) and (34). For the second part, η
d∈J(k)
n
X n
X (b)
bk k2M ≤
kSi Z bk k2F = λmax kS Zbk k2F
λmax kSi Z ≤hS Zbk − SZ ∗ , Z ∗ − Zbk iM
i=1 i=1
 
1 X 1 k k+1 2 d d+1 2
λmax + kZ − Z̄ kM + ξkZ − Z kM
≤ kS Zbk k2M = κkS Zbk k2M . 2η
d∈J(k)
ξ
λmin  
(37) 1 1 X 1 k
k 2 k+1 2 d d+1 2
≤− kS Z kM +
b kZ −Z̄ kM +ξkZ −Z kM
The next lemma shows that the conditional expectation of 2 2η ξ
d∈J(k)
the distance between Z k+1 and any Z ∗ ∈ FixT for given Z k (36) 1 |J(k)| k
has an upper bound that depends on Z k and Z ∗ only. From = − 2 kZ k − Z̄ k+1 k2M + kZ − Z̄ k+1 k2M
2η 2ξη
now on, the relaxation parameters will be set as
ξ X
η + kZ d − Z d+1 k2M , (45)
ηi = , (41) 2η
nqi d∈J(k)

where η > 0 is the constant appearing at (36). where the inequality (b) follows from Young’s inequality and
the fact that SZ ∗ = (I − T )Z ∗ = 0 since Z ∗ is a fixed
9

point of T . Substituting (44) and (45) into (43) and using Theorem 3 (Fundamental inequality). Let {Z k }k≥0 be the
J(k) ⊆ {k − 1, P..., k − τ } we get the desired result. sequence generated by Algorithm 2. Then for any Z∗ =
The term nξ k−τ ≤d<k kZ d −Z d+1 k2 in the inequality (42) (Z ∗ , . . . , Z ∗ ), it holds that
appears because of the delay. Next we stack τ + 1 iterates
E kZk+1 − Z∗ k2U Z k

together to form a new vector and introduce a new metric in  √ 
order to absorb these Q terms. k ∗ 2 1 1 2τ κ κ
τ ≤ kZ −Z kU − − √ − kZ̄ k+1−Z k k2M .(48)
Let Hτ +1 = i=0 H be a product space (Re- n η n qmin nqmin
call H denotes R(m+n)×p , see Definition 4). For any p
e0 , . . . , Zeτ ) ∈ Hτ +1 , we let h·, ·i be the
(Z0 , . . . , Zτ ), (Z Proof: Let ξ = n qmin /κ. We have
induced inner product,Pτ i.e., h(Z 0 , . . . , Zτ ), (Z0 , . . . , Zτ )i := E(kZk+1 − Z∗ k2U |Z k )
e e
Pτ > 0
i=0 hZi , Zi iM = i=0 tr(Zi M Zi ). Define a matrix U ∈
e e
(47)
R (τ +1)×(τ +1)
as = E(kZ k+1 −Z ∗k2M |Z k)
Pk d−(k−τ )
U 0 := U1 + U2 , + ξ d=k+1−τ n E(kZ d −Z d+1k2M |Z k)
(34) 2

where = E(kZ k+1 − Z ∗ k2M |Z k ) + ξτ η bk 2 k


n E( n2 qi2 kSik Z kM |Z )
k
  Pk−1
1 0 ··· 0 + ξ d=k+1−τ d−(k−τ n
)
kZ d − Z d+1 k2M
0 0 ··· 0 (44)
U1 :=  . ≤ E(kZ k+1 − Z ∗ k2M |Z k ) + n3ξτqmin κ
kZ k − Z̄ k+1 k2M
 
.. .. .. 
 .. . . . Pk−1
0 0 ··· 0 + ξ d=k+1−τ d−(k−τ n
)
kZ d − Z d+1 k2M
(42)  
  ≤ kZ k−Z ∗ k2M + n1 τξ + n2ξτqmin
κ
+ nqκmin − η1 kZ k−Z̄ k+1 k2M
τ −τ
−τ 2τ − 1 1−τ + nξ k−τ ≤d<k kZ d − Z d+1 k2M
 P
 
1−τ 2τ − 3 2−τ
r
qmin   Pk−1
+ξ d=k+1−τ d−(k−τ )
kZ d − Z d+1 k2M
U2 := ,
 
κ 
 .. .. ..  n
. . .  
  = kZ k−Z ∗ k2M + n1 τξ + n2ξτqmin
κ
+ nqκmin − η1 kZ k−Z̄ k+1 k2M
 −2 3 −1
Pk−1
−1 1 + nξ d=k−τ (d − (k − τ ) + 1)kZ d − Z d+1 k2M
 √ 
(47)
= kZk − Z∗ k2U + n1 n2τ √
κ κ 1
qmin + nqmin − η kZ −Z̄
k k+1 2
kM .
and let U = U 0 ⊗ I where ⊗ represents the Kronecker product
and I is the identity operator. For a given (A0 , · · · , Aτ ) ∈ Hence, the desired inequality (48) holds.
Hτ +1 ,
Remark 2 (Stochastic Fejér monotonicity [45]). From (48),
(B0 , · · · , Bτ ) = U (A0 , · · · , Aτ )
suppose
is given by: nqmin
0<η< √ , (49)
2τ κqmin + κ
r
qmin
B 0 = A0 + τ (A0 − A1 ),
κ then we have
r
qmin
Bi = [(i−τ −1)Ai−1 +(2τ −2i+1)Ai + (i − τ )Ai+1 ], E(kZk+1 − Z∗ k2U |Z k ) ≤ kZk − Z∗ k2U ,
κ
i.e., the sequence {Zk }k≥0 is stochastically Fejér monotone,
r
qmin
Bτ = (Aτ − Aτ −1 ), (46)
κ which means the covariance of Zk is non-increasing as the
iteation k evolves.
where the index i for Bi is from 1 to τ +1.The linear operator
U is a self-adjoint and positive definite since U 0 is symmetric Based on Theorem 3, we have the following corollary:
and positive definite. We define h·, ·iU = h·, U ·i as the U -
weighted inner product and k · kU as the induced norm. We Corollary 2. When η satisfies (49), we have
P∞ k k+1 2
further let 1) k=0 kZ − Z̄ kM < ∞ a.s.;

Zk := (Z k , Z k−1 , . . . , Z k−τ ) ∈ Hτ +1 , k ≥ 0, 2) the sequence {kZk − Z∗ k2U }k≥0 converges to a [0, +∞)-
∗ ∗ ∗ τ +1
Z := (Z , . . . , Z ) ∈ H , valued random variable a.s.

where Z k = Z 0 for k < 0. We have Proof: The condition (49) implies that η1 − n2τ√
κ
qmin −
κ
kZk − Z∗ k2U nqmin > 0. Applying [46, Theorem 1] to 48 directly gives
the two conclusions.
r k−1
(46) qmin X The following lemma establishes the convergence properties
= kZ k−Z ∗ k2M + (d−(k−τ )+1)kZ d−Z d+1 k2M ,(47)
κ of the sequences {Zk }∞ k ∞
k=1 and {Z }k=1 .
d=k−τ

and the following fundamental inequality. Lemma 6. Define S∗ = {(Z ∗ , . . . , Z ∗ )|Z ∗ ∈ FixT }, let
(Z k )k≥0 ⊂ H be the sequence generated by Algorithm 2,
10

η ∈ (0, ηmax ] for certain ηmax satisfying (49). Let Z (Zk ) be we can see that we must relax the update more if maximum
the set of cluster points of {Zk }k≥0 . Then, Z (Zk ) ⊆ S∗ a.s. delay τ becomes larger, or if the matrix M becomes more
ill-conditioned. The relative computation speed of the slowest
Proof: We take several steps to complete the proof of this
agent will also affect the relaxation parameter.
lemma.
k k+1
(i) Firstly, from Corollary √ 2, we have Z − Z̄ → 0 a.s.. D. New step size rules
k k+1 κ k k+1
Since kZ − Z kF ≤ nqmin kZ − Z̄ kM (c.f. (44)), we
For both the synchronous and asynchronous algorithms, we
have Z k −Z k+1 → 0 a.s. Then from (35), we have Z bk −Z k →
require the step size α be less than 2ρmin /L (Assumption 1).
0 a.s. Both ρmin and L involve global properties across the network:
(ii) From Corollary 2, we have that (kZk − Z∗ k2U )k≥0 ρmin is related to the property of the matrix V and L is related
converges a.s., and so does (kZk −Z∗ kU )k≥0 . Hence, we have to all the functions si . Unless they are known a priori, we need
limk→∞ kZk − Z∗ kU = γ a.s., where γ is a [0, +∞)-valued to apply a consensus algorithm to obtain them.
random variable. Hence, (kZk − Z∗ kU )k≥0 must be bounded In this section we introduce a step size αi for each agent
a.s., and so is (Zk )k≥0 . i that depends only on its local properties. The following
(iii) We claim that there exists Ω e ∈ F such that P (Ω)e =1 theorem is a consequence of combining results of [48, Lemma
∗ ∗ k ∗
and, for every ω ∈ Ω e and every Z ∈ S , (kZ (ω)−Z kU )k≥0 10] and the monotone operator theory.
converges.
The proof follows directly from [45, Proposition 2.3 (iii)]. It Theorem 5. Let the iteration Z k+1 = T Z k be defined as
is worth noting that Ω e in the statement works for all Z∗ ∈ S∗ ,  X X 
xi,k+1 =proxαi ri
 wij xj,k−αi ∇si (xi,k)− vei y e,k , ∀i,
namely, Ω does not depend on Z∗ . j∈Ni e∈Ei
(iv) By (i), there exists Ω b ∈ F such that P (Ω) b =
e,k+1 e,k
vei xi,k j,k
 
y =y + + vej x , ∀e = (i, j) ∈ Li ,

k k+1
1 and Z (w) − Z (w) → 0, ∀w ∈ Ω. b For any
ω ∈ Ω, b let (Zkl (ω))l≥0 be a convergent subsequence 1
with αi = Li /γ+1−w for an arbitrary 0 < γ < 2. Then
ii
of (Zk (ω))k≥0 , i.e., Zkl (ω) → Z, where Zkl (ω) = under assumption 1, the operator T is an averaged operator.
(Z kl (ω), Z kl −1 (ω)..., Z kl −τ (ω)) and Z = (u0 , ..., uτ ). Note
that Zkl (ω) → Z implies Z kl −j (ω) → uj , ∀j. Therefore, Because this new operator T is still an averaged operator,
ui = uj , for any i, j ∈ {0, · · · , τ } because Z kl −i (ω) − the convergence results, Corollary 1 and Theorem 4, still hold.
Z kl −j (ω) → 0. Furthermore, observing η > 0, we have
bkl (ω) − T Z
bkl (ω) E. Allowing unbounded delays
liml→∞ Z
1 Using the analytical tools developed in [44], we can fur-
= lim S Zbkl (ω) = lim (Z kl (ω) − Z̄ kl +1 (ω)) = 0. ther prove that our algorithm allows unbounded delays. The
l→∞ l→∞ η
detailed proofs are omitted due to the page limit; interested
From the triangle inequality and the nonexpansiveness of T , readers are referred to [44].
it follows that We make the following assumption.
kZ kl (ω) − T Z kl (ω)kM Assumption 5.
=kZ kl(ω)−Z bkl(ω)+Z bkl(ω)−T Zbkl(ω)+T Zbkl(ω)−T Z kl(ω)kM 1
, ∀i = 1, 2, ..., n.
qi =
≤kZ kl (ω) − Zbkl (ω)kM + kZbkl (ω) − T Z
bkl (ω)kM n
+ kT Zbkl (ω) − T Z kl (ω)kM Assumption 5 is made for simplicity of formulas. Only qi ≥
 > 0 for i is needed.
≤2kZ kl (ω) − Z
bkl (ω)kM + kZbkl (ω) − T Zbkl (ω)kM
Stochastic unbounded delays
≤2 d∈J(kl ) kZ d (ω)−Z d+1 (ω)kM +kZbkl (ω)−T Z bkl (ω)kM .
P
Instead of Assumption 2, we make the following assumptions
on the delay vectors τ k , δ k .
By (III-C3) and (III-C3), we have from the above inequality
that liml→∞ Z kl (ω)−T Z kl (ω) = 0. Now the demiclosedness Assumption 6. The delay vectors {τ k }k≥0 are i.i.d. random
principle [42, Theorem 4.17] implies u0 ∈ Fix T , and this variables; so are the delay vectors {δ k }k≥0 . In addition, they
implies the statement of the lemma. are independent of the agents i1 , . . . , ik , . . .
From Lemma 6 and Opial’s Lemma [47], we have the
Assumption 7 (Evenly old delays). We assume that the delays
convergence theorem of the asynchronous algorithm.
are evenly old, meaning that there exists an integer B > 0,
Theorem 4. Let (Z k )k≥0 ⊂ H be the sequence generated by such that for all k ≥ 0, all i, j = 1, ..., n, all e, f = 1, ..., m,
Algorithm 2, η ∈ (0, ηmax ] for certain ηmax satisfying (49).
If Assumptions 3, 2 and 4 hold, then (Z k )k≥0 converges to a |τik − τjk | ≤ B,
FixT -valued random variable a.s.. |τik − δek | ≤ B,
This theorem guarantees that, if we run the asynchronous |δek − δfk | ≤ B.
algorithm 2 with an arbitrary starting point Z 0 , then the In Assumption 7, only the existence of B is required. It can
sequence {Z k } produced will converge to one of the solutions be as long as it needs to be.
to problem (8) almost surely. From the upper bound of ηmax ,
11

Definition 6. We define distance of 15 units, they are regarded as neighbors and one
k edge is assumed to connect them. Once the network topology
∆ := max{τ1k , . . . , τnk , δ1k , . . . , δm
k
}; (50)
is generated, the associated weighting matrix W is produced
k
Pl : = P (∆ ≥ l), l = 0, 1, . . . according to the Metropolis-Hastings rule [37].
We have the following theorem. In all simulations, we let each agent start a new round of
update, with available neighboring variables which involves
Theorem 6. Let (Z k )k≥0 ⊂ H be the sequence generated delays in general, instantly after the finish of last one. To
by Algorithm 2, Let Assumptions 5, 4, 6 and 7 hold. Then guarantee each agent to be activated i.i.d, the computation
(Z k )k≥0 converges to a FixT -valued random variable a.s. if time of agent i is sampled from an exponential distribution
either of the following holds: exp(1/µi ). For agent i, µi is set as 2 + |µ̄| where µ̄ follows
P∞ 1/2
1) l=1 (lPl ) < ∞; the step sizes for each agent i are the standard normal distribution N (0, 1). The communication
equal andsatisfy −1 times is also simulated. The communication time between
P∞ 1/2
0 < ηi < κ + √1n l=1 Pl (l1/2 + l−1/2 ) ; agents are independently sampled from exp(1/0.6). After
P∞ 1/2
the computation time is generated, the probability qi can be
2) (P
l=1 l ) < ∞, the η ’s are equal and satisfy 0<
i −1 computed accordingly.
∞ 1/2
ηi < κ + √2n l=1 Pl
P
. In all curves in the following figures, the relative error
kX k −X ∗ kF ∗
In both scenarios of Theorem 6, Pl are required to decay kX 0 −X ∗ kF against time is plotted, where X is the exact
solution to (1).
fast when l grows. Bigger step sizes can be taken when the
delay distribution has thinner tails. In particular, bounded de- A. Decentralized compressed sensing
lays and delays following a geometric distribution are special For decentralized compressed sensing, each agent i ∈
cases of both scenarios. {1, 2, · · · , n} holds some measurements: bi = Ai x + ei ∈
Unbounded deterministic delays Rmi , where Ai ∈ Rmi ×p is a sensing matrix, x ∈ Rp is
In the following, we allow the delays to be arbitrary. the common unknown sparse signal, and ei is i.i.d. Gaussian
Assumption 8. The delay vectors τ k , δ k are arbitrary, with Pn The goal is to recover x. The number of measurements
noise.
lim inf ∆k < ∞ where ∆k is defined in (50). i=1 mi may be less than the number of unknowns p, so we
solve the `1 -regularized least squares:
Assumption 8 means that there exists an bound B, as large Pn
as it needs to be, such that for infinite many iterations, the minimize n1 i=1 si (x) + ri (x),
x
delays are no larger than B.
where si (x) = 21 kAi x − bi k22 , ri (x) = θi kxk1 , and θi is the
Definition 7. Let (Z k )k≥0 ⊂ H be the sequence generated regularization parameter with agent i.
by Algorithm 2. For all positive integers T , define QT be a The tested network has 10 nodes and 14 edges. We set
subsequence of (Z k )k≥0 obtained by removing the iterates mi = 3 for i = 1, · · · , 10 and p = 50. The entries of Ai , ei are
with ∆k > T . independently sampled from the standard normal distribution
N (0, 1), and Ai is normalized so that kAi k2 = 1. The signal
We have the following convergence result.
x is generated randomly with 20% nonzero elements. We set
Theorem 7. Let (Z k )k≥0 ⊂ H be the sequence generated by the regularization parameters θi = 0.01.
Algorithm 2. Let Assumptions 5, 4 and 8 hold. Fix γ > 0, and The step sizes of all the four algorithms are tuned by hand
let the step sizes ηi vary by each iteration and satisfy and are nearly optimal. The step size α for both the primal
 −1 synchronous algorithm (21) and the asynchronous algorithm
k 1 1 1 k 2+γ (22) are set to be 0.05. This choice is a compromise between
0 < ηi < κ + √ (1 + ) + (∆ + 1) ) .
n γ 2+γ convergence speed and accuracy. The relaxation parameters
Then for all T ≥ lim inf ∆k , the sequence QT converges to for the asynchronous algorithm (22), are chosen to be ηi =
the same FixT -valued random variable a.s. 0.036/qi . The step size α for both synchronous PG-EXTRA
Algorithm 1 and asynchronous Algorithm 2 are set to be
Theorem 7 requires the step sizes to change according 1. The relaxation parameters for asynchronous Algorithm 2
to the current delay and the convergence is applied to the are chosen to be ηi = 0.0288/qi . From Fig. 2 we can see
subsequence with some iterates excluded as in Definition 6. that asynchronous primal algorithm (22) is faster than its
synchronous version (21), but both algorithms are far slower
IV. N UMERICAL E XPERIMENTS than synchronous PG-EXTRA Algorithm 1 and asynchronous
In this part we simulate the performance of the proposed Algorithm 2. The latter two algorithms exhibit linear conver-
asynchronous algorithm (17). We will compare it with its gence and Algorithm 2 converges significantly faster. Within
synchronous counterpart (12) (synchronous PG-EXTRA) as the same period (roughtly 2760ms), the two asynchronous
well as the proximal gradient descent algorithms (synchronous algorithm finishes 21 times as many rounds of computation
algorithm (21) and asynchronous algorithm (22)). and communication as the synchronous counterparts, due to
In the following experimental settings, we generate the the elimination of waiting time.
network as follows. The location of the n agents are randomly To better illustrate the reason why the asynchronous Al-
generated in a 30 × 30 area. If any two agents are within a gorithm 2 is much faster than the synchronous Algorithm
12

agent 1 2 3 4 5
0
time (ms) 0.497 0.033 0.944 0.551 1.152 10

agent 6 7 8 9 10 −1
time (ms) 0.072 0.112 0.996 0.049 0.025 10

−2
TABLE II: Sampled computation time in the 1st iteration. 10

Relative error
−3
10
edge (1,2) (1,10) (1,8) (2,3) (2,5) (2,8) (2,9)
−4
time(ms) 0.489 1.425 0.024 1.191 2.862 2.140 0.091 10
edge (3,6) (4,6) (4,7) (4,9) (6,7) (7,8) (8,10)
−5
time(ms) 1.429 0.018 2.359 2.233 0.003 1.952 2.412 10

edge (2,1) (10,1) (8,1) (3,2) (5,2) (8,2) (9,2)


−6
time (ms) 2.762 1.165 1.672 1.828 0.569 4.592 0.617 10
Synchronous PG−EXTRA Alg. 1
edge (6,3) (6,4) (7,4) (9,4) (7,6) (8,7) (10,8) Asynchronous Primal−Dual Alg. 2
−7
10
time (ms) 0.385 0.887 1.152 0.744 2.716 0.649 3.031 0 100 200 300 400 500 600 700
Time(ms)
TABLE III: Sampled communication time in the 1st iteration.
Fig. 3: Convergence comparison between synchronous PG-EXTRA
and the asynchronous Algorithm 2 for sparse logistic regression.

1, we list the computation and communication time during B. Decentralized sparse logistic regression
the first iteration in Tables II and III, respectively. In Table In this subsection the tested problem is decentralized sparse
III each edge corresponds to two rounds of communication logistic regression. Each agent i ∈ {1, 2, · · · , n} holds local
time: communications from agent i to j and from j to i. data samples {hij , dij }m
j=1 , where the supscript i indiates the
i

From Table II it is observed that the longest computation time agent index and subscript j indicates the data index. hij ∈ Rp
is 1.152 ms. From Table III it is observed that the longest is a feature vector and dij ∈ {+1, −1} is the corresponding
communication time is 4.592 ms. Therefore, the duration label. mi is the number of local samples kept by agent i.
for the first iteration in the synchronous Algorithm 1 is All agents in the network will cooperatively solve the sparse
1.152 + 4.592 = 5.744 ms. However, in the asynchornous logistic regression problem
Algorithm 2, each agent will start a new iteration right after the Pn
finish of itsPprevious update. The average duration for the first minimizex∈Rp n1 i=1 [si (x) + ri (x)],
10
iteration is i=1 ti /10 = 0.443 ms. Therefore, during the first Pmi
where si (x)= m1i j=1 ln 1+exp(−dij (hij )>x) , ri (x)=θi kxk1 .

iteration the asynchronous Algorithm 2 is 5.744/0.443 ' 13 In the simulation, we set n = 10, p = 50, and mi = 3
times faster than the synchronous Algorithm 1. The data listed for all i. For local data samples {hij , dij }m
j=1 at agent i, each
i

in Tables II and III illustrates the necessity to remove the idle hij is generated from the standard normal distribution N (0, 1).
time appearing in the synchronous algorithm.
To generate dij , we first generate a random vector xo ∈ Rp
Since both synchronous and asynchronous proximal- with 80% entries being zeros. Next, we generate dij from a
gradient-descent-type algorithms are much slower compared to uniform distribution U (0, 1). If dij ≤ 1/[1 + exp(−(hij )> xo )]
the synchronous PG-EXTRA Algorithm 1 and asynchronous then dij is set as +1; otherwise dij is set as −1. We set the
Algorithm 2, in the following two simulations we just show regularization parameters θi = 0.1.
convergence performance of PG-EXTRA Algorithm 1 and The step sizes of both algorithms are tuned by hand and
asynchronous Algorithm 2. are nearly optimal. The step sizes α for both synchronous
PG-EXTRA Algorithm 1 and asynchronous Algorithm 2 are
set to be 0.4. The relaxation parameters for asynchronous
0
10
Algorithm 2 are chosen to be ηi = 0.0224 qi . From Fig. 3
we can see that both algorithms exhibit convergence almost
−1
10 linearly and that Algorithm 2 converges significantly faster.
−2
In our experiments, within the same period, the asynchronous
10
algorithm finishes 21 times as many rounds of computation
Relative error

−3
10
and communication as that performed by the synchronous
algorithm, due to the elimination of waiting time.
−4
10

−5
10 C. Decentralized low-rank matrix completion
Synchronous Primal (23)
−6 Asynchronous Primal (24) Consider a low-rank matrix A = [A1 , · · · , An ] ∈ RN ×K of
10
Synchronous PG−EXTRA Alg. 1 rank r  min{N, K}. In a network, Pn each agent i observes
Asynchronous Primal−Dual Alg. 2
−7
10 some entries of Ai ∈ RN ×Ki , i=1 Ki = K. The set of
0 500 1000 1500 2000
Time(ms) observations is Ω = ∪ni=1 Ωi . To recover the unknown entries
of A, we introduce a public matrix X ∈ RN ×r , which is
Fig. 2: Convergence comparison between synchronous algorithm known to all agents, and a private matrix Y = [Y 1 , · · · , Y n ] ∈
(21), asynchronous algorithm (22), synchronous PG-EXTRA Algo- Rr×K , where each Y i ∈ RN ×Ki corresponds to Ai and is
rithm 1 and the asynchronous Algorithm 2 for compressed sensing.
13

held by agent i. The supscript i indicates the agent index. We 0


10
reconstruct A = XY , which is at most rank r, by recovering
−1
X and Y in a decentralized fashion. The problem is formulated 10

as follows (see [49] for reference). −2


10
1
Pn i i 2
minimize i=1 kXY − Z kF ,

Relative error
n n
X,{Yi }i=1 ,{Zi }i=12 −3
10

subject to (Z i )ab = (Ai )ab , ∀(a, b) ∈ Ωi , (51) −4


10

where Z i ∈ RN ×Ki is an auxiliary matrix, and (Z i )ab is the −5


10
(a, b)-th element of Ai .
−6
1) Synchronous algorithm: [49] proposes a decentralized 10
Synchronous PG−EXTRA Alg. 1
algorithm to solve Problem (51). Let each agent hold X i as a −7
Asynchronous Primal−Dual Alg. 2
10
local copy of the public matrix X, the algorithm is: 0 200 400 600 800 1000
Time(ms)
i,0 i,0
Step 1: Initialization. Agent i initializes X and Y as
random matrices, Z i,0 is also initialized as a random matrix Fig. 4: Convergence comparison between synchronous PG-EXTRA
and the asynchronous Algorithm 2 for matrix completion.
with (Z i,0 )ab = (Ai )ab for any (a, b) ∈ Ωi .
Step 2: Update of X i . Each agent i updates X i,k+1 by solving entry of Ω is less than 0.8, it will be set as 1, which indicates
the following average consensus problem: this entry is known; otherwise it will be set as 0.
1
Pn i i,k
(Y i,k )> k2F , We run the synchronous PG-EXTRA Algorithm 1 and the
minimize 2 i=1 kX − Z
i n
{X }i=1 proposed asynchronous primal-dual Algorithm 2 and plot the
k
subject to X 1 = X 2 = · · · = X n . (52) relative error kZ −AkF
kZ 0 −AkF against time, as depicted in Fig. 4. The
step sizes of both algorithms are chosen to be α = 0.1. The
Step 3: Update of Y i . Each agent i updates relaxation parameters for asynchronous Algorithm 2 are cho-
Y i,k+1 = (X i,k+1 )> X i,k+1
 −1 i,k+1 > i,k
(X ) Z . sen to be ηi = 0.0102
qi . The performance of the algorithms are
similar to the previous experiments. In this 20-node network,
Step 4: Update of Zi . Each agent i updates within the same period, the asynchronous algorithm finishes
29 times as many rounds of computation and communication
Z i,k+1 =X i,k+1 Y i,k+1 + PΩi Ai − X i,k+1 Y i,k+1 ,

as those finished by synchronous Algorithm 1.
where PΩ (·) is defined as follows: for any matrix A, if (a, b) ∈
Ω, then [PΩ (A)]ab = Aab ; otherwise [PΩ (A)]ab = 0. D. Decentralized geometric median
In problem (52) appearing at Step 2, we can let ri (X i ) = In the literature there are limited asynchronous algorithms
1 i
2 kX − Z
i,k
(Y i,k )> k2F and si (Xi ) = 0 and hence problem proposed to solve problems with composite cost function as
(52) falls into the general form of problem (1), for which we in Problem (1). When there is no differentiable term, i.e.,
can apply the synchronous PG-EXTRA Algorithm 1 to solve si (x) = 0, the most closely related algorithm to the proposed
it. We introduce matrices {Qe , ∀e ∈ E} as dual variables. asynchronous primal-dual Alg. 2 is the asynchronous ADMM
Instead of solving (52) exactly, we just run (12) once for each [33]. In this subsection, we compare these two algorithms
iteration k. Therefore, Step 2 becomes when solving the geometric-median problem:
Step 20 : Update of X i . Each agent i updates X i,k+1 by
n
1 X 1X
min kx − bi k2 ,
X 
X i,k+1 = wij X j,k − vei Qe,k +αZ i,k (Y i,k )> , x∈Rp n i=1
α+1
j∈Ni e∈Ei

Qe,k+1
= Qe,k + vei X i,k +vej X j,k , ∀e = (i, j) ∈ Li .
 where {bi }ni=1 are given constants. Computing decentralized
geometric medians have various interesting applications, see
2) Asynchronous algorithm: For the asynchronous algo- [2] for more detail. In this simulation, we set n = 11,
rithm, agent i, when activated, updates Xi and {Qe , ∀e = p = 4, and each bi is generated from the Gaussian distribution
(i, j) ∈ Li } per Step 2’ using the neighboring information it N (0, Λ), where Λ is a diagonal matrix with each entry
has available, then performs Step 3 and Step 4 to update its follows uniform distribution U (0, 10). The parameters of both
private matrices Yi and Zi , and sends out the updated Xi and algorithms are hand-optimized. The augmented coefficient in
Qe to neighbors. the asynchronous ADMM is 0.3, and the step-size in Alg. 2 is
We test our algorithms on a network with 20 nodes and 41 1. The relaxation parameter for both algorithms is set as 0.4. It
edges. In the simulation, we generate a rank 4 matrix A ∈ is observed in Fig. 5 that Alg. 2 is faster than the asynchronous
R40×140 , hence each Ai ∈ R40×7 . To generate A, we first ADMM.
produce E ∈ R40×4 ∼ N (0, 1) and F ∈ R140×4 ∼ N (0, 1).
We also produce a diagonal matrix D ∈ R4×4 ∼ N (0, 1). V. C ONCLUSIONS
With E, F and D, we let A = EDF > . A total of 80% entries This paper developed an asynchronous, decentralized al-
of A are known. To generate Ω, we first sample its entries gorithm for concensus optimization. The agents in this al-
independently from the uniform distribution U (0, 1). If any gorithm can compute and communicate in an uncoordinated
14

2
10 [10] J. Chen and A. H. Sayed, “On the learning behavior of adaptive
Asynchronous Primal−Dual Alg.2 networks—part II: Performance analysis,” IEEE Transactions on In-
Asynchronous ADMM [33] formation Theory, vol. 61, no. 6, pp. 3518–3548, 2015.
0
10
[11] A. H. Sayed, “Adaptive networks,” Proceedings of the IEEE, vol. 102,
−2
no. 4, pp. 460–497, April 2014.
10
[12] S. Vlaski and A. H. Sayed, “Proximal diffusion for stochastic costs with
non-differentiable regularizers,” in Proc. International Conference on
Relative error

−4
10 Acoustic, Speech and Signal Processing (ICASSP), Brisbane, Australia,
April 2015, pp. 3352–3356.
−6
10 [13] S. Vlaski, L. Vandenberghe, and A. H Sayed, “Diffusion stochastic
optimization with non-smooth regularizers,” in 2016 IEEE International
−8
10 Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2016, pp. 4149–4153.
−10
10
[14] Z. J. Towfic and A. H. Sayed, “Stability and performance limits of
adaptive primal-dual networks,” IEEE Transactions on Signal Process-
−12 ing, vol. 63, no. 11, pp. 2888–2903, 2015.
10
0 100 200 300 400 500 600 [15] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip
Time(ms) algorithms,” IEEE/ACM Transactions on Networking (TON), vol. 14,
no. SI, pp. 2508–2530, 2006.
Fig. 5: Convergence comparison between asynchronous ADMM and [16] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione,
the proposed Alg. 2 for decentralized geometric median problem. “Gossip algorithms for distributed signal processing,” Proceedings of
the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
[17] J. M. Kar, S.and Moura, “Sensor networks with random links: Topol-
fashion; local variables are updated with possibly out-of- ogy design for distributed consensus,” IEEE Transactions on Signal
date information. Mathematically, the developed algorithm Processing, vol. 56, no. 7, pp. 3315–3326, 2008.
extends the existing algorithm PG-EXTRA by adding an [18] F. Fagnani and S. Zampieri, “Randomized consensus algorithms over
large scale networks,” IEEE Journal on Selected Areas in Communica-
explicit dual variable for each edge and to take asynchronous tions, vol. 26, no. 4, pp. 634–649, 2008.
steps. The convergence of our algorithm is established under [19] A. Nedic, “Asynchronous broadcast-based convex optimization over a
certain statistical assumptions. Although not all assumptions network,” IEEE Transactions on Automatic Control, vol. 56, no. 6, pp.
1337–1351, 2011.
are satisfied in practice, the algorithm is practical and efficient. [20] F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, “Asynchronous
In particular, step size parameters are fixed and depend only distributed optimization using a randomized alternating direction method
on local information. of multipliers,” in Proc. IEEE Conference on Decision and Control
(CDC), Florence, Italy, 2013, pp. 3671–3676.
Numerical simulations were performed on both convex [21] P. D. Lorenzo, S. Barbarossa, and A. H. Sayed, “Decentralized resource
and nonconvex problems, and synchronous and asynchronous assignment in cognitive networks based on swarming mechanisms over
algorithms were compared. In addition, we introduced an random graphs,” IEEE Transactions on Signal Processing, vol. 60, no.
7, pp. 3755–3769, 2012.
asynchronous algorithm without dual variables by extending [22] E. Wei and A. Ozdaglar, “On the o(1/k) convergence of asynchronous
an existing algorithm and included it in the simulation. All distributed alternating direction method of multipliers,” in Proc. IEEE
simulation results clearly show the advantages of the devel- Global Conference on Signal and Information Processing (GlobalSIP),
Austin, USA, 2013, pp. 551–554.
oped asynchronous primal-dual algorithm. [23] M. Hong and T. Chang, “Stochastic proximal gradient consensus over
random networks,” arXiv preprint arXiv:1511.08905, 2015.
[24] P. Bianchi, W. Hachem, and F. Iutzeler, “A coordinate descent primal-
R EFERENCES dual algorithm and application to distributed asynchronous optimiza-
tion,” IEEE Transactions on Automatic Control, vol. 61, no. 10, pp.
[1] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed, “Decentralized 2947–2957, 2016.
consensus optimization with asynchrony and delays,” in 2016 Asilomar [25] G. Zhang and R. Heusdens, “Bi-alternating direction method of
Conference on Signals, Systems, and Computers. IEEE, 2016. multipliers over graphs,” in IEEE International Conference on Acoustics,
[2] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm Speech and Signal Processing (ICASSP), 2015, pp. 3571–3575.
for decentralized composite optimization,” IEEE Transactions on Signal [26] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying
Processing, vol. 63, no. 22, pp. 6013–6023, 2015. directed graphs,” IEEE Transactions on Automatic Control, vol. 60, no.
[3] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear 3, pp. 601–615, 2015.
convergence of the ADMM in decentralized consensus optimization,” [27] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning over
IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, networks—Part I: Modeling and stability analysis,” IEEE Transactions
2014. on Signal Processing, vol. 63, no. 4, pp. 811–826, 2015.
[4] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in Ad hoc [28] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning over
WSNs with noisy links – Part I: Distributed estimation of deterministic networks—Part II: Performance analysis,” IEEE Transactions on Signal
signals,” IEEE Transactions on Signal Processing, vol. 56, no. 1, pp. Processing,, vol. 63, no. 4, pp. 827–842, 2015.
350–364, 2008. [29] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning over
[5] T.-H. Chang, M. Hong, and X. Wang, “Multi-agent distributed opti- networks—Part III: Comparison analysis,” IEEE Transactions on Signal
mization via inexact consensus admm,” IEEE Transactions on Signal Processing, vol. 63, no. 4, pp. 843–858, 2015.
Processing, vol. 63, no. 2, pp. 482–497, 2015. [30] K. I. Tsianos and M. G. Rabbat, “Distributed consensus and optimiza-
[6] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi- tion under communication delays,” in Proc. Allerton Conference on
agent optimization,” IEEE Transactions on Automatic Control, vol. 54, Communication, Control, and Computing (Allerton), Monticello, USA,
no. 1, pp. 48–61, 2009. 2011, pp. 974–982.
[7] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized [31] K. I. Tsianos and M. G. Rabbat, “Distributed dual averaging for convex
gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. optimization under communication delays,” in Proc. IEEE American
1835–1854, 2016. Control Conference (ACC), Montréal, Canada, 2012, pp. 1067–1072.
[8] J. Chen and A. H. Sayed, “Distributed Pareto optimization via diffusion [32] L. Liu, S.and Xie and H. Zhang, “Distributed consensus for multi-agent
strategies,” IEEE Journal of Selected Topics in Signal Processing, vol. systems with delays and noises in transmission channels,” Automatica,
7, no. 2, pp. 205–220, 2013. vol. 47, no. 5, pp. 920–934, 2011.
[9] J. Chen and A. H. Sayed, “On the learning behavior of adaptive [33] Z. Peng, Y. Xu, M. Yan, and W. Yin, “ARock: an algorithmic
networks—Part I: Transient analysis,” IEEE Transactions on Information framework for asynchronous parallel coordinate updates,” SIAM Journal
Theory, vol. 61, no. 6, pp. 3487–3517, 2015. on Scientific Computing, vol. 38, no. 5, pp. A2851–A2879, 2016.
15

[34] M. Eisen, A. Mokhtari, and A. Ribeiro, “Decentralized quasi-Newton


methods,” arXiv preprint arXiv:1605.00933, 2016.
[35] L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous
parallel algorithms for nonconvex big-data optimization: Model and
convergence,” arXiv preprint arXiv:1607.04818, 2016.
[36] Z. Peng, T. Wu, Y. Xu, M. Yan, and W. Yin, “Coordinate friendly
structures, algorithms and applications,” Annals of Mathematical Sci-
ences and Applications, vol. 1, no. 1, pp. 57–119, 2016.
[37] A. H. Sayed, “Adaptation, learning, and optimization over networks,”
Foundations and Trends in Machine Learning, vol. 7, no. 4-5, pp. 311–
801, 2014.
[38] L. Condat, “A primal–dual splitting method for convex optimization
involving Lipschitzian, proximable and linear composite terms,” Journal
of Optimization Theory and Applications, vol. 158, no. 2, pp. 460–479,
2013.
[39] B. C. Vũ, “A splitting algorithm for dual monotone inclusions involving
cocoercive operators,” Advances in Computational Mathematics, vol. 38,
no. 3, pp. 667–681, 2013.
[40] A. I. Chen and A. Ozdaglar, “A fast distributed proximal-gradient
method,” in Proc. IEEE Annual Allerton Conference on Communication,
Control, and Computing (Allerton), Monticello, USA, 2012, pp. 601–
608.
[41] M. Krasnosel’skii, “Two remarks on the method of successive approxi-
mations,” Uspekhi Matematicheskikh Nauk, vol. 10, no. 1, pp. 123–127,
1955.
[42] H. H. Bauschke and P. L. Combettes, Convex analysis and monotone
operator theory in Hilbert spaces, Springer Science & Business Media,
2011.
[43] D. Davis, “Convergence rate analysis of primal-dual splitting schemes,”
SIAM Journal on Optimization, vol. 25, no. 3, pp. 1912–1943, 2015.
[44] R. Hannah and W. Yin, “On unbounded delays in asynchronous parallel
fixed-point algorithms,” UCLA CAM 16-64, 2016.
[45] P. L. Combettes and J.-C. Pesquet, “Stochastic quasi-fejér block-
coordinate fixed point iterations with random sweeping,” SIAM Journal
on Optimization, vol. 25, no. 2, pp. 1221–1248, 2015.
[46] H. Robbins and D. Siegmund, “A convergence theorem for non negative
almost supermartingales and some applications,” in Herbert Robbins
Selected Papers, pp. 111–135. Springer, 1985.
[47] Zdzisław Opial, “Weak convergence of the sequence of successive
approximations for nonexpansive mappings,” Bulletin of the American
Mathematical Society, vol. 73, no. 4, pp. 591–597, 1967.
[48] D. A. Lorenz and T. Pock, “An inertial forward-backward algorithm for
monotone inclusions,” Journal of Mathematical Imaging and Vision,
vol. 51, no. 2, pp. 311–325, 2015.
[49] Q. Ling, Y. Xu, W. Yin, and Z. Wen, “Decentralized low-rank matrix
completion,” in Proc. IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012, pp. 2925–
2928.

You might also like