Chapter9-Parallel Distributed Algorithm

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Chapter 14 Parallel and Distributed Algorithm

Peng Li-Lanzhou University


May 27, 2023

We introduce two generic optimization problems,consensus and sharing [1].

1 Global Varable Consensus optimization


We first consider the case with a singular global variable with the objective and
constraint terms split into N parts.

X
N
min f (x) = fj (x) (1.1)
x∈Rn
j=1

Where fi : Rn → R ∪ {+∞} are convex.The goal is to solve the problem in such


a way that each term can be handled by its own processing element.

For example,x represents the parameters in a model and fj represents the loss
function associated with the jth block of data or measurements.We say that x is
found by collaborate filltering.Since the data sources are ”collaborating” to de-
velop a global model
minx f (x) = g(x) + h(x)

The problem can be rewriten with local variables xi ∈ Rn and a common global
variable z ∈ Rn  P
minx̃,z j fj (xj )
(1.2)
s.t.xj − z = 0
where x̃ = (x1 , x2 , · · · , xN )⊤ ∈ RnN ,This is called the global consensus prob-
lem,since the constraint is that all the local variable should agree,i.e.be equal


 x1 −z = 0

 x2 −z = 0
. .. .. (1.3)

 .


N x −z =0
ADMM for (1.2) can be derivedPeither directly from engmented Lagrange. 
Lρ (x1 , · · · , xN ; z) = j Fj (xj ) + ⟨yi , xi − z⟩ + ρ2 ∥xi − z∥22

1
or simply as a special case of the constraint optimization.
X
min f (x) = fj (xi ) (1.4)
x∈C
i

where the constraint set

C = {(x1 , · · · , xN ) ∈ RN ; x1 = x2 = x3 = · · · xN } (1.5)
The resulting ADMM algorithm is as follows.
 2

 xk+1 = arg minxi fi (xi ) + ρ2 xi − z k + yik /ρ 2 ,


i

 = P roxfi /ρ z k − yik /ρ , i = 1, . . . , N
P 2
z k+1 = arg minz i ρ2 xk+1 − z + yik /ρ 2 (1.6)

 P i 

 = N1 N j=1 xj
k+1
+ yik /ρ

 k+1 
yi = yik + ρ xk+1
i − z k+1 i = 1, · · · , N
This processing element that handles the global variable z is sometimes callled
the Center Collector or Fusion Center.Note that z-update is simply the projection
of xk+1 + y k /ρ. Onto the constraint sets C of ”block constant”vectors.i.e. z k+1 is
the average of xk+1 + y k /ρ for N-elements.
1 X k+1 1 X k
z k+1 = xj + y /ρ = x̄k+1 + ȳ k /ρ (1.7)
N j N j j

averaging the y-update gives



ȳ k+1 = ȳ k + ρ x̄k+1 − z k+1 (1.8)
Moreover,substituting (1.7) into (1.8) shows that

ȳ k+1 = ȳ k + ρ −ȳ k /ρ = 0 (1.9)
i.e. the dual variables have average value zero after the first iteratim.Usingz k =
x̄k .ADMM (1.6) can be written as
 2
 ρ

 x k+1
= arg min f i (x i ) + xi − x̄ k
+ y k

 i 2 i 2
= P roxfi /ρ x̄ − yi /ρ
k k
 (1.10)

 yik+1 = yik + ρ xk+1 − x̄k+1 i = 1, ..., N
 x̄k+1 = 1 PN xk+1
 i

N j=1 j

which is consensus ADMM.Here each pair (xk+1 i , yik+1 ) can be updated in parallel
way.for i = 1, ..., N .
This is similar as that of Tocabi-ADMM.
For consensus ADMM ,the primal and chual residuals are
( ⊤
γ k = xk1 − x̄k , · · · , xkN − x̄k
⊤ (1.11)
S k = −ρ x̄k−1 − x̄k , · · · , x̄k−1 − x̄k
So that,squard norms are

2
( 2 P
γ k = N xkj − x̄k
k 22
j=1 2 (1.12)
s = N ρ2 x̄k − x̄k−1 2
2 2

2 Global Varable Consensus with Regularition

A simple variable is as follows.

min fj (xj ) + g(z) s · t · xj − z = 0. (2.1)


x,z

Where g represents a simple constraint or regulazation handled by a central col-


lector P
minx j fj (xj ) + g(x)
The resulting ADMM algorithm is as follows
 2

 xk+1 = argminxj fj (xj ) + ρ2  xj − z k + yjk /ρ 2


j

 = proxfj /ρ z k − yjk /ρ

 k+1 P 2
z = argminz g(z) + i ρ2 xk+1 − z + yjk /ρ 2
j  (2.2)

 = argminz g(z) + ρN z − x̄k+1 + ȳ k /ρ ∥22

 2 

 = P rojg/ρN x̄k+1 + ȳ k /ρ

 k+1
yj = yjk + ρ xk+1
j − z k+1
Here
ρN 1 X 
z k+1 = argming(z) + ∥z − xj k+1 + yjk /ρ ∥22 . (2.3)
z 2 N j
 ρN X  k+1 
0 ∈ ∂g z k+1 + z − xk+1
j + yjk /ρ (2.4)
N j
  
= ∂g z k+1 + ρN z k+1 − x̄k+1 + ȳ k /ρ (2.5)
ρN
z k+1 = arg min g(z) + ∥z − (x̄k+1 + ȳ k /ρ)∥22 (2.6)
z 2
Please note that averaging the y-update gives 
ȳ k+1 = ȳ k + ρ x̄k+1 − z k+1
In this case,we do not in general have ȳ k = 0 .So cannot drop the z-update.
Remark.For the following problem
 P
min j f0 (xj ) + rj (xj )
(2.7)
s.t. xi = xj
Zeng and Yin solved it by parallel Proximal Gradient’s Descent Method [3]

3
3 Sharing

We consider the following sharing problem


!
X X
minxj fj (xj ) + g xj (3.1)
j j

where fj is a local cost function for subsystems i,and g is the shared objective.This
sharing problem is improtant both
(a).many useful problems can be put into this form
(b).it enjoys a dual relationship with the consensus problem:
Sharing can be writen as ADMM form by copying all variable.
( P P 
minx,z j fj (xj ) + g j zj j = 1, . . . , N (3.2)
s · t. xj − zj = 0,
with x = (x1 , ..., xN )⊤ ∈ RnN ,z = (z1 , ..., zN )⊤ ∈ RnN . This ADMM is
 ρ

 x k+1
= argmin f j (xj ) +
xj − zjk + yjk /ρ 2


j
2 2


xj



 = projfj /ρ zjk − yjk /ρ
!
X ρ X k+1 (3.3)



 z k+1
= argmin g z j + ∥xj − zj + yjk /ρ∥22


j
2

 j

j
 k+1
yj = yj + ρ xj − z j
k k+1 k

The xj -and-yj -steps can be carvied at independently in paralle for each j = 1....., N

Next,we will show how to carried z-subproblem.Let vik = xk+1


i + yik /ρ,the z-
subproblem can be written as
( P 2
minz,z̄ g(N z̄) + ρ2 j zj − vjk 2
P (3.4)
s.t. z̄ = N1 j zj
Mininizing over z1 , ...zN with z̄ fixed has the solution

zj P= vjk 
1 ⇒ zj = vjk + z k+1 − v k (3.5)
N j zj = z̄
P P
z̄ k+1 = N1 j zjk+1 = N1 j vjk = v̄ k
Then the z-update an yj -update can be computed by
P 2
minz̄ g(N z̄) + ρ2 j z̄ − v̄ k 2

ρN

k 2
= minz̄ g(N z̄) + √ 2
z̄ − v
ρ N
2
k 2
(3.6)
= min g(N z̄) √ + N z̄ − N v
2  2
= P roxg/ρ N N v̄ k /N

4
and 
yjk+1 = yjk + ρ xk+1
j − z j
k+1
 k+1 
= yj + ρ xj − vjk + z̄ k+1 − v̄ k
k
.
 k+1  
= yj + ρ xj − xj + yj /ρ − z̄
k k+1 k k+1
+ v̄ k (3.7)
 
= yjk + ρ −yjk /ρ − z̄ k+1 + x̄k+1 + ȳ k /ρ

= ȳ k + ρ x̄k+1 − z̄ k+1
The equation (3.7) shows that the dual variable y k+1 are all equal (i.e.consensus)and
can be replaced by a single dual variable y ∈ Rn

y k+1 = y k + ρ x̄k+1 − z̄ k+1 (3.8)
substiting the equivalent expression for zj

zj = v0k + z̄ k+1 − ∇
¯k
= xk+1
j + yjk /ρ + z̄ k+1 − x̄k+1 − ȳ k /ρ)
 (3.9)
= xk+1
j + y k /ρ + z̄ k+1 − x̄k+1 − y k /ρ

= xk+1
j + z̄ k+1 − x̄k+1
into the x-update,we get
ρ
xj − zjk + y k /ρ 2
xk+1
j = argmin fj (xj ) + 2
2
ρ  
= arg min fj (xj ) + ∥xj − xkj + z̄ k − x̄k − y k /ρ ∥22 (3.10)
2  
= P roxfj /ρ xj + z̄ − x̄k − y k /ρ
Combining the (3.10),(3.6) ,and (3.8).we get the final algorithm.
 k+1  

 xj = P roxfj /ρ xkj + z̄ k − x̄k −
y k /ρ
 k+1
z̄ = P roxg/ρN N x̄k+1 + y k /ρ /N
(3.11)

 y
k+1
+ ρ x̄k+1 − z̄ k+1
= yk P

x̄k+1 = N1 j xk+1j

4 Duality od sharing problem

Let the lagrange funtion of be (3.2) P 


P
L = j (fj (xj ) + ⟨yj , xj − zj ⟩) + g j zj

5
The dual function
Γ (y1 , · · · , yN ) = inf L (x1 , · · · , xN , z1 · · · , zN , y1 , · · · , yN )
x=z
" !#
X
N X X
= inf inf − [⟨−yj , xj ⟩ + fj (xj )] − ⟨yj , zj ⟩ − g zj
z x
j=1 j j
!!
X X X
=− fj∗ (−yj ) − sup ⟨yj , zj ⟩ − g zj
z
j j j
* + !!
X X X
=− fj∗ (yj ) − sup yj , zj −g zj
z
j j j
 P
− j fj∗ (−yj ) − g ∗ (yj ) , if y1 = · · · = yN
=
−∞, otherwise
(4.1)

Letting ψ = g ,hj (yj ) = fj∗ (−yj ).the
dual problem can be written as
 P
− miny,w i hi (yj ) + ψ(w) (4.2)
s.t. yj − w = 0, j = 1, · · · , N

5 Optimal Exchange

We higlight an improtant special case of the sharing problem -Exchange problem


with an appenling ecomic interpretation
 P
minxj j fj (xj )
P (5.1)
s.t. j xj = 0

This is a sharing problem with


1{x1 ,··· ,xN :∑ xj =0}
The compontents of the vector xj represent quantities of commodities that are
exchange among or subsystems.
(a).When(xi )j ≥ 0 .it can be viewed as the amount of commodity j contributed
by subsystem i to the exchange.
(b).(xi )j < 0,its megnititude|(xi )j |can be uewed as the amout of commodity j con-
tributed by subsystem i to the exchange.
The exchange problem can be treated as a generic constraint convex problem
 P
min j fj (xj )
(5.2)
s.t. x ∈ C
where  P
C = x = (x1 , · · · xN ) ∈ RnN , xj = 0
The exchange ADMM algorithm gives

6
 k+1 
 xj = P roxfj /ρ xkj − x̄k − y k /ρ
y k+1 = y k P
+ ρx̄k+1 (5.3)
 k+1
x̄ = N1 j xk+1j

6 Decentribulized Concensus
when we solve the global consensus optimization problem,
PN
min f (x) = j fj (x) (6.1)
we use the constriants
 P
minxi N i fi (xi ) (6.2)
s.t. xj − z = 0 j = 1, . . . , N
where x̂ = (x1 , · · · , xN )
Here z is the center collector.However,in many scenes,the data is collected or
sorted in a distributed meaner.a fusion center is either disallowed or not econ-
imal.Consequently,any computing tasks must be complushed a decentrialized and
collorative manner by a agents .This approach can be powerful and efficient,as
(a)the computing tasks are distributed over all the agents.
(b)information exchange occurs only between the agents with direct communica-
tion links.
This is no risk of central computation overload .or networks congestion.

W.S.QLing.K.Y uan.G.W u.W Y in.IEEET SP, 2014


For the global consensus optimization (6.1) we give the local copy of x the agent
i,and get
 P
min i fi (xi )
(6.3)
s.t. xi = xj , ∀(i, j) ∈ ε
where x = (x1 , · · · , xN )⊤ ∈ RnN ,or
 P
min j=1 fj (xj )
(6.4)
s.t.xj = x̄
1
P 1
PN
where x̄ = |v| i∈v xi = N i=1 xi
Here,we consider a network consisting of L agents bibirectionly connected by E
edges:
We can described the network as a symmetic indrected graph
Gu = v, ε,where v = 1, · · · , N is the set of vertexes |v| = N , ε is the set of the
edges with |ε| = E
For any vertex is v,we denotes N{i} = {j : (i, j) ∈ ε} it neighborhood

7
v = {v1 , · · · , v5 } , ε = (1, 2), (1, 5), (1, 3), (2, 5), (3, 2), (3, 4), (4, 5)

7 Distributed ADMM
In the optimization problem (6.3) ,the constraint can be writted as
x1 =x2 = x3 = · · · = xN

 x1 − x2 =0

 x2 − x3 =0
i.e. . .. ..

 .

 x
N −1 − xN = 0
This can be represented by edge-node incidnete.matrix A as follows
Ax = 0
where 
1−1
 1−1 
 
A= . .  ∈ R(N −1)×N
 . 
1−1
is the Laplace matrix
Therefore,we can get following problem
 P
min N j=1 fj (xj ) (7.1)
s.t. Ax = 0
or
( P
minx1 ,··· ,xN N j=1 fj (xj )
PN (7.2)
s.t. j=1 Aj xj = 0
Next,we give the distributed ADMM for solving (7.1). For agent i
( k P
xi = arg minfi (xi ) + ρ2 j∈N (i) ∥xj − xi + yij /ρ∥22 i = 1, · · · , N
xi  (7.3)
k+1
yji k
= yji + ρ xk+1
j − x k+1
i

Given
Nj = Ni+ ∪ Ni−

8
with 
Ni+ = {j : (i, j) ∈ ε, i < j}
(7.4)
Ni− = {j : (i, j) ∈ ε, i > j}
then design the following Distributed ADMM
  hP i hP i

 0 ∈ ∂fi xi − xj + yii /ρ + ρ j∈Ni− xj − xi
k+1 k+1 k+1
 +ρ + x
k k k k
+ yji /ρ
 h j∈N i
i
P  i hP  − k+1 i
= ∂f x k+1
+ ρ N + k+1
x − x k
+ y k
/ρ + ρ x k
+ y k
/ρ + N i x i


i i i i

+
j∈Ni j ji −
j∈Ni j ji
 k+1
yji = yji k
+ ρ xk+1
j − xk+1i
(7.5)
k+1
where xi -update only relates to its neigborhood Ni
ρ P
xi = arg min fi (xi ) + 2 ∥ i̸=j Aj xj + Ai xi − b + y/ρ∥22
k+1 k

For the model (6.4).we can writted it as follows [2]



minx,z F (x) + g(z)
(7.6)
s.t. Ax + Bz = 0
where

A = [A1 ; A2 ] ∈ R2En×Nn
(7.7)
B = [IEn , −IEn ] ∈ R2En×En
Though some transformations, we get
(  P 
0 ∈ ∂fi xk+1
i + w
 i
k
+ 2ρ |N i | x k+1
i − ρ |N
 i | x k
i + x
i∈Ni i
k
P (7.8)
wik+1 = wik + ρ |Ni | xk+1
i − x
j∈Nj j
k+1

where w = m with y ∈ RLn with M = A⊤ ⊤


1 − A2

8 Distributed Gradeint Method


Next ,we will show to solve (6.3) via gradient descent method.

At temtion k, every agent i sends its current itenite xki to its neighbors j ∈
Ni ,and receives xki from its neighbors j ∈ Ni .Then every agent i,excutes the con-
sensus update step P P
xk+1
i = aii xki + j∈Ni aijP
xkj = j∈Ni ∪{i} aij xkj
where the weights aii > 0 for j ∈ Ni with j∈Ni ∪{i}

Here the position sclars aij ,j ∈ Ni ∪ {i} are referred to as convex weights,and
vector xk+1
i is said to e a convex combination(or weighted average)of points xkj ,j ∈
Ni ∪ {i}. Note that the weights aij ,i ∈ Ni ∪ {i}are selected by agent i.
Let us define an N × N weight matrix A with entries

> 0 j ∈ Ni ∪ {i} P
aij , j∈Ni ∪{i} , aij = 1 (8.1)
=0 j∈ / Ni ∪ {i}

9
PN
Where the sum of each row of A is equal to l;i.e. j=1 aij = 1.we refer such matrix
to row stochastic.Then becomes

1 X
N
xk+1
i = PN aij xkj (8.2)
j=1 aij i=1

Then every agent i executes the folowing two steps


 k PL
x̃i = j=1 aij xkj  (8.3)
xk+1
i =x eki − ak ∇fi x̃ki

where αk > 0 is the step size 


xk+1
i = xki − αk ∇fi xki
we also can have the following variance of
 k 
x̄i = xki − αk ∇fi xki
P (8.4)
xk+1
i = Lj=1 aij x̄ki
By taking the average

1 X k+1 1 X k αk 
L
xi = x̃i − ∇fi x̄ki
L i L i L
! (8.5)
1X X αk X 
L L L
= aij xj −
k
fi x̃ki
L j=1 i=1 L i=1

Where
PL the matrix is doublely column stochastic,i.e.The matix A is also stochastic
i=1 aij = 1,for j,then
 k P
x̃ = j aij xkj
P  (8.6)
x̄k+1 = x̄kj − aLk Lj=1 ∇fi x̃ki

9 The Relationship Between Parallel and Dis-


tributed Computing Parallel Computing

Figure 1: Parallel Computing

10
Figure 2: Distributed Computing
 X ρ k
xk+1 = argmin f i j +
(x xj − xi + yijk /p 2 i = 1, · · · , L. (9.1)
i 2
j∈N
2
i

example Given a training date set


S = {(ai , bi ) ∈ Rn × R}N
i=1
we want to learn a parameter x that minimizes

X
N
min fi (xi , ai , bi ) (9.2)
x∈χ
j=1

where χ ∈ [−1, 1]n ,and fi are the loss functions associated over the dataset.For
example ,fi are quadratic,i.e.

X
N
2
min a⊤
i x − bi (9.3)
x∈χ
i=1

The stochastic matrix C is chosen as a lazy Metropolis matrix corresponding to G


satisfying,i.e.
 1
 2 max{i,|Nj |} (i, j) ∈ ε
C = [cij ] = (i, j) ∈ (9.4)
 P0 / ε, i = j
1 − j∈Ni aij i=j
Then we get  2
P ⊤
minxi ∈χ N i=1 ai xi − bi (9.5)
s.t.Cx = 0
and solve it by distributed gradient descent method by the weight matrix C.

References
[1] S Boyd, N Parikh, E Chu, B Peleato, and J Eckstein. Found trends mach. In
Learn, volume 3, page 1, 2011.
[2] Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method
of multipliers. In 2012 IEEE 51st IEEE Conference on Decision and Control
(CDC), pages 5445–5450. IEEE, 2012.

[3] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent.
IEEE Transactions on signal processing, 66(11):2834–2848, 2018.

12

You might also like