Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

312 MPI 14.

4 Non-Blocking Send and Receive, Avoiding Deadlocks

14.4.2 Mutual communication and avoiding deadlock


Non-blocking operations can be used also for avoiding deadlocks.
Deadlock is a situation where processes wait after each other without any of them
able to do anything useful. Deadlocks can occur:

• caused of false order of send and receive

• caused by system send-buffer fill-up

In case of mutual communication there are 3 possibilities:

1. Both processes start with send followed by receive

2. Both processes start with receive followed by send

3. One process starts with send followed by receive, another vica versa

Depending on blocking there are different possibilities:


313 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

1. Send followed by receive.


Analyse the following code:
 
if rank==0:
comm.Send([sendbuf, MPI.FLOAT], 1, tag)
comm.Recv([recvbuf, MPI.FLOAT], 1, tag)
else:
comm.Send([sendbuf, MPI.FLOAT], 0, tag)
comm.Recv([recvbuf, MPI.FLOAT], 0, tag)


Works
Does this
OKwork?
for small messages only (if sendbuf is smaller then system message
send-bubber)
Large
But what
messages
about large
produce
messages?
Deadlock
Why? Would using Irecv help out?
314 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

Following is deadlock-free:
 
if rank==0:
request=comm.Isend([sendbuf, MPI.FLOAT], 1, tag)
comm.Recv([recvbuf, MPI.FLOAT], 1, tag)
request.Wait()
else:
request=comm.Isend([sendbuf, MPI.FLOAT], 0, tag)
comm.Recv([recvbuf, MPI.FLOAT], 0, tag)
request.Wait()


Question: Why Wait() cannot follow right after Isend(...) ?


315 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

2. Receive followed by send


 
if rank==0:
comm.Recv([recvbuf, MPI.FLOAT], 1, tag)
comm.Send([sendbuf, MPI.FLOAT], 1, tag)
else:
comm.Recv([recvbuf, MPI.FLOAT], 0, tag)
comm.Send([sendbuf, MPI.FLOAT], 0, tag)


Is
This
thisproduces
OK? deadlock in any system message buffer size
316 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

But have a look at:


 
if rank==0:
request=comm.Irecv([recvbuf, MPI.FLOAT], 1, tag)
comm.Send([sendbuf, MPI.FLOAT], 1, tag)
request.Wait()
else:
request=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag)
comm.Send([sendbuf, MPI.FLOAT], 0, tag)
request.Wait()


(This
Is it deadlock-free
is deadlock-free)
317 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

3. One starts with Send the other one with receive


 
if rank==0:
comm.Send([sendbuf, MPI.FLOAT], 1, tag)
comm.Recv([recvbuf, MPI.FLOAT], 1, tag)
else:
comm.Recv([recvbuf, MPI.FLOAT], 0, tag)
comm.Send([sendbuf, MPI.FLOAT], 0, tag)

(Non-blocking
Could we use non-blocking
commands cancommands
be used in
instead
whichever call here as well)
318 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

Generally, the following communication pattern is advised:


 
if rank==0:
req1=comm.Isend([sendbuf, MPI.FLOAT], 1, tag)
req2=recomm.Irecv([recvbuf, MPI.FLOAT], 1, tag)
else:
req1=comm.Isend([sendbuf, MPI.FLOAT], 0, tag)
req2=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag)
req1.Wait()
req2.Wait()

319 // Program Design 15.1 Two main approaches in parallel program design

15 Design and Evaluation of Parallel Programs


15.1 Two main approaches in parallel program design
• Data-parallel approach

– Data is devided between processes so that each process does the same
operations but with different data

• Control-parallel approach

– Each process has access to all pieces of data, but they perform different
operations on them

Majority of parallel programs use mix of the above


In this course data parallel approach is dominating
320 // Program Design 15.1 Two main approaches in parallel program design

What obstacles can be accounted?

• Data dependencies

Example. Need to solve a 2 × 2 system of linear equations


! ! !
a 0 x f
=
b c y g

Data partitioning: First row goes to process 0, 2nd row to process1


Isn’t it easy! −→ NOT!
Write out as system of two equations:

ax =f
bx +cy = g.

Computation of y on process 1 depends on x computed on process 0. Therefore we


need
321 // Program Design 15.1 Two main approaches in parallel program design

• Communication (more expensive than arithmetic operations or memory


references)

• Synchronisation (Processes that are just waiting are useless)

• Extra work (avoid communication or write a parallel program)

Which way to go?

• Divide data evenly between machines −→ load balancing

• Reducing data dependencies −→ good partitioning (geometric, for example)

• Make communication-free parts of the calculations as large as possible −→


algorithms with large granularity

Given requirements are not easily achievable. Some problems are easily parallelis-
able, others not.
322 // Program Design 15.2 Assessing parallel programs

15.2 Assessing parallel programs


15.2.1 Speedup

tseq (N)
S(N, P) :=
tpar (N, P)

• tseq (N) time solving the same problem with best known sequential algorithm

– =⇒ often sequential algorithm differs from the parallel one!


– If to time the same parallel algorithm just on one processor (despite of
existance of a better sequential algorithm), corresponding ratio is called
relative speedup

• 0 < S(N, P) ≤ P

• If S(N, P) = P – the parallel program has linear or optimal speedup. (Exam-


ple: calculating π with a parallel quadrature formula)
323 // Program Design 15.2 Assessing parallel programs

• Sometimes it may happen that S(N, P) > P How


due toisprocessor
it possible?
cache
– called superlinear speedup

• But sometimes S(N, P) < 1 – slowdown


What does instead
this mean?
of speedup!

15.2.2 Efficiency

tseq (N)
E(N, P) := .
P · tpar (N, P)
Presumably, 0 < E(N, P) ≤ 1.
324 // Program Design 15.2 Assessing parallel programs

15.2.3 Amdahl’s law


Each algorithm has some part(s) not parallelisable

• Let σ (0 < σ ≤ 1) denote sequential part of a parallel program that cannot be


parallelised

• Assume, 1 − σ part be parallelised optimally, =⇒

tseq 1 1
S(N, P) = 1−σ
 = 1−σ
≤ .
σ+ P tseq σ+ P σ

If e.g. 5% of the algo- P S(N, P) =⇒ using a large num-


rithm is not parallelisable 2 1.9 ber of processors seems
(i.e. σ = 0.05), we get that 4 3.5 useless for gaining any rea-
10 6.9 sonable speedup increase!
20 10.3
100 16.8
∞ 20
325 // Program Design 15.2 Assessing parallel programs

15.2.4 Validity of Amdahl’s law


John Gustavson & Ed Barsis (Scania Laboratory):
1024-processor nCube/10 bet the Amdahl’s law! IsIn ittheir
possible
problem
at all?
they had:
σ ≈ 0.004...0.008
Got S ≈ 1000
(According to Amdahl’s law S should have been only 125...250 !!!)
Mathematically,
Does Amdahl’s law Amdahl’s
hold? law holds, of course
Does it make sense to solve a problem with fixed problem size N on arbitrarily
large number of processors? (What if only 1000 operations at all in the whole calcu-
lation – does it helpusing 1001 processors? ;-)

• The point is, that usually σ is not constant, but is reducing with N growing

• It is said that algorithm is efficiently parallel, if σ → 0 with N → ∞


326 // Program Design 15.2 Assessing parallel programs

To avoid problems with the terms, often term scaled efficiency is used

15.2.5 Scaled efficiency

tseq (N)
ES (N, P) :=
tpar (P · N, P)

• Does solution time remain the same with problem size change?

• 0 < ES (N, P) ≤ 1

• If ES (N, P) = 1, we have linear speedup


327 // Preconditioners 16.1 Comparing different solution algorithms

16 Parallel preconditioners

16.1 Comparing different solution algorithms


Poisson problem
Time [#IT] Parallel P = 8
×2
Memory Flops n = 200 −→ n = 400 n = 400 Speedup
LU O(n4 ) O(n6 ) - - - -
×25.1
Banded∗) O(n3 ) O(n4 ) 3.6s −→ 90.4s 92.4s 0.98
Jacobi O(n2 ) O(n4 ) > 1000s >1000s > 1000s -
×7.9
CG O(n2 ) O(n3 ) 3.8s [357] −→ 30.1s [702] 6.1s 4.9
×6.8
PCG(ILU) O(n2 ) O(n3 ) 2.9s [147] −→ 19.6s [249] 3.9s 5.0
×4.9
DOUG∗) O(n2 ) O(n7/3 ) 6.5s [15] −→ 32.1s [17] 4.3s 7.5
×4
BoomerAMG O(n2 ) O(n2 ) 1.8s [3] −→ 7.3s [3] 4.1s 1.8
*) Time taken on a 1.5-2 times slower machine

• In all cases stopping criteria chosen as ε := 10−8


328 // Preconditioners 16.1 Comparing different solution algorithms

• BoomerAMG is a parallel Multigrid algoritm (Lawrence Livermore Nat. Lab.,


CA, Hypre http://www.llnl.gov/CASC/hypre/). For Poisson equa-
tion solver the multigrid preconditioner is optimal (regarding flops and number
of unknowns)

• DOUG – Domain Decomposition on Unstructured Grids (Unversity of Tartu


and University of Bath (UK)) http://dougdevel.org
329 // Preconditioners 16.2 PCG

16.2 PCG
Recall CG method
 
Calculate r(0) = b − Ax(0) with given s t a r t i n g vector x(0)
for i = 1 , 2 , . . .
s o l v e Mz(i−1) = r(i−1) # M −1 i s c a l l e d P r e c o n d i t i o n e r
T
ρi−1 = r(i−1) z(i−1)
i f i ==1
p(1) = z(0)
else
βi−1 = ρi−1 /ρi−2
p(i) = z(i−1) + βi−1 p(i−1)
endif
T
q(i) = Ap(i) ; αi = ρi−1 /p(i) q(i)
x(i) = x(i−1) + αi p(i) ; r(i) = r(i−1) − αi q(i)
check convergence ; continue i f needed
end

330 // Preconditioners 16.3 Block-Jacobi method

16.3 Block-Jacobi method

Ω 1
M 1
0

M =
0 M 2

Ω 2

where M1 and M2 are preconditioners for matrix A


" #
A11 A12
A=
A21 A22

diagonal blocks A11 and A22 . (For example, M1 = A−1 −1


11 , M2 = A22 )
331 // Preconditioners 16.4 Overlapping Additive Schwarz method

16.4 Overlapping Additive Schwarz method

Ω 1
M 1
0

M =
0 M 2

Ω 2

Both last examples – Domain Decomposition methods


• Schwarz method

– dates back to year 1870


– was meant to solve PDE in a region combined from two regions
332 // Preconditioners 16.5 Domain Decomposition Method (DDM)

16.5 Domain Decomposition Method (DDM)


Domain Decomposition – class of methods for parallelisation where the solution do-
main is partitioned into subdomains
Partitioning examples
333 // Preconditioners 16.5 Domain Decomposition Method (DDM)
334 // Preconditioners 16.5 Domain Decomposition Method (DDM)
335 // Preconditioners 16.5 Domain Decomposition Method (DDM)

DDM Classification

• Non-overlapping methods

– Block-Jacobi method
– Additive Average method

• Overlapping methods

– Additive Schwarz method


– Substructuring method

• One-level method / Two-level (or multiple-level) methods


336 // Preconditioners 16.5 Domain Decomposition Method (DDM)

Non-overlapping, Block-Jacobi method:

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18

19 20 21 22 23 24 25 26 27

28 29 30 31 32 33 34 35 36

37 38 39 40 41 42 43 44 45

46 47 48 49 50 51 52 53 54

55 56 57 58 59 60 61 62 63

64 65 66 67 68 69 70 71 72

73 74 75 76 77 78 79 80 81
337 // Preconditioners 16.5 Domain Decomposition Method (DDM)

Method with minimal overlap (optimal overlap):

Ω1 Ω2 Ω3

Ω4 Ω5 Ω6

Ω7 Ω8 Ω9
338 // Preconditioners 16.5 Domain Decomposition Method (DDM)

Subdomain restriction operators


Restriction operator Ri – matrix applied to global vector x returns components
xi = Ri x belonging to subdomain Ωi .
Define
def
Mi−1 = RTi (Ri ARTi )−1 Ri (i = 1, ..., P)

One-level Additive Schwarz preconditioner is defined by:

P
−1 def
M1AS = ∑ Mi−1
i=1
339 // Preconditioners 16.6 Multilevel methods

16.6 Multilevel methods


Introduce coarse grid
340 // Preconditioners 16.6 Multilevel methods

Coarse grid Restriction and Interpolation operators


Restriction operator R0 is defined as transpose operator of interpolation operator
RT0 interpolating linearly (or bilinearly) values on coarse grid to fine grid nodes.
Coarse grid matrix A0

• assembled

– analytically or
– through formula: A0 = R0 ART0

• Define
def
M0−1 = RT0 A−1 T T −1
0 R0 = R0 (R0 AR0 ) R0

Two-level Additive Schwarz method defined as:


P
−1
M2AS = ∑ Mi−1
i=0
341 // Preconditioners 16.6 Multilevel methods

Requirements to the coarse grid:

• Coarse grid should cover all the fine grid nodes

• Each coarse grid cell must contain fine grid nodes

• If fine grid is with uneven density, coarse grid must adapt according to the fine
grid in terms of nodes numbers in each cell
342 // Preconditioners 16.6 Multilevel methods

If all the requirements fulfilled, it can be shown that


343 // Preconditioners 16.7 Parallelising PCG

• condition number of Additive Schwarz method κ(B) does not depend on dis-
cretisation parmeter n.

16.7 Parallelising PCG


In addition to parallel preconditioner (given above) we need to parallelise:

• Dot product operation

• Ax-operation

First is easily done with MPI_ALLREDUCE. (Attention is needecd only in the case
of overlaps)
Parallelising Ax-operation depends on existence of overlap
In case of no overlap (like Block-Jacobi method), usually technique with shadow-
nodes is used:
344 // Preconditioners 16.7 Parallelising PCG

3 4 6 7

12 13 15 16

21 22 23 24 25

31 32 33
345 // Preconditioners 16.7 Parallelising PCG

Before Ax-operation neighbours exchange values in nodes for which matrix A


elements Ai j 6= 0, i ∈ Ωk , j ∈ Ωl with k 6= l. For example, process P1 sends values at
nodes 4, 13, 22 (in global numeration) to process P0 and receives values from nodes
3, 12, ja 21 (saving these into shadow-node variables)
Non-blocking communication can be used!
RealKind.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/mpi_CG/
HTML/RealKind.f90.html)
sparse_mat.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/mpi_
CG/HTML/sparse_mat.f90.html)
par_sparse_mat.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/
mpi_CG/HTML/par_sparse_mat.f90.html)
par_sparse_lahenda.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/
mpi_CG/HTML/par_sparse_lahenda.f90.html)

You might also like