Lecture 15

312 MPI 14.
4 Non-Blocking Send and Receive, Avoiding Deadlocks
14.4.2 Mutual communication and avoiding deadlock

Non-blocking operations can be used also for avoiding deadlocks.
Deadlock is a situation where processes wait after each other without any of them
able to do anything useful. Deadlocks can occur:
• caused of false order of send and receive
• caused by system send-buffer fill-up
In case of mutual communication there are 3 possibilities:
1. Both processes start with send followed by receive
2. Both processes start with receive followed by send
3. One process starts with send followed by receive, another vica versa
Depending on blocking there are different possibilities:

313 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks
1. Send followed by receive.

Analyse the following code:

if rank==0:
comm.Send([sendbuf, MPI.FLOAT], 1, tag)
comm.Recv([recvbuf, MPI.FLOAT], 1, tag)
else:

Works
Does this
OKwork?
for small messages only (if sendbuf is smaller then system message
send-bubber)
Large
But what
messages
about large
produce
messages?
Deadlock
Why? Would using Irecv help out?
Following is deadlock-free:

if rank==0:
request=comm.Isend([sendbuf, MPI.FLOAT], 1, tag)
request.Wait()
else:
request=comm.Isend([sendbuf, MPI.FLOAT], 0, tag)
request.Wait()

Question: Why Wait() cannot follow right after Isend(...) ?

2. Receive followed by send

if rank==0:
else:

Is
This
thisproduces
OK? deadlock in any system message buffer size
But have a look at:

if rank==0:
request=comm.Irecv([recvbuf, MPI.FLOAT], 1, tag)
request.Wait()
else:
request=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag)
request.Wait()

(This
Is it deadlock-free
is deadlock-free)
3. One starts with Send the other one with receive

if rank==0:
else:

(Non-blocking
Could we use non-blocking
commands cancommands
be used in
instead
whichever call here as well)
Generally, the following communication pattern is advised:

if rank==0:
req1=comm.Isend([sendbuf, MPI.FLOAT], 1, tag)
req2=recomm.Irecv([recvbuf, MPI.FLOAT], 1, tag)
else:
req1=comm.Isend([sendbuf, MPI.FLOAT], 0, tag)
req2=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag)
req1.Wait()
req2.Wait()

319 // Program Design 15.1 Two main approaches in parallel program design
15 Design and Evaluation of Parallel Programs

15.1 Two main approaches in parallel program design
• Data-parallel approach
– Data is devided between processes so that each process does the same
operations but with different data
• Control-parallel approach
– Each process has access to all pieces of data, but they perform different
operations on them
Majority of parallel programs use mix of the above

In this course data parallel approach is dominating
What obstacles can be accounted?
• Data dependencies
Example. Need to solve a 2 × 2 system of linear equations

! ! !
a 0 x f
=
b c y g
Data partitioning: First row goes to process 0, 2nd row to process1

Isn’t it easy! −→ NOT!
Write out as system of two equations:
ax =f
bx +cy = g.
Computation of y on process 1 depends on x computed on process 0. Therefore we

need
• Communication (more expensive than arithmetic operations or memory

references)
• Synchronisation (Processes that are just waiting are useless)
• Extra work (avoid communication or write a parallel program)
Which way to go?
• Divide data evenly between machines −→ load balancing
• Reducing data dependencies −→ good partitioning (geometric, for example)
• Make communication-free parts of the calculations as large as possible −→

algorithms with large granularity
Given requirements are not easily achievable. Some problems are easily parallelis-
able, others not.
322 // Program Design 15.2 Assessing parallel programs
15.2 Assessing parallel programs

15.2.1 Speedup
tseq (N)
S(N, P) :=
tpar (N, P)
• tseq (N) time solving the same problem with best known sequential algorithm
– =⇒ often sequential algorithm differs from the parallel one!

– If to time the same parallel algorithm just on one processor (despite of
existance of a better sequential algorithm), corresponding ratio is called
relative speedup
• 0 < S(N, P) ≤ P
• If S(N, P) = P – the parallel program has linear or optimal speedup. (Exam-

ple: calculating π with a parallel quadrature formula)
• Sometimes it may happen that S(N, P) > P How

due toisprocessor
it possible?
cache
– called superlinear speedup
• But sometimes S(N, P) < 1 – slowdown

What does instead
this mean?
of speedup!
15.2.2 Efficiency
tseq (N)
E(N, P) := .
P · tpar (N, P)
Presumably, 0 < E(N, P) ≤ 1.
15.2.3 Amdahl’s law

Each algorithm has some part(s) not parallelisable
• Let σ (0 < σ ≤ 1) denote sequential part of a parallel program that cannot be

parallelised
• Assume, 1 − σ part be parallelised optimally, =⇒
tseq 1 1
S(N, P) = 1−σ
= 1−σ
≤ .
σ+ P tseq σ+ P σ
If e.g. 5% of the algo- P S(N, P) =⇒ using a large num-

rithm is not parallelisable 2 1.9 ber of processors seems
(i.e. σ = 0.05), we get that 4 3.5 useless for gaining any rea-
10 6.9 sonable speedup increase!
20 10.3
100 16.8
∞ 20
15.2.4 Validity of Amdahl’s law

John Gustavson & Ed Barsis (Scania Laboratory):
1024-processor nCube/10 bet the Amdahl’s law! IsIn ittheir
possible
problem
at all?
they had:
σ ≈ 0.004...0.008
Got S ≈ 1000
(According to Amdahl’s law S should have been only 125...250 !!!)
Mathematically,
Does Amdahl’s law Amdahl’s
hold? law holds, of course
Does it make sense to solve a problem with fixed problem size N on arbitrarily
large number of processors? (What if only 1000 operations at all in the whole calcu-
lation – does it helpusing 1001 processors? ;-)
• The point is, that usually σ is not constant, but is reducing with N growing
• It is said that algorithm is efficiently parallel, if σ → 0 with N → ∞

To avoid problems with the terms, often term scaled efficiency is used
15.2.5 Scaled efficiency
tseq (N)
ES (N, P) :=
tpar (P · N, P)
• Does solution time remain the same with problem size change?
• 0 < ES (N, P) ≤ 1
• If ES (N, P) = 1, we have linear speedup

327 // Preconditioners 16.1 Comparing different solution algorithms
16 Parallel preconditioners
16.1 Comparing different solution algorithms

Poisson problem
Time [#IT] Parallel P = 8
×2
Memory Flops n = 200 −→ n = 400 n = 400 Speedup
LU O(n4 ) O(n6 ) - - - -
×25.1
Banded∗) O(n3 ) O(n4 ) 3.6s −→ 90.4s 92.4s 0.98
Jacobi O(n2 ) O(n4 ) > 1000s >1000s > 1000s -
×7.9
CG O(n2 ) O(n3 ) 3.8s [357] −→ 30.1s [702] 6.1s 4.9
×6.8
PCG(ILU) O(n2 ) O(n3 ) 2.9s [147] −→ 19.6s [249] 3.9s 5.0
×4.9
DOUG∗) O(n2 ) O(n7/3 ) 6.5s [15] −→ 32.1s [17] 4.3s 7.5
×4
BoomerAMG O(n2 ) O(n2 ) 1.8s [3] −→ 7.3s [3] 4.1s 1.8
*) Time taken on a 1.5-2 times slower machine
• In all cases stopping criteria chosen as ε := 10−8

328 // Preconditioners 16.1 Comparing different solution algorithms
• BoomerAMG is a parallel Multigrid algoritm (Lawrence Livermore Nat. Lab.,

CA, Hypre http://www.llnl.gov/CASC/hypre/). For Poisson equa-
tion solver the multigrid preconditioner is optimal (regarding flops and number
of unknowns)
• DOUG – Domain Decomposition on Unstructured Grids (Unversity of Tartu

and University of Bath (UK)) http://dougdevel.org
329 // Preconditioners 16.2 PCG
16.2 PCG
Recall CG method

Calculate r(0) = b − Ax(0) with given s t a r t i n g vector x(0)
for i = 1 , 2 , . . .
s o l v e Mz(i−1) = r(i−1) # M −1 i s c a l l e d P r e c o n d i t i o n e r
T
ρi−1 = r(i−1) z(i−1)
i f i ==1
p(1) = z(0)
else
βi−1 = ρi−1 /ρi−2
p(i) = z(i−1) + βi−1 p(i−1)
endif
T
q(i) = Ap(i) ; αi = ρi−1 /p(i) q(i)
x(i) = x(i−1) + αi p(i) ; r(i) = r(i−1) − αi q(i)
check convergence ; continue i f needed
end

330 // Preconditioners 16.3 Block-Jacobi method
16.3 Block-Jacobi method
Ω 1
M 1
0
M =
0 M 2
Ω 2
where M1 and M2 are preconditioners for matrix A

" #
A11 A12
A=
A21 A22
diagonal blocks A11 and A22 . (For example, M1 = A−1 −1

11 , M2 = A22 )
331 // Preconditioners 16.4 Overlapping Additive Schwarz method
16.4 Overlapping Additive Schwarz method
Ω 1
M 1
0
M =
0 M 2
Ω 2
Both last examples – Domain Decomposition methods

• Schwarz method
– dates back to year 1870

– was meant to solve PDE in a region combined from two regions
332 // Preconditioners 16.5 Domain Decomposition Method (DDM)
16.5 Domain Decomposition Method (DDM)

Domain Decomposition – class of methods for parallelisation where the solution do-
main is partitioned into subdomains
Partitioning examples
DDM Classification
• Non-overlapping methods
– Block-Jacobi method
– Additive Average method
• Overlapping methods
– Additive Schwarz method

– Substructuring method
• One-level method / Two-level (or multiple-level) methods

Non-overlapping, Block-Jacobi method:
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27
28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54
55 56 57 58 59 60 61 62 63
64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81
Method with minimal overlap (optimal overlap):
Ω1 Ω2 Ω3
Ω4 Ω5 Ω6
Ω7 Ω8 Ω9
Subdomain restriction operators

Restriction operator Ri – matrix applied to global vector x returns components
xi = Ri x belonging to subdomain Ωi .
Define
def
Mi−1 = RTi (Ri ARTi )−1 Ri (i = 1, ..., P)
One-level Additive Schwarz preconditioner is defined by:
P
−1 def
M1AS = ∑ Mi−1
i=1
339 // Preconditioners 16.6 Multilevel methods
16.6 Multilevel methods

Introduce coarse grid
Coarse grid Restriction and Interpolation operators

Restriction operator R0 is defined as transpose operator of interpolation operator
RT0 interpolating linearly (or bilinearly) values on coarse grid to fine grid nodes.
Coarse grid matrix A0
• assembled
– analytically or
– through formula: A0 = R0 ART0
• Define
def
M0−1 = RT0 A−1 T T −1
0 R0 = R0 (R0 AR0 ) R0
Two-level Additive Schwarz method defined as:

P
−1
M2AS = ∑ Mi−1
i=0
Requirements to the coarse grid:
• Coarse grid should cover all the fine grid nodes
• Each coarse grid cell must contain fine grid nodes
• If fine grid is with uneven density, coarse grid must adapt according to the fine
grid in terms of nodes numbers in each cell
If all the requirements fulfilled, it can be shown that

343 // Preconditioners 16.7 Parallelising PCG
• condition number of Additive Schwarz method κ(B) does not depend on dis-
cretisation parmeter n.
16.7 Parallelising PCG

In addition to parallel preconditioner (given above) we need to parallelise:
• Dot product operation
• Ax-operation
First is easily done with MPI_ALLREDUCE. (Attention is needecd only in the case
of overlaps)
Parallelising Ax-operation depends on existence of overlap
In case of no overlap (like Block-Jacobi method), usually technique with shadow-
nodes is used:
3 4 6 7
12 13 15 16
21 22 23 24 25
31 32 33
Before Ax-operation neighbours exchange values in nodes for which matrix A

elements Ai j 6= 0, i ∈ Ωk , j ∈ Ωl with k 6= l. For example, process P1 sends values at
nodes 4, 13, 22 (in global numeration) to process P0 and receives values from nodes
3, 12, ja 21 (saving these into shadow-node variables)
Non-blocking communication can be used!
RealKind.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/mpi_CG/
HTML/RealKind.f90.html)
sparse_mat.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/mpi_
CG/HTML/sparse_mat.f90.html)
par_sparse_mat.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/
mpi_CG/HTML/par_sparse_mat.f90.html)
par_sparse_lahenda.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/
mpi_CG/HTML/par_sparse_lahenda.f90.html)

Lecture 15

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 15

Uploaded by

Copyright:

Available Formats

312 MPI 14.

4 Non-Blocking Send and Receive, Avoiding Deadlocks

14.4.2 Mutual communication and avoiding deadlock

• caused of false order of send and receive

• caused by system send-buffer fill-up

In case of mutual communication there are 3 possibilities:

1. Both processes start with send followed by receive

2. Both processes start with receive followed by send

Depending on blocking there are different possibilities:

1. Send followed by receive.

Question: Why Wait() cannot follow right after Isend(...) ?

2. Receive followed by send

But have a look at:

3. One starts with Send the other one with receive

Generally, the following communication pattern is advised:

15 Design and Evaluation of Parallel Programs

Majority of parallel programs use mix of the above

What obstacles can be accounted?

Example. Need to solve a 2 × 2 system of linear equations

Data partitioning: First row goes to process 0, 2nd row to process1

Computation of y on process 1 depends on x computed on process 0. Therefore we

• Communication (more expensive than arithmetic operations or memory

• Synchronisation (Processes that are just waiting are useless)

• Extra work (avoid communication or write a parallel program)

Which way to go?

• Divide data evenly between machines −→ load balancing

• Reducing data dependencies −→ good partitioning (geometric, for example)

• Make communication-free parts of the calculations as large as possible −→

15.2 Assessing parallel programs

– =⇒ often sequential algorithm differs from the parallel one!

• If S(N, P) = P – the parallel program has linear or optimal speedup. (Exam-

• Sometimes it may happen that S(N, P) > P How

• But sometimes S(N, P) < 1 – slowdown

15.2.3 Amdahl’s law

• Let σ (0 < σ ≤ 1) denote sequential part of a parallel program that cannot be

• Assume, 1 − σ part be parallelised optimally, =⇒

If e.g. 5% of the algo- P S(N, P) =⇒ using a large num-

15.2.4 Validity of Amdahl’s law

• It is said that algorithm is efficiently parallel, if σ → 0 with N → ∞

15.2.5 Scaled efficiency

• If ES (N, P) = 1, we have linear speedup

16.1 Comparing different solution algorithms

• In all cases stopping criteria chosen as ε := 10−8

• BoomerAMG is a parallel Multigrid algoritm (Lawrence Livermore Nat. Lab.,

• DOUG – Domain Decomposition on Unstructured Grids (Unversity of Tartu

16.3 Block-Jacobi method

where M1 and M2 are preconditioners for matrix A

diagonal blocks A11 and A22 . (For example, M1 = A−1 −1

16.4 Overlapping Additive Schwarz method

Both last examples – Domain Decomposition methods

– dates back to year 1870

16.5 Domain Decomposition Method (DDM)

– Additive Schwarz method

• One-level method / Two-level (or multiple-level) methods

Non-overlapping, Block-Jacobi method:

Method with minimal overlap (optimal overlap):

Subdomain restriction operators

One-level Additive Schwarz preconditioner is defined by:

16.6 Multilevel methods

Coarse grid Restriction and Interpolation operators

Two-level Additive Schwarz method defined as:

Requirements to the coarse grid:

• Coarse grid should cover all the fine grid nodes