Lec 9 DR Marwa Abbas

Elective Course 4 :CSE 327
High Performance Computing

(HPC)
Dr. Marwa A. Radad
Lecture 9: Introduction to MPI + MPI communication

Course Outline:
✓ Introduction to High Performance Computing HPC
✓ Architecture of parallel computing
✓ Shared memory and Distributed memory in parallel computing
✓ Parallel algorithms
✓ Performance of parallel systems
✓ Introduction to OpenMP
✓ Essentials of OpenMP programming
✓ Introduction to MPI and distributed memory programming
✓ Communication using MPI
Basics of MPI
The message passing paradigm
Distributed-memory
architecture:
Each process(or) can only access
Message
its dedicated address space.
No global shared address

space even if using shared
memory machine
Data exchange and communication

between processes is done by
explicitly passing messages through
a communication network
Parallel Programming 2020 2020-11-30 4
Distributed-memory
architecture:
Each process(or) can only access
Message
its dedicated address space.
No global shared address

space even if using shared
memory machine
Message passing library:
Data exchange and communication ▪ Should be flexible, efficient and portable
between processes is done by
explicitly passing messages through ▪ Hide communication hardware and software
a communication network layers from application developer
▪ Widely accepted standard in HPC / numerical simulation:
Message Passing Interface (MPI)
▪ Process-based approach: All variables are local!

▪ Same program on each processor/machine (SPMD)
▪ The program is written in a sequential language (Fortran/C[++])
▪ Data exchange between processes: Send/receive messages via MPI

library calls
▪ No automatic workload distribution

The MPI standard
▪ MPI forum – defines MPI standard / library subroutine interfaces
▪ Latest standard: MPI 3.1 (2015), 868 pages

▪ MPI 4.0 under development
▪ Successful free implementations (MPICH, mvapich,

OpenMPI) and vendor libraries (Intel, Cray, HP,…)
▪ Documents: http://www.mpi-forum.org/

MPI goals and scope
▪ Portability is main goal: architecture- and
hardware-independent code Application
MPI
▪ Fortran and C interfaces (C++
Drivers
deprecated)
▪ Features for supporting parallel Hard-
libraries ware
▪ Support for heterogeneous

environments (e.g., clusters with
compute nodes of different
architectures)

MPI in a nutshell
The beginner’s MPI toolbox
Architecture
▪ Operating system view:
Application User ▪ Running processes
▪ Developer’s view: Library routines for
▪ coordination
MPI execution ▪ communication
Library calls ▪ synchronization
(>500, 20 essential)
environment
▪ User’s view: MPI execution
environment provides
MPI Library ▪ resource allocation
▪ parallel program startup
Threads / processes / CPUs / ▪ other (implementation-dependent)
network hardware behavior
Parallel execution in MPI
Program startup ▪ Processes run throughout program execution
▪ MPI startup mechanism:

▪ launches tasks/processes
▪ establishes communication context
(“communicator”)


▪ MPI Point-to-point communication:
▪ between pairs of tasks/processes


+ ▪ between pairs of tasks/processes
▪ MPI Collective communication:
▪ between all processes or a subgroup
▪ barrier, reductions, scatter/gather


+ ▪ between pairs of tasks/processes
▪ MPI Collective communication:
▪ between all processes or a subgroup
▪ barrier, reductions, scatter/gather
Program shutdown ▪ Clean shutdown by MPI

C and Fortran interfaces for MPI
▪ Required header files:
▪ C: #include <mpi.h>
▪ Library:
▪ C: error = MPI_Xxxx(...);
int MPI_Init(int *argc, char *argv);
▪ MPI constants (global/common): All upper case in C

MPI_COMM_WORLD
▪ Arrays:
▪ C: indexed from 0

MPI error handling
▪ C routines
▪ return an int — may be ignored
▪ Return value MPI_SUCCESS

▪ Indicates that all is fine
▪ Default: Abort parallel computation in case of other return values

Initialization and finalization
▪ Details of MPI startup are implementation defined
▪ First call in MPI program: initialization of parallel machine
int MPI_Init(int *argc, char *argv);
▪ Last call: clean shutdown of parallel machine
int MPI_Finalize();
Only “master” process is guaranteed to continue after finalize

▪ Stdout/stderr of each MPI process
▪ usually redirected to console where program was started
▪ many options possible, depending on implementation
World communicator and rank
▪ MPI_Init() defines “communicator” MPI_COMM_WORLD comprising all
processes
MPI_COMM_WORLD
1
0
3 4
6 5
2
7
Process rank

Communicator and rank
▪ Communicator defines a set of processes (MPI_COMM_WORLD: all)
▪ The rank identifies each process within a communicator
▪ Obtain rank:
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
▪ rank = 0,1,2,…, (number of processes in communicator – 1)
▪ One process may have different ranks if it belongs to different
communicators
▪ Obtain number of processes in communicator:

int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI “Hello World!” in C
#include <mpi.h>
int main(char *argc, char *argv) {

int rank, size; Never forget that
these are pointers to
the original varables!
MPI_Init(&argc, &argv);
printf(“Hello World! I am %d of %d\n”, rank, size);

MPI_Finalize(); Communicator
required for (almost)
} all MPI calls

Compiling and running the code
▪ Compiling/linking
▪ Headers and libs must be found by
compiler
▪ Most implementations provide
wrapper scripts, e.g.,
▪ mpicc / mpiCC
▪ Behave like normal
compilers/linkers
▪ Running
▪ Details are implementation specific
▪ Startup wrappers:
▪ mpirun, mpiexec
Compiling and running the code
▪ Compiling/linking
▪ Headers and libs must be found by
compiler
▪ Most implementations provide
wrapper scripts, e.g., $ mpiCC -o hello hello.cc
▪ mpicc / mpiCC $ mpirun -np 4 ./hello
▪ Behave like normal
compilers/linkers Hello World! I am 3 of 4
Hello World! I am 1 of 4
Hello World! I am 0 of 4
▪ Running Hello World! I am 2 of 4
▪ Details are implementation specific
▪ Startup wrappers:
▪ mpirun, mpiexec
MPI point-to-point communication
Point-to-point communication: message envelope
▪ Which process is sending the message?
▪ Where is the data on the sending process?
▪ What kind of data is being sent?
Message
▪ How much data is there?
▪ Which processes are receiving the

message?
▪ Where should the data be left on the receiving process?
▪ How much data is the receiving process prepared to accept?
▪ Sender and receiver must pass their information to MPI separately

Message

message?

Message

message?

MPI point-to-point communication
▪ Processes communicate by sending and receiving messages
▪ MPI message: array of elements of a particular type
rank 𝑖 rank 𝑗
sender receiver
▪ Data types
▪ Basic
▪ MPI derived types

Predefined data types in MPI (selection)
Data type matching: Same type
in send and receive call
required
8 binary digits

MPI blocking point-to-point communication
▪ Point-to-point: one sender, one receiver
▪ Identified by rank
▪ Blocking: After the MPI call returns,

▪ the source process can safely modify the send buffer
▪ the receive buffer (on the destination process) contains the entire message.

Standard blocking send
int MPI_Send(const void* buf, int count,
MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm);
buf address of send buffer

count # of elements
datatype MPI data type
dest destination rank
tag message tag
comm communicator
At completion
▪ Send buffer can be reused as you see fit
▪ Status of destination is unknown – the message could be anywhere
Buffered send
MPI_send(buf, count, datatype,
blocking
dest, tag, comm)
Rank 0 Rank 1
app MPI MPI app
Bsend
blocking
copy msg
completion post send
wait for Recv
recv post recv
send completes
when message has msg msg
been copied transfer transfer
completion
predictable & no
synchronization Caveat: comes at the cost of
additional copy operations

Synchronous send
MPI_Ssend(buf, count, datatype, Problems:
▪ Performance: high latency, risk of
dest, tag, comm) serialization
synchronization of source ▪ Source for potential deadlocks
and destination But: can be used for debugging
Rank 0 Rank 1
app MPI MPI app
Ssend
post send
blocking
wait for Recv

recv
post recv
blocking
completion msg msg
transfer transfer
Ssend completes after recv completion

predictable &
has been posted at
safe behavior
destination

Standard blocking receive
int MPI_Recv(void* buf, int count,
MPI_Datatype datatype, int source,
int tag, MPI_Comm comm, MPI_Status *status);
buf address of receive buffer
count # of elements that fit into receive buffer
source sending process rank
tag message tag
comm communicator
status address of status object
At completion
▪ Message has been received successfully
▪ Message length, and probably the tag and the sender, are still unknown
Source and tag wildcards
▪ MPI_Recv accepts wildcards for the source and tag arguments:
MPI_ANY_SOURCE, MPI_ANY_TAG
▪ Actual source and tag values are available in the status object:
MPI_Status s;
MPI_Recv(buf, count, datatype, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &s);
printf(“Received from rank %d with tag %d\n”,
s.MPI_SOURCE, s.MPI_TAG);

Received message length
int MPI_Get_count(const MPI_Status *status,
MPI_Datatype datatype, int *count)
status address of status object

count address of element count variable
▪ Determines number of elements received
int count;
MPI_Get_count(&s, MPI_DOUBLE, &count);

Requirements for poit-to-point communication
For a communication to succeed:
▪ sender must specify a valid destination
▪ receiver must specify a valid source rank
(or MPI_ANY_SOURCE)
▪ communicator must match
(e.g., MPI_COMM_WORLD)
▪ tags must match
(or MPI_ANY_TAG for receiver)
▪ message data types must match

▪ receiver’s buffer must be large enough

Beginner’s MPI toolbox
▪ Basic point-to-point communication and support functions:
▪ MPI_Init() let's get going
▪ MPI_Comm_size() how many are we?
▪ MPI_Comm_rank() who am I?
▪ MPI_Send() send data to someone else
▪ MPI_Recv() receive data from some-/anyone
▪ MPI_Get_count() how many items have I received?
▪ MPI_Finalize() finish off
▪ Send/receive buffer may safely be reused after the call has completed
▪ MPI_Send() must have a specific target/tag, MPI_Recv() does not
▪ So far no explicit synchronization!
Example: parallel integration in MPI
MPI_Status status; // rank 0 collects partial results
MPI_Comm_size(MPI_COMM_WORLD, &size); if(0==rank) {
MPI_Comm_rank(MPI_COMM_WORLD, &rank); res = psum; // local result
// integration limits for(int i=1; i<size; ++i) {
double a=0., b=2., res=0., tmp; MPI_Recv(tmp, // receive buffer
// limits for "me" 1, // array length
mya = a + rank * (b-a)/size; MPI_DOUBLE, // data type
myb = mya + (b-a)/size; i, // rank of source
// integrate f(x) over my own chunk 0, // tag (unused here)
psum = integrate(mya,myb); MPI_COMM_WORLD,
&status); //status object
𝑏 res += tmp;
Task: calculate∫ 𝑓 𝑥 𝑑𝑥 using }
𝑎
printf(“Result: %f\n”, res);
(existing) function integrate(x,y) } else { // ranks != 0 send results to rank 0
MPI_Send(psum, // send buffer
1, // message length
▪ Split up interval [a,b] into equal MPI_DOUBLE,// data type
disjoint chunks 0, // rank of destination
0, // tag (unused here)
▪ Compute partial results in parallel MPI_COMM_WORLD);
▪ Collect global sum at rank 0 }
MPI_Finalize();
MPI_Status status;
// integration limits
double a=0., b=2., res=0., tmp;
// limits for "me"
mya = a + rank * (b-a)/size; 𝑏
Task: calculate∫ 𝑓 𝑥 𝑑𝑥 using
myb = mya + (b-a)/size; 𝑎
(existing) function integrate(x,y)
// integrate f(x) over my own chunk
psum = integrate(mya,myb); ▪ Split up interval [a,b] into equal
disjoint chunks
▪ Compute partial results in parallel
▪ Collect global sum at rank 0

// rank 0 collects partial results
if(0==rank) {
res = psum; // local result
for(int i=1; i<size; ++i) {
MPI_Recv(tmp, // receive buffer
1, // array length
MPI_DOUBLE, // data type
i, // rank of source
MPI_COMM_WORLD,
𝑏 res += tmp;
𝑎
1,
▪ Split up interval [a,b] into equal // message length
MPI_DOUBLE,// data type
▪ Compute partial results in parallel 0, // tag (unused here)
MPI_COMM_WORLD);
MPI_Finalize();
Remarks on parallel integration example
▪ Gathering results from processes is a very common task in MPI – there are more
efficient and elegant ways to do this (see later).
▪ This is a reduction operation (summation). There are more efficient and elegant
ways to do this (see later).
▪ The “master” process waits for one receive operation to be completed before the
next one is initiated. There are more efficient ways...

Remarks on parallel integration example
▪ “Master-worker” schemes are quite common in MPI programming but scalability to

high process counts may be limited.
▪ Every process has its own res variable, but only the master process actually uses
it → it’s typical for MPI codes to use more memory than actually needed.

Some useful MPI calls
▪ double MPI_Wtime();
Returns current time stamp
▪ double MPI_Wtick();
Returns resolution of timer
▪ int MPI_Abort(MPI_Comm comm, int errorcode);

▪ “Best effort” attempt to abort all tasks in communicator, deliver error code to
calling environment
▪ This is a last resort; if possible, shut down the program via MPI_Finalize()

Introduction to collectives in MPI
Collectives in MPI
Collectives: operations including all ranks of a communicator
All ranks must call the function!
▪ Blocking variants: buffer can be reused after return

▪ Nonblocking variants (since MPI 3.0):
▪ May or may not synchronize the processes
▪ Cannot interfere with point-to-point communication

▪ Completely separate modes of operation!

Collectives in MPI
▪ Rules for all collectives
▪ Data type matching
▪ No tags
▪ Count must be exact, i.e., there is only one message length, buffer must be
large enough
▪ Types:
▪ Synchronization (barrier)
▪ Data movement (broadcast, scatter, gather, all to all)
▪ Collective computation (reduction)
▪ Combinations of data movement and computation (reduction + broadcast)
▪ General assumption: MPI does a better job at collectives than you trying to
emulate them with point-to-point calls

Global communication
Barrier
▪ Explicit synchronization of all ranks from specified
communicator
MPI_Barrier(comm);
▪ Ranks only return from call after every rank has called
the function
▪ MPI_Barrier() rarely needed

▪ Debugging

Broadcast
▪ Send buffer contents from one rank (“root”) to all ranks
MPI_Bcast(buf, count, datatype, int root, comm);
▪ no restrictions on which rank is root – often rank 0

root int
rank 0 1 2 3
buffer 1 2 3
count = 3
MPI_Bcast(buffer, 3, MPI_INT, 1, MPI_COMM_WORLD)
buffer 1 2 3 1 2 3 1 2 3 1 2 3

Scatter
▪ Send the i-th chunk of an array to the i-th rank
MPI_Scatter(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype,
root, comm);
▪ In general, sendcount = recvcount

▪ This is the length of the chunk
▪ sendbuf is ignored on non-root ranks because there is nothing to send

Scatter
root int
rank 0 1 2 3
sendbuf 1 2 3 4
recvbuf
MPI_Scatter(sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT,

root, MPI_COMM_WORLD)
sendbuf 1 2 3 4
recvbuf 1 2 3 4

Gather
▪ Receive a message from each rank and place i-th rank’s message at i-th
position in receive buffer
MPI_Gather(sendbuf, sendcount, sendtype,

root, comm)
▪ In general, sendcount = recvcount

▪ recvbuf is ignored on non-root ranks because there is nothing to receive

Gather
root int
rank 0 1 2 3
sendbuf 1 2 3 4
recvbuf
MPI_Gather(sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT,

root, MPI_COMM_WORLD)
sendbuf 1 2 3 4
recvbuf 1 2 3 4

Allgather
▪ Combination of gather and broadcast
MPI_Allgather(sendbuf, sendcount, sendtype,

comm);
▪ Why not just use gather followed by a broadcast instead?

▪ MPI library has more options for optimization
▪ General assumption: Combined collectives are faster than using separate ones

Allgather
rank 0 1 2 3
sendbuf 0 1 2 3
sendcount 1 1 1 1
recvbuf
recvcount 1 1 1 1
MPI_Allgather() (no root required)
recvbuf 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Alltoall
▪ MPI_Alltoall: For all ranks, send i-th chunk to i-th rank
MPI_Alltoall(sendbuf, sendcount, sendtype,

comm);

Alltoall
rank 0 1 2 3
sendbuf 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sendcount 1 1 1 1
recvbuf
recvcount 1 1 1 1
MPI_Alltoall() (no root required)
recvbuf 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Global operations
Global operations: reduction
▪ Compute results over distributed data rank sendbuf
0 0 9 2 6
MPI_Reduce(sendbuf, recvbuf, count,
5 1 0 4
datatype, MPI_Op op, root, comm); 1
2 8 3 4 5
▪ Result in recvbuf only available on root process 3 1 0 6 8
▪ Perform operation on all count elements of an array MPI_Reduce()
max()
max()
max()
max()
count = 4
▪ If all ranks require result, use MPI_Allreduce() op = MPI_MAX
recvbuf 8 9 6 8
▪ If the 12 predefined ops are not enough use on root
MPI_Op_create/MPI_Op_free to create own ones
Global operations – predefined operators
Name Operation Name Operation
MPI_SUM Sum MPI_PROD Product
MPI_MAX Maximum MPI_MIN Minimum
MPI_LAND Logical AND MPI_BAND Bit-AND
MPI_LOR Logical OR MPI_BOR Bit-OR
MPI_LXOR Logical XOR MPI_BXOR Bit-XOR
MPI_MAXLOC Maximum+Position MPI_MINLOC Minimum+Position

MPI_Status status; // rank 0 collects partial results
MPI_Comm_size(MPI_COMM_WORLD, &size); if(0==rank) {
MPI_Comm_rank(MPI_COMM_WORLD, &rank); res = psum; // local result
// integration limits for(int i=1; i<size; ++i) {
double a=0., b=2., res=0., tmp; MPI_Recv(tmp, // receive buffer
// limits for "me" 1, // array length
mya = a + rank * (b-a)/size; MPI_DOUBLE, // data type
myb = mya + (b-a)/size; i, // rank of source
// integrate f(x) over my own chunk 0, // tag (unused here)
psum = integrate(mya,myb); MPI_COMM_WORLD,
𝑏 res += tmp;
𝑎
1, // message length
▪ Split up interval [a,b] into equal MPI_DOUBLE,// data type
▪ Compute partial results in parallel MPI_COMM_WORLD);
MPI_Finalize();
Example: parallel integration in MPI using collective operations
MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD, &size);
// integration limits
double a=0., b=2., res=0., tmp;
// limits for "me"
mya = a + rank * (b-a)/size;
myb = mya + (b-a)/size;
// integrate f(x) over my own chunk
psum = integrate(mya,myb);
/* everyone calls reduce, data is taken from all and the result ends up in root’s res only */
MPI_ Reduce( psum, // send buffer
res, // receive buffer
1, // array length
MPI_DOUBLE, // data type
MPI_SUM, // MPI operation
0, // root
MPI_COMM_WORLD);
if(0==rank) printf(“Result: %f\n”, res);

Summary of beginner’s MPI toolbox
▪ Starting up and shutting down the “parallel program” with MPI_Init()
and MPI_Finalize()
▪ MPI task (“process”) identified by rank (MPI_Comm_rank())
▪ Number of MPI tasks: MPI_Comm_size()
▪ Startup process is very implementation dependent
▪ Simple, blocking point-to-point communication with MPI_Send() and
MPI_Recv()
▪ “Blocking” == buffer can be reused as soon as call returns
▪ Message matching
▪ Timing functions

Summary of MPI collective communication
▪ MPI (blocking) collectives
▪ All ranks in communicator must call the function
▪ Communication and synchronization

▪ Barrier, broadcast, scatter, gather, there are more…
▪ Global operations
▪ Reduce, allreduce, there are more…

Lec 9 DR Marwa Abbas

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 9 DR Marwa Abbas

Uploaded by

Copyright:

Available Formats

Elective Course 4 :CSE 327

High Performance Computing

Lecture 9: Introduction to MPI + MPI communication

No global shared address

Data exchange and communication

No global shared address

▪ Process-based approach: All variables are local!

▪ The program is written in a sequential language (Fortran/C[++])

▪ Data exchange between processes: Send/receive messages via MPI

Parallel Programming 2020 2020-11-30 6

▪ Latest standard: MPI 3.1 (2015), 868 pages

▪ Successful free implementations (MPICH, mvapich,

Parallel Programming 2020 2020-11-30 7

▪ Support for heterogeneous

Parallel Programming 2020 2020-11-30 8

▪ MPI startup mechanism:

Parallel Programming 2020 2020-11-30 11

▪ MPI startup mechanism:

Parallel Programming 2020 2020-11-30 12

▪ MPI startup mechanism:

Parallel Programming 2020 2020-11-30 13

▪ MPI startup mechanism:

Program shutdown ▪ Clean shutdown by MPI

▪ MPI constants (global/common): All upper case in C

Parallel Programming 2020 2020-11-30 15

▪ Return value MPI_SUCCESS

▪ Default: Abort parallel computation in case of other return values

Parallel Programming 2020 2020-11-30 16

▪ First call in MPI program: initialization of parallel machine

int MPI_Init(int *argc, char *argv);

▪ Last call: clean shutdown of parallel machine

Only “master” process is guaranteed to continue after finalize

Parallel Programming 2020 2020-11-30 18

▪ Obtain number of processes in communicator:

int main(char *argc, char *argv) {

printf(“Hello World! I am %d of %d\n”, rank, size);

Parallel Programming 2020 2020-11-30 20

▪ Which processes are receiving the

▪ Sender and receiver must pass their information to MPI separately

▪ Which processes are receiving the

▪ Sender and receiver must pass their information to MPI separately

▪ Which processes are receiving the

▪ Sender and receiver must pass their information to MPI separately

Parallel Programming 2020 2020-11-30 27

Parallel Programming 2020 2020-11-30 28

▪ Blocking: After the MPI call returns,

Parallel Programming 2020 2020-11-30 29

buf address of send buffer

Parallel Programming 2020 2020-12-07 31

wait for Recv

Ssend completes after recv completion

Parallel Programming 2020 2020-12-07 32

Parallel Programming 2020 2020-11-30 34

status address of status object

▪ Determines number of elements received

Parallel Programming 2020 2020-11-30 35

▪ message data types must match

Parallel Programming 2020 2020-11-30 36

Parallel Programming 2020 2020-11-30 39

Parallel Programming 2020 2020-11-30 41

▪ “Master-worker” schemes are quite common in MPI programming but scalability to

Parallel Programming 2020 2020-11-30 42

▪ int MPI_Abort(MPI_Comm comm, int errorcode);

int MPI_Init(int argc, char argv);

int main(char argc, char argv) {