Professional Documents
Culture Documents
Unit 3 - Parallel Communication
Unit 3 - Parallel Communication
Unit 3 - Parallel Communication
College of Engineering
Subject :- High Performance Computing
Unit 3
Parallel Communication
By- Prof. Gunjan Deshmukh
SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
SNJB’s Late Sau. K. B. J. College of Engineering
Syllabus
1. Basic Communication:
2. One-to-All Broadcast
3. All-to-One Reduction
4. All-to-All Broadcast and Reduction
5. All-Reduce and Prefix-Sum Operations
6. Collective Communication using MPI:
7. Scatter
8. Gather
9. Blocking and non blocking MPI
10. All-to-All Personalized Communication
11. Circular Shift
12. Improving the speed of some communication operations.
SNJB’s Late Sau. K. B. J. College of Engineering
SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
SNJB’s Late Sau. K. B. J. College of Engineering
Basic Communication
● Many interactions in practical parallel programs occur in well-defined patterns involving
groups of processors.
● Group communication operations are built using point-to- point messaging primitives
● Communicating a message of size m (ie, m words) over an uncongested network takes time
destination.
SNJB’s Late Sau. K. B. J. College of Engineering
Broadcast and Reduction: Example
● Reduction on an eight-node ring with node 0 as the destination
of the reduction.
In the nal step, the results of these products are accumulated to the rst row using n concurrent
all-to-one reduction operations along the columns (using the sum operation).
SNJB’s Late Sau. K. B. J. College of Engineering
Broadcast and Reduction on a Mesh: Example
● Performed in two phases in the 1st phase, each row of the mesh performs an all-to-all
broadcast using the procedure for the linear array.
● In this phase, all nodes collect messages corresponding to the nodes of their
respective rows. Each node consolidates this information into a single message of size mpp.
● The second communication phase is a columnwise all-to-all broadcast of the consolidated
messages.
SNJB’s Late Sau. K. B. J. College of Engineering
All-to-all Broadcast on a Mesh
SNJB’s Late Sau. K. B. J. College of Engineering
All-to-all broadcast on a Hypercube
● Generalization of the mesh algorithm to log p dimensions.
● Message size doubles at each of the log p steps.
SNJB’s Late Sau. K. B. J. College of Engineering
All-Reduce and Prefix-Sum Operations
● All-Reduce and Prefix-Sum are two common operations in High-Performance Computing (HPC) that are
used to efficiently communicate data and perform computations among multiple processes in a
parallel program.
● All-Reduce: All-Reduce is a collective operation in which all processes in a system contribute a value,
which is then combined using a reduction operation such as sum, product, or maximum, and the
result is then distributed to all processes.
● This is useful when each process has a local value that needs to be combined with the values from all
the other processes.
● All-Reduce can be used to implement a variety of parallel algorithms, including parallel sorting,
matrix multiplication, and graph algorithms.
● For example, consider a parallel program that is computing the dot product of two large vectors. Each
process has a subset of the vectors, and needs to compute the dot product of its subset with the
corresponding subset from all the other processes.
SNJB’s Late Sau. K. B. J. College of Engineering
All-Reduce and Prefix-Sum Operations
● Prefix-Sum: Prefix-Sum is a parallel computation in which each process computes a cumulative sum of a
sequence of values, where the i-th element of the sequence is the sum of the first i elements of the
original sequence.
● Prefix-Sum is useful in a variety of parallel algorithms, including parallel sorting, discrete Fourier
transform, and graph algorithms.
● For example, consider a parallel program that is computing the prefix sum of a large array. Each
process has a subset of the array, and needs to compute the prefix sum of its subset.
● This can be done using Prefix-Sum, where each process first computes the local prefix sum of its
subset, and then combines the local prefix sums from all the other processes using a binary tree
algorithm.
●
SNJB’s Late Sau. K. B. J. College of Engineering
Prefix-Sum Operations Example
SNJB’s Late Sau. K. B. J. College of Engineering
Scatter
● Scatter is a communication operation in which a single process distributes data to all other processes in
a parallel program.
● In Scatter, the data is partitioned into equal-sized smaller chunks, and each chunk is sent to a
different process.
● Scatter is often used when a single process has a large amount of data that needs to be distributed
among all the processes in the system.
● For example, consider a parallel program that is reading data from a file, and needs to distribute the data
among all the processes in the system for processing. The data can be partitioned into smaller chunks, and
then each chunk can be sent to a different process using Scatter
● Scatter is implemented using point-to-point communication, where the sender process sends a message
to each receiver process, and each receiver process receives a message from the sender process.
● Is often used in combination with other communication operations such as Gather, All-to-All, and
All-Reduce to efficiently exchange data and perform computations among multiple processes in a parallel
SNJB’s Late Sau. K. B. J. College of Engineering
Gather
● Gather is a communication operation in which all processes in a parallel program send their local data
to a single process for aggregation.
● In Gather, the data is partitioned into smaller chunks, and each chunk is sent from a different process
to a single process that aggregates the data.
● Gather is often used when all processes in the system generate local data that needs to be combined
into a single dataset for analysis or output.
● Gather is implemented using point-to-point communication, where each sender process sends a
message to the receiver process, and the receiver process receives messages from all the sender
processes.
● It is often used in combination with other communication operations such as Scatter, All-to-All, and
All-Reduce to efficiently exchange data and perform computations among multiple processes in a
parallel program.
SNJB’s Late Sau. K. B. J. College of Engineering
Scatter and Gather
● In the scatter operation, a single node sends a unique message of size m to every other node (also
called a one-to-all personalized communication).
● In the gather operation, a single node collects a unique message from each node.
● While the scatter operation is fundamentally different from broadcast, the algorithmic structure is
similar, except for differences in message sizes (messages get smaller in scatter and stay constant in
broadcast).
● The gather operation is exactly the inverse of the scatter operation and can be executed as such.
SNJB’s Late Sau. K. B. J. College of Engineering
Example of the Scatter Operation
SNJB’s Late Sau. K. B. J. College of Engineering
All-to-All Personalized Communication
● In all-to-all personalized communication, each node sends a distinct message of size m to every other node.
● Each node sends different messages to different nodes, unlike all-to-all broadcast, in which each node sends
the same message to all other nodes.
● this operation is equivalent to transposing a two-dimensional array of data distributed among p processes
● All-to-all personalized communication is also known as total exchange
● This operation is used in a variety of parallel algorithms such as fast Fourier transform, matrix transpose,
sample sort, and some parallel database join operations.
SNJB’s Late Sau. K. B. J. College of Engineering
All-to-All Personalized Communication : Matrix transposition
● Consider an n x n matrix mapped onto n processors such that each processor contains one full row of the matrix.
● With this mapping, processor Pi initially contains the elements of the matrix with indices [i, 0], [i, 1], ..., [i, n - 1].
● After the transposition, element [i, 0] belongs to P0, element [i, 1] belongs to P1, and so on.
● In general, element [i, j] initially resides on Pi , but moves to Pj during the transposition.
SNJB’s Late Sau. K. B. J. College of Engineering
All-to-All Personalized Communication : Ring
● each process first sends its message to its neighbor in one direction around the ring. The
receiving process then forwards the message to its neighbor in the opposite direction, and this
process continues until all processes in the ring have received a message from every other process.
● To perform this operation, every node sends p - 1 pieces of data, each of size m.
● these pieces of data are identified by pairs of integers of the form {i, j}, where i is the source of the
message and j is its final destination
● First, each node sends all pieces of data as one consolidated message of size m(p - 1) to one of its
neighbors (all nodes communicate in the same direction)
● Of the m(p - 1) words of data received by a node in this step,
● Therefore, each node extracts the information meant for it from the data received, and forwards
the remaining (p - 2) pieces of size m each to the next node.
SNJB’s Late Sau. K. B. J. College of Engineering
All-to-All Personalized Communication : Ring
● This process continues for p - 1 steps. The total size of data being transferred between nodes decreases by
m words in each successive step. In every step, each node adds to its collection one m-word packet
originating from a different node.
● Hence, in p - 1 steps, every node receives the information from all other nodes in the ensemble.
SNJB’s Late Sau. K. B. J. College of Engineering
All-to-All Personalized Communication : Mesh
● each process first sends its message to its neighbors in its row and column.
● The receiving processes then forward the message to their neighbors in their row and column, and this
process continues until all processes in the mesh have received a message from every other process.
SNJB’s Late Sau. K. B. J. College of Engineering
Blocking and non blocking MPI
● When you send an MPI message, the MPI runtime code handles the details of opening an appropriate
network connection with the remote task and moving the bytes of data from the buffer you have
designated in the send call into an appropriate buffer at the receiving end.
● MPI provides a variety of send and receive calls in its interface. These calls can be classified into one of
two groups: those that cause a task to pause while it waits for messages to be sent and received, and those that
do not.
SNJB’s Late Sau. K. B. J. College of Engineering
Blocking and non blocking MPI
● Blocking Communication:
● In blocking communication, a process waits until a communication operation completes
before proceeding to the next instruction.
● Blocking communication is often used when the data being transferred is small or the
process that sends or receives data has nothing else to do until the communication
operation completes.
● The sending process waits for the receive operation to complete before proceeding to the
next instruction.
● Blocking communication is done using MPI_Send() and MPI_Recv(). the function call does not
return control to your program until the data you are sending have been copied out of your
● This is not a good situation in general
SNJB’s Late Sau. K. B. J. College of Engineering
Blocking and non blocking MPI
● Blocking Communication:
SNJB’s Late Sau. K. B. J. College of Engineering
Blocking and non blocking MPI
● Non-Blocking Communication:
● In non-blocking communication, a process initiates a communication operation and then
proceeds to the next instruction without waiting for the operation to complete.
● Non-blocking communication is often used when the data being transferred is large, or the
process that sends or receives data can continue to perform other tasks while the
communication operation is in progress.
● The sending process does not wait for the receive operation to complete before proceeding
to the next instruction.
● Non-blocking communication is done using MPI_Isend() and MPI_Irecv(). These function return
immediately (i.e., they do not block) even if the communication is not finished yet. You must call
MPI_Wait() or MPI_Test() to see whether the communication has finished.
SNJB’s Late Sau. K. B. J. College of Engineering
Blocking and non blocking MPI
● Non-Blocking Communication:
SNJB’s Late Sau. K. B. J. College of Engineering
Circular Shift
● It involve data permutation or reordering.
● In a circular shift, each element of a dataset is shifted a fixed number of positions to the left
or right,
● A special permutation in which node i sends a data packet to node (i + q)mod p in a p-node
(0 < q < p)
● Circular shift operations can be useful in data reordering for parallel sorting or partitioning,
● All-to-one reduction can be performed by performing all-to-all reduction (dual of all-to-all broadcast)
● https://cvw.cac.cornell.edu/parallel/block
● http://users.atw.hu/parallelcomp/ch04lev1sec5.html
● Youtube Channel :
https://www.youtube.com/watch?v=fsllCdhWQYc&list=PLYwpaL_SFmcA1eJbqwvjKgsn
T321hXRGx
● https://www.youtube.com/watch?v=214iu9qdMt0
● https://www.youtube.com/watch?v=m1b74x18kZk&list=PLhbrpS8rYbc0RD5cCF-IDtwzu