Unit 3 - Parallel Communication

SNJB’s Late Sau. K. B. J.
College of Engineering
Subject :- High Performance Computing
Unit 3
Parallel Communication
By- Prof. Gunjan Deshmukh
SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
SNJB’s Late Sau. K. B. J. College of Engineering
Syllabus
1. Basic Communication:
2. One-to-All Broadcast
3. All-to-One Reduction
4. All-to-All Broadcast and Reduction
5. All-Reduce and Prefix-Sum Operations
6. Collective Communication using MPI:
7. Scatter
8. Gather
9. Blocking and non blocking MPI
10. All-to-All Personalized Communication
11. Circular Shift
12. Improving the speed of some communication operations.
Course Objectives & Outcomes
CO3: Illustrate data communication operations on

various parallel architecture
SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
Basic Communication
● Many interactions in practical parallel programs occur in well-defined patterns involving
groups of processors.
● Efficient implementations of these operations can improve performance, reduce
development effort and cost, and improve software quality.
● Group communication operations are built using point-to- point messaging primitives
● Communicating a message of size m (ie, m words) over an uncongested network takes time
ts + tw m (setup time + transmission time for all words).

Basic Communication Operations
● Inter-Process Communication (IPC): IPC is the exchange of data between two or more
processes. IPC is used to enable processes to communicate and synchronize with each other.
● Message Passing Interface (MPI): MPI is a communication protocol used to enable processes
to communicate with each other. It is widely used in HPC applications and supports
point-to-point and collective communication operations.
● Parallel I/O: Parallel I/O is a technique used to improve the performance of input/output
operations. It involves distributing the I/O operations across multiple processors, which can
reduce the I/O time and improve the overall performance.
● Synchronization: Synchronization is a technique used to ensure that all processes are
executing at the same place.
Basic Communication Operations
● Shared Memory: Shared memory is a memory architecture used to allow multiple processes
to access the same memory space.
● Distributed Memory: Distributed memory is a memory architecture used to distribute the
memory across multiple processors. This can be useful for applications that require large
amounts of memory
● Remote Procedure Call (RPC): RPC is a communication mechanism used to enable
processes to call functions or procedures on remote processes.
● Load Balancing: Load balancing is a technique used to distribute the workload evenly
across multiple processors.
One-to-All Broadcast
● One process sends a message to all other processes in the system. Here's how it works:
● The process that wants to broadcast a message first splits the group of processes into two
parts: the root process and the non-root processes.
● The root process sends the message to all the non-root processes in the group.
● The non-root processes then process the message as needed.
● This operation is typically used when a process has some data that needs to be shared with all
other processes in the system
● One-to-All broadcast can be implemented using various communication libraries such as
Message Passing Interface (MPI) or OpenMP.
One-to-All Broadcast and All-to-One Reduction
● One processor has a piece of data (of size m) and it needs to send it to everyone (one-to-all
broadcast).
● The dual of one-to-all broadcast is all-to-one reduction.
● In all-to-one reduction, each processor has m units of data in individual buffer. These data
items must be combined piece-wise
One-to-All Broadcast and All-to-One Reduction on Rings
● Relevant in the context of ring-based network topologies, where nodes are connected in a
circular manner.
● In a One-to-All Broadcast operation on a ring, a single node initiates the transmission of a
message to all other nodes in the ring
● In an All-to-One Reduction operation on Ring, all nodes in the ring collaborate to reduce a set
of values into a single value
● One-to-all broadcast on an eight-node ring. Node 0 is the
source of the broadcast. Each message transfer step is shown by
a numbered, dotted arrow from the source of the message to its
destination.
Broadcast and Reduction: Example
● Reduction on an eight-node ring with node 0 as the destination
of the reduction.
In the nal step, the results of these products are accumulated to the rst row using n concurrent
all-to-one reduction operations along the columns (using the sum operation).
Broadcast and Reduction on a Mesh: Example
One-to-all broadcast on a 16-node mesh

Broadcast and Reduction on a Hypercube
One-to-all broadcast on a three-dimensional hypercube. The
binary representations of node labels are shown in parentheses.

Broadcast and Reduction on a Balanced Binary Tree
● Consider a binary tree in which processors are (logically) at the leaves an internal nodes are
routing nodes.
● Assume that source processor is the root of this tree. In the rst step, the source sends the data
to the right child (assuming the source is also the left child). The problem has now been
decomposed into two problems with half the number of processors.
All-to-All Broadcast and Reduction
● All-to-All Broadcast: All-to-All Broadcast is a communication pattern in which each process sends a
message to all other processes in the system.
● This is useful when each process has information that needs to be shared with every other process.
● All-to-All Broadcast can be implemented using point-to-point communication or collective
communication, depending on the size of the system and the available communication infrastructure.
● All-to-All Reduction: All-to-All Reduction is a communication pattern in which each process sends a
message to all other processes in the system, and then all processes combine the received messages using
a reduction operation.
● This is useful when each process has a local value that needs to be combined with the values from all the
other processes.
● All-to-All Reduction can also be implemented using point-to-point communication or collective
communication.
All-to-All Broadcast and Reduction
All-to-All Broadcast and Reduction on a Ring
● Simplest approach: perform p one-to-all broadcasts. This is not the most efficient way, though.
● Each node sends to one of its neighbors the data it needs to broadcast.
● In subsequent steps, it forwards the data received from one of its neighbors to its other
neighbor.
● The algorithm terminates in p - 1 steps.
All-to-all Broadcast on a Mesh
● Performed in two phases in the 1st phase, each row of the mesh performs an all-to-all
broadcast using the procedure for the linear array.
● In this phase, all nodes collect messages corresponding to the nodes of their
respective rows. Each node consolidates this information into a single message of size mpp.
● The second communication phase is a columnwise all-to-all broadcast of the consolidated
messages.
All-to-all Broadcast on a Mesh
All-to-all broadcast on a Hypercube
● Generalization of the mesh algorithm to log p dimensions.
● Message size doubles at each of the log p steps.
All-Reduce and Prefix-Sum Operations
● All-Reduce and Prefix-Sum are two common operations in High-Performance Computing (HPC) that are
used to efficiently communicate data and perform computations among multiple processes in a
parallel program.
● All-Reduce: All-Reduce is a collective operation in which all processes in a system contribute a value,
which is then combined using a reduction operation such as sum, product, or maximum, and the
result is then distributed to all processes.
● This is useful when each process has a local value that needs to be combined with the values from all
the other processes.
● All-Reduce can be used to implement a variety of parallel algorithms, including parallel sorting,
matrix multiplication, and graph algorithms.
● For example, consider a parallel program that is computing the dot product of two large vectors. Each
process has a subset of the vectors, and needs to compute the dot product of its subset with the
corresponding subset from all the other processes.
All-Reduce and Prefix-Sum Operations
● Prefix-Sum: Prefix-Sum is a parallel computation in which each process computes a cumulative sum of a
sequence of values, where the i-th element of the sequence is the sum of the first i elements of the
original sequence.
● Prefix-Sum is useful in a variety of parallel algorithms, including parallel sorting, discrete Fourier
transform, and graph algorithms.
● For example, consider a parallel program that is computing the prefix sum of a large array. Each
process has a subset of the array, and needs to compute the prefix sum of its subset.
● This can be done using Prefix-Sum, where each process first computes the local prefix sum of its
subset, and then combines the local prefix sums from all the other processes using a binary tree
algorithm.
●
Prefix-Sum Operations Example
Scatter
● Scatter is a communication operation in which a single process distributes data to all other processes in
a parallel program.
● In Scatter, the data is partitioned into equal-sized smaller chunks, and each chunk is sent to a
different process.
● Scatter is often used when a single process has a large amount of data that needs to be distributed
among all the processes in the system.
● For example, consider a parallel program that is reading data from a file, and needs to distribute the data
among all the processes in the system for processing. The data can be partitioned into smaller chunks, and
then each chunk can be sent to a different process using Scatter
● Scatter is implemented using point-to-point communication, where the sender process sends a message
to each receiver process, and each receiver process receives a message from the sender process.
● Is often used in combination with other communication operations such as Gather, All-to-All, and
All-Reduce to efficiently exchange data and perform computations among multiple processes in a parallel
Gather
● Gather is a communication operation in which all processes in a parallel program send their local data
to a single process for aggregation.
● In Gather, the data is partitioned into smaller chunks, and each chunk is sent from a different process
to a single process that aggregates the data.
● Gather is often used when all processes in the system generate local data that needs to be combined
into a single dataset for analysis or output.
● Gather is implemented using point-to-point communication, where each sender process sends a
message to the receiver process, and the receiver process receives messages from all the sender
processes.
● It is often used in combination with other communication operations such as Scatter, All-to-All, and
All-Reduce to efficiently exchange data and perform computations among multiple processes in a
parallel program.
Scatter and Gather
● In the scatter operation, a single node sends a unique message of size m to every other node (also
called a one-to-all personalized communication).
● In the gather operation, a single node collects a unique message from each node.
● While the scatter operation is fundamentally different from broadcast, the algorithmic structure is
similar, except for differences in message sizes (messages get smaller in scatter and stay constant in
broadcast).
● The gather operation is exactly the inverse of the scatter operation and can be executed as such.
Example of the Scatter Operation
All-to-All Personalized Communication
● In all-to-all personalized communication, each node sends a distinct message of size m to every other node.
● Each node sends different messages to different nodes, unlike all-to-all broadcast, in which each node sends
the same message to all other nodes.
● this operation is equivalent to transposing a two-dimensional array of data distributed among p processes
● All-to-all personalized communication is also known as total exchange
● This operation is used in a variety of parallel algorithms such as fast Fourier transform, matrix transpose,
sample sort, and some parallel database join operations.
All-to-All Personalized Communication : Matrix transposition
● Consider an n x n matrix mapped onto n processors such that each processor contains one full row of the matrix.
● With this mapping, processor Pi initially contains the elements of the matrix with indices [i, 0], [i, 1], ..., [i, n - 1].
● After the transposition, element [i, 0] belongs to P0, element [i, 1] belongs to P1, and so on.
● In general, element [i, j] initially resides on Pi , but moves to Pj during the transposition.
All-to-All Personalized Communication : Ring
● each process first sends its message to its neighbor in one direction around the ring. The
receiving process then forwards the message to its neighbor in the opposite direction, and this
process continues until all processes in the ring have received a message from every other process.
● To perform this operation, every node sends p - 1 pieces of data, each of size m.
● these pieces of data are identified by pairs of integers of the form {i, j}, where i is the source of the
message and j is its final destination
● First, each node sends all pieces of data as one consolidated message of size m(p - 1) to one of its
neighbors (all nodes communicate in the same direction)
● Of the m(p - 1) words of data received by a node in this step,
● Therefore, each node extracts the information meant for it from the data received, and forwards
the remaining (p - 2) pieces of size m each to the next node.
All-to-All Personalized Communication : Ring
● This process continues for p - 1 steps. The total size of data being transferred between nodes decreases by
m words in each successive step. In every step, each node adds to its collection one m-word packet
originating from a different node.
● Hence, in p - 1 steps, every node receives the information from all other nodes in the ensemble.
All-to-All Personalized Communication : Mesh
● each process first sends its message to its neighbors in its row and column.
● The receiving processes then forward the message to their neighbors in their row and column, and this
process continues until all processes in the mesh have received a message from every other process.
Blocking and non blocking MPI
● When you send an MPI message, the MPI runtime code handles the details of opening an appropriate
network connection with the remote task and moving the bytes of data from the buffer you have
designated in the send call into an appropriate buffer at the receiving end.
● MPI provides a variety of send and receive calls in its interface. These calls can be classified into one of
two groups: those that cause a task to pause while it waits for messages to be sent and received, and those that
do not.
● Blocking Communication:
● In blocking communication, a process waits until a communication operation completes
before proceeding to the next instruction.
● Blocking communication is often used when the data being transferred is small or the
process that sends or receives data has nothing else to do until the communication
operation completes.
● The sending process waits for the receive operation to complete before proceeding to the
next instruction.
● Blocking communication is done using MPI_Send() and MPI_Recv(). the function call does not
return control to your program until the data you are sending have been copied out of your
● This is not a good situation in general
● Blocking Communication:
● Non-Blocking Communication:
● In non-blocking communication, a process initiates a communication operation and then
proceeds to the next instruction without waiting for the operation to complete.
● Non-blocking communication is often used when the data being transferred is large, or the
process that sends or receives data can continue to perform other tasks while the
communication operation is in progress.
● The sending process does not wait for the receive operation to complete before proceeding
to the next instruction.
● Non-blocking communication is done using MPI_Isend() and MPI_Irecv(). These function return
immediately (i.e., they do not block) even if the communication is not finished yet. You must call
MPI_Wait() or MPI_Test() to see whether the communication has finished.
● Non-Blocking Communication:
Circular Shift
● It involve data permutation or reordering.
● In a circular shift, each element of a dataset is shifted a fixed number of positions to the left
or right,
● A special permutation in which node i sends a data packet to node (i + q)mod p in a p-node
(0 < q < p)
● Circular shift operations can be useful in data reordering for parallel sorting or partitioning,
or as part of algorithms for image processing or numerical simulations.

Circular Shift : Mesh

● It can be performed in min{q, p − q} neighbor
communications.
● Mesh algorithms follow from this as well.
● We shift in one direction (all processors) followed by
the next direction.
● The associated time has an upper bound of:
● T = (ts + twm)(√p + 1).
Circular Shift : Hypercube
● To perform a q-shift, we expand q as a sum of distinct

powers of 2.
● If q is the sum of s distinct powers of 2, then the
circular q-shift on a hypercube is performed in s
phases.
● The time for this is upper bounded by:
● T = (ts + twm)(2 log p − 1).
Improving the speed of some communication operations
● Splitting and routing messages into parts: If the message can be split into p parts, a one-to-all broadcast
can be implemented as a scatter operation followed by an all-to-all broadcast operation.
● All-to-one reduction can be performed by performing all-to-all reduction (dual of all-to-all broadcast)
followed by a gather operation (dual of scatter).
● Use a high-performance network
● Reduce message size
● Use non-blocking communication
● Use collective communication
● Use specialized hardware

References
● https://shraddhasshinde.files.wordpress.com/2017/12/unit-ii2.pdf
● https://cvw.cac.cornell.edu/parallel/block
● http://users.atw.hu/parallelcomp/ch04lev1sec5.html
● Youtube Channel :
https://www.youtube.com/watch?v=fsllCdhWQYc&list=PLYwpaL_SFmcA1eJbqwvjKgsn
T321hXRGx
● https://www.youtube.com/watch?v=214iu9qdMt0
● https://www.youtube.com/watch?v=m1b74x18kZk&list=PLhbrpS8rYbc0RD5cCF-IDtwzu

Unit 3 - Parallel Communication

Uploaded by

Copyright:

Available Formats

You might also like

Unit 3 - Parallel Communication

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3 - Parallel Communication

Uploaded by

Copyright:

Available Formats

SNJB’s Late Sau. K. B. J.

Course Objectives & Outcomes

CO3: Illustrate data communication operations on

● Efficient implementations of these operations can improve performance, reduce

development effort and cost, and improve software quality.

ts + tw m (setup time + transmission time for all words).

source of the broadcast. Each message transfer step is shown by

a numbered, dotted arrow from the source of the message to its

One-to-all broadcast on a 16-node mesh

One-to-all broadcast on a three-dimensional hypercube. The

binary representations of node labels are shown in parentheses.

or as part of algorithms for image processing or numerical simulations.

Circular Shift : Mesh

● To perform a q-shift, we expand q as a sum of distinct

can be implemented as a scatter operation followed by an all-to-all broadcast operation.

followed by a gather operation (dual of scatter).

● Use a high-performance network

● Reduce message size

● Use non-blocking communication

● Use collective communication

● Use specialized hardware

You might also like