Professional Documents
Culture Documents
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
Chapter 6
(Learn CUDA Programming)
Ajeet K Jain
CSE, KMIT, Hyderabad
So far, we have concentrated on getting optimal performance on a
single GPU. Dense nodes with multiple GPUs have become a
pressing need for upcoming supercomputers, especially since the
ExaFLOP (a quintillion operations per sec) system is becoming a
reality.
GPU architecture is energy-efficient and hence, in recent years,
systems with GPUs have taken the majority of the top spots in the
Green500 list.
On the other hand, extracting a pivot row prefers the row-
major format.
•No matter how we store the AB matrix, one coalesced and one
strided / noncoalesced access to memory is unavoidable.
https://prodigytechno.com/pci-express-pcie-or-pci-e/
PCIe Switch
96 Lane
The IOH does not support non-contiguous bytes from PCI Express
for remote peer-to-peer MMIO transactions. The nature of the QPI
link connecting the two CPUs ensures that a direct P2P copy
between GPU memory is not possible if the GPUs reside on different
PCIe domains. Thus, a copy from the memory of GPU0 to the
memory of GPU4 requires copying over the PCIe link to the memory
attached to CPU0, then transferring it over the QPI link to CPU1 and
over the PCIe again to GPU4. As you can imagine, this process adds
a significant amount of overhead in terms of both latency and BW.
The following diagram shows another system where GPUs are connected to each
other via an NVLink interconnect that supports P2P transfers:
Basically, MPI is used for message passing across multiple MPI processes. MPI
processes that communicate with each other may reside on the same node or
across multiple nodes.
We are making use of the mpicc compiler to compile our code.
mpicc is basically is a wrapper script that internally expands the
compilation instructions to include the paths to the relevant
libraries and header files. Also, running an MPI executable
requires it to be passed as an argument to mpirun. mpirun is a
wrapper that helps set up the environment across multiple nodes
where the application is supposed to be executed.