Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

UNIT V

Scalable Multi-GPU Programming

Chapter 6
(Learn CUDA Programming)
Ajeet K Jain
CSE, KMIT, Hyderabad
So far, we have concentrated on getting optimal performance on a
single GPU. Dense nodes with multiple GPUs have become a
pressing need for upcoming supercomputers, especially since the
ExaFLOP (a quintillion operations per sec) system is becoming a
reality.
GPU architecture is energy-efficient and hence, in recent years,
systems with GPUs have taken the majority of the top spots in the
Green500 list.

(https:/ / www. top500. org/ green500)


The DGX (Digital Gaming eXtreme) system from
NVIDIA now has 16 V100 32 GB in one server. With
the help of unified memory and interconnect
technologies such as NVLink and NvSwitch,
developers can see all GPUs as one big GPU with
512 GB memory (16 GPU * 32 GB each). In this
chapter, we will go into the details of writing CUDA
code and make use of CUDA-aware libraries to
efficiently get scalability in a multi-GPU
environment within and across nodes.
Topics covered :

Solving a linear equation using Gaussian


elimination
GPUDirect peer to peer
A brief introduction to MPI
GPUDirect RDMA
CUDA streams
Additional tricks
Solving a Linear Equation using Gaussian Elimination

To demonstrate the usage of multiple GPUs within and across


nodes, we will start with some sequential code and then convert
it into multiple GPUs within and across nodes.

We will be solving a linear system of equations containing


M equations and N unknowns.

The equation can be represented as follows:


A×x=b
A×x=b
Here: A is a matrix with M rows and N columns,
x is a column vector (also referred to as a solution )
with N rows, and
b is also a column vector with M rows.

Finding a solution vector involves computing vector x


when A and b are given. One of the standard methods
for solving a linear system of equations is Gaussian
elimination.
A×x=b
In Gaussian elimination, first matrix A is reduced to
either the upper or lower triangular matrix by
performing elementary row transformations.
Then, the resulting triangular system of equations
is solved by using the back substitution step.
The following pseudocode explains the steps that are involved
in solving the linear equation:

1. For iteration 1 to N (N: number of unknowns)


1.1 Find a row with non-zero pivot
1.2 Extract the pivot row
1.3 Reduce other rows using pivot row
2. Computing the solution vector through back substitution

Let's take a look at an example in order to understand the


algorithm. Let's say the system of equations is as follows:
Single GPU hotspot analysis of Gaussian elimination
Here, in this Gaussian elimination, the number of rows is equal to
the number of equations and the number of columns is equal to
the number of unknowns. The pr row shown in the diagram is the
pivot row and will be used to reduce other rows using the pivot
element.
The first observation that we can make is that we are operating
on an augmented matrix to merge the A matrix with the b
vector. Hence, the size of unknowns is N+1 as an augmented
matrix has the last column as the b vector.
Creating an augmented matrix helps us work on just one data
structure, that is, a matrix.

The following extract shows that, in a single


GPU implementation, the three steps are known as
three kernels for finding N unknowns:
•findPivotRowAndMultipliers<<< ... >>>: The kernel
finds the pivot row and multiplier, which should
be used for row elimination.

•extractPivotRow<<< >>>: The kernel extracts the pivot


row, which is then used to perform row elimination.

•rowElimination<<< >>>: This is the final kernel call, and


does the row elimination in parallel on the GPU.
The following code snippet shows the three kernels called
iteratively after the data has been copied to the GPU:

<Copy input augmented matrix AB to GPU>


...

for( int n = 0; n < N; n++ ) {


// M: number of equations, N: number of unknowns
findPivotRowAndMultipliers <<< ... >>> ( ) ;
extractPivotRow<<< ... >>> ( ) ;
rowElimination<<< ... >>> ( ) ;
}
The focus is on how this single GPU implementation can be
enhanced to support multiple GPUs. However, to fill in the
missing pieces from the GPU implementation, we need to make
some optimization changes to the single GPU implementation:

•The performance of the Gaussian elimination algorithm is heavily


influenced by the memory access pattern. Basically, it depends
on how the AB matrix is stored:

Finding the pivot row prefers the column-major format as it


provides coalesced access if the matrix is stored in a column
major format.

On the other hand, extracting a pivot row prefers the row-
major format.
•No matter how we store the AB matrix, one coalesced and one
strided / noncoalesced access to memory is unavoidable.

•The column major format is also beneficial for row elimination


kernels and hence, for our Gaussian elimination kernel, we
decided to store the transpose of the AB matrix instead of AB.
The AB matrix gets transposed once, at the beginning of the
code in the transposeMatrixAB() function.

We will enable multi-GPU P2P(peer to peer) access and split


the work among multiple GPUs.
GPUDirect peer to peer

The GPUDirect technology was created to allow high-bandwidth,


low-latency communication between GPUs within and across
different nodes.

This technology was introduced to eliminate CPU overheads when


one GPU needs to communicate with another. GPUDirect can be
classified into the following major categories:

Peer to peer (P2P) transfer between GPU: Allows CUDA programs


to use high speed Direct Memory Transfer (DMA) to copy data
between two GPUs in the same system. It also allows optimized
access to the memory of other GPUs within the same system.
Accelerated communication between network and storage: This
technology helps with direct access to CUDA memory from third-
party devices such as InfiniBand network adapters or storage. It
eliminates unnecessary memory copies and CPU overhead and
hence reduces the latency of transfer and access.
This feature is supported from CUDA 3.1 onward.

GPUDirect for video: This technology optimizes pipelines for frame-


based video devices. It allows low-latency communication with
OpenGL, DirectX, or CUDA and is supported from CUDA 4.2 onward.
Remote Direct Memory Access (RDMA): This feature allows direct
communication between GPUs across a cluster. This feature is
supported from CUDA 5.0 and later.
The access of one computer's memory by another in a network
without the involvement of any computer's operating system,
processor, or cache is referred to as RDMA. Because many
resources are freed up, it aids in improving system throughput and
performance.

On a remote machine, read and write operations can be performed


without being interrupted by the CPU of that machine. The data
transfer rate is increased, and networking latency is reduced,
thanks to this technology. It uses zero-copy networking for
transferring data directly into system buffers by enabling network
adapters.
RDMA was earlier used only in high-performance computing (HPC)
environments. The cost of maintaining RDMA-capable network fabrics
like InfiniBand was justified due to the importance of performance
over cost.

Because RDMA may now be enabled on existing Ethernet fabrics for


IP network communications, thanks to standards like RDMA over
Converged Ethernet (RoCE), the cost of adopting RDMA is reduced.
The RNIC adapters are configured using standard Ethernet
management procedures.
Features of RDMA
Kernel bypass − Because the operating system is not engaged in data
transfers, applications can send data directly from user space,
eliminating context switching and latency.

Zero-copy − Without copying data between network layers,


applications can place data directly in the memory buffer of the
destination application and receive data directly into the buffer. This
will cut down on buffer transfers that aren't necessary.

CPU Involvement is Reduced − Applications can retrieve data from


faraway servers without using CPU time on those servers. The
accessed content will not fill the cache memory of the remote
server's CPU.

Effective Transaction − Rather than sending and receiving data in


streams, discrete messages can be sent and received, eliminating the
requirement to separate messages.
Now we will be converting our sequential code to make use of
the P2P feature of GPUDirect so that it can run on multiple GPUs
within the same system.

The GPUDirect P2P feature allows the following:

GPUDirect transfers: cudaMemcpy() initiates a DMA copy from


GPU 1‘s memory to GPU 2's memory.

Direct access: GPU 1 can read or write GPU 2's memory


(load/store).
To understand the advantage of P2P, it is necessary to understand
the PCIe bus specification. This was created with the intention of
optimally communicating through interconnects such as
InfiniBand to other nodes. This is different when we want to
optimally send and receive data from individual GPUs. The
following is a sample PCIe topology where eight GPUs are being
connected to various CPUs and NIC/InfiniBand cards:

https://prodigytechno.com/pci-express-pcie-or-pci-e/
PCIe Switch
96 Lane

P2P transfer is allowed between GPU0 and GPU1 as they are


both situated in the same PCIe switch.
However, GPU0 and GPU4 cannot perform a P2P transfer as
PCIe P2P communication is not supported between two I/O
Hubs (IOHs).
PCIe Switch
96 Lane

The IOH does not support non-contiguous bytes from PCI Express
for remote peer-to-peer MMIO transactions. The nature of the QPI
link connecting the two CPUs ensures that a direct P2P copy
between GPU memory is not possible if the GPUs reside on different
PCIe domains. Thus, a copy from the memory of GPU0 to the
memory of GPU4 requires copying over the PCIe link to the memory
attached to CPU0, then transferring it over the QPI link to CPU1 and
over the PCIe again to GPU4. As you can imagine, this process adds
a significant amount of overhead in terms of both latency and BW.
The following diagram shows another system where GPUs are connected to each
other via an NVLink interconnect that supports P2P transfers:

The preceding diagram shows a sample NVLink topology resulting in an eight-cube


mesh where each GPU is connected to another GPU with a max 1 hop.
Single node – multi-GPU Gaussian elimination
Going from a single-to multi-GPU implementation, the 3 kernels we
defined earlier are used as-is. However, the linear system is split into
a number of parts equal to the number of GPUs. These parts are
distributed one part per GPU. Each GPU is responsible for performing
the operation on the part that's been assigned to that GPU. The
matrix is split column-wise. This means each GPU gets an equal
number of consecutive columns from all the rows.
The kernel for finding the pivot is launched on the GPU that holds
the column containing the pivot element. The row index of the pivot
element is broadcasted to other GPUs. The extracted pivot row and
row elimination kernels are launched on all the GPUs, with each
GPU working on its own part of the matrix. The following diagram
shows the rows being split among multiple GPUs and how the pivot
row needs to be broadcasted to the rest of the processes:
The diagram represents the division of work across multiple
GPUs. Currently, the pivot row belongs to GPU1 and is
responsible for broadcasting the pivot row to other GPUs.
Let's try to understand these code changes, as well as the CUDA
API that was used to enable the P2P feature:
1. Enable P2P access between the supported GPUs. The
following code shows the first step in this: enabling P2P access
between GPUs:
Now that we've achieved multi-GPU
implementation on a single node, we'll change
gear and run this code on multiple GPUs. But
before converting our code into multiple GPUs,
we will provide a short primer on MPI
programming, which is primarily used for
internode communication.
Brief introduction to MPI

The Message Passing Interface (MPI) standard is a message-passing library


standard and has become the industry standard for writing message-passing
programs on HPC platforms.

Basically, MPI is used for message passing across multiple MPI processes. MPI
processes that communicate with each other may reside on the same node or
across multiple nodes.
We are making use of the mpicc compiler to compile our code.
mpicc is basically is a wrapper script that internally expands the
compilation instructions to include the paths to the relevant
libraries and header files. Also, running an MPI executable
requires it to be passed as an argument to mpirun. mpirun is a
wrapper that helps set up the environment across multiple nodes
where the application is supposed to be executed.

The -n 4 argument says that we want to run four processes and


that these processes will run on nodes with the hostname stored
in the file hosts list.
GPUDirect RDMA

In a cluster environment, we would like to make use of GPUs


across multiple nodes. We will allow our parallel solver to
integrate CUDA code with MPI to utilize multi-level parallelism on
multi-node, multi-GPU systems.

A CUDA-aware MPI is used to leverage GPUDirect RDMA for


optimized inter-node communication.

GPUDirect RDMA allows direct communication between GPUs


across a cluster. It was first
supported by CUDA 5.0 with the Kepler GPU card. In the following
diagram, we can see the GPUDirect RDMA, that is, GPU 2 in
Server 1 communicating directly with GPU 1 in Server 2:

You might also like