Professional Documents
Culture Documents
Cuda Basics
Cuda Basics
CUDA Basics
Frauke Sprengel
Outline
1 GPU Architecture and Shader Programming
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 2
Outline
1 GPU Architecture and Shader Programming
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 3
Graphics Rendering Pipeline
Transform
Feedback
Vertex Vertex
Data Primitive Fragment
Shading and
Assembly Shading and
Per-Vertex Framebuffer
and Per-Fragment
Operations
Rasterization Operations
Pixel
Data
Texture
Memory
Pixel
Pack/Unpack
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 4
Interesting Pipeline Stages
1. Vertex shading
2. Fragment shading
3. Geometry shading, i. e. primitive assembly operations on points, line
segments, triangles following the per-vertex operations
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 5
Unified Shader Architecture
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 6
Unified Shader Architecture
Since 2006 a unified shader architecture is used for GPUs. The same
processors are used for vertex, geometry, and fragment shading in a
three-step loop. The preceding figure shows the unified shader architecture of
the NVIDIA GeForce 8800 GTX with 16 streaming multiprocessors (SMs),
each of which consists of 8 streaming processors (SPs), thus summing up to
128 processor cores (called CUDA cores by NVIDIA).
The three loop steps are denoted as vertex (vtx), geometry, and pixel thread
issue, the streaming processors (SPs) are drawn in green. In addition to the
SPs, the GPU possesses texture filtering units (TFs) and render output units
(or raster operation pipelines, ROPs) which are drawn in blue, the latter ones
between the multiprocessors and the frame buffers (FBs). Finally, the orange
blocks denoted as L1 and L2 represent shared (local per multiprocessor) and
device (global) memory, respectively.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 7
GPU Generations in the Graphics Lab / Normal Labs
Room 1H.2.36/46
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 8
Outline
1 GPU Architecture and Shader Programming
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 9
GPGPU Using GLSL
B = A · c, A, B ∈ R15×15 , c ∈ R
Method:
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 10
Create the Fragment Shader
v o i d main ( ) {
vec4 matrixValue = texture2DRect ( texUnit ,
gl_TexCoord [ 0 ] . xy ) ;
gl_FragColor = matrixValue . xyz ∗ c ;
};
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 11
Inititalize OpenGL
G L i n t program = g l C r e a t e P r o g r a m ( ) ;
// c o m p i l e and l i n k s h a d e r . . .
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 12
Create the Texture
In the next step the texture is created and filled with the matrix to be
multiplied. Since an RGB texture is used, three matrix entries are mapped to
one texture pixel. For other matrix sizes, an RGBA texture with a four to one
mapping might fit better.
float ∗ matrix [15 ∗ 1 5] ;
// d e f i n e m a t r i x . . .
GLuint t e x t u r e ;
g l G e n T e x t u r e s ( 1 , &t e x t u r e ) ;
g l B i n d T e x t u r e (GL_TEXTURE_RECTANGLE_ARB, t e x t u r e ) ;
glTexImage2D (GL_TEXTURE_RECTANGLE_ARB, 0 ,
GL_RGB32F_ARB , 5 , 1 5 , 0 , GL_RGB, GL_FLOAT , m a t r i x ) ;
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 13
Pass Parameters to Shader
Passing the texture and the scalar (as uniform parameter) to the shader is
straightformward.
G L i n t t e x U n i t I d = g l G e t U n i f o r m L o c a t i o n ( program ,
" texUnit " ) ;
glUniform1i ( texUnitId , 0);
float c = 2.0 f ;
G L i n t c I d = g l G e t U n i f o r m L o c a t i o n ( program , "c" ) ;
glUniform1f ( cId , c ) ;
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 14
Invoke the Shader
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 15
Read the Result
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 16
Drawbacks of GLSL-based Approach
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 17
Outline
1 GPU Architecture and Shader Programming
CUDA stands for Compute Unified Device Architecture and has been
introduced by NVIDIA in November 2006 as a C-based computing language
for NVIDIA GPUs. In recent years, CUDA has been promoted by NVIDIA with
massive marketing efforts, including the GPU Technology Conference (GTC)
mostly devoted to CUDA. As a reult, CUDA (i. e., CUDA C) is better known
and more widely used then any other GPU computing language. Today the
CUDA toolkit comprises C- and Fortran-based computing languages as well
as NVIDIA’s implementations of OpenCL and DirectCompute.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 19
CUDA: Compute Unified Device Architecture
Figure 2.3: CUDA C is one of the APIs for GPU computing (NVIDIA 2015a).
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 20
CUDA C Compiler
Compiling A CUDA Program
NVCC Compiler
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 21
CUDA C Compiler
The CUDA C compiler nvcc is used to compile and link host (CPU) and
device (GPU) code:
• CUDA C is a subset of ANSI C (and C++ for latest GPUs) with special
CUDA extensions.
• For host code (.c/.cpp/.cxx files) the usual compiler and linker of the
operating system (GCC or Visual Studio) are called.
• Device code (.cu files) is compiled into PTX assembly form and/or cubin
binary code, which is finally linked with the host code.
• Debugging is possible via the IDE NSight.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 22
CUDA C Compiler Options for Compute Capability
nvcc has two options to define the compute capability of the target GPU (cf.
appendix A of the CUDA C Programming Guide (NVIDIA 2015a)):
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 23
Cuda C APIs
Driver API Low-level API loading compiled PTX or cubin code, similar to
OpenCL.
Runtime API High-level API allowing to define and launch kernels from
within the host code.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 24
– Heterogeneous host (CPU) + device (GPU) application C program
Host and device
– Serial parts in host C code
– Parallel parts in device SPMD kernel code
Figure 2.5: The execution of a CUDA program is shared between the host
(CPU) for serial code and the device (GPU) for parallel kernels. (Nvidia
2016a).
4
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 25
GPU Computing Using CUDA
Task: multiply a 15 × 15 matrix A with a scalar c,
B = A · c, A, B ∈ R15×15 , c ∈ R
Heterogeneous Computing vecAdd CU
Method: Part 1
• Device code (GPU)
• Create a CUDA kernel for the matrix-scalar Host Memory Device Memory
Part 2 #include <cuda.h>
void vecAdd(float *h_A, flo
multiplication. CPU GPU
{
int size = n* sizeof(float);
• Host code (CPU) float *d_A, *d_B, *d_C;
// Part 1
• Part 1a: Allocate GPU memory for source Part 3 // Allocate device memory fo
• Part 1b: Pass the matrix A to the GPU. Figure 2.6: Data transfer // Part 2
// Kernel launch code – the d
• Part 2: Invoke the kernel with the scalar c. between CPU and GPU
// Part 3
• Part 3: Read the result B from the GPU. (Nvidia 2016a). // copy C from the device me
// Free device vectors
}
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 26
CUDA Memories
Partial Overview of CUDA Memories
– Device code can:
(Device) Grid – R/W per-thread registers
Block (0, 0) Block (0, 1) – R/W all-shared global
Registers Registers Registers Registers
memory
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Host code can
Host
Global
– Transfer data to/from per
Memory grid global memory
6
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 27
CUDA Device Memory Management API functions
CUDA Device Memory Management API functions
– cudaMalloc()
(Device) Grid – Allocates an object in the device
global memory
Block (0, 0) Block (0, 1)
– Two parameters
Registers Registers Registers Registers – Address of a pointer to the
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
allocated object
– Size of allocated object in terms
Host of bytes
Global
Memory – cudaFree()
– Frees object from device global
memory
– One parameter
– Pointer to freed object
7
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 28
Host-Device Data Transfer API functions
Host-Device Data Transfer API functions
– cudaMemcpy()
(Device) Grid – memory data transfer
Block (0, 0) Block (0, 1) – Requires four parameters
– Pointer to destination
Registers Registers Registers Registers
– Pointer to source
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Number of bytes copied
– Type/Direction of transfer
Host
Global
Memory
– Transfer to device is asynchronous
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 29
Allocate Memory and Pass the Matrix to the Device
s i z e _ t memSize = 15 ∗ 15 ∗ s i z e o f ( f l o a t ) ;
f l o a t ∗ d e v i c e M e m P t r S r c = NULL ;
c u d a M a l l o c (& deviceMemPtrSrc , memSize ) ;
f l o a t ∗ deviceMemPtrDest = NULL ;
c u d a M a l l o c (& deviceMemPtrDest , memSize ) ;
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 30
Read the Result from the Device
After the computation (done by the device), the result can be copied from
the device memory to the host memory.
Host Code Part 3:
float ∗ resultMatrix [15 ∗ 1 5] ;
cudaMemcpy ( r e s u l t M a t r i x , deviceMemPtrDest , memSize ,
cudaMemcpyDeviceToHost ) ;
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 31
Create the CUDA Kernel
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 32
Function Type Specifiers
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 33
Programming Model
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 34
Hardware Model
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 35
A Thread as a Von-Neumann Processor
Threads
A thread is a “virtualized” or
“abstracted”
Von-Neumann Processor
Memory
I/O
Processing Unit
Reg
ALU File
Control Unit
PC IR
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 36
Arrays of Parallel Threads
Threads in a Grid
• kernel
A CUDA A CUDA is kernel
executed by a gridby(array)
is executed of threads
a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
• All threads
– Eachin a grid
thread has run thethatsame
indexes it useskernel code
to compute (Single
memory Program
addresses Multiple
and make
Data) control decisions
• Each thread has indices, that it uses to compute memory addresses and
make control decisions
0 1 2 254 255
…
…
Figure 2.12: Parallel execution, indices of threads and blocks. (Nvidia
2016a).
8
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 37
Threads in Thread
Blocks in the Grid
Blocks: Scalable Cooperation
Thread Block 0 Thread Block 1 Thread Block N -1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255
… … …
i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x +
…
threadIdx.x; threadIdx.x; threadIdx.x;
B[i] = A[i] * c; B[i] = A[i] * c; B[i] = A[i] * c;
… … …
Figure 2.13: Threads in blocks. (Nvidia 2016a).
– Divide thread array into multiple blocks
– Threads within a block cooperate via shared memory, atomic operations and
barrier synchronization
Divide thread –array into multiple blocks
Threads in different blocks do not interact
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 38
Thread Organization
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 39
Block and Thread Organization
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 41
How many blocks shall be used in the example?
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 42
Invoke the Kernel and Read the Result
In the CUDA C runtime API, an angular bracket notation (<<< >>>) is
used to define the grid and block layout. The first argument in angular
brackets defines the number of blocks, the second argument defines the
number of threads.
m u l t M a t r i x <<<9, 25>>>(deviceMemPtrSrc , 2 . 0 f ,
deviceMemPtrDest ) ;
or alternatively:
dim3 DimGrid ( 9 , 1 , 1 ) ;
dim3 DimBlock ( 2 5 , 1 , 1 ) ;
m u l t M a t r i x <<<DimGrid , DimBlock>>>(deviceMemPtrSrc , 2 . 0 f ,
deviceMemPtrDest ) ;
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 43
Drawbacks of CUDA-based Approach
• A special compiler has to be used, thus the integration into your favorite
IDE may be difficult.
• CUDA only works with NVIDIA GPUs. OpenCL provides a
platform-independent alternative, but is less-widely used (and has less
marketing behind it) than CUDA until now.
• Meanwhile ATI/AMD does the marketing for OpenCL. . .
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 44