Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Visual Computing – GPU Computing

CUDA Basics

Frauke Sprengel
Outline
1 GPU Architecture and Shader Programming

2 GPGPU Using GLSL

3 GPU Computing Using CUDA

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 2
Outline
1 GPU Architecture and Shader Programming

2 GPGPU Using GLSL

3 GPU Computing Using CUDA

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 3
Graphics Rendering Pipeline
Transform
Feedback

Vertex Vertex
Data Primitive Fragment
Shading and
Assembly Shading and
Per-Vertex Framebuffer
and Per-Fragment
Operations
Rasterization Operations

Pixel
Data

Texture
Memory

Pixel
Pack/Unpack

Figure 2.1: Graphics Rendering Pipeline in the OpenGL specification (Segal


and Akeley 2015).
Figure 2.1. Block diagram of the GL.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 4
Interesting Pipeline Stages

Target pipeline stages for programmable shaders:

1. Vertex shading
2. Fragment shading
3. Geometry shading, i. e. primitive assembly operations on points, line
segments, triangles following the per-vertex operations

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 5
Unified Shader Architecture

Figure 2.2: Unified shader architecture (Kirk and Hwu 2010).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 6
Unified Shader Architecture
Since 2006 a unified shader architecture is used for GPUs. The same
processors are used for vertex, geometry, and fragment shading in a
three-step loop. The preceding figure shows the unified shader architecture of
the NVIDIA GeForce 8800 GTX with 16 streaming multiprocessors (SMs),
each of which consists of 8 streaming processors (SPs), thus summing up to
128 processor cores (called CUDA cores by NVIDIA).
The three loop steps are denoted as vertex (vtx), geometry, and pixel thread
issue, the streaming processors (SPs) are drawn in green. In addition to the
SPs, the GPU possesses texture filtering units (TFs) and render output units
(or raster operation pipelines, ROPs) which are drawn in blue, the latter ones
between the multiprocessors and the frame buffers (FBs). Finally, the orange
blocks denoted as L1 and L2 represent shared (local per multiprocessor) and
device (global) memory, respectively.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 7
GPU Generations in the Graphics Lab / Normal Labs
Room 1H.2.36/46

• Quadro 600 ((GF108GL,Fermi) 2010), 64 CUDA cores, 245 GFlops/s,


25.6 GB/s, CUDA Capability 2.1

Room 1H.2.06/38 (computers 2017)

• Quadro P 600 ((GP107, Pascal) 2017), 384 CUDA cores, 1196


GFlops/s, 64 GB/s, CUDA Capability 6.1

Graphics lab 1H.2.30 (computers 2015)

• mostly Quadro M4000 (GM204, Maxwell2), 1664 CUDA cores, 2572


GFlops/s, 192 GB/s , CUDA Capability 5.2
• one Quadro M4000 (GM200, Maxwell2), 3072 CUDA cores, 6070
GFlops/s, 317 GB/s, CUDA Capability 5.2

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 8
Outline
1 GPU Architecture and Shader Programming

2 GPGPU Using GLSL


Example Program
Evaluation

3 GPU Computing Using CUDA

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 9
GPGPU Using GLSL

Task: multiply a 15 × 15 matrix A with a scalar c,

B = A · c, A, B ∈ R15×15 , c ∈ R

Method:

• Create a fragment shader for the matrix-scalar multiplication.


• Pass the matrix A as a 5 × 15 RGB texture to the GPU.
• Pass the scalar c as a uniform parameter to the GPU.
• Draw a rectangle, thus invoking the shader.
• Read the result B from the framebuffer.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 10
Create the Fragment Shader

A fragment shader is used for the computation because it provides convenient


access to the texture and allows writing the result directly to the framebuffer.
In the vertex shader the projection transformation has to be applied to the
vertices.
uniform float c;
uniform sampler2DRect texUnit ;

v o i d main ( ) {
vec4 matrixValue = texture2DRect ( texUnit ,
gl_TexCoord [ 0 ] . xy ) ;
gl_FragColor = matrixValue . xyz ∗ c ;
};

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 11
Inititalize OpenGL

OpenGL has to be initialized with a viewport and an orthographic projection


which ensure that the texture and framebuffer pixels match exactly, in order
to avoid interpolation of texture values.
glViewport (0 , 0 , 4 , 14);
g l M a t r i x M o d e (GL_PROJECTION ) ;
glLoadIdentity ();
glOrtho (0 , 4 , 0 , 14);

G L i n t program = g l C r e a t e P r o g r a m ( ) ;
// c o m p i l e and l i n k s h a d e r . . .

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 12
Create the Texture

In the next step the texture is created and filled with the matrix to be
multiplied. Since an RGB texture is used, three matrix entries are mapped to
one texture pixel. For other matrix sizes, an RGBA texture with a four to one
mapping might fit better.
float ∗ matrix [15 ∗ 1 5] ;
// d e f i n e m a t r i x . . .
GLuint t e x t u r e ;
g l G e n T e x t u r e s ( 1 , &t e x t u r e ) ;
g l B i n d T e x t u r e (GL_TEXTURE_RECTANGLE_ARB, t e x t u r e ) ;
glTexImage2D (GL_TEXTURE_RECTANGLE_ARB, 0 ,
GL_RGB32F_ARB , 5 , 1 5 , 0 , GL_RGB, GL_FLOAT , m a t r i x ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 13
Pass Parameters to Shader

Passing the texture and the scalar (as uniform parameter) to the shader is
straightformward.
G L i n t t e x U n i t I d = g l G e t U n i f o r m L o c a t i o n ( program ,
" texUnit " ) ;
glUniform1i ( texUnitId , 0);

float c = 2.0 f ;
G L i n t c I d = g l G e t U n i f o r m L o c a t i o n ( program , "c" ) ;
glUniform1f ( cId , c ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 14
Invoke the Shader

In order to invoke the shader, something has to be rendered. The simplest


geometric object is a rectangle in the z = 0 plane exactly filling the viewport.
g l B e g i n (GL_QUADS ) ;
glTexCoord2f (0 , 0 ) ; glVertex2f (0 , 0);
glTexCoord2f (4 , 0 ) ; glVertex2f (4 , 0);
glTexCoord2f (4 , 14); glVertex3f (4 , 14);
glTexCoord2f (0 , 14); glVertex3f (0 , 14);
glEnd ( ) ;
glFinish ();

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 15
Read the Result

Finally, the result can be read from the framebuffer.


float ∗ r e s u l t [15 ∗ 1 5 ] ;
g l R e a d P i x e l s ( 0 , 0 , 4 , 1 4 , GL_RGB, GL_FLOAT , r e s u l t ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 16
Drawbacks of GLSL-based Approach

• Inconvenient “API” working with textures and OpenGL rendering


• Unflexible programming language focusing on graphics rendering pipeline
• Unnecessary computational overhead:
• vertex shading, projection, primitive assembly
• clipping, perspective division, rasterization
• texture filtering, depth buffer operations

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 17
Outline
1 GPU Architecture and Shader Programming

2 GPGPU Using GLSL

3 GPU Computing Using CUDA


CUDA: Compute Unified Device Architecture
Example Program
Example Program - Data Transfer
Example Program - Defining the CUDA kernel
Programming Model
Invoking the Kernel - Example Program
Evaluation
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 18
CUDA: Compute Unified Device Architecture

CUDA stands for Compute Unified Device Architecture and has been
introduced by NVIDIA in November 2006 as a C-based computing language
for NVIDIA GPUs. In recent years, CUDA has been promoted by NVIDIA with
massive marketing efforts, including the GPU Technology Conference (GTC)
mostly devoted to CUDA. As a reult, CUDA (i. e., CUDA C) is better known
and more widely used then any other GPU computing language. Today the
CUDA toolkit comprises C- and Fortran-based computing languages as well
as NVIDIA’s implementations of OpenCL and DirectCompute.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 19
CUDA: Compute Unified Device Architecture

Figure 2.3: CUDA C is one of the APIs for GPU computing (NVIDIA 2015a).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 20
CUDA C Compiler
Compiling A CUDA Program

Integrated C programs with CUDA extensions

NVCC Compiler

Host Code Device Code (PTX)

Host C Compiler/ Linker Device Just-in-Time Compiler

Heterogeneous Computing Platform with


CPUs, GPUs, etc.

Figure 2.4: CUDA C compilation process (Nvidia 2016a).


8

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 21
CUDA C Compiler

The CUDA C compiler nvcc is used to compile and link host (CPU) and
device (GPU) code:

• CUDA C is a subset of ANSI C (and C++ for latest GPUs) with special
CUDA extensions.
• For host code (.c/.cpp/.cxx files) the usual compiler and linker of the
operating system (GCC or Visual Studio) are called.
• Device code (.cu files) is compiled into PTX assembly form and/or cubin
binary code, which is finally linked with the host code.
• Debugging is possible via the IDE NSight.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 22
CUDA C Compiler Options for Compute Capability

nvcc has two options to define the compute capability of the target GPU (cf.
appendix A of the CUDA C Programming Guide (NVIDIA 2015a)):

• -arch: compatibility of the generated PTX assembly code, e. g.,


-arch=compute_52 for compute capability 5.2
• -code: compatibility of the generated cubin binary code, e. g.,
-code=sm_52 for compute capability 5.2
• PTX code compiled for a specific compute capability can be compiled to
cubin code of equal or higher compute capability.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 23
Cuda C APIs

CUDA C provides two different APIs:

Driver API Low-level API loading compiled PTX or cubin code, similar to
OpenCL.
Runtime API High-level API allowing to define and launch kernels from
within the host code.

In the following, the runtime API will be used.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 24
– Heterogeneous host (CPU) + device (GPU) application C program
Host and device
– Serial parts in host C code
– Parallel parts in device SPMD kernel code

Serial Code (host)


Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)

Parallel Kernel (device)


KernelB<<< nBlk, nTid >>>(args); ...

Figure 2.5: The execution of a CUDA program is shared between the host
(CPU) for serial code and the device (GPU) for parallel kernels. (Nvidia
2016a).
4

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 25
GPU Computing Using CUDA
Task: multiply a 15 × 15 matrix A with a scalar c,

B = A · c, A, B ∈ R15×15 , c ∈ R
Heterogeneous Computing vecAdd CU
Method: Part 1
• Device code (GPU)
• Create a CUDA kernel for the matrix-scalar Host Memory Device Memory
Part 2 #include <cuda.h>
void vecAdd(float *h_A, flo
multiplication. CPU GPU
{
int size = n* sizeof(float);
• Host code (CPU) float *d_A, *d_B, *d_C;
// Part 1
• Part 1a: Allocate GPU memory for source Part 3 // Allocate device memory fo

and destination matrices. // copy A and B to device me

• Part 1b: Pass the matrix A to the GPU. Figure 2.6: Data transfer // Part 2
// Kernel launch code – the d
• Part 2: Invoke the kernel with the scalar c. between CPU and GPU
// Part 3
• Part 3: Read the result B from the GPU. (Nvidia 2016a). // copy C from the device me
// Free device vectors
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 26
CUDA Memories
Partial Overview of CUDA Memories
– Device code can:
(Device) Grid – R/W per-thread registers
Block (0, 0) Block (0, 1) – R/W all-shared global
Registers Registers Registers Registers
memory
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Host code can
Host
Global
– Transfer data to/from per
Memory grid global memory

Figure 2.7: Partial Overview of CUDA Memories (Nvidia 2016a).


We will cover more memory types and more
sophisticated memory models later.

6
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 27
CUDA Device Memory Management API functions
CUDA Device Memory Management API functions
– cudaMalloc()
(Device) Grid – Allocates an object in the device
global memory
Block (0, 0) Block (0, 1)
– Two parameters
Registers Registers Registers Registers – Address of a pointer to the
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
allocated object
– Size of allocated object in terms
Host of bytes
Global
Memory – cudaFree()
– Frees object from device global
memory
– One parameter
– Pointer to freed object

Figure 2.8: CUDA Device Memory Management API functions (Nvidia


2016a).

7
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 28
Host-Device Data Transfer API functions
Host-Device Data Transfer API functions
– cudaMemcpy()
(Device) Grid – memory data transfer
Block (0, 0) Block (0, 1) – Requires four parameters
– Pointer to destination
Registers Registers Registers Registers
– Pointer to source
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Number of bytes copied
– Type/Direction of transfer
Host
Global
Memory
– Transfer to device is asynchronous

Figure 2.9: Host-Device Data Transfer API functions (Nvidia 2016a).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 29
Allocate Memory and Pass the Matrix to the Device

Host Code Part 1:


float ∗ matrix [15 ∗ 1 5] ;
// d e f i n e m a t r i x . . .

s i z e _ t memSize = 15 ∗ 15 ∗ s i z e o f ( f l o a t ) ;
f l o a t ∗ d e v i c e M e m P t r S r c = NULL ;
c u d a M a l l o c (& deviceMemPtrSrc , memSize ) ;
f l o a t ∗ deviceMemPtrDest = NULL ;
c u d a M a l l o c (& deviceMemPtrDest , memSize ) ;

cudaMemcpy ( deviceMemPtrSrc , m a t r i x , memSize ,


cudaMemcpyHostToDevice ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 30
Read the Result from the Device

After the computation (done by the device), the result can be copied from
the device memory to the host memory.
Host Code Part 3:
float ∗ resultMatrix [15 ∗ 1 5] ;
cudaMemcpy ( r e s u l t M a t r i x , deviceMemPtrDest , memSize ,
cudaMemcpyDeviceToHost ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 31
Create the CUDA Kernel

A kernel is a program section, which is applied concurrently to multiple data


on the GPU.
We define a CUDA kernel, which is indicated by the function type specifier
__global__.
__global__
v o i d m u l t M a t r i x ( const f l o a t ∗ A ,
f l o a t c , f l o a t ∗ B) {
i n t i = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
B[ i ] = A[ i ] ∗ c ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 32
Function Type Specifiers

In CUDA C there are three types of functions, defined by the following


specifiers:

• __global__: executed on the device, callable from the host only


• __device__: executed on the device, callable from the device only
• __host__: executed on the host, callable from the host only (default,
specifier can be omitted)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 33
Programming Model

SPMD architecture: single program, multiple data


Differences to SIMD (single instruction, multiple data)

• Programs may diverge, e. g., by using conditional sections (if clauses).


• Synchronization of threads running on the same multiprocessor is
possible.
• Threads have access to shared memory within the multiprocessor and to
global device memory.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 34
Hardware Model

Figure 2.10: The CUDA hardware


model. Details on memory usage
will be treated in the next
chapter. (CUDA Programming
Guide Version 2.3, 2009)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 35
A Thread as a Von-Neumann Processor
Threads
A thread is a “virtualized” or
“abstracted”
Von-Neumann Processor

Memory
I/O

Processing Unit
Reg
ALU File

Control Unit
PC IR

Figure 2.11: A thread is a “virtualized” or “abstracted” Von-Neumann


Processor (Nvidia 2016a).
7

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 36
Arrays of Parallel Threads
Threads in a Grid
• kernel
A CUDA A CUDA is kernel
executed by a gridby(array)
is executed of threads
a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
• All threads
– Eachin a grid
thread has run thethatsame
indexes it useskernel code
to compute (Single
memory Program
addresses Multiple
and make
Data) control decisions
• Each thread has indices, that it uses to compute memory addresses and
make control decisions

0 1 2 254 255

i = blockIdx.x * blockDim.x + threadIdx.x;


B[i] = A[i] * c;


Figure 2.12: Parallel execution, indices of threads and blocks. (Nvidia
2016a).
8

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 37
Threads in Thread
Blocks in the Grid
Blocks: Scalable Cooperation
Thread Block 0 Thread Block 1 Thread Block N -1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255

… … …
i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x +

threadIdx.x; threadIdx.x; threadIdx.x;
B[i] = A[i] * c; B[i] = A[i] * c; B[i] = A[i] * c;

… … …
Figure 2.13: Threads in blocks. (Nvidia 2016a).
– Divide thread array into multiple blocks
– Threads within a block cooperate via shared memory, atomic operations and
barrier synchronization
Divide thread –array into multiple blocks
Threads in different blocks do not interact

• Threads within a block cooperate via shared memory, atomic


9 operations
and barrier synchronization
• Threads in different blocks do not interact
9

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 38
Thread Organization

Block Group of threads which are assigned to a single streaming


multiprocessor.
• May consist of up to 1024 threads.
• Threads are arranged in 1D, 2D, or 3D.
Grid Group of blocks which execute a single kernel.
• Blocks are distributed to different multiprocessors.
• Blocks are arranged in 1D, 2D (or 3D from CUDA 4.0).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 39
Block and Thread Organization

Figure 2.14: Blocks in the Grid, Threads in a Block (NVIDIA 2015a)


Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 40
Blocks on different types of GPUs

Figure 2.15: Blocks on different types of GPUs (NVIDIA 2015a).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 41
How many blocks shall be used in the example?

• Within a block, threads are organized as warps, each consisting of 32


threads. Assigning 32 threads (or a multiple of this) to a block minimizes
the effects of memory latency.
• Otherwise, multiprocessors should not be idle.
• Reasonable options for a 15 × 15 matrix:
15 blocks à 15 threads or 9 blocks à 25 threads.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 42
Invoke the Kernel and Read the Result
In the CUDA C runtime API, an angular bracket notation (<<< >>>) is
used to define the grid and block layout. The first argument in angular
brackets defines the number of blocks, the second argument defines the
number of threads.
m u l t M a t r i x <<<9, 25>>>(deviceMemPtrSrc , 2 . 0 f ,
deviceMemPtrDest ) ;

or alternatively:
dim3 DimGrid ( 9 , 1 , 1 ) ;
dim3 DimBlock ( 2 5 , 1 , 1 ) ;
m u l t M a t r i x <<<DimGrid , DimBlock>>>(deviceMemPtrSrc , 2 . 0 f ,
deviceMemPtrDest ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 43
Drawbacks of CUDA-based Approach

• A special compiler has to be used, thus the integration into your favorite
IDE may be difficult.
• CUDA only works with NVIDIA GPUs. OpenCL provides a
platform-independent alternative, but is less-widely used (and has less
marketing behind it) than CUDA until now.
• Meanwhile ATI/AMD does the marketing for OpenCL. . .

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 44

You might also like