Cuda Basics

Visual Computing – GPU Computing
CUDA Basics
Frauke Sprengel
Outline
1 GPU Architecture and Shader Programming
2 GPGPU Using GLSL
3 GPU Computing Using CUDA
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, CUDA Basics, 2
Outline
2 GPGPU Using GLSL
Graphics Rendering Pipeline
Transform
Feedback
Vertex Vertex
Data Primitive Fragment
Shading and
Assembly Shading and
Per-Vertex Framebuffer
and Per-Fragment
Operations
Rasterization Operations
Pixel
Data
Texture
Memory
Pixel
Pack/Unpack
Figure 2.1: Graphics Rendering Pipeline in the OpenGL specification (Segal

and Akeley 2015).
Figure 2.1. Block diagram of the GL.
Interesting Pipeline Stages
Target pipeline stages for programmable shaders:
1. Vertex shading
2. Fragment shading
3. Geometry shading, i. e. primitive assembly operations on points, line
segments, triangles following the per-vertex operations
Unified Shader Architecture
Figure 2.2: Unified shader architecture (Kirk and Hwu 2010).
Unified Shader Architecture
Since 2006 a unified shader architecture is used for GPUs. The same
processors are used for vertex, geometry, and fragment shading in a
three-step loop. The preceding figure shows the unified shader architecture of
the NVIDIA GeForce 8800 GTX with 16 streaming multiprocessors (SMs),
each of which consists of 8 streaming processors (SPs), thus summing up to
128 processor cores (called CUDA cores by NVIDIA).
The three loop steps are denoted as vertex (vtx), geometry, and pixel thread
issue, the streaming processors (SPs) are drawn in green. In addition to the
SPs, the GPU possesses texture filtering units (TFs) and render output units
(or raster operation pipelines, ROPs) which are drawn in blue, the latter ones
between the multiprocessors and the frame buffers (FBs). Finally, the orange
blocks denoted as L1 and L2 represent shared (local per multiprocessor) and
device (global) memory, respectively.
GPU Generations in the Graphics Lab / Normal Labs
Room 1H.2.36/46
• Quadro 600 ((GF108GL,Fermi) 2010), 64 CUDA cores, 245 GFlops/s,

25.6 GB/s, CUDA Capability 2.1
Room 1H.2.06/38 (computers 2017)
• Quadro P 600 ((GP107, Pascal) 2017), 384 CUDA cores, 1196

GFlops/s, 64 GB/s, CUDA Capability 6.1
Graphics lab 1H.2.30 (computers 2015)
• mostly Quadro M4000 (GM204, Maxwell2), 1664 CUDA cores, 2572

GFlops/s, 192 GB/s , CUDA Capability 5.2
• one Quadro M4000 (GM200, Maxwell2), 3072 CUDA cores, 6070
GFlops/s, 317 GB/s, CUDA Capability 5.2
Outline
2 GPGPU Using GLSL

Example Program
Evaluation
GPGPU Using GLSL
Task: multiply a 15 × 15 matrix A with a scalar c,
B = A · c, A, B ∈ R15×15 , c ∈ R
Method:
• Create a fragment shader for the matrix-scalar multiplication.

• Pass the matrix A as a 5 × 15 RGB texture to the GPU.
• Pass the scalar c as a uniform parameter to the GPU.
• Draw a rectangle, thus invoking the shader.
• Read the result B from the framebuffer.
Create the Fragment Shader
A fragment shader is used for the computation because it provides convenient

access to the texture and allows writing the result directly to the framebuffer.
In the vertex shader the projection transformation has to be applied to the
vertices.
uniform float c;
uniform sampler2DRect texUnit ;
v o i d main ( ) {
vec4 matrixValue = texture2DRect ( texUnit ,
gl_TexCoord [ 0 ] . xy ) ;
gl_FragColor = matrixValue . xyz ∗ c ;
};
Inititalize OpenGL
OpenGL has to be initialized with a viewport and an orthographic projection

which ensure that the texture and framebuffer pixels match exactly, in order
to avoid interpolation of texture values.
glViewport (0 , 0 , 4 , 14);
g l M a t r i x M o d e (GL_PROJECTION ) ;
glLoadIdentity ();
glOrtho (0 , 4 , 0 , 14);
G L i n t program = g l C r e a t e P r o g r a m ( ) ;
// c o m p i l e and l i n k s h a d e r . . .
Create the Texture
In the next step the texture is created and filled with the matrix to be
multiplied. Since an RGB texture is used, three matrix entries are mapped to
one texture pixel. For other matrix sizes, an RGBA texture with a four to one
mapping might fit better.
float ∗ matrix [15 ∗ 1 5] ;
// d e f i n e m a t r i x . . .
GLuint t e x t u r e ;
g l G e n T e x t u r e s ( 1 , &t e x t u r e ) ;
g l B i n d T e x t u r e (GL_TEXTURE_RECTANGLE_ARB, t e x t u r e ) ;
glTexImage2D (GL_TEXTURE_RECTANGLE_ARB, 0 ,
GL_RGB32F_ARB , 5 , 1 5 , 0 , GL_RGB, GL_FLOAT , m a t r i x ) ;
Pass Parameters to Shader
Passing the texture and the scalar (as uniform parameter) to the shader is
straightformward.
G L i n t t e x U n i t I d = g l G e t U n i f o r m L o c a t i o n ( program ,
" texUnit " ) ;
glUniform1i ( texUnitId , 0);
float c = 2.0 f ;
G L i n t c I d = g l G e t U n i f o r m L o c a t i o n ( program , "c" ) ;
glUniform1f ( cId , c ) ;
Invoke the Shader
In order to invoke the shader, something has to be rendered. The simplest

geometric object is a rectangle in the z = 0 plane exactly filling the viewport.
g l B e g i n (GL_QUADS ) ;
glTexCoord2f (0 , 0 ) ; glVertex2f (0 , 0);
glTexCoord2f (4 , 0 ) ; glVertex2f (4 , 0);
glTexCoord2f (4 , 14); glVertex3f (4 , 14);
glTexCoord2f (0 , 14); glVertex3f (0 , 14);
glEnd ( ) ;
glFinish ();
Read the Result
Finally, the result can be read from the framebuffer.

float ∗ r e s u l t [15 ∗ 1 5 ] ;
g l R e a d P i x e l s ( 0 , 0 , 4 , 1 4 , GL_RGB, GL_FLOAT , r e s u l t ) ;
Drawbacks of GLSL-based Approach
• Inconvenient “API” working with textures and OpenGL rendering

• Unflexible programming language focusing on graphics rendering pipeline
• Unnecessary computational overhead:
• vertex shading, projection, primitive assembly
• clipping, perspective division, rasterization
• texture filtering, depth buffer operations
Outline
2 GPGPU Using GLSL

CUDA: Compute Unified Device Architecture
Example Program
Example Program - Data Transfer
Example Program - Defining the CUDA kernel
Programming Model
Invoking the Kernel - Example Program
Evaluation
CUDA stands for Compute Unified Device Architecture and has been
introduced by NVIDIA in November 2006 as a C-based computing language
for NVIDIA GPUs. In recent years, CUDA has been promoted by NVIDIA with
massive marketing efforts, including the GPU Technology Conference (GTC)
mostly devoted to CUDA. As a reult, CUDA (i. e., CUDA C) is better known
and more widely used then any other GPU computing language. Today the
CUDA toolkit comprises C- and Fortran-based computing languages as well
as NVIDIA’s implementations of OpenCL and DirectCompute.
Figure 2.3: CUDA C is one of the APIs for GPU computing (NVIDIA 2015a).
CUDA C Compiler
Compiling A CUDA Program
Integrated C programs with CUDA extensions
NVCC Compiler
Host Code Device Code (PTX)
Host C Compiler/ Linker Device Just-in-Time Compiler
Heterogeneous Computing Platform with

CPUs, GPUs, etc.
Figure 2.4: CUDA C compilation process (Nvidia 2016a).

8
CUDA C Compiler
The CUDA C compiler nvcc is used to compile and link host (CPU) and
device (GPU) code:
• CUDA C is a subset of ANSI C (and C++ for latest GPUs) with special
CUDA extensions.
• For host code (.c/.cpp/.cxx files) the usual compiler and linker of the
operating system (GCC or Visual Studio) are called.
• Device code (.cu files) is compiled into PTX assembly form and/or cubin
binary code, which is finally linked with the host code.
• Debugging is possible via the IDE NSight.
CUDA C Compiler Options for Compute Capability
nvcc has two options to define the compute capability of the target GPU (cf.
appendix A of the CUDA C Programming Guide (NVIDIA 2015a)):
• -arch: compatibility of the generated PTX assembly code, e. g.,

-arch=compute_52 for compute capability 5.2
• -code: compatibility of the generated cubin binary code, e. g.,
-code=sm_52 for compute capability 5.2
• PTX code compiled for a specific compute capability can be compiled to
cubin code of equal or higher compute capability.
Cuda C APIs
CUDA C provides two different APIs:
Driver API Low-level API loading compiled PTX or cubin code, similar to
OpenCL.
Runtime API High-level API allowing to define and launch kernels from
within the host code.
In the following, the runtime API will be used.
– Heterogeneous host (CPU) + device (GPU) application C program
Host and device
– Serial parts in host C code
– Parallel parts in device SPMD kernel code
Serial Code (host)

Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...
Serial Code (host)
Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args); ...
Figure 2.5: The execution of a CUDA program is shared between the host
(CPU) for serial code and the device (GPU) for parallel kernels. (Nvidia
2016a).
4
GPU Computing Using CUDA
Task: multiply a 15 × 15 matrix A with a scalar c,
B = A · c, A, B ∈ R15×15 , c ∈ R
Heterogeneous Computing vecAdd CU
Method: Part 1
• Device code (GPU)
• Create a CUDA kernel for the matrix-scalar Host Memory Device Memory
Part 2 #include <cuda.h>
void vecAdd(float *h_A, flo
multiplication. CPU GPU
{
int size = n* sizeof(float);
• Host code (CPU) float *d_A, *d_B, *d_C;
// Part 1
• Part 1a: Allocate GPU memory for source Part 3 // Allocate device memory fo
and destination matrices. // copy A and B to device me
• Part 1b: Pass the matrix A to the GPU. Figure 2.6: Data transfer // Part 2
// Kernel launch code – the d
• Part 2: Invoke the kernel with the scalar c. between CPU and GPU
// Part 3
• Part 3: Read the result B from the GPU. (Nvidia 2016a). // copy C from the device me
// Free device vectors
}
CUDA Memories
Partial Overview of CUDA Memories
– Device code can:
(Device) Grid – R/W per-thread registers
Block (0, 0) Block (0, 1) – R/W all-shared global
Registers Registers Registers Registers
memory
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Host code can
Host
Global
– Transfer data to/from per
Memory grid global memory
Figure 2.7: Partial Overview of CUDA Memories (Nvidia 2016a).

We will cover more memory types and more
sophisticated memory models later.
6
CUDA Device Memory Management API functions
CUDA Device Memory Management API functions
– cudaMalloc()
(Device) Grid – Allocates an object in the device
global memory
Block (0, 0) Block (0, 1)
– Two parameters
Registers Registers Registers Registers – Address of a pointer to the
allocated object
– Size of allocated object in terms
Host of bytes
Global
Memory – cudaFree()
– Frees object from device global
memory
– One parameter
– Pointer to freed object
Figure 2.8: CUDA Device Memory Management API functions (Nvidia

2016a).
7
Host-Device Data Transfer API functions
Host-Device Data Transfer API functions
– cudaMemcpy()
(Device) Grid – memory data transfer
Block (0, 0) Block (0, 1) – Requires four parameters
– Pointer to destination
Registers Registers Registers Registers
– Pointer to source
– Number of bytes copied
– Type/Direction of transfer
Host
Global
Memory
– Transfer to device is asynchronous
Figure 2.9: Host-Device Data Transfer API functions (Nvidia 2016a).
Allocate Memory and Pass the Matrix to the Device
Host Code Part 1:

float ∗ matrix [15 ∗ 1 5] ;
// d e f i n e m a t r i x . . .
s i z e _ t memSize = 15 ∗ 15 ∗ s i z e o f ( f l o a t ) ;
f l o a t ∗ d e v i c e M e m P t r S r c = NULL ;
c u d a M a l l o c (& deviceMemPtrSrc , memSize ) ;
f l o a t ∗ deviceMemPtrDest = NULL ;
c u d a M a l l o c (& deviceMemPtrDest , memSize ) ;
cudaMemcpy ( deviceMemPtrSrc , m a t r i x , memSize ,

cudaMemcpyHostToDevice ) ;
Read the Result from the Device
After the computation (done by the device), the result can be copied from
the device memory to the host memory.
Host Code Part 3:
float ∗ resultMatrix [15 ∗ 1 5] ;
cudaMemcpy ( r e s u l t M a t r i x , deviceMemPtrDest , memSize ,
cudaMemcpyDeviceToHost ) ;
Create the CUDA Kernel
A kernel is a program section, which is applied concurrently to multiple data

on the GPU.
We define a CUDA kernel, which is indicated by the function type specifier
__global__.
__global__
v o i d m u l t M a t r i x ( const f l o a t ∗ A ,
f l o a t c , f l o a t ∗ B) {
i n t i = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
B[ i ] = A[ i ] ∗ c ;
}
Function Type Specifiers
In CUDA C there are three types of functions, defined by the following

specifiers:
• __global__: executed on the device, callable from the host only

• __device__: executed on the device, callable from the device only
• __host__: executed on the host, callable from the host only (default,
specifier can be omitted)
Programming Model
SPMD architecture: single program, multiple data

Differences to SIMD (single instruction, multiple data)
• Programs may diverge, e. g., by using conditional sections (if clauses).

• Synchronization of threads running on the same multiprocessor is
possible.
• Threads have access to shared memory within the multiprocessor and to
global device memory.
Hardware Model
Figure 2.10: The CUDA hardware

model. Details on memory usage
will be treated in the next
chapter. (CUDA Programming
Guide Version 2.3, 2009)
A Thread as a Von-Neumann Processor
Threads
A thread is a “virtualized” or
“abstracted”
Von-Neumann Processor
Memory
I/O
Processing Unit
Reg
ALU File
Control Unit
PC IR
Figure 2.11: A thread is a “virtualized” or “abstracted” Von-Neumann

Processor (Nvidia 2016a).
7
Arrays of Parallel Threads
Threads in a Grid
• kernel
A CUDA A CUDA is kernel
executed by a gridby(array)
is executed of threads
a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
• All threads
– Eachin a grid
thread has run thethatsame
indexes it useskernel code
to compute (Single
memory Program
addresses Multiple
and make
Data) control decisions
• Each thread has indices, that it uses to compute memory addresses and
make control decisions
0 1 2 254 255
…
i = blockIdx.x * blockDim.x + threadIdx.x;

B[i] = A[i] * c;
…
Figure 2.12: Parallel execution, indices of threads and blocks. (Nvidia
2016a).
8
Threads in Thread
Blocks in the Grid
Blocks: Scalable Cooperation
Thread Block 0 Thread Block 1 Thread Block N -1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255
… … …
i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x +
…
threadIdx.x; threadIdx.x; threadIdx.x;
B[i] = A[i] * c; B[i] = A[i] * c; B[i] = A[i] * c;
… … …
Figure 2.13: Threads in blocks. (Nvidia 2016a).
– Divide thread array into multiple blocks
– Threads within a block cooperate via shared memory, atomic operations and
barrier synchronization
Divide thread –array into multiple blocks
Threads in different blocks do not interact
• Threads within a block cooperate via shared memory, atomic

9 operations
and barrier synchronization
• Threads in different blocks do not interact
9
Thread Organization
Block Group of threads which are assigned to a single streaming

multiprocessor.
• May consist of up to 1024 threads.
• Threads are arranged in 1D, 2D, or 3D.
Grid Group of blocks which execute a single kernel.
• Blocks are distributed to different multiprocessors.
• Blocks are arranged in 1D, 2D (or 3D from CUDA 4.0).
Block and Thread Organization
Figure 2.14: Blocks in the Grid, Threads in a Block (NVIDIA 2015a)

Blocks on different types of GPUs
Figure 2.15: Blocks on different types of GPUs (NVIDIA 2015a).
How many blocks shall be used in the example?
• Within a block, threads are organized as warps, each consisting of 32

threads. Assigning 32 threads (or a multiple of this) to a block minimizes
the effects of memory latency.
• Otherwise, multiprocessors should not be idle.
• Reasonable options for a 15 × 15 matrix:
15 blocks à 15 threads or 9 blocks à 25 threads.
Invoke the Kernel and Read the Result
In the CUDA C runtime API, an angular bracket notation (<<< >>>) is
used to define the grid and block layout. The first argument in angular
brackets defines the number of blocks, the second argument defines the
number of threads.
m u l t M a t r i x <<<9, 25>>>(deviceMemPtrSrc , 2 . 0 f ,
deviceMemPtrDest ) ;
or alternatively:
dim3 DimGrid ( 9 , 1 , 1 ) ;
dim3 DimBlock ( 2 5 , 1 , 1 ) ;
m u l t M a t r i x <<<DimGrid , DimBlock>>>(deviceMemPtrSrc , 2 . 0 f ,
deviceMemPtrDest ) ;
Drawbacks of CUDA-based Approach
• A special compiler has to be used, thus the integration into your favorite
IDE may be difficult.
• CUDA only works with NVIDIA GPUs. OpenCL provides a
platform-independent alternative, but is less-widely used (and has less
marketing behind it) than CUDA until now.
• Meanwhile ATI/AMD does the marketing for OpenCL. . .

Cuda Basics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cuda Basics

Uploaded by

Copyright:

Available Formats

Visual Computing – GPU Computing

2 GPGPU Using GLSL

3 GPU Computing Using CUDA

2 GPGPU Using GLSL

3 GPU Computing Using CUDA

Figure 2.1: Graphics Rendering Pipeline in the OpenGL specification (Segal

Target pipeline stages for programmable shaders:

Figure 2.2: Unified shader architecture (Kirk and Hwu 2010).

• Quadro 600 ((GF108GL,Fermi) 2010), 64 CUDA cores, 245 GFlops/s,

Room 1H.2.06/38 (computers 2017)

• Quadro P 600 ((GP107, Pascal) 2017), 384 CUDA cores, 1196

Graphics lab 1H.2.30 (computers 2015)

• mostly Quadro M4000 (GM204, Maxwell2), 1664 CUDA cores, 2572

2 GPGPU Using GLSL

3 GPU Computing Using CUDA

Task: multiply a 15 × 15 matrix A with a scalar c,

• Create a fragment shader for the matrix-scalar multiplication.

A fragment shader is used for the computation because it provides convenient

OpenGL has to be initialized with a viewport and an orthographic projection

In order to invoke the shader, something has to be rendered. The simplest

Finally, the result can be read from the framebuffer.

• Inconvenient “API” working with textures and OpenGL rendering

2 GPGPU Using GLSL

3 GPU Computing Using CUDA

Integrated C programs with CUDA extensions

Host Code Device Code (PTX)

Host C Compiler/ Linker Device Just-in-Time Compiler

Heterogeneous Computing Platform with

Figure 2.4: CUDA C compilation process (Nvidia 2016a).

• -arch: compatibility of the generated PTX assembly code, e. g.,

CUDA C provides two different APIs:

In the following, the runtime API will be used.

Serial Code (host)

Serial Code (host)

Parallel Kernel (device)

and destination matrices. // copy A and B to device me

Figure 2.7: Partial Overview of CUDA Memories (Nvidia 2016a).

Figure 2.8: CUDA Device Memory Management API functions (Nvidia

Figure 2.9: Host-Device Data Transfer API functions (Nvidia 2016a).

Host Code Part 1:

cudaMemcpy ( deviceMemPtrSrc , m a t r i x , memSize ,

A kernel is a program section, which is applied concurrently to multiple data

In CUDA C there are three types of functions, defined by the following

• __global__: executed on the device, callable from the host only

SPMD architecture: single program, multiple data

• Programs may diverge, e. g., by using conditional sections (if clauses).

Figure 2.10: The CUDA hardware

Figure 2.11: A thread is a “virtualized” or “abstracted” Von-Neumann

i = blockIdx.x * blockDim.x + threadIdx.x;

• Threads within a block cooperate via shared memory, atomic

Block Group of threads which are assigned to a single streaming

Figure 2.14: Blocks in the Grid, Threads in a Block (NVIDIA 2015a)

Figure 2.15: Blocks on different types of GPUs (NVIDIA 2015a).

• Within a block, threads are organized as warps, each consisting of 32

You might also like

• global: executed on the device, callable from the host only