Curs VSD8

Master SDTW/DSWT – an II
Vizualizarea in sisteme distribuite
Conf. dr. ing. Simona Caraiman

VSD - Curs 8
Programarea GPU (1I)
CUDA
VSD - Curs 8 Master SDTW/DSWT – an II

CUDA
 “Compute Unified Device Architecture”
 General purpose programming model
 User kicks off batches of threads on the GPU
 GPU = dedicated super-threaded, massively data parallel co-processor
 Targeted software stack
 Compute oriented drivers, language, and tools
 Driver for loading computation programs into GPU
 Standalone Driver - Optimized for computation
 Interface designed for compute – graphics-free API
 Data sharing with OpenGL buffer objects
 Guaranteed maximum download & readback speeds
 Explicit GPU memory management
VSD – Curs 8 Master SDTW/DSWT – an II

CUDA
 Co-designed hardware and software to expose the
computational power of NVIDIA GPUs for GPU computing
 Software
 small set of extensions to C language
 low learning curve
 Hardware
 shared memory – scalable thread cooperation

CUDA

Arhitectura GPU – Tesla (2007)
GeForce 8800  16 highly threaded SMs,
 >128 FPUs, 367 GFLOPS,
Host
 768 MB DRAM,
Input Assembler
 86.4 GB/s Mem BW, 4GB/s BW to CPU
Thread Execution Manager
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture
Load/store Load/store Load/store Load/store Load/store Load/store
Global Memory

Arhitectura GPU – Fermi (2010)
 ~1.5TFLOPS (SP)/~800GFLOPS (DP)

 230 GB/s DRAM BW
 512 CUDA cores
 NVIDIA Parallel DataCache technology
 NVIDIA GigaThread™ engine
 ECC support

Arhitectura GPU – Kepler (2012)
 4 SMX (neXt
generation streaming
multiprocessor)
 192 cores/SMX
 3xperformance/Watt
 paralelism dinamic
 Hiper-Q

Arhitectura GPU – Maxwell (2014)
 More efficient SMX
 Increased L2 cache
 More shared mem (64
KB)
 2 x performance/Watt
 fast shared mem
atomics

Arhitectura GPU – Pascal (2016)
 128 cores/SM
 HBM2 memory (3D
stacked DRAM)
 Unified memory
 NVLink (new
interconnect
architecture)
 More registers
 More shared mem

Arhitectura GPU – Volta (2017)
 64 FP32 cores for single-
precision arithmetic operations,
 32 FP64 cores for double-
precision arithmetic operations,
 64 INT32 cores for integer math,
 8 mixed-precision Tensor
Cores for deep learning matrix
arithmetic
 16 special function units for
single-precision floating-point
transcendental functions,
 4 warp schedulers.

CUDA Tools
Compiling a CUDA Program
float4 me = gx[gtid]; • Parallel Thread
me.x += me.y * me.z;
eXecution (PTX)
– Virtual Machine
and ISA
– Programming
model
– Execution
resources and
state
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
mad.f32 $f1, $f5, $f3, $f1;

CUDA Tools
Compilation
 Any source file containing CUDA language
extensions must be compiled with NVCC
 NVCC is a compiler driver
 Works by invoking all the necessary tools and compilers like
cudacc, g++, cl, …
 NVCC outputs:
 C code (host CPU Code)
 Must then be compiled with the rest of the application using
another tool
 PTX
 Object code directly
 Or, PTX source, interpreted at runtime

CUDA Tools
Linking
 Any executable with CUDA code requires two

dynamic libraries:
 The CUDA core library (cuda)
 The CUDA runtime library (cudart)
 if runtime API is used
 loads cuda library

CUDA Tools
Debugging Using the Device Emulation Mode
 An executable compiled in device emulation mode
(nvcc -deviceemu) runs completely on the host using the
CUDA runtime
 No need of any device and CUDA driver
 Each device thread is emulated with a host thread
 Running in device emulation mode, one can:

 Use host native debug support (breakpoints, inspection, etc.)
 Access any device-specific data from host code and vice-versa
 Call any host function from device code (e.g. printf) and vice-
versa
 Detect deadlock situations caused by improper usage of
__syncthreads

CUDA Tools
Device Emulation Mode Pitfalls
 Emulated device threads execute sequentially, so

simultaneous accesses of the same memory location by
multiple threads could produce different results.
 Dereferencing device pointers on the host or host

pointers on the device can produce correct results in
device emulation mode, but will generate an error in
device execution mode

CUDA workflow
 get a CUDA-enabled GPU
 write C/C++ like code (*.cu)
 compile with CUDA compiler (nvcc)

 generates PTX code (“Parallel Thread Execution”)
 application runs on GPU

 many many parallel threads
 CUDA driver translates PTX code into hardware

CUDA - overview
 CUDA C/C++ language extensions
 small sets of extensions for writing kernels (sub-routine that
runs multithreaded on the GPU)
 CUDA programming model

 for fine-grained data / thread parallelism
 thread group hierarchy
 shared memories
 synchronization barriers

CUDA - overview

 shared memories

CUDA – C/C++ Language Extensions







CUDA - overview

 shared memories

CUDA – Programming model
 GPU = compute device
 coprocessor for the host CPU
 has its own device memory on the card
 executes many threads in parallel
 GPU threads are extremely lightweight
 very little creation overhead
 instant switching
 GPU expects 1000s of threads for full utilization
 GOAL: saturate the GPU
 threads used to hide latency and memory access
 multicore CPUs need only a few

Kernels and Threads
 Parallel portions of an application are executed on the

device as kernels
 one kernel is executed at a time
 many threads execute each kernel
 Device = GPU; Host = CPU

 Kernel = function called from the host and which runs on
the device

Arrays of Parallel Threads
 A CUDA kernel is executed by an array of threads
 all threads run the same code
 each thread has an ID that it uses to compute memory
addresses and make control decisions

Thread Cooperation
 purpose:
 share results to avoid redundant cooperation
 share memory accesses (drastic bandwidth reduction)
 CUDA allows thread cooperation
 cooperation between a monolithic array of threads is not

scalable
 allow cooperation within smaller batches of threads

Thread Cooperation
 kernel launches a grid of thread blocks
 threads within a block cooperate via shared memory

 threads in different blocks cannot cooperate
 allows programs to transparently scale to different GPUs

Thread Allocation
 hardware is free to schedule thread blocks on any proc.
 a kernel scales across parallel multiprocessors threads get
scheduled round-
robin based on
the # of procs.
available in the
device
you need sufficient # of

blocks, at least as many
as the # of SMs, to fill
the pipe

Thread Block Size
 how big should the block size be?
 Rule 1: Fill the pipe
 More threads makes GPU happy –they hide memory latency
 NVIDIA Tesla can take up to 768 threads
 Rule 2: Don’t overfill the pipe
 Threads compete for resources –registers, shared memory
 You can’t anyways –the maximum size is limited;
 use cudaGetDeviceProperties() to obtain the maximum size of
a thread block and grid supported by a particular device
(usually 512).

Block IDs and Thread IDs
 each thread has access to
 dim3 gridDim; (up to 2D only, grid dimension)
 dim3 blockDim; (block dimension)
 dim3 blockIdx; (current block location on the grid)
 dim3 threadIdx; (current thread location on the block)

Block IDs and Thread IDs
 Block ID: 1D, 2D
 Thread ID: 1D, 2D, 3D
 Simplifies memory addressing when

processing multidimensional data
 image processing
 solving PDEs on volumes

SPA, SM, SP, SFU, warp
 SPA (ShaderProcessor Array) is the SHADER.
 SPA contains a few SMs (ShaderMultiprocessor)
 A SM consists of several SPs (ShaderProcessors) and

SFUs(Special Function Units, or SuperFunk), an instruction
scheduler and a shared memory block
 Ratios of SPs/SFUs may change from generation to generation

Warps
 Unrelated to CUDA design, this is purely hardware
 Designed to support effective divergence and latency hiding
 “SIMT” scheduling
 A block is divided into several warps
 In G80, a warp == 32 threads
 Each warp get executed without divergence
 If…then…else statement, if a branch is taken for one thread in
the warp, all other threads will execute that path
 A SM can execute only ONE warp at a time
 __syncthreads() not necessary for threads within the same
warp

Data viitoare
 CUDA Memory Model
 CUDA Programming Basics

Curs VSD8

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Curs VSD8

Uploaded by

Copyright:

Available Formats

Master SDTW/DSWT – an II

Vizualizarea in sisteme distribuite

Conf. dr. ing. Simona Caraiman

Programarea GPU (1I)

VSD - Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

Load/store Load/store Load/store Load/store Load/store Load/store

VSD – Curs 7 Master SDTW/DSWT – an II

 ~1.5TFLOPS (SP)/~800GFLOPS (DP)

VSD – Curs 7 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 9 Master SDTW/DSWT – an II

VSD – Curs 9 Master SDTW/DSWT – an II

 Any executable with CUDA code requires two

VSD – Curs 9 Master SDTW/DSWT – an II

 Running in device emulation mode, one can:

VSD – Curs 9 Master SDTW/DSWT – an II

 Emulated device threads execute sequentially, so

 Dereferencing device pointers on the host or host

VSD – Curs 9 Master SDTW/DSWT – an II

 write C/C++ like code (*.cu)

 compile with CUDA compiler (nvcc)

 application runs on GPU

VSD – Curs 8 Master SDTW/DSWT – an II

 CUDA programming model

VSD – Curs 8 Master SDTW/DSWT – an II

 CUDA programming model

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

 CUDA programming model

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

 Parallel portions of an application are executed on the

 Device = GPU; Host = CPU

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

 CUDA allows thread cooperation

 cooperation between a monolithic array of threads is not

VSD – Curs 8 Master SDTW/DSWT – an II

 threads within a block cooperate via shared memory

VSD – Curs 8 Master SDTW/DSWT – an II

you need sufficient # of

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

VSD – Curs 8 Master SDTW/DSWT – an II

 Simplifies memory addressing when

VSD – Curs 8 Master SDTW/DSWT – an II

 SPA (ShaderProcessor Array) is the SHADER.

 SPA contains a few SMs (ShaderMultiprocessor)

 A SM consists of several SPs (ShaderProcessors) and

 Ratios of SPs/SFUs may change from generation to generation

VSD – Curs 8 Master SDTW/DSWT – an II