Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Master SDTW/DSWT – an II

Vizualizarea in sisteme distribuite

Conf. dr. ing. Simona Caraiman


VSD - Curs 8

Programarea GPU (1I)

CUDA

VSD - Curs 8 Master SDTW/DSWT – an II


CUDA
 “Compute Unified Device Architecture”
 General purpose programming model
 User kicks off batches of threads on the GPU
 GPU = dedicated super-threaded, massively data parallel co-processor
 Targeted software stack
 Compute oriented drivers, language, and tools
 Driver for loading computation programs into GPU
 Standalone Driver - Optimized for computation
 Interface designed for compute – graphics-free API
 Data sharing with OpenGL buffer objects
 Guaranteed maximum download & readback speeds
 Explicit GPU memory management

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA
 Co-designed hardware and software to expose the
computational power of NVIDIA GPUs for GPU computing

 Software
 small set of extensions to C language
 low learning curve

 Hardware
 shared memory – scalable thread cooperation

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA

VSD – Curs 8 Master SDTW/DSWT – an II


Arhitectura GPU – Tesla (2007)
GeForce 8800  16 highly threaded SMs,
 >128 FPUs, 367 GFLOPS,
Host
 768 MB DRAM,
Input Assembler
 86.4 GB/s Mem BW, 4GB/s BW to CPU
Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory

VSD – Curs 7 Master SDTW/DSWT – an II


Arhitectura GPU – Fermi (2010)

 ~1.5TFLOPS (SP)/~800GFLOPS (DP)


 230 GB/s DRAM BW
 512 CUDA cores
 NVIDIA Parallel DataCache technology
 NVIDIA GigaThread™ engine
 ECC support

VSD – Curs 7 Master SDTW/DSWT – an II


Arhitectura GPU – Kepler (2012)
 4 SMX (neXt
generation streaming
multiprocessor)
 192 cores/SMX
 3xperformance/Watt
 paralelism dinamic
 Hiper-Q

VSD – Curs 8 Master SDTW/DSWT – an II


Arhitectura GPU – Maxwell (2014)
 More efficient SMX
 Increased L2 cache
 More shared mem (64
KB)
 2 x performance/Watt
 fast shared mem
atomics

VSD – Curs 8 Master SDTW/DSWT – an II


Arhitectura GPU – Pascal (2016)
 128 cores/SM
 HBM2 memory (3D
stacked DRAM)
 Unified memory
 NVLink (new
interconnect
architecture)
 More registers
 More shared mem

VSD – Curs 8 Master SDTW/DSWT – an II


Arhitectura GPU – Volta (2017)
 64 FP32 cores for single-
precision arithmetic operations,
 32 FP64 cores for double-
precision arithmetic operations,
 64 INT32 cores for integer math,
 8 mixed-precision Tensor
Cores for deep learning matrix
arithmetic
 16 special function units for
single-precision floating-point
transcendental functions,
 4 warp schedulers.

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA Tools
Compiling a CUDA Program
float4 me = gx[gtid]; • Parallel Thread
me.x += me.y * me.z;
eXecution (PTX)
– Virtual Machine
and ISA
– Programming
model
– Execution
resources and
state
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
mad.f32 $f1, $f5, $f3, $f1;

VSD – Curs 9 Master SDTW/DSWT – an II


CUDA Tools
Compilation
 Any source file containing CUDA language
extensions must be compiled with NVCC
 NVCC is a compiler driver
 Works by invoking all the necessary tools and compilers like
cudacc, g++, cl, …
 NVCC outputs:
 C code (host CPU Code)
 Must then be compiled with the rest of the application using
another tool
 PTX
 Object code directly
 Or, PTX source, interpreted at runtime

VSD – Curs 9 Master SDTW/DSWT – an II


CUDA Tools
Linking

 Any executable with CUDA code requires two


dynamic libraries:
 The CUDA core library (cuda)
 The CUDA runtime library (cudart)
 if runtime API is used
 loads cuda library

VSD – Curs 9 Master SDTW/DSWT – an II


CUDA Tools
Debugging Using the Device Emulation Mode
 An executable compiled in device emulation mode
(nvcc -deviceemu) runs completely on the host using the
CUDA runtime
 No need of any device and CUDA driver
 Each device thread is emulated with a host thread

 Running in device emulation mode, one can:


 Use host native debug support (breakpoints, inspection, etc.)
 Access any device-specific data from host code and vice-versa
 Call any host function from device code (e.g. printf) and vice-
versa
 Detect deadlock situations caused by improper usage of
__syncthreads

VSD – Curs 9 Master SDTW/DSWT – an II


CUDA Tools
Device Emulation Mode Pitfalls

 Emulated device threads execute sequentially, so


simultaneous accesses of the same memory location by
multiple threads could produce different results.

 Dereferencing device pointers on the host or host


pointers on the device can produce correct results in
device emulation mode, but will generate an error in
device execution mode

VSD – Curs 9 Master SDTW/DSWT – an II


CUDA workflow
 get a CUDA-enabled GPU

 write C/C++ like code (*.cu)

 compile with CUDA compiler (nvcc)


 generates PTX code (“Parallel Thread Execution”)

 application runs on GPU


 many many parallel threads
 CUDA driver translates PTX code into hardware

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA - overview
 CUDA C/C++ language extensions
 small sets of extensions for writing kernels (sub-routine that
runs multithreaded on the GPU)

 CUDA programming model


 for fine-grained data / thread parallelism
 thread group hierarchy
 shared memories
 synchronization barriers

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA - overview
 CUDA C/C++ language extensions
 small sets of extensions for writing kernels (sub-routine that
runs multithreaded on the GPU)

 CUDA programming model


 for fine-grained data / thread parallelism
 thread group hierarchy
 shared memories
 synchronization barriers

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – C/C++ Language Extensions

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – C/C++ Language Extensions

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – C/C++ Language Extensions

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – C/C++ Language Extensions

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – C/C++ Language Extensions

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – C/C++ Language Extensions

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – C/C++ Language Extensions

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA - overview
 CUDA C/C++ language extensions
 small sets of extensions for writing kernels (sub-routine that
runs multithreaded on the GPU)

 CUDA programming model


 for fine-grained data / thread parallelism
 thread group hierarchy
 shared memories
 synchronization barriers

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
 GPU = compute device
 coprocessor for the host CPU
 has its own device memory on the card
 executes many threads in parallel
 GPU threads are extremely lightweight
 very little creation overhead
 instant switching
 GPU expects 1000s of threads for full utilization
 GOAL: saturate the GPU
 threads used to hide latency and memory access
 multicore CPUs need only a few

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
Kernels and Threads

 Parallel portions of an application are executed on the


device as kernels
 one kernel is executed at a time
 many threads execute each kernel

 Device = GPU; Host = CPU


 Kernel = function called from the host and which runs on
the device

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
Arrays of Parallel Threads
 A CUDA kernel is executed by an array of threads
 all threads run the same code
 each thread has an ID that it uses to compute memory
addresses and make control decisions

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
Thread Cooperation
 purpose:
 share results to avoid redundant cooperation
 share memory accesses (drastic bandwidth reduction)

 CUDA allows thread cooperation

 cooperation between a monolithic array of threads is not


scalable
 allow cooperation within smaller batches of threads

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
Thread Cooperation
 kernel launches a grid of thread blocks

 threads within a block cooperate via shared memory


 threads in different blocks cannot cooperate
 allows programs to transparently scale to different GPUs

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
Thread Allocation
 hardware is free to schedule thread blocks on any proc.
 a kernel scales across parallel multiprocessors threads get
scheduled round-
robin based on
the # of procs.
available in the
device

you need sufficient # of


blocks, at least as many
as the # of SMs, to fill
the pipe

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
Thread Block Size
 how big should the block size be?
 Rule 1: Fill the pipe
 More threads makes GPU happy –they hide memory latency
 NVIDIA Tesla can take up to 768 threads
 Rule 2: Don’t overfill the pipe
 Threads compete for resources –registers, shared memory
 You can’t anyways –the maximum size is limited;
 use cudaGetDeviceProperties() to obtain the maximum size of
a thread block and grid supported by a particular device
(usually 512).

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
Block IDs and Thread IDs
 each thread has access to
 dim3 gridDim; (up to 2D only, grid dimension)
 dim3 blockDim; (block dimension)
 dim3 blockIdx; (current block location on the grid)
 dim3 threadIdx; (current thread location on the block)

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
Block IDs and Thread IDs
 Block ID: 1D, 2D
 Thread ID: 1D, 2D, 3D

 Simplifies memory addressing when


processing multidimensional data
 image processing
 solving PDEs on volumes

VSD – Curs 8 Master SDTW/DSWT – an II


CUDA – Programming model
SPA, SM, SP, SFU, warp

 SPA (ShaderProcessor Array) is the SHADER.

 SPA contains a few SMs (ShaderMultiprocessor)

 A SM consists of several SPs (ShaderProcessors) and


SFUs(Special Function Units, or SuperFunk), an instruction
scheduler and a shared memory block

 Ratios of SPs/SFUs may change from generation to generation

VSD – Curs 8 Master SDTW/DSWT – an II


VSD – Curs 8 Master SDTW/DSWT – an II
VSD – Curs 8 Master SDTW/DSWT – an II
VSD – Curs 8 Master SDTW/DSWT – an II
VSD – Curs 8 Master SDTW/DSWT – an II
CUDA – Programming model
Warps
 Unrelated to CUDA design, this is purely hardware
 Designed to support effective divergence and latency hiding
 “SIMT” scheduling
 A block is divided into several warps
 In G80, a warp == 32 threads
 Each warp get executed without divergence
 If…then…else statement, if a branch is taken for one thread in
the warp, all other threads will execute that path
 A SM can execute only ONE warp at a time
 __syncthreads() not necessary for threads within the same
warp

VSD – Curs 8 Master SDTW/DSWT – an II


Data viitoare
 CUDA Memory Model
 CUDA Programming Basics

VSD – Curs 8 Master SDTW/DSWT – an II

You might also like