Professional Documents
Culture Documents
Programming Gpus With Cuda: John Mellor-Crummey
Programming Gpus With Cuda: John Mellor-Crummey
John Mellor-Crummey
Department of Computer Science Rice University johnmc@rice.edu
COMP 422
Why GPUs?
GPGPU?
General Purpose computation using GPU
applications beyond 3D graphics typically, data-intensive science and engineering applications
Directly exploit
pixel shaders vertex shaders video memory
CUDA
CUDA = Compute Unified Device Architecture
SM
SP
Introducing Fermi
512 CUDA cores Configurable L1 data cache 8x peak DP perf over Tesla 2
IEEE 754-2008 FP standard
GigaThread Engine
concurrent kernel exec
Full C++ support Unified address space Debugger support ECC support
Figure Credit:
http://www.nvidia.com/content/PDF/fermi_white_papers/ NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
10
Why CUDA?
Business rationale
opportunity for Nvidia to sell more chips
extend the demand from graphics into HPC
Technical rationale
hides GPU architecture behind the programming API
programmers never write directly to the metal insulate programmers from details of GPU hardware enables Nvidia to change GPU architecture completely, transparently preserve investment in CUDA programs
11
12
Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.1
13
14
15
Host code
16
17
CPU
18
19
20
21
Minimal Extensions to C
cudaThreadSynchronize() forces runtime to wait until all preceding device tasks have finished Within kernel, declare shared memory as
extern int __shared[];
23
__device__ is optional with __local__, __shared__, or __constant__ Automatic variables without any qualifier reside in a register
except arrays: reside in local memory
Scratchpad memory
__shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x]; // compute on scratch values begin[threadIdx.x] = scratch[threadIdx.x];
25
26
Runtime Support
Memory management for pointers to GPU memory
cudaMalloc(), cudaFree()
27
kernel code
...
28
29
Extended C Summary
30
Compiling CUDA
31
32
33
34
Optimization Considerations
Kernel optimizations
make use of shared memory minimize use divergent control flow
SIMD execution must follow all paths taken within a thread group
CPU/GPU interaction
maximize PCIe throughput use asynchronous memory copies
36
CUDA Resources
General information about CUDA
www.nvidia.com/object/cuda_home.html
37
Emerging framework for writing programs that execute on heterogeneous platforms, including CPUs, GPUs, etc.
supports both task and data parallelism based on subset of ISO C99 with extensions for parallelism numerics based on IEEE 754 floating point standard efficiently interoperated with graphics APIs, e.g. OpenGL
OpenCL managed by non-profit Khronos Group Initial specification approved for public release Dec. 8, 2008
specification 1.0.33 released Feb 4, 2009
38
39
40
CPU
single-threaded, serial instruction stream
superscalar: manage pipelines for multiple functional units SIMD short vector operations: 3-4 operations per cycle
GPGPU
use GPU pixel shaders as general purpose processors operate on data in video memory threads interact with each other through off-chip memory
CUDA
automatically manages threads divides data set into smaller chunks stored in on-chip memory
reduces need to access off-chip memory improves performance
References
Patrick LeGresley, High Performance Computing with CUDA, Stanford University Colloquium, October 2008, http://www.stanford.edu/dept/ICME/ docs/seminars/LeGresley-2008-10-27.pdf Vivek Sarkar. Introduction to General-Purpose computation on GPUs (GPGPUs), COMP 635, September 2007 Rob Farber. CUDA, Supercomputing for the Masses, Parts 1-11, Dr. Dobbs Portal, http://www.ddj.com/architect/207200659, April 2008-March 2009. Tom Halfhill. Parallel Processing with CUDA, Microprocessor Report, January 2008. N. Govindaraju et al. A cache-efficient sorting algorithm for database and data mining computations using graphics processors. http:// gamma.cs.unc.edu/SORT/gpusort.pdf http://defectivecompass.wordpress.com/2006/06/25/learning-from-gpusort http://en.wikipedia.org/wiki/OpenCL http://www.khronos.org/opencl http://www.khronos.org/files/opencl-quick-reference-card.pdf 42