Professional Documents
Culture Documents
GPU Architecture & Implications: David Luebke NVIDIA Research
GPU Architecture & Implications: David Luebke NVIDIA Research
GPU Architecture & Implications: David Luebke NVIDIA Research
David Luebke
NVIDIA Research
GPU Architecture
1.5 GB DRAM
76 GB/s peak
800 MHz GDDR3 clock
384 pin DRAM interface
Input Assembler
Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors
PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM
Load/store
Processing elements
8 scalar thread processors (SP)
SM t0 t1 … tB
32 GFLOPS peak at 1.35 GHz
8192 32-bit registers (32KB)
MT IU
½ MB total register file space!
SP usual ops: float, int, branch, …
Hardware multithreading
up to 8 blocks resident at once
up to 768 active threads in total
Shared
Memory 16KB on-chip memory
low latency storage
shared amongst threads of a block
supports thread communication
© NVIDIA Corporation 2007
Goal: Scalability
Scalable execution
Program must be insensitive to the number of cores
Write one program for any number of SM cores
Program runs on any size GPU without recompiling
...
MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU
SP SP SP SP SP SP SP SP
...
Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory
Device Memory
Goal: easy to program
Strategies:
Familiar programming language mechanics
C/C++ with small extension
Simple parallel abstractions
Simple barrier synchronization
Shared memory semantics
Hardware-managed hierarchy of threads
Hardware Multithreading
SP
Hardware schedules threads
threads have their own registers
any thread not waiting for something can run
context switching is (basically) free – every cycle
Shared
Memory
Hardware relies on threads to hide latency
i.e., parallelism is necessary for performance
Shared
SP Memory
Host
Memory
I Cache
Device Memory
PCIe
© NVIDIA Corporation 2007
Myths of GPU Computing
Rounding modes for Round to nearest and All 4 IEEE, round to Round to nearest Round to
FADD and FMUL round to zero nearest, zero, inf, -inf only zero/truncate only
Supported, Supported,
Denormal handling Flush to zero Flush to zero
1000’s of cycles 1000’s of cycles
Reciprocal estimate
24 bit 12 bit 12 bit 12 bit
accuracy
Reciprocal sqrt
23 bit 12 bit 12 bit 12 bit
estimate accuracy
David Luebke
dluebke@nvidia.com
Applications
&
Sweet Spots
GPU Computing Sweet Spots
Applications:
High bandwidth:
Sequencing (virus scanning, genomics), sorting,
database…
Visual computing:
Graphics, image processing, tomography, machine vision…
Computational Computational
Geoscience Chemistry
Computational Computational
Medicine Modeling
Computational Computational
Science Biology
Computational Image
Finance Processing
GPU-enabled clusters
A 100x or better speedup changes the science
Solve at different scales
Direct brute-force methods may outperform cleverness
New bottlenecks may emerge
Approaches once inconceivable may become practical
© NVIDIA Corporation 2007
New Applications
Real-time options implied volatility engine
Ultrasound imaging
Manifold 8 GIS
Rounding modes for Round to nearest and All 4 IEEE, round to Round to nearest Round to
FADD and FMUL round to zero nearest, zero, inf, -inf only zero/truncate only
Supported, Supported,
Denormal handling Flush to zero Flush to zero
1000’s of cycles 1000’s of cycles
Reciprocal estimate
24 bit 12 bit 12 bit 12 bit
accuracy
Reciprocal sqrt
23 bit 12 bit 12 bit 12 bit
estimate accuracy
Performance: How:
BLAS1: 60+ GB/sec Leveraging shared
BLAS3: 127 GFLOPS memory
FFT: 52 benchFFT* GPU memory bandwidth
GFLOPS GPU GFLOPS
FDTD: 1.2 Gcells/sec performance
SSEARCH: 5.2 Gcells/sec Custom hardware
intrinsics
Black Scholes: 4.7
GOptions/sec __sinf(), __cosf(),
__expf(), __logf(),
…
VMD: 290 GFLOPS
Software:
Scalableprogram the GPU inCCwith minimal yet
data-parallel
execution/memory model powerful extensions
Vertex Program
Rasterization
Fragment Program
Display
Vertex Program
“Programs” created with raster operation
Rasterization
Input Registers
Application
Vertex Program
Texture
Rasterization Constants
Pixel Program
Display
Output Registers
Input Registers
Application Limited texture size and
dimension
Input Registers
Texture
Registers
Output Registers
Texture
Registers
Output Registers
© NVIDIA Corporation 2007
Global Memory
Features
Fully general load/store to
GPU memory
Thread Number Untyped, not fixed texture types
Pointer support
Texture
Registers
Global Memory
© NVIDIA Corporation 2007
Parallel Data Cache
Features
Dedicated on-chip memory
Shared between threads for
inter-thread communication
Thread Number
Explicitly managed
As fast as registers
Texture
Registers
Global Memory
© NVIDIA Corporation 2007
Example Algorithm - Fluids
Goal: Calculate PRESSURE in a fluid
Pressure2 = P3 + P4 + P5 + P6
Pressure depends
Pressure3 = P5 + P6 + P7 + P8
on neighbors
Pressure4 = P7 + P8 + P9 + P10
P1,
ALU P2 ALU
AL P3, Shared
Pn’=P1+P2+P3+P4
U P1 P4 Data
Control
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4 P2 P1,P2
P3,P4
P1
P3 ALU
Control
P2
P4 Pn’=P1+P2+P3+P4 DRAM
Video ALU P3
Memory
P4
Control
Pn’=P1+P2+P3+P4 P5
Single P1,P2
P3,P4
thread out
ALU
ALU
Pn’=P1+P2+P3+P4
of cache
Control
Program/Control
© NVIDIA Corporation 2007 Parallel execution through cache
Parallel Data Cache
P1,
P2
of stream computing:
ALU
P3,
•
Pn’=P1+P2+P3+P4
P4
The data are far from the FLOPS, Control
P1,P
video RAM latency is high ALU 2
•
P3,P
Threads can only communicate
Pn’=P1+P2+P3+P4
4
Video
their results through this high Memory
P1,P2
ALU
ALU P3,P4
Pn’=P1+P2+P3+P4
Multiple passes
through video
memory
Pn’=P1+P2+P3+P4
Shader A
Shader
Core
Shader D
Multiprocessor N
Texture
Memory
NVIDIA C Compiler
NVIDIA Assembly
for Computing
CPU Host Code
CUDA Debugger
Standard C Compiler
Driver Profiler
GPU CPU
Function qualifiers:
__global__ void MyKernel() { }
__device__ float MyDeviceFunc() { }
Variable qualifiers:
__constant__ float MyConstantArray[32];
__shared__ float MySharedArray[32];
Execution configuration:
dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel
Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
void addMatrix(float *a, float *b, __global__ void addMatrix(float *a,float *b,
float *c, int N) float *c, int N)
{ {
int i, j, index; int i=blockIdx.x*blockDim.x+threadIdx.x;
for (i = 0; i < N; i++) { int j=blockIdx.y*blockDim.y+threadIdx.y;
for (j = 0; j < N; j++) { int index = i + j * N;
index = i + j * N; if ( i < N && j < N)
c[index]=a[index] + b[index]; c[index]= a[index] + b[index];
} }
}
}
void main()
void main() {
{ ..... // allocate & transfer data to GPU
..... dim3 dimBlk (blocksize, blocksize);
addMatrix(a, b, c, N); dim3 dimGrd (N/dimBlk.x, N/dimBlk.y);
} addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);
}
© NVIDIA Corporation 2007
Example: Vector Addition Kernel
void main()
{
Pre-Hibernation
First proposed in APL by Iverson (1962)
Used as a data parallel primitive in the Connection Machine (1990)
Feature of C* and CM-Lisp
Guy Blelloch used scan as a primitive for various parallel
algorithms; his balanced-tree scan is used in the example here
Blelloch, 1990, “Prefix Sums and Their Applications”
Post-Democratization
O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)
Applied to Summed Area Tables by Hensley et al. (EG05)
O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al.
(EG06)
O(n) work & space GPU implementation by Harris et al. (2007)
NVIDIA CUDA SDK and GPU Gems 3
Applied to radix sort, stream compaction, and summed area tables
Parallel Reduction Complexity
graphic
Millions
Source: Alexa, YouTube 2006 Source: Hedburg, CPI, Walmart
Petabytes 60
50
40
30
20
10
0
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Source: Jim Farnsworth, BP May 2005
© NVIDIA Corporation 2007 Source: John Bates, NOAA Nat. Climate Center
Computational Horsepower
700 20X
600
500
400
Performance
(Mcells/s) 10X
Pacemaker with Transmit Antenna
300
200
5X
100
1X
0
CPU 1 GPU 2 GPUs 4 GPUs
3.2 GHz
© NVIDIA Corporation 2007
EvolvedMachines
130X Speed up
Simulate brain circuitry
Sensory computing: vision, olfactory
EvolvedMachines
Matlab
168 seconds
MATLAB
992 seconds
240X speedup
Computational biology
Biomolecular simulations
attempt to replicate in vivo
conditions in silico.
Model structures are
initially constructed in
vacuum
Solvent (water) and ions are
added as necessary for the
required biological
conditions
Computational
requirements scale with the
size of the simulated
structure
http://www.nvidia.com/page/professor_partnership.html
© NVIDIA Corporation 2007