GPU Architecture & Implications: David Luebke NVIDIA Research

GPU Architecture & Implications
David Luebke
NVIDIA Research
GPU Architecture
CUDA provides a parallel programming model
The Tesla GPU architecture implements this
This talk will describe the characteristics, goals,

and implications of that architecture
© NVIDIA Corporation 2007

G80 GPU Implementation: Tesla C870
681 million transistors

470 mm2 in 90 nm CMOS
128 thread processors

518 GFLOPS peak
1.35 GHz processor clock
1.5 GB DRAM
76 GB/s peak
800 MHz GDDR3 clock
384 pin DRAM interface
ATX form factor card

PCI Express x16
170 W max with DRAM
Block Diagram Redux
G80 (launched Nov 2006)
128 Thread Processors execute kernel threads
Up to 12,288 parallel threads active
Per-block shared memory (PBSM) accelerates processing
Host
Input Assembler
Thread Execution Manager
Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors
PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM
Load/store
© NVIDIA Corporation 2007 Global Memory

Streaming Multiprocessor (SM)
Processing elements
8 scalar thread processors (SP)
SM t0 t1 … tB
32 GFLOPS peak at 1.35 GHz
8192 32-bit registers (32KB)
MT IU
½ MB total register file space!
SP usual ops: float, int, branch, …
Hardware multithreading
up to 8 blocks resident at once
up to 768 active threads in total
Shared
Memory 16KB on-chip memory
low latency storage
shared amongst threads of a block
supports thread communication
Goal: Scalability
Scalable execution
Program must be insensitive to the number of cores
Write one program for any number of SM cores
Program runs on any size GPU without recompiling
Hierarchical execution model

Decompose problem into sequential steps (kernels)
Decompose kernel into computing parallel blocks
Decompose block into computing parallel threads
Hardware distributes independent blocks to SMs as

available
Blocks Run on Multiprocessors
Kernel launched by host
...
Device processor array
MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU
SP SP SP SP SP SP SP SP
...
Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory
Device Memory
Goal: easy to program
Strategies:
Familiar programming language mechanics
C/C++ with small extension
Simple parallel abstractions
Simple barrier synchronization
Shared memory semantics
Hardware-managed hierarchy of threads
Hardware Multithreading
Hardware allocates resources to blocks

SM blocks need: thread slots, registers, shared memory
blocks don’t run until resources are available
MT IU
SP
Hardware schedules threads
threads have their own registers
any thread not waiting for something can run
context switching is (basically) free – every cycle
Shared
Memory
Hardware relies on threads to hide latency
i.e., parallelism is necessary for performance

Goal: Performance per millimeter
For GPUs, perfomance == throughput
Strategy: hide latency with computation not cache

Heavy multithreading – already discussed by Kevin
Implication: need many threads to hide latency

Occupancy – typically need 128 threads/SM minimum
Multiple thread blocks/SM good to minimize effect of
barriers
Strategy: Single Instruction Multiple Thread (SIMT)

Balances performance with ease of programming
SIMT Thread Execution
Groups of 32 threads formed into warps

always executing same instruction
SM
shared instruction fetch/dispatch
MT IU some become inactive when code path diverges
SP hardware automatically handles divergence
Warps are the primitive unit of scheduling

pick 1 of 24 warps for each instruction slot
Shared SIMT execution is an implementation choice

Memory
sharing control logic leaves more space for ALUs
largely invisible to programmer
must understand for performance, not correctness
SIMT Multithreaded Execution
Weaving: the original parallel thread
technology is about 10,000 years old
Warp: a set of 32 parallel threads
that execute a SIMD instruction
SM multithreaded SM hardware implements zero-overhead

instruction scheduler warp and thread scheduling
time Each SM executes up to 768 concurrent
threads, as 24 SIMD warps of 32 threads
warp 8 instruction 11
Threads can execute independently
SIMD warp automatically diverges and
converges when threads branch
Best efficiency and performance when
..
. threads of a warp execute together
warp 8 instruction 12 SIMT across threads (not just SIMD
across data) gives easy single-thread
warp 3 instruction 96 scalar programming with SIMD efficiency
12 gh07 Hot3D: Tesla GPU Computing © NVIDIA Corporation 2007
Memory Architecture
Direct load/store access to device memory

treated as the usual linear sequence of bytes (i.e., not pixels)
Texture & constant caches are read-only access paths
On-chip shared memory shared amongst threads of a block
important for communication amongst threads
provides low-latency temporary storage (~100x less than DRAM)
MT IU
Shared
SP Memory
Host
Memory
I Cache
Texture Cache Constant Cache
Device Memory
PCIe
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:

Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point


NO: CUDA compiles directly to the hardware




Very wide (1000s) SIMD machines… NO: warps are 32-wide


…on which branching is impossible or prohibitive… NOPE




…with 4-wide vector registers. NO: scalar thread processors




GPUs are power-inefficient:

No – 4-10x perf/W advantage, up to 89x reported for some studies

GPUs are power-inefficient:

GPU Floating Point Features
G80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for Round to nearest and All 4 IEEE, round to Round to nearest Round to
FADD and FMUL round to zero nearest, zero, inf, -inf only zero/truncate only
Supported, Supported,
Denormal handling Flush to zero Flush to zero
1000’s of cycles 1000’s of cycles
NaN support Yes Yes Yes No
Overflow and Infinity Yes, only clamps to

Yes Yes No, infinity
support max norm
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate
24 bit 12 bit 12 bit 12 bit
accuracy
Reciprocal sqrt
estimate accuracy
log2(x) and 2^x

23 bit No 12 bit No
estimates accuracy
Do GPUs Do Real IEEE FP?
G8x GPU FP is IEEE 754

Comparable to other processors / accelerators
More precise / usable in some ways
Less precise in other ways
GPU FP getting better every generation

Double precision support shortly
Goal: best of class by 2009
Questions?
David Luebke
dluebke@nvidia.com
Applications
&
Sweet Spots
GPU Computing Sweet Spots
Applications:
High arithmetic intensity:

Dense linear algebra, PDEs, n-body, finite difference, …
High bandwidth:
Sequencing (virus scanning, genomics), sorting,
database…
Visual computing:
Graphics, image processing, tomography, machine vision…

GPU Computing Example Markets
Computational Computational
Geoscience Chemistry
Medicine Modeling
Science Biology
Computational Image
Finance Processing

Applications - Condensed
3D image analysis Film Protein folding
Adaptive radiation therapy Financial - lots of areas Quantum chemistry
Acoustics Languages Ray tracing
Astronomy GIS Radar
Audio Holographics cinema Reservoir simulation
Automobile vision Imaging (lots) Robotic vision/AI
Bioinfomatics Mathematics research Robotic surgery
Biological simulation Military (lots) Satellite data analysis
Broadcast Mine planning Seismic imaging
Cellular automata Molecular dynamics Surgery simulation
Computational Fluid Dynamics MRI reconstruction Surveillance
Computer Vision Multispectral imaging Ultrasound
Cryptography nbody Video conferencing
CT reconstruction Network processing Telescope
Data Mining Neural network Video
Digital cinema/projections Oceanographic research Visualization
Electromagnetic simulation Optical inspection Wireless
Equity training Particle physics X-ray

GPU Computing Sweet Spots
From cluster to workstation

The “personal supercomputing” phase change
From lab to clinic
From machine room to engineer, grad student desks
From batch processing to interactive
From interactive to real-time
GPU-enabled clusters
A 100x or better speedup changes the science
Solve at different scales
Direct brute-force methods may outperform cleverness
New bottlenecks may emerge
Approaches once inconceivable may become practical
New Applications
Real-time options implied volatility engine
Ultrasound imaging
Swaption volatility cube calculator
HOOMD Molecular Dynamics
Manifold 8 GIS
SDK: Mandelbrot, computer vision

Also…
Image rotation/classification
Graphics processing toolbox
Seismic migration
Microarray data analysis
Data parallel primitives
Astrophysics simulations
The Future of GPUs
GPU Computing drives new applications

Reducing “Time to Discovery”
100x Speedup changes science and research methods
New applications drive the future of GPUs and

GPU Computing
Drives new GPU capabilities
Drives hunger for more performance
Some exciting new domains:

Vision, acoustic, and embedded applications
Large-scale simulation & physics
Accuracy
&
Performance
GPU Floating Point Features
G80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for Round to nearest and All 4 IEEE, round to Round to nearest Round to
FADD and FMUL round to zero nearest, zero, inf, -inf only zero/truncate only
Supported, Supported,
Denormal handling Flush to zero Flush to zero
1000’s of cycles 1000’s of cycles
NaN support Yes Yes Yes No
Overflow and Infinity Yes, only clamps to

Yes Yes No, infinity
support max norm
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate
accuracy
Reciprocal sqrt
estimate accuracy
log2(x) and 2^x

23 bit No 12 bit No
estimates accuracy

Do GPUs Do Real IEEE FP?
G8x GPU FP is IEEE 754

Comparable to other processors / accelerators
More precise / usable in some ways
Less precise in other ways
GPU FP getting better every generation

Double precision support shortly
Goal: best of class by 2009

CUDA Performance Advantages
Performance: How:
BLAS1: 60+ GB/sec Leveraging shared
BLAS3: 127 GFLOPS memory
FFT: 52 benchFFT* GPU memory bandwidth
GFLOPS GPU GFLOPS
FDTD: 1.2 Gcells/sec performance
SSEARCH: 5.2 Gcells/sec Custom hardware
intrinsics
Black Scholes: 4.7
GOptions/sec __sinf(), __cosf(),
__expf(), __logf(),
…
VMD: 290 GFLOPS
All benchmarks are compiled code!

GPGPU
vs.
GPU Computing
Problem: GPGPU
OLD: GPGPU – trick the GPU into general-purpose

computing by casting problem as graphics
Turn data into images (“texture maps”)
Turn algorithms into image synthesis (“rendering passes”)
Promising results, but:

Tough learning curve, particularly for non-graphics experts
Potentially high overhead of graphics API
Highly constrained memory layout & access model
Need for many passes drives up bandwidth consumption

Solution: CUDA
NEW: GPU Computing with CUDA

CUDA = Compute Unified Driver Architecture
Co-designed hardware & software for direct GPU computing
Hardware: fully general data-parallel architecture

General thread launch Scalar architecture
Global load-store Integers, bit operations
Parallel data cache Double precision (soon)
Software:
Scalableprogram the GPU inCCwith minimal yet
data-parallel
execution/memory model powerful extensions

Graphics Programming Model
Graphics Application
Vertex Program
Rasterization
Fragment Program
Display

Streaming GPGPU Programming
OpenGL Program to Start by creating a quad
Add A and B
Vertex Program
“Programs” created with raster operation
Rasterization
Fragment Program Read textures as input to OpenGL shader program
CPU Reads Texture

Memory for Results Write answer to texture memory as a “color”

All this just to do A + B
What’s Wrong With GPGPU?
Input Registers
Application
Vertex Program
Texture
Rasterization Constants
Pixel Program
Pixel Program Temp Registers
Display
Output Registers

What’s Wrong With GPGPU?
APIs are specific to graphics
Input Registers
Application Limited texture size and
dimension
Vertex Program Fragment Program

Texture
Rasterization Limited instruction set

Constants
No thread communication
Fragment Program Temp Registers

Limited local storage
Display
Output Registers
Limited shader outputs
No scatter
Building a Better Pixel
Input Registers
Texture
Fragment Program Constants
Registers
Output Registers

Building a Better Pixel Thread
Features
Millions of instructions
Full Integer and Bit instructions
Thread Number No limits on branching, looping
1D, 2D, or 3D thread ID allocation
Texture
Thread Program Constants
Registers
Output Registers
Global Memory
Features
Fully general load/store to
GPU memory
Thread Number Untyped, not fixed texture types
Pointer support
Texture
Registers
Global Memory
Parallel Data Cache
Features
Dedicated on-chip memory
Shared between threads for
inter-thread communication
Thread Number
Explicitly managed
As fast as registers
Texture
Registers
Parallel Data Cache
Global Memory
Example Algorithm - Fluids
Goal: Calculate PRESSURE in a fluid
Pressure = Sum of neighboring pressures

Pn’ = P1 + P2 + P3 + P4
So the pressure for each particle is…

Pressure1 = P1 + P2 + P3 + P4
Pressure depends
on neighbors

Example Fluid Algorithm GPU Computing
with CUDA
CPU GPGPU Thread
Execution
Parallel
Manager Data
Cache
Control
Cache DRAM
Control
Control
P1,
ALU P2 ALU
AL P3, Shared
Pn’=P1+P2+P3+P4
U P1 P4 Data
Control
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4 P2 P1,P2
P3,P4
P1
P3 ALU
Control
P2
P4 Pn’=P1+P2+P3+P4 DRAM
Video ALU P3
Memory
P4
Control
Pn’=P1+P2+P3+P4 P5
Single P1,P2
P3,P4
thread out
ALU
ALU
Pn’=P1+P2+P3+P4
of cache
Control
Multiple passes ALU

ALU
through video
Data/Computation
memory Pn’=P1+P2+P3+P4
Program/Control
© NVIDIA Corporation 2007 Parallel execution through cache
Parallel Data Cache
Bring the data closer to the ALU GPGPU
 Addresses a fundamental problem

Control
P1,
P2
of stream computing:
ALU
P3,
•
Pn’=P1+P2+P3+P4
P4
The data are far from the FLOPS, Control
P1,P
video RAM latency is high ALU 2
•
P3,P
Threads can only communicate
Pn’=P1+P2+P3+P4
4
Video
their results through this high Memory
latency RAM Control
P1,P2
ALU
ALU P3,P4
Pn’=P1+P2+P3+P4
Multiple passes
through video
memory

Parallel Data Cache
Thread Parallel
Bring the data closer to the ALU Execution
Manager Data
Cache
Control
 Stage computation for the parallel

data cache
ALU
Shared
Data
 Minimize trips to external memory Pn’=P1+P2+P3+P4
P1
 Share values to minimize overfetch
and computation
Control
P2
DRAM
ALU P3
 Increases arithmetic intensity by

Pn’=P1+P2+P3+P4
P4
P5
keeping data close to the
processors
 User managed generic memory, Control
threads read/write arbitrarily

ALU
ALU
Pn’=P1+P2+P3+P4
© NVIDIA Corporation 2007 Parallel execution through

cache
Streaming vs. GPU Computing
Streaming GPGPU
Gather in, Restricted write
Memory is far from ALU
No inter-element communication
GPU Computing with CUDA ALU
More general data parallel model

Full Scatter / Gather CUDA
PDC brings the data closer to the ALU

App decides how to decompose the
problem across threads
Share and communicate between threads
to solve problems efficiently
ALU

GPU Design
CPU/GPU Parallelism
Moore’s Law gives you more and more transistors

What do you want to do with them?
CPU strategy: make the workload (one compute thread) run as
fast as possible
Tactics:
– Cache (area limiting)
– Instruction/Data prefetch
– Speculative execution
limited by “perimeter” – communication bandwidth
…then add task parallelism…multi-core
GPU strategy: make the workload (as many threads as possible)
run as fast as possible
Tactics:
– Parallelism (1000s of threads)
– Pipelining
 limited by “area” – compute capability

Background: Unified Design
Discrete Design Unified Design
Shader A
Shader B ibuffer ibuffer ibuffer ibuffer
Shader
Core
obuffer obuffer obuffer obuffer

Shader C
Shader D

Hardware Implementation:
Collection of SIMT Multiprocessors
Device
Each multiprocessor is a set of
SIMT thread processors Multiprocessor N
Single Instruction Multiple Thread

Multiprocessor 2
Multiprocessor 1
Each thread processor has:

program counter, register file, etc.
scalar data path Instruction
read/write memory access Processor 1 Processor 2 … Processor M
Unit
Unit of SIMT execution: warp

execute same instruction/clock
Hardware handles thread
scheduling and divergence
transparently
Warps enable a friendly data-parallel programming model!

Memory Architecture
Device
Multiprocessor N
The device has local

device memory Multiprocessor 2
Multiprocessor 1
Can be read and written by
the host and by the Shared Memory
multiprocessors Registers Registers Registers

Instruction
Each multiprocessor has: Processor 1 Processor 2 … Processor M
Unit
A set of 32-bit registers

per processor Constant
Cache
on-chip shared memory Texture

Cache
A read-only constant
cache
Device memory
A read-only texture cache

Memory Model
Grid
Each thread can: Block (0, 0) Block (1, 0)
Read/write per-block on- Shared Memory Shared Memory
chip shared memory Registers Registers Registers Registers

Read per-grid cached
constant memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Read/write non-cached
device memory: Local Local Local Local
Per-grid global memory Memory Memory Memory Memory
Per-thread local memory Global

Memory
Read cached texture
memory Constant
Memory
Texture
Memory

CUDA
Programming
CUDA SDK
Libraries:FFT, BLAS,… Integrated CPU

Example Source Code and GPU C Source Code
NVIDIA C Compiler
NVIDIA Assembly
for Computing
CPU Host Code
CUDA Debugger
Standard C Compiler
Driver Profiler
GPU CPU

CUDA: Features available to kernels
Standard mathematical functions

sinf, powf, atanf, ceil, etc.
Built-in vector types

float4, int4, uint4, etc. for dimensions 1..4
Texture accesses in kernels

texture<float,2> my_texture; // declare texture reference
float4 texel = texfetch(my_texture, u, v);

G8x CUDA = C with Extensions
Philosophy: provide minimal set of extensions necessary to expose power
Function qualifiers:
__global__ void MyKernel() { }
__device__ float MyDeviceFunc() { }
Variable qualifiers:
__constant__ float MyConstantArray[32];
__shared__ float MySharedArray[32];
Execution configuration:
dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel
Built-in variables and functions valid in device code:

dim3 gridDim; // Grid dimension
dim3 blockDim; // Block dimension
dim3 blockIdx; // Block index
dim3 threadIdx; // Thread index
void __syncthreads(); // Thread synchronization
CUDA: Runtime support
Explicit memory allocation returns pointers to GPU memory

cudaMalloc(), cudaFree()
Explicit memory copy for host ↔ device, device ↔ device

cudaMemcpy(), cudaMemcpy2D(), ...
Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
OpenGL & DirectX interoperability

cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …

Example: Adding matrices w/ 2D grids
CPU C program CUDA C program
void addMatrix(float *a, float *b, __global__ void addMatrix(float *a,float *b,
float *c, int N) float *c, int N)
{ {
int i, j, index; int i=blockIdx.x*blockDim.x+threadIdx.x;
for (i = 0; i < N; i++) { int j=blockIdx.y*blockDim.y+threadIdx.y;
for (j = 0; j < N; j++) { int index = i + j * N;
index = i + j * N; if ( i < N && j < N)
c[index]=a[index] + b[index]; c[index]= a[index] + b[index];
} }
}
}
void main()
void main() {
{ ..... // allocate & transfer data to GPU
..... dim3 dimBlk (blocksize, blocksize);
addMatrix(a, b, c, N); dim3 dimGrd (N/dimBlk.x, N/dimBlk.y);
} addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B

// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)

{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}

Example: Invoking the Kernel
__global__ void vecAdd(float* A, float* B, float* C);
void main()
{
// Execute on N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

Example: Host code for memory
// allocate host (CPU) memory

float* h_A = (float*) malloc(N * sizeof(float));
float* h_B = (float*) malloc(N * sizeof(float));
… initalize h_A and h_B …
// allocate device (GPU) memory

float* d_A, d_B, d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device

cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));
cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));
// execute the kernel on N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
A quick review
device = GPU = set of multiprocessors

Multiprocessor = set of processors & shared
memory
Kernel = GPU program
Grid = array of thread blocks that execute a kernel
Thread block = group of SIMD threads that execute
a kernel and can communicate via shared memory
Memory Location Cached Access Who
Local Off-chip No Read/write One thread
Shared On-chip N/A - resident Read/write All threads in a block
Global Off-chip No Read/write All threads + host
Constant Off-chip Yes Read All threads + host
Texture Off-chip Yes Read All threads + host
Data-Parallel
Programming
Scan Literature
Pre-Hibernation
First proposed in APL by Iverson (1962)
Used as a data parallel primitive in the Connection Machine (1990)
Feature of C* and CM-Lisp
Guy Blelloch used scan as a primitive for various parallel
algorithms; his balanced-tree scan is used in the example here
Blelloch, 1990, “Prefix Sums and Their Applications”
Post-Democratization
O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)
Applied to Summed Area Tables by Hensley et al. (EG05)
O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al.
(EG06)
O(n) work & space GPU implementation by Harris et al. (2007)
NVIDIA CUDA SDK and GPU Gems 3
Applied to radix sort, stream compaction, and summed area tables
Parallel Reduction Complexity
Log(N) parallel steps, each step S does N/2S

independent ops
Step Complexity is O(log N)
For N=2D, performs S[1..D]2D-S = N-1 operations

Work Complexity is O(N) – It is work-efficient
i.e. does not perform more operations than a sequential
algorithm
With P threads physically in parallel (P processors),

time complexity is O(N/P + log N)
Compare to O(N) for sequential reduction
Unrolling Last Steps
Only one warp is active during the last few

steps
Unroll them and remove unneeded
__syncthreads()
for (unsigned int s = bd/2; s > 32; s >>= 1)

{
if (t < s) {
data[t] += data[t + s];
}
__syncthreads();
}
if (t < 32) data[t] += data[t + 32];
if (t < 16) data[t] += data[t + 16];
if (t < 8) data[t] += data[t + 8];
if (t < 4) data[t] += data[t + 4];
if (t < 2) data[t] += data[t + 2];
if (t < 1) data[t] += data[t + 1];
Unrolling the Loop Completely
When block #define STEP(d) \

size is known if (t < (d)) data[t] += data[t+(d));
at compile
time, we can #define SYNC __syncthreads();
completely
unroll the loop template <unsigned int bsize>
__global__ void d_reduce(int *g_idata,
It often is, int *g_odata)
since the { ...
maximum if (bsize == 512) STEP(512) SYNC
thread block
size of 512 if (bsize >= 256) STEP(256) SYNC
constrains us if (bsize >= 128) STEP(128) SYNC
if (bsize >= 64) STEP(64) SYNC
if (bsize >= 32) { STEP(32) STEP(16)
Use STEP(8)
templates…
STEP(4) STEP(2)
STEP(1) }
}
GPU Computing
Motivation
Computing Challenge
graphic
Task Computing Data Computing

Extreme Growth in Raw Data
YouTube Bandwidth Growth Walmart Transaction Tracking
Millions
Millions
Source: Alexa, YouTube 2006 Source: Hedburg, CPI, Walmart
BP Oil and Gas Active Data NOAA Weather Data

NOAA NASA Weather Data in Petabytes
90
80
70
Terabytes
Petabytes 60
50
40
30
20
10
0
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Source: Jim Farnsworth, BP May 2005
© NVIDIA Corporation 2007 Source: John Bates, NOAA Nat. Climate Center
Computational Horsepower
GPU is a massively parallel computation engine

High memory bandwidth (5-10x CPU)
High floating-point performance (5-10x CPU)

Benchmarking: CPU vs. GPU Computing
G80 vs. Core2 Duo 2.66 GHz

Measured against commercial CPU benchmarks when possible

“Free” Massively Parallel Processors
It’s not science fiction, it’s just funded by them

Asst Master Chief Harvard
Success
Stories
Success Stories: Data to Design
Acceleware EM Field simulation technology for the GPU
3D Finite-Difference and Finite-Element (FDTD)
Modeling of:
Cell phone irradiation
MRI Design / Modeling
Printed Circuit Boards
Radar Cross Section (Military)
700 20X
600
500
400
Performance
(Mcells/s) 10X
Pacemaker with Transmit Antenna
300
200
5X
100
1X
0
CPU 1 GPU 2 GPUs 4 GPUs
3.2 GHz
EvolvedMachines
130X Speed up
Simulate brain circuitry
Sensory computing: vision, olfactory
EvolvedMachines

Matlab: Language of Science
10X with MATLAB CPU+GPU
Pseudo-spectral simulation of 2D Isotropic turbulence

http://developer.nvidia.com/object/matlab_cuda.html
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
MATLAB Example:
Advection of an elliptic vortex
256x256 mesh, 512 RK4 steps, Linux, MATLAB file
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m
Matlab
168 seconds
Matlab with CUDA

(single precision FFTs)
20 seconds

MATLAB Example:
Pseudo-spectral simulation of 2D Isotropic turbulence
512x512 mesh, 400 RK4 steps, Windows XP, MATLAB file

http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
MATLAB
992 seconds
MATLAB with CUDA

(single precision FFTs)
93 seconds

NAMD/VMD Molecular Dynamics
240X speedup
Computational biology
© NVIDIA Corporation 2007 http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

Molecular Dynamics Example
Case study: molecular dynamics research

at U. Illinois Urbana-Champaign
(Scientist-sponsored) course project for CS 498AL:
Programming Massively Parallel Multiprocessors (Kirk/Hwu)
Next slides stolen from a nice description of problem,
algorithms, and iterative optimization process available at:
http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

Molecular Modeling: Ion Placement
Biomolecular simulations
attempt to replicate in vivo
conditions in silico.
Model structures are
initially constructed in
vacuum
Solvent (water) and ions are
added as necessary for the
required biological
conditions
Computational
requirements scale with the
size of the simulated
structure

Evolution of Ion Placement Code
First implementation was sequential
Virus structure with 10^6 atoms would require 10
CPU days
Tuned for Intel C/C++ vectorization+SSE, ~20x
speedup
Parallelized /w pthreads: high data parallelism =
linear speedup
Parallelized GPU accelerated implementation: 3
GeForce 8800GTX cards outrun ~300 Itanium2
CPUs!
Virus structure now runs in 25 seconds on 3 GPUs!
Further speedups should still be possible…

Multi-GPU CUDA
Coulombic Potential Map Performance
Host: Intel Core 2 Quad,

8GB RAM, ~$3,000
3 GPUs: NVIDIA GeForce
8800GTX, ~$550 each
32-bit RHEL4 Linux
(want 64-bit CUDA!!)
235 GFLOPS per GPU for
current version of
coulombic potential map
kernel
705 GFLOPS total for
multithreaded multi-GPU
version Three GeForce 8800GTX GPUs
in a single machine, cost ~$4,650

Professor
Partnership
NVIDIA Professor Partnership
Support faculty research & teaching efforts

Small equipment gifts (1-2 GPUs)
Significant discounts on GPU purchases
Easy
Especially Quadro, Tesla equipment
Useful for cost matching
Research contracts
Small cash grants (typically ~$25K gifts)
Competitive
Medium-scale equipment donations
(10-30 GPUs)
Informal proposals, reviewed quarterly
Focus areas: GPU computing, especially with an
educational mission or component
http://www.nvidia.com/page/professor_partnership.html

GPU Architecture & Implications: David Luebke NVIDIA Research

Uploaded by

Copyright:

Available Formats

You might also like

GPU Architecture & Implications: David Luebke NVIDIA Research

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GPU Architecture & Implications: David Luebke NVIDIA Research

Uploaded by

Copyright:

Available Formats

GPU Architecture & Implications

CUDA provides a parallel programming model

The Tesla GPU architecture implements this

This talk will describe the characteristics, goals,

© NVIDIA Corporation 2007

681 million transistors

128 thread processors

ATX form factor card

Thread Execution Manager

© NVIDIA Corporation 2007 Global Memory

Hierarchical execution model

Hardware distributes independent blocks to SMs as

Kernel launched by host

Device processor array

Hardware allocates resources to blocks

© NVIDIA Corporation 2007

For GPUs, perfomance == throughput

Strategy: hide latency with computation not cache

Implication: need many threads to hide latency

Strategy: Single Instruction Multiple Thread (SIMT)

Groups of 32 threads formed into warps

Warps are the primitive unit of scheduling

Shared SIMT execution is an implementation choice

SM multithreaded SM hardware implements zero-overhead

Direct load/store access to device memory

Texture Cache Constant Cache

GPUs layer normal programs on top of graphics

GPUs architectures are:

GPUs are power-inefficient

GPUs don’t do real floating point

GPUs layer normal programs on top of graphics

GPUs are power-inefficient

GPUs don’t do real floating point

GPUs layer normal programs on top of graphics

GPUs architectures are:

GPUs are power-inefficient

GPUs don’t do real floating point

GPUs layer normal programs on top of graphics

GPUs architectures are:

GPUs are power-inefficient

GPUs don’t do real floating point

GPUs layer normal programs on top of graphics

GPUs architectures are:

GPUs are power-inefficient

GPUs don’t do real floating point

GPUs layer normal programs on top of graphics

GPUs architectures are:

GPUs are power-inefficient

GPUs don’t do real floating point

GPUs layer normal programs on top of graphics

GPUs architectures are:

GPUs are power-inefficient

GPUs don’t do real floating point

GPUs layer normal programs on top of graphics

GPUs architectures are:

GPUs are power-inefficient

GPUs don’t do real floating point