GPU Architecture & Implications: David Luebke NVIDIA Research

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 94

GPU Architecture & Implications

David Luebke
NVIDIA Research
GPU Architecture

CUDA provides a parallel programming model

The Tesla GPU architecture implements this

This talk will describe the characteristics, goals,


and implications of that architecture

© NVIDIA Corporation 2007


G80 GPU Implementation: Tesla C870

681 million transistors


470 mm2 in 90 nm CMOS

128 thread processors


518 GFLOPS peak
1.35 GHz processor clock

1.5 GB DRAM
76 GB/s peak
800 MHz GDDR3 clock
384 pin DRAM interface

ATX form factor card


PCI Express x16
170 W max with DRAM
Block Diagram Redux
G80 (launched Nov 2006)
128 Thread Processors execute kernel threads
Up to 12,288 parallel threads active
Per-block shared memory (PBSM) accelerates processing
Host

Input Assembler

Thread Execution Manager

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM

Load/store

© NVIDIA Corporation 2007 Global Memory


Streaming Multiprocessor (SM)

Processing elements
8 scalar thread processors (SP)
SM t0 t1 … tB
32 GFLOPS peak at 1.35 GHz
8192 32-bit registers (32KB)
MT IU
½ MB total register file space!
SP usual ops: float, int, branch, …

Hardware multithreading
up to 8 blocks resident at once
up to 768 active threads in total

Shared
Memory 16KB on-chip memory
low latency storage
shared amongst threads of a block
supports thread communication
© NVIDIA Corporation 2007
Goal: Scalability

Scalable execution
Program must be insensitive to the number of cores
Write one program for any number of SM cores
Program runs on any size GPU without recompiling

Hierarchical execution model


Decompose problem into sequential steps (kernels)
Decompose kernel into computing parallel blocks
Decompose block into computing parallel threads

Hardware distributes independent blocks to SMs as


available
Blocks Run on Multiprocessors

Kernel launched by host

...

Device processor array

MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU

SP SP SP SP SP SP SP SP

...
Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory

Device Memory
Goal: easy to program

Strategies:
Familiar programming language mechanics
C/C++ with small extension
Simple parallel abstractions
Simple barrier synchronization
Shared memory semantics
Hardware-managed hierarchy of threads
Hardware Multithreading

Hardware allocates resources to blocks


SM blocks need: thread slots, registers, shared memory
blocks don’t run until resources are available
MT IU

SP
Hardware schedules threads
threads have their own registers
any thread not waiting for something can run
context switching is (basically) free – every cycle

Shared
Memory
Hardware relies on threads to hide latency
i.e., parallelism is necessary for performance

© NVIDIA Corporation 2007


Goal: Performance per millimeter

For GPUs, perfomance == throughput

Strategy: hide latency with computation not cache


Heavy multithreading – already discussed by Kevin

Implication: need many threads to hide latency


Occupancy – typically need 128 threads/SM minimum
Multiple thread blocks/SM good to minimize effect of
barriers

Strategy: Single Instruction Multiple Thread (SIMT)


Balances performance with ease of programming
SIMT Thread Execution

Groups of 32 threads formed into warps


always executing same instruction
SM
shared instruction fetch/dispatch
MT IU some become inactive when code path diverges
SP hardware automatically handles divergence

Warps are the primitive unit of scheduling


pick 1 of 24 warps for each instruction slot

Shared SIMT execution is an implementation choice


Memory
sharing control logic leaves more space for ALUs
largely invisible to programmer
must understand for performance, not correctness
© NVIDIA Corporation 2007
SIMT Multithreaded Execution
Weaving: the original parallel thread
technology is about 10,000 years old
Warp: a set of 32 parallel threads
that execute a SIMD instruction

SM multithreaded SM hardware implements zero-overhead


instruction scheduler warp and thread scheduling
time Each SM executes up to 768 concurrent
threads, as 24 SIMD warps of 32 threads
warp 8 instruction 11
Threads can execute independently
warp 1 instruction 42
SIMD warp automatically diverges and
converges when threads branch
warp 3 instruction 95
Best efficiency and performance when
..
. threads of a warp execute together
warp 8 instruction 12 SIMT across threads (not just SIMD
across data) gives easy single-thread
warp 3 instruction 96 scalar programming with SIMD efficiency
12 gh07 Hot3D: Tesla GPU Computing © NVIDIA Corporation 2007
Memory Architecture

Direct load/store access to device memory


treated as the usual linear sequence of bytes (i.e., not pixels)
Texture & constant caches are read-only access paths
On-chip shared memory shared amongst threads of a block
important for communication amongst threads
provides low-latency temporary storage (~100x less than DRAM)
MT IU

Shared
SP Memory
Host
Memory
I Cache

Texture Cache Constant Cache

Device Memory
PCIe
© NVIDIA Corporation 2007
Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


Myths of GPU Computing

GPUs layer normal programs on top of graphics


NO: CUDA compiles directly to the hardware
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines… NO: warps are 32-wide
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive… NOPE
…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers. NO: scalar thread processors

GPUs are power-inefficient

GPUs don’t do real floating point


Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.

GPUs are power-inefficient:


No – 4-10x perf/W advantage, up to 89x reported for some studies
GPUs don’t do real floating point
Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:


Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.

GPUs are power-inefficient:

GPUs don’t do real floating point


GPU Floating Point Features
G80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for Round to nearest and All 4 IEEE, round to Round to nearest Round to
FADD and FMUL round to zero nearest, zero, inf, -inf only zero/truncate only

Supported, Supported,
Denormal handling Flush to zero Flush to zero
1000’s of cycles 1000’s of cycles

NaN support Yes Yes Yes No

Overflow and Infinity Yes, only clamps to


Yes Yes No, infinity
support max norm

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate
24 bit 12 bit 12 bit 12 bit
accuracy

Reciprocal sqrt
23 bit 12 bit 12 bit 12 bit
estimate accuracy

log2(x) and 2^x


23 bit No 12 bit No
estimates accuracy
Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754


Comparable to other processors / accelerators
More precise / usable in some ways
Less precise in other ways

GPU FP getting better every generation


Double precision support shortly
Goal: best of class by 2009
Questions?

David Luebke
dluebke@nvidia.com
Applications
&
Sweet Spots
GPU Computing Sweet Spots

Applications:

High arithmetic intensity:


Dense linear algebra, PDEs, n-body, finite difference, …

High bandwidth:
Sequencing (virus scanning, genomics), sorting,
database…

Visual computing:
Graphics, image processing, tomography, machine vision…

© NVIDIA Corporation 2007


GPU Computing Example Markets

Computational Computational
Geoscience Chemistry

Computational Computational
Medicine Modeling

Computational Computational
Science Biology

Computational Image
Finance Processing

© NVIDIA Corporation 2007


Applications - Condensed
3D image analysis Film Protein folding
Adaptive radiation therapy Financial - lots of areas Quantum chemistry
Acoustics Languages Ray tracing
Astronomy GIS Radar
Audio Holographics cinema Reservoir simulation
Automobile vision Imaging (lots) Robotic vision/AI
Bioinfomatics Mathematics research Robotic surgery
Biological simulation Military (lots) Satellite data analysis
Broadcast Mine planning Seismic imaging
Cellular automata Molecular dynamics Surgery simulation
Computational Fluid Dynamics MRI reconstruction Surveillance
Computer Vision Multispectral imaging Ultrasound
Cryptography nbody Video conferencing
CT reconstruction Network processing Telescope
Data Mining Neural network Video
Digital cinema/projections Oceanographic research Visualization
Electromagnetic simulation Optical inspection Wireless
Equity training Particle physics X-ray

© NVIDIA Corporation 2007


GPU Computing Sweet Spots

From cluster to workstation


The “personal supercomputing” phase change
From lab to clinic
From machine room to engineer, grad student desks
From batch processing to interactive
From interactive to real-time

GPU-enabled clusters
A 100x or better speedup changes the science
Solve at different scales
Direct brute-force methods may outperform cleverness
New bottlenecks may emerge
Approaches once inconceivable may become practical
© NVIDIA Corporation 2007
New Applications
Real-time options implied volatility engine

Ultrasound imaging

Swaption volatility cube calculator

HOOMD Molecular Dynamics

Manifold 8 GIS

SDK: Mandelbrot, computer vision


Also…
Image rotation/classification
Graphics processing toolbox
Seismic migration
Microarray data analysis
Data parallel primitives
Astrophysics simulations
© NVIDIA Corporation 2007
The Future of GPUs

GPU Computing drives new applications


Reducing “Time to Discovery”
100x Speedup changes science and research methods

New applications drive the future of GPUs and


GPU Computing
Drives new GPU capabilities
Drives hunger for more performance

Some exciting new domains:


Vision, acoustic, and embedded applications
Large-scale simulation & physics
© NVIDIA Corporation 2007
Accuracy
&
Performance
GPU Floating Point Features
G80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for Round to nearest and All 4 IEEE, round to Round to nearest Round to
FADD and FMUL round to zero nearest, zero, inf, -inf only zero/truncate only

Supported, Supported,
Denormal handling Flush to zero Flush to zero
1000’s of cycles 1000’s of cycles

NaN support Yes Yes Yes No

Overflow and Infinity Yes, only clamps to


Yes Yes No, infinity
support max norm

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate
24 bit 12 bit 12 bit 12 bit
accuracy

Reciprocal sqrt
23 bit 12 bit 12 bit 12 bit
estimate accuracy

log2(x) and 2^x


23 bit No 12 bit No
estimates accuracy

© NVIDIA Corporation 2007


Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754


Comparable to other processors / accelerators
More precise / usable in some ways
Less precise in other ways

GPU FP getting better every generation


Double precision support shortly
Goal: best of class by 2009

© NVIDIA Corporation 2007


CUDA Performance Advantages

Performance: How:
BLAS1: 60+ GB/sec Leveraging shared
BLAS3: 127 GFLOPS memory
FFT: 52 benchFFT* GPU memory bandwidth
GFLOPS GPU GFLOPS
FDTD: 1.2 Gcells/sec performance
SSEARCH: 5.2 Gcells/sec Custom hardware
intrinsics
Black Scholes: 4.7
GOptions/sec __sinf(), __cosf(),
__expf(), __logf(),

VMD: 290 GFLOPS

All benchmarks are compiled code!

© NVIDIA Corporation 2007


GPGPU
vs.
GPU Computing
Problem: GPGPU

OLD: GPGPU – trick the GPU into general-purpose


computing by casting problem as graphics
Turn data into images (“texture maps”)
Turn algorithms into image synthesis (“rendering passes”)

Promising results, but:


Tough learning curve, particularly for non-graphics experts
Potentially high overhead of graphics API
Highly constrained memory layout & access model
Need for many passes drives up bandwidth consumption

© NVIDIA Corporation 2007


Solution: CUDA

NEW: GPU Computing with CUDA


CUDA = Compute Unified Driver Architecture
Co-designed hardware & software for direct GPU computing

Hardware: fully general data-parallel architecture


General thread launch Scalar architecture
Global load-store Integers, bit operations
Parallel data cache Double precision (soon)

Software:
Scalableprogram the GPU inCCwith minimal yet
data-parallel
execution/memory model powerful extensions

© NVIDIA Corporation 2007


Graphics Programming Model
Graphics Application

Vertex Program

Rasterization

Fragment Program

Display

© NVIDIA Corporation 2007


Streaming GPGPU Programming
OpenGL Program to Start by creating a quad
Add A and B

Vertex Program
“Programs” created with raster operation

Rasterization

Fragment Program Read textures as input to OpenGL shader program

CPU Reads Texture


Memory for Results Write answer to texture memory as a “color”

© NVIDIA Corporation 2007


All this just to do A + B
What’s Wrong With GPGPU?

Input Registers
Application

Vertex Program
Texture

Rasterization Constants
Pixel Program

Pixel Program Temp Registers

Display
Output Registers

© NVIDIA Corporation 2007


What’s Wrong With GPGPU?
APIs are specific to graphics

Input Registers
Application Limited texture size and
dimension

Vertex Program Fragment Program


Texture

Rasterization Limited instruction set


Constants
No thread communication

Fragment Program Temp Registers


Limited local storage
Display
Output Registers
Limited shader outputs
No scatter
© NVIDIA Corporation 2007
Building a Better Pixel

Input Registers

Texture

Fragment Program Constants

Registers

Output Registers

© NVIDIA Corporation 2007


Building a Better Pixel Thread
Features
Millions of instructions
Full Integer and Bit instructions
Thread Number No limits on branching, looping
1D, 2D, or 3D thread ID allocation

Texture

Thread Program Constants

Registers

Output Registers
© NVIDIA Corporation 2007
Global Memory
Features
Fully general load/store to
GPU memory
Thread Number Untyped, not fixed texture types
Pointer support

Texture

Thread Program Constants

Registers

Global Memory
© NVIDIA Corporation 2007
Parallel Data Cache
Features
Dedicated on-chip memory
Shared between threads for
inter-thread communication
Thread Number
Explicitly managed
As fast as registers

Texture

Thread Program Constants

Registers

Parallel Data Cache

Global Memory
© NVIDIA Corporation 2007
Example Algorithm - Fluids
Goal: Calculate PRESSURE in a fluid

Pressure = Sum of neighboring pressures


Pn’ = P1 + P2 + P3 + P4

So the pressure for each particle is…


Pressure1 = P1 + P2 + P3 + P4

Pressure2 = P3 + P4 + P5 + P6
Pressure depends
Pressure3 = P5 + P6 + P7 + P8
on neighbors
Pressure4 = P7 + P8 + P9 + P10

© NVIDIA Corporation 2007


Example Fluid Algorithm GPU Computing
with CUDA
CPU GPGPU Thread
Execution
Parallel
Manager Data
Cache
Control
Cache DRAM
Control
Control

P1,
ALU P2 ALU
AL P3, Shared
Pn’=P1+P2+P3+P4
U P1 P4 Data
Control
Pn’=P1+P2+P3+P4
Pn’=P1+P2+P3+P4 P2 P1,P2
P3,P4
P1
P3 ALU
Control
P2
P4 Pn’=P1+P2+P3+P4 DRAM
Video ALU P3
Memory
P4
Control
Pn’=P1+P2+P3+P4 P5
Single P1,P2
P3,P4
thread out
ALU
ALU
Pn’=P1+P2+P3+P4
of cache
Control

Multiple passes ALU


ALU
through video
Data/Computation
memory Pn’=P1+P2+P3+P4

Program/Control
© NVIDIA Corporation 2007 Parallel execution through cache
Parallel Data Cache

Bring the data closer to the ALU GPGPU

 Addresses a fundamental problem


Control

P1,
P2
of stream computing:
ALU
P3,


Pn’=P1+P2+P3+P4
P4
The data are far from the FLOPS, Control

P1,P
video RAM latency is high ALU 2


P3,P
Threads can only communicate
Pn’=P1+P2+P3+P4
4
Video
their results through this high Memory

latency RAM Control

P1,P2
ALU
ALU P3,P4
Pn’=P1+P2+P3+P4

Multiple passes
through video
memory

© NVIDIA Corporation 2007


Parallel Data Cache
Thread Parallel
Bring the data closer to the ALU Execution
Manager Data
Cache
Control

 Stage computation for the parallel


data cache
ALU
Shared
Data
 Minimize trips to external memory Pn’=P1+P2+P3+P4
P1
 Share values to minimize overfetch
and computation
Control
P2
DRAM
ALU P3

 Increases arithmetic intensity by


Pn’=P1+P2+P3+P4
P4
P5
keeping data close to the
processors
 User managed generic memory, Control

threads read/write arbitrarily


ALU
ALU

Pn’=P1+P2+P3+P4

© NVIDIA Corporation 2007 Parallel execution through


cache
Streaming vs. GPU Computing
Streaming GPGPU
Gather in, Restricted write
Memory is far from ALU
No inter-element communication

GPU Computing with CUDA ALU

More general data parallel model


Full Scatter / Gather CUDA

PDC brings the data closer to the ALU


App decides how to decompose the
problem across threads
Share and communicate between threads
to solve problems efficiently
ALU

© NVIDIA Corporation 2007


GPU Design
CPU/GPU Parallelism

Moore’s Law gives you more and more transistors


What do you want to do with them?
CPU strategy: make the workload (one compute thread) run as
fast as possible
Tactics:
– Cache (area limiting)
– Instruction/Data prefetch
– Speculative execution
limited by “perimeter” – communication bandwidth
…then add task parallelism…multi-core
GPU strategy: make the workload (as many threads as possible)
run as fast as possible
Tactics:
– Parallelism (1000s of threads)
– Pipelining
 limited by “area” – compute capability

© NVIDIA Corporation 2007


Background: Unified Design

Discrete Design Unified Design

Shader A

Shader B ibuffer ibuffer ibuffer ibuffer

Shader
Core

obuffer obuffer obuffer obuffer


Shader C

Shader D

© NVIDIA Corporation 2007


Hardware Implementation:
Collection of SIMT Multiprocessors
Device
Each multiprocessor is a set of
SIMT thread processors Multiprocessor N

Single Instruction Multiple Thread


Multiprocessor 2
Multiprocessor 1

Each thread processor has:


program counter, register file, etc.
scalar data path Instruction
read/write memory access Processor 1 Processor 2 … Processor M
Unit

Unit of SIMT execution: warp


execute same instruction/clock
Hardware handles thread
scheduling and divergence
transparently

Warps enable a friendly data-parallel programming model!


© NVIDIA Corporation 2007
Hardware Implementation:
Memory Architecture
Device

Multiprocessor N

The device has local


device memory Multiprocessor 2
Multiprocessor 1
Can be read and written by
the host and by the Shared Memory

multiprocessors Registers Registers Registers


Instruction
Each multiprocessor has: Processor 1 Processor 2 … Processor M
Unit

A set of 32-bit registers


per processor Constant
Cache

on-chip shared memory Texture


Cache
A read-only constant
cache
Device memory
A read-only texture cache

© NVIDIA Corporation 2007


Hardware Implementation:
Memory Model
Grid

Each thread can: Block (0, 0) Block (1, 0)

Read/write per-block on- Shared Memory Shared Memory

chip shared memory Registers Registers Registers Registers


Read per-grid cached
constant memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Read/write non-cached
device memory: Local Local Local Local
Per-grid global memory Memory Memory Memory Memory

Per-thread local memory Global


Memory
Read cached texture
memory Constant
Memory

Texture
Memory

© NVIDIA Corporation 2007


CUDA
Programming
CUDA SDK

Libraries:FFT, BLAS,… Integrated CPU


Example Source Code and GPU C Source Code

NVIDIA C Compiler

NVIDIA Assembly
for Computing
CPU Host Code

CUDA Debugger
Standard C Compiler
Driver Profiler

GPU CPU

© NVIDIA Corporation 2007


CUDA: Features available to kernels

Standard mathematical functions


sinf, powf, atanf, ceil, etc.

Built-in vector types


float4, int4, uint4, etc. for dimensions 1..4

Texture accesses in kernels


texture<float,2> my_texture; // declare texture reference

float4 texel = texfetch(my_texture, u, v);

© NVIDIA Corporation 2007


G8x CUDA = C with Extensions
Philosophy: provide minimal set of extensions necessary to expose power

Function qualifiers:
__global__ void MyKernel() { }
__device__ float MyDeviceFunc() { }

Variable qualifiers:
__constant__ float MyConstantArray[32];
__shared__ float MySharedArray[32];

Execution configuration:
dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel

Built-in variables and functions valid in device code:


dim3 gridDim; // Grid dimension
dim3 blockDim; // Block dimension
dim3 blockIdx; // Block index
dim3 threadIdx; // Thread index
void __syncthreads(); // Thread synchronization
© NVIDIA Corporation 2007
CUDA: Runtime support

Explicit memory allocation returns pointers to GPU memory


cudaMalloc(), cudaFree()

Explicit memory copy for host ↔ device, device ↔ device


cudaMemcpy(), cudaMemcpy2D(), ...

Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperability


cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …

© NVIDIA Corporation 2007


Example: Adding matrices w/ 2D grids
CPU C program CUDA C program

void addMatrix(float *a, float *b, __global__ void addMatrix(float *a,float *b,
float *c, int N) float *c, int N)
{ {
int i, j, index; int i=blockIdx.x*blockDim.x+threadIdx.x;
for (i = 0; i < N; i++) { int j=blockIdx.y*blockDim.y+threadIdx.y;
for (j = 0; j < N; j++) { int index = i + j * N;
index = i + j * N; if ( i < N && j < N)
c[index]=a[index] + b[index]; c[index]= a[index] + b[index];
} }
}
}
void main()
void main() {
{ ..... // allocate & transfer data to GPU
..... dim3 dimBlk (blocksize, blocksize);
addMatrix(a, b, c, N); dim3 dimGrd (N/dimBlk.x, N/dimBlk.y);
} addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);
}
© NVIDIA Corporation 2007
Example: Vector Addition Kernel

// Compute vector sum C = A+B


// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)


{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}

© NVIDIA Corporation 2007


Example: Invoking the Kernel

__global__ void vecAdd(float* A, float* B, float* C);

void main()
{

// Execute on N/256 blocks of 256 threads each


vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

© NVIDIA Corporation 2007


Example: Host code for memory

// allocate host (CPU) memory


float* h_A = (float*) malloc(N * sizeof(float));
float* h_B = (float*) malloc(N * sizeof(float));
… initalize h_A and h_B …

// allocate device (GPU) memory


float* d_A, d_B, d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device


cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));
cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));

// execute the kernel on N/256 blocks of 256 threads each


vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
© NVIDIA Corporation 2007
A quick review

device = GPU = set of multiprocessors


Multiprocessor = set of processors & shared
memory
Kernel = GPU program
Grid = array of thread blocks that execute a kernel
Thread block = group of SIMD threads that execute
a kernel and can communicate via shared memory
Memory Location Cached Access Who
Local Off-chip No Read/write One thread
Shared On-chip N/A - resident Read/write All threads in a block
Global Off-chip No Read/write All threads + host
Constant Off-chip Yes Read All threads + host
Texture Off-chip Yes Read All threads + host
© NVIDIA Corporation 2007
Data-Parallel
Programming
Scan Literature

Pre-Hibernation
First proposed in APL by Iverson (1962)
Used as a data parallel primitive in the Connection Machine (1990)
Feature of C* and CM-Lisp
Guy Blelloch used scan as a primitive for various parallel
algorithms; his balanced-tree scan is used in the example here
Blelloch, 1990, “Prefix Sums and Their Applications”
Post-Democratization
O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)
Applied to Summed Area Tables by Hensley et al. (EG05)
O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al.
(EG06)
O(n) work & space GPU implementation by Harris et al. (2007)
NVIDIA CUDA SDK and GPU Gems 3
Applied to radix sort, stream compaction, and summed area tables
Parallel Reduction Complexity

Log(N) parallel steps, each step S does N/2S


independent ops
Step Complexity is O(log N)

For N=2D, performs S[1..D]2D-S = N-1 operations


Work Complexity is O(N) – It is work-efficient
i.e. does not perform more operations than a sequential
algorithm

With P threads physically in parallel (P processors),


time complexity is O(N/P + log N)
Compare to O(N) for sequential reduction
Unrolling Last Steps

Only one warp is active during the last few


steps
Unroll them and remove unneeded
__syncthreads()

for (unsigned int s = bd/2; s > 32; s >>= 1)


{
if (t < s) {
data[t] += data[t + s];
}
__syncthreads();
}
if (t < 32) data[t] += data[t + 32];
if (t < 16) data[t] += data[t + 16];
if (t < 8) data[t] += data[t + 8];
if (t < 4) data[t] += data[t + 4];
if (t < 2) data[t] += data[t + 2];
if (t < 1) data[t] += data[t + 1];
Unrolling the Loop Completely

When block #define STEP(d) \


size is known if (t < (d)) data[t] += data[t+(d));
at compile
time, we can #define SYNC __syncthreads();
completely
unroll the loop template <unsigned int bsize>
__global__ void d_reduce(int *g_idata,
It often is, int *g_odata)
since the { ...
maximum if (bsize == 512) STEP(512) SYNC
thread block
size of 512 if (bsize >= 256) STEP(256) SYNC
constrains us if (bsize >= 128) STEP(128) SYNC
if (bsize >= 64) STEP(64) SYNC
if (bsize >= 32) { STEP(32) STEP(16)
Use STEP(8)
templates…
STEP(4) STEP(2)
STEP(1) }
}
GPU Computing
Motivation
Computing Challenge

graphic

Task Computing Data Computing

© NVIDIA Corporation 2007


Extreme Growth in Raw Data
YouTube Bandwidth Growth Walmart Transaction Tracking
Millions

Millions
Source: Alexa, YouTube 2006 Source: Hedburg, CPI, Walmart

BP Oil and Gas Active Data NOAA Weather Data


NOAA NASA Weather Data in Petabytes
90
80
70
Terabytes

Petabytes 60
50
40
30
20
10
0
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Source: Jim Farnsworth, BP May 2005
© NVIDIA Corporation 2007 Source: John Bates, NOAA Nat. Climate Center
Computational Horsepower

GPU is a massively parallel computation engine


High memory bandwidth (5-10x CPU)
High floating-point performance (5-10x CPU)

© NVIDIA Corporation 2007


Benchmarking: CPU vs. GPU Computing

G80 vs. Core2 Duo 2.66 GHz


Measured against commercial CPU benchmarks when possible

© NVIDIA Corporation 2007


“Free” Massively Parallel Processors

It’s not science fiction, it’s just funded by them


Asst Master Chief Harvard
Success
Stories
Success Stories: Data to Design
Acceleware EM Field simulation technology for the GPU
3D Finite-Difference and Finite-Element (FDTD)
Modeling of:
Cell phone irradiation
MRI Design / Modeling
Printed Circuit Boards
Radar Cross Section (Military)

700 20X
600

500

400
Performance
(Mcells/s) 10X
Pacemaker with Transmit Antenna
300

200
5X
100
1X
0
CPU 1 GPU 2 GPUs 4 GPUs
3.2 GHz
© NVIDIA Corporation 2007
EvolvedMachines
130X Speed up
Simulate brain circuitry
Sensory computing: vision, olfactory

EvolvedMachines

© NVIDIA Corporation 2007


Matlab: Language of Science

10X with MATLAB CPU+GPU

Pseudo-spectral simulation of 2D Isotropic turbulence


http://developer.nvidia.com/object/matlab_cuda.html
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
© NVIDIA Corporation 2007
MATLAB Example:
Advection of an elliptic vortex
256x256 mesh, 512 RK4 steps, Linux, MATLAB file
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m

Matlab
168 seconds

Matlab with CUDA


(single precision FFTs)
20 seconds

© NVIDIA Corporation 2007


MATLAB Example:
Pseudo-spectral simulation of 2D Isotropic turbulence

512x512 mesh, 400 RK4 steps, Windows XP, MATLAB file


http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m

MATLAB
992 seconds

MATLAB with CUDA


(single precision FFTs)
93 seconds

© NVIDIA Corporation 2007


NAMD/VMD Molecular Dynamics

240X speedup
Computational biology

© NVIDIA Corporation 2007 http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/


Molecular Dynamics Example

Case study: molecular dynamics research


at U. Illinois Urbana-Champaign
(Scientist-sponsored) course project for CS 498AL:
Programming Massively Parallel Multiprocessors (Kirk/Hwu)
Next slides stolen from a nice description of problem,
algorithms, and iterative optimization process available at:
http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

© NVIDIA Corporation 2007


© NVIDIA Corporation 2007
Molecular Modeling: Ion Placement

Biomolecular simulations
attempt to replicate in vivo
conditions in silico.
Model structures are
initially constructed in
vacuum
Solvent (water) and ions are
added as necessary for the
required biological
conditions
Computational
requirements scale with the
size of the simulated
structure

© NVIDIA Corporation 2007


Evolution of Ion Placement Code
First implementation was sequential
Virus structure with 10^6 atoms would require 10
CPU days
Tuned for Intel C/C++ vectorization+SSE, ~20x
speedup
Parallelized /w pthreads: high data parallelism =
linear speedup
Parallelized GPU accelerated implementation: 3
GeForce 8800GTX cards outrun ~300 Itanium2
CPUs!
Virus structure now runs in 25 seconds on 3 GPUs!
Further speedups should still be possible…

© NVIDIA Corporation 2007


Multi-GPU CUDA
Coulombic Potential Map Performance

Host: Intel Core 2 Quad,


8GB RAM, ~$3,000
3 GPUs: NVIDIA GeForce
8800GTX, ~$550 each
32-bit RHEL4 Linux
(want 64-bit CUDA!!)
235 GFLOPS per GPU for
current version of
coulombic potential map
kernel
705 GFLOPS total for
multithreaded multi-GPU
version Three GeForce 8800GTX GPUs
in a single machine, cost ~$4,650

© NVIDIA Corporation 2007


Professor
Partnership
NVIDIA Professor Partnership

Support faculty research & teaching efforts


Small equipment gifts (1-2 GPUs)
Significant discounts on GPU purchases
Easy
Especially Quadro, Tesla equipment
Useful for cost matching
Research contracts
Small cash grants (typically ~$25K gifts)
Competitive
Medium-scale equipment donations
(10-30 GPUs)
Informal proposals, reviewed quarterly
Focus areas: GPU computing, especially with an
educational mission or component

http://www.nvidia.com/page/professor_partnership.html
© NVIDIA Corporation 2007

You might also like