Professional Documents
Culture Documents
LN02 A
LN02 A
Data Parallelism
Vector Addition Example
vector
A
vector
B
vector
C
A[0]
A[1]
A[2]
A[3]
A[4]
A[N-1]
B[0]
B[1]
B[2]
B[3]
B[4]
B[N-1]
C[0]
C[1]
C[2]
C[3]
C[4]
C[N-1]
Execution Model
CUDA C /OpenCL
...
...
4
CUDA C
New keywords and Application Programming
Interface (API) Functions
Device functions
Functions that are executed by the device
NVCC Compiler
Host Code
Host C Compiler/
Linker
Device Just-in-Time
Compiler
Compiler
The ISA
An Instruction Set Architecture (ISA) is a
contract between the hardware and the
software.
As the name suggests, it is a set of
instructions that the architecture (hardware)
can execute.
I/O
Data
Processing Unit
ALU
Reg
File
Processor
Control Unit
PC
IR
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
10
254
255
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
11
11
255
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
254
255
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
254
255
254
Thread Block 1
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
Thread Block 0
12
vector
A
vector
B
vector
C
Vector Addition
Conceptual View
A[0]
A[1]
A[2]
A[3]
A[4]
A[N-1]
B[0]
B[1]
B[2]
B[3]
B[4]
B[N-1]
C[0]
C[1]
C[2]
C[3]
C[4]
C[N-1]
14
Vector Addition
Traditional C Code
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
vecAdd(h_A, h_B, h_C, N);
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
15
{
int size = n* sizeof(float);
float* d_A, d_B, d_C;
Part 3
}
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
16
(Device) Grid
Block (0, 0)
Block (1, 0)
Host
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
17
cudaMemcpy()
(Device) Grid
Host
Block (0, 0)
Block (1, 0)
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
Transfer to device is
asynchronous
19
3.
Only callable
from the:
device
device
__global__ void
device
host
host
host
__host__
KernelFunc()
float HostFunc()
25
Parallel Algorithm
Recipe to solve a problem on multiple processors
Typical steps for constructing a parallel algorithm
identify what pieces of work can be performed concurrently
partition concurrent work onto independent processors
distribute a programs input, output, and intermediate data
coordinate accesses to shared data: avoid conflicts
ensure proper order of work using synchronization
T3
T9
T7
T6
T4
T12
T17
T13
T10
T8
T15
T16
T14
4
1 2
Observations
task size is uniform
no control dependences between tasks
tasks share b
5