Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Objective

To learn about data parallelism and the basic


features of CUDA C that enable exploitation
of data parallelism
Hierarchical thread organization
Main interfaces for launching parallel execution
Thread index(es) to data index mapping

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

Data Parallelism
Vector Addition Example
vector
A
vector
B

vector
C

A[0]

A[1]

A[2]

A[3]

A[4]

A[N-1]

B[0]

B[1]

B[2]

B[3]

B[4]

B[N-1]

C[0]

C[1]

C[2]

C[3]

C[4]

C[N-1]

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

Execution Model

Integrated host+device app C program

CUDA C /OpenCL

Serial or modestly parallel parts in host C code


Highly parallel parts in device SPMD kernel C code
Serial Code (host)

Parallel Kernel (device)


KernelA<<< nBlk, nTid >>>(args);

...

Serial Code (host)


Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

...
4

CUDA C
New keywords and Application Programming
Interface (API) Functions
Device functions
Functions that are executed by the device

Device data declaration

Extension to standard C Language

Variables and data structures processed by the device

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

Compiling A CUDA Program


Integrated C programs with CUDA extensions

NVCC Compiler
Host Code

Host C Compiler/
Linker

Device Code (PTX)

Device Just-in-Time
Compiler

Heterogeneous Computing Platform with


CPUs, GPUs
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

From Natural Language to Electrons

Compiler

Natural Language (e.g, English)


Algorithm
HighInstruction Set Architecture
Microarchitecture
Circuits
Electrons
Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

The ISA
An Instruction Set Architecture (ISA) is a
contract between the hardware and the
software.
As the name suggests, it is a set of
instructions that the architecture (hardware)
can execute.

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

A program at the ISA level


A program is a set of instructions stored in
memory that can be read, interpreted, and
executed by the hardware.
Program instructions operate on data stored
in memory or provided by Input/Output (I/O)
device.

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

The Von-Neumann Model


Control
Memory

I/O
Data

Processing Unit

ALU

Reg
File

Processor

Control Unit
PC
IR
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

10

Arrays of Parallel Threads

A CUDA kernel is executed by a grid (array) of


threads
A thread is conceptually a Von-Neumann Processor
All threads in a grid run the same kernel code (SPMD)
Each thread has an index that it uses to compute
memory addresses and make control decisions
1

254

255

i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

11
11

Thread Blocks: Scalable Cooperation


Divide thread array into multiple blocks

255

i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];

254

255

i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

254

255

254

Thread Block N-1

Thread Block 1

i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];

Thread Block 0

Threads within a block cooperate via shared


memory, atomic operations and barrier
synchronization
Threads in different blocks cannot cooperate

12

vector
A
vector
B

vector
C

Vector Addition

Conceptual View

A[0]

A[1]

A[2]

A[3]

A[4]

A[N-1]

B[0]

B[1]

B[2]

B[3]

B[4]

B[N-1]

C[0]

C[1]

C[2]

C[3]

C[4]

C[N-1]

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

14

Vector Addition

Traditional C Code

// Compute vector sum C = A+B


void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
int i;
for (i = 0; i < n; i++)
h_C[i] = h_A[i] + h_B[i];
}

int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
vecAdd(h_A, h_B, h_C, N);
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

15

Heterogeneous Computing vecAdd


CUDA Host CodePart 1
void vecAdd(float* h_A, float* h_B, float* h_C, int n)

{
int size = n* sizeof(float);
float* d_A, d_B, d_C;

Host Memory Device Memory


GPU
CPU
Part 2

1. // Allocate device memory for A, B, and C


// copy A and B to device memory

Part 3

2. // Kernel launch code to have the device


// to perform the actual vector addition
3. // copy C from the device memory
// Free device vectors

}
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

16

Partial Overview of CUDA Memories


Device code can:
R/W per-thread registers
R/W all-shared global memory

(Device) Grid
Block (0, 0)

Host code can


Transfer data to/from per grid global
memory

Block (1, 0)

Host

Registers

Registers

Registers

Registers

Thread (0, 0)

Thread (1, 0)

Thread (0, 0)

Thread (1, 0)

Global
Memory

We will cover more later.


David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

17

Host-Device Data Transfer API


functions

cudaMemcpy()

(Device) Grid

memory data transfer


Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of transfer

Host

Block (0, 0)

Block (1, 0)

Registers

Registers

Registers

Registers

Thread (0, 0)

Thread (1, 0)

Thread (0, 0)

Thread (1, 0)

Global
Memory

Transfer to device is
asynchronous

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

19

void vecAdd(float* h_A, float* h_B, float* h_C, int n)


{
int size = n * sizeof(float);
float* d_A, d_B, d_C;

1. // Transfer A and B to device memory


cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &B_d, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Allocate device memory for
cudaMalloc((void **) &d_C, size);
to be shown later

2. // Kernel invocation code

3.

// Transfer C from device to host


cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory for A, B, C
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
20
}

ECE408/CS483, University of Illinois, Urbana-Champaign

More on CUDA Function Declarations


Executed
on the:

Only callable
from the:

__device__ float DeviceFunc()

device

device

__global__ void

device

host

host

host

__host__

KernelFunc()

float HostFunc()

__global__ defines a kernel function


A kernel function must return void

__device__ and __host__ can be used together

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012


ECE408/CS483, University of Illinois, Urbana-Champaign

25

Parallel Algorithm
Recipe to solve a problem on multiple processors
Typical steps for constructing a parallel algorithm
identify what pieces of work can be performed concurrently
partition concurrent work onto independent processors
distribute a programs input, output, and intermediate data
coordinate accesses to shared data: avoid conflicts
ensure proper order of work using synchronization

Why typical? Some of the steps may be omitted.


if data is in shared memory, distributing it may be unnecessary
if using message passing, there may not be shared data
the mapping of work to processors can be done statically by the
programmer or dynamically by the runtime

Topics for Today

Introduction to parallel algorithms


tasks and decomposition
threads and mapping
threads versus cores

Decomposition techniques - part 1


recursive decomposition
data decomposition

Decomposing Work for Parallel Execution

Divide work into tasks that can be executed concurrently


Many different decompositions possible for any computation
Tasks may be same, different, or even indeterminate sizes
Tasks may be independent or have non-trivial order
Conceptualize tasks and ordering as task dependency DAG
node = task
edge = control dependence
T11
T2
T5
T1

T3

T9
T7

T6
T4

T12

T17

T13
T10

T8

T15
T16

T14
4

Example: Dense Matrix-Vector Product


Task 1
2

1 2

Computing each element of output vector y is independent


Easy to decompose dense matrix-vector product into tasks
one per element in y

Observations
task size is uniform
no control dependences between tasks
tasks share b
5

You might also like