LN02 A

Objective
To learn about data parallelism and the basic

features of CUDA C that enable exploitation
of data parallelism
Hierarchical thread organization
Main interfaces for launching parallel execution
Thread index(es) to data index mapping
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

ECE408/CS483, University of Illinois, Urbana-Champaign
Data Parallelism
Vector Addition Example
vector
A
vector
B
vector
C
A[0]
A[1]
A[2]
A[3]
A[4]
A[N-1]
B[0]
B[1]
B[2]
B[3]
B[4]
B[N-1]
C[0]
C[1]
C[2]
C[3]
C[4]
C[N-1]

Execution Model
Integrated host+device app C program
CUDA C /OpenCL
Serial or modestly parallel parts in host C code

Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);
...
Serial Code (host)

Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
...
4
CUDA C
New keywords and Application Programming
Interface (API) Functions
Device functions
Functions that are executed by the device
Device data declaration
Extension to standard C Language
Variables and data structures processed by the device

Compiling A CUDA Program

Integrated C programs with CUDA extensions
NVCC Compiler
Host Code
Host C Compiler/
Linker
Device Code (PTX)
Device Just-in-Time
Compiler
Heterogeneous Computing Platform with

CPUs, GPUs
From Natural Language to Electrons
Compiler
Natural Language (e.g, English)

Algorithm
HighInstruction Set Architecture
Microarchitecture
Circuits
Electrons
Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond

The ISA
An Instruction Set Architecture (ISA) is a
contract between the hardware and the
software.
As the name suggests, it is a set of
instructions that the architecture (hardware)
can execute.

A program at the ISA level

A program is a set of instructions stored in
memory that can be read, interpreted, and
executed by the hardware.
Program instructions operate on data stored
in memory or provided by Input/Output (I/O)
device.

The Von-Neumann Model

Control
Memory
I/O
Data
Processing Unit
ALU
Reg
File
Processor
Control Unit
PC
IR
10
Arrays of Parallel Threads
A CUDA kernel is executed by a grid (array) of

threads
A thread is conceptually a Von-Neumann Processor
All threads in a grid run the same kernel code (SPMD)
Each thread has an index that it uses to compute
memory addresses and make control decisions
1
254
255
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];

11
11
Thread Blocks: Scalable Cooperation

Divide thread array into multiple blocks
255
threadIdx.x;
C[i] = A[i] + B[i];
254
255
threadIdx.x;
C[i] = A[i] + B[i];

254
255
254
Thread Block N-1
Thread Block 1
threadIdx.x;
C[i] = A[i] + B[i];
Thread Block 0
Threads within a block cooperate via shared

memory, atomic operations and barrier
synchronization
Threads in different blocks cannot cooperate
12
vector
A
vector
B
vector
C
Vector Addition
Conceptual View
A[0]
A[1]
A[2]
A[3]
A[4]
A[N-1]
B[0]
B[1]
B[2]
B[3]
B[4]
B[N-1]
C[0]
C[1]
C[2]
C[3]
C[4]
C[N-1]

14
Vector Addition
Traditional C Code
// Compute vector sum C = A+B

void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
int i;
for (i = 0; i < n; i++)
h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
vecAdd(h_A, h_B, h_C, N);
15
Heterogeneous Computing vecAdd

CUDA Host CodePart 1
{
int size = n* sizeof(float);
float* d_A, d_B, d_C;
Host Memory Device Memory

GPU
CPU
Part 2
1. // Allocate device memory for A, B, and C

// copy A and B to device memory
Part 3
2. // Kernel launch code to have the device

// to perform the actual vector addition
3. // copy C from the device memory
// Free device vectors
}
16
Partial Overview of CUDA Memories

Device code can:
R/W per-thread registers
R/W all-shared global memory
(Device) Grid
Block (0, 0)
Host code can

Transfer data to/from per grid global
memory
Block (1, 0)
Host
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
We will cover more later.

17
Host-Device Data Transfer API

functions
cudaMemcpy()
(Device) Grid
memory data transfer

Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of transfer
Host
Block (0, 0)
Block (1, 0)
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
Transfer to device is
asynchronous

19

{
int size = n * sizeof(float);
float* d_A, d_B, d_C;
1. // Transfer A and B to device memory

cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &B_d, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Allocate device memory for
cudaMalloc((void **) &d_C, size);
to be shown later
2. // Kernel invocation code
3.
// Transfer C from device to host

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory for A, B, C
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
20
}
More on CUDA Function Declarations

Executed
on the:
Only callable
from the:
__device__ float DeviceFunc()
device
device
__global__ void
device
host
host
host
__host__
KernelFunc()
float HostFunc()
__global__ defines a kernel function

A kernel function must return void
__device__ and __host__ can be used together

25
Parallel Algorithm
Recipe to solve a problem on multiple processors
Typical steps for constructing a parallel algorithm
identify what pieces of work can be performed concurrently
partition concurrent work onto independent processors
distribute a programs input, output, and intermediate data
coordinate accesses to shared data: avoid conflicts
ensure proper order of work using synchronization
Why typical? Some of the steps may be omitted.

if data is in shared memory, distributing it may be unnecessary
if using message passing, there may not be shared data
the mapping of work to processors can be done statically by the
programmer or dynamically by the runtime
Topics for Today
Introduction to parallel algorithms

tasks and decomposition
threads and mapping
threads versus cores
Decomposition techniques - part 1

recursive decomposition
data decomposition
Decomposing Work for Parallel Execution
Divide work into tasks that can be executed concurrently

Many different decompositions possible for any computation
Tasks may be same, different, or even indeterminate sizes
Tasks may be independent or have non-trivial order
Conceptualize tasks and ordering as task dependency DAG
node = task
edge = control dependence
T11
T2
T5
T1
T3
T9
T7
T6
T4
T12
T17
T13
T10
T8
T15
T16
T14
4
Example: Dense Matrix-Vector Product

Task 1
2
1 2
Computing each element of output vector y is independent

Easy to decompose dense matrix-vector product into tasks
one per element in y
Observations
task size is uniform
no control dependences between tasks
tasks share b
5

LN02 A

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LN02 A

Uploaded by

Copyright:

Available Formats

Objective

To learn about data parallelism and the basic

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

Integrated host+device app C program

Serial or modestly parallel parts in host C code

Parallel Kernel (device)

Serial Code (host)

Device data declaration

Extension to standard C Language

Variables and data structures processed by the device

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

Compiling A CUDA Program

Device Code (PTX)

Heterogeneous Computing Platform with

From Natural Language to Electrons

Natural Language (e.g, English)

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

A program at the ISA level

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

The Von-Neumann Model

Arrays of Parallel Threads

A CUDA kernel is executed by a grid (array) of

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

Thread Blocks: Scalable Cooperation

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

Thread Block N-1

Threads within a block cooperate via shared

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

// Compute vector sum C = A+B

Heterogeneous Computing vecAdd

Host Memory Device Memory

1. // Allocate device memory for A, B, and C

2. // Kernel launch code to have the device

Partial Overview of CUDA Memories

Host code can

We will cover more later.

Host-Device Data Transfer API

memory data transfer

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

void vecAdd(float* h_A, float* h_B, float* h_C, int n)

1. // Transfer A and B to device memory

2. // Kernel invocation code

// Transfer C from device to host

ECE408/CS483, University of Illinois, Urbana-Champaign

More on CUDA Function Declarations

__device__ float DeviceFunc()

__global__ defines a kernel function

__device__ and __host__ can be used together

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012

Why typical? Some of the steps may be omitted.

Topics for Today

Introduction to parallel algorithms

Decomposition techniques - part 1

Decomposing Work for Parallel Execution

Divide work into tasks that can be executed concurrently

Example: Dense Matrix-Vector Product

Computing each element of output vector y is independent

You might also like

device float DeviceFunc()

global defines a kernel function

device and host can be used together