Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

Multithreaded Architectures

(Applied Parallel Programming)

Lecture 4
Memory and Data Locality

1
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Programmer View of CUDA Memories
• Each thread can: Grid
– Read/write per-thread
Block (0, 0) Block (1, 0)
registers (~1 cycle)
– Read/write per-block Shared Memory Shared Memory

shared memory (~5 Registers Registers Registers Registers


cycles)
– Read/write per-grid
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
global memory (~500
cycles) Host Global Memory
– Read/only per-grid
constant memory (~5 Constant Memory

cycles with caching)

2
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
The Von-Neumann Model

Memory I/O

Processing Unit
Reg
ALU
File

Control Unit
PC IR

3
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Going back to the program
• Every instruction needs to be fetched from
memory, decoded, then executed.
– The decode stage typically accesses register file
• Instructions come in three flavors: Operate, Data
transfer, and Program Control Flow.
• An example instruction cycle is the following:

Fetch | Decode | Execute | Memory

4
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Operate Instructions
• Example of an operate instruction:
ADD R1, R2, R3

• Instruction cycle for an operate instruction:


Fetch | Decode | Execute | Memory

5
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Memory Access Instructions
• Examples of memory access instruction:
LDR R1, R2, #2
STR R1, R2, #2

• Instruction cycle for a memory instruction:


Fetch | Decode | Execute | Memory

6
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Registers vs Memory
• Registers are “free”
– No additional memory access instruction
– Very fast to use, however, there are very few of them
• Memory is slow, but very large

Memory
I/O

Reg
ALU File

Control Unit
PC IR 7
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime
Automatic variables other than arrays register thread kernel

Automatic array variables local thread kernel


__device__ __shared__ int SharedVar; shared block kernel
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
• __device__ is optional when used with
__shared__, or __constant__

• Automatic variables without any qualifier reside in


a register
– Except per-thread arrays that reside in local memory
8
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Matrix Multiplication Example
A Simple Host Version in C
// Matrix multiplication on the (CPU) host in single
precision
N
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
{ k
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) { j

WIDTH
float sum = 0;
for (int k = 0; k < Width; ++k) {
float a = M[i * Width + k];
float b = N[k * Width + j];
sum += a * b; M P
}
P[i * Width + j] = sum; i
}

WIDTH
}
k
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016 9
WIDTH WIDTH
Kernel Function - A Small Example
• Have each 2D thread block to compute a (TILE_WIDTH)2
sub-matrix (tile) of the result matrix
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
Block(0,0) Block(0,1)

P0,0 P0,1 P0,2 P0,3 WIDTH = 4; TILE_WIDTH = 2


Each block has 2*2 = 4 threads
P1,0 P1,1 P1,2 P1,3

P2,0 P2,1 P2,2 P2,3 WIDTH/TILE_WIDTH = 2


Use 2* 2 = 4 blocks
P3,0 P3,1 P2,3 P3,3

Block(1,0) Block(1,1)
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
A Slightly Bigger Example
(TILE_WIDTH =2)

P0,0 P0,1 P0,2 P0,3 P0,4 P0,5 P0,6 P0,7

P1,0 P1,1 P1,2 P1,3 P1,4 P1,5 P1,6 P1,7


WIDTH = 8; TILE_WIDTH = 2
P2,0 P2,1 P2,2 P2,3 P2,4 P2,5 P2,6 P2,7
Each block has 2*2 = 4 threads
P3,0 P3,1 P3,2 P3,3 P3,4 P3,5 P3,6 P3,7

P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7


WIDTH/TILE_WIDTH = 4
P5,0 P5,1 P5,2 P5,3 P5,4 P5,5 P5,6 P5,7
Use 4* 4 = 16 blocks
P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7

P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7

11
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
A Slightly Bigger Example (cont.)
(TILE_WIDTH = 4)

P0,0 P0,1 P0,2 P0,3 P0,4 P0,5 P0,6 P0,7

P1,0 P1,1 P1,2 P1,3 P1,4 P1,5 P1,6 P1,7


WIDTH = 8; TILE_WIDTH = 4
P2,0 P2,1 P2,2 P2,3 P2,4 P2,5 P2,6 P2,7
Each block has 4*4 =16 threads
P3,0 P3,1 P3,2 P3,3 P3,4 P3,5 P3,6 P3,7

P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7 WIDTH/TILE_WIDTH = 2


Use 2* 2 = 4 blocks
P5,0 P5,1 P5,2 P5,3 P5,4 P5,5 P5,6 P5,7

P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7

P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016 12


Kernel Invocation (Host-side Code)

// Setup the execution configuration


// TILE_WIDTH is a #define constant
dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH, 1);
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);

// Launch the device computation threads!


MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016 13


Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{

// Pvalue is used to store the element of the matrix


// that is computed by the thread
float Pvalue = 0;

14
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Work for Block (0,0)
in a TILE_WIDTH = 2 Configuration

Col = 0
Col = 1
blockDim.x blockDim.y

Col = 0 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3


blockIdx.x blockIdx.y N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3

Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3

Row = 1 M1,0 M1,1 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


15
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Work for Block (0,1)

Col = 2
Col = 3
Col = 1 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3


blockIdx.x blockIdx.y N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N2,3 N3,3

Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3

Row = 1 M1,0 M1,1 M1,2 M1,3 P0,1 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007- 16
2016
A Simple Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
// Calculate the row index of the d_P element and d_M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column idenx of d_P and d_N
int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += d_M[Row*Width+k] *
d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;
}
} © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
17
2016
How about performance on a device
with 150 GB/s memory bandwidth?
• All threads access global memory
Grid
for their input matrix elements
– Two memory accesses (8 bytes) Block (0, 0) Block (1, 0)
per floating point multiply-add
– 4B/s of memory Shared Memory Shared Memory
bandwidth/FLOPS
– 150 GB/s limits the code at 37.5 Registers Registers Registers Registers

GFLOPS
• The actual code runs at about 25 Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
GFLOPS
• Need to drastically cut down
Host Global Memory
memory accesses to get closer to
the peak of more than 1,000 Constant Memory
GFLOPS

© David Kirk/NVIDIA and Wen-mei W. Hwu, 18


ECE408/CS483/ 2007-2016
Matrix-Matrix Multiplication using
Shared Memory

19
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
• A profitable way of performing computation on the
device is to tile the input data to take advantage of
fast shared memory:
– Partition data into subsets (tiles) that fit into shared
memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
• Copying results from shared memory to global memory
20
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
__shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
__shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];

21
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Shared Memory Blocking Basic Idea
in

Global Memory

Thread 1 Thread 2 …
Global Memory

in

On-chip Memory

Thread 1 Thread 2 … 22
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Outline of Technique
• Identify a tile of global data that are accessed by
multiple threads
• Load the tile from global memory into on-chip
memory
• Have the multiple threads to access their data
from the on-chip memory
• Move on to the next block/tile

23
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Idea: Place global memory data into
Shared Memory for reuse
• Each input element is N
read by Width threads.
• Load each element into

WIDTH
Shared Memory and
have several threads
use the local version to
M P
reduce the memory
bandwidth ty

WIDTH
tx
WIDTH WIDTH 24
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
bx
Tiled Multiply 0 1 2

tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1

kernel into phases so that the N

TILE_WIDTH
data accesses in each phase is

WIDTH
focused on one subset (tile) of

TILE_WIDTH
M and N

M P
0

TILE_WIDTHE
0
1

WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
25
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Loading a Tile
• All threads in a block participate
– Each thread loads one M element and one N element
in based tiled code

• Assign the loaded element to each thread such


that the accesses within each warp is coalesced
(more later).

26
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Work for Block (0,0)
Thread 1,1
Thread 0,1
Block 0,0
SM

Thread 0,0 N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

Thread 1,0 N1,0 N1,1 N1,2 N1,3 N1,0 N1,1

N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3


SM

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
27
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Work for Block (0,0)
Threads use shared memory data in step 0.

N0,0 N0,1 N0,2 N0,3


N0,0 N0,1
Shared Memory
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3


Shared Memory

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3

M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


28
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Work for Block (0,0)
Threads use shared memory data in step 1.

N0,0 N0,1 N0,2 N0,3


N0,0 N0,1
SM
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3


SM

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3

M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


29
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Loading an Input Tile 0 0
bx
1 2

Col tx
0 1 2 TILE_WIDTH-1

Accessing tile 0 2D indexing:


N

TILE_WIDTH
M[Row][tx]

Width
N[ty][Col]

TILE_WIDTH
M P
0

Row

TILE_WIDTHE
0
1

Width
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 Width Width
30
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Loading an Input Tile 1 0
bx
1 2

Col tx
0 1 2 TILE_WIDTH-1

Accessing tile 1 in 2D indexing: N

TILE_WIDTH
M[Row][1*TILE_WIDTH+tx]

WIDTH
TILE_WIDTH
N[1*TILE_WIDTH+ty][Col]

M P

Row

TILE_WIDTHE
0
1

WIDTH
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Loading an Input Tile m 0
bx
1 2

However, recall that M and N are dynamically allocated


Col tx
0 1 2 TILE_WIDTH-1
and can only use 1D indexing:
N

TILE_WIDTH
M[Row][m*TILE_WIDTH+tx]
M[Row*Width + m*TILE_WIDTH + tx]

WIDTH
m

TILE_WIDTH
N[m*TILE_WIDTH+ty][Col]
N[(m*TILE_WIDTH+ty) * Width + Col]

M P

0 m

Row

TILE_WIDTHE
0
1

WIDTH
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
32
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Barrier Synchronization
• An API function call in CUDA
– __syncthreads()

• All threads in the same block must reach the


__syncthreads() before any can move on

• Best used to coordinate tiled algorithms


– To ensure that all elements of a tile are loaded
– To ensure that all elements of a tile are consumed

33
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Time
Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

… …
Thread N-3

Thread N-2

Thread N-1

An example execution timing of barrier synchronization. 34


© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
1. __shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;


4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the P element to work on


5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element


8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of M and N tiles into shared memory
9. subTileM[ty][tx] = M[Row*Width + m*TILE_WIDTH+tx];
10. subTileN[ty][tx] = N[(m*TILE_WIDTH+ty)*Width+Col];
11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k)
13. Pvalue += subTileM[ty][k] * subTileN[k][tx];
14. __syncthreads();
15. }
16. P[Row*Width+Col] = Pvalue;
}

35
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Tiled Matrix Multiplication Kernel 2/2
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
1. __shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;


4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the P element to work on


5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element


8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of M and N tiles into shared memory
9. subTileM[ty][tx] = M[Row*Width + m*TILE_WIDTH+tx];
10. subTileN[ty][tx] = N[(m*TILE_WIDTH+ty)*Width+Col];
11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k)
13. Pvalue += subTileM[ty][k] * subTileN[k][tx];
14. __syncthreads();
15. }
16. P[Row*Width+Col] = Pvalue;
}

36
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Compare with Base Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;

// each thread computes one element of the block sub-matrix


for (int k = 0; k < Width; ++k)
Pvalue += M[Row*Width+k] * N[k*Width+Col];

P[Row*Width+Col] = Pvalue;
}
}

37
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Shared Memory and Threading
• Each SM in Maxwell has 64KB shared memory (48KB
max per block)
– Shared memory size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory.
• Shared memory can potentially suppot up to 32 thread blocks actively
executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB
shared memory usage per thread block, allowing 8 thread blocks
active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
– The 150GB/s bandwidth can now support (150/4)*16 = 600
GFLOPS!
38
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Device Query
• Number of devices in the system
int dev_count;
cudaGetDeviceCount( &dev_count);
• Capability of devices
cudaDeviceProp dev_prop;
for (i = 0; i < dev_count; i++) {
cudaGetDeviceProperties( &dev_prop, i);
// decide if device has sufficient resources and capabilities 
}
• cudaDeviceProp is a built-in C structure type
– dev_prop.dev_prop.maxThreadsPerBlock
– Dev_prop.sharedMemoryPerBlock
–…
39
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016

You might also like