Multithreaded Architectures: Memory and Data Locality

Multithreaded Architectures
(Applied Parallel Programming)
Lecture 4
Memory and Data Locality
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Programmer View of CUDA Memories
• Each thread can: Grid
– Read/write per-thread
Block (0, 0) Block (1, 0)
registers (~1 cycle)
– Read/write per-block Shared Memory Shared Memory
shared memory (~5 Registers Registers Registers Registers

cycles)
– Read/write per-grid
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
global memory (~500
cycles) Host Global Memory
– Read/only per-grid
constant memory (~5 Constant Memory
cycles with caching)
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
The Von-Neumann Model
Memory I/O
Processing Unit
Reg
ALU
File
Control Unit
PC IR
3
2016
Going back to the program
• Every instruction needs to be fetched from
memory, decoded, then executed.
– The decode stage typically accesses register file
• Instructions come in three flavors: Operate, Data
transfer, and Program Control Flow.
• An example instruction cycle is the following:
Fetch | Decode | Execute | Memory
4
2016
Operate Instructions
• Example of an operate instruction:
ADD R1, R2, R3
• Instruction cycle for an operate instruction:

5
2016
Memory Access Instructions
• Examples of memory access instruction:
LDR R1, R2, #2
STR R1, R2, #2
• Instruction cycle for a memory instruction:

6
2016
Registers vs Memory
• Registers are “free”
– No additional memory access instruction
– Very fast to use, however, there are very few of them
• Memory is slow, but very large
Memory
I/O
Reg
ALU File
Control Unit
PC IR 7
2016
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime
Automatic variables other than arrays register thread kernel
Automatic array variables local thread kernel

__device__ __shared__ int SharedVar; shared block kernel
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
• __device__ is optional when used with
__shared__, or __constant__
• Automatic variables without any qualifier reside in

a register
– Except per-thread arrays that reside in local memory
8
2016
Matrix Multiplication Example
A Simple Host Version in C
// Matrix multiplication on the (CPU) host in single
precision
N
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
{ k
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) { j
WIDTH
float sum = 0;
for (int k = 0; k < Width; ++k) {
float a = M[i * Width + k];
float b = N[k * Width + j];
sum += a * b; M P
}
P[i * Width + j] = sum; i
}
WIDTH
}
k
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016 9
WIDTH WIDTH
Kernel Function - A Small Example
• Have each 2D thread block to compute a (TILE_WIDTH)2
sub-matrix (tile) of the result matrix
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
Block(0,0) Block(0,1)
P0,0 P0,1 P0,2 P0,3 WIDTH = 4; TILE_WIDTH = 2

Each block has 2*2 = 4 threads
P1,0 P1,1 P1,2 P1,3
P2,0 P2,1 P2,2 P2,3 WIDTH/TILE_WIDTH = 2

Use 2* 2 = 4 blocks
P3,0 P3,1 P2,3 P3,3
Block(1,0) Block(1,1)
10
2016
A Slightly Bigger Example
(TILE_WIDTH =2)
P0,0 P0,1 P0,2 P0,3 P0,4 P0,5 P0,6 P0,7
P1,0 P1,1 P1,2 P1,3 P1,4 P1,5 P1,6 P1,7

WIDTH = 8; TILE_WIDTH = 2
P2,0 P2,1 P2,2 P2,3 P2,4 P2,5 P2,6 P2,7
Each block has 2*2 = 4 threads
P3,0 P3,1 P3,2 P3,3 P3,4 P3,5 P3,6 P3,7
P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7

WIDTH/TILE_WIDTH = 4
P5,0 P5,1 P5,2 P5,3 P5,4 P5,5 P5,6 P5,7
Use 4* 4 = 16 blocks
P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7
P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7
11
2016
A Slightly Bigger Example (cont.)
(TILE_WIDTH = 4)
P0,0 P0,1 P0,2 P0,3 P0,4 P0,5 P0,6 P0,7
P1,0 P1,1 P1,2 P1,3 P1,4 P1,5 P1,6 P1,7

WIDTH = 8; TILE_WIDTH = 4
P2,0 P2,1 P2,2 P2,3 P2,4 P2,5 P2,6 P2,7
Each block has 4*4 =16 threads
P3,0 P3,1 P3,2 P3,3 P3,4 P3,5 P3,6 P3,7
P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7 WIDTH/TILE_WIDTH = 2

Use 2* 2 = 4 blocks
P5,0 P5,1 P5,2 P5,3 P5,4 P5,5 P5,6 P5,7
P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7
P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7

Kernel Invocation (Host-side Code)
// Setup the execution configuration

// TILE_WIDTH is a #define constant
dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH, 1);
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);
// Launch the device computation threads!

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Kernel Function
// Matrix multiplication kernel – per thread code
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
// Pvalue is used to store the element of the matrix

// that is computed by the thread
float Pvalue = 0;
14
2016
Work for Block (0,0)
in a TILE_WIDTH = 2 Configuration
Col = 0
Col = 1
blockDim.x blockDim.y
Col = 0 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3
N1,0 N1,1 N1,2 N1,3

blockIdx.x blockIdx.y N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
Row = 1 M1,0 M1,1 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

15
2016
Col = 2
Col = 3
Col = 1 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3
N1,0 N1,1 N1,2 N1,3

blockIdx.x blockIdx.y N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N2,3 N3,3
Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
Row = 1 M1,0 M1,1 M1,2 M1,3 P0,1 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007- 16
2016
A Simple Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
// Calculate the row index of the d_P element and d_M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column idenx of d_P and d_N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += d_M[Row*Width+k] *
d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;
}
} © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
17
2016
How about performance on a device
with 150 GB/s memory bandwidth?
• All threads access global memory
Grid
for their input matrix elements
– Two memory accesses (8 bytes) Block (0, 0) Block (1, 0)
per floating point multiply-add
– 4B/s of memory Shared Memory Shared Memory
bandwidth/FLOPS
– 150 GB/s limits the code at 37.5 Registers Registers Registers Registers
GFLOPS
• The actual code runs at about 25 Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
GFLOPS
• Need to drastically cut down
Host Global Memory
memory accesses to get closer to
the peak of more than 1,000 Constant Memory
GFLOPS
© David Kirk/NVIDIA and Wen-mei W. Hwu, 18

ECE408/CS483/ 2007-2016
Matrix-Matrix Multiplication using
Shared Memory
19
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
• A profitable way of performing computation on the
device is to tile the input data to take advantage of
fast shared memory:
– Partition data into subsets (tiles) that fit into shared
memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
• Copying results from shared memory to global memory
20
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
__shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
__shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];
21
2016
Shared Memory Blocking Basic Idea
in
Global Memory
Thread 1 Thread 2 …
Global Memory
in
On-chip Memory
Thread 1 Thread 2 … 22
Outline of Technique
• Identify a tile of global data that are accessed by
multiple threads
• Load the tile from global memory into on-chip
memory
• Have the multiple threads to access their data
from the on-chip memory
• Move on to the next block/tile
23
Idea: Place global memory data into
Shared Memory for reuse
• Each input element is N
read by Width threads.
• Load each element into
WIDTH
Shared Memory and
have several threads
use the local version to
M P
reduce the memory
bandwidth ty
WIDTH
tx
WIDTH WIDTH 24
2016
bx
Tiled Multiply 0 1 2
tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1
kernel into phases so that the N
TILE_WIDTH
data accesses in each phase is
WIDTH
focused on one subset (tile) of
TILE_WIDTH
M and N
M P
0
TILE_WIDTHE
0
1
WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
25
2016
Loading a Tile
• All threads in a block participate
– Each thread loads one M element and one N element
in based tiled code
• Assign the loaded element to each thread such

that the accesses within each warp is coalesced
(more later).
26
2016
Thread 1,1
Thread 0,1
Block 0,0
SM
Thread 0,0 N0,0 N0,1 N0,2 N0,3 N0,0 N0,1
Thread 1,0 N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3

SM
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
27
2016
Threads use shared memory data in step 0.
N0,0 N0,1 N0,2 N0,3

N0,0 N0,1
Shared Memory
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3

Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

28
Threads use shared memory data in step 1.
N0,0 N0,1 N0,2 N0,3

N0,0 N0,1
SM
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3

SM
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

29
2016
Loading an Input Tile 0 0
bx
1 2
Col tx
0 1 2 TILE_WIDTH-1
Accessing tile 0 2D indexing:

N
TILE_WIDTH
M[Row][tx]
Width
N[ty][Col]
TILE_WIDTH
M P
0
Row
TILE_WIDTHE
0
1
Width
2
by 1 ty
TILE_WIDTH-1
2 Width Width
30
Loading an Input Tile 1 0
bx
1 2
Col tx
0 1 2 TILE_WIDTH-1
Accessing tile 1 in 2D indexing: N
TILE_WIDTH
M[Row][1*TILE_WIDTH+tx]
WIDTH
TILE_WIDTH
N[1*TILE_WIDTH+ty][Col]
M P
Row
TILE_WIDTHE
0
1
WIDTH
2
by 1 ty
TILE_WIDTH-1
2 WIDTH WIDTH
31
Loading an Input Tile m 0
bx
1 2
However, recall that M and N are dynamically allocated

Col tx
0 1 2 TILE_WIDTH-1
and can only use 1D indexing:
N
TILE_WIDTH
M[Row][m*TILE_WIDTH+tx]
M[Row*Width + m*TILE_WIDTH + tx]
WIDTH
m
TILE_WIDTH
N[m*TILE_WIDTH+ty][Col]
N[(m*TILE_WIDTH+ty) * Width + Col]
M P
0 m
Row
TILE_WIDTHE
0
1
WIDTH
2
by 1 ty
TILE_WIDTH-1
2 WIDTH WIDTH
32
Barrier Synchronization
• An API function call in CUDA
– __syncthreads()
• All threads in the same block must reach the

__syncthreads() before any can move on
• Best used to coordinate tiled algorithms

– To ensure that all elements of a tile are loaded
– To ensure that all elements of a tile are consumed
33
Time
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
… …
Thread N-3
Thread N-2
Thread N-1
An example execution timing of barrier synchronization. 34

Tiled Matrix Multiplication Kernel
{
1. __shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the P element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element

8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of M and N tiles into shared memory
9. subTileM[ty][tx] = M[Row*Width + m*TILE_WIDTH+tx];
10. subTileN[ty][tx] = N[(m*TILE_WIDTH+ty)*Width+Col];
11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k)
13. Pvalue += subTileM[ty][k] * subTileN[k][tx];
15. }
16. P[Row*Width+Col] = Pvalue;
}
35
Tiled Matrix Multiplication Kernel 2/2
{
1. __shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the P element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element

8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of M and N tiles into shared memory
9. subTileM[ty][tx] = M[Row*Width + m*TILE_WIDTH+tx];
10. subTileN[ty][tx] = N[(m*TILE_WIDTH+ty)*Width+Col];
12. for (int k = 0; k < TILE_WIDTH; ++k)
13. Pvalue += subTileM[ty][k] * subTileN[k][tx];
15. }
16. P[Row*Width+Col] = Pvalue;
}
36
Compare with Base Kernel
{
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)
Pvalue += M[Row*Width+k] * N[k*Width+Col];
P[Row*Width+Col] = Pvalue;
}
}
37
Shared Memory and Threading
• Each SM in Maxwell has 64KB shared memory (48KB
max per block)
– Shared memory size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory.
• Shared memory can potentially suppot up to 32 thread blocks actively
executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB
shared memory usage per thread block, allowing 8 thread blocks
active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
– The 150GB/s bandwidth can now support (150/4)*16 = 600
GFLOPS!
38
Device Query
• Number of devices in the system
int dev_count;
cudaGetDeviceCount( &dev_count);
• Capability of devices
cudaDeviceProp dev_prop;
for (i = 0; i < dev_count; i++) {
cudaGetDeviceProperties( &dev_prop, i);
// decide if device has sufficient resources and capabilities
}
• cudaDeviceProp is a built-in C structure type
– dev_prop.dev_prop.maxThreadsPerBlock
– Dev_prop.sharedMemoryPerBlock
–…
39

Multithreaded Architectures: Memory and Data Locality

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multithreaded Architectures: Memory and Data Locality

Uploaded by

Copyright:

Available Formats

Multithreaded Architectures

(Applied Parallel Programming)

shared memory (~5 Registers Registers Registers Registers

cycles with caching)

Fetch | Decode | Execute | Memory

• Instruction cycle for an operate instruction:

• Instruction cycle for a memory instruction:

Automatic array variables local thread kernel

• Automatic variables without any qualifier reside in

P0,0 P0,1 P0,2 P0,3 WIDTH = 4; TILE_WIDTH = 2

P2,0 P2,1 P2,2 P2,3 WIDTH/TILE_WIDTH = 2

P0,0 P0,1 P0,2 P0,3 P0,4 P0,5 P0,6 P0,7

P1,0 P1,1 P1,2 P1,3 P1,4 P1,5 P1,6 P1,7

P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7

P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7

P0,0 P0,1 P0,2 P0,3 P0,4 P0,5 P0,6 P0,7

P1,0 P1,1 P1,2 P1,3 P1,4 P1,5 P1,6 P1,7

P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7 WIDTH/TILE_WIDTH = 2

P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7

P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016 12

// Setup the execution configuration

// Launch the device computation threads!

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016 13

// Matrix multiplication kernel – per thread code

// Pvalue is used to store the element of the matrix

N1,0 N1,1 N1,2 N1,3

N3,0 N3,1 N3,2 N3,3

Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3

Row = 1 M1,0 M1,1 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

N1,0 N1,1 N1,2 N1,3

N3,0 N3,1 N2,3 N3,3

Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3

Row = 1 M1,0 M1,1 M1,2 M1,3 P0,1 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

if ((Row < Width) && (Col < Width)) {

© David Kirk/NVIDIA and Wen-mei W. Hwu, 18

kernel into phases so that the N

• Assign the loaded element to each thread such

Thread 0,0 N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

Thread 1,0 N1,0 N1,1 N1,2 N1,3 N1,0 N1,1

N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3

N0,0 N0,1 N0,2 N0,3

N3,0 N3,1 N3,2 N3,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

N0,0 N0,1 N0,2 N0,3

N3,0 N3,1 N3,2 N3,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

Accessing tile 0 2D indexing:

Accessing tile 1 in 2D indexing: N

However, recall that M and N are dynamically allocated

• All threads in the same block must reach the

• Best used to coordinate tiled algorithms

An example execution timing of barrier synchronization. 34

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the P element to work on

// Loop over the M and N tiles required to compute the P element