Professional Documents
Culture Documents
Multithreaded Architectures: Memory and Data Locality
Multithreaded Architectures: Memory and Data Locality
Lecture 4
Memory and Data Locality
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Programmer View of CUDA Memories
• Each thread can: Grid
– Read/write per-thread
Block (0, 0) Block (1, 0)
registers (~1 cycle)
– Read/write per-block Shared Memory Shared Memory
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
The Von-Neumann Model
Memory I/O
Processing Unit
Reg
ALU
File
Control Unit
PC IR
3
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Going back to the program
• Every instruction needs to be fetched from
memory, decoded, then executed.
– The decode stage typically accesses register file
• Instructions come in three flavors: Operate, Data
transfer, and Program Control Flow.
• An example instruction cycle is the following:
4
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Operate Instructions
• Example of an operate instruction:
ADD R1, R2, R3
5
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Memory Access Instructions
• Examples of memory access instruction:
LDR R1, R2, #2
STR R1, R2, #2
6
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Registers vs Memory
• Registers are “free”
– No additional memory access instruction
– Very fast to use, however, there are very few of them
• Memory is slow, but very large
Memory
I/O
Reg
ALU File
Control Unit
PC IR 7
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime
Automatic variables other than arrays register thread kernel
WIDTH
float sum = 0;
for (int k = 0; k < Width; ++k) {
float a = M[i * Width + k];
float b = N[k * Width + j];
sum += a * b; M P
}
P[i * Width + j] = sum; i
}
WIDTH
}
k
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016 9
WIDTH WIDTH
Kernel Function - A Small Example
• Have each 2D thread block to compute a (TILE_WIDTH)2
sub-matrix (tile) of the result matrix
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
Block(0,0) Block(0,1)
Block(1,0) Block(1,1)
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
A Slightly Bigger Example
(TILE_WIDTH =2)
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
A Slightly Bigger Example (cont.)
(TILE_WIDTH = 4)
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
14
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Work for Block (0,0)
in a TILE_WIDTH = 2 Configuration
Col = 0
Col = 1
blockDim.x blockDim.y
Col = 0 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3
Col = 2
Col = 3
Col = 1 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3
GFLOPS
• The actual code runs at about 25 Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
GFLOPS
• Need to drastically cut down
Host Global Memory
memory accesses to get closer to
the peak of more than 1,000 Constant Memory
GFLOPS
19
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
• A profitable way of performing computation on the
device is to tile the input data to take advantage of
fast shared memory:
– Partition data into subsets (tiles) that fit into shared
memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
• Copying results from shared memory to global memory
20
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
__shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
__shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];
21
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Shared Memory Blocking Basic Idea
in
Global Memory
Thread 1 Thread 2 …
Global Memory
in
On-chip Memory
Thread 1 Thread 2 … 22
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Outline of Technique
• Identify a tile of global data that are accessed by
multiple threads
• Load the tile from global memory into on-chip
memory
• Have the multiple threads to access their data
from the on-chip memory
• Move on to the next block/tile
23
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Idea: Place global memory data into
Shared Memory for reuse
• Each input element is N
read by Width threads.
• Load each element into
WIDTH
Shared Memory and
have several threads
use the local version to
M P
reduce the memory
bandwidth ty
WIDTH
tx
WIDTH WIDTH 24
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
bx
Tiled Multiply 0 1 2
tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1
TILE_WIDTH
data accesses in each phase is
WIDTH
focused on one subset (tile) of
TILE_WIDTH
M and N
M P
0
TILE_WIDTHE
0
1
WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
25
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Loading a Tile
• All threads in a block participate
– Each thread loads one M element and one N element
in based tiled code
26
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Work for Block (0,0)
Thread 1,1
Thread 0,1
Block 0,0
SM
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
27
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-
2016
Work for Block (0,0)
Threads use shared memory data in step 0.
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
Col tx
0 1 2 TILE_WIDTH-1
TILE_WIDTH
M[Row][tx]
Width
N[ty][Col]
TILE_WIDTH
M P
0
Row
TILE_WIDTHE
0
1
Width
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 Width Width
30
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Loading an Input Tile 1 0
bx
1 2
Col tx
0 1 2 TILE_WIDTH-1
TILE_WIDTH
M[Row][1*TILE_WIDTH+tx]
WIDTH
TILE_WIDTH
N[1*TILE_WIDTH+ty][Col]
M P
Row
TILE_WIDTHE
0
1
WIDTH
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Loading an Input Tile m 0
bx
1 2
TILE_WIDTH
M[Row][m*TILE_WIDTH+tx]
M[Row*Width + m*TILE_WIDTH + tx]
WIDTH
m
TILE_WIDTH
N[m*TILE_WIDTH+ty][Col]
N[(m*TILE_WIDTH+ty) * Width + Col]
M P
0 m
Row
TILE_WIDTHE
0
1
WIDTH
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
32
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Barrier Synchronization
• An API function call in CUDA
– __syncthreads()
33
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Time
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
… …
Thread N-3
Thread N-2
Thread N-1
35
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Tiled Matrix Multiplication Kernel 2/2
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
1. __shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];
36
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Compare with Base Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;
P[Row*Width+Col] = Pvalue;
}
}
37
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Shared Memory and Threading
• Each SM in Maxwell has 64KB shared memory (48KB
max per block)
– Shared memory size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory.
• Shared memory can potentially suppot up to 32 thread blocks actively
executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB
shared memory usage per thread block, allowing 8 thread blocks
active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
– The 150GB/s bandwidth can now support (150/4)*16 = 600
GFLOPS!
38
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016
Device Query
• Number of devices in the system
int dev_count;
cudaGetDeviceCount( &dev_count);
• Capability of devices
cudaDeviceProp dev_prop;
for (i = 0; i < dev_count; i++) {
cudaGetDeviceProperties( &dev_prop, i);
// decide if device has sufficient resources and capabilities
}
• cudaDeviceProp is a built-in C structure type
– dev_prop.dev_prop.maxThreadsPerBlock
– Dev_prop.sharedMemoryPerBlock
–…
39
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016