Professional Documents
Culture Documents
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
Optimization
Minimize transfers
Intermediate data can be allocated, operated on, and de-
allocated without ever copying them to host memory
Group transfers
One large transfer much better than many small ones
Asynchronous data transfer
Overlap communication and computation time
CPU
data PCI-E
data data
GPU A GPU B
H1 H2 H3
3 way
Concurrent K1 K2 K3
(using 3 streams)
D1 D2 D3
Context based
cudaThreadSynchronize()
Blocks host until all issued CUDA calls from a CPU thread complete
Stream based
cudaStreamSynchronize(stream-id)
Blocks host until all CUDA calls in stream stream-id complete
cudaStreamQuery(stream-id)
Indicates whether event has recorded
Returns cudaSuccess, cudaErrorNotReady
Does not block CPU thread
NTHU LSA Lab 19
GPU/CPU Synchronization by Events
cudaEventRecord (event, stream-id )
Insert ‘events‘ in streams
Event is recorded when GPU reaches it in a stream
Record = assigned a timestamp (GPU clocktick)
Useful for timing
cudaEventSynchronize (event)
Blocks CPU thread until event is recorded
cudaEventQuery (stream-id, event)
Indicates whether event has recorded
Returns cudaSuccess, cudaErrorNotReady
Does not block CPU thread
cudaStreamWaitEvent (steam-id, event)
Block a GPU stream until event reports completion
NTHU LSA Lab 20
Example: Explicit Sync
cudaEvent_t event;
cudaEventCreate (&event); // create event
// 1) H2D copy of new input
cudaMemcpyAsync ( d_in, in, size, H2D, stream1 );
cudaEventRecord (event, stream1); // record event
// 2) D2H copy of previous result
cudaMemcpyAsync ( out, d_out, size, D2H, stream2 );
// wait for event in stream1
cudaStreamWaitEvent ( stream2, event );
// 3) must wait for 1 and 2
kernel <<< , , , stream2 >>> ( d_in, d_out );
asynchronousCPUmethod ( … ) // Async GPU method
bank conflict
NTHU LSA Lab 38
Example: No bank Conflict
Linear addressing Random 1:1 Permutation
Thread0 Bank0 Thread0 Bank0
Thread1 Bank1 Thread1 Bank1
Thread2 Bank2 Thread2 Bank2
Thread3 Bank3 Thread3 Bank3
Thread4 Bank4 Thread4 Bank4
Thread5 Bank5 Thread5 Bank5
Thread6 Bank6 Thread6 Bank6
Thread7 Bank7 Thread7 Bank7
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
Memory padding
Add addition memory space to avoid bank conflict
AN EXAMPLE OF CUDA
30x Speedup!
T1 T1
4WARP
2WARP
1WARP
1WARP
1WARP
1WARP
4reads
2read
Highly divergent memory access locations similar to the effect of random read