Professional Documents
Culture Documents
Optimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA
Mark Harris
NVIDIA Developer Technology
Parallel Reduction
Common and important data parallel primitive
Easy to implement in CUDA
Harder to get it right
Parallel Reduction
Tree-based approach used within each thread block
3
11
3
9
14
25
3 1 7 0 4 1 6 3
4
7
5
9
11
14
25
Level 0:
8 blocks
Level 1:
1 block
Step 2
Stride 2
Step 3
Stride 4
Step 4
Stride 8
Thread
IDs
Values
11
Thread
IDs
Values
18
Thread
IDs
Values
24
Thread
IDs
Values
41
-1
2
1
-2
4
-1
-2
6
-2
-1
-3
8
5
4
1
-2
-5
10
-3
11
12
7
8
-2
11
14
11
12
-3
13
11
8
1
-1
-2
17
-3
13
11
-1
-2
17
-3
13
11
Kernel 1:
8.054 ms
Bandwidth
2.083 GB/s
interleaved addressing
with divergent branching
10
Step 2
Stride 2
Step 3
Stride 4
Step 4
Stride 8
Thread
IDs
Values
11
Thread
IDs
Values
18
Thread
IDs
Values
24
Thread
IDs
Values
41
-1
1
1
-2
2
-1
-2
3
-2
-1
-3
4
5
1
1
-2
-5
5
-3
11
6
7
2
-2
11
7
11
3
-3
13
11
1
1
-1
-2
17
-3
13
11
-1
-2
17
-3
13
11
12
Kernel 1:
interleaved addressing
with divergent branching
Kernel 2:
interleaved addressing
with bank conflicts
Bandwidth
8.054 ms
2.083 GB/s
3.456 ms
4.854 GB/s
Step
Cumulative
Speedup Speedup
2.33x
2.33x
13
Step 2
Stride 4
Step 3
Stride 2
Step 4
Stride 1
-1
-2
Values
-2
10
Thread
IDs
Values
13
13
Thread
IDs
Values
21
20
13
13
Thread
IDs
Values
41
20
13
13
Thread
IDs
-2
-3
11
-2
-3
11
-2
-3
11
-2
-3
11
-2
-3
11
14
Kernel 1:
interleaved addressing
with divergent branching
Kernel 2:
interleaved addressing
with bank conflicts
Kernel 3:
sequential addressing
Bandwidth
Step
Cumulative
Speedup Speedup
8.054 ms
2.083 GB/s
3.456 ms
4.854 GB/s
2.33x
2.33x
1.722 ms
9.741 GB/s
2.01x
4.68x
16
Idle Threads
Problem:
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
17
18
Kernel 1:
interleaved addressing
with divergent branching
Kernel 2:
interleaved addressing
with bank conflicts
Kernel 3:
sequential addressing
Kernel 4:
first add during global load
Bandwidth
Step
Cumulative
Speedup Speedup
8.054 ms
2.083 GB/s
3.456 ms
4.854 GB/s
2.33x
2.33x
1.722 ms
9.741 GB/s
2.01x
4.68x
1.78x
8.34x
19
Instruction Bottleneck
At 17 GB/s, were far from bandwidth bound
And we know reduction has low arithmetic intensity
20
21
Note: This saves useless work in all warps, not just the last one!
Without unrolling, all warps execute every iteration of the for loop and if statement
22
Kernel 1:
interleaved addressing
with divergent branching
Kernel 2:
interleaved addressing
with bank conflicts
Kernel 3:
sequential addressing
Kernel 4:
first add during global load
Kernel 5:
unroll last warp
Bandwidth
Step
Cumulative
Speedup Speedup
8.054 ms
2.083 GB/s
3.456 ms
4.854 GB/s
2.33x
2.33x
1.722 ms
9.741 GB/s
2.01x
4.68x
1.78x
8.34x
1.8x
15.01x
23
Complete Unrolling
If we knew the number of iterations at compile time,
we could completely unroll the reduction
Luckily, the block size is limited by the GPU to 512 threads
Also, we are sticking to power-of-2 block sizes
25
26
27
Kernel 1:
interleaved addressing
with divergent branching
Kernel 2:
interleaved addressing
with bank conflicts
Kernel 3:
sequential addressing
Kernel 4:
first add during global load
Kernel 5:
unroll last warp
Kernel 6:
completely unrolled
Bandwidth
Step
Cumulative
Speedup Speedup
8.054 ms
2.083 GB/s
3.456 ms
4.854 GB/s
2.33x
2.33x
1.722 ms
9.741 GB/s
2.01x
4.68x
1.78x
8.34x
1.8x
15.01x
1.41x
21.16x
28
Algorithm Cascading
Combine sequential and parallel reduction
Each thread loads and sums multiple elements into
shared memory
Tree-based reduction in shared memory
Kernel 1:
interleaved addressing
with divergent branching
Kernel 2:
interleaved addressing
with bank conflicts
Kernel 3:
sequential addressing
Kernel 4:
first add during global load
Kernel 5:
unroll last warp
Kernel 6:
completely unrolled
Kernel 7:
multiple elements per thread
Bandwidth
Step
Cumulative
Speedup Speedup
8.054 ms
2.083 GB/s
3.456 ms
4.854 GB/s
2.33x
2.33x
1.722 ms
9.741 GB/s
2.01x
4.68x
1.78x
8.34x
1.8x
15.01x
1.41x
21.16x
1.42x
30.04x
35
Performance Comparison
10
1: Interleaved Addressing:
Divergent Branches
2: Interleaved Addressing:
Bank Conflicts
3: Sequential Addressing
1
Time (ms)
0.01
2
07
1
13
4
14
2
26
8
28
4
52
6
57
8
4
10
2
15
7
9
20
4
30
4
9
41
2
8
6
60
43
21
4
7
8
8
77
55
83
16
33
# Elements
36
Types of optimization
Interesting observation:
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
37
Conclusion
Understand CUDA performance characteristics
Memory coalescing
Divergent branching
Bank conflicts
Latency hiding