Professional Documents
Culture Documents
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
Stephen Jones Timothy Costa Carter Edwards Olivier Giroux Michael Houston Christopher Lamb CJ Newburn
CUDA Architect Product Manager, Principal Systems Distinguished Chief Architect, Vice President, Principal Architect,
HPC Software Software Engineer Engineer AI Systems Compute Software HPC
2
HUGE BREADTH OF PLATFORMS, SYSTEMS, LANGUAGES
3
HUGE BREADTH OF PLATFORMS, SYSTEMS, LANGUAGES
Data Analytics AI Training
Cloud Gaming
Machine Learning
Graphics 5G Networks
Genomics
AI Inference
4
THE NVIDIA A100 GPU
5
THE NVIDIA AMPERE GPU ARCHITECTURE
6
NVIDIA SELENE
#7 on Top500 @ 27.58 Petaflops
7
THE A100 SM ARCHITECTURE
Sparsity acceleration
8
THE A100 SM ARCHITECTURE
9
THE NVIDIA A100 GPU
V100 A100
SMs 80 108
Tensor Core FP64, TF32, BF16,
FP16
Precision FP16, I8, I4, B1
Shared Memory
96 kB 160 kB
per Block
L2 Cache Size 6144 kB 40960 kB
10
THE NVIDIA A100 GPU
Multi-Instance GPU
Advanced barriers
L2 cache management
11
CUDA 11.0 RELEASED
Need Picture
12
CUDA PLATFORM: TARGETS EACH LEVEL OF THE HIERARCHY
The CUDA Platform Advances State Of The Art From Data Center To The GPU
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 0
L2
Up To 7 GPU Instances In a Single A100
USER1
Full software stack enabled on each instance, with
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 1
L2
dedicated SM, memory, L2 cache & bandwidth
USER2
Control
DRAM
Pipe
Xbar
Data
Xbar
Sys
GPU Instance 2
L2
Simultaneous Workload Execution With
USER3
Guaranteed Quality Of Service
Control
GP
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 3
L2
U
All MIG instances run in parallel with predictable
USER4 throughput & latency, fault & error isolation
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 4
L2
USER5
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 5
L2
Supported with Bare metal, Docker, Kubernetes Pod,
USER6
Virtualized Environments
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 6
L2
14
ASYNCHRONOUS BARRIERS
Arrive
All threads block on
slowest arrival
Independent Pipelined
Work processing
Arrive Single-Stage
Wait Barrier Wait
2
Threads 2
Threads Threads Threads
Two step copy to shared memory via registers Asynchronous direct copy to shared memory
1 Thread loads data from GPU Direct transfer into shared memory,
memory into registers 1
bypassing thread resources
Shared Memory
P
1
Async copy initial element into shared memory
Barrier
P
1
17
ASYNCHRONOUS DIRECT DATA MOVEMENT
Shared Memory
P
1
Async copy initial element into shared memory
P
1
1
18
ASYNCHRONOUS DIRECT DATA MOVEMENT
Shared Memory
P
1
Async copy initial element into shared memory
19
MANAGING LATENCY: L2 CACHE RESIDENCY CONTROL
Shared
Memory L2 Cache GPU Memory
Latency 1x 5x 15x
Bandwidth 13x 3x 1x
SM L1
HBM
SM L1 HBM
HBM
SM L1
20
MANAGING LATENCY: L2 CACHE RESIDENCY CONTROL
Shared
Memory L2 Cache GPU Memory
Latency 1x 5x 15x
Bandwidth 13x 3x 1x L2 Cache Residency Control
Specify address range up to 128MB
for persistent caching
SM L1
Normal & streaming accesses
HBM cannot evict persistent data
Residency
SM L1 Control HBM
HBM Load/store from range persists in L2
even between kernel launches
21
LATENCIES & OVERHEADS: GRAPHS vs. STREAMS
Empty Kernel Launches – Investigating System Overheads
22
THIRD GENERATION TENSOR CORES
NVIDIA V100 vs NVIDIA A100
23
FLOATING POINT FORMATS & PRECISION
24
A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0
For more information see: S21681 - How CUDA Math Libraries Can Help You Unleash the Power of the New NVIDIA A100 GPU 25
TENSOR CORE ACCELERATED LIBRARIES
28
T f l o p /s
24 3.5x
20
16
12
0
2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k 40k
Matrix size
CUDA 11.0, A100 GPU
26
libcu++ : THE CUDA C++ STANDARD LIBRARY
27
cuda::std::
Opt-in Does not interfere with or replace your host standard library
For more information see: S21262 - The CUDA C++ Standard Library 29