HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained

HPC SUMMIT DIGITAL 2020
GPU EXPERTS PANEL:

AMPERE EXPLAINED
Moderator Panelists
Stephen Jones Timothy Costa Carter Edwards Olivier Giroux Michael Houston Christopher Lamb CJ Newburn
CUDA Architect Product Manager, Principal Systems Distinguished Chief Architect, Vice President, Principal Architect,
HPC Software Software Engineer Engineer AI Systems Compute Software HPC
2
HUGE BREADTH OF PLATFORMS, SYSTEMS, LANGUAGES
3
HUGE BREADTH OF PLATFORMS, SYSTEMS, LANGUAGES
Data Analytics AI Training
Scientific Computing Video Analytics
Cloud Gaming
Machine Learning
Graphics 5G Networks
Genomics
AI Inference
4
THE NVIDIA A100 GPU
5
THE NVIDIA AMPERE GPU ARCHITECTURE
6
NVIDIA SELENE
#7 on Top500 @ 27.58 Petaflops
7
THE A100 SM ARCHITECTURE
Third-generation Tensor Core
Faster and more efficient
Comprehensive data types
Sparsity acceleration
Asynchronous data movement and synchronization

Increased L1/SMEM capacity
8
THE A100 SM ARCHITECTURE
Peak FP64 9.7 TFLOPS

Peak FP64 Tensor Core 19.5 TFLOPS
Peak FP32 19.5 TFLOPS
Peak FP16 78 TFLOPS
Peak BF16 39 TFLOPS
Peak TF32 Tensor Core 156 TFLOPS | 312 TFLOPS
Peak FP16 Tensor Core 312 TFLOPS | 624 TFLOPS
Peak BF16 Tensor Core 312 TFLOPS | 624 TFLOPS
Peak INT8 Tensor Core 624 TOPS | 1,248 TOPS
Peak INT4 Tensor Core 1,248 TOPS | 2,496 TOPS
9
THE NVIDIA A100 GPU
V100 A100
SMs 80 108
Tensor Core FP64, TF32, BF16,
FP16
Precision FP16, I8, I4, B1
Shared Memory
96 kB 160 kB
per Block
L2 Cache Size 6144 kB 40960 kB
Memory Bandwidth 900 GB/sec 1555 GB/sec
NVLink Interconnect 300 GB/sec 600 GB/sec
10
THE NVIDIA A100 GPU
NVIDIA GA100 Key Architectural Features
Multi-Instance GPU
Advanced barriers
Asynchronous data movement
L2 cache management
Task graph acceleration
New Tensor Core precisions
11
CUDA 11.0 RELEASED
Need Picture
Hierarchy Asynchrony Latency Language

Programming and running Creating concurrency at every Overcoming Amdahl Supporting and evolving
systems at every scale level of the hierarchy with lower overheads for Standard Languages
memory & processing
12
CUDA PLATFORM: TARGETS EACH LEVEL OF THE HIERARCHY
The CUDA Platform Advances State Of The Art From Data Center To The GPU
System Scope Node Scope Program Scope

FABRIC MANAGEMENT GPU-DIRECT CUDA C++
DATA CENTER OPERATIONS NVLINK OPENACC
DEPLOYMENT LIBRARIES STANDARD LANGUAGES
MONITORING UNIFIED MEMORY SYNCHRONIZATION
COMPATIBILITY ARM PRECISION
SECURITY MIG TASK GRAPHS
13
NEW MULTI-INSTANCE GPU (MIG)
Divide a Single GPU Into Multiple Instances, Each With
Isolated Paths Through the Entire Memory System
SMs
USER0
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 0
L2
Up To 7 GPU Instances In a Single A100
USER1
Full software stack enabled on each instance, with
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 1
L2
dedicated SM, memory, L2 cache & bandwidth
USER2
Control
DRAM
Pipe
Xbar
Data
Xbar
Sys
GPU Instance 2
L2
Simultaneous Workload Execution With
USER3
Guaranteed Quality Of Service
Control
GP
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 3
L2
U
All MIG instances run in parallel with predictable
USER4 throughput & latency, fault & error isolation
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 4
L2
USER5
Control
Diverse Deployment Environments
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 5
L2
Supported with Bare metal, Docker, Kubernetes Pod,
USER6
Virtualized Environments
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 6
L2
14
ASYNCHRONOUS BARRIERS
Produce Data Produce Data
Arrive
All threads block on
slowest arrival
Independent Pipelined
Work processing
Arrive Single-Stage
Wait Barrier Wait
Consume Data Consume Data
Single-Stage barriers combine Asynchronous barriers enable

back-to-back arrive & wait pipelined processing
15
ASYNC MEMCOPY: DIRECT TRANSFER INTO SHARED MEMORY
SM A100 SM
Shared Memory Shared Memory
2
Threads 2
Threads Threads Threads
Registers Registers Registers Registers

1
1L1 Cache1 L1 Cache
GPU Memory GPU Memory

HBM HBM
HBM HBM
Two step copy to shared memory via registers Asynchronous direct copy to shared memory
1 Thread loads data from GPU Direct transfer into shared memory,
memory into registers 1
bypassing thread resources
2 Thread stores data into SM

shared memory 16
ASYNCHRONOUS DIRECT DATA MOVEMENT
Shared Memory
P
1
Async copy initial element into shared memory
Barrier
P
1
17
Shared Memory
P
1
1 Async copy next element into shared memory

Barrier Barrier
P
1
1
18
Shared Memory
P
1
1 Async copy next element into shared memory

2 Barrier Barrier
2 Threads synchronize with current async copy
3 3 Compute using shared memory data

P
1
1 4 Repeat for next element
19
MANAGING LATENCY: L2 CACHE RESIDENCY CONTROL
Shared
Memory L2 Cache GPU Memory
Latency 1x 5x 15x
Bandwidth 13x 3x 1x
SM L1
HBM
SM L1 HBM
HBM
SM L1
20
MANAGING LATENCY: L2 CACHE RESIDENCY CONTROL
Shared
Memory L2 Cache GPU Memory
Latency 1x 5x 15x
Bandwidth 13x 3x 1x L2 Cache Residency Control
Specify address range up to 128MB
for persistent caching
SM L1
Normal & streaming accesses
HBM cannot evict persistent data
Residency
SM L1 Control HBM
HBM Load/store from range persists in L2
even between kernel launches
SM L1 Normal accesses can still use entire

cache if no persistent data is present
21
LATENCIES & OVERHEADS: GRAPHS vs. STREAMS
Empty Kernel Launches – Investigating System Overheads
Note: Empty kernel launches – timings show reduction in latency only
22
THIRD GENERATION TENSOR CORES
NVIDIA V100 vs NVIDIA A100
Peak FP64 Tensor Core FLOP/s = 19.5

Peak TF32 Tensor Core FLOP/s = 156
TFlops
TFlops
(312 TFlops for sparse input data)
23
FLOATING POINT FORMATS & PRECISION
sign 11-bit exponent 52-bit mantissa

double
8-bit 23-bit
float
5-bit 10-bit
half
8-bit 7-bit
bfloat16
8-bit 10-bit
TF32
Numerical Range Numerical Precision
value = (-1)sign x 2exponent x (1 + mantissa)
24
A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0
cuBLAS cuSPARSE cuTENSOR cuSOLVER

BF16, TF32 and FP64 Increased memory BW, BF16, TF32 and FP64 BF16, TF32 and FP64
Tensor Cores Shared Memory & L2 Tensor Cores Tensor Cores
nvJPEG cuFFT CUDA Math API CUTLASS

Hardware Decoder BF16, TF32 and FP64 Increased memory BW, BF16 & TF32 Support
Tensor Cores Shared Memory & L2
For more information see: S21681 - How CUDA Math Libraries Can Help You Unleash the Power of the New NVIDIA A100 GPU 25
TENSOR CORE ACCELERATED LIBRARIES
Performance of LU factorization with different precisions

44
FP16 getrf
40
BF16 getrf
36 TF32 getrf
FP64 getrf
32
28
T f l o p /s
24 3.5x
20
16
12
0
2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k 40k
Matrix size
CUDA 11.0, A100 GPU
26
libcu++ : THE CUDA C++ STANDARD LIBRARY
ISO C++ == Language + Standard Library

CUDA C++ == Language + libcu++
Strictly conforming to ISO C++, plus conforming extensions
Opt-in, Heterogeneous, Incremental
27
cuda::std::
Opt-in Does not interfere with or replace your host standard library
Copyable/Movable objects can migrate between host & device

Heterogeneous Host & Device can call all member functions
Host & Device can concurrently use synchronization primitives*
A subset of the standard library today

Incremental
Each release adds more functionality
*Synchronization primitives must be in managed memory and be declared with cuda::std::thread_scope_system

28
libcu++ NAMESPACE HIERARCHY
// ISO C++, __host__ only

#include <atomic>
std::atomic<int> x;
// CUDA C++, __host__ __device__

// Strictly conforming to the ISO C++
#include <cuda/std/atomic>
cuda::std::atomic<int> x;
// CUDA C++, __host__ __device__

// Conforming extensions to ISO C++
#include <cuda/atomic>
cuda::atomic<int, cuda::thread_scope_block> x;
For more information see: S21262 - The CUDA C++ Standard Library 29

HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained

Uploaded by

Copyright:

Available Formats

HPC SUMMIT DIGITAL 2020

GPU EXPERTS PANEL:

Scientific Computing Video Analytics

Third-generation Tensor Core

Faster and more efficient

Comprehensive data types

Asynchronous data movement and synchronization

Peak FP64 9.7 TFLOPS

Memory Bandwidth 900 GB/sec 1555 GB/sec

NVLink Interconnect 300 GB/sec 600 GB/sec

NVIDIA GA100 Key Architectural Features

Asynchronous data movement

Task graph acceleration

New Tensor Core precisions

Hierarchy Asynchrony Latency Language

System Scope Node Scope Program Scope

Diverse Deployment Environments

Produce Data Produce Data

Consume Data Consume Data

Single-Stage barriers combine Asynchronous barriers enable

Shared Memory Shared Memory

Registers Registers Registers Registers

1L1 Cache1 L1 Cache

GPU Memory GPU Memory

2 Thread stores data into SM

1 Async copy next element into shared memory

1 Async copy next element into shared memory

3 3 Compute using shared memory data

SM L1 Normal accesses can still use entire

Note: Empty kernel launches – timings show reduction in latency only

Peak FP64 Tensor Core FLOP/s = 19.5

sign 11-bit exponent 52-bit mantissa

Numerical Range Numerical Precision

value = (-1)sign x 2exponent x (1 + mantissa)

cuBLAS cuSPARSE cuTENSOR cuSOLVER

nvJPEG cuFFT CUDA Math API CUTLASS

Performance of LU factorization with different precisions

ISO C++ == Language + Standard Library

Strictly conforming to ISO C++, plus conforming extensions

Opt-in, Heterogeneous, Incremental

Copyable/Movable objects can migrate between host & device

A subset of the standard library today

*Synchronization primitives must be in managed memory and be declared with cuda::std::thread_scope_system

// ISO C++, __host__ only

// CUDA C++, __host__ __device__

// CUDA C++, __host__ __device__

You might also like

// ISO C++, host only

// CUDA C++, host device

// CUDA C++, host device