Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

HPC SUMMIT DIGITAL 2020

GPU EXPERTS PANEL:


AMPERE EXPLAINED
Moderator Panelists

Stephen Jones Timothy Costa Carter Edwards Olivier Giroux Michael Houston Christopher Lamb CJ Newburn
CUDA Architect Product Manager, Principal Systems Distinguished Chief Architect, Vice President, Principal Architect,
HPC Software Software Engineer Engineer AI Systems Compute Software HPC

2
HUGE BREADTH OF PLATFORMS, SYSTEMS, LANGUAGES

3
HUGE BREADTH OF PLATFORMS, SYSTEMS, LANGUAGES
Data Analytics AI Training

Scientific Computing Video Analytics

Cloud Gaming
Machine Learning

Graphics 5G Networks

Genomics
AI Inference

4
THE NVIDIA A100 GPU

5
THE NVIDIA AMPERE GPU ARCHITECTURE

6
NVIDIA SELENE
#7 on Top500 @ 27.58 Petaflops

7
THE A100 SM ARCHITECTURE

Third-generation Tensor Core

Faster and more efficient

Comprehensive data types

Sparsity acceleration

Asynchronous data movement and synchronization


Increased L1/SMEM capacity

8
THE A100 SM ARCHITECTURE

Peak FP64 9.7 TFLOPS


Peak FP64 Tensor Core 19.5 TFLOPS
Peak FP32 19.5 TFLOPS
Peak FP16 78 TFLOPS
Peak BF16 39 TFLOPS
Peak TF32 Tensor Core 156 TFLOPS | 312 TFLOPS
Peak FP16 Tensor Core 312 TFLOPS | 624 TFLOPS
Peak BF16 Tensor Core 312 TFLOPS | 624 TFLOPS
Peak INT8 Tensor Core 624 TOPS | 1,248 TOPS
Peak INT4 Tensor Core 1,248 TOPS | 2,496 TOPS

9
THE NVIDIA A100 GPU

V100 A100
SMs 80 108
Tensor Core FP64, TF32, BF16,
FP16
Precision FP16, I8, I4, B1
Shared Memory
96 kB 160 kB
per Block
L2 Cache Size 6144 kB 40960 kB

Memory Bandwidth 900 GB/sec 1555 GB/sec

NVLink Interconnect 300 GB/sec 600 GB/sec

10
THE NVIDIA A100 GPU

NVIDIA GA100 Key Architectural Features

Multi-Instance GPU

Advanced barriers

Asynchronous data movement

L2 cache management

Task graph acceleration

New Tensor Core precisions

11
CUDA 11.0 RELEASED

Need Picture

Hierarchy Asynchrony Latency Language


Programming and running Creating concurrency at every Overcoming Amdahl Supporting and evolving
systems at every scale level of the hierarchy with lower overheads for Standard Languages
memory & processing

12
CUDA PLATFORM: TARGETS EACH LEVEL OF THE HIERARCHY
The CUDA Platform Advances State Of The Art From Data Center To The GPU

System Scope Node Scope Program Scope


FABRIC MANAGEMENT GPU-DIRECT CUDA C++
DATA CENTER OPERATIONS NVLINK OPENACC
DEPLOYMENT LIBRARIES STANDARD LANGUAGES
MONITORING UNIFIED MEMORY SYNCHRONIZATION
COMPATIBILITY ARM PRECISION
SECURITY MIG TASK GRAPHS
13
NEW MULTI-INSTANCE GPU (MIG)
Divide a Single GPU Into Multiple Instances, Each With
Isolated Paths Through the Entire Memory System
SMs
USER0

Control

DRAM
Pipe

Data
Xbar

Xbar
Sys
GPU Instance 0

L2
Up To 7 GPU Instances In a Single A100
USER1
Full software stack enabled on each instance, with

Control

DRAM
Pipe

Data
Xbar

Xbar
Sys
GPU Instance 1

L2
dedicated SM, memory, L2 cache & bandwidth
USER2

Control

DRAM
Pipe

Xbar

Data
Xbar
Sys
GPU Instance 2

L2
Simultaneous Workload Execution With
USER3
Guaranteed Quality Of Service
Control
GP

DRAM
Pipe

Data
Xbar

Xbar
Sys
GPU Instance 3

L2
U
All MIG instances run in parallel with predictable
USER4 throughput & latency, fault & error isolation
Control

DRAM
Pipe

Data
Xbar
Xbar
Sys

GPU Instance 4

L2
USER5
Control

Diverse Deployment Environments

DRAM
Pipe

Data
Xbar

Xbar
Sys

GPU Instance 5

L2
Supported with Bare metal, Docker, Kubernetes Pod,
USER6
Virtualized Environments
Control

DRAM
Pipe

Data
Xbar

Xbar
Sys

GPU Instance 6
L2

14
ASYNCHRONOUS BARRIERS

Produce Data Produce Data

Arrive
All threads block on
slowest arrival

Independent Pipelined
Work processing
Arrive Single-Stage
Wait Barrier Wait

Consume Data Consume Data

Single-Stage barriers combine Asynchronous barriers enable


back-to-back arrive & wait pipelined processing
15
ASYNC MEMCOPY: DIRECT TRANSFER INTO SHARED MEMORY
SM A100 SM

Shared Memory Shared Memory

2
Threads 2
Threads Threads Threads

Registers Registers Registers Registers


1

1L1 Cache1 L1 Cache

GPU Memory GPU Memory


HBM HBM
HBM HBM

Two step copy to shared memory via registers Asynchronous direct copy to shared memory
1 Thread loads data from GPU Direct transfer into shared memory,
memory into registers 1
bypassing thread resources

2 Thread stores data into SM


shared memory 16
ASYNCHRONOUS DIRECT DATA MOVEMENT

Shared Memory
P
1
Async copy initial element into shared memory

Barrier

P
1

17
ASYNCHRONOUS DIRECT DATA MOVEMENT

Shared Memory
P
1
Async copy initial element into shared memory

1 Async copy next element into shared memory


Barrier Barrier

P
1
1

18
ASYNCHRONOUS DIRECT DATA MOVEMENT

Shared Memory
P
1
Async copy initial element into shared memory

1 Async copy next element into shared memory


2 Barrier Barrier
2 Threads synchronize with current async copy

3 3 Compute using shared memory data


P
1
1 4 Repeat for next element

19
MANAGING LATENCY: L2 CACHE RESIDENCY CONTROL
Shared
Memory L2 Cache GPU Memory
Latency 1x 5x 15x
Bandwidth 13x 3x 1x

SM L1

HBM
SM L1 HBM
HBM

SM L1

20
MANAGING LATENCY: L2 CACHE RESIDENCY CONTROL
Shared
Memory L2 Cache GPU Memory
Latency 1x 5x 15x
Bandwidth 13x 3x 1x L2 Cache Residency Control
Specify address range up to 128MB
for persistent caching
SM L1
Normal & streaming accesses
HBM cannot evict persistent data
Residency
SM L1 Control HBM
HBM Load/store from range persists in L2
even between kernel launches

SM L1 Normal accesses can still use entire


cache if no persistent data is present

21
LATENCIES & OVERHEADS: GRAPHS vs. STREAMS
Empty Kernel Launches – Investigating System Overheads

Note: Empty kernel launches – timings show reduction in latency only

22
THIRD GENERATION TENSOR CORES
NVIDIA V100 vs NVIDIA A100

Peak FP64 Tensor Core FLOP/s = 19.5


Peak TF32 Tensor Core FLOP/s = 156
TFlops
TFlops
(312 TFlops for sparse input data)

23
FLOATING POINT FORMATS & PRECISION

sign 11-bit exponent 52-bit mantissa


double
8-bit 23-bit
float
5-bit 10-bit
half
8-bit 7-bit
bfloat16
8-bit 10-bit
TF32

Numerical Range Numerical Precision

value = (-1)sign x 2exponent x (1 + mantissa)

24
A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0

cuBLAS cuSPARSE cuTENSOR cuSOLVER


BF16, TF32 and FP64 Increased memory BW, BF16, TF32 and FP64 BF16, TF32 and FP64
Tensor Cores Shared Memory & L2 Tensor Cores Tensor Cores

nvJPEG cuFFT CUDA Math API CUTLASS


Hardware Decoder BF16, TF32 and FP64 Increased memory BW, BF16 & TF32 Support
Tensor Cores Shared Memory & L2

For more information see: S21681 - How CUDA Math Libraries Can Help You Unleash the Power of the New NVIDIA A100 GPU 25
TENSOR CORE ACCELERATED LIBRARIES

Performance of LU factorization with different precisions


44
FP16 getrf
40
BF16 getrf
36 TF32 getrf
FP64 getrf
32

28
T f l o p /s

24 3.5x
20

16

12

0
2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k 40k
Matrix size
CUDA 11.0, A100 GPU
26
libcu++ : THE CUDA C++ STANDARD LIBRARY

ISO C++ == Language + Standard Library


CUDA C++ == Language + libcu++

Strictly conforming to ISO C++, plus conforming extensions

Opt-in, Heterogeneous, Incremental

27
cuda::std::

Opt-in Does not interfere with or replace your host standard library

Copyable/Movable objects can migrate between host & device


Heterogeneous Host & Device can call all member functions
Host & Device can concurrently use synchronization primitives*

A subset of the standard library today


Incremental
Each release adds more functionality

*Synchronization primitives must be in managed memory and be declared with cuda::std::thread_scope_system


28
libcu++ NAMESPACE HIERARCHY

// ISO C++, __host__ only


#include <atomic>
std::atomic<int> x;

// CUDA C++, __host__ __device__


// Strictly conforming to the ISO C++
#include <cuda/std/atomic>
cuda::std::atomic<int> x;

// CUDA C++, __host__ __device__


// Conforming extensions to ISO C++
#include <cuda/atomic>
cuda::atomic<int, cuda::thread_scope_block> x;

For more information see: S21262 - The CUDA C++ Standard Library 29

You might also like