Professional Documents
Culture Documents
GPU Bootcamp Samhar
GPU Bootcamp Samhar
GPU-Computing
perf
1.5X per year
Single-threaded perf
1980 1990 2000 2010 2020
Original data up to the year 2010 collected and plotted by M. Horowitz, Path to 2 nm May Not Be Worth It : https://www.eetimes.com/document.asp?doc_id=1333109
F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New
plot and data collected for 2010-2015 by K. Rupp 2
3
VOLTA ARCHITECTURE 21B Transistors
5120 CUDA Cores
120 TF Tensor Cores 640 Tensor Cores
Powers Summit and Sierra 900 GB/s HBM2
300 GB/s NVLink
41
DGX-1: 96X FASTER THAN CPU
ITERATE AND INNOVATE FASTER
Workload: ResNet-50, BS=256, 90 epochs to solution | CPU: dual Xeon Platinum 8180 | GPU: 8x NVIDIA Tesla V100 32GB
5
VOLTA ARCHITECTURE 21B Transistors
5120 CUDA Cores
Giantly for ML&DL 640 Tensor Cores
120 TF Tensor Cores 900 GB/s HBM2
Powers Summit and Sierra 300 GB/s NVLink
1
INTRODUCING TESLA V100
Volta Architecture Improved NVLink Volta MPS Improved SIMT Model Tensor Core
& HBM2
120 Programmable
Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep
Learning
The Fastest and Most Productive GPU for Deep Learning and HPC
4
VOLTA TENSOR CORE
8
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
D=
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3
A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
D = AB + C 9
What is Mixed Precision?
• Reduced Precision tensor math with FP16 Multiplication, and FP32 accumulation
• Successfully used to train a variety of:
• Well known public network
• Variety of NVIDIA research network
• Variety of NVIDIA automotive network
TensorFlow Medium Post: Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs
FP32 M ixed
Precision
Source: https://github.com/NVIDIA/apex/tree/master/examples/imagenet 8
AUTOMATIC MIXED PRECISION IN MXNET
AMP speedup ~1.5X to 2X in comparison with FP32
https://github.com/apache/incubator-mxnet/pull/14173 9
NVIDIA GPU CLOUD
GPU-OPTIMIZED CONTAINERS
CHALLENGES WITH COMPLEX SOFTWARE
NVIDIA Driver
51
6
GPU-ACCELERATED CONTAINERS
10 at Launch, 35+ Today
1
7
ACCELERATED INFERENCING
CURRENT DEPLOYMENT WORKFLOW
TRAINING UNOPTIMIZED DEPLOYMENT
1
Data Deploy training
Management framework
Training Training
Trained Neural
2
Data Network Deploy custom
application using
Model NVIDIA DL SDK
Assessment
3
Framework or
custom CPU-Only
application
TESLA T4
TensorRT
JETSON TX2
Optimizer Runtime
DRIVE PX 2
NVIDIA DLA
TESLA V100
20
developer.nvidia.com/tensorrt
TENSOR-RT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350
Latency (ms)
400
Latency (ms)
4,000
Input/sec
Images/sec
25
280 ms
300
1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.
21
developer.nvidia.com/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2
Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans
Plan 3
Kernel Auto-Tuning
Dynamic Tensor
Memory
23
LAYER & TENSOR FUSION
24
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32 INT8
Difference 6,000
-3.4x10Top 1 Top 1
38 38
FP32 ~ +3.4x10 Training precision
FP16
Googlenet-65504
FP16 68.87%
~ +65504 68.49% 0.38%
No calibration required Tensor Core
5,000
VGG 68.56% 68.45% 0.11%
Requires calibration
INT8 -128 ~ +127
Resnet-50 73.11% 72.54% 0.57% 4,000
Images/Second
Resnet-152 75.18% 74.56% 0.61%
3,000
25
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser
developer.nvidia.com/tensorrt
WIDELY ADOPTED
AUTONOMOUS MACHINE
NVIDIA Jetson
software-defined autonomous
Powerful and efficient AI, CV, HPC | Rich Software Development Platform
machines
Open Platform | 200K Developers
ACCELERATED MODULES
Depth est Path planning Obj detect Gestur e r ec Pose est Speech r ec
JETSON COMPUTER
AUTONOMOUS MACHINES
WAREHOUSE DELIVERY AGRICULTURE RETAIL INDUSTRIAL
69
Jetson software
Jetson module
JETSON NANO JETSON TX1 → JETSON TX2 4 GB JETSON TX2 8GB | Industrial JETSON AGX XAVIER
5 - 10W 7 - 15W 7 – 15W 10 – 30W
0.5 TFLOPS (FP16) 1 – 1.3 TFLOPS (FP16) 1.3 TFLOPS (FP16) 10 TFLOPS (FP16) | 32 TOPS (INT8)
45mm x 70mm 50mm x 87mm 50mm x 87mm 100mm x 87mm
$129 $299 $399 - $749 $1099
SOFTWARE
HW AND SENSORS
DEEPSTREAM - FRAMEWORK FOR
INTELLIGENT VIDEO ANALYTICS
https://www.nvidia.com/en-us/autonomous-machines/intelligent-video-analytics-
platform/
IVA
Traffic management
Public safety
Smart building
Airport security
CAMERA
Parking entrance
Law enforcement
39
PERCEPTION FOR INTELLIGENT VIDEO
ANALYTICS
DNN inference/TensorRT plugins DeepStream in containers, Multi-GPU orchestration End to end reference applications
Communications plugins Tracking & analytics across large scale/ multi-camera App building/configuration tools
Video/image capture and processing plugins Streaming and Batch Analytics End-end orchestration recipes & adaptation guides
3rd party library plugins … … Event fabric Plugin templates, custom IP integration
DeepStream SDK
Linux, CUDA
Perception infra - Jetson, Tesla server (Edge and cloud) Analytics infra - Edge server, NGC, AWS, Azure
45
DAY IN THE LIFE OF A DATA SCIENTIST
NVIDIA GPUs Supercharge The Way They Work
ANOTHER…
RAPIDS
Data Preparation Model Training Visualization
GPU Memory
47
RAPIDS LIBRARIES
cuDF
• GPU-accelerated lightweight in-GPU memory database
used for data preparation
• Accelerates loading, filtering, and manipulation of data
for model training data preparation
• Python drop-in Pandas replacement built on CUDA C++
RAPIDS
cuML
• GPU accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD …
cuGRAPH
• Collection of graph analytics libraries, coming soon
48
HOW? DOWNLOAD AND DEPLOY
Source available on GitHub | Container available on NGC and Docker Hub | Conda and PIP
https://ngc.nvidia.com https://anaconda.org/rapidsai
https://github.com/rapidsai
https://hub.docker.com/u/rapidsai PIP available at a later date
NGC
On-premises Cloud 49
Principal Component Analysis (PCA)
Before… …Now!
CPU vs GPU
PORTING EXISTING CODE
PCA
CPU vs GPU
PORTING EXISTING CODE
KNN
cuDF (Load and Data Pr epar ation) Data Conver sion XGBoost
0.5
0.0
57
58
59
CLARA AI
Lowering the barriers to AI adoption
MRI
Clara AI
Rapid Data Curation
60
CLARA AI
Pre-Trained models * AI-Assisted Annotation * Transfer Learning * Ready to integrate
61
62
CLARA AI
Intelligent compute platform for medical imaging
Clara Train SDK Clara Deploy SDK
SAMPLE TRAINING PIPELINES
AI-ASSISTED DICOM
WEB UI
SOFTWARE
TRAIN
DICOM 2
NIFTI PRE-TRAINED MODELS PIPELINE MANAGER
AI STREAMING
INFERENCE RENDER
KUBERNETES
CuBLA
cuBLAS CuFFT
cuFFTNPP NCCL
NPP cuDNN DALI TRT OPTIX INDEX NVENC
S
CUDA
HARDWARE
PRE-TRAINED MODELS
DICOM 2 NIFTI EXPORT
1) Brain Tumor
2) Liver and Tumor 8) Colon Tumor TRAIN
3) Hippocampus 9) Hepatic Vessel
4) Lung Tumor 10) Spleen
5) Prostate 11) Heart
6) Left atrium 12) Chest X-ray
7) Pancreas and Tumor
KUBERNETES
66
DEEPSTREAM - MANY INDUSTRIES, FLEXIBLE DEPLOYMENT
TLT
NGC
Analytics
Construction
NVRS
SERVERS CLOUD
Visualization
Manufacturing
ANY CLOUD
EDGE ON-PREM
67
DEEPSTREAM SOFTWARE STACK
DEEPSTREAM SDK
Reference
Hardware
Docker Containers Applications & Azure IOT Runtime
Accelerated Plugins
Orchestration Recipes
CUDA-X
JETSON | TESLA
68
IVA APPLICATION WORKFLOW
Pixels Insights
69
DEEPSTREAM GRAPH ARCHITECTURE
IMAGE DISPLAY/
RTSP/RAW DECODE PROCESSING BATCHING DNN(s) TRACKING VIZ STORAGE/
CLOUD
SCALE, DETECT,
CAPTURE DECODE DEWARP, STREAM ON SCREEN
CLASSIFY & TRACKING OUTPUT
CROP MGMT DISPLAY
SEGMENT
VIC CPU
ISP
70
WHAT’S NEW IN DEEPSTREAM 4.0
UNIFIED SDK , ALL PLATFORMS TURNKEY IoT INTEGRATION DOCKER CONTAINERS ON NGC
72
ACHIEVING REAL-TIME PERFORMANCE
73
DEEPSTREAM ACCELERATED PLUGINS
Plugin Name Functionality
Gst-nvvideo4linux2 Hardware accelerated decode and encode
75
SCALE WITH DEEPSTREAM IN DOCKER
NGC
76
REAL TIME INSIGHTS, HIGHEST STREAM DENSITY
77
Perception Analytics Visualization
79
BRINGING REALTIME AI TO IOT
“Extracting actionable insights from a sea
of data created by the world’s billions of
cameras and sensors is a huge task and
maintaining a connection from these
devices to the cloud for processing may be
overly expensive or infeasible due to
security, regulatory, or bandwidth
restrictions.
80
FULFILLMENT AND LOGISTICS MANAGEMENT WITH SMARTER
VIDEO INSIGHTS
81
START DEVELOPING WITH DEEPSTREAM
82
83
Parabricks
GPU-Accelerated Analysis of
DNA Sequencing Data
84
Current Use Cases
Clinical Diagnosis
Alignment Pre-processing Variant calling Scientific Conclusions
Quality Checking
BWA-Mem
Co-ordinate Sorting Picard MarkDups BQSR HaplotypeCaller
Alignment
2500
2000 1870
1500
1000
500
38 40 60 65 62 70 65 70 65 72
0
Sample1 (26X) Sample2 (42X) Sample3 (41X) NA12878 (43X) NIST (41X)
89 Confidential: Do not distribute without
Features
20
15
10
5
0
0 1 2 3 4 5 6 7 8
Number of GPUs
91 Confidential: Do not distribute without
Deep Learning in Genomics
DeepVariant
92 Confidential: Do not distribute without
Google Deepvariant
Generate 6D Image
Candidates Pileup CNN
580
332
45 15 14 28 25
Speed
Cost