GPU Bootcamp Samhar

ACCELERATED COMPUTING
Sunil Patel Sr. Data Scientist – Deep Learning

supatel@nvidia.com
https://www.linkedin.com/in/linus1/
RISE OF NVIDIA GPU COMPUTING
GPU-Computing
perf
1.5X per year
1.1X per year
1.5X per year
Single-threaded perf
1980 1990 2000 2010 2020
40 Years of CPU Trend Data
Original data up to the year 2010 collected and plotted by M. Horowitz, Path to 2 nm May Not Be Worth It : https://www.eetimes.com/document.asp?doc_id=1333109
F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New
plot and data collected for 2010-2015 by K. Rupp 2
3
VOLTA ARCHITECTURE 21B Transistors
5120 CUDA Cores
120 TF Tensor Cores 640 Tensor Cores
Powers Summit and Sierra 900 GB/s HBM2
300 GB/s NVLink
2.4 X faster Resnet-50 Training w.r.t P100

3.7 X faster Resnet-50 Inference w.r.t P100
41
DGX-1: 96X FASTER THAN CPU
ITERATE AND INNOVATE FASTER
Workload: ResNet-50, BS=256, 90 epochs to solution | CPU: dual Xeon Platinum 8180 | GPU: 8x NVIDIA Tesla V100 32GB
5
VOLTA ARCHITECTURE 21B Transistors
5120 CUDA Cores
Giantly for ML&DL 640 Tensor Cores
120 TF Tensor Cores 900 GB/s HBM2
Powers Summit and Sierra 300 GB/s NVLink
2.4 X faster Resnet-50 Training w.r.t P100

3.7 X faster Resnet-50 Inference w.r.t P100
1
INTRODUCING TESLA V100
Volta Architecture Improved NVLink Volta MPS Improved SIMT Model Tensor Core
& HBM2
120 Programmable
Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep
Learning
The Fastest and Most Productive GPU for Deep Learning and HPC
4
VOLTA TENSOR CORE
8
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
D=
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3
A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
FP16 or FP32 FP16 FP16 FP16 or FP32
D = AB + C 9
What is Mixed Precision?
• Reduced Precision tensor math with FP16 Multiplication, and FP32 accumulation
• Successfully used to train a variety of:
• Well known public network
• Variety of NVIDIA research network
• Variety of NVIDIA automotive network
Benefits of Mixed Precision Training

• Accelerates math
• Tensor Cores have 8x higher throughput than FP32
• 125 TF theory
• Reduces memory bandwidth pressure
• FP16 halves the memory traffic compared to FP32
• Reduced memory consumption
• Halve the size of activation and gradient Tensor
• Enables larger minibatches or larger input sizes
AUTOMATIC MIXED PRECISION IN TENSORFLOW
Upto 3X Speedup
TensorFlow Medium Post: Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs
All models can be found at:

https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow, except for ssd-rn50-fpn-640, which is here: https://github.com/tensorflow/models/tree/master/research/object_detection All
performance collected on 1xV100-16GB, except bert-squadqa on 1xV100-32GB.
Speedup is the ratio of time to train for a fixed number of epochs in single-precision and Automatic Mixed Precision. Number of epochs for each model was matching the literature or common practice (it was also confirmed that both training sessions achieved the same model accuracy).
Batch sizes:. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; NCF: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; GNMT: 128 for FP32, 192 for AMP. 7
AUTOMATIC MIXED PRECISION IN PYTORCH
https://developer.nvidia.com/automatic-mixed-precision
● Plot shows ResNet-50 result with/without

automatic mixed precision(AMP)
● AMP enabled model scripts available for all
the popular models like Mask-R CNN,
GNMT, NCF, etc. 2X
AMP
Enabled
FP32 M ixed
Precision
Source: https://github.com/NVIDIA/apex/tree/master/examples/imagenet 8
AUTOMATIC MIXED PRECISION IN MXNET
AMP speedup ~1.5X to 2X in comparison with FP32
https://github.com/apache/incubator-mxnet/pull/14173 9
NVIDIA GPU CLOUD
GPU-OPTIMIZED CONTAINERS
CHALLENGES WITH COMPLEX SOFTWARE
Current DIY GPU-accelerated AI Applications or

and HPC deployments are Frameworks
complex and time consuming to

build, test and maintain
Development of software NVIDIA Libraries

frameworks by the community is
moving very fast
NVIDIA Container
Runtime for Docker
NVIDIA Driver
Requires high level of expertise to NVIDIA GPU
manage driver, library, framework

dependencies
1
5
WHY CONTAINERS?
Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating time-
consuming software integration work
Isolate individual deep learning frameworks
and applications
Share, collaborate,
and test applications across
different environments
51
6
GPU-ACCELERATED CONTAINERS
10 at Launch, 35+ Today
Deep Learning HPC HPC Visualization NVIDIA/K8s Partners

caffe bigdft index Kubernetes chainer
on NVIDIA GPUs
caffe2 candle paraview-holodeck h20ai-driverless
cntk chroma paraview-index kinetica
cuda gamess paraview-optix mapd
digits gromacs Paddlepaddle
inferenceserver lammps MATLAB
mxnet lattice-microbes
pytorch milc
tensorflow namd
tensorrt pgi
theano picongpu
torch relion
vmd
1
7
ACCELERATED INFERENCING
CURRENT DEPLOYMENT WORKFLOW
TRAINING UNOPTIMIZED DEPLOYMENT
1
Data Deploy training
Management framework
Training Training
Trained Neural
2
Data Network Deploy custom
application using
Model NVIDIA DL SDK
Assessment
3
Framework or
custom CPU-Only
application
CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL)

19
NVIDIA TENSORRT
Programmable Inference Accelerator
FRAMEWORKS GPU PLATFORMS
TESLA T4
TensorRT
JETSON TX2
Optimizer Runtime
DRIVE PX 2
NVIDIA DLA
TESLA V100
20
developer.nvidia.com/tensorrt
TENSOR-RT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350
Latency (ms)
400
Latency (ms)
4,000
Input/sec
Images/sec
25
280 ms
300
3,000 20 300 250

14 ms
15 200
2,000 200 153 ms
10 150
6.67 ms 6.83 ms 117 ms
1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.
21
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2
Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans
Step 2: Deploy optimized plans with runtime
Plan 1 De-serialize Deploy

Engine Runtime
Plan 2 Data center
Plan 3
Optimized Plans TensorRT Runtime Engine Automotive Embedded 22

TENSORRT OPTIMIZATIONS
Layer & Tensor Fusion
➢ Optimizations are completely automatic

➢ Performed with a single function call
Weights & Activation
Precision Calibration
Kernel Auto-Tuning
Dynamic Tensor
Memory
23
LAYER & TENSOR FUSION
Un-Optimized Network TensorRT Optimized Network

• Vertical Fusion
next input
• Horizonal Fusion next input
concat Elimination
• Layer
relu relu relu relu
bias
bias Network Layersbias Layers bias 3x3 CBR 5x5 CBR 1x1 CBR
1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv.
before after
relu relu
VGG19
bias
43bias 27max pool
1x1 CBR max pool
Inception
1x1 conv. 1x1
309conv. 113
V3
input
ResNet-152 670 159 input
concat
24
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32 INT8
Difference 6,000
-3.4x10Top 1 Top 1
38 38
FP32 ~ +3.4x10 Training precision
FP16
Googlenet-65504
FP16 68.87%
~ +65504 68.49% 0.38%
No calibration required Tensor Core
5,000
VGG 68.56% 68.45% 0.11%
Requires calibration
INT8 -128 ~ +127
Resnet-50 73.11% 72.54% 0.57% 4,000
Images/Second
Resnet-152 75.18% 74.56% 0.61%
3,000
Precision calibration for INT8 inference: 2,000

INT8
➢ Minimizes information loss between FP32 and FP32
1,000
INT8 inference on a calibration dataset FP32
FP32
➢ Completely automatic 0
CPU-Only P4 V100
25
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser
Step 3: Register inputs and outputs
Step 4: Optimize model and create

a runtime engine
Step 5: Serialize optimized engine
Step 6: De-serialize engine

Step 7: Perform inference
WIDELY ADOPTED
AUTONOMOUS MACHINE
NVIDIA Jetson
software-defined autonomous
Powerful and efficient AI, CV, HPC | Rich Software Development Platform
machines
Open Platform | 200K Developers
Sensor s AI/Systems Softwar e Design Ser vices ECOSYSTEM
ACCELERATED MODULES
Depth est Path planning Obj detect Gestur e r ec Pose est Speech r ec
Ar tificial Intelligence Computer V ision Acceler ated Computing Multimedia

JETPACK SDK
BSP • Linux • Secur ity Ar chitectur e
JETSON COMPUTER
AUTONOMOUS MACHINES
WAREHOUSE DELIVERY AGRICULTURE RETAIL INDUSTRIAL
69
Jetson software
Nsight Developer Tools

Modules
Depth Object Pose Gesture Path Ecosystem

…
estimation detection estimation recognition planning modules
TensorRT VisionWorks cuBLAS libargus Drivers

JetPack SDK
cuDNN OpenCV cuFFT Video API Ecosystem
Deep Learning Computer Vision Accelerated Computing Multimedia Sensors
CUDA • CUDA-X • Linux • RTOS
Jetson module
Jetson software: developer.nvidia.com/jetson

Jetson
World’s First Autonomous Machine Platform
Computer V ision Engines
V ision Acceler ator
Ster eo & Optical Flow Engine
HDR ISP
JETSON H/W PLATFORM

Multimedia Engines
Car mel ARM V 8.2 CPU Encode, Decode,
8 Cor es 4x2MB L2, 4MB L3 V ideo Image Compositor
10-wide Super scalar H.264, H.265, V P9
Cache coher ent CPU complex HDMI and DP Display suppor t
Boot, Power & Secur ity

Boot and power management pr ocessor
TEE + ARM Tr ustZone
V olta Tensor Cor e GPU
FP32/FP16/INT8 Multi-Pr ecision AES, RSA, SHA
512 CUDA Tensor Cor es
2.8 TFLOPS (FP16) Industr y Standar d IO
22.6 Tensor Cor e DL TOPS Always on Sensor pr ocessor engine (AON/SPE)
CAN, DMIC, GPIO, I2 C, I2 S, PMC, SPI, UART
Industr y Standar d High-Speed IO

DLA – Designed for Infer ence PCle Gen4 Root and Endpoint
5.7 TFLOPS FP16 16 lanes MIPI CSI-2 | 8 lanes SLV S-EC
11.4 TOPS INT8 Suppor t C-Phy and D-Phy
RGMII Ether net
USB 3.1 and 2.0

Memor y USB 3.1 Gen2 Host and Device
256-Bit LPDDR4X
16GB, 137 GB/s
The jetson family
From AI at the Edge to Autonomous Machines
JETSON NANO JETSON TX1 → JETSON TX2 4 GB JETSON TX2 8GB | Industrial JETSON AGX XAVIER
5 - 10W 7 - 15W 7 – 15W 10 – 30W
0.5 TFLOPS (FP16) 1 – 1.3 TFLOPS (FP16) 1.3 TFLOPS (FP16) 10 TFLOPS (FP16) | 32 TOPS (INT8)
45mm x 70mm 50mm x 87mm 50mm x 87mm 100mm x 87mm
$129 $299 $399 - $749 $1099
AI at the edge Fully autonomous machines
Multiple devices - Same software

Listed prices are for 1000u+ | Full specs at developer.nvidia.com/jetson
Jetson developer
Comprehensive tool suite to tools
accelerate development
System wide application tuning and optimization

Workload balancing across GPU, CPU, DLA
Multi-platform development
CUDA-aware editor CPU/GPU debugger Visual profiler and system trace

Compute and graphics
Develop → Profile → Analyze → Optimize

Jetson ECOSYSTEM
DISTRIBUTION
ISV TOOLS/SYSTEMS SW CSP-IOT SOFTWARE SERVICES
SOFTWARE
CAMERA AND SENSORS HARDWARE AND DESIGN SERVICES
HW AND SENSORS
DEEPSTREAM - FRAMEWORK FOR
INTELLIGENT VIDEO ANALYTICS
https://www.nvidia.com/en-us/autonomous-machines/intelligent-video-analytics-
platform/
IVA
Access Control Public Transit Parking Management Traffic Engineering
Retail Analytics Securing Critical Infrastructure Managing Logistics Forensic Analysis

37
AI CITY NEEDS SCALABILITY – AN EDGE TO
CLOUD ARCHITECTURE
CLOUD
1000s of cameras
Traffic management
Public safety
ON-PREM SERVER / APPLIANCE

10s-100s of cameras
Smart building
Airport security
CAMERA
Parking entrance
Law enforcement
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 38

INCREASING RESOLUTIONS
Full Quality image captured in Smart Cities
39
PERCEPTION FOR INTELLIGENT VIDEO
ANALYTICS
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 40

DEEPSTREAM SDK 3.0
Plugins (build w ith open sourc e, 3rd party, NV ) Analy tic s - multi-c amera, multi-sensor framew ork Development Tools
DNN inference/TensorRT plugins DeepStream in containers, Multi-GPU orchestration End to end reference applications
Communications plugins Tracking & analytics across large scale/ multi-camera App building/configuration tools
Video/image capture and processing plugins Streaming and Batch Analytics End-end orchestration recipes & adaptation guides
3rd party library plugins … … Event fabric Plugin templates, custom IP integration
DeepStream SDK
Multimedia APIs/ Imaging & Metadata & Multi-camera

TensorRT NV containers Message bus clients
Video Codec SDK Dewarping library messaging tracking lib
Linux, CUDA
Perception infra - Jetson, Tesla server (Edge and cloud) Analytics infra - Edge server, NGC, AWS, Azure
DEEPSTREAM SDK 4.0 IS RELEASED

41
42
43
RAPIDS
EXPLORATION AND MODEL PROTOTYPING
45
DAY IN THE LIFE OF A DATA SCIENTIST
NVIDIA GPUs Supercharge The Way They Work
ANOTHER…
@*#! Forgot to Add Same Number of Iterations

GET A COFFEE in Much Less Time Train Model
a Feature
Validate
Restart Data Prep
Start Data Prep Test Model
Workflow Start
Workflow
GET A COFFEE 12 GET A COFFEE 12 Experiment with
Optimizations and
Repeat
Switch to Decaf
Configure Data Prep
Workflow CPU- GPU-
9 POWERED 3 9 POWERED 3
WORKFLOW WORKFLOW
Find Unexpected Null
Values Stored as String…
Dataset 6 Dataset 6
Downloads Restart Data Prep Downloads
Overnight Workflow Again Overnight
Stay Late Go Home on Time
Dataset Collection Analysis Data Prep Train Inference

46
RAPIDS
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating

data preparation and machine learning.
OSS website: rapids.ai
RAPIDS
Data Preparation Model Training Visualization
cuDF cuML cuGraph cuXfilter

Data Preparation Machine Learning Graph Analytics Visualization
GPU Memory
47
RAPIDS LIBRARIES
cuDF
• GPU-accelerated lightweight in-GPU memory database
used for data preparation
• Accelerates loading, filtering, and manipulation of data
for model training data preparation
• Python drop-in Pandas replacement built on CUDA C++
RAPIDS
cuML
• GPU accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD …
cuGRAPH
• Collection of graph analytics libraries, coming soon
48
HOW? DOWNLOAD AND DEPLOY
Source available on GitHub | Container available on NGC and Docker Hub | Conda and PIP
https://ngc.nvidia.com https://anaconda.org/rapidsai
https://github.com/rapidsai
https://hub.docker.com/u/rapidsai PIP available at a later date
NGC
Source code, libraries, packages
On-premises Cloud 49
Principal Component Analysis (PCA)
Before… …Now!
CPU vs GPU
PORTING EXISTING CODE
PCA
Training and query results:

• CPU: ~5 minutes
• GPU: ~7 seconds
50
k-Nearest Neighbors (KNN)
Before… …Now!
CPU vs GPU
PORTING EXISTING CODE
KNN
Training and query results:

• CPU: ~9 minutes
• GPU: ~5 seconds
51
BENCHMARKS
cuDF – Load and Data Prep cuML – XGBoost End-to-End
20 CPU Nodes 2,741 20 CPU Nodes 2,290
20 CPU Nodes 8,763
30 CPU Nodes 1,675 30 CPU Nodes 1,956

30 CPU Nodes 6,147
50 CPU Nodes 715 50 CPU Nodes 1,999

50 CPU Nodes 3,926
100 CPU Nodes 379 100 CPU Nodes 1,948

100 CPU Nodes 3,221
DGX-2 42 DGX-2 169

DGX-2 322
5x DGX-1 19 5x DGX-1 157

5x DGX-1 213
0 1,000 2,000 3,000 0 500 1,000 1,500 2,000 2,500
0 2,000 4,000 6,000 8,000 10,000
Time in seconds — Shorter is better
cuDF (Load and Data Pr epar ation) Data Conver sion XGBoost
Benchmark CPU Cluster Configuration DGX Cluster Configuration

200GB CSV dataset; Data preparation CPU nodes (61 GiB of memory, 8 vCPUs, 5x DGX-1 on InfiniBand network
includes joins, variable transformations. 64-bit platform), Apache Spark
52
NEMO
https://github.com/NVIDIA/NeMo
CLARA
55
56
DEEP LEARNING IS ENTERING THE CLINIC
1.0
0.5
tumor prob. map
0.0
tumor prob. map
DL DIAGNOSTIC DEVICES 1st FDA DL CLOUD ALGORITHM DL PATHOLOGY

SAMSUNG & GEHC ARTERYS PHILIPS & PATHAI
57
58
59
CLARA AI
Lowering the barriers to AI adoption
MRI
Clara AI
Rapid Data Curation
CT Accurate model with less

data
C C
X-RAY Unlabeled Diagnosis Reference Pipelines
data created by Data
Data Transfer Deployment Scientists
Ultrasound Annotation Learning
Integration into
existing workflows
60
CLARA AI
Pre-Trained models * AI-Assisted Annotation * Transfer Learning * Ready to integrate
61
62
CLARA AI
Intelligent compute platform for medical imaging
Clara Train SDK Clara Deploy SDK
SAMPLE TRAINING PIPELINES
AI-ASSISTED DICOM
WEB UI
SOFTWARE
ANNOTATION TRANSFER LEARNING ADAPTER SAMPLE DEPLOYMENT PIPELINES
TRAIN
DICOM 2
NIFTI PRE-TRAINED MODELS PIPELINE MANAGER
AI STREAMING
INFERENCE RENDER
KUBERNETES
COMPUTE ARTIFICIAL INTELLIGENCE VISUALIZATION

AX LIBS
CuBLA
cuBLAS CuFFT
cuFFTNPP NCCL
NPP cuDNN DALI TRT OPTIX INDEX NVENC
S
CUDA
HARDWARE
TESLA/QUADRO NVIDIA DGX FAMILY CLOUD

63
CLARA TRAIN SDK
DATA CONVERSION TRAINING MODEL OUTPUT
SAMPLE TRAINING PIPELINES

AI-ASSISTED
TRANSFER LEARNING
ANNOTATION
PRE-TRAINED MODELS
DICOM 2 NIFTI EXPORT
1) Brain Tumor
2) Liver and Tumor 8) Colon Tumor TRAIN
3) Hippocampus 9) Hepatic Vessel
4) Lung Tumor 10) Spleen
5) Prostate 11) Heart
6) Left atrium 12) Chest X-ray
7) Pancreas and Tumor
KUBERNETES
TESLA/QUADRO NVIDIA DGX FAMILY CLOUD

64
DEEPSTREAM 4.0.2
Accelerating Scalable IVA
65
INTELLIGENT VIDEO ANALYTICS FOR EFFICIENCY AND SAFETY
Access Control Public Transit Industrial Inspection Traffic Engineering
Retail Analytics Logistics Critical Infrastructure Public Safety
66
DEEPSTREAM - MANY INDUSTRIES, FLEXIBLE DEPLOYMENT
TLT
NGC
Security EDGE TO CLOUD

TRAIN PUBLISH
Alerts
Retail
Analytics
Construction
NVRS
SERVERS CLOUD
Visualization
Manufacturing
ANY CLOUD
EDGE ON-PREM
67
DEEPSTREAM SOFTWARE STACK
Applications and Services
DEEPSTREAM SDK
Reference
Hardware
Docker Containers Applications & Azure IOT Runtime
Accelerated Plugins
Orchestration Recipes
CUDA-X
Kuber netes ON GPUs NV IDIA Container s RT CUDA Multimedia Tensor RT
NVIDIA COMPUTING PLATFORM - EDGE TO CLOUD
JETSON | TESLA
68
IVA APPLICATION WORKFLOW
IoT Edge Cloud

Sensors
Use AI Metadata to Analytics &

Capture Pre-processing Detect, Track Datacenter/ Cloud Visualization
Pixels Insights
69
DEEPSTREAM GRAPH ARCHITECTURE
IMAGE DISPLAY/
RTSP/RAW DECODE PROCESSING BATCHING DNN(s) TRACKING VIZ STORAGE/
CLOUD
SCALE, DETECT,
CAPTURE DECODE DEWARP, STREAM ON SCREEN
CLASSIFY & TRACKING OUTPUT
CROP MGMT DISPLAY
SEGMENT
CPU NVDEC GPU CPU GPU GPU GPU HDMI
PVA DLA PVA VIC SATA
VIC CPU
ISP
70
WHAT’S NEW IN DEEPSTREAM 4.0
UNIFIED SDK , ALL PLATFORMS TURNKEY IoT INTEGRATION DOCKER CONTAINERS ON NGC
Easy to scale and maintain

From Jetson Nano to Tesla T4 Microsoft Azure IoT Hub*
SUPPORT FOR IMAGE

MONOCHROME AND JPEG PLUGIN SOURCES
SEGMENTATION
Inference Decode Messaging
Enabling Retail & Supply

Enabling Industrial Inspection chain solution Greater control for your use case
* DeepStream container now available on Azure Marketplace 71

PYTHON BINDINGS
Alpha version of Python Bindings now available
Download bindings from devzone
Get the Python apps from GitHub
72
ACHIEVING REAL-TIME PERFORMANCE
NVIDIA Products H.264 H.265

Jetson Nano† 8 8
Jetson Tx1† 8 8
Jetson Tx2† 14 14
Jetson AGX Xavier* 32 49
Tesla T4* 35 68
Number of 1080p/30FPS stream captured and processed with AI.
†Object detection using 4-class ResNet10 and no classifiers

*Object detection using 4-class ResNet10 + three ResNet10 classifiers
73
DEEPSTREAM ACCELERATED PLUGINS
Plugin Name Functionality
Gst-nvvideo4linux2 Hardware accelerated decode and encode
Gst-nvinfer DL inference for detection, classification and segmentation
Gst-nvtracker Reference object trackers; KLT, IOU, NvDCF
Gst-nvmsgbroker Messaging to cloud
Gst-nvstreammux Stream aggregation, multiplexing, and batching
Gst-nvdsosd Draw boxes and text overlay
Gst-nvmultistreamtiler Renders frames 2D grid array
Gst-nveglglessink Accelerated X11 / EGL rendering
Gst-nvvideoconvert Scaling, format conversion, rotation
Gst-nvdewarp Dewarping for fish-eye degree cameras
Gst-nvmsgconv Metadata generation
Gst-nvsegvisual Visualizes segmentation results
Gst-nvof Hardware accelerated optical flow

74
DEEPSTREAM ON JETSON NANO
75
SCALE WITH DEEPSTREAM IN DOCKER
Development and deployment containers Deployment containers
NGC
76
REAL TIME INSIGHTS, HIGHEST STREAM DENSITY
NGC ANY CLOUD
NVIDIA Metropolis Analytics Visualization

Application Framework
NVIDIA Edge Stack
NVIDIA EGX Server Cloud Monitoring
68 streams of 1080p per T4

Pixels Information Dashboard
77
Perception Analytics Visualization
VIDEO : INTELLIGENT TRAFFIC SYSTEM

78
AI-POWERED LOSS PREVENTION SLASHES SHRINK RATE
“Improving operational efficiency and

reducing loss are key issues facing many
retailers.
Today’s large supermarkets have numerous

in-store cameras, which can be used to
mitigate these problems, but real-time
video processing of so many streams can be
a challenge.
By leveraging NVIDIA T4 GPUs, DeepStream

and TensorRT, Malong’s state-of-the-art
Intelligent Video Analytics (IVA) solution
achieves 3X higher throughput with
industry-leading accuracy to help their
retail customers significantly improve their
business performance.”
79
BRINGING REALTIME AI TO IOT
“Extracting actionable insights from a sea
of data created by the world’s billions of
cameras and sensors is a huge task and
maintaining a connection from these
devices to the cloud for processing may be
overly expensive or infeasible due to
security, regulatory, or bandwidth
restrictions.
Microsoft Azure IoT Edge deploys

applications and services built using
DeepStream to edge devices, allowing
organizations to process data locally to
trigger alerts and take actions
automatically and to upload to the cloud
when needed.
Combining Azure IoT Edge, NVIDIA

DeepStream and Azure IoT Central brings
device management, monitoring and
custom business logic to millions of edge
devices for real-time insights and easy
deployment.”
80
FULFILLMENT AND LOGISTICS MANAGEMENT WITH SMARTER
VIDEO INSIGHTS
“As a leader in fulfillment and logistics

management, SF Express needed to track
goods and vehicles across tens of thousands
of locations.
Every site requires detailed analytics

around fleet management, loading times,
and other operational activities.
Using DeepStream and NVIDIA GPUs, they

were able to increase the efficiency of AI
Argus; an intelligent video analytics
product that brings smarter video insights
and can process 32 video streams
simultaneously.”
81
START DEVELOPING WITH DEEPSTREAM
DEEPSTREAM | EXPLORE METROPOLIS | SUPPORT FORUMS
82
83
Parabricks
GPU-Accelerated Analysis of
DNA Sequencing Data
84
Current Use Cases
Population Clinical Plants/Animals

Used for multiple Used for cancer Used for Cannabis, Coral, Seaweed,
national scale settings, new-born …
population studies babies, …
85 Confidential: Do not distribute without

Introduction
Clinical Diagnosis
Alignment Pre-processing Variant calling Scientific Conclusions
Primary Analysis Secondary Analysis Tertiary Analysis
Parabricks accelerated secondary analysis:

• Uses GPUs (cloud/on-premise) for
computing
• Single node for running entire analysis
• Reduces cost of computing significantly
Current
Parabricks NGS Analytics Under development
Ha pl otype ImportGVCF VQSR STAR

BWA Ca l l er
MEM
Variant STAR-Fusion
Filtration
Mutect2 Sel ect
Alignment Va ri a nts
Select kallisto
Variants
CNVKi t RNA
Merge Genotype
GVCF CNNScore
variants
Sort Annotations
Deep
Va ri a nt
Joint Genotyping VCF Tools Mutation
Ma rk
Signature
AlignmentSummary
Manta WGSMetrics
BQSR InsertSize TMB & MSI
RAWWGSMetrics
SequencingArtifact
BaseDistributionByCycle
GcBias
MeanQualityByCycle
Preprocessing Variant Callers QualityScoreDistribution Tertiary Analysis
Quality Checking

Germline Pipeline
BWA-Mem
Co-ordinate Sorting Picard MarkDups BQSR HaplotypeCaller
Alignment

Performance Comparison
Baseline (32 vCPU) 8 V100 GPU Server P3.16xlarge (AWS)
3500
3125
2993
Execution Time (Minutes)
3000 2746 2825
2500
2000 1870
1500
1000
500
38 40 60 65 62 70 65 70 65 72
0
Sample1 (26X) Sample2 (42X) Sample3 (41X) NA12878 (43X) NIST (41X)
Features
• 35-50x faster pipeline
• 30x Whole Genome under 45 mins
• Nearly 40 genomes/day in throughput mode (DGX1)
• 100% reproducible and deterministic
• Flexible pipeline (add and remove steps easily)

Scaling
50
45
40
35
30
25
Speedup
20
15
10
5
0
0 1 2 3 4 5 6 7 8
Number of GPUs
Deep Learning in Genomics
DeepVariant
Google Deepvariant
Generate 6D Image
Candidates Pileup CNN
Variant 1AT chr1 0

tfrecord tfrecord Variant 2CA chr1 20 Mutations
BAM Variant
Variant
3GT
4 A CC
chr1 35
chr1 60
Variant 5AG chr1 200 VCF
Variant 6AT chr1 1000
Data Preparation Inference Create Output

(make_examples) (call_variants) (post_process)

Parabricks Performance (40X)
CPU Server with 32 vCPU GPU Server with 8 V100
1250
1090 1086
580
332
45 15 14 28 25
BWA + Haplotypecaller Mutect2 GenotypeGVCF DeepVariant

Preprocessing
BAM : 100% Matching, VCF : 99.99% Matching
94
Comparison Parabricks HW solutions
(NVIDIA GPUs) (FPGA)
Speed
Cost
Same Results as Base Tools (GATK, Deepvariant, …)
Up-To-Date (new versions of tools will be supported by Parabricks)
Deep learning Integration (DeepVariant, CNNScoreVariant, training with your

data, …)
General Purpose Hardware
Ease of Use & Flexibility
Collaboration for new tools and pipelines

THANK YOU
Sunil Patel Sr. Solution Architect – Deep Learning

supatel@nvidia.com

GPU Bootcamp Samhar

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GPU Bootcamp Samhar

Uploaded by

Copyright:

Available Formats

ACCELERATED COMPUTING

Sunil Patel Sr. Data Scientist – Deep Learning

1.1X per year

1.5X per year

40 Years of CPU Trend Data

2.4 X faster Resnet-50 Training w.r.t P100

2.4 X faster Resnet-50 Training w.r.t P100

FP16 or FP32 FP16 FP16 FP16 or FP32

Benefits of Mixed Precision Training

All models can be found at:

● Plot shows ResNet-50 result with/without

Current DIY GPU-accelerated AI Applications or

complex and time consuming to

Development of software NVIDIA Libraries

Requires high level of expertise to NVIDIA GPU

manage driver, library, framework

Deep Learning HPC HPC Visualization NVIDIA/K8s Partners

CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL)

FRAMEWORKS GPU PLATFORMS

3,000 20 300 250

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy

Optimized Plans TensorRT Runtime Engine Automotive Embedded 22

Layer & Tensor Fusion

➢ Optimizations are completely automatic

Un-Optimized Network TensorRT Optimized Network

Precision calibration for INT8 inference: 2,000

Step 3: Register inputs and outputs

Step 4: Optimize model and create

Step 5: Serialize optimized engine

Step 6: De-serialize engine

Sensor s AI/Systems Softwar e Design Ser vices ECOSYSTEM

Ar tificial Intelligence Computer V ision Acceler ated Computing Multimedia

BSP • Linux • Secur ity Ar chitectur e

Nsight Developer Tools

Depth Object Pose Gesture Path Ecosystem

TensorRT VisionWorks cuBLAS libargus Drivers

cuDNN OpenCV cuFFT Video API Ecosystem

Deep Learning Computer Vision Accelerated Computing Multimedia Sensors

CUDA • CUDA-X • Linux • RTOS

Jetson software: developer.nvidia.com/jetson

JETSON H/W PLATFORM

Boot, Power & Secur ity

Industr y Standar d High-Speed IO

USB 3.1 and 2.0

AI at the edge Fully autonomous machines

Multiple devices - Same software

System wide application tuning and optimization

CUDA-aware editor CPU/GPU debugger Visual profiler and system trace

Develop → Profile → Analyze → Optimize

ISV TOOLS/SYSTEMS SW CSP-IOT SOFTWARE SERVICES

CAMERA AND SENSORS HARDWARE AND DESIGN SERVICES

Access Control Public Transit Parking Management Traffic Engineering

Retail Analytics Securing Critical Infrastructure Managing Logistics Forensic Analysis

ON-PREM SERVER / APPLIANCE

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 38

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 40

Multimedia APIs/ Imaging & Metadata & Multi-camera

DEEPSTREAM SDK 4.0 IS RELEASED

@*#! Forgot to Add Same Number of Iterations

Stay Late Go Home on Time

Dataset Collection Analysis Data Prep Train Inference

RAPIDS is a set of open source libraries for GPU accelerating