Download as pdf or txt
Download as pdf or txt
You are on page 1of 96

ACCELERATED COMPUTING

Sunil Patel Sr. Data Scientist – Deep Learning


supatel@nvidia.com
https://www.linkedin.com/in/linus1/
RISE OF NVIDIA GPU COMPUTING

GPU-Computing
perf
1.5X per year

1.1X per year

1.5X per year

Single-threaded perf
1980 1990 2000 2010 2020

40 Years of CPU Trend Data

Original data up to the year 2010 collected and plotted by M. Horowitz, Path to 2 nm May Not Be Worth It : https://www.eetimes.com/document.asp?doc_id=1333109
F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New
plot and data collected for 2010-2015 by K. Rupp 2
3
VOLTA ARCHITECTURE 21B Transistors
5120 CUDA Cores
120 TF Tensor Cores 640 Tensor Cores
Powers Summit and Sierra 900 GB/s HBM2
300 GB/s NVLink

2.4 X faster Resnet-50 Training w.r.t P100


3.7 X faster Resnet-50 Inference w.r.t P100

41
DGX-1: 96X FASTER THAN CPU
ITERATE AND INNOVATE FASTER

Workload: ResNet-50, BS=256, 90 epochs to solution | CPU: dual Xeon Platinum 8180 | GPU: 8x NVIDIA Tesla V100 32GB

5
VOLTA ARCHITECTURE 21B Transistors
5120 CUDA Cores
Giantly for ML&DL 640 Tensor Cores
120 TF Tensor Cores 900 GB/s HBM2
Powers Summit and Sierra 300 GB/s NVLink

2.4 X faster Resnet-50 Training w.r.t P100


3.7 X faster Resnet-50 Inference w.r.t P100

1
INTRODUCING TESLA V100

Volta Architecture Improved NVLink Volta MPS Improved SIMT Model Tensor Core
& HBM2

120 Programmable
Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep
Learning

The Fastest and Most Productive GPU for Deep Learning and HPC

4
VOLTA TENSOR CORE

8
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices

A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3

D=
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3

A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3

A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3

FP16 or FP32 FP16 FP16 FP16 or FP32

D = AB + C 9
What is Mixed Precision?
• Reduced Precision tensor math with FP16 Multiplication, and FP32 accumulation
• Successfully used to train a variety of:
• Well known public network
• Variety of NVIDIA research network
• Variety of NVIDIA automotive network

Benefits of Mixed Precision Training


• Accelerates math
• Tensor Cores have 8x higher throughput than FP32
• 125 TF theory
• Reduces memory bandwidth pressure
• FP16 halves the memory traffic compared to FP32
• Reduced memory consumption
• Halve the size of activation and gradient Tensor
• Enables larger minibatches or larger input sizes
AUTOMATIC MIXED PRECISION IN TENSORFLOW
Upto 3X Speedup

TensorFlow Medium Post: Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs

All models can be found at:


https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow, except for ssd-rn50-fpn-640, which is here: https://github.com/tensorflow/models/tree/master/research/object_detection All
performance collected on 1xV100-16GB, except bert-squadqa on 1xV100-32GB.
Speedup is the ratio of time to train for a fixed number of epochs in single-precision and Automatic Mixed Precision. Number of epochs for each model was matching the literature or common practice (it was also confirmed that both training sessions achieved the same model accuracy).
Batch sizes:. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; NCF: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; GNMT: 128 for FP32, 192 for AMP. 7
AUTOMATIC MIXED PRECISION IN PYTORCH
https://developer.nvidia.com/automatic-mixed-precision

● Plot shows ResNet-50 result with/without


automatic mixed precision(AMP)
● AMP enabled model scripts available for all
the popular models like Mask-R CNN,
GNMT, NCF, etc. 2X
AMP
Enabled

FP32 M ixed
Precision

Source: https://github.com/NVIDIA/apex/tree/master/examples/imagenet 8
AUTOMATIC MIXED PRECISION IN MXNET
AMP speedup ~1.5X to 2X in comparison with FP32

https://github.com/apache/incubator-mxnet/pull/14173 9
NVIDIA GPU CLOUD
GPU-OPTIMIZED CONTAINERS
CHALLENGES WITH COMPLEX SOFTWARE

Current DIY GPU-accelerated AI Applications or


and HPC deployments are Frameworks

complex and time consuming to


build, test and maintain

Development of software NVIDIA Libraries


frameworks by the community is
moving very fast
NVIDIA Container
Runtime for Docker

NVIDIA Driver

Requires high level of expertise to NVIDIA GPU

manage driver, library, framework


dependencies
1
5
WHY CONTAINERS?
Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating time-
consuming software integration work
Isolate individual deep learning frameworks
and applications
Share, collaborate,
and test applications across
different environments

51
6
GPU-ACCELERATED CONTAINERS
10 at Launch, 35+ Today

Deep Learning HPC HPC Visualization NVIDIA/K8s Partners


caffe bigdft index Kubernetes chainer
on NVIDIA GPUs
caffe2 candle paraview-holodeck h20ai-driverless
cntk chroma paraview-index kinetica
cuda gamess paraview-optix mapd
digits gromacs Paddlepaddle
inferenceserver lammps MATLAB
mxnet lattice-microbes
pytorch milc
tensorflow namd
tensorrt pgi
theano picongpu
torch relion
vmd

1
7
ACCELERATED INFERENCING
CURRENT DEPLOYMENT WORKFLOW
TRAINING UNOPTIMIZED DEPLOYMENT

1
Data Deploy training
Management framework

Training Training
Trained Neural
2
Data Network Deploy custom
application using
Model NVIDIA DL SDK
Assessment

3
Framework or
custom CPU-Only
application

CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL)


19
NVIDIA TENSORRT
Programmable Inference Accelerator

FRAMEWORKS GPU PLATFORMS

TESLA T4

TensorRT
JETSON TX2
Optimizer Runtime

DRIVE PX 2

NVIDIA DLA

TESLA V100

20
developer.nvidia.com/tensorrt
TENSOR-RT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350

Latency (ms)
400

Latency (ms)
4,000

Input/sec
Images/sec

25
280 ms
300

3,000 20 300 250


14 ms
15 200
2,000 200 153 ms
10 150
6.67 ms 6.83 ms 117 ms

1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.

21
developer.nvidia.com/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2

Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy


Engine Runtime
Plan 2 Data center

Plan 3

Optimized Plans TensorRT Runtime Engine Automotive Embedded 22


TENSORRT OPTIMIZATIONS

Layer & Tensor Fusion

➢ Optimizations are completely automatic


➢ Performed with a single function call
Weights & Activation
Precision Calibration

Kernel Auto-Tuning

Dynamic Tensor
Memory
23
LAYER & TENSOR FUSION

Un-Optimized Network TensorRT Optimized Network


• Vertical Fusion
next input
• Horizonal Fusion next input
concat Elimination
• Layer
relu relu relu relu
bias
bias Network Layersbias Layers bias 3x3 CBR 5x5 CBR 1x1 CBR
1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv.
before after
relu relu
VGG19
bias
43bias 27max pool
1x1 CBR max pool
Inception
1x1 conv. 1x1
309conv. 113
V3
input
ResNet-152 670 159 input
concat

24
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32 INT8
Difference 6,000
-3.4x10Top 1 Top 1
38 38
FP32 ~ +3.4x10 Training precision
FP16
Googlenet-65504
FP16 68.87%
~ +65504 68.49% 0.38%
No calibration required Tensor Core
5,000
VGG 68.56% 68.45% 0.11%
Requires calibration
INT8 -128 ~ +127
Resnet-50 73.11% 72.54% 0.57% 4,000

Images/Second
Resnet-152 75.18% 74.56% 0.61%
3,000

Precision calibration for INT8 inference: 2,000


INT8
➢ Minimizes information loss between FP32 and FP32
1,000
INT8 inference on a calibration dataset FP32
FP32
➢ Completely automatic 0
CPU-Only P4 V100

25
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser

Step 3: Register inputs and outputs

Step 4: Optimize model and create


a runtime engine

Step 5: Serialize optimized engine

Step 6: De-serialize engine


Step 7: Perform inference

developer.nvidia.com/tensorrt
WIDELY ADOPTED
AUTONOMOUS MACHINE
NVIDIA Jetson
software-defined autonomous
Powerful and efficient AI, CV, HPC | Rich Software Development Platform
machines
Open Platform | 200K Developers

Sensor s AI/Systems Softwar e Design Ser vices ECOSYSTEM

ACCELERATED MODULES
Depth est Path planning Obj detect Gestur e r ec Pose est Speech r ec

Ar tificial Intelligence Computer V ision Acceler ated Computing Multimedia


JETPACK SDK

BSP • Linux • Secur ity Ar chitectur e

JETSON COMPUTER
AUTONOMOUS MACHINES
WAREHOUSE DELIVERY AGRICULTURE RETAIL INDUSTRIAL

69
Jetson software

Nsight Developer Tools


Modules

Depth Object Pose Gesture Path Ecosystem



estimation detection estimation recognition planning modules

TensorRT VisionWorks cuBLAS libargus Drivers


JetPack SDK

cuDNN OpenCV cuFFT Video API Ecosystem

Deep Learning Computer Vision Accelerated Computing Multimedia Sensors

CUDA • CUDA-X • Linux • RTOS

Jetson module

Jetson software: developer.nvidia.com/jetson


Jetson
World’s First Autonomous Machine Platform
Computer V ision Engines
V ision Acceler ator
Ster eo & Optical Flow Engine
HDR ISP

JETSON H/W PLATFORM


Multimedia Engines
Car mel ARM V 8.2 CPU Encode, Decode,
8 Cor es 4x2MB L2, 4MB L3 V ideo Image Compositor
10-wide Super scalar H.264, H.265, V P9
Cache coher ent CPU complex HDMI and DP Display suppor t

Boot, Power & Secur ity


Boot and power management pr ocessor
TEE + ARM Tr ustZone
V olta Tensor Cor e GPU
FP32/FP16/INT8 Multi-Pr ecision AES, RSA, SHA
512 CUDA Tensor Cor es
2.8 TFLOPS (FP16) Industr y Standar d IO
22.6 Tensor Cor e DL TOPS Always on Sensor pr ocessor engine (AON/SPE)
CAN, DMIC, GPIO, I2 C, I2 S, PMC, SPI, UART

Industr y Standar d High-Speed IO


DLA – Designed for Infer ence PCle Gen4 Root and Endpoint
5.7 TFLOPS FP16 16 lanes MIPI CSI-2 | 8 lanes SLV S-EC
11.4 TOPS INT8 Suppor t C-Phy and D-Phy
RGMII Ether net

USB 3.1 and 2.0


Memor y USB 3.1 Gen2 Host and Device
256-Bit LPDDR4X
16GB, 137 GB/s
The jetson family
From AI at the Edge to Autonomous Machines

JETSON NANO JETSON TX1 → JETSON TX2 4 GB JETSON TX2 8GB | Industrial JETSON AGX XAVIER
5 - 10W 7 - 15W 7 – 15W 10 – 30W
0.5 TFLOPS (FP16) 1 – 1.3 TFLOPS (FP16) 1.3 TFLOPS (FP16) 10 TFLOPS (FP16) | 32 TOPS (INT8)
45mm x 70mm 50mm x 87mm 50mm x 87mm 100mm x 87mm
$129 $299 $399 - $749 $1099

AI at the edge Fully autonomous machines

Multiple devices - Same software


Listed prices are for 1000u+ | Full specs at developer.nvidia.com/jetson
Jetson developer
Comprehensive tool suite to tools
accelerate development

System wide application tuning and optimization


Workload balancing across GPU, CPU, DLA
Multi-platform development

CUDA-aware editor CPU/GPU debugger Visual profiler and system trace


Compute and graphics

Develop → Profile → Analyze → Optimize


Jetson ECOSYSTEM
DISTRIBUTION

ISV TOOLS/SYSTEMS SW CSP-IOT SOFTWARE SERVICES

SOFTWARE

CAMERA AND SENSORS HARDWARE AND DESIGN SERVICES

HW AND SENSORS
DEEPSTREAM - FRAMEWORK FOR
INTELLIGENT VIDEO ANALYTICS
https://www.nvidia.com/en-us/autonomous-machines/intelligent-video-analytics-
platform/
IVA

Access Control Public Transit Parking Management Traffic Engineering

Retail Analytics Securing Critical Infrastructure Managing Logistics Forensic Analysis


37
AI CITY NEEDS SCALABILITY – AN EDGE TO
CLOUD ARCHITECTURE
CLOUD
1000s of cameras

Traffic management

Public safety

ON-PREM SERVER / APPLIANCE


10s-100s of cameras

Smart building

Airport security

CAMERA

Parking entrance

Law enforcement

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 38


INCREASING RESOLUTIONS
Full Quality image captured in Smart Cities

39
PERCEPTION FOR INTELLIGENT VIDEO
ANALYTICS

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 40


DEEPSTREAM SDK 3.0
Plugins (build w ith open sourc e, 3rd party, NV ) Analy tic s - multi-c amera, multi-sensor framew ork Development Tools

DNN inference/TensorRT plugins DeepStream in containers, Multi-GPU orchestration End to end reference applications

Communications plugins Tracking & analytics across large scale/ multi-camera App building/configuration tools

Video/image capture and processing plugins Streaming and Batch Analytics End-end orchestration recipes & adaptation guides

3rd party library plugins … … Event fabric Plugin templates, custom IP integration

DeepStream SDK

Multimedia APIs/ Imaging & Metadata & Multi-camera


TensorRT NV containers Message bus clients
Video Codec SDK Dewarping library messaging tracking lib

Linux, CUDA

Perception infra - Jetson, Tesla server (Edge and cloud) Analytics infra - Edge server, NGC, AWS, Azure

DEEPSTREAM SDK 4.0 IS RELEASED


41
42
43
RAPIDS
EXPLORATION AND MODEL PROTOTYPING

45
DAY IN THE LIFE OF A DATA SCIENTIST
NVIDIA GPUs Supercharge The Way They Work
ANOTHER…

@*#! Forgot to Add Same Number of Iterations


GET A COFFEE in Much Less Time Train Model
a Feature
Validate
Restart Data Prep
Start Data Prep Test Model
Workflow Start
Workflow
GET A COFFEE 12 GET A COFFEE 12 Experiment with
Optimizations and
Repeat
Switch to Decaf
Configure Data Prep
Workflow CPU- GPU-
9 POWERED 3 9 POWERED 3
WORKFLOW WORKFLOW
Find Unexpected Null
Values Stored as String…
Dataset 6 Dataset 6
Downloads Restart Data Prep Downloads
Overnight Workflow Again Overnight

Stay Late Go Home on Time

Dataset Collection Analysis Data Prep Train Inference


46
RAPIDS
GPU Accelerated End-to-End Data Science

RAPIDS is a set of open source libraries for GPU accelerating


data preparation and machine learning.

OSS website: rapids.ai

RAPIDS
Data Preparation Model Training Visualization

cuDF cuML cuGraph cuXfilter


Data Preparation Machine Learning Graph Analytics Visualization

GPU Memory

47
RAPIDS LIBRARIES
cuDF
• GPU-accelerated lightweight in-GPU memory database
used for data preparation
• Accelerates loading, filtering, and manipulation of data
for model training data preparation
• Python drop-in Pandas replacement built on CUDA C++

RAPIDS
cuML
• GPU accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD …

cuGRAPH
• Collection of graph analytics libraries, coming soon

48
HOW? DOWNLOAD AND DEPLOY
Source available on GitHub | Container available on NGC and Docker Hub | Conda and PIP
https://ngc.nvidia.com https://anaconda.org/rapidsai
https://github.com/rapidsai
https://hub.docker.com/u/rapidsai PIP available at a later date

NGC

Source code, libraries, packages

On-premises Cloud 49
Principal Component Analysis (PCA)
Before… …Now!

CPU vs GPU
PORTING EXISTING CODE
PCA

Training and query results:


• CPU: ~5 minutes
• GPU: ~7 seconds
50
k-Nearest Neighbors (KNN)
Before… …Now!

CPU vs GPU
PORTING EXISTING CODE
KNN

Training and query results:


• CPU: ~9 minutes
• GPU: ~5 seconds
51
BENCHMARKS
cuDF – Load and Data Prep cuML – XGBoost End-to-End
20 CPU Nodes 2,741 20 CPU Nodes 2,290
20 CPU Nodes 8,763

30 CPU Nodes 1,675 30 CPU Nodes 1,956


30 CPU Nodes 6,147

50 CPU Nodes 715 50 CPU Nodes 1,999


50 CPU Nodes 3,926

100 CPU Nodes 379 100 CPU Nodes 1,948


100 CPU Nodes 3,221

DGX-2 42 DGX-2 169


DGX-2 322

5x DGX-1 19 5x DGX-1 157


5x DGX-1 213
0 1,000 2,000 3,000 0 500 1,000 1,500 2,000 2,500
0 2,000 4,000 6,000 8,000 10,000

Time in seconds — Shorter is better

cuDF (Load and Data Pr epar ation) Data Conver sion XGBoost

Benchmark CPU Cluster Configuration DGX Cluster Configuration


200GB CSV dataset; Data preparation CPU nodes (61 GiB of memory, 8 vCPUs, 5x DGX-1 on InfiniBand network
includes joins, variable transformations. 64-bit platform), Apache Spark
52
NEMO
https://github.com/NVIDIA/NeMo
CLARA
55
56
DEEP LEARNING IS ENTERING THE CLINIC
1.0

0.5

tumor prob. map

0.0

tumor prob. map

DL DIAGNOSTIC DEVICES 1st FDA DL CLOUD ALGORITHM DL PATHOLOGY


SAMSUNG & GEHC ARTERYS PHILIPS & PATHAI

57
58
59
CLARA AI
Lowering the barriers to AI adoption

MRI
Clara AI
Rapid Data Curation

CT Accurate model with less


data
C C
X-RAY Unlabeled Diagnosis Reference Pipelines
data created by Data
Data Transfer Deployment Scientists
Ultrasound Annotation Learning
Integration into
existing workflows

60
CLARA AI
Pre-Trained models * AI-Assisted Annotation * Transfer Learning * Ready to integrate

61
62
CLARA AI
Intelligent compute platform for medical imaging
Clara Train SDK Clara Deploy SDK
SAMPLE TRAINING PIPELINES

AI-ASSISTED DICOM
WEB UI
SOFTWARE

ANNOTATION TRANSFER LEARNING ADAPTER SAMPLE DEPLOYMENT PIPELINES

TRAIN
DICOM 2
NIFTI PRE-TRAINED MODELS PIPELINE MANAGER

AI STREAMING
INFERENCE RENDER

KUBERNETES

COMPUTE ARTIFICIAL INTELLIGENCE VISUALIZATION


AX LIBS

CuBLA
cuBLAS CuFFT
cuFFTNPP NCCL
NPP cuDNN DALI TRT OPTIX INDEX NVENC
S

CUDA
HARDWARE

TESLA/QUADRO NVIDIA DGX FAMILY CLOUD


63
CLARA TRAIN SDK
DATA CONVERSION TRAINING MODEL OUTPUT

SAMPLE TRAINING PIPELINES


AI-ASSISTED
TRANSFER LEARNING
ANNOTATION

PRE-TRAINED MODELS
DICOM 2 NIFTI EXPORT
1) Brain Tumor
2) Liver and Tumor 8) Colon Tumor TRAIN
3) Hippocampus 9) Hepatic Vessel
4) Lung Tumor 10) Spleen
5) Prostate 11) Heart
6) Left atrium 12) Chest X-ray
7) Pancreas and Tumor

KUBERNETES

TESLA/QUADRO NVIDIA DGX FAMILY CLOUD


64
DEEPSTREAM 4.0.2
Accelerating Scalable IVA
65
INTELLIGENT VIDEO ANALYTICS FOR EFFICIENCY AND SAFETY

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infrastructure Public Safety

66
DEEPSTREAM - MANY INDUSTRIES, FLEXIBLE DEPLOYMENT

TLT

NGC

Security EDGE TO CLOUD


TRAIN PUBLISH
Alerts
Retail

Analytics
Construction
NVRS
SERVERS CLOUD
Visualization
Manufacturing

ANY CLOUD

EDGE ON-PREM

67
DEEPSTREAM SOFTWARE STACK

Applications and Services

DEEPSTREAM SDK

Reference
Hardware
Docker Containers Applications & Azure IOT Runtime
Accelerated Plugins
Orchestration Recipes

CUDA-X

Kuber netes ON GPUs NV IDIA Container s RT CUDA Multimedia Tensor RT

NVIDIA COMPUTING PLATFORM - EDGE TO CLOUD

JETSON | TESLA

68
IVA APPLICATION WORKFLOW

IoT Edge Cloud


Sensors

Use AI Metadata to Analytics &


Capture Pre-processing Detect, Track Datacenter/ Cloud Visualization

Pixels Insights

69
DEEPSTREAM GRAPH ARCHITECTURE

IMAGE DISPLAY/
RTSP/RAW DECODE PROCESSING BATCHING DNN(s) TRACKING VIZ STORAGE/
CLOUD

SCALE, DETECT,
CAPTURE DECODE DEWARP, STREAM ON SCREEN
CLASSIFY & TRACKING OUTPUT
CROP MGMT DISPLAY
SEGMENT

CPU NVDEC GPU CPU GPU GPU GPU HDMI

PVA DLA PVA VIC SATA

VIC CPU

ISP

70
WHAT’S NEW IN DEEPSTREAM 4.0
UNIFIED SDK , ALL PLATFORMS TURNKEY IoT INTEGRATION DOCKER CONTAINERS ON NGC

Easy to scale and maintain


From Jetson Nano to Tesla T4 Microsoft Azure IoT Hub*

SUPPORT FOR IMAGE


MONOCHROME AND JPEG PLUGIN SOURCES
SEGMENTATION

Inference Decode Messaging

Enabling Retail & Supply


Enabling Industrial Inspection chain solution Greater control for your use case

* DeepStream container now available on Azure Marketplace 71


PYTHON BINDINGS

Alpha version of Python Bindings now available

Download bindings from devzone

Get the Python apps from GitHub

72
ACHIEVING REAL-TIME PERFORMANCE

NVIDIA Products H.264 H.265


Jetson Nano† 8 8
Jetson Tx1† 8 8
Jetson Tx2† 14 14
Jetson AGX Xavier* 32 49
Tesla T4* 35 68
Number of 1080p/30FPS stream captured and processed with AI.

†Object detection using 4-class ResNet10 and no classifiers


*Object detection using 4-class ResNet10 + three ResNet10 classifiers

73
DEEPSTREAM ACCELERATED PLUGINS
Plugin Name Functionality
Gst-nvvideo4linux2 Hardware accelerated decode and encode

Gst-nvinfer DL inference for detection, classification and segmentation

Gst-nvtracker Reference object trackers; KLT, IOU, NvDCF

Gst-nvmsgbroker Messaging to cloud

Gst-nvstreammux Stream aggregation, multiplexing, and batching

Gst-nvdsosd Draw boxes and text overlay

Gst-nvmultistreamtiler Renders frames 2D grid array

Gst-nveglglessink Accelerated X11 / EGL rendering

Gst-nvvideoconvert Scaling, format conversion, rotation

Gst-nvdewarp Dewarping for fish-eye degree cameras

Gst-nvmsgconv Metadata generation

Gst-nvsegvisual Visualizes segmentation results

Gst-nvof Hardware accelerated optical flow


74
DEEPSTREAM ON JETSON NANO

75
SCALE WITH DEEPSTREAM IN DOCKER

Development and deployment containers Deployment containers

NGC
76
REAL TIME INSIGHTS, HIGHEST STREAM DENSITY

NGC ANY CLOUD

NVIDIA Metropolis Analytics Visualization


Application Framework

NVIDIA Edge Stack

NVIDIA EGX Server Cloud Monitoring

68 streams of 1080p per T4


Pixels Information Dashboard

77
Perception Analytics Visualization

VIDEO : INTELLIGENT TRAFFIC SYSTEM


78
AI-POWERED LOSS PREVENTION SLASHES SHRINK RATE

“Improving operational efficiency and


reducing loss are key issues facing many
retailers.

Today’s large supermarkets have numerous


in-store cameras, which can be used to
mitigate these problems, but real-time
video processing of so many streams can be
a challenge.

By leveraging NVIDIA T4 GPUs, DeepStream


and TensorRT, Malong’s state-of-the-art
Intelligent Video Analytics (IVA) solution
achieves 3X higher throughput with
industry-leading accuracy to help their
retail customers significantly improve their
business performance.”

79
BRINGING REALTIME AI TO IOT
“Extracting actionable insights from a sea
of data created by the world’s billions of
cameras and sensors is a huge task and
maintaining a connection from these
devices to the cloud for processing may be
overly expensive or infeasible due to
security, regulatory, or bandwidth
restrictions.

Microsoft Azure IoT Edge deploys


applications and services built using
DeepStream to edge devices, allowing
organizations to process data locally to
trigger alerts and take actions
automatically and to upload to the cloud
when needed.

Combining Azure IoT Edge, NVIDIA


DeepStream and Azure IoT Central brings
device management, monitoring and
custom business logic to millions of edge
devices for real-time insights and easy
deployment.”

80
FULFILLMENT AND LOGISTICS MANAGEMENT WITH SMARTER
VIDEO INSIGHTS

“As a leader in fulfillment and logistics


management, SF Express needed to track
goods and vehicles across tens of thousands
of locations.

Every site requires detailed analytics


around fleet management, loading times,
and other operational activities.

Using DeepStream and NVIDIA GPUs, they


were able to increase the efficiency of AI
Argus; an intelligent video analytics
product that brings smarter video insights
and can process 32 video streams
simultaneously.”

81
START DEVELOPING WITH DEEPSTREAM

DEEPSTREAM | EXPLORE METROPOLIS | SUPPORT FORUMS

82
83
Parabricks
GPU-Accelerated Analysis of
DNA Sequencing Data

84
Current Use Cases

Population Clinical Plants/Animals


Used for multiple Used for cancer Used for Cannabis, Coral, Seaweed,
national scale settings, new-born …
population studies babies, …

85 Confidential: Do not distribute without


Introduction

Clinical Diagnosis
Alignment Pre-processing Variant calling Scientific Conclusions

Primary Analysis Secondary Analysis Tertiary Analysis

Parabricks accelerated secondary analysis:


• Uses GPUs (cloud/on-premise) for
computing
• Single node for running entire analysis
• Reduces cost of computing significantly
86 Confidential: Do not distribute without
Current
Parabricks NGS Analytics Under development

Ha pl otype ImportGVCF VQSR STAR


BWA Ca l l er
MEM
Variant STAR-Fusion
Filtration
Mutect2 Sel ect
Alignment Va ri a nts
Select kallisto
Variants
CNVKi t RNA
Merge Genotype
GVCF CNNScore
variants
Sort Annotations
Deep
Va ri a nt
Joint Genotyping VCF Tools Mutation
Ma rk
Signature
AlignmentSummary
Manta WGSMetrics
BQSR InsertSize TMB & MSI
RAWWGSMetrics
SequencingArtifact
BaseDistributionByCycle
GcBias
MeanQualityByCycle
Preprocessing Variant Callers QualityScoreDistribution Tertiary Analysis

Quality Checking

87 Confidential: Do not distribute without


Germline Pipeline

BWA-Mem
Co-ordinate Sorting Picard MarkDups BQSR HaplotypeCaller
Alignment

88 Confidential: Do not distribute without


Performance Comparison
Baseline (32 vCPU) 8 V100 GPU Server P3.16xlarge (AWS)
3500
3125
2993
Execution Time (Minutes)

3000 2746 2825

2500

2000 1870

1500

1000

500
38 40 60 65 62 70 65 70 65 72
0
Sample1 (26X) Sample2 (42X) Sample3 (41X) NA12878 (43X) NIST (41X)
89 Confidential: Do not distribute without
Features

• 35-50x faster pipeline

• 30x Whole Genome under 45 mins

• Nearly 40 genomes/day in throughput mode (DGX1)

• 100% reproducible and deterministic

• Flexible pipeline (add and remove steps easily)

90 Confidential: Do not distribute without


Scaling
50
45
40
35
30
25
Speedup

20
15
10
5
0
0 1 2 3 4 5 6 7 8
Number of GPUs
91 Confidential: Do not distribute without
Deep Learning in Genomics

DeepVariant
92 Confidential: Do not distribute without
Google Deepvariant

Generate 6D Image
Candidates Pileup CNN

Variant 1AT chr1 0


tfrecord tfrecord Variant 2CA chr1 20 Mutations
BAM Variant
Variant
3GT
4 A CC
chr1 35
chr1 60
Variant 5AG chr1 200 VCF
Variant 6AT chr1 1000

Data Preparation Inference Create Output


(make_examples) (call_variants) (post_process)

93 Confidential: Do not distribute without


Parabricks Performance (40X)
CPU Server with 32 vCPU GPU Server with 8 V100
1250
1090 1086

580
332

45 15 14 28 25

BWA + Haplotypecaller Mutect2 GenotypeGVCF DeepVariant


Preprocessing
BAM : 100% Matching, VCF : 99.99% Matching
94
Comparison Parabricks HW solutions
(NVIDIA GPUs) (FPGA)

Speed

Cost

Same Results as Base Tools (GATK, Deepvariant, …)

Up-To-Date (new versions of tools will be supported by Parabricks)

Deep learning Integration (DeepVariant, CNNScoreVariant, training with your


data, …)

General Purpose Hardware

Ease of Use & Flexibility

Collaboration for new tools and pipelines


95 Confidential: Do not distribute without
THANK YOU

Sunil Patel Sr. Solution Architect – Deep Learning


supatel@nvidia.com

You might also like