Professional Documents
Culture Documents
Accelerating Data Science With GPUs
Accelerating Data Science With GPUs
WITH GPUs
AGENDA
• Session 1:
GAMING TRANSPORTATION
2
NVIDIA DATA CENTER PLATFORM
Single Platform Drives Utilization and Productivity
CUSTOMER
USE CASES Molecular Weather Seismic Creative & Knowledge
Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping Technical Workers
TESLA GPUs
& SYSTEMS
TESLA GPU NVIDIA HGX EVERY OEM EVERY MAJOR CLOUD 3
ONE ARCHITECTURE –
MULTIPLE USES CASES THROUGH NVIDIA SDK
CLARA for Medical Imaging DEEPSTREAM for Video Analytics RAPIDS for Machine Learning
ISAAC for Robotics DRIVE for Autonomous Vehicles VRWorks for Virtual Reality 4
RAPIDS
5
GPU-ACCELERATED DATA SCIENCE
Use Cases in Every Industry
CONSUMER INTERNET OIL & GAS
Ad Personalization Sensor Data Tag Mapping
Click Through Rate Optimization Anomaly Detection
Churn Reduction Robust Fault Prediction
HEALTHCARE TELCO
Improve Clinical Care Detect Network/Security Anomalies
Drive Operational Efficiency Forecasting Network Performance
Speed Up Drug Discovery Network Resource Optimization (SON)
RETAIL AUTOMOTIVE
Supply Chain & Inventory Management Personalization & Intelligent Customer Interactions
Price Management / Markdown Optimization Connected Vehicle Predictive Maintenance
Promotion Prioritization And Ad Targeting Forecasting, Demand, & Capacity Planning
6
EXTENDING DL → BIG DATA ANALYTICS
From Business Intelligence to Data Science
ARTIFICIAL INTELLIGENCE
Deep
Analytics Traditional Machine Learning (regressions, decision trees, graph)
Learning
DATA SCIENCE
7
ML WORKFLOW STIFLES INNOVATION
8
WHAT IS RAPIDS?
The New GPU Data Science Pipeline
rapids.ai
Built on CUDA
Scikit-learn-like API
9
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
DATA PREDICTIONS
10
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
DATA PREDICTIONS
11
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
DATA PREDICTIONS
VISUALIZATION (cuGRAPH)
Effortless exploration of datasets, billions of records in milliseconds
Dynamic interaction with data = faster ML model development
Data visualization ecosystem (Graphistry & OmniSci), integrated with RAPIDS
12
DAY IN THE LIFE OF A DATA SCIENTIST
ANOTHER…
14
GPU-ACCELERATED
MACHINE
LEARNING
CLUSTER
DGX-2 and RAPIDS for
Predictive Analytics
1 DGX-2 | 10 kW
1/8 the Cost | 1/15 the Space
1/18 the Power
End-to-End
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
15
FASTER SPEEDS, REAL WORLD BENEFITS
cuIO/cuDF —
Load and Data Preparation cuML — XGBoost End-to-End
100 CPU Nodes 379 100 CPU Nodes 1,948 100 CPU Nodes
0 1,000 2,000 3,000 0 500 1,000 1,500 2,000 2,500 0 2,000 4,000 6,000 8,000 10,000
>80% Accuracy & Immediate Alert 50% Reduction in Emergency >$6M / Year Savings and
to Radiologists Road Repair Costs Reduced Risk of Outage
19
NVIDIA BREAKS RECORDS IN AI PERFORMANCE
MLPerf Records Both At Scale And Per Accelerator
Per Accelerator comparison using reported performance for MLPerf 0.6 NVIDIA DGX-2H (16 V100s) compared to other submissions at same scale except for MiniGo where NVIDIA DGX-1 (8 V100s) submission was used|20
MLPerf
ID Max Scale: Mask R-CNN: 0.6-23, GNMT: 0.6-26, MiniGo: 0.6-11 | MLPerf ID Per Accelerator: Mask R-CNN, SSD, GNMT, Transformer: all use 0.6-20, MiniGo: 0.6-10
NVIDIA DGX SUPERPOD BREAKS
AT SCALE AI RECORDS
Under 20 Minutes To Train Each MLPerf Benchmark
0 20 40
21
MLPerf 0.6 Performance at Max Scale | MLPerf ID at Scale: RN50 v1.5: 0.6-30, 0.6-6 | Transformer: 0.6-28, 0.6-6 | GNMT: 0.6-26, 0.6-5 | SSD: 0.6-27, 0.6-6 | MiniGo: 0.6-11, 0.6-7 | Mask R-CNN: 0.6-23, 0.6-3
UP TO 80% MORE PERFORMANCE ON SAME SERVER
Software Innovation Delivers Continuous MLPerf Improvements
0
Image Classification Translation Object Detection Translation Object Detection
RN50 v.1.5 (non-recurrent) (Light Weight) (recurrent) (Heavy Weight)
Transformer SSD GNMT Mask R-CNN
22
Comparing the throughput of a single DGX-2H server on a single epoch (Single pass of the dataset through the neural network) | MLPerf ID 0.5/0.6 comparison: ResNet50 v1.5: 0.5-20/0.6-30 | Transformer: 0.5-21/0.6-20
| SSD: 0.5-21/0.6-20 | GNMT: 0.5-19/0.6-20 | Mask R-CNN: 0.5-21/0.6-20
DRAMATICALLY MORE FOR YOUR MONEY
SAME
THROUGHPUT
1/18
THE POWER
1/30
THE SPACE
23
NVIDIA DGX-2
Designed To Train The Previously Impossible
2 Two HGX-2 GPU Motherboards
8 V100 32GB GPUs per board
6 NVSwitches per board
NVIDIA Tesla V100 32GB 1 512GB Total HBM2 Memory
interconnected by
Plane Card
24
24
TESLA V100
TENSOR CORE GPU
World’s Most Powerful
Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink
NVSWITCH
World’s Highest Bandwidth
On-node Switch
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per
port bi-directional
Fully-connected crossbar
2 billion transistors |
47.5mm x 47.5mm package
26
WORLD RECORDS FOR CONVERSATIONAL AI
BERT Training and Inference Records
Largest Transformer Based Model Ever Trained
80X
53 BERTLARGE
Number of Parameters by Network
8.3Bn
Image
Recognition
NLP 8.3Bn GPT-2 8B 40X
(Q&A, Translation) parameters Largest Transformer Based Model Trained
1.5Bn
20X
340M
26M
2.2ms BERTBASE
Latency Fastest Inference (18X Faster Than CPU) X
0 500 1000 1500
# of V100 GPUs
EXPLODING MODEL SIZE CONVERSATIONAL AI RECORDS Training GPUs - Near Linear Scaling
Complexity to Train Code Available on Github Requires Leading AI Infrastructure
BERTLARGE Training Record: 1472 Tesla V100-SXM3-32GB 450W GPUs | 92 DGX-2H Servers | 8 Mellanox Infiniband Adapters per node 27
BERTBASE Inference Record: SQuAD Dataset| Tesla T4 16GB GPU | CPU: Intel Xeon Gold 6240 & OpenVINO v2
Scaling Training Performance on: BERT | Speedups show performance scaling on 1x, 16x, 64x and 92x DGX-2H Servers with 16 NVIDIA V100 GPUs each
ML/DL
INFRASTRUCTURE
28
AI PLATFORM CONSIDERATIONS
Factors impacting deep learning platform decisions
29
COMPARING AI COMPUTE ALTERNATIVES
Looking beyond the “spec sheet”
Innovation
Hardware Architecture
30
NVIDIA DGX PODTM:
HIGH-DENSITY COMPUTE REFERENCE ARCH.
• NVIDIA DGX POD
Network
• In-rack: 100 GbE to DGX-1 servers
• In-rack: 50 GbE to storage nodes
• Out-of-rack: 4 x 100 GbE (up to 8)
Rack
Framework?
Libraries? Open source / forum
O/S?
Open source / forum
GPU?
Drivers?
33
SUPPORTING AI WITH DGX REFERENCE
ARCHITECTURE SOLUTIONS
Problem! Running!
“Update to PyTorch
container XX.XX”
34
THE NEW NGC
GPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows
Model Training Scripts
NLP, Image Classification, Object Detection & more
Simplify Deployments
50+ Containers
DL, ML, HPC NGC Innovate Faster
Deploy Anywhere
Industry Workflows
Medical Imaging, Intelligent Video Analytics
Pre-trained Models
NLP, Classification, Object Detection & more ngc.nvidia.com 35
Solving the complexity of managing
distributed computing on GPU
AGENDA
• Session 2:
ML Pipelines KubeFlow
Services
TensorFlow PyTorch Rapids Dask Pandas Spark Presto Prometheus Grafana
DL workloads
Shared resources
ML workloads
Data Persistent & GPU
compute GPU sharing
Model
Inferencing
3
DEMO
Q&A
Optimize GPU sharing
Enabling GPU at scale
6
Supporting DGX cluster
DGX cluster
7
Running models in an inferencing layer with GPU
Quick deployment of models in
a serving layer
§ Models are running as functions
at scale on a GPU cluster
§ High-performance parallel
execution engine
8
Ease of management and orchestration
Easy access to GPU
§ Self service on a managed platform
§ Jobs scheduling
PYTHON
DEEP
LEARNING
RAPIDS FRAMEWORKS
CUDA
10
Serverless & GPU – Better Performance
§ Linear scalability
11
How we Enable Large Scale Data Processing on GPUs
DB + native Support
Chunk
Merge
Filter Partition
Chunk
Final
Chunk
Results
Raw Data
10s-1000s 10s GBs
Filtered Partitioned Chunked
Terabytes
TBs 100s GBs 1-10s GBs
12
Value for Nvidia customers
§ Speed up data science projects
o Immediate access to GPU (training and inferencing)
§ Improved performance by 4x
o By creating a shared resource pool and load balancing across all GPU’s
13
Integrated GPU monitoring (coming soon)
§ Advanced troubleshooting
Identify which service/app is utilizing
the GPU resource
14
Thank You
anantg@iguazio.com | www.iguazio.com