Accelerating Data Science With GPUs

ACCELERATING DATA SCIENCE
WITH GPUs
AGENDA
• Session 1:
○ NVIDIA-Iguazio Accelerated Solutions for Deep Learning and Machine

Learning
Dr. Gabriel Noaje

Senior Solutions Architect
E-mail: gnoaje@nvidia.com
http://bit.ly/GabrielNoaje
AGENDA
• Session 2:
○ Data Science as a service using GPUs

○ Demo
• Anant GANDHI , Solutions Engineer ,Iguazio

He has 12 years of experience in helping customers
across Banking , Aerospace & Telecom with expertise in
Big Data and Analytics ecosystem
https://www.linkedin.com/in/anant-gandhi-b5447614/
NVIDIA ACCELERATED SOLUTIONS FOR
DEEP LEARNING AND MACHINE LEARNING
Dr. Gabriel Noaje
Senior Solutions Architect, APAC South
gnoaje@nvidia.com
NVIDIA
The AI Computing Company
GAMING TRANSPORTATION
HPC DEEP LEARNING MACHINE LEARNING
Scientific Computing, AI and Data Analytics

DESIGN HEALTHCARE
Visualization Industry Verticals
2
NVIDIA DATA CENTER PLATFORM
Single Platform Drives Utilization and Productivity
CUSTOMER
USE CASES Molecular Weather Seismic Creative & Knowledge
Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping Technical Workers
CONSUMER INTERNET & INDUSTRY APPLICATIONS SCIENTIFIC APPLICATIONS VIRTUAL GRAPHICS
APPS & Amber

+600
FRAMEWORKS NAMD Applications
MACHINE LEARNING DEEP LEARNING HPC VIRTUAL GPU

CUDA-X & cuDF cuML
cuDNN cuGRAPH cuDNN CUTLASS TensorRT OpenACC cuFFT vDWS vPC vAPPS
NVIDIA SDKs
CUDA & CORE LIBRARIES - cuBLAS | NCCL
TESLA GPUs
& SYSTEMS
TESLA GPU NVIDIA HGX EVERY OEM EVERY MAJOR CLOUD 3
ONE ARCHITECTURE –
MULTIPLE USES CASES THROUGH NVIDIA SDK
CLARA for Medical Imaging DEEPSTREAM for Video Analytics RAPIDS for Machine Learning
ISAAC for Robotics DRIVE for Autonomous Vehicles VRWorks for Virtual Reality 4
RAPIDS
5
GPU-ACCELERATED DATA SCIENCE
Use Cases in Every Industry
CONSUMER INTERNET OIL & GAS
Ad Personalization Sensor Data Tag Mapping
Click Through Rate Optimization Anomaly Detection
Churn Reduction Robust Fault Prediction
FINANCIAL SERVICES MANUFACTURING

Claim fraud Remaining Useful Life Estimation
Customer service chatbots/routing Failure Prediction
Risk evaluation Demand Forecasting
HEALTHCARE TELCO
Improve Clinical Care Detect Network/Security Anomalies
Drive Operational Efficiency Forecasting Network Performance
Speed Up Drug Discovery Network Resource Optimization (SON)
RETAIL AUTOMOTIVE
Supply Chain & Inventory Management Personalization & Intelligent Customer Interactions
Price Management / Markdown Optimization Connected Vehicle Predictive Maintenance
Promotion Prioritization And Ad Targeting Forecasting, Demand, & Capacity Planning
6
EXTENDING DL → BIG DATA ANALYTICS
From Business Intelligence to Data Science
ARTIFICIAL INTELLIGENCE
Deep
Analytics Traditional Machine Learning (regressions, decision trees, graph)
Learning
DENSE DATA TYPES

DENSE DATA TABULAR/SPARSE DATA (images, video, voice)
DATA SCIENCE
7
ML WORKFLOW STIFLES INNOVATION
Wrangle Data Data Preparation Train Deploy
Data Data Train Evaluate Predictions

Sources ETL Lake
Time-consuming, inefficient workflow that wastes data science productivity
8
WHAT IS RAPIDS?
The New GPU Data Science Pipeline
rapids.ai
Suit of open-source, end-to-end data

science tools
Built on CUDA
Pandas-like API for data cleaning and

transformation
Scikit-learn-like API
A unifying framework for GPU data

science
9
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
DATA PREDICTIONS
DATA PREPARATION (cuDF)

GPUs accelerated compute for in-memory data preparation
Simplified implementation using familiar data science tools
Python drop-in Pandas replacement built on CUDA C++. GPU-accelerated Spark
10
DATA PREDICTIONS
MODEL TRAINING (cuML)

GPU-acceleration of today’s most popular ML algorithms
XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD …
11
DATA PREDICTIONS
VISUALIZATION (cuGRAPH)
Effortless exploration of datasets, billions of records in milliseconds
Dynamic interaction with data = faster ML model development
Data visualization ecosystem (Graphistry & OmniSci), integrated with RAPIDS
12
DAY IN THE LIFE OF A DATA SCIENTIST
ANOTHER…
@*#! Forgot to Add

GET A COFFEE Train Model
a Feature
Validate
Start Data Prep Restart Data Prep
Test Model
Workflow Start
Workflow
GET A COFFEE 12 GET A COFFEE 12 Experiment with
Optimizations and
Repeat
Configure Data Prep Switch to Decaf
Workflow CPU GPU
9 POWERED 3 9 POWERED 3
WORKFLOW WORKFLOW
Find Unexpected Null
Values Stored as String…
Dataset 6 Dataset 6
Downloads Restart Data Prep Downloads
Overnight Workflow Again Overnight
Stay Late Go Home on Time
Dataset Collection Analysis Data Prep Train Inference

13
TRADITIONAL
DATA SCIENCE
CLUSTER
Workload Profile:
Fannie Mae Mortgage Data:
• 192GB data set
• 16 years, 68 quarters
• 34.7 Million single family mortgage loans
• 1.85 Billion performance records
• XGBoost training set: 50 features
300 Servers | $3M | 180 kW
14
GPU-ACCELERATED
MACHINE
LEARNING
CLUSTER
DGX-2 and RAPIDS for
Predictive Analytics
1 DGX-2 | 10 kW
1/8 the Cost | 1/15 the Space
1/18 the Power
End-to-End
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000
15
FASTER SPEEDS, REAL WORLD BENEFITS
cuIO/cuDF —
Load and Data Preparation cuML — XGBoost End-to-End
20 CPU Nodes 2,741 20 CPU Nodes 2,290 20 CPU Nodes
30 CPU Nodes 1,675 30 CPU Nodes 1,956 30 CPU Nodes
50 CPU Nodes 715 50 CPU Nodes 1,999 50 CPU Nodes
100 CPU Nodes 379 100 CPU Nodes 1,948 100 CPU Nodes
DGX-2 42 DGX-2 169 DGX-2
5x DGX-1 19 5x DGX-1 157 5x DGX-1
0 1,000 2,000 3,000 0 500 1,000 1,500 2,000 2,500 0 2,000 4,000 6,000 8,000 10,000
Time in seconds — Shorter is better
cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost
Benchmark CPU Cluster Configuration DGX Cluster Configuration

200GB CSV dataset; Data preparation CPU nodes (61 GB of memory, 8 vCPUs, 5x DGX-1 on InfiniBand network
includes joins, variable transformations. 64-bit platform), Apache Spark
16
GTC2019 RAPIDS TRAINING CONTENT
S9801 - RAPIDS: Deep Dive Into How the Platform Works
PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9801-rapids-deep-dive-into-how-the-platform-works.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9801/
S9577 - RAPIDS: The Platform Inside and Out

PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9577-rapids-the-platform-inside-and-out.pdf
S9793 - cuDF: RAPIDS GPU-Accelerated Data Frame Library

PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9793-cudf-rapids-gpu-accelerated-data-frame-library.pdf
S91043 - RAPIDS CUDA DataFrame Internals for C++ Developers

PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s91043-rapids-cuda-dataframe-internals-for-c++-developers.pdf
S9817 - RAPIDS cuML: A Library for GPU Accelerated Machine Learning

PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9817-rapids-cuml-a-library-for-gpu-accelerated-machine-learning.pdf
S9783 - Accelerating Graph Algorithms with RAPIDS

PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9783-accelerating-graph-algorithms-with-rapids.pdf
17
Many more sessions – 26 sessions on Rapids related topics !!!
DEEP LEARNING
AI TRANSFORMING EVERY INDUSTRY
HEALTHCARE INFRASTRUCTURE IOT
>80% Accuracy & Immediate Alert 50% Reduction in Emergency >$6M / Year Savings and
to Radiologists Road Repair Costs Reduced Risk of Outage
19
NVIDIA BREAKS RECORDS IN AI PERFORMANCE
MLPerf Records Both At Scale And Per Accelerator
Record Type Benchmark Record

Max Scale Object Detection (Heavy Weight) Mask R-CNN 18.47 Mins
(Minutes To
Translation (Recurrent) GNMT 1.8 Mins
Train)
Reinforcement Learning (MiniGo) 13.57 Mins
Per Accelerator Object Detection (Heavy Weight) Mask R-CNN 25.39 Hrs
(Hours To Train)
Object Detection (Light Weight) SSD 3.04 Hrs
Translation (Recurrent) GNMT 2.63 Hrs
Translation (Non-recurrent)Transformer 2.61 Hrs
Reinforcement Learning (MiniGo) 3.65 Hrs
Per Accelerator comparison using reported performance for MLPerf 0.6 NVIDIA DGX-2H (16 V100s) compared to other submissions at same scale except for MiniGo where NVIDIA DGX-1 (8 V100s) submission was used|20
MLPerf
ID Max Scale: Mask R-CNN: 0.6-23, GNMT: 0.6-26, MiniGo: 0.6-11 | MLPerf ID Per Accelerator: Mask R-CNN, SSD, GNMT, Transformer: all use 0.6-20, MiniGo: 0.6-10
NVIDIA DGX SUPERPOD BREAKS
AT SCALE AI RECORDS
Under 20 Minutes To Train Each MLPerf Benchmark
MLPerf At Scale Submissions

Image Classification 1.33 Minutes To Train (Lower Is Better)
ResNet-50 v.1.5 1.28
Translation (Non-recurrent) 1.59

Transformer 0.85
Translation (Recurrent) 1.8

GNMT 2.11
NVIDIA GPU
2.23 Google TPU

Object Detection (Light Weight)
SSD 1.21 Intel CPU
Reinforcement Learning 13.57

No TPU Submission
MiniGo
14.43
Object Detection (Heavy Weight) 18.47

Mask R-CNN 35.6
0 20 40
21
MLPerf 0.6 Performance at Max Scale | MLPerf ID at Scale: RN50 v1.5: 0.6-30, 0.6-6 | Transformer: 0.6-28, 0.6-6 | GNMT: 0.6-26, 0.6-5 | SSD: 0.6-27, 0.6-6 | MiniGo: 0.6-11, 0.6-7 | Mask R-CNN: 0.6-23, 0.6-3
UP TO 80% MORE PERFORMANCE ON SAME SERVER
Software Innovation Delivers Continuous MLPerf Improvements
MLPerf On DGX-2 Server (7 Month Improvement)

2
1.8x
1.5x
Relative Speedup
1.2x 1.3x 1.2x
0
Image Classification Translation Object Detection Translation Object Detection
RN50 v.1.5 (non-recurrent) (Light Weight) (recurrent) (Heavy Weight)
Transformer SSD GNMT Mask R-CNN
MLPerf 0.5 MLPerf 0.6
22
Comparing the throughput of a single DGX-2H server on a single epoch (Single pass of the dataset through the neural network) | MLPerf ID 0.5/0.6 comparison: ResNet50 v1.5: 0.5-20/0.6-30 | Transformer: 0.5-21/0.6-20
| SSD: 0.5-21/0.6-20 | GNMT: 0.5-19/0.6-20 | Mask R-CNN: 0.5-21/0.6-20
DRAMATICALLY MORE FOR YOUR MONEY
CPU-Only Cluster GPU-Accelerated
SAME
THROUGHPUT
Deep Learning Deep Learning

Training 1/8 Training
Image training Image training
THE COST
Resnet 50 Resnet 50
1/18
THE POWER
1/30
THE SPACE
300 Self-hosted Broadwell CPU Servers 1 DGX-2

180 KWatts 10 KWatts
23
NVIDIA DGX-2
Designed To Train The Previously Impossible
2 Two HGX-2 GPU Motherboards
8 V100 32GB GPUs per board
6 NVSwitches per board
NVIDIA Tesla V100 32GB 1 512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches 3 4 Eight EDR Infiniband/100 GigE

2.4 TB/sec bi-section 1600 Gb/sec Total
bandwidth Bi-directional Bandwidth
8 Two High-Speed Ethernet

10/25/40/100 GigE
5 Two Intel Xeon Platinum CPUs

30 TB NVME SSDs 7
Internal Storage
6 1.5 TB System Memory
24
24
TESLA V100
TENSOR CORE GPU
World’s Most Powerful
Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink
NVSWITCH
World’s Highest Bandwidth
On-node Switch
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per
port bi-directional
Fully-connected crossbar
2 billion transistors |
47.5mm x 47.5mm package
26
WORLD RECORDS FOR CONVERSATIONAL AI
BERT Training and Inference Records
Largest Transformer Based Model Ever Trained
80X
53 BERTLARGE
Number of Parameters by Network
8.3Bn
Normalized Speedup (1/Time)

NLP – Generative Tasks minutes Speed Training Record
(Chatbots, Auto-completion) 60X
Image
Recognition
NLP 8.3Bn GPT-2 8B 40X
(Q&A, Translation) parameters Largest Transformer Based Model Trained
1.5Bn
20X
340M
26M
2.2ms BERTBASE
Latency Fastest Inference (18X Faster Than CPU) X
0 500 1000 1500
# of V100 GPUs
EXPLODING MODEL SIZE CONVERSATIONAL AI RECORDS Training GPUs - Near Linear Scaling
Complexity to Train Code Available on Github Requires Leading AI Infrastructure
BERTLARGE Training Record: 1472 Tesla V100-SXM3-32GB 450W GPUs | 92 DGX-2H Servers | 8 Mellanox Infiniband Adapters per node 27
BERTBASE Inference Record: SQuAD Dataset| Tesla T4 16GB GPU | CPU: Intel Xeon Gold 6240 & OpenVINO v2
Scaling Training Performance on: BERT | Speedups show performance scaling on 1x, 16x, 64x and 92x DGX-2H Servers with 16 NVIDIA V100 GPUs each
ML/DL
INFRASTRUCTURE
28
AI PLATFORM CONSIDERATIONS
Factors impacting deep learning platform decisions
DEVELOPER SCALING TOTAL COST OF

PRODUCTIVITY PERFORMANCE OWNERSHIP
“ lineMustof get started now,

business wants to “ I want the most GPU
bang for the buck
“ Ineed
have limited budget,
lowest up-front
deliver results yesterday cost possible
29
COMPARING AI COMPUTE ALTERNATIVES
Looking beyond the “spec sheet”
AI/DL Expertise &

Evaluation Criteria
Innovation
AI/DL Software Stack
Operating System Image
Hardware Architecture
30
NVIDIA DGX PODTM:
HIGH-DENSITY COMPUTE REFERENCE ARCH.
• NVIDIA DGX POD
• Support scalability to hundreds of nodes
• Based on proven SATURNV architecture
Nine DGX-1 Servers

• Eight Tesla V100 GPUs
• NVIDIA. GPUDirect™ over RDMA support
• Run at MaxQ
• 100 GbE networking (up to 4 x 100 GbE)
Twelve Storage Nodes

• 192 GB RAM
• 3.8 TB SSD
• 100 TB HDD (1.2 PB Total HDD)
• 50 GbE networking
Network
• In-rack: 100 GbE to DGX-1 servers
• In-rack: 50 GbE to storage nodes
• Out-of-rack: 4 x 100 GbE (up to 8)
Rack
4 POD design with cooling •

•
35 kW Power
42U x 1200 mm x 700 mm (minimum)
• Rear Door Cooler
DGX-1 POD
31
NVIDIA DGX
SUPERPOD
AI LEADERSHIP REQUIRES
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems

• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50
Modular & Scalable GPU SuperPOD Architecture

• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software
Integrates Fully Optimized Software Stacks

• Freely Available Through NGC
• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core
GPUs
• 1 megawatt of power
SUPPORTING AI:
ALTERNATIVE APPROACHES
Multiple paths to
problem resolution
Framework?
Libraries? Open source / forum
O/S?
Open source / forum
GPU?
Drivers?
Installed/ Problem! Server?

running Network?
Storage?
Server, Storage & Network
Solution Providers
33
SUPPORTING AI WITH DGX REFERENCE
ARCHITECTURE SOLUTIONS
“My PyTorch CNN model

is running 30% slower NPN
than yesterday!” Partner AI Expertise
DGX RA DGX RA IT Admin

Solution Solution
Storage Storage
Problem! Running!
“Update to PyTorch
container XX.XX”
34
THE NEW NGC
GPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows
Model Training Scripts
NLP, Image Classification, Object Detection & more
Simplify Deployments
50+ Containers
DL, ML, HPC NGC Innovate Faster
Deploy Anywhere
Industry Workflows
Medical Imaging, Intelligent Video Analytics
Pre-trained Models
NLP, Classification, Object Detection & more ngc.nvidia.com 35
Solving the complexity of managing
distributed computing on GPU
AGENDA
• Session 2:
○ Data Science as a service using GPUs

○ Demo
Iguazio: Integrated and Open Data Science Platform
ML Pipelines KubeFlow
Serverless Functions Jupyter Notebook Nuclio

& Notebooks
Services
TensorFlow PyTorch Rapids Dask Pandas Spark Presto Prometheus Grafana
DL workloads
Shared resources
ML workloads
Data Persistent & GPU
compute GPU sharing
Model
Inferencing
3
DEMO
Q&A
Optimize GPU sharing
Enabling GPU at scale
§ Quick way for data scientists to work

on a cluster of GPUs.
o Built-in integration with GPU
o No Devops is required
§ Frees GPU resources after Jupyter

notebook becomes idle
§ Maximizing the efficiency of GPU

usage among the data science team
6
Supporting DGX cluster
DGX cluster
§ Running data science

workload on a DGX cluster
§ Running Jupyter, Spark,

TensorFlow and distributed
Python on a DGX cluster
§ Monitoring jobs on the

cluster level
7
Running models in an inferencing layer with GPU
Quick deployment of models in
a serving layer
§ Models are running as functions
at scale on a GPU cluster
§ High-performance parallel
execution engine
§ Easy control of GPU resources

per function
§ Quick deployment of models from

Jupyter notebooks
8
Ease of management and orchestration
Easy access to GPU
§ Self service on a managed platform
§ Jobs scheduling
§ Cloud experience for on-prem
§ Full and open data science

environment at the click of a
button
§ Built-in integration for Jupyter and

GPU
9
Advanced integration with RAPIDs
§ Direct writes/reads into/from the GPU’s memory using RAPIDS data frames
o By doing that users can read data from the database and analyze it directly on the GPU without
any intermediate layer
§ Streaming data in chunks directly into GPU
§ Full parallelism - multiple nodes can read data, each only one shard
PYTHON
DEEP
LEARNING
RAPIDS FRAMEWORKS
DASK CUDF CUML CUGRAPH CUDNN
CUDA
APACHE ARROW on GPU Memory
10
Serverless & GPU – Better Performance
§ Iguazio’s serverless functions 4x FASTER

(Nuclio) improves GPU utilization
and sharing, resulting in almost
four times faster application
performance when compared to
the use of NVIDIA GPUs within
monolithic architectures.
§ Linear scalability
11
How we Enable Large Scale Data Processing on GPUs
DB + native Support
Chunk
Merge
Filter Partition
Chunk
Final
Chunk
Results
Raw Data
10s-1000s 10s GBs
Filtered Partitioned Chunked
Terabytes
TBs 100s GBs 1-10s GBs
12
Value for Nvidia customers
§ Speed up data science projects
o Immediate access to GPU (training and inferencing)
§ Increase overall GPU utilization (90%)

o Helping customers to maximize their GPU utilization
§ Fully managed PaaS with built-in GPU integration

o Application provisioning, orchestration and managed notebooks enabling training at scale on a
shared GPU cluster
o Tight integration with Nvidia TensorRT, RAPIDS, DeepOPS
§ Simplify management of GPU’s & DGX

o Automated workflow for a continuous data science pipeline
§ Improved performance by 4x
o By creating a shared resource pool and load balancing across all GPU’s
13
Integrated GPU monitoring (coming soon)
§ Built-in GPU monitoring dashboard

integrated with Nvidia Deepops
§ Advanced troubleshooting
Identify which service/app is utilizing
the GPU resource
14
Thank You
anantg@iguazio.com | www.iguazio.com

Accelerating Data Science With GPUs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Accelerating Data Science With GPUs

Uploaded by

Copyright:

Available Formats

ACCELERATING DATA SCIENCE

○ NVIDIA-Iguazio Accelerated Solutions for Deep Learning and Machine

Dr. Gabriel Noaje

○ Data Science as a service using GPUs

• Anant GANDHI , Solutions Engineer ,Iguazio

HPC DEEP LEARNING MACHINE LEARNING

Scientific Computing, AI and Data Analytics

Visualization Industry Verticals

CONSUMER INTERNET & INDUSTRY APPLICATIONS SCIENTIFIC APPLICATIONS VIRTUAL GRAPHICS

APPS & Amber

MACHINE LEARNING DEEP LEARNING HPC VIRTUAL GPU

CUDA & CORE LIBRARIES - cuBLAS | NCCL

FINANCIAL SERVICES MANUFACTURING

DENSE DATA TYPES

Wrangle Data Data Preparation Train Deploy

Data Data Train Evaluate Predictions

Time-consuming, inefficient workflow that wastes data science productivity

Suit of open-source, end-to-end data

Pandas-like API for data cleaning and

A unifying framework for GPU data

DATA PREPARATION (cuDF)

MODEL TRAINING (cuML)

@*#! Forgot to Add

Stay Late Go Home on Time

Dataset Collection Analysis Data Prep Train Inference

300 Servers | $3M | 180 kW

0 2,000 4,000 6,000 8,000 10,000

20 CPU Nodes 2,741 20 CPU Nodes 2,290 20 CPU Nodes

30 CPU Nodes 1,675 30 CPU Nodes 1,956 30 CPU Nodes

50 CPU Nodes 715 50 CPU Nodes 1,999 50 CPU Nodes

DGX-2 42 DGX-2 169 DGX-2

5x DGX-1 19 5x DGX-1 157 5x DGX-1

Time in seconds — Shorter is better

cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost

Benchmark CPU Cluster Configuration DGX Cluster Configuration

S9577 - RAPIDS: The Platform Inside and Out

S9793 - cuDF: RAPIDS GPU-Accelerated Data Frame Library

S91043 - RAPIDS CUDA DataFrame Internals for C++ Developers

S9817 - RAPIDS cuML: A Library for GPU Accelerated Machine Learning

S9783 - Accelerating Graph Algorithms with RAPIDS

Record Type Benchmark Record

MLPerf At Scale Submissions

Translation (Non-recurrent) 1.59

Translation (Recurrent) 1.8

2.23 Google TPU

Reinforcement Learning 13.57

Object Detection (Heavy Weight) 18.47

MLPerf On DGX-2 Server (7 Month Improvement)

1.2x 1.3x 1.2x

MLPerf 0.5 MLPerf 0.6

CPU-Only Cluster GPU-Accelerated

Deep Learning Deep Learning

300 Self-hosted Broadwell CPU Servers 1 DGX-2

Twelve NVSwitches 3 4 Eight EDR Infiniband/100 GigE

8 Two High-Speed Ethernet

5 Two Intel Xeon Platinum CPUs

Normalized Speedup (1/Time)

DEVELOPER SCALING TOTAL COST OF

“ lineMustof get started now,

AI/DL Expertise &

AI/DL Software Stack

Operating System Image

• Support scalability to hundreds of nodes