Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

ACCELERATING DATA SCIENCE

WITH GPUs
AGENDA
• Session 1:

○ NVIDIA-Iguazio Accelerated Solutions for Deep Learning and Machine


Learning

Dr. Gabriel Noaje


Senior Solutions Architect
E-mail: gnoaje@nvidia.com
http://bit.ly/GabrielNoaje
AGENDA
• Session 2:

○ Data Science as a service using GPUs


○ Demo

• Anant GANDHI , Solutions Engineer ,Iguazio


He has 12 years of experience in helping customers
across Banking , Aerospace & Telecom with expertise in
Big Data and Analytics ecosystem
https://www.linkedin.com/in/anant-gandhi-b5447614/
NVIDIA ACCELERATED SOLUTIONS FOR
DEEP LEARNING AND MACHINE LEARNING
Dr. Gabriel Noaje
Senior Solutions Architect, APAC South
gnoaje@nvidia.com
NVIDIA
The AI Computing Company

GAMING TRANSPORTATION

HPC DEEP LEARNING MACHINE LEARNING

Scientific Computing, AI and Data Analytics


DESIGN HEALTHCARE

Visualization Industry Verticals

2
NVIDIA DATA CENTER PLATFORM
Single Platform Drives Utilization and Productivity

CUSTOMER
USE CASES Molecular Weather Seismic Creative & Knowledge
Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping Technical Workers

CONSUMER INTERNET & INDUSTRY APPLICATIONS SCIENTIFIC APPLICATIONS VIRTUAL GRAPHICS

APPS & Amber


+600
FRAMEWORKS NAMD Applications

MACHINE LEARNING DEEP LEARNING HPC VIRTUAL GPU


CUDA-X & cuDF cuML
cuDNN cuGRAPH cuDNN CUTLASS TensorRT OpenACC cuFFT vDWS vPC vAPPS
NVIDIA SDKs

CUDA & CORE LIBRARIES - cuBLAS | NCCL

TESLA GPUs
& SYSTEMS
TESLA GPU NVIDIA HGX EVERY OEM EVERY MAJOR CLOUD 3
ONE ARCHITECTURE –
MULTIPLE USES CASES THROUGH NVIDIA SDK

CLARA for Medical Imaging DEEPSTREAM for Video Analytics RAPIDS for Machine Learning

ISAAC for Robotics DRIVE for Autonomous Vehicles VRWorks for Virtual Reality 4
RAPIDS
5
GPU-ACCELERATED DATA SCIENCE
Use Cases in Every Industry
CONSUMER INTERNET OIL & GAS
Ad Personalization Sensor Data Tag Mapping
Click Through Rate Optimization Anomaly Detection
Churn Reduction Robust Fault Prediction

FINANCIAL SERVICES MANUFACTURING


Claim fraud Remaining Useful Life Estimation
Customer service chatbots/routing Failure Prediction
Risk evaluation Demand Forecasting

HEALTHCARE TELCO
Improve Clinical Care Detect Network/Security Anomalies
Drive Operational Efficiency Forecasting Network Performance
Speed Up Drug Discovery Network Resource Optimization (SON)

RETAIL AUTOMOTIVE
Supply Chain & Inventory Management Personalization & Intelligent Customer Interactions
Price Management / Markdown Optimization Connected Vehicle Predictive Maintenance
Promotion Prioritization And Ad Targeting Forecasting, Demand, & Capacity Planning
6
EXTENDING DL → BIG DATA ANALYTICS
From Business Intelligence to Data Science

ARTIFICIAL INTELLIGENCE

Deep
Analytics Traditional Machine Learning (regressions, decision trees, graph)
Learning

DENSE DATA TYPES


DENSE DATA TABULAR/SPARSE DATA (images, video, voice)

DATA SCIENCE

7
ML WORKFLOW STIFLES INNOVATION

Wrangle Data Data Preparation Train Deploy

Data Data Train Evaluate Predictions


Sources ETL Lake

Time-consuming, inefficient workflow that wastes data science productivity

8
WHAT IS RAPIDS?
The New GPU Data Science Pipeline

rapids.ai

Suit of open-source, end-to-end data


science tools

Built on CUDA

Pandas-like API for data cleaning and


transformation

Scikit-learn-like API

A unifying framework for GPU data


science

9
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA

DATA PREDICTIONS

DATA PREPARATION (cuDF)


GPUs accelerated compute for in-memory data preparation
Simplified implementation using familiar data science tools
Python drop-in Pandas replacement built on CUDA C++. GPU-accelerated Spark

10
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA

DATA PREDICTIONS

MODEL TRAINING (cuML)


GPU-acceleration of today’s most popular ML algorithms
XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD …

11
DATA SCIENCE WORKFLOW WITH RAPIDS
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA

DATA PREDICTIONS

VISUALIZATION (cuGRAPH)
Effortless exploration of datasets, billions of records in milliseconds
Dynamic interaction with data = faster ML model development
Data visualization ecosystem (Graphistry & OmniSci), integrated with RAPIDS

12
DAY IN THE LIFE OF A DATA SCIENTIST
ANOTHER…

@*#! Forgot to Add


GET A COFFEE Train Model
a Feature
Validate
Start Data Prep Restart Data Prep
Test Model
Workflow Start
Workflow
GET A COFFEE 12 GET A COFFEE 12 Experiment with
Optimizations and
Repeat
Configure Data Prep Switch to Decaf
Workflow CPU GPU
9 POWERED 3 9 POWERED 3
WORKFLOW WORKFLOW
Find Unexpected Null
Values Stored as String…
Dataset 6 Dataset 6
Downloads Restart Data Prep Downloads
Overnight Workflow Again Overnight

Stay Late Go Home on Time

Dataset Collection Analysis Data Prep Train Inference


13
TRADITIONAL
DATA SCIENCE
CLUSTER
Workload Profile:
Fannie Mae Mortgage Data:
• 192GB data set
• 16 years, 68 quarters
• 34.7 Million single family mortgage loans
• 1.85 Billion performance records
• XGBoost training set: 50 features

300 Servers | $3M | 180 kW

14
GPU-ACCELERATED
MACHINE
LEARNING
CLUSTER
DGX-2 and RAPIDS for
Predictive Analytics
1 DGX-2 | 10 kW
1/8 the Cost | 1/15 the Space
1/18 the Power

End-to-End
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1

0 2,000 4,000 6,000 8,000 10,000

15
FASTER SPEEDS, REAL WORLD BENEFITS
cuIO/cuDF —
Load and Data Preparation cuML — XGBoost End-to-End

20 CPU Nodes 2,741 20 CPU Nodes 2,290 20 CPU Nodes

30 CPU Nodes 1,675 30 CPU Nodes 1,956 30 CPU Nodes

50 CPU Nodes 715 50 CPU Nodes 1,999 50 CPU Nodes

100 CPU Nodes 379 100 CPU Nodes 1,948 100 CPU Nodes

DGX-2 42 DGX-2 169 DGX-2

5x DGX-1 19 5x DGX-1 157 5x DGX-1

0 1,000 2,000 3,000 0 500 1,000 1,500 2,000 2,500 0 2,000 4,000 6,000 8,000 10,000

Time in seconds — Shorter is better

cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost

Benchmark CPU Cluster Configuration DGX Cluster Configuration


200GB CSV dataset; Data preparation CPU nodes (61 GB of memory, 8 vCPUs, 5x DGX-1 on InfiniBand network
includes joins, variable transformations. 64-bit platform), Apache Spark
16
GTC2019 RAPIDS TRAINING CONTENT
S9801 - RAPIDS: Deep Dive Into How the Platform Works
PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9801-rapids-deep-dive-into-how-the-platform-works.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9801/

S9577 - RAPIDS: The Platform Inside and Out


PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9577-rapids-the-platform-inside-and-out.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9577/

S9793 - cuDF: RAPIDS GPU-Accelerated Data Frame Library


PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9793-cudf-rapids-gpu-accelerated-data-frame-library.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9793/

S91043 - RAPIDS CUDA DataFrame Internals for C++ Developers


PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s91043-rapids-cuda-dataframe-internals-for-c++-developers.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S91043/

S9817 - RAPIDS cuML: A Library for GPU Accelerated Machine Learning


PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9817-rapids-cuml-a-library-for-gpu-accelerated-machine-learning.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9817/

S9783 - Accelerating Graph Algorithms with RAPIDS


PDF: https://on-demand.gputechconf.com/gtc/2019/presentation/_/s9783-accelerating-graph-algorithms-with-rapids.pdf
RECORDING: https://on-demand.gputechconf.com/gtc/2019/video/_/S9783/
17
Many more sessions – 26 sessions on Rapids related topics !!!
DEEP LEARNING
AI TRANSFORMING EVERY INDUSTRY
HEALTHCARE INFRASTRUCTURE IOT

>80% Accuracy & Immediate Alert 50% Reduction in Emergency >$6M / Year Savings and
to Radiologists Road Repair Costs Reduced Risk of Outage

19
NVIDIA BREAKS RECORDS IN AI PERFORMANCE
MLPerf Records Both At Scale And Per Accelerator

Record Type Benchmark Record


Max Scale Object Detection (Heavy Weight) Mask R-CNN 18.47 Mins
(Minutes To
Translation (Recurrent) GNMT 1.8 Mins
Train)
Reinforcement Learning (MiniGo) 13.57 Mins
Per Accelerator Object Detection (Heavy Weight) Mask R-CNN 25.39 Hrs
(Hours To Train)
Object Detection (Light Weight) SSD 3.04 Hrs
Translation (Recurrent) GNMT 2.63 Hrs
Translation (Non-recurrent)Transformer 2.61 Hrs
Reinforcement Learning (MiniGo) 3.65 Hrs

Per Accelerator comparison using reported performance for MLPerf 0.6 NVIDIA DGX-2H (16 V100s) compared to other submissions at same scale except for MiniGo where NVIDIA DGX-1 (8 V100s) submission was used|20
MLPerf
ID Max Scale: Mask R-CNN: 0.6-23, GNMT: 0.6-26, MiniGo: 0.6-11 | MLPerf ID Per Accelerator: Mask R-CNN, SSD, GNMT, Transformer: all use 0.6-20, MiniGo: 0.6-10
NVIDIA DGX SUPERPOD BREAKS
AT SCALE AI RECORDS
Under 20 Minutes To Train Each MLPerf Benchmark

MLPerf At Scale Submissions


Image Classification 1.33 Minutes To Train (Lower Is Better)
ResNet-50 v.1.5 1.28

Translation (Non-recurrent) 1.59


Transformer 0.85

Translation (Recurrent) 1.8


GNMT 2.11
NVIDIA GPU

2.23 Google TPU


Object Detection (Light Weight)
SSD 1.21 Intel CPU

Reinforcement Learning 13.57


No TPU Submission
MiniGo
14.43

Object Detection (Heavy Weight) 18.47


Mask R-CNN 35.6

0 20 40
21
MLPerf 0.6 Performance at Max Scale | MLPerf ID at Scale: RN50 v1.5: 0.6-30, 0.6-6 | Transformer: 0.6-28, 0.6-6 | GNMT: 0.6-26, 0.6-5 | SSD: 0.6-27, 0.6-6 | MiniGo: 0.6-11, 0.6-7 | Mask R-CNN: 0.6-23, 0.6-3
UP TO 80% MORE PERFORMANCE ON SAME SERVER
Software Innovation Delivers Continuous MLPerf Improvements

MLPerf On DGX-2 Server (7 Month Improvement)


2
1.8x
1.5x
Relative Speedup

1.2x 1.3x 1.2x

0
Image Classification Translation Object Detection Translation Object Detection
RN50 v.1.5 (non-recurrent) (Light Weight) (recurrent) (Heavy Weight)
Transformer SSD GNMT Mask R-CNN

MLPerf 0.5 MLPerf 0.6

22
Comparing the throughput of a single DGX-2H server on a single epoch (Single pass of the dataset through the neural network) | MLPerf ID 0.5/0.6 comparison: ResNet50 v1.5: 0.5-20/0.6-30 | Transformer: 0.5-21/0.6-20
| SSD: 0.5-21/0.6-20 | GNMT: 0.5-19/0.6-20 | Mask R-CNN: 0.5-21/0.6-20
DRAMATICALLY MORE FOR YOUR MONEY

CPU-Only Cluster GPU-Accelerated

SAME
THROUGHPUT

Deep Learning Deep Learning


Training 1/8 Training
Image training Image training
THE COST
Resnet 50 Resnet 50

1/18
THE POWER

1/30
THE SPACE

300 Self-hosted Broadwell CPU Servers 1 DGX-2


180 KWatts 10 KWatts

23
NVIDIA DGX-2
Designed To Train The Previously Impossible
2 Two HGX-2 GPU Motherboards
8 V100 32GB GPUs per board
6 NVSwitches per board
NVIDIA Tesla V100 32GB 1 512GB Total HBM2 Memory
interconnected by
Plane Card

Twelve NVSwitches 3 4 Eight EDR Infiniband/100 GigE


2.4 TB/sec bi-section 1600 Gb/sec Total
bandwidth Bi-directional Bandwidth

8 Two High-Speed Ethernet


10/25/40/100 GigE

5 Two Intel Xeon Platinum CPUs


30 TB NVME SSDs 7
Internal Storage
6 1.5 TB System Memory

24
24
TESLA V100
TENSOR CORE GPU
World’s Most Powerful
Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink
NVSWITCH
World’s Highest Bandwidth
On-node Switch
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per
port bi-directional
Fully-connected crossbar
2 billion transistors |
47.5mm x 47.5mm package

26
WORLD RECORDS FOR CONVERSATIONAL AI
BERT Training and Inference Records
Largest Transformer Based Model Ever Trained
80X

53 BERTLARGE
Number of Parameters by Network

8.3Bn

Normalized Speedup (1/Time)


NLP – Generative Tasks minutes Speed Training Record
(Chatbots, Auto-completion) 60X

Image
Recognition
NLP 8.3Bn GPT-2 8B 40X
(Q&A, Translation) parameters Largest Transformer Based Model Trained
1.5Bn
20X
340M
26M
2.2ms BERTBASE
Latency Fastest Inference (18X Faster Than CPU) X
0 500 1000 1500

# of V100 GPUs

EXPLODING MODEL SIZE CONVERSATIONAL AI RECORDS Training GPUs - Near Linear Scaling
Complexity to Train Code Available on Github Requires Leading AI Infrastructure

BERTLARGE Training Record: 1472 Tesla V100-SXM3-32GB 450W GPUs | 92 DGX-2H Servers | 8 Mellanox Infiniband Adapters per node 27
BERTBASE Inference Record: SQuAD Dataset| Tesla T4 16GB GPU | CPU: Intel Xeon Gold 6240 & OpenVINO v2
Scaling Training Performance on: BERT | Speedups show performance scaling on 1x, 16x, 64x and 92x DGX-2H Servers with 16 NVIDIA V100 GPUs each
ML/DL
INFRASTRUCTURE
28
AI PLATFORM CONSIDERATIONS
Factors impacting deep learning platform decisions

DEVELOPER SCALING TOTAL COST OF


PRODUCTIVITY PERFORMANCE OWNERSHIP

“ lineMustof get started now,


business wants to “ I want the most GPU
bang for the buck
“ Ineed
have limited budget,
lowest up-front
deliver results yesterday cost possible

29
COMPARING AI COMPUTE ALTERNATIVES
Looking beyond the “spec sheet”

AI/DL Expertise &


Evaluation Criteria

Innovation

AI/DL Software Stack

Operating System Image

Hardware Architecture

30
NVIDIA DGX PODTM:
HIGH-DENSITY COMPUTE REFERENCE ARCH.
• NVIDIA DGX POD

• Support scalability to hundreds of nodes

• Based on proven SATURNV architecture

Nine DGX-1 Servers


• Eight Tesla V100 GPUs
• NVIDIA. GPUDirect™ over RDMA support
• Run at MaxQ
• 100 GbE networking (up to 4 x 100 GbE)

Twelve Storage Nodes


• 192 GB RAM
• 3.8 TB SSD
• 100 TB HDD (1.2 PB Total HDD)
• 50 GbE networking

Network
• In-rack: 100 GbE to DGX-1 servers
• In-rack: 50 GbE to storage nodes
• Out-of-rack: 4 x 100 GbE (up to 8)

Rack

4 POD design with cooling •



35 kW Power
42U x 1200 mm x 700 mm (minimum)
• Rear Door Cooler
DGX-1 POD
31
NVIDIA DGX
SUPERPOD
AI LEADERSHIP REQUIRES
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
AI INFRASTRUCTURE LEADERSHIP

Test Bed for Highest Performance Scale-Up Systems


• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50

Modular & Scalable GPU SuperPOD Architecture


• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software

Integrates Fully Optimized Software Stacks


• Freely Available Through NGC
• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core
GPUs
• 1 megawatt of power
SUPPORTING AI:
ALTERNATIVE APPROACHES
Multiple paths to
problem resolution

Framework?
Libraries? Open source / forum

O/S?
Open source / forum

GPU?
Drivers?

Installed/ Problem! Server?


running Network?
Storage?
Server, Storage & Network
Solution Providers

33
SUPPORTING AI WITH DGX REFERENCE
ARCHITECTURE SOLUTIONS

“My PyTorch CNN model


is running 30% slower NPN
than yesterday!” Partner AI Expertise

DGX RA DGX RA IT Admin


Solution Solution
Storage Storage

Problem! Running!
“Update to PyTorch
container XX.XX”

34
THE NEW NGC
GPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows
Model Training Scripts
NLP, Image Classification, Object Detection & more

Simplify Deployments

50+ Containers
DL, ML, HPC NGC Innovate Faster

Deploy Anywhere
Industry Workflows
Medical Imaging, Intelligent Video Analytics
Pre-trained Models
NLP, Classification, Object Detection & more ngc.nvidia.com 35
Solving the complexity of managing
distributed computing on GPU
AGENDA

• Session 2:

○ Data Science as a service using GPUs


○ Demo
Iguazio: Integrated and Open Data Science Platform

ML Pipelines KubeFlow

Serverless Functions Jupyter Notebook Nuclio


& Notebooks

Services
TensorFlow PyTorch Rapids Dask Pandas Spark Presto Prometheus Grafana

DL workloads
Shared resources
ML workloads
Data Persistent & GPU
compute GPU sharing

Model
Inferencing
3
DEMO
Q&A
Optimize GPU sharing
Enabling GPU at scale

§ Quick way for data scientists to work


on a cluster of GPUs.
o Built-in integration with GPU
o No Devops is required

§ Frees GPU resources after Jupyter


notebook becomes idle

§ Maximizing the efficiency of GPU


usage among the data science team

6
Supporting DGX cluster
DGX cluster

§ Running data science


workload on a DGX cluster

§ Running Jupyter, Spark,


TensorFlow and distributed
Python on a DGX cluster

§ Monitoring jobs on the


cluster level

7
Running models in an inferencing layer with GPU
Quick deployment of models in
a serving layer
§ Models are running as functions
at scale on a GPU cluster

§ High-performance parallel
execution engine

§ Easy control of GPU resources


per function

§ Quick deployment of models from


Jupyter notebooks

8
Ease of management and orchestration
Easy access to GPU
§ Self service on a managed platform

§ Jobs scheduling

§ Cloud experience for on-prem

§ Full and open data science


environment at the click of a
button

§ Built-in integration for Jupyter and


GPU
9
Advanced integration with RAPIDs
§ Direct writes/reads into/from the GPU’s memory using RAPIDS data frames
o By doing that users can read data from the database and analyze it directly on the GPU without
any intermediate layer
§ Streaming data in chunks directly into GPU
§ Full parallelism - multiple nodes can read data, each only one shard

PYTHON

DEEP
LEARNING
RAPIDS FRAMEWORKS

DASK CUDF CUML CUGRAPH CUDNN

CUDA

APACHE ARROW on GPU Memory

10
Serverless & GPU – Better Performance

§ Iguazio’s serverless functions 4x FASTER


(Nuclio) improves GPU utilization
and sharing, resulting in almost
four times faster application
performance when compared to
the use of NVIDIA GPUs within
monolithic architectures.

§ Linear scalability

11
How we Enable Large Scale Data Processing on GPUs
DB + native Support

Chunk

Merge

Filter Partition
Chunk

Final
Chunk
Results
Raw Data
10s-1000s 10s GBs
Filtered Partitioned Chunked
Terabytes
TBs 100s GBs 1-10s GBs
12
Value for Nvidia customers
§ Speed up data science projects
o Immediate access to GPU (training and inferencing)

§ Increase overall GPU utilization (90%)


o Helping customers to maximize their GPU utilization

§ Fully managed PaaS with built-in GPU integration


o Application provisioning, orchestration and managed notebooks enabling training at scale on a
shared GPU cluster
o Tight integration with Nvidia TensorRT, RAPIDS, DeepOPS

§ Simplify management of GPU’s & DGX


o Automated workflow for a continuous data science pipeline

§ Improved performance by 4x
o By creating a shared resource pool and load balancing across all GPU’s
13
Integrated GPU monitoring (coming soon)

§ Built-in GPU monitoring dashboard


integrated with Nvidia Deepops

§ Advanced troubleshooting
Identify which service/app is utilizing
the GPU resource

14
Thank You
anantg@iguazio.com | www.iguazio.com

You might also like