Lez.b-06 - nVIDIA GPU and Servers

DELL AND NVIDIA FOR YOUR AI
WORKLOADS IN THE DATA CENTER

Helge Gose, NVIDIA Solution Architect, June 7, 2018
What is Deep Learning?
AGENDA Volta and NVLINK

Inference to Training – Dell solutions
2
THE TIME HAS COME FOR GPU COMPUTING
GPU-Accelerated
Computing
107
1.1X per year

105
103
1.5X per year
Single-threaded perf
1980 1990 2000 2010 2020
3
DEEP LEARNING
IS SWEEPING ACROSS INDUSTRIES
INTERNET SERVICES MEDICINE MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES
INTERNET SERVICES
Image/Video classification Cancer cell detection Video captioning Face recognition Pedestrian detection
Speech recognition Diabetic grading Content based search Video surveillance Lane tracking
Natural language processing Drug discovery Real time translation Cyber security Recognize traffic signs
4
DEFINITIONS
A NEW COMPUTING MODEL
Algorithms that learn from examples
MACHINE LEARNING
TRADITIONAL APPROACH Car

Vehicle
Requires domain experts
Time-consuming experimentation
Custom algorithms Coupe
Not scalable to new problems
DEEP LEARNING
Car
DEEP NEURAL NETWORKS Vehicle
Learn from data
Easily to extend
Coupe
Accelerated with GPUs
6
WHAT PROBLEM ARE YOU SOLVING?
Defining the AI/DL Task
BUSINESS EXAMPLE OUTPUTS
INPUTS AI/DL TASK
QUESTION HEALTHCARE RETAIL FINANCE
Is “it” present
Detection Cancer Detection Targeted ads Cybersecurity
or not?
What type of thing

Classification Image Classification Basket Analysis Credit Scoring
is “it”?
Text Data Images
To what extent is Tumor Size/Shape Build 360º
Segmentation Credit Risk Analysis
“it” present? Analysis Customer View
What is the likely Survivability Sentiment &

Prediction Fraud Detection
outcome? Prediction behavior recognition
Video Audio What will likely Therapy Recommendation Algorithmic

Recommendations
satisfy the objective? Recommendation Engine Trading
7
VOLTA AND NVLINK
8
TESLA V100
WORLD’S MOST ADVANCED DATA CENTER GPU
5,120 CUDA cores

640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
16GB/ 32GB HBM2 @ 900GB/s | 300GB/s NVLink
9
REVOLUTIONARY AI PERFORMANCE
3X Faster DL Training Performance
Exponential Performance over time Relative Time to Train Improvements
(GoogleNet) (LSTM)
100x
8x V100
cuDNN7
2X
80x CPU 15 Days
Speedup vs K80
60x
1X
18 Hours
40x P100
8x P100
cuDNN6
20x 4x M40
cuDNN3 1X
1x K80 6 Hours
cuDNN2 V100
0x
Q1 Q3 Q2 Q2
15 15 16 17
0 10 20
Over 80X DL Training

3X Reduction in Time to Train Over P100
Performance in 3 Years
10
GoogleNet Training Performance on versions of cuDNN Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x
Vs 1x K80 cuDNN2 Xeon E5 2699 V4
END-TO-END PRODUCT FAMILY
TRAINING INFERENCE
DESKTOP DATA CENTER DATA CENTER EMBEDDED AUTOMOTIVE
JETPACK SDK DriveWorks SDK
Dell PowerEdge
C4140
TESLA P4
TITAN V
Jetson
TESLA V100 Drive PX
DGX Station TESLA V100
11
POWERING THE DEEP LEARNING ECOSYSTEM
NVIDIA SDK Accelerates Every Major Framework
COMPUTER VISION SPEECH & AUDIO NATURAL LANGUAGE PROCESSING
OBJECT DETECTION IMAGE CLASSIFICATION VOICE RECOGNITION LANGUAGE TRANSLATION RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
Mocha.jl
NVIDIA DEEP LEARNING SDK
12
developer.nvidia.com/deep-learning-software
DELL AI SOLUTIONS
13
PowerEdge C4140 Server
Faster time to insights with ultra-dense accelerator optimized server platform
TA R G E T E D W O R K L O A D S
• Machine Learning and Deep

Learning
• Technical Computing (Research /
Life Sciences)
• Low latency, high performance
applications (FSI) Xeon Scalable Tesla
Processors GPUs
Key Capabilities
• Unthrottled performance and superior thermal efficiency with patent-pending interleaved

GPU system design*
• No-compromise (CPU + GPU) acceleration technology up to 500 TFLOPS / U+ using the
NVIDIA® Tesla™V100 with NVLink™
• 2.4KW PSUs help future-proof for next generation GPUs
* Based on Dell internal analyses and Principled Technologies Report - Jan 2015.
• Simplified deployment with pre-configured Ready Bundles
+
Based on V100 NVLink Tensor Core Performance
14
14 of 21 THE BEDROCK OF THE MODERN DATACENTER
™
C4140 – Now with NVIDIA Volta and NVLink™ ®
Faster time to insights with ultra-dense accelerator optimized server platform
NVIDIA® Volta GPU has over NVIDIA® NVLink™ is a high-

21 Billion Transistors and bandwidth interconnect
640 Tensor cores to deliver enabling ultra fast
100+ TFLOPS communication between
CPU and GPU and between
GPUs
 Volta V100 performs 2.6X avg. speed up for DL workloads than Pascal P100
 Delivers 44X more throughput compared to CPU nodes with lower latency
 NVLink 5X – 10X faster than traditional PCIe Gen3 Interconnect
 Volta-Optimized Software for important HPC applications
*Source: NVIDIA® Volta benchmarks for multiple applications 2017
15
C4140 and NVLink™
PCIe Topology NVLink Topology
 NVLINK is 25Gbps versus PCIe at 8Gbps

 Increase in performance due to higher clock speed – 7%
 Increase in performance Peer to Peer GPU communication – 7%+
16
17
17 of 21
INDUSTRY'S #1
Server Portfolio
PowerEdge
Now Introducing C4140
OpenManage Enterprise – Intelligent Automation Systems Management
Extreme Scale
Towers Racks Modular
Infrastructure
*Based on units sold (tie). IDC Worldwide Quarterly Server Tracker, Q1-Q3, 2016.
18 T H E B E D R O C K O F T H E M O D E R N D ATA C E N T E R
Dell - Internal Use - Confidential

Lez.b-06 - nVIDIA GPU and Servers

Uploaded by

Copyright:

Available Formats

You might also like

Lez.b-06 - nVIDIA GPU and Servers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lez.b-06 - nVIDIA GPU and Servers

Uploaded by

Copyright:

Available Formats

DELL AND NVIDIA FOR YOUR AI

WORKLOADS IN THE DATA CENTER

AGENDA Volta and NVLINK

1.1X per year

TRADITIONAL APPROACH Car

What type of thing

What is the likely Survivability Sentiment &

Video Audio What will likely Therapy Recommendation Algorithmic

5,120 CUDA cores

Over 80X DL Training

JETPACK SDK DriveWorks SDK

DGX Station TESLA V100

DEEP LEARNING FRAMEWORKS

NVIDIA DEEP LEARNING SDK

• Machine Learning and Deep

• Unthrottled performance and superior thermal efficiency with patent-pending interleaved

Faster time to insights with ultra-dense accelerator optimized server platform

NVIDIA® Volta GPU has over NVIDIA® NVLink™ is a high-

 NVLink 5X – 10X faster than traditional PCIe Gen3 Interconnect

 Volta-Optimized Software for important HPC applications

*Source: NVIDIA® Volta benchmarks for multiple applications 2017

 NVLINK is 25Gbps versus PCIe at 8Gbps

OpenManage Enterprise – Intelligent Automation Systems Management

You might also like