Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Fuel AI Training with Ascend

Huawei Atlas Data Center Training Solution


Contents

1 Trends

2 Products and Solutions

3 Scenario

2 Huawei Confidential
Development
Trends
Shifting from General to Heterogeneous Computing,
Advancing Moore's Law
Unstructured data A paradigm shift from general to
increases exponentially heterogeneous computing
Data • Moore's Law is weakening
Unstructured data growth • Innovations of heterogeneous computing architecture will break the
bottlenecks and advance Moore's Law
Structured data

By 2020, 80% of data growth will Moore's law


come from unstructured data. NPU

GPU

FPGA
CPU
1980 1990 2000 2010 2020 2030

Diversified applications Unstructured data 1.5x/year 1.1x/year


• Smart city • Text
• IoT • Image
• Autonomous • Speech
driving • Video
• Smartphone

Data source: IDC; Huawei GIV 2025

3 Huawei Confidential
Challenges

AI Computing Power Scarcity vs Exponential Growth


in Demands
Demands for AI computing power are increasing at an unprecedented speed. However,
current computing power supply is concentrated in some vendors. GPT-2
Requirement curve of computing power for AI model training BERT
AlphaGoZero
Doubles Neural Machine
• The training computing power required by AI
every 3.4 Translation,
models doubles every 3.4 months. 6.9 x 106 PF
months TI7 Dota 1v1

VGG DeepSpeech2, 21,000 PF


• AI computing power has increased by 300,000 ResNet, 10,000 PF
AlexNet,
100x times since 2012. 470 PF
• AI computing power requirements will continue
Deep Belief Nets and
to increase by 100 times in the next 3–5 years.
Layer-wise pretraining
DQN, 2.3 PF
TD-Gammon v2.1
BiLSTM for Speech
LeNet-5
NETtalk
RNN for Speech
ALVINN

Perceptron
Doubles Doubles every
every 2 years 3.4 months
Unit: PetaFLOPS/day 2012
Data source: OpenAI

4 Huawei Confidential
Contents

1 Trends

2 Products and Solutions

3 Scenario

5 Huawei Confidential
Huawei Data Center Solution: Building the Industry's Most
Powerful AI Training Platform

Energy Finance Public utilities Transportation Carrier Manufacturing Education …


Atlas distributed training platform
Cluster management/Model
management/Data pre-processing

Framework MindSpore/TensorFlow/PyTorch/… Common components


AXE toolchain/Security subsystem/Unified
Chip CANN (Compute Architecture
enablement for Neural Networks) O&M system

Atlas 300T Atlas 800 Atlas 900


256 TFLOPS FP16 2 PFLOPS FP16 256 to 1024 PFLOPS FP16
World's most powerful training card World's most powerful training server World's fastest AI training cluster

Ascend 910
AI processor

6 Huawei Confidential
Ascend 910

Ascend 910: World's Most Powerful AI Processor

Ultimate computing power


256 TFLOPS FP16, twice the industry counterpart
FLOPS
256T
4

High integration 3
32 Huawei Da Vinci AI Cores + 16 TaiShan Cores + 125T
100G RoCE v2 NIC, integrating AI computing, 2
90T
general computing, and I/O all in one processor 45T
1
 Architecture: Da Vinci
 256 TFLOPS FP16 High-speed connection Ascend 910
 128-channel full HD video
decoding
Integrates the HCCS, PCIe 4.0, and 100G RoCE * Normalized to 16-bit

 Max. power consumption: 310 W


high-speed interfaces for faster data training and
gradient synchronization

7
Ascend 910

Ascend 910 Chip Architecture


Highly integrated AI
SoC design 32 AI Cores
AI computing, general
computing, and I/O 3-in-1

Integrated high-
16 CPU Cores speed network port

External DDR4 memory

8 Huawei Confidential Pre-processing unit High bandwidth memory (HBM)


Atlas 300

Atlas 300T AI Accelerator Card: Industry's Most


Powerful AI Training Card
Model: 9000

Ultimate computing power


 256 TFLOPS FP16
 32 GB HBM + 16 GB DDR4
memory 32 built-in Da Vinci AI Cores, providing 256 TFLOPS
 PCIe 4.0 x16, standard full- FP16 performance, twice the industry counterparts
height 3/4-length PCIe card,
applicable to general-purpose
servers High-speed connection
 100G RoCE ports directly
provided by the processor
Supports PCIe 4.0 and 100G RoCE high-speed
Model: 9000 interfaces, reducing the gradient synchronization
256
Deep learning | Astronomical TFLOPS
latency by 10–70%
112 125
exploration | Oil exploration |
TFLOPS TFLOPS
Autonomous driving Large memory

Built-in 32 GB HBM and 16 GB two-level, large-capacity


Others Atlas 300
memory meets the bandwidth and capacity
requirements of AI and general-purpose computing.

9
Atlas 800

Atlas 800 AI Server: Industry's Most Powerful AI


Training Server
Model: 9000

Industry's highest computing density


Atlas 800 AI server
Model: 9000
Up to 2 PFLOPS FP16 computing power in a 4U space,
twice the industry counterpart
Deep learning | AI supercomputing
Distributed training platform
High perf./Watt

Supports air cooling and liquid cooling, and an up to 2


PFLOPS/5.5 kW ultra-high energy efficiency, which is
 2 PFLOPS FP16
1.6x the industry counterpart, meeting the requirements
 4U server, supporting 4 Kunpeng 920 and 8 Ascend 910 of deployment in enterprise equipment rooms and high-
processors density clusters
 32 DDR4 DIMMs and 10 x 2.5'' hard drives
 8 x 100GE + 2 x 100GE/4 x 25GE High-speed network bandwidth
 5.5 kW maximum power consumption, supporting air
cooling and liquid cooling
8 x 100G RoCE v2 high-speed interfaces, doubling the
industry counterpart and reducing the inter-chip cross-
server interconnect latency by 10–70%

10
Atlas 800

Atlas 800 AI Server Architecture


Model: 9000

Kunpeng 920
processor

32 DDR4 DIMMs
PCIe module

PSU

1 10 x 2.5" hard drives 2 Fan module


Ascend 910 AI processor 3 USB 3.0 port 4 VGA port
Liquid
cooling pipe
5 Slide-out label plate - -

8 x 100GE
optical ports

11 Huawei Confidential
Atlas 800
Atlas 800 AI Server: Industry's Highest Computing Density
Model: 9000
• Computing power:
Competitor's
Atlas 800 AI server • Computing power: 1
2 PFLOPS PFLOPS
• Height: 4U • Height: 4U
• Power consumption: • Power consumption:
5.5 kW 4.4 kW • Superior computing power: 8 Ascend 910 AI
processors, up to 2 PFLOPS FP16 per server,
2 PFLOPS 2 PFLOPS outstripping competitor products by far
8U
Vendor 1

Atlas 800

Vendor 1 • Ultra-high density: 2 PFLOPS computing


power in a 4U space, 2x that of the
4U competing product
Vendor 1

Atlas 800
Vendor 1

• Space-efficient: saves equipment room space


by 50%, and drives down OPEX
Server computing
Server height
power

2x computing density
12 Huawei Confidential
Atlas 800

Atlas 800 AI Server: Ultra-High Energy Efficiency


Model: 9000
• 2 PFLOPS/5.5 kW, over 1.6x the energy efficiency of industry counterparts,
reducing OPEX
• A single server supports air cooling and liquid cooling, meeting the requirements of
deployment in enterprise equipment rooms and high-density clusters
1.6x

Counter-rotating fans Perf./Watt


Cellular board for for higher wind speeds
higher porosity rate
of the front panel
Hybrid liquid
cooling design

Supports 50°C high-


temperature inlet
water, cooling the
chips in spray mode

13 Huawei Confidential
Atlas 800

Atlas 800 AI Server: Ultra-High Network Bandwidth


Model: 9000
1 2 3

• Ultra-high bandwidth: 8 x 100 Gbit/s high-


speed interfaces + 4 x 25GE or 2 x 100GE, 2x
the industry bandwidth

• High-speed connection: HCCS, PCIe 4.0,


and 100G RoCE integrated

• Ultra-low latency: Provides 100G inter-node


4 interconnection interface based on RoCE v2,
improving the training data and gradient
1 2 x 100GE or 4 x 25GE 2 Management network
FlexIO card port and serial port
synchronization efficiency, and shortening the
inter-chip cross-server interconnect latency by
3 4 x GE LOM ports 4 8 x 100GE optical ports
10–70%

14 Huawei Confidential
Atlas 900

Atlas 900 AI Cluster: Supercharging AI Training

Leading computing power | Best cluster network | Ultimate heat dissipation

1024 Ascend 910 AI HCCS, PCIe 4.0, 100G


> 95% liquid cooled
processors Ethernet interconnect, > 80%
PUE < 1.1
256–1024 PFLOPS FP16 linearity

15 Huawei Confidential
Atlas 900

Atlas 900 AI Cluster: World's Fastest AI Training Cluster


Atlas 900: world's fastest
AI training cluster World's No. 1: 59.8s

• Test benchmark: ResNet-50 V1.5


model, ImageNet-1k dataset
• Test time: September 2019

79.8s

70.2s
15%
59.8s
Computing power
125 TFLOPS 125 TFLOPS 256 TFLOPS
per processor

Chip count per


1536 2048 1024
cluster

Chip architecture GPU GPU NPU


Vendor 1 Vendor 2 Atlas 900

16 Huawei Confidential
Atlas 900

Atlas 900 AI Cluster: Industry's Best Cluster Network

Top AI cluster network

The Atlas 900 AI training cluster uses three high-speed interconnect


CloudEngine CloudEngine modes: HCCS, PCIe 4.0, and 100GE, and a dedicated 100 TB/s full-
switch switch mesh, non-blocking synchronization network. This helps reduce the
8x
100G
………… 100G RoCE gradient synchronization latency by 10–70%.
RoCE 64x
……

Huawei Collective Communication Library (HCCL) provides distributed


parallel libraries for training networks. Communication libraries, network
HCCS + PCIe interconnect HCCS + PCIe interconnect topologies, and training algorithms are optimized at the system level,
improving job scheduling efficiency and delivering > 80% linearity.
D D D D D D D D
...

D D D D D D D D
iLossless, a unique, intelligent lossless switching algorithm, learns and
trains network traffic in real time, achieving zero packet loss and E2E
AI server AI server μs-level latency.
HCCS PCIe 4.0 100G RoCE

17 Huawei Confidential
Atlas 900

Atlas 900:Industry's First Fully Liquid Cooled AI Cluster,


with PUE < 1.1

Hybrid liquid cooling, achieving the


ultimate energy efficiency PUE
AI cluster with PUE reduced by ~30% Atlas 900 < 1.1
Board-level liquid cooling
dissipates 70% heat
Industry's PUE
air-cooled
clusters
1.5
Rack-scale enclosed
adiabatic design
dissipates 30% heat by air-
to-liquid heat exchange
Cooling capacity per rack Rack quantity Total power consumption

• High-performance fan modules and VC heat sinks


combined with field synergistic heat exchangers to
improve heat dissipation efficiency by 10%
• Supports 50°C high-temperature water inlet 50 kW 16 racks 704 kW
(30°C by industry counterparts), improving cooling
efficiency
• Real-time leakage detection and quick
automatic shutdown, ensuring reliability 30 kW 86 racks 1728 kW

18 Huawei Confidential
Atlas Accelerates AI Model Training for Various Applications

Video
analysis
Algorithms & models

Trained models
Parameter Computing Model Tailor Quantization AI services
Dataset tuning verification Gene
research

Model training Model deployment


Autonomous
driving

Weather
forecast

Oil
exploration
Atlas 300T AI accelerator card Atlas 800 AI server Atlas 900 AI cluster
Model: 9000 Model: 9000

19
Contents

1 Trends

2 Products and Solutions

3 Scenario

20 Huawei Confidential
AI Supercomputing: Builds the Infrastructure for Cloud Services
World's No. 1 in performance benchmark test: 59.8s
Atlas 900
AI supercomputing powered by Time 76.8s
Kunpeng and Ascend 70.2s
256–1024 PFLOPS FP16 59.8s

6195 x86 racks = 208 GPU racks = 16 Atlas racks


40,268 kW 736 kW Google Fujitsu Huawei
1,352 kW Vendor
Atlas 900

AI applications boost the development of the Greater Bay Area (GBA) Benchmarking with NVIDIA: TCO reduced by 9.3% for
the same computing power
Smart Smart Smart finance
transportation healthcare and more 1. Doubled computing power
Interconnects over 1024 Ascend 910 AI processors,
• AI supercomputing platform with international providing double computing performance on a single chip
National
strategy

influence compared with the industry


• National open source platform for AI basics
2. 70% shorter network latency
• Open & innovative ecosystem for AI Integrates three high-speed interfaces: HCCS, PCIe
4.0, and 100G RoCE, reducing latency by up to 70%


Shenzhen

Supports major AI application requirements such as intelligent


Serving

computing system and robot system in the GBA 3. Over 60% electricity saving and 80% smaller footprint
• Improves the basic position and innovation capability of AI research Hybrid liquid cooling system for 50 kW per rack, PUE < 1.1
on open source platforms and intelligent applications in the GBA Ultra-high-density prefabricated modular equipment room,
• Attracts national AI resources and talents low power consumption, fast deployment, and exascale
CloudBrain cluster rollout in six months

21 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home and
organization for a fully connected,
intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like