04.huawei Atlas Data Center AI Training

Fuel AI Training with Ascend
Huawei Atlas Data Center Training Solution

Contents
1 Trends
2 Products and Solutions
3 Scenario
2 Huawei Confidential
Development
Trends
Shifting from General to Heterogeneous Computing,
Advancing Moore's Law
Unstructured data A paradigm shift from general to
increases exponentially heterogeneous computing
Data • Moore's Law is weakening
Unstructured data growth • Innovations of heterogeneous computing architecture will break the
bottlenecks and advance Moore's Law
Structured data
By 2020, 80% of data growth will Moore's law

come from unstructured data. NPU
GPU
FPGA
CPU
1980 1990 2000 2010 2020 2030
Diversified applications Unstructured data 1.5x/year 1.1x/year

• Smart city • Text
• IoT • Image
• Autonomous • Speech
driving • Video
• Smartphone
Data source: IDC; Huawei GIV 2025
Challenges
AI Computing Power Scarcity vs Exponential Growth

in Demands
Demands for AI computing power are increasing at an unprecedented speed. However,
current computing power supply is concentrated in some vendors. GPT-2
Requirement curve of computing power for AI model training BERT
AlphaGoZero
Doubles Neural Machine
• The training computing power required by AI
every 3.4 Translation,
models doubles every 3.4 months. 6.9 x 106 PF
months TI7 Dota 1v1
VGG DeepSpeech2, 21,000 PF

• AI computing power has increased by 300,000 ResNet, 10,000 PF
AlexNet,
100x times since 2012. 470 PF
• AI computing power requirements will continue
Deep Belief Nets and
to increase by 100 times in the next 3–5 years.
Layer-wise pretraining
DQN, 2.3 PF
TD-Gammon v2.1
BiLSTM for Speech
LeNet-5
NETtalk
RNN for Speech
ALVINN
Perceptron
Doubles Doubles every
every 2 years 3.4 months
Unit: PetaFLOPS/day 2012
Data source: OpenAI
Contents
1 Trends
3 Scenario
Huawei Data Center Solution: Building the Industry's Most
Powerful AI Training Platform
Energy Finance Public utilities Transportation Carrier Manufacturing Education …

Atlas distributed training platform
Cluster management/Model
management/Data pre-processing
Framework MindSpore/TensorFlow/PyTorch/… Common components

AXE toolchain/Security subsystem/Unified
Chip CANN (Compute Architecture
enablement for Neural Networks) O&M system
Atlas 300T Atlas 800 Atlas 900

256 TFLOPS FP16 2 PFLOPS FP16 256 to 1024 PFLOPS FP16
World's most powerful training card World's most powerful training server World's fastest AI training cluster
Ascend 910
AI processor
Ascend 910
Ascend 910: World's Most Powerful AI Processor
Ultimate computing power

256 TFLOPS FP16, twice the industry counterpart
FLOPS
256T
4
High integration 3
32 Huawei Da Vinci AI Cores + 16 TaiShan Cores + 125T
100G RoCE v2 NIC, integrating AI computing, 2
90T
general computing, and I/O all in one processor 45T
1
 Architecture: Da Vinci
 256 TFLOPS FP16 High-speed connection Ascend 910
 128-channel full HD video
decoding
Integrates the HCCS, PCIe 4.0, and 100G RoCE * Normalized to 16-bit
 Max. power consumption: 310 W

high-speed interfaces for faster data training and
gradient synchronization
7
Ascend 910
Ascend 910 Chip Architecture

Highly integrated AI
SoC design 32 AI Cores
AI computing, general
computing, and I/O 3-in-1
Integrated high-
16 CPU Cores speed network port
External DDR4 memory
8 Huawei Confidential Pre-processing unit High bandwidth memory (HBM)

Atlas 300
Atlas 300T AI Accelerator Card: Industry's Most

Powerful AI Training Card
Model: 9000
Ultimate computing power

 256 TFLOPS FP16
 32 GB HBM + 16 GB DDR4
memory 32 built-in Da Vinci AI Cores, providing 256 TFLOPS
 PCIe 4.0 x16, standard full- FP16 performance, twice the industry counterparts
height 3/4-length PCIe card,
applicable to general-purpose
servers High-speed connection
 100G RoCE ports directly
provided by the processor
Supports PCIe 4.0 and 100G RoCE high-speed
Model: 9000 interfaces, reducing the gradient synchronization
256
Deep learning | Astronomical TFLOPS
latency by 10–70%
112 125
exploration | Oil exploration |
TFLOPS TFLOPS
Autonomous driving Large memory
Built-in 32 GB HBM and 16 GB two-level, large-capacity

Others Atlas 300
memory meets the bandwidth and capacity
requirements of AI and general-purpose computing.
9
Atlas 800
Atlas 800 AI Server: Industry's Most Powerful AI

Training Server
Model: 9000
Industry's highest computing density

Atlas 800 AI server
Model: 9000
Up to 2 PFLOPS FP16 computing power in a 4U space,
twice the industry counterpart
Deep learning | AI supercomputing
Distributed training platform
High perf./Watt
Supports air cooling and liquid cooling, and an up to 2

PFLOPS/5.5 kW ultra-high energy efficiency, which is
 2 PFLOPS FP16
1.6x the industry counterpart, meeting the requirements
 4U server, supporting 4 Kunpeng 920 and 8 Ascend 910 of deployment in enterprise equipment rooms and high-
processors density clusters
 32 DDR4 DIMMs and 10 x 2.5'' hard drives
 8 x 100GE + 2 x 100GE/4 x 25GE High-speed network bandwidth
 5.5 kW maximum power consumption, supporting air
cooling and liquid cooling
8 x 100G RoCE v2 high-speed interfaces, doubling the
industry counterpart and reducing the inter-chip cross-
server interconnect latency by 10–70%
10
Atlas 800
Atlas 800 AI Server Architecture

Model: 9000
Kunpeng 920
processor
32 DDR4 DIMMs
PCIe module
PSU
1 10 x 2.5" hard drives 2 Fan module

Ascend 910 AI processor 3 USB 3.0 port 4 VGA port
Liquid
cooling pipe
5 Slide-out label plate - -
8 x 100GE
optical ports
Atlas 800
Atlas 800 AI Server: Industry's Highest Computing Density
Model: 9000
• Computing power:
Competitor's
Atlas 800 AI server • Computing power: 1
2 PFLOPS PFLOPS
• Height: 4U • Height: 4U
• Power consumption: • Power consumption:
5.5 kW 4.4 kW • Superior computing power: 8 Ascend 910 AI
processors, up to 2 PFLOPS FP16 per server,
2 PFLOPS 2 PFLOPS outstripping competitor products by far
8U
Vendor 1
Atlas 800
Vendor 1 • Ultra-high density: 2 PFLOPS computing

power in a 4U space, 2x that of the
4U competing product
Vendor 1
Atlas 800
Vendor 1
• Space-efficient: saves equipment room space

by 50%, and drives down OPEX
Server computing
Server height
power
2x computing density
Atlas 800
Atlas 800 AI Server: Ultra-High Energy Efficiency

Model: 9000
• 2 PFLOPS/5.5 kW, over 1.6x the energy efficiency of industry counterparts,
reducing OPEX
• A single server supports air cooling and liquid cooling, meeting the requirements of
deployment in enterprise equipment rooms and high-density clusters
1.6x
Counter-rotating fans Perf./Watt

Cellular board for for higher wind speeds
higher porosity rate
of the front panel
Hybrid liquid
cooling design
Supports 50°C high-

temperature inlet
water, cooling the
chips in spray mode
Atlas 800
Atlas 800 AI Server: Ultra-High Network Bandwidth

Model: 9000
1 2 3
• Ultra-high bandwidth: 8 x 100 Gbit/s high-

speed interfaces + 4 x 25GE or 2 x 100GE, 2x
the industry bandwidth
• High-speed connection: HCCS, PCIe 4.0,

and 100G RoCE integrated
• Ultra-low latency: Provides 100G inter-node

4 interconnection interface based on RoCE v2,
improving the training data and gradient
1 2 x 100GE or 4 x 25GE 2 Management network
FlexIO card port and serial port
synchronization efficiency, and shortening the
inter-chip cross-server interconnect latency by
3 4 x GE LOM ports 4 8 x 100GE optical ports
10–70%
Atlas 900
Atlas 900 AI Cluster: Supercharging AI Training
Leading computing power | Best cluster network | Ultimate heat dissipation
1024 Ascend 910 AI HCCS, PCIe 4.0, 100G

> 95% liquid cooled
processors Ethernet interconnect, > 80%
PUE < 1.1
256–1024 PFLOPS FP16 linearity
Atlas 900
Atlas 900 AI Cluster: World's Fastest AI Training Cluster

Atlas 900: world's fastest
AI training cluster World's No. 1: 59.8s
• Test benchmark: ResNet-50 V1.5

model, ImageNet-1k dataset
• Test time: September 2019
79.8s
70.2s
15%
59.8s
Computing power
125 TFLOPS 125 TFLOPS 256 TFLOPS
per processor
Chip count per

1536 2048 1024
cluster
Chip architecture GPU GPU NPU

Vendor 1 Vendor 2 Atlas 900
Atlas 900
Atlas 900 AI Cluster: Industry's Best Cluster Network
Top AI cluster network
The Atlas 900 AI training cluster uses three high-speed interconnect

CloudEngine CloudEngine modes: HCCS, PCIe 4.0, and 100GE, and a dedicated 100 TB/s full-
switch switch mesh, non-blocking synchronization network. This helps reduce the
8x
100G
………… 100G RoCE gradient synchronization latency by 10–70%.
RoCE 64x
……
Huawei Collective Communication Library (HCCL) provides distributed

parallel libraries for training networks. Communication libraries, network
HCCS + PCIe interconnect HCCS + PCIe interconnect topologies, and training algorithms are optimized at the system level,
improving job scheduling efficiency and delivering > 80% linearity.
D D D D D D D D
...
D D D D D D D D
iLossless, a unique, intelligent lossless switching algorithm, learns and
trains network traffic in real time, achieving zero packet loss and E2E
AI server AI server μs-level latency.
HCCS PCIe 4.0 100G RoCE
Atlas 900
Atlas 900:Industry's First Fully Liquid Cooled AI Cluster,

with PUE < 1.1
Hybrid liquid cooling, achieving the

ultimate energy efficiency PUE
AI cluster with PUE reduced by ~30% Atlas 900 < 1.1
Board-level liquid cooling
dissipates 70% heat
Industry's PUE
air-cooled
clusters
1.5
Rack-scale enclosed
adiabatic design
dissipates 30% heat by air-
to-liquid heat exchange
Cooling capacity per rack Rack quantity Total power consumption
• High-performance fan modules and VC heat sinks

combined with field synergistic heat exchangers to
improve heat dissipation efficiency by 10%
• Supports 50°C high-temperature water inlet 50 kW 16 racks 704 kW
(30°C by industry counterparts), improving cooling
efficiency
• Real-time leakage detection and quick
automatic shutdown, ensuring reliability 30 kW 86 racks 1728 kW
Atlas Accelerates AI Model Training for Various Applications
Video
analysis
Algorithms & models
Trained models
Parameter Computing Model Tailor Quantization AI services
Dataset tuning verification Gene
research
Model training Model deployment

Autonomous
driving
Weather
forecast
Oil
exploration
Atlas 300T AI accelerator card Atlas 800 AI server Atlas 900 AI cluster
Model: 9000 Model: 9000
19
Contents
1 Trends
3 Scenario
AI Supercomputing: Builds the Infrastructure for Cloud Services
World's No. 1 in performance benchmark test: 59.8s
Atlas 900
AI supercomputing powered by Time 76.8s
Kunpeng and Ascend 70.2s
256–1024 PFLOPS FP16 59.8s
6195 x86 racks = 208 GPU racks = 16 Atlas racks

40,268 kW 736 kW Google Fujitsu Huawei
1,352 kW Vendor
Atlas 900
AI applications boost the development of the Greater Bay Area (GBA) Benchmarking with NVIDIA: TCO reduced by 9.3% for
the same computing power
Smart Smart Smart finance
transportation healthcare and more 1. Doubled computing power
Interconnects over 1024 Ascend 910 AI processors,
• AI supercomputing platform with international providing double computing performance on a single chip
National
strategy
influence compared with the industry

• National open source platform for AI basics
2. 70% shorter network latency
• Open & innovative ecosystem for AI Integrates three high-speed interfaces: HCCS, PCIe
4.0, and 100G RoCE, reducing latency by up to 70%
•
Shenzhen
Supports major AI application requirements such as intelligent

Serving
computing system and robot system in the GBA 3. Over 60% electricity saving and 80% smaller footprint
• Improves the basic position and innovation capability of AI research Hybrid liquid cooling system for 50 kW per rack, PUE < 1.1
on open source platforms and intelligent applications in the GBA Ultra-high-density prefabricated modular equipment room,
• Attracts national AI resources and talents low power consumption, fast deployment, and exascale
CloudBrain cluster rollout in six months
Thank you. 把数字世界带入每个人、每个家庭、
每个组织，构建万物互联的智能世界。
Bring digital to every person, home and
organization for a fully connected,
intelligent world.
Copyright©2020 Huawei Technologies Co., Ltd.

All Rights Reserved.
The information in this document may contain predictive

statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

04.huawei Atlas Data Center AI Training

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04.huawei Atlas Data Center AI Training

Uploaded by

Copyright:

Available Formats

Fuel AI Training with Ascend

Huawei Atlas Data Center Training Solution

2 Products and Solutions

By 2020, 80% of data growth will Moore's law

Diversified applications Unstructured data 1.5x/year 1.1x/year

Data source: IDC; Huawei GIV 2025

AI Computing Power Scarcity vs Exponential Growth

VGG DeepSpeech2, 21,000 PF

2 Products and Solutions

Energy Finance Public utilities Transportation Carrier Manufacturing Education …

Framework MindSpore/TensorFlow/PyTorch/… Common components

Atlas 300T Atlas 800 Atlas 900

Ascend 910: World's Most Powerful AI Processor

Ultimate computing power

 Max. power consumption: 310 W

Ascend 910 Chip Architecture

External DDR4 memory

8 Huawei Confidential Pre-processing unit High bandwidth memory (HBM)

Atlas 300T AI Accelerator Card: Industry's Most

Ultimate computing power

Built-in 32 GB HBM and 16 GB two-level, large-capacity

Atlas 800 AI Server: Industry's Most Powerful AI

Industry's highest computing density

Supports air cooling and liquid cooling, and an up to 2

Atlas 800 AI Server Architecture

1 10 x 2.5" hard drives 2 Fan module

Vendor 1 • Ultra-high density: 2 PFLOPS computing

• Space-efficient: saves equipment room space

Atlas 800 AI Server: Ultra-High Energy Efficiency

Counter-rotating fans Perf./Watt

Supports 50°C high-

Atlas 800 AI Server: Ultra-High Network Bandwidth

• Ultra-high bandwidth: 8 x 100 Gbit/s high-

• High-speed connection: HCCS, PCIe 4.0,

• Ultra-low latency: Provides 100G inter-node

Atlas 900 AI Cluster: Supercharging AI Training

Leading computing power | Best cluster network | Ultimate heat dissipation

1024 Ascend 910 AI HCCS, PCIe 4.0, 100G

Atlas 900 AI Cluster: World's Fastest AI Training Cluster

• Test benchmark: ResNet-50 V1.5

Chip count per

Chip architecture GPU GPU NPU

Atlas 900 AI Cluster: Industry's Best Cluster Network

Top AI cluster network

The Atlas 900 AI training cluster uses three high-speed interconnect

Huawei Collective Communication Library (HCCL) provides distributed

Atlas 900:Industry's First Fully Liquid Cooled AI Cluster,

Hybrid liquid cooling, achieving the

• High-performance fan modules and VC heat sinks

Model training Model deployment

2 Products and Solutions

6195 x86 racks = 208 GPU racks = 16 Atlas racks

influence compared with the industry

Supports major AI application requirements such as intelligent

Copyright©2020 Huawei Technologies Co., Ltd.

The information in this document may contain predictive

You might also like