Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

AI/ML

AWS 기반 기계 학습 자동화 및
최적화를 위한 실전 기법
남궁영환 김대근
데이터 사이언티스트 SA 데이터 사이언티스트 SA
아마존웹서비스 아마존웹서비스
Agenda DEV DAY

• AI/ML at AWS

• 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝


o Part 1
§ Infrastructure for ML on AWS
§ Horovod & TensorFlow distributed training on EC2, EKS, and SageMaker
o Part 2
§ fast.ai on AWS
§ MnasNet on AWS

• Summary

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS ML Stack : 가장 깊고 폭넓은 역량과 기술의 집약

Vision Speech Language Chatbots Forecasting Recommendations

AI SERVICES

(App developers with


little knowledge of ML) REKOGNITION REKOGNITION
TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND L EX FORECAST PERSONALIZE
IMAGE VIDEO

AMAZON
SAGEMAKER

ML SERVICES BUILD TRAIN DEPLOY

(ML developers and Pre-bui lt algori thms & notebooks One-cli ck model trai ni ng & tuni ng One-cli ck deployment & hosti ng
data scientists) Data labeli ng (GROUND TRUTH) Opti mi zati on (N E O )
Al gor i thms & model s Rei nforcement learni ng
(AWS MARKETPLACE FOR MACHINE LEARNING)

Frameworks Interfaces Infrastructure

ML FRAMEWORKS
& INFRASTRUCTURE
(ML researchers and EC2 P3 EC2 C5 FPGAs GREENGRASS ELASTIC INFERENTIA
academics) & INFERENCE
P3DN

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling TensorFlow near-linearly 256 GPUs at 2018

Stock AWS-Optimized
scaling efficiency TensorFlow TensorFlow

65% 90%
with 256 GPUs

Amazon SageMaker

AWS Deep Learning AMIs
에서 사용 가능

training
time 30 min 14 min

https://aws.amazon.com/about-aws/whats-new/2018/11/tensorflow-scalability-to-256-gpus/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://www.slideshare.net/ExtractConf
https://eng.uber.com/horovod/

대규모 머신러닝이 중요한 이유 (1/3)

• 데이터 축적에 따라 모델의 성능은 • 대량의 데이터 기반 ML/DL 모델 트레이닝은


지속적으로 향상 많은 시간과 자원들을 필요로 함
• 딥러닝 적용 사례가 다양한 • “분산 트레이닝”
분야에서 꾸준히 증가하고 있음

How do data science techniques The “data parallel” approach


scale with amount of data? to distributed training

- Andrew Ng - Uber

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
대규모 머신러닝이 중요한 이유 (2/3)

알고리즘 선정도 중요하지만


많은 양의 트레이닝 데이터의 확보가
무엇보다 중요

“These results suggest that we


may want to reconsider the
trade-off between spending time
and money on algorithm
development versus spending it
on corpus development.”

Scaling to Very Very Large Corpora for Natural Language Disambiguation, Banko and Brill, Microsoft Research (2001)
http://www.aclweb.org/anthology/P01-1005

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
대규모 머신러닝이 중요한 이유 (3/3)

• 공통 목표
ü 컴퓨팅, 네트워킹, 컨테이너, 분산 트레이닝 성능 튜닝, . . .
ü 머신러닝 엔지니어는 선호하는 ML/DL 프레임워크를 이용하여
비즈니스 성공에 기여할 수 있는 모델 개발에 집중
대규모 머신러닝은
• Data Management
문제 및 접근 방식에
ü 데이터의 규모 ∝ 해결 과제 및 알고리즘의 복잡도
따라 해결 방안이 ü 데이터의 견고성(durability) 및 가용성(availability)

매우 다양할 수 있음 • Distributed Computing Frameworks


ü Data pipelines feature (Dask, Ray, PyToolz, ipyparallel, etc.)
ü CPU ➝ GPU ➝ Multi-GPUs ➝ Multi-nodes
ü TensorFlow, PyTorch, MxNet, . . .

• Build Compute Clusters to fit the workload!

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to train and deploy deep learning models
“해결하려는 워크로드를 고려하여 적절한
ML/DL 모델 트레이닝 및 배포 환경을 선택합니다”

AWS Deep Learning AWS Deep Learning


Amazon SageMaker
AMIs Containers

Amazon
Amazon EC2 Amazon
Elastic Container Service for Elastic Container Service
Kubernetes

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DEV DAY

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝


Infrastructure for ML on AWS

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
P3 instance
3 가지 타입 중 14 리전
대규모 병렬 처리가 필요한 워크로드에 적합
• 기계학습 모델 트레이닝 P3.2xlarge P3.8xlarge P3.16xlarge

• HPC(High Performance Computing) 시뮬레이션 1 V100


GPU
4 V100
GPU
8 V100
GPU
• 3D 모델 렌더링
8 vCPU 32 vCPU 64 vCPU
• 비디오 인코딩
61 GB 244 GB 488 GB
Mem Mem Mem

최대 8 개의 NVIDIA Tesla V100 GPU


• 1 PetaFLOPs 컴퓨팅 성능
(P2 인스턴스 대비 최대 14배 ↑)

• 300 GB/s 의 GPU 간 통신 속도 지원 (NVLink)


(P2인스턴스 대비 9배 ↑)

• 모든 ML 프레임워크 및 모델 타입 지원
• 다양한 형태의 인스턴스 사용 가능
(Spot instance 사용 시 최대 70% 비용 절감 가능)

https://aws.amazon.com/ko/ec2/instance-types/p3/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
P3dn.24xlarge instance

• 클라우드에서 사용 가능한 가장 강력한 GPU


Description P3.16xlarge P3dn.24xlarge Improvements
인스턴스
Number and
• 효율적인 대규모 ML 트레이닝 및 HPC Type of GPUs
8 x NVIDIA V100 8 x NVIDIA V100 -
시뮬레이션 지원
GPU Memory 16GB/GPU 32GB/GPU 100%
(100Gbps 네트워크 대역폭을 이용한 멀티-노드 클러스터
(32대 이상) 구성 가능) GPU Peer to Peer NVLink - 300 GB/s NVLink - 300 GB/s -
• 모델 트레이닝 및 시뮬레이션을 위한 데이터에 CPU Family Broadwell Skylake w AVX512
빠른 액세스 지원
(Amazon S3, 네트워크 기반 파일 시스템, 로컬 인스턴스 스토리지) vCPU 64 96 50%

System Memory 488 GB 768 GB 57%


• 대규모 ML 모델 트레이닝 및 대규모 데이터 처리
(32GB GPU 메모리를 장착한 최신 NVIDA V100 GPU) Networking
Throughput
25Gbps 100Gbps 200%

• 데이터 전처리 최적화에 적합 EBS Throughput 14Gbps 14Gbps -


(96 vCPUs using AWS Custom Skylake CPUs and 768GB of Local Instance
No 2.0TBs NVMe SSD
System Memory) Storage

https://aws.amazon.com/ko/ec2/instance-types/p3/#Amazon_EC2_P3dn.24xlarge_Instances
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS FSx for Lustre

• 머신 러닝, HPC, 동영상 처리, 금융 모델링 등을 위한 고성능 파일


시스템
• S3와 기본적으로 연동됨

• Lustre는 1 millisecond 미만의 지연 시간과 초당 수백 Gigabytes,


수백만 IOPS로 확장되는 처리량을 지원
• POSIX와 호환되므로, 특별히 추가 변경 없이 기존 Linux 기반
애플리케이션 사용 가능
• 사용한 리소스에 대해서만 비용 지불 (최소약정/선수금 없음) Amazon FSx
• 클라이언트 OS 커널 모듈 변경 작업 필요없음 for Lustre

(https://aws.amazon.com/ko/fsx/lustre/)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure for ML on AWS (1/3)

전통적 HPC 머신러닝 클러스터 Cloud-native 머신러닝 클러스터

BeeGFS RAM-based storage array


Model parameter
Object store
Amazon FSx
Amazon S3 Amazon ECR
for Lustre
commit

Amazon EFS

Deep Learning
Deep Learning hydrate
Placement Lustre Multi-node TensorFlow
Application Stack Auto Scaling worker nodes
kernel Container Registry
Cluster-wide Group driver
persistent storage
Multi-node parallel

AWS Batch P3 / P3dn container instances


Auto Scaling BeeGFS RAM storage nodes

Deep Learning
Placement Group
Bastion host | BeeGFS management node | Cluster monitoring

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://aws.amazon.com/ko/blogs/compute/distributed-deep-learning-made-easy/

Infrastructure for ML on AWS (2/3)


https://github.com/aws-samples/deep-learning-models/tree/master/hpc-cluster
https://github.com/awslabs/deeplearning-cfn

Traditional AWS Deep Learning Cluster


AWS
Cloud VPC Public
Public: 203.0.113.0 Auto Scaling Group
subnet Private: 10.0.0.1

Amazon SQS Public Subnet


Master Queue EC2 Master
10.0.0.0/24 Worker setup
Instance Router

Private AWS Elastic File System


subnet
Amazon SQS Auto Scaling Group
Workers Internet
Worker Queue Internet
10.0.1.1 Gateway
10.0.1.2
10.0.1.3

Private Subnet EC2 Workers


10.0.1.0/16

Default VPC: 10.0.0.0/16 NAT Gateway


Auto Scaling
Setup Complete

AWS Amazon Amazon


Lambda SNS S3

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure for ML on AWS (3/3)
Cloud-native AWS Deep Learning Cluster
Amazon Event
AWS Cloud CloudWatch trigger

AWS Step Functions workflow


TFRecord Input
bucket

FSx for Lustre AWS Batch

Multi-node Parallel Job

Amazon Glacier Training Output


bucket
TensorFlow
Container Registry

NVIDIA GPU-backed
running containers

https://aws.amazon.com/ko/blogs/compute/scalable-deep-learning-training-using-multi-node-parallel-jobs-with-aws-batch-and-amazon-fsx-for-lustre/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DEV DAY

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝


with Horovod & TensorFlow

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (1/9)

• 분산 딥러닝을 위한 오픈 소스 프레임워크
• Stock TensorFlow, Keras, PyTorch 등과 연동하여 동작
• 쉽고 간단한 설치 `pip install horovod`
• 고급 알고리즘 사용 가능
• High-Performance 네트워크 (RDMA, GPUDirect) 지원
• ML 엔지니어와 인프라를 분리
ü 인프라팀은 컨테이너 및 MPI 환경을 제공
ü ML 엔지니어는 선호하는 딥러닝 프레임워크 사용
ü 프레임워크 상에서 분산 트레이닝에 대한 공통 기대치 horovod.ai
(인프라팀 & ML 엔지니어)

https://eng.uber.com/horovod/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (2/9)
Worker A
5 13 8 19 42 1

Worker C Worker B
9 27 3 15 8 4 8 11 4 2 7 7

• Ring-AllReduce Worker A

데이터의 규모 ∝ 클러스터 노드의 개수


5 13 8 19 50 5

Synchronous updates
Worker C Worker B

• 9 27 7 17 8 4 13 24 4 2 7 7

• NVIDIA’s NCCL library (for GPU-level communication) Worker A


5 13 15 36 50 5

• Configurations Worker C Worker B

ü Sing-ring NCCL vs. Hierarchical AllReduce 22 51 7 17 8 4 13 24 4 2 57 12

HOROVOD_HIERARCHICAL_ALLREDUCE=1
ü Tensor Fusion Worker A
22 51 15 36 50 5

HOROVOD_FUSION_THRESHOLD=67108864
Worker C Worker B
HOROVOD_CYCLE_TIME=5 22 51 7 17 57 12 13 24 15 36 57 12

ü FP16 all-reduce
hvd.DistributedOptimizer(...,compression=hvd.Compression.fp16) Worker A
22 51 15 36 57 12

Worker C Worker B
22 51 15 36 57 12 22 51 15 36 57 12

https://eng.uber.com/horovod/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (3/9) 4. Synchronize initial state between workers
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
with tf.train.MonitoredTrainingSession(hooks=hooks,...) as mon_sess:
1. 라이브러리 초기화 ...
# OR
import horovod.tensorflow as hvd
bcast_op = hvd.broadcast_global_variables(0)
hvd.init()
sess.run(bcast_op)

2. 사용할 GPU 세팅 5. Use checkpoints only on the first worker


config = tf.ConfigProto() ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
config.gpu_options.visible_device_list = with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …)
str(hvd.local_rank()) as mon_sess:
...

3. Learning Rate 조정 및
Horovod 분산 Optimizer 추가 * Horovod for TensorFlow, Keras, and PyTorch
import horovod.tensorflow as hvd
opt = tf.train.MomentumOptimizer(
import horovod.keras as hvd
lr=0.01 * hvd.size())
import horovod.tensorflow.keras as hvd
opt = hvd.DistributedOptimizer(opt)
import horovod.torch as hvd
# more frameworks coming

( source code from https://github.com/horovod/horovod )


© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (4/9)
실행 예
# Use AWS Deep Learning AMI aws-ip-1$ mpirun -np 2 -H aws-ip-1,aws-ip-2 \
wget https://raw.githubusercontent.com/uber/horovod
laptop$ ssh ubuntu@<aws-ip-1> /master/examples/tensorflow_mnist.py
aws-ip-1$ source activate tensorflow_p27
aws-ip-1$ ssh-keygen aws-ip-1$ mpirun -bind-to none -map-by slot\
aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
[copy contents of the pubkey] -x LD_LIBRARY_PATH -x PATH \
aws-ip-1$ exit
-mca btl_tcp_if_exclude lo,docker0 \
–np 16 -H aws-ip-1:8,aws-ip-2:8 \
laptop$ ssh ubuntu@<aws-ip-2> python tensorflow_mnist.py
aws-ip-2$ source activate tensorflow_p27
aws-ip-2$ cat >> /home/ubuntu/.ssh/authorized_keys # Pro tip: hide mpirun args into mpirun.sh
[paste contents of the pubkey]
aws-ip-2$ exit
aws-ip-1$ mpirun.sh \
–np 16 –H aws-ip-1:8,aws-ip-2:8 \
laptop$ ssh ubuntu@<aws-ip-1> python tensorflow_mnist.py
aws-ip-2$ ssh aws-ip-2
[will ask for prompt, say yes]
aws-ip-2$ exit

( source code from https://github.com/horovod/horovod )


© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (5/9)
[참고] 예제 코드 – Horovod for TensorFlow
import tensorflow as tf # Add hook to synchronize initial state
import horovod.tensorflow as hvd hooks =[hvd.BroadcastGlobalVariablesHook(0)]

# Initialize Horovod # Only checkpoint on rank 0


hvd.init() ckpt_dir = "/tmp/train_logs" \
if hvd.rank() == 0 else None
# Pin GPU to be used to
# process local rank (one GPU per process) # Make training operation
config = tf.ConfigProto() train_op = opt.minimize(loss)
config.gpu_options.visible_device_list =
str(hvd.local_rank()) # The MonitoredTrainingSession takes care of
# session initialization, restoring from a
# Build model... # checkpoint, saving to a checkpoint, and
loss = ... # closing when done or an error occurs.
opt = tf.train.MomentumOptimizer( with tf.train.MonitoredTrainingSession(checkpoint
lr=0.01 * hvd.size()) _dir=ckpt_dir, config=config, hooks=hooks) as mon
_sess:
# Add Horovod Distributed Optimizer while not mon_sess.should_stop():
opt = hvd.DistributedOptimizer(opt) # Perform synchronous training
mon_sess.run(train_op)
( source code from https://github.com/horovod/horovod )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
( source code from https://github.com/horovod/horovod )

Horovod (6/9)
[참고] 예제 코드 – Estimator API
import tensorflow as tf # Broadcast initial variable state.
import horovod.tensorflow as hvd hooks = \
[hvd.BroadcastGlobalVariablesHook(0)]
# Initialize Horovod
hvd.init() # Only checkpoint on rank 0
# Pin GPU to be used ckpt_dir = "/tmp/train_logs" \
config = tf.ConfigProto() if hvd.rank() == 0 else None
config.gpu_options.visible_device_list =
str(hvd.local_rank()) # Create the Estimator
mnist_classifier = tf.estimator.Estimator(
# Build model... model_fn=cnn_model_fn,
def model_fn(features, labels, mode): model_dir=ckpt_dir,
loss = ... config=tf.estimator.RunConfig(
opt = tf.train.MomentumOptimizer( session_config=config))
lr=0.01 * hvd.size())
mnist_classifier.train(
# Add Horovod Distributed Optimizer input_fn=train_input_fn,
opt = hvd.DistributedOptimizer(opt)
steps=100,
hooks=hooks)
return tf.estimator.EstimatorSpec(...)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
( source code from https://github.com/horovod/horovod )

Horovod (7/9)
[참고] 예제 코드 – Horovod for MxNet # Fetch and broadcast parameters
params = model.collect_params()
import mxnet as mx if params is not None:
import horovod.mxnet as hvd hvd.broadcast_parameters(params, root_rank=0)
from mxnet import autograd # Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)
# Initialize Horovod # Create loss function
hvd.init() loss_fn = ...
# Pin GPU to be used to process local rank # Train model
context = mx.gpu(hvd.local_rank()) for epoch in range(num_epoch):
num_workers = hvd.size() train_data.reset()
for nbatch, batch in enumerate(train_data, start=1):
# Build model data = batch.data[0].as_in_context(context)
model = ... label = batch.label[0].as_in_context(context)
model.hybridize() with autograd.record():
# Create optimizer output = model(data.astype(dtype, copy=False))
optimizer_params = ... loss = loss_fn(output, label)
opt = mx.optimizer.create('sgd', **optimizer_params) loss.backward()
# Initialize parameters trainer.step(batch_size)
model.initialize(initializer, ctx=context)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
( source code from https://github.com/horovod/horovod )

Horovod (8/9)
[참고] 예제 코드 – Horovod for Keras
import keras model.compile(
from keras import backend as K loss='categorical_crossentropy’,
import tensorflow as tf optimizer=opt,
import horovod.keras as hvd metrics=['accuracy'])

# Initialize Horovod
# Broadcast initial variable state.
hvd.init() callbacks = [hvd.callbacks.BroadcastGlobalVariabl
esCallback(0)]
# Pin GPU to be used
config = tf.ConfigProto() ...
config.gpu_options.visible_device_list = \ model.fit(
x_train,
str(hvd.local_rank())
K.set_session(tf.Session(config=config)) y_train,
callbacks=callbacks,
# Build model... epochs=10,
model = ... validation_data=(x_test, y_test))
opt = keras.optimizers.Adadelta(lr=1.0 * hvd.size())
# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
( source code from https://github.com/horovod/horovod )

Horovod (9/9)
[참고] 예제 코드 – Horovod for PyTorch
import torch # Horovod: broadcast parameters
import horovod.torch as hvd hvd.broadcast_parameters(
model.state_dict(),
# Initialize Horovod root_rank=0)
hvd.init()
for epoch in range(100):
# Horovod: pin GPU to local rank for batch_idx, (data, target) in ...:
torch.cuda.set_device(hvd.local_rank())
optimizer.zero_grad()
output = model(data)
# Build model...
loss = F.nll_loss(output, target)
model = Net() loss.backward()
model.cuda()
optimizer.step()
optimizer = optim.SGD(model.parameters())

# Wrap optimizer with DistributedOptimizer


optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters())

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DEV DAY

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝


Scalable multi-node training (EC2)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2

ImageNet을 이용한 트레이닝 실행 예시

• TFRecord 변환 전용 인스턴스로 전처리 수행


ü t2.large Instance with 1.0 TB EBS sc1 Volume
ü Download ImageNet dataset
ü Transform the raw dataset with TFRecord
ü Upload the transformed dataset to the Amazon S3
nohup aws s3 sync /data s3://YOUR_BUCKET_NAME >& upload.log &
• Setting up all the EC2 instances having the same type of instances, AMI, the path of
data, and the path of models
• Need to check the utilization of GPUs on P3dn.24xlarge (and/or P3.16xlarge)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2

ImageNet을 이용한 트레이닝 실행 예시


• Time-to-train: around 45 mins
• 8 * P3dn.24xlarge instances
• ML Models: ResNet-50
• Top-1 Validation Accuracy : 75.59 %

https://docs.aws.amazon.com/ko_kr/dlami/latest/devguide/tutorial-horovod-tensorflow.html
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2 50,000
45,000
40,000

Images/Second
time-to-train: 47 min ~ 50 min Training using 35,000
30,000
P3 instances 25,000

구성 정보
20,000
(ResNet-50 & ImageNet)
15,000
10,000
5,000
• 8 * P3.16xlarge instances -
1 2 4 8 16 32 64
• DL Framework: TensorFlow, MxNet Number of GPUs

• ML model: ResNet-50
• Dataset: ImageNet (1.2 millions of images)
• Top-1 validation accuracy: 76%

https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2

time-to-train: 14.6 min


Training performance Time to train vs Number of GPUs vs
w.r.t. TensorFlow & CUDA Images/sec, efficiency, and

구성 정보 (ResNet-50 & ImageNet)


(Images/sec)
communication overhead

• 32 * P3.16xlarge instances
• DL Framework: TensorFlow
• ML model: ResNet-50
• Dataset: ImageNet
• Top-1 validation accuracy 75.4%
• Top-5 validation accuracy 92.6%

https://aws.amazon.com/ko/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DEV DAY

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝


Amazon EKS 기반 분산 딥러닝 성능 최적화

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (1/11)

[참고] Modular and Scalable Amazon EKS Architecture

https://aws.amazon.com/ko/quickstart/architecture/amazon-eks/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (2/11)
Using Horovod in Amazon EKS
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-eks-tutorials-distributed-gpu-training.html

• STEP 1. Install Kubeflow to setup a cluster for distributed training

• STEP 2. Set the app name and initialize it.

• STEP 3. Install mpi-operator from kubeflow

• STEP 4. Create a MPI Job template, define the number of nodes (replicas),
number of GPUs each node has (gpusPerReplica)

• STEP 5. Apply the manifest to the default environment.


The MPI Job will create a launch pod

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (3/11)
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark

• 클러스터 생성부터 종료까지 자동화된


벤치마크 워크플로 제공
• 다양한 백엔드 스토리지 시스템 지원
(예: Amazon EFS, Amazon FSx for Lustre)
• S3와 연동하여 환경설정 정보 및 결과 저장
• Backed by kubeflow operators and kubebench.
• 다양한 딥러닝 프레임워크 지원
(TF, TF + Horovod + OpenMPI, PyTorch, MxNet)
• 사용자의 요구사항에 맞는 Kubernetes 클러스터 환경
설정 지원
• 중간 결과 저장 및 자동 클러스터 종료 기능
• 동시에 여러 실험을 병렬로 진행 가능

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (4/11)
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark

• Setup NFS
kubectl create -f deploy/benchmark-nfs-svc.yaml
kubectl get svc benchmark-nfs-svc -o=jsonpath={.spec.clusterIP}

# Replace ip in the `deploy/benchmark-nfs-volume.yaml` before following step


kubectl create -f deploy/benchmark-nfs-volume.yaml

• Install Argo Workflow


kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.2.1/manifests/install.yaml

# you can forward port to localhost and look at Argo UI


kubectl port-forward deployment/argo-ui 8001:8001 -n argo

• Configure AWS credentials


• Conifgure your GitHub token
• Setup S3 buckets for your benchmark results and
your training data
• Configure your Kubernetes cluster
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (5/11)
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark

s3ResultPath: 's3://kubeflow-pipeline-data/benchmark/',
• Run the benchmmark jobs s3DatasetPath: 's3://eks-dl-benchmark/imagenet/',
clusterConfig: 's3://kubeflow-pipeline-data/benchmark/cluster_config.yaml',
experiments: [{
experiment: 'experiment-20190415-01',
1. Update your workflow trainingJobConfig: 's3://kubeflow-pipeline-data/benchmark/mpi-job-imagenet.yaml',
setting using ks command trainingJobPkg: 'mpi-job',
trainingJobPrototype: 'mpi-job-custom',
// Change to upstream once https://github.com/kubeflow/kubeflow/pull/3062 is merged

2. Update benchmark
trainingJobRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow',
}],
workflow manifest directly githubSecretName: 'github-token',
githubSecretTokenKeyName: 'GITHUB_TOKEN',
image: 'seedjeffwan/benchmark-runner:20190424',
name: '20190424-00',
namespace: 'default',
nfsVolume: 'benchmark-pv',
nfsVolumeClaim: 'benchmark-pvc',
region: 'us-west-2',
trainingDatasetVolume: 'dataset-claim',
s3SecretName: 'aws-secret',
s3SecretAccesskeyidKeyName: 'AWS_ACCESS_KEY_ID',
s3SecretSecretaccesskeyKeyName: 'AWS_SECRET_ACCESS_KEY',
storageBackend: 'fsx',
kubeflowRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow'
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (6/11)
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (7/11)
( Amazon EKS + Kubeflow + AWS FSx CSI driver )

• Kubernetes
ü 컨테이너 기반의 다양한 ML/DL 프레임워크 지원
ü 탄력성 및 손쉬운 확장성 지원
ü Deep Neural Network 트레이닝 환경으로서 지속적으로 확산 중

• Amazon EKS
ü 완전 관리형 Kubernetes 서비스
ü EC2 P2, P3 인스턴스 상에서 Kubernetes 워크로드의 손쉬운 실행

• Kubeflow
ü 머신러닝 워크로드 효율적인 개발, 관리, 배포 등을 지원하는 Kubernetes-native 플랫폼
ü 분산 트레이닝 지원
(native TensorFlow architecture or MPI AllReduce (NVIDIA NCCL library or Horovod))

https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (8/11)
( Amazon EKS + Kubeflow + AWS FSx CSI driver )

• Amazon FSx for Lustre


ü High Performance 파일 시스템
ü 빠른 처리를 요구하는 워크로드에 최적 (예: 머신러닝, HPC )
ü Amazon S3와 연동, 통합 지원

• AWS FSx CSI driver


ü Kubernetes-native 형태로 컨테이너에서 FSx for Lustre 파일시스템 이용 가능
ü Static/Dynamic volume provisioning
ü Containers from multiple nodes within a cluster (connected to the same Lustre filesystem)
ü Lustre 데이터 저장소로 S3 사용 가능

https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (9/11)
( Amazon EKS + Kubeflow + AWS FSx CSI driver )

“90%-100%의 near-linear scaling performance를 확인”

Machines • 20 * p3.16xlarge (mixed precision)

• Kubernetes v1.11.8
Amazon EKS-optimized • MPI Operator Alpha from Kubeflow 0.4.1
AMI with GPU support • CUDA 10 with NVIDIA Tesla 410.104 driver
• Docker 18.06.1-ce (incl. nvidia-docker2)
• FSx CSI Driver v0.1
AWS FSx for Lustre
• Hydrated from an S3 bucket
filesystem
(for ImageNet TFRecords)
• TENSORFLOW_VERSION: v1.13.1
• HOROVOD_VERSION: 0.16.0
TensorFlow
• CUDNN_VERSION: 7.4.2.24-1+cuda10.0
(customized image)
• NCCL_VERSION: 2.4.2-1+cuda10.0
• OPENMPI 4.0.0
• 1.28 millions of images (1000 classes)
Dataset (ImageNet) • 1024 training files & 128 validation files
(TFRecords)
• awscli, eksctl, ksonnet, and
Relevant tools
aws-iam-authenticator

https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (10/11)
( Amazon EKS + Kubeflow + AWS FSx CSI driver )

• 성능 최적화를 위한 체크리스트 (Part #1)

ü 최신 딥러닝 툴킷 사용 (예: AMI for EKS)

ü GPU clock speed를 최대값으로 설정 (참고: bootstrap command)

ü Placement Group 내에 인스턴스 생성 (낮은 지연시간)

ü AWS VPC CNI 플러그인 (최신버전)을 사용 (모든 NIC들이 EKS 클러스터 상에서 기본적으로
Jumbo Frame을 사용하도록)

https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (11/11)
( Amazon EKS + Kubeflow + AWS FSx CSI driver )

• 성능 최적화를 위한 체크리스트 (Part #2)

ü 적절한 스토리지 백엔드를 선택 (EBS, EFS, FSx for Lustre, etc.)

ü Static Kubernetes CPU 관리 정책을 사용

ü MPI processor

ü Intel MKL DNN 으로 TensorFlow 환경을 구축하여 GPU 성능을 최적화

ü 데이터 변환 프로세스 및 스레드 병렬화를 위한 TensorFlow 최적화

ü 스레드 풀 조정 및 CPU 성능 튜닝

https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DEV DAY

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝


Amazon SageMaker에서 TensorFlow 분산 트레이닝

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (1/6)

Amazon SageMaker
• Amazon SageMaker는 Prebuilt TensorFlow 컨테이너를 제공 (TensorFlow v1.11+)
• ML 모델 트레이닝을 위한 하드웨어 리소스, 하이퍼파라미터 설정
• Training instances: ML 모델 트레이닝을 위한 비용 효율적이고 자동화된 클러스터
• Approaches for distributed training
ü TensorFlow’s native parameter server (TF v1.11+)
ü Horovod (TF v.1.12+)

https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (2/6)

Parameter servers
• Multiple dedicated processes to
ü Collect gradients
(computed by “worker” processes)
ü Aggregate gradients
ü Distribute the updated gradients back to the
workers asynchronously
ü All-to-all communication model

• In Amazon SageMaker
ü No need to setup and manage the
parameter server cluster manually
ü A built-in script mode option

https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (3/6)

Parameter servers – example code

from sagemaker.tensorflow import TensorFlow estimator_ps = TensorFlow(


base_job_name='hvd-imagenet-tf',
ps_instance_type = 'ml.p3.2xlarge’ source_dir='code',
ps_instance_count = 2 entry_point='train_ps.py',
role=role,
distributions = { framework_version='1.13',
'parameter_server': { py_version='py3',
'enabled': True hyperparameters=hyperparameters,
} train_instance_count=ps_instance_count,
} train_instance_type=ps_instance_type,
model_dir=model_dir,
hyperparameters = {'epochs': 60, 'batch-size' : 256} distributions=distributions)

# start training; inputs can be in


# Amazon S3, Amazon EFS, or Amazon FSx for Lustre
estimator_hvd.fit(inputs)

https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (4/6)

Horovod
• Amazon SageMaker 상에서 손쉬운 Horovod 클러스터 구성 자동화 및 실행 가능
• SageMaker TensorFlow container
ü sets up the MPI environment
ü run the mpirun command to start jobs on the cluster nodes

• Estimator의 distributions 파라미터에서 다음 필드들의 설정값을 고려할 것


ü enabled (bool): set up for executing mpirun
ü processes_per_host (int): Number of processes MPI launching on each host
ü custom_mpi_options (bool): For adding flags to the mpirun and then run on Amazon
SageMaker (for Horovod training)

https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (5/6)

Horovod – example code

from sagemaker.tensorflow import TensorFlow estimator_hvd = TensorFlow(


base_job_name='hvd-imagenet-tf',
hvd_instance_type = 'ml.p3.2xlarge' source_dir='code',
hvd_processes_per_host = 1 entry_point='train_hvd.py',
hvd_instance_count = 2 role=role,
framework_version='1.13',
distributions = { py_version='py3',
'mpi': { hyperparameters=hyperparameters,
'enabled': True, train_instance_count=hvd_instance_count,
'processes_per_host': hvd_processes_per_host, train_instance_type=hvd_instance_type,
'custom_mpi_options': distributions=distributions)
'-verbose --NCCL_DEBUG=INFO
-x OMPI_MCA_btl_vader_single_copy_mechanism=none' # start training; inputs can be in
} # Amazon S3, Amazon EFS, or Amazon FSx for Lustre
} estimator_hvd.fit(inputs)

hyperparameters = {'epochs': 60, 'batch-size' : 256}

https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (6/6)

필요에 따라 적절한 분산 트레이닝 방법을 선택합니다


• Scaling-up on a single machine with multiple GPUs (“data parallelism”)
• Scaling-out with either Parameter server or Horovod (“cluster size”)

Time to share gradient 더 높은 CPU 성능을 원할 경우 더 높은 GPU 성능을 원할 경우

Larger # of gradients
Parameter Server or Horovod on
Long Parameter Server
a single instance with Multi-GPUs
Bigger model size

Smaller # of gradients
Short Parameter Server Horovod
Lesser model size

https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DEV DAY

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝


fast.ai – Now anyone can train Imagenet in 18 minutes

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (1/5)

The summary of results

• ImageNet training results


ü Time : 18 minutes
ü Machines : 16 * p3.16xlarge on AWS (EC2)
ü Compute cost : $48.00
ü PyTorch

• Collaborators
ü Yaroslv Bulatov
ü Jeremy Howard
ü Andrew Shaw

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (2/5)

An analysis of Deep Neural Network models for


How to train fast? practical applications
by Alfredo Canziani, Adam Paszke, Eugenio Culurciello

• Step 1
ü Find a good baseline
for single machine

• Step 2
ü Scale to multiple machines

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (3/5)

단일 머신 트레이닝
Trained ImageNet in 30 epochs Single p3.16xlarge instance
(instead of 90) trains to 93% in 1.5 hours

One cycle Progressive resizing Rectangular


Learning Rate (Leslie Smith) for classification (fast.ai) Image validation (fast.ai)

• 상대적으로 높은 • Faster initial epochs: • Validate images close to original aspect


Learning Rate 로 시작 2x speedup in training 128 vs. 224 ratio (instead center crop images to
224 x 224)
• 20% faster convergence for • More accurate final epochs:
• 23% speedup of training time to reach
single machine 288 images increased accuracy 0.8% the benchmark accuracy of 93%
Learning Rate

Number of steps

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (4/5)

분산 아키텍처

Distributed Data Parallel - PyTorch All-Reduce - NVIDIA NCCL*

• Sync gradient after backprop

GPU0
Forward Backprop Gradients batch0_0
Gradients

Data Sync
GPU5 GPU1
Forward Backprop batch0_5 batch0_1

Gradients Gradients

• Optimization: Overlap sync with computation


GPU4 GPU2
[참고] apex.parallel.DistributedDataParallel batch0_4 batch0_2

GPU3
Forward Backprop Gradients batch0_3
Gradients

Data Sync Sync Sync


Forward Backprop NCCL: NVIDIA Collective Communications Library

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (5/5)

기타 고려사항

Scaling Run lots of


Tips and Tricks
Techniques experiments

• Batch Normalization 튜닝 • Spot instance - 70% 저렴 • AWS 상에서 손쉽게 분산


및 Learning Rate 스케일링 • AMI - ImageNet baked in AWS Deep Learning AMI 환경을 구축하고 실험할 수
(By Goyal) 있습니다.
• 지연시간 - IOPS + Placement Groups
• Learning Rate를 감소시키는
대신 Batch size를 늘림
(By Google Brain)
git clone git@github.com:diux-dev/imagenet18.git
S3
pip install -r requirements.txt

Images/sec
aws configure
AMI python train.py

10k IOPS

Io2 volume

P3 instance

Number of steps

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DEV DAY

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝


Distributed Training of MnasNet on AWS

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed Training of MnasNet on AWS (1/4)

MnasNet
• An automated mobile NAS* approach
• Trade-off between Accuracy and Latency

where

• An example of MnasNet network architecture

https://www.youtube.com/watch?v=4uDZxefPd-I

* NAS: Neural Architecture Search


https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html
https://arxiv.org/pdf/1807.11626
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed Training of MnasNet on AWS (2/4)

# Define the base params

실행 예 (1/2) # Naming

$ pip install ec2-cluster==0.3.1


$ ec3 create ec2cluster_p3_mnasnet_example.yaml # Launch Location
$ ec3 setup-horovod ec2cluster_p3_mnasnet_example.yaml
$ ec3 ssh-cmd ec2cluster_p3_mnasnet_example.yaml
...
...
$ source activate tensorflow_p36
$ mpirun -np 16 -hostfile /home/ubuntu/hostfile \
-bind-to socket -map-by slot -mca plm_rsh_no_tree_spawn 1 \
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 \
-x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib \
-x NCCL_SOCKET_IFNAME=ens5 -mca btl_tcp_if_exclude lo.docker0 -x TF_CPP_MIN_LOG_LEVEL=0 \
python /home/ubuntu/aws-ai-optimized-models/mnasnet/mnasnet_main_hvd.py --use_tpu=False \
--data_dir=/home/ubutu/data --model_dir=./results_hvd \
--train_batch_size=256 --eval_batch_size=256 \
--train_steps=109475 --skip_host_call=Fall --data_format='channels_first' \
--transport_input=False --use_horovod=True --eval_on_single_gpu=True
...

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed Training of MnasNet on AWS (3/4)

실행 예 (2/2)

...
I0923 16:15:22.086650 140202663954176 saver.py:1276] Restoring parameters from ./results_hvd/model.ckpt-62560
I0923 16:15:22.418808 140202663954176 session_manager.py:491] Running local_init_op.
I0923 16:15:22.426828 140202663954176 session_manager.py:493] Done running local_init_op.
I0923 16:15:47.475176 140202663954176 evaluation.py:277] Finished evaluation at 2019-09-23-16:15:47
I0923 16:15:47.475430 140202663954176 estimator.py:1979] Saving dict for global step 62560: global_step = 62560, loss =
2.1191003, top_1_accuracy = 0.74759614, top_5_accuracy = 0.9215545
I0923 16:15:47.475846 140202663954176 estimator.py:2039] Saving 'checkpoint_path' summary for global step 62560:
./results_hvd/model.ckpt-62560
I0923 16:15:47.476232 140202663954176 error_handling.py:93] evaluation_loop marked as finished
I0923 16:15:47.476345 140202663954176 mnasnet_main_hvd.py:1041] Eval results at step 62560: {'loss': 2.1191003,
'top_1_accuracy': 0.74759614, 'top_5_accuracy': 0.9215545, 'global_step': 62560}. Hvd rank 0
I0923 16:15:47.476416 140202663954176 mnasnet_main_hvd.py:1051] Finished training up to step 62560. Elapsed seconds 40649.

• time-to-train: ≈ 11.29 hrs


• Top-1 accuracy : 74.76%
• Top-5 accuracy : 92.16%

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed Training of MnasNet on AWS (4/4)

성능 테스트 결과 예
Machines • p3dn.24xlarge Num. of Top-1
Time-to-train
instances Validation
(hours)
• TENSORFLOW_VERSION: v1.13.1 (p3dn.24xlarge) Accuracy (%)
• CUDNN_VERSION: 7.4.2.24-1+cuda10.0
TensorFlow • NCCL_VERSION: 2.4.2-1+cuda10.0 1 29 75.2
• OPENMPI 4.0.0

2 24.3 74.5
• 1.28 millions of images (1000 classes)
Dataset • 1024 training files & 128 validation files
(ImageNet) (TFRecords) 4 9.0 74.67

• Mixed Channel ordering


• Mixed XLA (for all ops except depth-wise convolution)
8 4.6 74.16
Optimizations • LARC Optimizer
• HOROVOD_VERSION: 0.16.1 16 1.8 ~ 2.6 73.9 ~ 74.6

LARC (Layer-wise adaptive rate control)


© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
요약 정리
• Train smart with tools making distributed training easily

• Experiment, experiment, and experiment

ü Efficient Linear Scalability ü Efficient Linear Scalability ü Efficient Linear Scalability


ü Flexibility ü Flexibility
ü Fully-managed service

AWS DL AMI AWS DL Container

Amazon SageMaker

Amazon EC2 Amazon EKS Amazon ECS

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
감사합니다

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
여러분의 피드백을 기다립니다!

#AWSDEVDAYSEOUL

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like