AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법

AI/ML
AWS 기반 기계 학습 자동화 및
최적화를 위한 실전 기법
남궁영환 김대근
데이터 사이언티스트 SA 데이터 사이언티스트 SA
아마존웹서비스 아마존웹서비스
Agenda DEV DAY
• AI/ML at AWS
• 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝

o Part 1
§ Infrastructure for ML on AWS
§ Horovod & TensorFlow distributed training on EC2, EKS, and SageMaker
o Part 2
§ fast.ai on AWS
§ MnasNet on AWS
• Summary
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS ML Stack : 가장 깊고 폭넓은 역량과 기술의 집약
Vision Speech Language Chatbots Forecasting Recommendations
AI SERVICES
(App developers with

little knowledge of ML) REKOGNITION REKOGNITION
TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND L EX FORECAST PERSONALIZE
IMAGE VIDEO
AMAZON
SAGEMAKER
ML SERVICES BUILD TRAIN DEPLOY
(ML developers and Pre-bui lt algori thms & notebooks One-cli ck model trai ni ng & tuni ng One-cli ck deployment & hosti ng
data scientists) Data labeli ng (GROUND TRUTH) Opti mi zati on (N E O )
Al gor i thms & model s Rei nforcement learni ng
(AWS MARKETPLACE FOR MACHINE LEARNING)
Frameworks Interfaces Infrastructure
ML FRAMEWORKS
& INFRASTRUCTURE
(ML researchers and EC2 P3 EC2 C5 FPGAs GREENGRASS ELASTIC INFERENTIA
academics) & INFERENCE
P3DN
Scaling TensorFlow near-linearly 256 GPUs at 2018
Stock AWS-Optimized
scaling efficiency TensorFlow TensorFlow
65% 90%
with 256 GPUs
Amazon SageMaker
및
AWS Deep Learning AMIs
에서 사용 가능
training
time 30 min 14 min
https://aws.amazon.com/about-aws/whats-new/2018/11/tensorflow-scalability-to-256-gpus/
https://www.slideshare.net/ExtractConf
https://eng.uber.com/horovod/
대규모 머신러닝이 중요한 이유 (1/3)
• 데이터 축적에 따라 모델의 성능은 • 대량의 데이터 기반 ML/DL 모델 트레이닝은

지속적으로 향상 많은 시간과 자원들을 필요로 함
• 딥러닝 적용 사례가 다양한 • “분산 트레이닝”
분야에서 꾸준히 증가하고 있음
How do data science techniques The “data parallel” approach

scale with amount of data? to distributed training
- Andrew Ng - Uber
알고리즘 선정도 중요하지만

많은 양의 트레이닝 데이터의 확보가
무엇보다 중요
“These results suggest that we

may want to reconsider the
trade-off between spending time
and money on algorithm
development versus spending it
on corpus development.”
Scaling to Very Very Large Corpora for Natural Language Disambiguation, Banko and Brill, Microsoft Research (2001)
http://www.aclweb.org/anthology/P01-1005
• 공통 목표
ü 컴퓨팅, 네트워킹, 컨테이너, 분산 트레이닝 성능 튜닝, . . .
ü 머신러닝 엔지니어는 선호하는 ML/DL 프레임워크를 이용하여
비즈니스 성공에 기여할 수 있는 모델 개발에 집중
대규모 머신러닝은
• Data Management
문제 및 접근 방식에
ü 데이터의 규모 ∝ 해결 과제 및 알고리즘의 복잡도
따라 해결 방안이 ü 데이터의 견고성(durability) 및 가용성(availability)
매우 다양할 수 있음 • Distributed Computing Frameworks

ü Data pipelines feature (Dask, Ray, PyToolz, ipyparallel, etc.)
ü CPU ➝ GPU ➝ Multi-GPUs ➝ Multi-nodes
ü TensorFlow, PyTorch, MxNet, . . .
• Build Compute Clusters to fit the workload!
Where to train and deploy deep learning models
“해결하려는 워크로드를 고려하여 적절한
ML/DL 모델 트레이닝 및 배포 환경을 선택합니다”
AWS Deep Learning AWS Deep Learning

Amazon SageMaker
AMIs Containers
Amazon
Amazon EC2 Amazon
Elastic Container Service for Elastic Container Service
Kubernetes
DEV DAY
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝

Infrastructure for ML on AWS
P3 instance
3 가지 타입 중 14 리전
대규모 병렬 처리가 필요한 워크로드에 적합
• 기계학습 모델 트레이닝 P3.2xlarge P3.8xlarge P3.16xlarge
• HPC(High Performance Computing) 시뮬레이션 1 V100

GPU
4 V100
GPU
8 V100
GPU
• 3D 모델 렌더링
8 vCPU 32 vCPU 64 vCPU
• 비디오 인코딩
61 GB 244 GB 488 GB
Mem Mem Mem
최대 8 개의 NVIDIA Tesla V100 GPU

• 1 PetaFLOPs 컴퓨팅 성능
(P2 인스턴스 대비 최대 14배 ↑)
• 300 GB/s 의 GPU 간 통신 속도 지원 (NVLink)

(P2인스턴스 대비 9배 ↑)
• 모든 ML 프레임워크 및 모델 타입 지원
• 다양한 형태의 인스턴스 사용 가능
(Spot instance 사용 시 최대 70% 비용 절감 가능)
https://aws.amazon.com/ko/ec2/instance-types/p3/
P3dn.24xlarge instance
• 클라우드에서 사용 가능한 가장 강력한 GPU

Description P3.16xlarge P3dn.24xlarge Improvements
인스턴스
Number and
• 효율적인 대규모 ML 트레이닝 및 HPC Type of GPUs
8 x NVIDIA V100 8 x NVIDIA V100 -
시뮬레이션 지원
GPU Memory 16GB/GPU 32GB/GPU 100%
(100Gbps 네트워크 대역폭을 이용한 멀티-노드 클러스터
(32대 이상) 구성 가능) GPU Peer to Peer NVLink - 300 GB/s NVLink - 300 GB/s -
• 모델 트레이닝 및 시뮬레이션을 위한 데이터에 CPU Family Broadwell Skylake w AVX512
빠른 액세스 지원
(Amazon S3, 네트워크 기반 파일 시스템, 로컬 인스턴스 스토리지) vCPU 64 96 50%
System Memory 488 GB 768 GB 57%

• 대규모 ML 모델 트레이닝 및 대규모 데이터 처리
(32GB GPU 메모리를 장착한 최신 NVIDA V100 GPU) Networking
Throughput
25Gbps 100Gbps 200%
• 데이터 전처리 최적화에 적합 EBS Throughput 14Gbps 14Gbps -

(96 vCPUs using AWS Custom Skylake CPUs and 768GB of Local Instance
No 2.0TBs NVMe SSD
System Memory) Storage
https://aws.amazon.com/ko/ec2/instance-types/p3/#Amazon_EC2_P3dn.24xlarge_Instances
AWS FSx for Lustre
• 머신 러닝, HPC, 동영상 처리, 금융 모델링 등을 위한 고성능 파일

시스템
• S3와 기본적으로 연동됨
• Lustre는 1 millisecond 미만의 지연 시간과 초당 수백 Gigabytes,

수백만 IOPS로 확장되는 처리량을 지원
• POSIX와 호환되므로, 특별히 추가 변경 없이 기존 Linux 기반
애플리케이션 사용 가능
• 사용한 리소스에 대해서만 비용 지불 (최소약정/선수금 없음) Amazon FSx
• 클라이언트 OS 커널 모듈 변경 작업 필요없음 for Lustre
(https://aws.amazon.com/ko/fsx/lustre/)
Infrastructure for ML on AWS (1/3)
전통적 HPC 머신러닝 클러스터 Cloud-native 머신러닝 클러스터
BeeGFS RAM-based storage array

Model parameter
Object store
Amazon FSx
Amazon S3 Amazon ECR
for Lustre
commit
Amazon EFS
Deep Learning
Deep Learning hydrate
Placement Lustre Multi-node TensorFlow
Application Stack Auto Scaling worker nodes
kernel Container Registry
Cluster-wide Group driver
persistent storage
Multi-node parallel
AWS Batch P3 / P3dn container instances

Auto Scaling BeeGFS RAM storage nodes
Deep Learning
Placement Group
Bastion host | BeeGFS management node | Cluster monitoring
https://aws.amazon.com/ko/blogs/compute/distributed-deep-learning-made-easy/

https://github.com/aws-samples/deep-learning-models/tree/master/hpc-cluster
https://github.com/awslabs/deeplearning-cfn
Traditional AWS Deep Learning Cluster

AWS
Cloud VPC Public
Public: 203.0.113.0 Auto Scaling Group
subnet Private: 10.0.0.1
Amazon SQS Public Subnet

Master Queue EC2 Master
10.0.0.0/24 Worker setup
Instance Router
Private AWS Elastic File System

subnet
Amazon SQS Auto Scaling Group
Workers Internet
Worker Queue Internet
10.0.1.1 Gateway
10.0.1.2
10.0.1.3
Private Subnet EC2 Workers

10.0.1.0/16
Default VPC: 10.0.0.0/16 NAT Gateway

Auto Scaling
Setup Complete
AWS Amazon Amazon

Lambda SNS S3
Cloud-native AWS Deep Learning Cluster
Amazon Event
AWS Cloud CloudWatch trigger
AWS Step Functions workflow

TFRecord Input
bucket
FSx for Lustre AWS Batch
Multi-node Parallel Job
Amazon Glacier Training Output

bucket
TensorFlow
Container Registry
NVIDIA GPU-backed
running containers
https://aws.amazon.com/ko/blogs/compute/scalable-deep-learning-training-using-multi-node-parallel-jobs-with-aws-batch-and-amazon-fsx-for-lustre/
DEV DAY

with Horovod & TensorFlow
Horovod (1/9)
• 분산 딥러닝을 위한 오픈 소스 프레임워크
• Stock TensorFlow, Keras, PyTorch 등과 연동하여 동작
• 쉽고 간단한 설치 `pip install horovod`
• 고급 알고리즘 사용 가능
• High-Performance 네트워크 (RDMA, GPUDirect) 지원
• ML 엔지니어와 인프라를 분리
ü 인프라팀은 컨테이너 및 MPI 환경을 제공
ü ML 엔지니어는 선호하는 딥러닝 프레임워크 사용
ü 프레임워크 상에서 분산 트레이닝에 대한 공통 기대치 horovod.ai
(인프라팀 & ML 엔지니어)
Horovod (2/9)
Worker A
5 13 8 19 42 1
Worker C Worker B
9 27 3 15 8 4 8 11 4 2 7 7
• Ring-AllReduce Worker A
데이터의 규모 ∝ 클러스터 노드의 개수

5 13 8 19 50 5
Synchronous updates
Worker C Worker B
• 9 27 7 17 8 4 13 24 4 2 7 7
• NVIDIA’s NCCL library (for GPU-level communication) Worker A

5 13 15 36 50 5
• Configurations Worker C Worker B
ü Sing-ring NCCL vs. Hierarchical AllReduce 22 51 7 17 8 4 13 24 4 2 57 12
HOROVOD_HIERARCHICAL_ALLREDUCE=1
ü Tensor Fusion Worker A
22 51 15 36 50 5
HOROVOD_FUSION_THRESHOLD=67108864
Worker C Worker B
HOROVOD_CYCLE_TIME=5 22 51 7 17 57 12 13 24 15 36 57 12
ü FP16 all-reduce
hvd.DistributedOptimizer(...,compression=hvd.Compression.fp16) Worker A
22 51 15 36 57 12
Worker C Worker B
22 51 15 36 57 12 22 51 15 36 57 12
Horovod (3/9) 4. Synchronize initial state between workers
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
with tf.train.MonitoredTrainingSession(hooks=hooks,...) as mon_sess:
1. 라이브러리 초기화 ...
# OR
import horovod.tensorflow as hvd
bcast_op = hvd.broadcast_global_variables(0)
hvd.init()
sess.run(bcast_op)
2. 사용할 GPU 세팅 5. Use checkpoints only on the first worker

config = tf.ConfigProto() ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
config.gpu_options.visible_device_list = with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …)
str(hvd.local_rank()) as mon_sess:
...
3. Learning Rate 조정 및
Horovod 분산 Optimizer 추가 * Horovod for TensorFlow, Keras, and PyTorch
import horovod.tensorflow as hvd
opt = tf.train.MomentumOptimizer(
import horovod.keras as hvd
lr=0.01 * hvd.size())
import horovod.tensorflow.keras as hvd
opt = hvd.DistributedOptimizer(opt)
import horovod.torch as hvd
# more frameworks coming
( source code from https://github.com/horovod/horovod )

Horovod (4/9)
실행 예
# Use AWS Deep Learning AMI aws-ip-1$ mpirun -np 2 -H aws-ip-1,aws-ip-2 \
wget https://raw.githubusercontent.com/uber/horovod
laptop$ ssh ubuntu@<aws-ip-1> /master/examples/tensorflow_mnist.py
aws-ip-1$ source activate tensorflow_p27
aws-ip-1$ ssh-keygen aws-ip-1$ mpirun -bind-to none -map-by slot\
aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
[copy contents of the pubkey] -x LD_LIBRARY_PATH -x PATH \
aws-ip-1$ exit
-mca btl_tcp_if_exclude lo,docker0 \
–np 16 -H aws-ip-1:8,aws-ip-2:8 \
laptop$ ssh ubuntu@<aws-ip-2> python tensorflow_mnist.py
aws-ip-2$ source activate tensorflow_p27
aws-ip-2$ cat >> /home/ubuntu/.ssh/authorized_keys # Pro tip: hide mpirun args into mpirun.sh
[paste contents of the pubkey]
aws-ip-2$ exit
aws-ip-1$ mpirun.sh \
–np 16 –H aws-ip-1:8,aws-ip-2:8 \
laptop$ ssh ubuntu@<aws-ip-1> python tensorflow_mnist.py
aws-ip-2$ ssh aws-ip-2
[will ask for prompt, say yes]
aws-ip-2$ exit

Horovod (5/9)
[참고] 예제 코드 – Horovod for TensorFlow
import tensorflow as tf # Add hook to synchronize initial state
import horovod.tensorflow as hvd hooks =[hvd.BroadcastGlobalVariablesHook(0)]
# Initialize Horovod # Only checkpoint on rank 0

hvd.init() ckpt_dir = "/tmp/train_logs" \
if hvd.rank() == 0 else None
# Pin GPU to be used to
# process local rank (one GPU per process) # Make training operation
config = tf.ConfigProto() train_op = opt.minimize(loss)
config.gpu_options.visible_device_list =
str(hvd.local_rank()) # The MonitoredTrainingSession takes care of
# session initialization, restoring from a
# Build model... # checkpoint, saving to a checkpoint, and
loss = ... # closing when done or an error occurs.
opt = tf.train.MomentumOptimizer( with tf.train.MonitoredTrainingSession(checkpoint
lr=0.01 * hvd.size()) _dir=ckpt_dir, config=config, hooks=hooks) as mon
_sess:
# Add Horovod Distributed Optimizer while not mon_sess.should_stop():
opt = hvd.DistributedOptimizer(opt) # Perform synchronous training
mon_sess.run(train_op)
Horovod (6/9)
[참고] 예제 코드 – Estimator API
import tensorflow as tf # Broadcast initial variable state.
import horovod.tensorflow as hvd hooks = \
[hvd.BroadcastGlobalVariablesHook(0)]
# Initialize Horovod
hvd.init() # Only checkpoint on rank 0
# Pin GPU to be used ckpt_dir = "/tmp/train_logs" \
config = tf.ConfigProto() if hvd.rank() == 0 else None
config.gpu_options.visible_device_list =
str(hvd.local_rank()) # Create the Estimator
mnist_classifier = tf.estimator.Estimator(
# Build model... model_fn=cnn_model_fn,
def model_fn(features, labels, mode): model_dir=ckpt_dir,
loss = ... config=tf.estimator.RunConfig(
opt = tf.train.MomentumOptimizer( session_config=config))
lr=0.01 * hvd.size())
mnist_classifier.train(
# Add Horovod Distributed Optimizer input_fn=train_input_fn,
steps=100,
hooks=hooks)
return tf.estimator.EstimatorSpec(...)
Horovod (7/9)
[참고] 예제 코드 – Horovod for MxNet # Fetch and broadcast parameters
params = model.collect_params()
import mxnet as mx if params is not None:
import horovod.mxnet as hvd hvd.broadcast_parameters(params, root_rank=0)
from mxnet import autograd # Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)
# Initialize Horovod # Create loss function
hvd.init() loss_fn = ...
# Pin GPU to be used to process local rank # Train model
context = mx.gpu(hvd.local_rank()) for epoch in range(num_epoch):
num_workers = hvd.size() train_data.reset()
for nbatch, batch in enumerate(train_data, start=1):
# Build model data = batch.data[0].as_in_context(context)
model = ... label = batch.label[0].as_in_context(context)
model.hybridize() with autograd.record():
# Create optimizer output = model(data.astype(dtype, copy=False))
optimizer_params = ... loss = loss_fn(output, label)
opt = mx.optimizer.create('sgd', **optimizer_params) loss.backward()
# Initialize parameters trainer.step(batch_size)
model.initialize(initializer, ctx=context)
Horovod (8/9)
[참고] 예제 코드 – Horovod for Keras
import keras model.compile(
from keras import backend as K loss='categorical_crossentropy’,
import tensorflow as tf optimizer=opt,
import horovod.keras as hvd metrics=['accuracy'])
# Initialize Horovod
# Broadcast initial variable state.
hvd.init() callbacks = [hvd.callbacks.BroadcastGlobalVariabl
esCallback(0)]
# Pin GPU to be used
config = tf.ConfigProto() ...
config.gpu_options.visible_device_list = \ model.fit(
x_train,
str(hvd.local_rank())
K.set_session(tf.Session(config=config)) y_train,
callbacks=callbacks,
# Build model... epochs=10,
model = ... validation_data=(x_test, y_test))
opt = keras.optimizers.Adadelta(lr=1.0 * hvd.size())
# Add Horovod Distributed Optimizer.
Horovod (9/9)
[참고] 예제 코드 – Horovod for PyTorch
import torch # Horovod: broadcast parameters
import horovod.torch as hvd hvd.broadcast_parameters(
model.state_dict(),
# Initialize Horovod root_rank=0)
hvd.init()
for epoch in range(100):
# Horovod: pin GPU to local rank for batch_idx, (data, target) in ...:
torch.cuda.set_device(hvd.local_rank())
optimizer.zero_grad()
output = model(data)
# Build model...
loss = F.nll_loss(output, target)
model = Net() loss.backward()
model.cuda()
optimizer.step()
optimizer = optim.SGD(model.parameters())
# Wrap optimizer with DistributedOptimizer

optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters())
DEV DAY

Scalable multi-node training (EC2)
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2
ImageNet을 이용한 트레이닝 실행 예시
• TFRecord 변환 전용 인스턴스로 전처리 수행

ü t2.large Instance with 1.0 TB EBS sc1 Volume
ü Download ImageNet dataset
ü Transform the raw dataset with TFRecord
ü Upload the transformed dataset to the Amazon S3
nohup aws s3 sync /data s3://YOUR_BUCKET_NAME >& upload.log &
• Setting up all the EC2 instances having the same type of instances, AMI, the path of
data, and the path of models
• Need to check the utilization of GPUs on P3dn.24xlarge (and/or P3.16xlarge)
ImageNet을 이용한 트레이닝 실행 예시

• Time-to-train: around 45 mins
• 8 * P3dn.24xlarge instances
• ML Models: ResNet-50
• Top-1 Validation Accuracy : 75.59 %
https://docs.aws.amazon.com/ko_kr/dlami/latest/devguide/tutorial-horovod-tensorflow.html
TensorFlow & Horovod on Amazon EC2 50,000
45,000
40,000
Images/Second
time-to-train: 47 min ~ 50 min Training using 35,000
30,000
P3 instances 25,000
구성 정보
20,000
(ResNet-50 & ImageNet)
15,000
10,000
5,000
• 8 * P3.16xlarge instances -
1 2 4 8 16 32 64
• DL Framework: TensorFlow, MxNet Number of GPUs
• ML model: ResNet-50
• Dataset: ImageNet (1.2 millions of images)
• Top-1 validation accuracy: 76%
https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/
time-to-train: 14.6 min

Training performance Time to train vs Number of GPUs vs
w.r.t. TensorFlow & CUDA Images/sec, efficiency, and
구성 정보 (ResNet-50 & ImageNet)

(Images/sec)
communication overhead
• 32 * P3.16xlarge instances
• DL Framework: TensorFlow
• ML model: ResNet-50
• Dataset: ImageNet
• Top-1 validation accuracy 75.4%
• Top-5 validation accuracy 92.6%
https://aws.amazon.com/ko/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/
DEV DAY

Amazon EKS 기반 분산 딥러닝 성능 최적화
Amazon EKS 기반 분산 딥러닝 성능 최적화 (1/11)
[참고] Modular and Scalable Amazon EKS Architecture
https://aws.amazon.com/ko/quickstart/architecture/amazon-eks/
Using Horovod in Amazon EKS
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-eks-tutorials-distributed-gpu-training.html
• STEP 1. Install Kubeflow to setup a cluster for distributed training
• STEP 2. Set the app name and initialize it.
• STEP 3. Install mpi-operator from kubeflow
• STEP 4. Create a MPI Job template, define the number of nodes (replicas),
number of GPUs each node has (gpusPerReplica)
• STEP 5. Apply the manifest to the default environment.

The MPI Job will create a launch pod
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark
• 클러스터 생성부터 종료까지 자동화된

벤치마크 워크플로 제공
• 다양한 백엔드 스토리지 시스템 지원
(예: Amazon EFS, Amazon FSx for Lustre)
• S3와 연동하여 환경설정 정보 및 결과 저장
• Backed by kubeflow operators and kubebench.
• 다양한 딥러닝 프레임워크 지원
(TF, TF + Horovod + OpenMPI, PyTorch, MxNet)
• 사용자의 요구사항에 맞는 Kubernetes 클러스터 환경
설정 지원
• 중간 결과 저장 및 자동 클러스터 종료 기능
• 동시에 여러 실험을 병렬로 진행 가능
• Setup NFS
kubectl create -f deploy/benchmark-nfs-svc.yaml
kubectl get svc benchmark-nfs-svc -o=jsonpath={.spec.clusterIP}
# Replace ip in the `deploy/benchmark-nfs-volume.yaml` before following step

kubectl create -f deploy/benchmark-nfs-volume.yaml
• Install Argo Workflow

kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.2.1/manifests/install.yaml
# you can forward port to localhost and look at Argo UI

kubectl port-forward deployment/argo-ui 8001:8001 -n argo
• Configure AWS credentials

• Conifgure your GitHub token
• Setup S3 buckets for your benchmark results and
your training data
• Configure your Kubernetes cluster
s3ResultPath: 's3://kubeflow-pipeline-data/benchmark/',
• Run the benchmmark jobs s3DatasetPath: 's3://eks-dl-benchmark/imagenet/',
clusterConfig: 's3://kubeflow-pipeline-data/benchmark/cluster_config.yaml',
experiments: [{
experiment: 'experiment-20190415-01',
1. Update your workflow trainingJobConfig: 's3://kubeflow-pipeline-data/benchmark/mpi-job-imagenet.yaml',
setting using ks command trainingJobPkg: 'mpi-job',
trainingJobPrototype: 'mpi-job-custom',
// Change to upstream once https://github.com/kubeflow/kubeflow/pull/3062 is merged
2. Update benchmark
trainingJobRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow',
}],
workflow manifest directly githubSecretName: 'github-token',
githubSecretTokenKeyName: 'GITHUB_TOKEN',
image: 'seedjeffwan/benchmark-runner:20190424',
name: '20190424-00',
namespace: 'default',
nfsVolume: 'benchmark-pv',
nfsVolumeClaim: 'benchmark-pvc',
region: 'us-west-2',
trainingDatasetVolume: 'dataset-claim',
s3SecretName: 'aws-secret',
s3SecretAccesskeyidKeyName: 'AWS_ACCESS_KEY_ID',
s3SecretSecretaccesskeyKeyName: 'AWS_SECRET_ACCESS_KEY',
storageBackend: 'fsx',
kubeflowRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow'
( Amazon EKS + Kubeflow + AWS FSx CSI driver )
• Kubernetes
ü 컨테이너 기반의 다양한 ML/DL 프레임워크 지원
ü 탄력성 및 손쉬운 확장성 지원
ü Deep Neural Network 트레이닝 환경으로서 지속적으로 확산 중
• Amazon EKS
ü 완전 관리형 Kubernetes 서비스
ü EC2 P2, P3 인스턴스 상에서 Kubernetes 워크로드의 손쉬운 실행
• Kubeflow
ü 머신러닝 워크로드 효율적인 개발, 관리, 배포 등을 지원하는 Kubernetes-native 플랫폼
ü 분산 트레이닝 지원
(native TensorFlow architecture or MPI AllReduce (NVIDIA NCCL library or Horovod))
https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
• Amazon FSx for Lustre

ü High Performance 파일 시스템
ü 빠른 처리를 요구하는 워크로드에 최적 (예: 머신러닝, HPC )
ü Amazon S3와 연동, 통합 지원
• AWS FSx CSI driver

ü Kubernetes-native 형태로 컨테이너에서 FSx for Lustre 파일시스템 이용 가능
ü Static/Dynamic volume provisioning
ü Containers from multiple nodes within a cluster (connected to the same Lustre filesystem)
ü Lustre 데이터 저장소로 S3 사용 가능
“90%-100%의 near-linear scaling performance를 확인”
Machines • 20 * p3.16xlarge (mixed precision)
• Kubernetes v1.11.8
Amazon EKS-optimized • MPI Operator Alpha from Kubeflow 0.4.1
AMI with GPU support • CUDA 10 with NVIDIA Tesla 410.104 driver
• Docker 18.06.1-ce (incl. nvidia-docker2)
• FSx CSI Driver v0.1
AWS FSx for Lustre
• Hydrated from an S3 bucket
filesystem
(for ImageNet TFRecords)
• TENSORFLOW_VERSION: v1.13.1
• HOROVOD_VERSION: 0.16.0
TensorFlow
• CUDNN_VERSION: 7.4.2.24-1+cuda10.0
(customized image)
• NCCL_VERSION: 2.4.2-1+cuda10.0
• OPENMPI 4.0.0
• 1.28 millions of images (1000 classes)
Dataset (ImageNet) • 1024 training files & 128 validation files
(TFRecords)
• awscli, eksctl, ksonnet, and
Relevant tools
aws-iam-authenticator
• 성능 최적화를 위한 체크리스트 (Part #1)
ü 최신 딥러닝 툴킷 사용 (예: AMI for EKS)
ü GPU clock speed를 최대값으로 설정 (참고: bootstrap command)
ü Placement Group 내에 인스턴스 생성 (낮은 지연시간)
ü AWS VPC CNI 플러그인 (최신버전)을 사용 (모든 NIC들이 EKS 클러스터 상에서 기본적으로
Jumbo Frame을 사용하도록)
• 성능 최적화를 위한 체크리스트 (Part #2)
ü 적절한 스토리지 백엔드를 선택 (EBS, EFS, FSx for Lustre, etc.)
ü Static Kubernetes CPU 관리 정책을 사용
ü MPI processor
ü Intel MKL DNN 으로 TensorFlow 환경을 구축하여 GPU 성능을 최적화
ü 데이터 변환 프로세스 및 스레드 병렬화를 위한 TensorFlow 최적화
ü 스레드 풀 조정 및 CPU 성능 튜닝
DEV DAY

Amazon SageMaker에서 TensorFlow 분산 트레이닝
TensorFlow 분산 트레이닝 in Amazon SageMaker (1/6)
Amazon SageMaker
• Amazon SageMaker는 Prebuilt TensorFlow 컨테이너를 제공 (TensorFlow v1.11+)
• ML 모델 트레이닝을 위한 하드웨어 리소스, 하이퍼파라미터 설정
• Training instances: ML 모델 트레이닝을 위한 비용 효율적이고 자동화된 클러스터
• Approaches for distributed training
ü TensorFlow’s native parameter server (TF v1.11+)
ü Horovod (TF v.1.12+)
https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
Parameter servers
• Multiple dedicated processes to
ü Collect gradients
(computed by “worker” processes)
ü Aggregate gradients
ü Distribute the updated gradients back to the
workers asynchronously
ü All-to-all communication model
• In Amazon SageMaker
ü No need to setup and manage the
parameter server cluster manually
ü A built-in script mode option
Parameter servers – example code
from sagemaker.tensorflow import TensorFlow estimator_ps = TensorFlow(

base_job_name='hvd-imagenet-tf',
ps_instance_type = 'ml.p3.2xlarge’ source_dir='code',
ps_instance_count = 2 entry_point='train_ps.py',
role=role,
distributions = { framework_version='1.13',
'parameter_server': { py_version='py3',
'enabled': True hyperparameters=hyperparameters,
} train_instance_count=ps_instance_count,
} train_instance_type=ps_instance_type,
model_dir=model_dir,
hyperparameters = {'epochs': 60, 'batch-size' : 256} distributions=distributions)
# start training; inputs can be in

# Amazon S3, Amazon EFS, or Amazon FSx for Lustre
estimator_hvd.fit(inputs)
Horovod
• Amazon SageMaker 상에서 손쉬운 Horovod 클러스터 구성 자동화 및 실행 가능
• SageMaker TensorFlow container
ü sets up the MPI environment
ü run the mpirun command to start jobs on the cluster nodes
• Estimator의 distributions 파라미터에서 다음 필드들의 설정값을 고려할 것

ü enabled (bool): set up for executing mpirun
ü processes_per_host (int): Number of processes MPI launching on each host
ü custom_mpi_options (bool): For adding flags to the mpirun and then run on Amazon
SageMaker (for Horovod training)
Horovod – example code
from sagemaker.tensorflow import TensorFlow estimator_hvd = TensorFlow(

base_job_name='hvd-imagenet-tf',
hvd_instance_type = 'ml.p3.2xlarge' source_dir='code',
hvd_processes_per_host = 1 entry_point='train_hvd.py',
hvd_instance_count = 2 role=role,
framework_version='1.13',
distributions = { py_version='py3',
'mpi': { hyperparameters=hyperparameters,
'enabled': True, train_instance_count=hvd_instance_count,
'processes_per_host': hvd_processes_per_host, train_instance_type=hvd_instance_type,
'custom_mpi_options': distributions=distributions)
'-verbose --NCCL_DEBUG=INFO
-x OMPI_MCA_btl_vader_single_copy_mechanism=none' # start training; inputs can be in
} # Amazon S3, Amazon EFS, or Amazon FSx for Lustre
} estimator_hvd.fit(inputs)
hyperparameters = {'epochs': 60, 'batch-size' : 256}
필요에 따라 적절한 분산 트레이닝 방법을 선택합니다

• Scaling-up on a single machine with multiple GPUs (“data parallelism”)
• Scaling-out with either Parameter server or Horovod (“cluster size”)
Time to share gradient 더 높은 CPU 성능을 원할 경우 더 높은 GPU 성능을 원할 경우
Larger # of gradients
Parameter Server or Horovod on
Long Parameter Server
a single instance with Multi-GPUs
Bigger model size
Smaller # of gradients
Short Parameter Server Horovod
Lesser model size
DEV DAY

fast.ai – Now anyone can train Imagenet in 18 minutes
fast.ai: Now anyone can train ImageNet in 18 minutes (1/5)
The summary of results
• ImageNet training results

ü Time : 18 minutes
ü Machines : 16 * p3.16xlarge on AWS (EC2)
ü Compute cost : $48.00
ü PyTorch
• Collaborators
ü Yaroslv Bulatov
ü Jeremy Howard
ü Andrew Shaw
An analysis of Deep Neural Network models for

How to train fast? practical applications
by Alfredo Canziani, Adam Paszke, Eugenio Culurciello
• Step 1
ü Find a good baseline
for single machine
• Step 2
ü Scale to multiple machines
단일 머신 트레이닝
Trained ImageNet in 30 epochs Single p3.16xlarge instance
(instead of 90) trains to 93% in 1.5 hours
One cycle Progressive resizing Rectangular

Learning Rate (Leslie Smith) for classification (fast.ai) Image validation (fast.ai)
• 상대적으로 높은 • Faster initial epochs: • Validate images close to original aspect

Learning Rate 로 시작 2x speedup in training 128 vs. 224 ratio (instead center crop images to
224 x 224)
• 20% faster convergence for • More accurate final epochs:
• 23% speedup of training time to reach
single machine 288 images increased accuracy 0.8% the benchmark accuracy of 93%
Learning Rate
Number of steps
분산 아키텍처
Distributed Data Parallel - PyTorch All-Reduce - NVIDIA NCCL*
• Sync gradient after backprop
GPU0
Forward Backprop Gradients batch0_0
Gradients
Data Sync
GPU5 GPU1
Forward Backprop batch0_5 batch0_1
Gradients Gradients
• Optimization: Overlap sync with computation

GPU4 GPU2
[참고] apex.parallel.DistributedDataParallel batch0_4 batch0_2
GPU3
Forward Backprop Gradients batch0_3
Gradients
Data Sync Sync Sync

Forward Backprop NCCL: NVIDIA Collective Communications Library
기타 고려사항
Scaling Run lots of

Tips and Tricks
Techniques experiments
• Batch Normalization 튜닝 • Spot instance - 70% 저렴 • AWS 상에서 손쉽게 분산

및 Learning Rate 스케일링 • AMI - ImageNet baked in AWS Deep Learning AMI 환경을 구축하고 실험할 수
(By Goyal) 있습니다.
• 지연시간 - IOPS + Placement Groups
• Learning Rate를 감소시키는
대신 Batch size를 늘림
(By Google Brain)
git clone git@github.com:diux-dev/imagenet18.git
S3
pip install -r requirements.txt
Images/sec
aws configure
AMI python train.py
10k IOPS
Io2 volume
P3 instance
Number of steps
DEV DAY

Distributed Training of MnasNet on AWS
Distributed Training of MnasNet on AWS (1/4)
MnasNet
• An automated mobile NAS* approach
• Trade-off between Accuracy and Latency
where
• An example of MnasNet network architecture
https://www.youtube.com/watch?v=4uDZxefPd-I
* NAS: Neural Architecture Search

https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html
https://arxiv.org/pdf/1807.11626
# Define the base params
실행 예 (1/2) # Naming
$ pip install ec2-cluster==0.3.1

$ ec3 create ec2cluster_p3_mnasnet_example.yaml # Launch Location
$ ec3 setup-horovod ec2cluster_p3_mnasnet_example.yaml
$ ec3 ssh-cmd ec2cluster_p3_mnasnet_example.yaml
...
...
$ source activate tensorflow_p36
$ mpirun -np 16 -hostfile /home/ubuntu/hostfile \
-bind-to socket -map-by slot -mca plm_rsh_no_tree_spawn 1 \
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 \
-x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib \
-x NCCL_SOCKET_IFNAME=ens5 -mca btl_tcp_if_exclude lo.docker0 -x TF_CPP_MIN_LOG_LEVEL=0 \
python /home/ubuntu/aws-ai-optimized-models/mnasnet/mnasnet_main_hvd.py --use_tpu=False \
--data_dir=/home/ubutu/data --model_dir=./results_hvd \
--train_batch_size=256 --eval_batch_size=256 \
--train_steps=109475 --skip_host_call=Fall --data_format='channels_first' \
--transport_input=False --use_horovod=True --eval_on_single_gpu=True
...
실행 예 (2/2)
...
I0923 16:15:22.086650 140202663954176 saver.py:1276] Restoring parameters from ./results_hvd/model.ckpt-62560
I0923 16:15:22.418808 140202663954176 session_manager.py:491] Running local_init_op.
I0923 16:15:22.426828 140202663954176 session_manager.py:493] Done running local_init_op.
I0923 16:15:47.475176 140202663954176 evaluation.py:277] Finished evaluation at 2019-09-23-16:15:47
I0923 16:15:47.475430 140202663954176 estimator.py:1979] Saving dict for global step 62560: global_step = 62560, loss =
2.1191003, top_1_accuracy = 0.74759614, top_5_accuracy = 0.9215545
I0923 16:15:47.475846 140202663954176 estimator.py:2039] Saving 'checkpoint_path' summary for global step 62560:
./results_hvd/model.ckpt-62560
I0923 16:15:47.476232 140202663954176 error_handling.py:93] evaluation_loop marked as finished
I0923 16:15:47.476345 140202663954176 mnasnet_main_hvd.py:1041] Eval results at step 62560: {'loss': 2.1191003,
'top_1_accuracy': 0.74759614, 'top_5_accuracy': 0.9215545, 'global_step': 62560}. Hvd rank 0
I0923 16:15:47.476416 140202663954176 mnasnet_main_hvd.py:1051] Finished training up to step 62560. Elapsed seconds 40649.
• time-to-train: ≈ 11.29 hrs

• Top-1 accuracy : 74.76%
• Top-5 accuracy : 92.16%
성능 테스트 결과 예
Machines • p3dn.24xlarge Num. of Top-1
Time-to-train
instances Validation
(hours)
• TENSORFLOW_VERSION: v1.13.1 (p3dn.24xlarge) Accuracy (%)
• CUDNN_VERSION: 7.4.2.24-1+cuda10.0
TensorFlow • NCCL_VERSION: 2.4.2-1+cuda10.0 1 29 75.2
• OPENMPI 4.0.0
2 24.3 74.5
• 1.28 millions of images (1000 classes)
Dataset • 1024 training files & 128 validation files
(ImageNet) (TFRecords) 4 9.0 74.67
• Mixed Channel ordering

• Mixed XLA (for all ops except depth-wise convolution)
8 4.6 74.16
Optimizations • LARC Optimizer
• HOROVOD_VERSION: 0.16.1 16 1.8 ~ 2.6 73.9 ~ 74.6
LARC (Layer-wise adaptive rate control)

요약 정리
• Train smart with tools making distributed training easily
• Experiment, experiment, and experiment
ü Efficient Linear Scalability ü Efficient Linear Scalability ü Efficient Linear Scalability

ü Flexibility ü Flexibility
ü Fully-managed service
AWS DL AMI AWS DL Container
Amazon SageMaker
Amazon EC2 Amazon EKS Amazon ECS
감사합니다
여러분의 피드백을 기다립니다!
#AWSDEVDAYSEOUL

AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법

Uploaded by

Copyright:

Available Formats

AI/ML

• 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝

Vision Speech Language Chatbots Forecasting Recommendations

(App developers with

ML SERVICES BUILD TRAIN DEPLOY

Frameworks Interfaces Infrastructure

대규모 머신러닝이 중요한 이유 (1/3)

• 데이터 축적에 따라 모델의 성능은 • 대량의 데이터 기반 ML/DL 모델 트레이닝은

How do data science techniques The “data parallel” approach

알고리즘 선정도 중요하지만

“These results suggest that we

매우 다양할 수 있음 • Distributed Computing Frameworks

• Build Compute Clusters to fit the workload!

AWS Deep Learning AWS Deep Learning

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝

• HPC(High Performance Computing) 시뮬레이션 1 V100

최대 8 개의 NVIDIA Tesla V100 GPU

• 300 GB/s 의 GPU 간 통신 속도 지원 (NVLink)

• 클라우드에서 사용 가능한 가장 강력한 GPU

System Memory 488 GB 768 GB 57%

• 데이터 전처리 최적화에 적합 EBS Throughput 14Gbps 14Gbps -

• 머신 러닝, HPC, 동영상 처리, 금융 모델링 등을 위한 고성능 파일

• Lustre는 1 millisecond 미만의 지연 시간과 초당 수백 Gigabytes,

전통적 HPC 머신러닝 클러스터 Cloud-native 머신러닝 클러스터

BeeGFS RAM-based storage array

AWS Batch P3 / P3dn container instances

Infrastructure for ML on AWS (2/3)

Traditional AWS Deep Learning Cluster

Amazon SQS Public Subnet

Private AWS Elastic File System

Private Subnet EC2 Workers

Default VPC: 10.0.0.0/16 NAT Gateway

AWS Amazon Amazon

AWS Step Functions workflow

FSx for Lustre AWS Batch

Multi-node Parallel Job

Amazon Glacier Training Output

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝

데이터의 규모 ∝ 클러스터 노드의 개수

• NVIDIA’s NCCL library (for GPU-level communication) Worker A

• Configurations Worker C Worker B

ü Sing-ring NCCL vs. Hierarchical AllReduce 22 51 7 17 8 4 13 24 4 2 57 12

2. 사용할 GPU 세팅 5. Use checkpoints only on the first worker

( source code from https://github.com/horovod/horovod )

( source code from https://github.com/horovod/horovod )

# Initialize Horovod # Only checkpoint on rank 0

# Wrap optimizer with DistributedOptimizer

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝

ImageNet을 이용한 트레이닝 실행 예시

• TFRecord 변환 전용 인스턴스로 전처리 수행

ImageNet을 이용한 트레이닝 실행 예시

time-to-train: 14.6 min

구성 정보 (ResNet-50 & ImageNet)

클라우드 기반 대규모 머신러닝/딥러닝 트레이닝

[참고] Modular and Scalable Amazon EKS Architecture

• STEP 1. Install Kubeflow to setup a cluster for distributed training

• STEP 2. Set the app name and initialize it.

• STEP 3. Install mpi-operator from kubeflow

• STEP 5. Apply the manifest to the default environment.

• 클러스터 생성부터 종료까지 자동화된

# Replace ip in the `deploy/benchmark-nfs-volume.yaml` before following step

• Install Argo Workflow

# you can forward port to localhost and look at Argo UI

• Configure AWS credentials