Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

CMP311-R

How NextRoll leverages AWS Batch


for daily business operations
Roozbeh Zabihollahi Steve Kendrex
Technical Lead Senior Technical Product Manager, AWS Batch and HPC
NextRoll Amazon Web Services

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Introducing AWS Batch

Fully managed Integrated with Cost-optimized


AWS resource
No software to install or
provisioning
Natively integrated with the
servers to manage.
AWS platform, AWS Batch
AWS Batch automatically
jobs can easily and securely
AWS Batch provisions, provisions compute
interact with services such as
manages, and scales your resources tailored to the
Amazon S3, Amazon
infrastructure needs of your jobs using
DynamoDB, and Amazon
Amazon EC2 and EC2 Spot
Rekognition
Who uses AWS Batch?
Weather/systems modeling

Financial market/
Gene sequencing risk analysis

Autonomous vehicle Oil and gas


ML and simulation exploration

And much more


Typical AWS Batch job architecture
Input is put
into Amazon S3 Job definition
Output is put
bucket
into Amazon S3
bucket

Application
image +
configuration

AWS Batch Job queue with


compute environment runnable jobs
IAM role

Scheduler
New: Allocation strategies for AWS Batch
Make capacity/throughput/cost tradeoffs

SPOT_CAPACITY_OPTIMIZED – Allow AWS to predict the deepest Spot


capacity pools, and launch instances accordingly. Available in Spot only

BEST_FIT – Same behavior as previously. Compute environments (CEs)


that are created through SLI/SDK will default to this (to preserve
backward compatibility). Spot or On-Demand CEs supported. Not
recommended for most use cases

BEST_FIT_PROGRESSIVE – Same as BEST_FIT, but when we reach a


capacity error (ICE, Reclaim, EC2 Limit), AWS Batch will progressively
sort through the list and pick the next best fit instance type.
Recommended for On-Demand CEs or in Spot CEs with a specific use
case
Instance: Max vCPU:
Optimal 100

CE1
(On-
Demand)
Allocation strategy:
BEST_FIT_PROGRESSIVE

JQ1 Instance: Max vCPU:


SubmitJob Optimal 2,000

CE2
(Spot)
Allocation strategy:
SPOT_CAPACITY_OPTIMIZED
Jobs

Jobs are the unit of work executed by AWS Batch as containerized


applications running on Amazon EC2
Containerized jobs can reference a container image, command, and
parameters, or users can simply provide a .zip containing their
application and we will run it on a default Amazon Linux container

aws batch submit-job --job-name sim-variation-1 \


--job-definition sim-sensors \
--job-queue high-mem-and-cpu
Job definitions
Similar to Amazon Elastic Container Service (Amazon ECS) task definitions,
AWS Batch job definitions specify how jobs are to be run. While each job
must reference a job definition, many parameters can be overridden
Some of the attributes specified in a job definition:
• AWS Identity and Access Management (IAM) role associated with the job
• vCPU and memory requirements
• Retry strategy
• Mount points
• Container properties
• Environment variables

aws batch register-job-definition --job-definition-name sim\


--container-properties ...
Workflows, pipelines, and job dependencies

Jobs can express a dependency on the successful


completion of other jobs or specific elements of an
array job

Use your preferred workflow engine and language


to submit jobs. Flow-based systems simply submit
jobs serially, while DAG-based systems submit
many jobs at once, identifying inter-job
dependencies

aws batch submit-job –-depends-on 606b3ad1-aa31-48d8-92ec-f154bfc8215f


Model example
C is dependent on A,
C N_TO_N dependency on B, same for D and C,
E and F depend on D

Job-B Job-C Job-D


B:0 C:0 D:0
B:1 C:1
Job-E
D:1
Job-A B:1 C:1 D:1

B:1 C:1 D:1


Job-F
B:99 C:99 D:99
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About NextRoll

37,000+ growing brands

1.2 billion 80B+ AI Access to 500


shopper profiles predictions per day supply sources,
emails, and onsite
presentations
Customers

• D2C

• TeePublic, Student.com, Wigs.com, and over 37,000 more customers

• B2B

• Pantheon, PagerDuty, WP Engine, IBM Cloud Video, and more

• Platform Services

• Rakuten, Springbot
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cloud processing

The pile of data resulting from advertising activity is


a lot more than anyone can fit on their laptop

(Unless it’s like a super really big laptop


with thousands of terabytes of disk space)
Who knows?

$4,000,000
Batch and Spot savings
Why AWS Batch?
• Freedom of stack (since it is using Docker)
• Python, C, Rust, GoLang, Haskell, Java, etc.

• Data processing
• Using in-house customized file format and processing when open-source tech
does not scale properly (compared to Apache Hadoop or Apache Spark)
• Ease of deployment
• It’s only about pushing Docker
Teams
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jobs purposes
• 80B+ events per day
• Each event is 1KB in size
• Need to process 80TB of data

• Attribution
• Processing billions of events to attribute 500,000 conversions every day

• Machine learning
• Training models every night
AWS Batch is good for:

• Periodic tasks, nightly pipeline

• Automatic scaling processes

• Flexibility in the stack

• Different instance types: m5d.large, x1.16xlarge, optimal or …

• Batch is great for orchestrating


Challenges

• Monitoring

• Finding single jobs out of millions

• Checking logs

• We solved it with Batchiepatchie

• Out of Spot Instances – specific instance types

• We need to change our compute environment to use a different instance type


Challenges
• Disk cleanup – instances out of disk sometimes
• We fixed it with the Amazon Machine Image (AMI) (crontab cleanup)
• Long-running jobs that use disk a lot

• Orchestration happens in a lower level


• so comparing to Hadoop you need to write more code.
• Batch does not help with state management (e.g. o shared file system)

• Managed queues
• AWS Batch set the desired vCPU, but it scales slowly
• Batchiepatchie sets the Min vCPUs based on the jobs in the queue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Batchiepatchie
• Open source - https://github.com/AdRoll/batchiepatchie
• Monitoring jobs submitted
• Reviewing logs for jobs
• Searching jobs by name and command line
• Estimating cost of jobs submitted
• Instance and job data available in PostgreSQL
Monitoring jobs
Reviewing logs for jobs
Searching jobs by name and command line
Review the holistic view of queues
Cost Analysis
batchiepatchie=> \d+
List of relations
Name | Type | Owner | Size
------------------------------+----------+----------------+------------
activated_job_queues | table | batchiepatchie | 16 kB
compute_environment_event_log | table | batchiepatchie | 381 MB
goose_db_version | table | batchiepatchie | 40 kB
goose_db_version_id_seq | sequence | batchiepatchie | 8192 bytes
instance_event_log | table | batchiepatchie | 28 GB
instances | table | batchiepatchie | 1523 MB
job_status_events | table | batchiepatchie | 24 kB
job_summary_event_log | table | batchiepatchie | 296 MB
jobs | table | batchiepatchie | 25 GB
AWS Batch Roadmap

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roadmap

• Significant improvements to the AWS Batch console


• Full integration with Container Insights for job-level metrics
• Speed and Scale (faster scaling, higher limits)
• Custom Logging configuration
• New compute strategies and types
• Better placement logic and additional scheduling methods: increase
utilization, lower cost, increase scheduling fairness

And more…
Thank you!
Roozbeh Zabihollahi Steve Kendrex
roozbeh@nextroll.com kendrexs@amazon.com

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like