Professional Documents
Culture Documents
Container Orchestration For Big Data Workloads Final 72717 301083
Container Orchestration For Big Data Workloads Final 72717 301083
and Spark
on Docker
Source: https://en.wikipedia.org/wiki/LXC
What is a Big Data Application?
Source: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications
Just to Set Expectations …
cluster
Just to Set Expectations …
NN HDFS NameNode
“Hello world”
Hadoop Services (continued)
Master Node Worker Node
RM YARN ResourceManager
NN RM JN ZK
DN NM
NM YARN NodeManager
Worker Node
HFS HttpFS Service
DN NM
JN Journal Node
This is getting complex.
ZK ZooKeeper
But Wait, There’s More …
RM YARN ResourceManager SHS Spark History Server
Master Node (HA)
NM YARN NodeManager Hue Hue
NN RM JN ZK
JN Journal Node
??? Node
ZK ZooKeeper
Uh oh. Hue OZ
HM Hbase Master
We need some help.
HRS Hbase Region Server
Complete List of Hadoop Services?
RM YARN ResourceManager SHS Spark History Server ISS Impala State Store
CM Cloudera Manager
DN YARN DataNode SS Solr Server
ZK ZooKeeper
ACK! There is seemingly
HM Hbase Master
no end to these services.
HRS Hbase Region Server
Running Big Data in Containers
• Orchestration of Containers
– Deploy and manage containers for a single application cluster
• Orchestration of Containers
– Deploy and manage containers for a single application cluster
9+ choices
12+ choices
5+ choices
3+ choices
High degree of flexibility for each and every component (e.g. host OS, container runtime, image registry, etc.)
Pluggable model – exposes key set of capabilities and APIs
Source: The New Stack
Kubernetes (K8)
Source: Kubernetes
Kubernetes (K8) Features
• Pods
– Pods form the atomic unit of scheduling in Kubernetes, as opposed to
single containers in other systems
– Pods host dependent/related services
• Flat Networking Space
– Networking is very different in Kubernetes versus the default Docker
networking
• Replication Controller
– Controls and monitors the number of running pods (“ replicas”) for each
service
Source: Kubernetes
Kubernetes (K8) Considerations
Source: Kubernetes
Kubernetes (K8) Considerations
Source: Kubernetes
Kubernetes (K8) Considerations
• Multiple ways to
implement
networking with K8
Source: Kubernetes
Kubernetes (K8) SWOT
Strengths Weaknesses
• Google-backing • Support model
• Developer adoption • Complex install
• Largest community • Alpha/beta features
• Public cloud support • Documentation
Opportunities Threats
• De-facto standard • Docker
• Enterprise-grade • Multiple Kubernetes
• Partner ecosystem distributions
(confusion/forking)
Docker Swarm (Swarm mode)
Strengths Weaknesses
• Docker native • Small contributor
feature base (Docker Inc.
• Simple and fast mostly)
• Native integration • Feature gaps for
with Docker tools enterprise IT and
(e.g. Compose) complex use cases
Opportunities Threats
• New, native Swarm • Container mgmt &
mode in Docker tools (e.g. K8)
1.12 • Commoditization of
• Strong community Docker run time
and brand
Apache Mesos using Marathon
• Two-tier system
Strengths Weaknesses
• Proven at scale • Complexity
• Can run both non- • Needs frameworks
containerized and • Key container
containerized features are
workloads alpha/beta
• Support model
Opportunities Threats
Note: Mesos is closely
• Ability to deploy • K8 and Swarm have aligned with Spark, but
mixed workloads developer mind Spark workloads running on
(containers and share
Mesos are typically bare-
non-containers) • Mesos viewed as metal (not containerized)
• K8 and Swarm as niche solution for
frameworks Spark, etc.
Orchestration Options At-a-Glance
Time to
Size Maturity Workload Storage Networking
Install
Flexible but
Two storage
Kubernetes 10s-1000s Medium-High High Cloud native
APIs
complex (use
case dependent)
Supports
Uses overlay
Swarm 10s-1000s Low Medium Cloud native mounting
network
volumes
• Orchestration of Containers
– Deploy and manage containers for a single application cluster
DB RDBMS Hue OZ
JHS Job History Server
ZK ZooKeeper
HM Hbase Master
Edge Node
GW FA
Use the Best Tool for the Job
• Orchestration of Containers
– Deploy and manage containers for a single application cluster
• Creation
• Monitoring
• Expansion
• Contraction
• Pause / resume
• Application software upgrade
• Deletion
Big Data Application Support
• Networking
– Many options to choose from
• Storage
– Local / shared
– Persistent / non-persistent
– Compute and storage separation
• Both are major challenges for Big Data in containers
– Requires custom code (intellectual property)
Integration with Enterprise Services
Build management Build high Build remote storage Build guard rails for
and administration availability & connectivity governance and
console disaster recovery (HDFS, S3, NFS etc.) onboarding new apps
+
General container orchestration functionality
Compute EC2
container container
Container Container
scheduler scheduler
container
Compute
container
REST API server
container Worker Mgr.
container
container Controller Mgr. bd_mgmt
container bd_mgmt
Storage
Worker (1)
Controller Worker (N)
External HDFS
Flexible architecture with 20+ patents and bare-metal performance for Big Data
Big Data in Docker Containers
Dynamic, fully
managed local
volume
Bare-Metal Performance
Intel benchmarking study: BlueData EPIC demonstrated 2.33% higher
performance vs. bare-metal (for 50 Hadoop compute nodes and 10 TB of data)
Source: “Bare-metal performance for Big Data workloads on Docker containers”, Intel white paper, March 2017 http://intel.ly/2lXPZHx
Purpose-Built for Big Data on Docker
Out-of-the-box solution with differentiated Big Data innovations & optimizations
Out-of-the-box solution with differentiated Big Data innovations & optimizations
BlueData EPIC container-based platform for Big Data
container monitoring
with pre-built HA and multi-tenancy
Aggregate Docker
container storage,
memory, cores
(CPU shares), and
QoS level for each
tenant
Isolated Work Environments
Different Big Data applications, tools, and/or versions tailored for each specific tenant (user group)
Containerized Compute Clusters
4 containers
on 2 different hosts
using 1 VLAN and 4 persistent IPs
Different Services in Each Container
Master Services
Worker Services
BlueData EPIC Monitoring