Container Orchestration For Big Data Workloads Final 72717 301083

Hadoop
and Spark
on Docker
Container Orchestration for Big Data

Today’s Speakers
Tom Phelan Anant Chintamaneni

Co-Founder and Chief Architect Vice President of Products
BlueData Software BlueData Software
@tapbluedata @anantcman
Agenda
• Containers and Big Data

• Container Orchestration Choices and Considerations
• Requirements for Hadoop and Spark Clusters
• How to Run Multiple Big Data Clusters in Containers
• Q & A
What is a Container?
“ LXC (Linux Containers) is an

operating-system-level
virtualization method for running
multiple isolated Linux systems
(containers) on a control host
using a single Linux kernel. ”
Source: https://en.wikipedia.org/wiki/LXC
What is a Big Data Application?
Big Data refers to large sets of data characterized by:

• Volume
• Velocity
• Variety
Common Big Data application frameworks include:
• Hadoop
• Spark
• Kafka, Cassandra, and more
Why Hadoop & Spark on Containers?
• All the value propositions of virtualization:

– Flexibility, agility, cost reduction, etc.
• Lower virtualization “tax” than hypervisor-based VMs
• Simplify management of complex Big Data software
stacks with the use of the Docker file format
• Enhance independent scalability of compute and
storage resources
Just to Set Expectations …
This presentation is not about using containers to

run Big Data tasks:
Source: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications
This presentation is about running Big Data clusters

in containers:
containers
cluster
To provide a true containerized Big Data environment:

What is Container Orchestration?
• Deploy and Configure

Stateless Applications
• Fault Isolation & Healing - Nothing to disk
- Web front-end
• Secure -
-
Can stop and start as many containers as you like
Container is ephemeral
- No container instance-specific configuration
• Upgrades
Stateful Applications
• Scaling Up and Down - Container-specific: Host names, IP addresses
- Big Data service configuration information
- Security secrets: passwords, KDC keys
Hadoop and Spark in Containers
• The requirements for Hadoop, Spark, and other

similar Big Data applications do not match the
behavior of most containerized applications
• Let’s try a simple example with Hadoop:
– Treat each Hadoop service as a microservice
– Run each Hadoop service in its own container
• How would that work?
Hadoop Services
Master Node
RM YARN ResourceManager
NN RM Worker Node
NM YARN NodeManager
DN NM
NN HDFS NameNode
DN YARN DataNode Worker Node

DN NM
Deploy 1 service per container

Worker Node
No problem! DN NM
“Hello world”
Hadoop Services (continued)
Master Node Worker Node
RM YARN ResourceManager
NN RM JN ZK
DN NM
NM YARN NodeManager
NN HDFS NameNode Master Node (HA) Worker Node

DN YARN DataNode NN RM JN ZK DN NM
JHS Job History Server JHS HFS
Worker Node
HFS HttpFS Service
DN NM
JN Journal Node
This is getting complex.
ZK ZooKeeper
But Wait, There’s More …
RM YARN ResourceManager SHS Spark History Server
Master Node (HA)
NM YARN NodeManager Hue Hue
NN RM JN ZK
NN HDFS NameNode OZ Oozie

JHS HFS HM SHS
DN YARN DataNode
Job History Server

JHS
Worker Node
HFS HttpFS Service DN NM HRS
JN Journal Node
??? Node
ZK ZooKeeper
Uh oh. Hue OZ
HM Hbase Master
We need some help.
HRS Hbase Region Server
Complete List of Hadoop Services?
RM YARN ResourceManager SHS Spark History Server ISS Impala State Store
NM YARN NodeManager Hue Hue ICS Impala Catalog Server
HDFS NameNode OZ Oozie

NN ID Impala Daemon
CM Cloudera Manager
DN YARN DataNode SS Solr Server
JHS Job History Server DB RDBMS

HS Hive Server
HFS HttpFS Service GW Gateway HSS Hive Metastore Service
JN Journal Node FA Flume Agent …
ZK ZooKeeper
ACK! There is seemingly
HM Hbase Master
no end to these services.
Running Big Data in Containers
The blind placement of individual services into containers will lead

to all sorts of problems. Big Data applications break the typical
assumptions for containers and container orchestration.
So is it possible run Big Data applications in containers?
Yes! Containers are great for Big Data.

But it takes some care.
Use the Best Tool for the Job
• Orchestration of Containers
– Deploy and manage containers for a single application cluster
• Managing and Configuring Hadoop Services

– Configure and start Hadoop-specific application services (e.g.
within each container)
• Running Multiple Big Data Clusters in Containers

– Deploy and manage multiple Big Data clusters simultaneously


Container Orchestration Options
• Primary Options: • There are multiple options and choices for

– Kubernetes container orchestration and management
– Apache Mesos with Marathon • The core use case is deploying “beautifully”
– Docker Swarm (“Swarm mode”) architected microservices and stateless apps
• These tools all provide scalability, resilience,
• Other Options: and fault tolerance
– Rancher (Rancher Labs)
• They offer capabilities & APIs to build on
– Fleet (CoreOS)
• All have similar high-level clustered
– Nomad (HashiCorp) architecture, but slightly different approaches
– etc.
Container Management Choices
9+ choices
6+ choices 12+ choices
12+ choices
5+ choices
3+ choices
High degree of flexibility for each and every component (e.g. host OS, container runtime, image registry, etc.)
Pluggable model – exposes key set of capabilities and APIs
Source: The New Stack
Kubernetes (K8)
End users see service endpoints

(e.g. ngnix, web server etc.)
Source: Kubernetes
Kubernetes (K8) Features
• Pods
– Pods form the atomic unit of scheduling in Kubernetes, as opposed to
single containers in other systems
– Pods host dependent/related services
• Flat Networking Space
– Networking is very different in Kubernetes versus the default Docker
networking
• Replication Controller
– Controls and monitors the number of running pods (“ replicas”) for each
service
Source: Kubernetes
Kubernetes (K8) Considerations
K8 for on-premises Linux is non-trivial: every component must be manually installed
Source: Kubernetes
Volumes è Non-starter for Big Data
Persistent Volumes è Requires external storage: cannot leverage local disk
Source: Kubernetes
• Multiple ways to
implement
networking with K8
• Big Data workloads

have unique
requirements that
are different than
stateless workloads
Source: Kubernetes
Kubernetes (K8) SWOT
Strengths Weaknesses
• Google-backing • Support model
• Developer adoption • Complex install
• Largest community • Alpha/beta features
• Public cloud support • Documentation
Opportunities Threats
• De-facto standard • Docker
• Enterprise-grade • Multiple Kubernetes
• Partner ecosystem distributions
(confusion/forking)
Docker Swarm (Swarm mode)
Docker v1.12 includes a “Swarm mode” in Docker Engine for

natively managing a cluster of Docker Engines
Source: Docker
Docker Swarm SWOT
• Docker native • Small contributor
feature base (Docker Inc.
• Simple and fast mostly)
• Native integration • Feature gaps for
with Docker tools enterprise IT and
(e.g. Compose) complex use cases
• New, native Swarm • Container mgmt &
mode in Docker tools (e.g. K8)
1.12 • Commoditization of
• Strong community Docker run time
and brand
Apache Mesos using Marathon
• Two-tier system
• Mesos is thin resource

sharing layer
• Marathon gets offers from

Mesos to start, monitor,
and scale containers
• Mesos can fire up

Kubernetes or Docker
Swarm as a framework
Apache Mesos w/ Marathon SWOT
• Proven at scale • Complexity
• Can run both non- • Needs frameworks
containerized and • Key container
containerized features are
workloads alpha/beta
• Support model
Note: Mesos is closely
• Ability to deploy • K8 and Swarm have aligned with Spark, but
mixed workloads developer mind Spark workloads running on
(containers and share
Mesos are typically bare-
non-containers) • Mesos viewed as metal (not containerized)
• K8 and Swarm as niche solution for
frameworks Spark, etc.
Orchestration Options At-a-Glance
Time to
Size Maturity Workload Storage Networking
Install
Flexible but
Two storage
Kubernetes 10s-1000s Medium-High High Cloud native
APIs
complex (use
case dependent)
Supports
Uses overlay
Swarm 10s-1000s Low Medium Cloud native mounting
network
volumes
Mesos Beta support

IP per container
w/Maratho 10s-1000s High High Cloud native for persistent
(alpha)
n volumes


Attributes of Big Data Clusters
• Not exactly monolithic applications, but close

• Multiple, co-operating services with dynamic APIs
– Service start-up / tear-down ordering requirements
– Different sets of services running on different hosts (nodes)
– Tricky service interdependencies impact scalability
• Lots of configuration (aka state)
– Host name, IP address, ports, etc.
– Big meta-data: Hadoop and Spark service-specific configurations
Managing and Configuring Hadoop
• Use a Hadoop manager

– HortonWorks: Ambari
– Cloudera: Cloudera Manager
– MapR: MapR Control System (MCS)
• Follow common deployment pattern
• Ensures distro supportability
Using a Hadoop Manager
RM YARN ResourceManager SHS Spark History Server
NM YARN NodeManager Hue Hue
NN HDFS NameNode OZ Oozie Hadoop Manager Node

Hmg ZK JN DB
Hmg Hadoop Manager
DN YARN DataNode
DB RDBMS Hue OZ
JHS Job History Server
HFS HttpFS Service GW Gateway
JN Journal Node FA Flume Agent
ZK ZooKeeper
HM Hbase Master

Using a Hadoop Manager (cont’d)
RM ResourceManager SHS Spark History Server
NM NodeManager
Hue
Hue
Hadoop Manager Node
NN OZ Oozie Hmg ZK DB JN
HDFS NameNode
DN Hmg Hadoop Manager
YARN DataNode Hue OZ
JHS DB RDBMS
Job History Server
HFS GW Gateway
HttpFS Service
JN FA Master Node Master Node
Journal Node Flume Agent
RM NN ZK JN RM NN ZK JN
ZK ZooKeeper
HM JHS HFS SHS HM JHS HFS SHS HM
Hbase Master
Worker Node Worker Node Worker Node

NM DN HRS NM DN HRS NM DN HRS
Edge Node
GW FA


Requirements for Big Data Clusters
ü Full cluster lifecycle management

ü Big Data application support (i.e. requires no modification)
ü Management of storage and networking resources
ü Integration with existing enterprise services (e.g. LDAP / AD)
ü Conform to existing enterprise security policies
• Multi-tenancy, multi-cluster, auditing, monitoring, etc.
ü Maintain Big Data performance goals
ü Support for hybrid (on-prem + cloud) environments
Full Cluster Lifecycle Management
• Creation
• Monitoring
• Expansion
• Contraction
• Pause / resume
• Application software upgrade
• Deletion
Big Data Application Support
• How do you manage multiple Big Data applications?

– Hadoop, Spark, Kafka, Cassandra, and other frameworks
• How do ensure no modification to these open source
Big Data frameworks and tools?
• How do you run different distros/versions of Hadoop?
• Will your distro vendor support your containerized
deployment environment?
Mgmt of Storage and Networking
• Networking
– Many options to choose from
• Storage
– Local / shared
– Persistent / non-persistent
– Compute and storage separation
• Both are major challenges for Big Data in containers
– Requires custom code (intellectual property)
Integration with Enterprise Services
• Lightweight Directory Access Protocol (LDAP) service

• Active Directory (AD) service
• Directory Name Service (DNS)
• Kerberos Key Distribution Center (KDC)
• Key Management Service (KMS)
• Big Data platforms are not conducive to service discovery
– A software-defined networking approach is required
Enterprise Security Policies
• Installation and management on hosts without

passwordless ssh root access
• User operation privilege controlled (sudo) environment
– Role-based security is required
• Auditing and monitoring
• Use of a secured Docker Image Registry
• Multi-tenant, multi-cluster with controlled access
Big Data Performance
• Peta-byte scale storage

– No existing container orchestration tool has an answer on
how to handle different types of persistent storage at scale
– New strategies and approaches need to be considered:
separation of compute & storage is a fundamental concept
• Maintain elasticity
– Why use containers if they are full of state?
• Deliver bare-metal performance with containers
Support for Hybrid Environments
• Today: on-premises (containerized)

• Next: public cloud
• Tomorrow: hybrid
– Compute and data on-premises
– Compute on-premises and in cloud, with data on-premises
– Compute and data on-premises and in the public cloud
– Compute and data in the public cloud
Big Data in Containers
Choose container Choose & build Choose & build Choose & build
platform container orchestration container networking persistent storage for
(e.g. Docker, rkt) (multiple options) (multiple options) Big Data containers
Build infrastructure Build cluster management Build policy engine for

security integrations for Big Data on-premises resource quotas and
(AD/LDAP, KMS, Kerberos) and/or in the cloud multi-tenancy
Build management Build high Build remote storage Build guard rails for
and administration availability & connectivity governance and
console disaster recovery (HDFS, S3, NFS etc.) onboarding new apps
Much more than container orchestration: requires specific design, differentiated

system architecture, and engineering to build a “whole solution” for Big Data
A Solution for Big Data in Containers
Purpose-built innovations for Big Data in containers
+
General container orchestration functionality
• Turnkey solution for running multiple Big Data clusters in containers

• Focused on stateful Big Data applications in containers
• Leveraging containers as “lightweight VMs”
• Multi-tenant architecture with enterprise-grade security and performance
• Enabling compute / storage separation for Big Data
BlueData EPIC Software Platform
Data Scientists Developers Data Engineers Data Analysts
Blue Data EPIC™ Software Platform
BI/Analytics Tools Bring-Your-Own
ElasticPlane™ - Self-service, multi-tenant clusters

IOBoost™ - Extreme performance and scalability
DataTap™ - In-place access to data on-prem or in the cloud
Compute EC2
Storage NFS HDFS S3

On-Premises Public Cloud
BlueData EPIC Architecture
Best of K8 + Swarm concepts for Big Data
container container
Container Container
scheduler scheduler
container
Compute
container
REST API server
container Worker Mgr.
container
container Controller Mgr. bd_mgmt
container bd_mgmt
data_server memq_cnode data_server memq_cnode
Storage
Worker (1)
Controller Worker (N)
External HDFS
Flexible architecture with 20+ patents and bare-metal performance for Big Data
Big Data in Docker Containers
Namespace awareness for data paths & services

Big Data service endpoints
Big Data security
Big Data service dependency mgmt
Heartbeat / control system
Container security (non-privileged)
Storage driver & network interfaces
Dynamic, fully
managed local
volume
Bare-Metal Performance
Intel benchmarking study: BlueData EPIC demonstrated 2.33% higher
performance vs. bare-metal (for 50 Hadoop compute nodes and 10 TB of data)
Source: “Bare-metal performance for Big Data workloads on Docker containers”, Intel white paper, March 2017 http://intel.ly/2lXPZHx
Purpose-Built for Big Data on Docker
Out-of-the-box solution with differentiated Big Data innovations & optimizations
Out-of-the-box solution with differentiated Big Data innovations & optimizations
BlueData EPIC container-based platform for Big Data
Web-based UI and RESTful APIs for automation

app images & App Workbench
App Store with Docker-based
Metricbeat + ELK stack for

Container management for Big Data workloads
container monitoring
with pre-built HA and multi-tenancy
Open vSwitch with VXLAN Dynamic persistent volumes
CentOS / RHEL only CentOS / RHEL only CentOS / RHEL only
On-Premises: Physical Servers or VMs Public Cloud

BlueData Use Cases + Functionality
Example Big Data Analytics Multi-Tenant Data Science Real-Time Data

Use Cases Sandbox Hadoop Operations Pipeline
BlueData EPIC Functionality
Self-Service RESTful App Store / Tenant Cluster

User Interface APIs Registry Management Management
DataTap IOBoost Big Data Identity & App

Connectors Caching Security Access Workbench
Container Container High Resource Cloud

Networking Storage Availability Scheduler Connector
BlueData EPIC – Access Layer
Management console for administrators and users REST APIs for automation and workflow
BlueData EPIC – App Store
DOCKER-BASED APP IMAGES OF YOUR CHOICE:

Same images for on-premises, AWS, or any public cloud
(Use BlueData App Workbench to update existing images,
adding newer versions or net new apps)
BlueData EPIC – Tenants & Policies
Tenant quotas for CPU, memory, & storage LDAP/AD access control & Kerberos security
Multi-Tenant Resource Quotas & QoS
Aggregate Docker
container storage,
memory, cores
(CPU shares), and
QoS level for each
tenant
Isolated Work Environments
Different Big Data applications, tools, and/or versions tailored for each specific tenant (user group)
Containerized Compute Clusters
On-demand, elastic compute environments.

Expand or shrink with just a few mouse clicks.
Compute / Storage Separation
Connectivity from containerized

compute clusters to one or many
remote HDFS or NFS systems
Multi-Host
4 containers
on 2 different hosts
using 1 VLAN and 4 persistent IPs
Different Services in Each Container
Master Services
Worker Services
BlueData EPIC Monitoring
Metricbeat + Elastic + Kibana for

cluster and container-level
monitoring
Key Takeaways
• Current container orchestration engines not ready for Big Data
– Hadoop and other Big Data workloads require new architecture and innovation
– Wide array of storage, network, security, & operational features are missing
• Compute/storage separation is a critical architectural concept
– Storing terabytes of persistent data in containers similar to physical servers defeats
the key goals of running applications on containers
– Big Data must be stored in external storage (e.g. HDFS, Object)
• DIY for Big Data in containers is not an option. It will not work!
– Configuring and tuning container orchestration systems will not be sufficient
– Choose a turnkey, purpose-built platform for Big Data to accelerate time to value
Tom Phelan Anant Chintamaneni
@tapbluedata @anantcman
Thank You
For more information:

www.bluedata.com
sales@bluedata.com
www.bluedata.com/aws
TRY BLUEDATA EPIC ON AWS

Container Orchestration For Big Data Workloads Final 72717 301083

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Container Orchestration For Big Data Workloads Final 72717 301083

Uploaded by

Copyright:

Available Formats

Hadoop

Container Orchestration for Big Data

Tom Phelan Anant Chintamaneni

• Containers and Big Data

“ LXC (Linux Containers) is an

Big Data refers to large sets of data characterized by:

• All the value propositions of virtualization:

This presentation is not about using containers to

This presentation is about running Big Data clusters

To provide a true containerized Big Data environment:

• Deploy and Configure

• The requirements for Hadoop, Spark, and other

DN YARN DataNode Worker Node

Deploy 1 service per container

NN HDFS NameNode Master Node (HA) Worker Node

JHS Job History Server JHS HFS

NN HDFS NameNode OZ Oozie

Job History Server

NM YARN NodeManager Hue Hue ICS Impala Catalog Server

HDFS NameNode OZ Oozie

JHS Job History Server DB RDBMS

HFS HttpFS Service GW Gateway HSS Hive Metastore Service

JN Journal Node FA Flume Agent …

The blind placement of individual services into containers will lead

So is it possible run Big Data applications in containers?

Yes! Containers are great for Big Data.

• Managing and Configuring Hadoop Services

• Running Multiple Big Data Clusters in Containers

• Managing and Configuring Hadoop Services

• Running Multiple Big Data Clusters in Containers

• Primary Options: • There are multiple options and choices for

6+ choices 12+ choices

End users see service endpoints

K8 for on-premises Linux is non-trivial: every component must be manually installed

Volumes è Non-starter for Big Data

Persistent Volumes è Requires external storage: cannot leverage local disk

• Big Data workloads

Docker v1.12 includes a “Swarm mode” in Docker Engine for

• Mesos is thin resource

• Marathon gets offers from

• Mesos can fire up

Mesos Beta support

• Managing and Configuring Hadoop Services

• Running Multiple Big Data Clusters in Containers

• Not exactly monolithic applications, but close

• Use a Hadoop manager

NM YARN NodeManager Hue Hue

NN HDFS NameNode OZ Oozie Hadoop Manager Node

HFS HttpFS Service GW Gateway

JN Journal Node FA Flume Agent

HRS Hbase Region Server

Worker Node Worker Node Worker Node

• Managing and Configuring Hadoop Services

• Running Multiple Big Data Clusters in Containers

ü Full cluster lifecycle management

• How do you manage multiple Big Data applications?

• Lightweight Directory Access Protocol (LDAP) service

• Installation and management on hosts without

• Peta-byte scale storage

• Today: on-premises (containerized)

Build infrastructure Build cluster management Build policy engine for

Much more than container orchestration: requires specific design, differentiated

Purpose-built innovations for Big Data in containers