Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Hadoop

and Spark
on Docker

Container Orchestration for Big Data


Today’s Speakers

Tom Phelan Anant Chintamaneni


Co-Founder and Chief Architect Vice President of Products
BlueData Software BlueData Software
@tapbluedata @anantcman
Agenda

• Containers and Big Data


• Container Orchestration Choices and Considerations
• Requirements for Hadoop and Spark Clusters
• How to Run Multiple Big Data Clusters in Containers
• Q & A
What is a Container?

“ LXC (Linux Containers) is an


operating-system-level
virtualization method for running
multiple isolated Linux systems
(containers) on a control host
using a single Linux kernel. ”

Source: https://en.wikipedia.org/wiki/LXC
What is a Big Data Application?

Big Data refers to large sets of data characterized by:


• Volume
• Velocity
• Variety
Common Big Data application frameworks include:
• Hadoop
• Spark
• Kafka, Cassandra, and more
Why Hadoop & Spark on Containers?

• All the value propositions of virtualization:


– Flexibility, agility, cost reduction, etc.
• Lower virtualization “tax” than hypervisor-based VMs
• Simplify management of complex Big Data software
stacks with the use of the Docker file format
• Enhance independent scalability of compute and
storage resources
Just to Set Expectations …

This presentation is not about using containers to


run Big Data tasks:

Source: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications
Just to Set Expectations …

This presentation is about running Big Data clusters


in containers:
containers

cluster
Just to Set Expectations …

To provide a true containerized Big Data environment:


What is Container Orchestration?

• Deploy and Configure


Stateless Applications
• Fault Isolation & Healing - Nothing to disk
- Web front-end
• Secure -
-
Can stop and start as many containers as you like
Container is ephemeral
- No container instance-specific configuration
• Upgrades
Stateful Applications
• Scaling Up and Down - Container-specific: Host names, IP addresses
- Big Data service configuration information
- Security secrets: passwords, KDC keys
Hadoop and Spark in Containers

• The requirements for Hadoop, Spark, and other


similar Big Data applications do not match the
behavior of most containerized applications
• Let’s try a simple example with Hadoop:
– Treat each Hadoop service as a microservice
– Run each Hadoop service in its own container
• How would that work?
Hadoop Services
Master Node
RM YARN ResourceManager
NN RM Worker Node
NM YARN NodeManager
DN NM

NN HDFS NameNode

DN YARN DataNode Worker Node


DN NM

Deploy 1 service per container


Worker Node
No problem! DN NM

“Hello world”
Hadoop Services (continued)
Master Node Worker Node
RM YARN ResourceManager
NN RM JN ZK
DN NM
NM YARN NodeManager

NN HDFS NameNode Master Node (HA) Worker Node


DN YARN DataNode NN RM JN ZK DN NM

JHS Job History Server JHS HFS

Worker Node
HFS HttpFS Service
DN NM

JN Journal Node
This is getting complex.
ZK ZooKeeper
But Wait, There’s More …
RM YARN ResourceManager SHS Spark History Server
Master Node (HA)
NM YARN NodeManager Hue Hue
NN RM JN ZK

NN HDFS NameNode OZ Oozie


JHS HFS HM SHS
DN YARN DataNode

Job History Server


JHS
Worker Node
HFS HttpFS Service DN NM HRS

JN Journal Node
??? Node
ZK ZooKeeper
Uh oh. Hue OZ

HM Hbase Master
We need some help.
HRS Hbase Region Server
Complete List of Hadoop Services?
RM YARN ResourceManager SHS Spark History Server ISS Impala State Store

NM YARN NodeManager Hue Hue ICS Impala Catalog Server

HDFS NameNode OZ Oozie


NN ID Impala Daemon

CM Cloudera Manager
DN YARN DataNode SS Solr Server

JHS Job History Server DB RDBMS


HS Hive Server

HFS HttpFS Service GW Gateway HSS Hive Metastore Service

JN Journal Node FA Flume Agent …

ZK ZooKeeper
ACK! There is seemingly
HM Hbase Master
no end to these services.
HRS Hbase Region Server
Running Big Data in Containers

The blind placement of individual services into containers will lead


to all sorts of problems. Big Data applications break the typical
assumptions for containers and container orchestration.

So is it possible run Big Data applications in containers?

Yes! Containers are great for Big Data.


But it takes some care.
Use the Best Tool for the Job

• Orchestration of Containers
– Deploy and manage containers for a single application cluster

• Managing and Configuring Hadoop Services


– Configure and start Hadoop-specific application services (e.g.
within each container)

• Running Multiple Big Data Clusters in Containers


– Deploy and manage multiple Big Data clusters simultaneously
Use the Best Tool for the Job

• Orchestration of Containers
– Deploy and manage containers for a single application cluster

• Managing and Configuring Hadoop Services


– Configure and start Hadoop-specific application services (e.g.
within each container)

• Running Multiple Big Data Clusters in Containers


– Deploy and manage multiple Big Data clusters simultaneously
Container Orchestration Options

• Primary Options: • There are multiple options and choices for


– Kubernetes container orchestration and management
– Apache Mesos with Marathon • The core use case is deploying “beautifully”
– Docker Swarm (“Swarm mode”) architected microservices and stateless apps
• These tools all provide scalability, resilience,
• Other Options: and fault tolerance
– Rancher (Rancher Labs)
• They offer capabilities & APIs to build on
– Fleet (CoreOS)
• All have similar high-level clustered
– Nomad (HashiCorp) architecture, but slightly different approaches
– etc.
Container Management Choices

9+ choices

6+ choices 12+ choices

12+ choices
5+ choices

3+ choices

High degree of flexibility for each and every component (e.g. host OS, container runtime, image registry, etc.)
Pluggable model – exposes key set of capabilities and APIs
Source: The New Stack
Kubernetes (K8)

End users see service endpoints


(e.g. ngnix, web server etc.)

Source: Kubernetes
Kubernetes (K8) Features
• Pods
– Pods form the atomic unit of scheduling in Kubernetes, as opposed to
single containers in other systems
– Pods host dependent/related services
• Flat Networking Space
– Networking is very different in Kubernetes versus the default Docker
networking
• Replication Controller
– Controls and monitors the number of running pods (“ replicas”) for each
service
Source: Kubernetes
Kubernetes (K8) Considerations

K8 for on-premises Linux is non-trivial: every component must be manually installed

Source: Kubernetes
Kubernetes (K8) Considerations

Volumes è Non-starter for Big Data

Persistent Volumes è Requires external storage: cannot leverage local disk

Source: Kubernetes
Kubernetes (K8) Considerations

• Multiple ways to
implement
networking with K8

• Big Data workloads


have unique
requirements that
are different than
stateless workloads

Source: Kubernetes
Kubernetes (K8) SWOT

Strengths Weaknesses
• Google-backing • Support model
• Developer adoption • Complex install
• Largest community • Alpha/beta features
• Public cloud support • Documentation

Opportunities Threats
• De-facto standard • Docker
• Enterprise-grade • Multiple Kubernetes
• Partner ecosystem distributions
(confusion/forking)
Docker Swarm (Swarm mode)

Docker v1.12 includes a “Swarm mode” in Docker Engine for


natively managing a cluster of Docker Engines
Source: Docker
Docker Swarm SWOT

Strengths Weaknesses
• Docker native • Small contributor
feature base (Docker Inc.
• Simple and fast mostly)
• Native integration • Feature gaps for
with Docker tools enterprise IT and
(e.g. Compose) complex use cases

Opportunities Threats
• New, native Swarm • Container mgmt &
mode in Docker tools (e.g. K8)
1.12 • Commoditization of
• Strong community Docker run time
and brand
Apache Mesos using Marathon

• Two-tier system

• Mesos is thin resource


sharing layer

• Marathon gets offers from


Mesos to start, monitor,
and scale containers

• Mesos can fire up


Kubernetes or Docker
Swarm as a framework
Apache Mesos w/ Marathon SWOT

Strengths Weaknesses
• Proven at scale • Complexity
• Can run both non- • Needs frameworks
containerized and • Key container
containerized features are
workloads alpha/beta
• Support model

Opportunities Threats
Note: Mesos is closely
• Ability to deploy • K8 and Swarm have aligned with Spark, but
mixed workloads developer mind Spark workloads running on
(containers and share
Mesos are typically bare-
non-containers) • Mesos viewed as metal (not containerized)
• K8 and Swarm as niche solution for
frameworks Spark, etc.
Orchestration Options At-a-Glance

Time to
Size Maturity Workload Storage Networking
Install
Flexible but
Two storage
Kubernetes 10s-1000s Medium-High High Cloud native
APIs
complex (use
case dependent)

Supports
Uses overlay
Swarm 10s-1000s Low Medium Cloud native mounting
network
volumes

Mesos Beta support


IP per container
w/Maratho 10s-1000s High High Cloud native for persistent
(alpha)
n volumes
Use the Best Tool for the Job

• Orchestration of Containers
– Deploy and manage containers for a single application cluster

• Managing and Configuring Hadoop Services


– Configure and start Hadoop-specific application services (e.g.
within each container)

• Running Multiple Big Data Clusters in Containers


– Deploy and manage multiple Big Data clusters simultaneously
Attributes of Big Data Clusters

• Not exactly monolithic applications, but close


• Multiple, co-operating services with dynamic APIs
– Service start-up / tear-down ordering requirements
– Different sets of services running on different hosts (nodes)
– Tricky service interdependencies impact scalability
• Lots of configuration (aka state)
– Host name, IP address, ports, etc.
– Big meta-data: Hadoop and Spark service-specific configurations
Managing and Configuring Hadoop

• Use a Hadoop manager


– HortonWorks: Ambari
– Cloudera: Cloudera Manager
– MapR: MapR Control System (MCS)
• Follow common deployment pattern
• Ensures distro supportability
Using a Hadoop Manager
RM YARN ResourceManager SHS Spark History Server

NM YARN NodeManager Hue Hue

NN HDFS NameNode OZ Oozie Hadoop Manager Node


Hmg ZK JN DB
Hmg Hadoop Manager
DN YARN DataNode

DB RDBMS Hue OZ
JHS Job History Server

HFS HttpFS Service GW Gateway

JN Journal Node FA Flume Agent

ZK ZooKeeper

HM Hbase Master

HRS Hbase Region Server


Using a Hadoop Manager (cont’d)
RM ResourceManager SHS Spark History Server
NM NodeManager
Hue
Hue
Hadoop Manager Node
NN OZ Oozie Hmg ZK DB JN
HDFS NameNode
DN Hmg Hadoop Manager
YARN DataNode Hue OZ
JHS DB RDBMS
Job History Server
HFS GW Gateway
HttpFS Service
JN FA Master Node Master Node
Journal Node Flume Agent
RM NN ZK JN RM NN ZK JN
ZK ZooKeeper
HM JHS HFS SHS HM JHS HFS SHS HM
Hbase Master
HRS Hbase Region Server

Worker Node Worker Node Worker Node


NM DN HRS NM DN HRS NM DN HRS

Edge Node
GW FA
Use the Best Tool for the Job

• Orchestration of Containers
– Deploy and manage containers for a single application cluster

• Managing and Configuring Hadoop Services


– Configure and start Hadoop-specific application services (e.g.
within each container)

• Running Multiple Big Data Clusters in Containers


– Deploy and manage multiple Big Data clusters simultaneously
Requirements for Big Data Clusters

ü Full cluster lifecycle management


ü Big Data application support (i.e. requires no modification)
ü Management of storage and networking resources
ü Integration with existing enterprise services (e.g. LDAP / AD)
ü Conform to existing enterprise security policies
• Multi-tenancy, multi-cluster, auditing, monitoring, etc.
ü Maintain Big Data performance goals
ü Support for hybrid (on-prem + cloud) environments
Full Cluster Lifecycle Management

• Creation
• Monitoring
• Expansion
• Contraction
• Pause / resume
• Application software upgrade
• Deletion
Big Data Application Support

• How do you manage multiple Big Data applications?


– Hadoop, Spark, Kafka, Cassandra, and other frameworks
• How do ensure no modification to these open source
Big Data frameworks and tools?
• How do you run different distros/versions of Hadoop?
• Will your distro vendor support your containerized
deployment environment?
Mgmt of Storage and Networking

• Networking
– Many options to choose from
• Storage
– Local / shared
– Persistent / non-persistent
– Compute and storage separation
• Both are major challenges for Big Data in containers
– Requires custom code (intellectual property)
Integration with Enterprise Services

• Lightweight Directory Access Protocol (LDAP) service


• Active Directory (AD) service
• Directory Name Service (DNS)
• Kerberos Key Distribution Center (KDC)
• Key Management Service (KMS)
• Big Data platforms are not conducive to service discovery
– A software-defined networking approach is required
Enterprise Security Policies

• Installation and management on hosts without


passwordless ssh root access
• User operation privilege controlled (sudo) environment
– Role-based security is required
• Auditing and monitoring
• Use of a secured Docker Image Registry
• Multi-tenant, multi-cluster with controlled access
Big Data Performance

• Peta-byte scale storage


– No existing container orchestration tool has an answer on
how to handle different types of persistent storage at scale
– New strategies and approaches need to be considered:
separation of compute & storage is a fundamental concept
• Maintain elasticity
– Why use containers if they are full of state?
• Deliver bare-metal performance with containers
Support for Hybrid Environments

• Today: on-premises (containerized)


• Next: public cloud
• Tomorrow: hybrid
– Compute and data on-premises
– Compute on-premises and in cloud, with data on-premises
– Compute and data on-premises and in the public cloud
– Compute and data in the public cloud
Big Data in Containers
Choose container Choose & build Choose & build Choose & build
platform container orchestration container networking persistent storage for
(e.g. Docker, rkt) (multiple options) (multiple options) Big Data containers

Build infrastructure Build cluster management Build policy engine for


security integrations for Big Data on-premises resource quotas and
(AD/LDAP, KMS, Kerberos) and/or in the cloud multi-tenancy

Build management Build high Build remote storage Build guard rails for
and administration availability & connectivity governance and
console disaster recovery (HDFS, S3, NFS etc.) onboarding new apps

Much more than container orchestration: requires specific design, differentiated


system architecture, and engineering to build a “whole solution” for Big Data
A Solution for Big Data in Containers

Purpose-built innovations for Big Data in containers

+
General container orchestration functionality

• Turnkey solution for running multiple Big Data clusters in containers


• Focused on stateful Big Data applications in containers
• Leveraging containers as “lightweight VMs”
• Multi-tenant architecture with enterprise-grade security and performance
• Enabling compute / storage separation for Big Data
BlueData EPIC Software Platform

Data Scientists Developers Data Engineers Data Analysts

Blue Data EPIC™ Software Platform

BI/Analytics Tools Bring-Your-Own

ElasticPlane™ - Self-service, multi-tenant clusters


IOBoost™ - Extreme performance and scalability
DataTap™ - In-place access to data on-prem or in the cloud

Compute EC2

Storage NFS HDFS S3


On-Premises Public Cloud
BlueData EPIC Architecture
Best of K8 + Swarm concepts for Big Data

container container
Container Container
scheduler scheduler
container

Compute
container
REST API server
container Worker Mgr.
container
container Controller Mgr. bd_mgmt
container bd_mgmt

data_server memq_cnode data_server memq_cnode

Storage
Worker (1)
Controller Worker (N)

External HDFS

Flexible architecture with 20+ patents and bare-metal performance for Big Data
Big Data in Docker Containers

Namespace awareness for data paths & services


Big Data service endpoints
Big Data security
Big Data service dependency mgmt
Heartbeat / control system
Container security (non-privileged)
Storage driver & network interfaces

Dynamic, fully
managed local
volume
Bare-Metal Performance
Intel benchmarking study: BlueData EPIC demonstrated 2.33% higher
performance vs. bare-metal (for 50 Hadoop compute nodes and 10 TB of data)

Source: “Bare-metal performance for Big Data workloads on Docker containers”, Intel white paper, March 2017 http://intel.ly/2lXPZHx
Purpose-Built for Big Data on Docker
Out-of-the-box solution with differentiated Big Data innovations & optimizations
Out-of-the-box solution with differentiated Big Data innovations & optimizations
BlueData EPIC container-based platform for Big Data

Web-based UI and RESTful APIs for automation


app images & App Workbench
App Store with Docker-based

Metricbeat + ELK stack for


Container management for Big Data workloads

container monitoring
with pre-built HA and multi-tenancy

Open vSwitch with VXLAN Dynamic persistent volumes

CentOS / RHEL only CentOS / RHEL only CentOS / RHEL only

On-Premises: Physical Servers or VMs Public Cloud


BlueData Use Cases + Functionality

Example Big Data Analytics Multi-Tenant Data Science Real-Time Data


Use Cases Sandbox Hadoop Operations Pipeline
BlueData EPIC Functionality

Self-Service RESTful App Store / Tenant Cluster


User Interface APIs Registry Management Management

DataTap IOBoost Big Data Identity & App


Connectors Caching Security Access Workbench

Container Container High Resource Cloud


Networking Storage Availability Scheduler Connector
BlueData EPIC – Access Layer
Management console for administrators and users REST APIs for automation and workflow
BlueData EPIC – App Store

DOCKER-BASED APP IMAGES OF YOUR CHOICE:


Same images for on-premises, AWS, or any public cloud
(Use BlueData App Workbench to update existing images,
adding newer versions or net new apps)
BlueData EPIC – Tenants & Policies
Tenant quotas for CPU, memory, & storage LDAP/AD access control & Kerberos security
Multi-Tenant Resource Quotas & QoS

Aggregate Docker
container storage,
memory, cores
(CPU shares), and
QoS level for each
tenant
Isolated Work Environments
Different Big Data applications, tools, and/or versions tailored for each specific tenant (user group)
Containerized Compute Clusters

On-demand, elastic compute environments.


Expand or shrink with just a few mouse clicks.
Compute / Storage Separation

Connectivity from containerized


compute clusters to one or many
remote HDFS or NFS systems
Multi-Host

4 containers
on 2 different hosts
using 1 VLAN and 4 persistent IPs
Different Services in Each Container
Master Services

Worker Services
BlueData EPIC Monitoring

Metricbeat + Elastic + Kibana for


cluster and container-level
monitoring
Key Takeaways
• Current container orchestration engines not ready for Big Data
– Hadoop and other Big Data workloads require new architecture and innovation
– Wide array of storage, network, security, & operational features are missing
• Compute/storage separation is a critical architectural concept
– Storing terabytes of persistent data in containers similar to physical servers defeats
the key goals of running applications on containers
– Big Data must be stored in external storage (e.g. HDFS, Object)
• DIY for Big Data in containers is not an option. It will not work!
– Configuring and tuning container orchestration systems will not be sufficient
– Choose a turnkey, purpose-built platform for Big Data to accelerate time to value
Tom Phelan Anant Chintamaneni
@tapbluedata @anantcman
Thank You

For more information:


www.bluedata.com
sales@bluedata.com
www.bluedata.com/aws
TRY BLUEDATA EPIC ON AWS

You might also like