Hadoop Deployment Cheat Sheet

Hadoop Deployment Cheat Sheet

If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI) this
document can help you navigate some of the technology and terminology, and guide you in setting up and
con guring the system.

In this document we provide some background information about the framework, the key distributions,
modules, components, and related products. We also provide you with single and multi-node Hadoop
installation commands and con guration parameters.

The nal section includes some tips and tricks to help you get started, and provides guidance in setting up a
Hadoop project.

Hadoop Distributions
Hadoop Modules
Hadoop Components
Hadoop Ecosystem
Major Hadoop Cloud Providers

Single Node Installation
Multi-node Installation
Backup HDFS Metadata
HDFS Basic Commands
HDFS Administration
Resource Manager UI
Secure Hadoop
Common Data Formats
Hadoop Tips and Tricks

Key Hadoop Distributions

Vendor Strength

Apache Hadoop The open source distribution from Apache

Hortonworks A leading vendor committed to a 100% open source package

Cloudera Hadoop lesystem w/proprietary components for enterprise needs

MapR Uses its own proprietary le system

IBM Integration w/ IBM analytics products

Pivotal Integration
(/) w/ Greenplum and Cloud Foundry (CF)

Hadoop Modules
Module Description

Common Common utilities. Supports other Hadoop modules


Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware

YARN Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling

MapReduce Software framework for parallel processing of large data sets based on YARN

Hadoop Components
Component / Module Description

NameNode / HDFS The directory tree of the Hadoop HDFS le system (a.k.a Hadoop inode)

Secondary(/)NameNode / HDFS

High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits le
into the fsimage le

JournalNode / HDFS Arbiter node that supports auto failover between NameNodes

DataNode / HDFS Nodes (or servers) that store the actual data

NFS3 Gateway / HDFS Daemons that enable NFS3 support

ResourceManager / YARN

Global daemon that arbitrates resources among all the applications in the Hadoop cluster

ApplicationMaster / YARN

Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager
to consume them and monitor the tasks

NodeManager / YARN

Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage
such as CPU and disk, and reporting back to the ResourceManager

Container / YARN

Running speci c tasks on a speci c machine for a speci c application based on allocated resources

Product Description


A completely open-source management platform for provisioning, managing, monitoring and securing Apache
Hadoop clusters

Apex Big data in motion platform based on YARN

Azbakan Work ow job scheduling and management system for Hadoop

Flume Reliable, distributed and available service that streams logs into HDFS

Knox Authentication and Access gateway service for Hadoop

HBase Distributed non-relational database that runs on top of HDFS

Hive Data warehouse system based on Hadoop


Machine learning algorithm (clustering, classi cation and batch-based collaborative ltering) implementation based on

Impala Enables low-latency SQL queries on HBase and HDFS

Oozie Work ow job scheduling and management system for Hadoop

Ranger Access policy manager for HDFS les, folders, databases, tables and columns

Spark (/)

Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like
interface and machine learning library.

Sqoop Data migration application between RDBMS and Hadoop using CLI

Tez Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN

Pig High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark


Distributed name registry, synchronization service and con guration service that is used as a sub-system in Hadoop

Major Hadoop Cloud Providers

Cloud operator Service name

Amazon Web Services EMR (Elastic Map Reduce)

IBM Softlayer IBM Brightsight

Microsoft Azure HDInsight

(/) Data Formats
Format Description

Avro JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.

Parquet Columnar storage format

ORC Fast Columnar storage format

RCFile Data placement format for Rational tables

SequenceFile Binary data format with a record of speci c data types

Unstructured Hadoop also supports various unstructured data formats

Single Node Installation

Requirement / Task Command

Java Installation / Check version >java -version

Java Installation / Install >sudo apt-get -y update && sudo apt-get -y install default-jdk

Create User
(/) and Permissions / Create User >useradd hadoop
>passwd hadoop
>mkdir /home/hadoop
>chown -R hadoop:hadoop /home/hadoop

Create User and Permissions / Create keys >su - hadoop

>ssh-keygen -t rsa &&
>cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
>&& chmod 0600 ~/.ssh/authorized_keys

Install from source

>wget http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz &&

>tar xzf hadoop-2.7.2.tar.gz &&
>mv hadoop-2.7.2 hadoop

(/) / Env Vars >source ~/.bashrc
>export HADOOP_HOME=/home/hadoop/hadoop








Environment / Set Java_Home >vi $HADOOP_HOME/etc/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/opt/jdk1.8.0_05/

Con guration les / Edit if required core-site.xml


Format NameNode >hdfs namenode -format

Start System
(/) >cd $HADOOP_HOME/sbin/

Test System >bin/hdfs dfs -mkdir /user

>bin/hdfs dfs -mkdir /user/hadoop
>bin/hdfs dfs -put /var/log/httpd logs

Multi-node Installation
Task Command

Con gure hosts on each node >vi /etc/hosts hadoop-master hadoop-slave-1 hadoop-slave-2

Enable cross node authentication >su – hadoop

>ssh-keygen -t rsa
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2
>chmod 0600 ~/.ssh/authorized_keys>exit

Copy system
(/) >su - hadoop
>cd /opt/hadoop
>scp -r hadoop hadoop-slave-1:/opt/hadoop
>scp -r hadoop hadoop-slave-2:/opt/hadoop

Con gure Master >su - hadoop

>cd /opt/hadoop/hadoop

>vi conf/masters
//add your master node to the file:

>vi conf/slaves
//add your slave nodes to the file, one hostname per line:

>su - hadoop

>cd /opt/hadoop/hadoop

>bin/hadoop namenode -format

Start system >bin/start-all.sh

Backup(/) HDFS Metadata

Task Command

Stop the cluster >stop-all.sh

Perform cold backup to metadata directories >cd /data/dfs/nn

>tar -cvf /tmp/backup.tar.gz

Start the cluster >start-all.sh

HDFS Basic Commands

Task Command

List the content of the home directory >hdfs dfs -ls /data/

Upload a le from the local le system to HDFS >hdfs dfs -put logs.csv /data/

Read the content of the le from HDFS >hdfs dfs -cat /data/logs.csv

Change the permission of a le >hdfs dfs -chmod 744 /data/logs.csv

Set the replication factor of a le to 3 >hdfs dfs -setrep -w 3 /data/logs.csv

Check the size of the le >hdfs dfs -du -h /data/logs.csv

Move the (/)le to the newly-created subdirectory >hdfs dfs -mv logs.csv logs/

Remove directory from HDFS >hdfs dfs -rm -r logs

HDFS Administration
Task Command

Balance the cluster storage >hdfs balancer -threshold

Run the NameNode >hdfs namenode

Run the secondary NameNode >hdfs secondarynamenode

Run a datanode >hdfs datanode

Run the NFS3 gateway >hdfs nfs3

Run the RPC portmap for the NFS3 gateway >hdfs portmap

Task Command

Show yarn(/)help >yarn

De ne con guration le >yarn [--config confdir]

De ne log level

>yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG or

User commands

Show Hadoop classpath >yarn classpath

Show and kill application >yarn application

Show application attempt >yarn applicationattempt

Show container information >yarn container

Show node information >yarn node

Show queue information >yarn queue

Administration commands

Start NodeManager >yarn nodemanager

Start Proxy web server >yarn proxyserver

Start ResourceManager >yarn resourcemanager

Run ResourceManager
(/) admin client >yarn rmadmin

Start Shared Cache Manager >yarn sharedcachemanager

Start TimeLineServer >yarn timelineserver

Submit the WordCount MapReduce job to the cluster

>hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input


Check the output of this job in HDFS >hadoop fs -cat logs -output/*

Submit a scalding job >hadoop jar scalding.jar com.twitter.scalding.Tool Scalding

Kill a MapReduce job >yarn application -kill

Resource Manager UI
Resource Default URI

NameNode http://:50070/

DataNode(/) http://:50075/

Sec NameNode http://:50090/

Resource Manager http://:8088

HBase Master http://:60010

Secure Hadoop
Aspect Best Practice

De ne users
Enable Kerberos in Hadoop
Setup Knox gateway to control access and authentication to the HDFS cluster
Integrate with the organization’s SSO and LDAP


De ne groups
De ne HDFS Permissions
De ne HDFS ACL’s
Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns

Audit (/) Enable process execution audit trail

Data Protection
Wire encryption with Knox or Hadoop

Hadoop Tips and Tricks

Project Concept

Iterate cluster sizing to optimize performance and meet actual load patterns


Clusters with more nodes recover faster

The higher the storage per node, the longer the recovery time

Use commodity hardware:

Use large slow disks (SATA) without RAID (3-6TB disks)
Use as much RAM as is cost-effective (96-192GB RAM)
Use mainstream CPU with as many cores as possible (8-12 cores)

Invest in reliable hardware for the NameNodes

Product Partners Resources Jethro Blog
Support Try Jethro
NameNode RAM should be 2GB + 1GB for every 100TB raw disk space
Support Try Jethro

(/)cost should be 20% of hardware budget

40 nodes is the critical mass to achieve best performance/cost ratio

Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows
for 3 replicas

Operating System and JVM

Must be 64-bit

Set le descriptor limit to 64K (ulimit)

Enable time synchronization using NTP

Speed up reads by mounting disks with NOATIME

Disable hugepages


Enable monitoring using Ambari

Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to
recover your cluster when needed

Avoid reaching 90% cluster disk utilization

Balance the cluster periodically using balancer

Edit metadata les using Hadoop utilities only, to avoid corruption

Keep replication
(/) >= 3

Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation

Clean /tmp regularly – it tends to ll up with junk les

Optimize the number of reducers to avoid system starvation

Verify that the le system you selected is supported by your Hadoop vendor

Data and System Recovery

Disk failure is not an issue

Data nodes failure is not a major issue

NameNodes failure is an issue even in a clustered environment

Make regular backups of namenode metadata

Enable NameNode clustering using ZooKeeper

Provide su cient disk space for NameNode logging

Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml

