Mapreduce and Hadoop Ecosystem

MapReduce and
Hadoop Ecosystem
Overview
• MapReduce
• Programming Model
• How it works
• Fault tolerance
• Locality
• Implementations Example
• Hadoop Ecosystems
• Architecture
• Components
• HDFS
• YARN
MapReduce
• Paper: MapReduce: Simplified Data Processing on Large Clusters
• Jeffrey Dean and Sanjay Ghemawat (Google)
• 6th Symposium on Operating System Design and Implementation
• 2004
• Large scale data processing
• Process a lot of data
• Want to use hundreds/thousands of CPUs
• Need to be simple
MapReduce
• The frameworks provides
• Automatic parallelization and distribution
• Fault tolerance
• I/O scheduling
• Status and monitoring
MapReduce Programming Model
• Map & Reduce
function
Reduce • Created by
• Input: function programmer
• (in_key, in_value)
• Outpu: • Input:
• (out_key,
• Parallelization and
• list(out_key,
intermediate_value) list(intermediat_value) distribution
)
• Output
• Handled by the
• list(out_value) framework
Map function
Data Types
• map (k1,v1) → list(k2,v2)
• the input keys and values are drawn from a different domain than the output
keys and values
• reduce (k2,list(v2)) → list(v2)
Example: WordCount
Counting the number of occurences of each word in a
large collection of documents
map(String key, String value): reduce(String key, Iterator values):
// key: document name // key: a word
// value: document contents // values: a list of counts
for each word w in value: int result = 0;
EmitIntermediate(w, for each v in values:
"1"); result += ParseInt(v);
Emit(AsString(result));
WordCount MapReduce
WordCount MapReduce
How MapReduce Works
How MapReduce Works
• MapReduce library split the
input files into M pieces
• Typically 16 – 64 MB
• Starts up many copies of the
program on a cluster of
machines
How MapReduce Works
• The master assigns work to the
workser
• There are M map tasks and R
reduce tasks to assign
• Master picks idle workers and
assigns each one a map task or a
reduce task
How MapReduce Works
• A worker who is assigned a map
task reads the contents of the
corresponding input split
• It parses key/value pairs out of the
input data and passes each
pair to the user-defined Map
function
• The intermediate key/value pairs
produced by the Map function
are buffered in memory
How MapReduce Works
• Periodocally the buffered pairs
are written to local disk
• Partitioned into R regions
• Locations of these buffered pairs
on the local disk are passed back
to the master
• Master forwards the locations to
the reduce workers
How MapReduce Works
• Reduce worker is notified by the
master about intermediate files
locations
• Reduce worker uses the remote
procedure calls to read the
buffered data
• It sorts the buffered data read by
the intermediate keys
• Occurences of the same key are
grouped together
How MapReduce Works
• The reduce worker iterates over the
sorted intermediate data
• For each unique intermediate kay
encountered
• Passes the key and the corresponding
set of intermediate values to the
user’s Reduce function.
• The output of the Reduce function
is appended to a final output file for
this reduce partition
Fault Tolerance
• MapReduce library is designed to help process very large amounts of
data using hundreds or thousands of machines
• The library must tolerate machine failures
• Worker failure
• Master failure
Worker Failure
• The master pings every worker periodically.
• If no response is received from a worker in a certain amount of
time, the master marks the worker as failed
• Any map tasks completed by the worker are reset back to their initial
idle state
• become eligible for scheduling on other workers
• Similarly, any map task or reduce task in progress on a failed worker is
also reset to idle and becomes eligible for rescheduling.
Worker Failure
• Completed map tasks are re-executed on a failure because their
output is stored on the local disk(s) of the failed machine and is
therefore inaccessible.
• Completed reduce tasks do not need to be re-executed since their
output is stored in a global file system
Master Failure
• The master write periodic checkpoints of the master data structures
• If the master task dies, a new copy can be started from the last
checkpointed state
• Given that there is only a single master, its failure is unlikely
• Current implementation aborts the MapReduce computationif the
master fails
• Clients can check for this condition and retry the MapReduce
operation if they desire
Locality
• Network bandwidth is a relatively scarce resource in computing
environment
• conserve network bandwidth by taking advantage of the fact that the
input data (managed by GFS) is stored on the local disks of the
machines that make up our cluster
• GFS divides each file into 64 MB blocks, and stores several copies of
each block (typically 3 copies) on different machines.
• The MapReduce master takes the location information of the input
files into account and attempts to schedule a map task on a machine
that contains a replica of the corresponding input data
Locality
• Failing that, it attempts to schedule a map task near a replica of that
task’s input data
• (e.g., on a worker machine that is on the same network switch as the machine
containing the data).
• When running large MapReduce operations on a significant fraction of
the workers in a cluster, most input data is read locally and consumes
no network bandwidth.
MapReduce Implementations
• Distributed grep
• map function : a line that matches a supplied pattern
• reduce function : an identity function that just copies the supplied
intermediate data to the output
• Count of URL Access frequency
• map function : processes logs of web page requests and outputs <URL, 1>.
• reduce function : adds together all values for the same URL and emits a <URL,
total_count> pair
• Reverse Web Link Graph
• map function : outputs <target, source> pairs for each link to a target URL
found in a page named source.
• reduce function concatenates the list of all source URLs associated with a
given target URL and emits the pair: <target, list(source)>
• Term Vector per Host
• A term vector summarizes the most important words that occur in a
document or a set of documents as a list of <word, frequency> pairs
• map function emits a <hostname, term>
• Inverted Index
• map function parses each document, and emits a sequence of <word,
document ID> pairs
• reduce function accepts all pairs for a given word, sorts the corresponding
document IDs and emits a <word, list(document ID)> pair
• The set of all output pairs forms a simple inverted index
• Distributed Sort
• map function extracts the key from each record, and emits a <key, record>
pair
• reduce function emits all pairs unchanged.
Big Data Platform
Batch Processing
Stream Processing
"Hadoop logo" by Apache Software Foundation - https://svn.apache.org/repos/asf/hadoop/logos/out_rgb/. Licensed under Apache License 2.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Hadoop_logo.svg#mediaviewer/File:Hadoop_logo.svg
Developer
Doug Cutting Mike Cafarella
Hadoop
Timeline
2002 2003 2004

• Nutch project started • Google publish GFS architecture paper • Develop Nutch Distributed File System
• Google publish MapReduce Paper
Timeline (cont’d)
2008
2006 Hadoop became top
level project in
Hadoop Apache
Hadoop broke the
project record the fastest
system to sort
started terabytes of data
Hadoop Distribution and Cloud Support
• Hadoop distribution
• Apache Hadoop (http://hadoop.apache.org)
• Cloudera – CDH (http://cloudera.com)
• Hortonworks Data Platform – HDP (http://hortonworks.com)
• MapR Distribution (http://www.mapr.com)
• Cloud support Hadoop as a service
• Google Cloud Platform (https://cloud.google.com/hadoop/)
• Rackspace Cloud Big Data (https://www.rackspace.com/cloud/big-data)
• Azure HDInsight (https://azure.microsoft.com/en-us/services/hdinsight/)
• Amazon EMR (https://aws.amazon.com/emr/)
Hadoop
Framework for distributed storage and processing on clusters of commodity hardware
Components
Hadoop Common Hadoop DFS Hadoop MapReduce Hadoop YARN

Hadoop Architecture
Hadoop Distributed File System (HDFS)
• Store data redundantly on multiple nodes for persistence and
availability
• Adopted from Google GFS
HDFS
• Not good for
• Low-latency data access
• Lots of small files
• Multiple writes, arbitrary file modifications
• HDFS characteristics
• Very large files
• Write-once, read many
• Commodity hardware
HDFS Concepts
Namenodes
Blocks and datanodes
Blocks
• A disk has a block size
• Minimum amount of data that it can read or write
• Normally 512 bytes
• Filesystems for a single disk build on this by dealing with data in
blocks
• typically a few kilobytes
• HDFS using much larger block size
• Default 128 MB
Blocks
• Why large?
• Minimize the cost of disk seeks
• Map task in MapReduce normally operate on one block at a time
• Block abstraction for a distributed file systems benefits
• A file can be larger than any single disk in the network
• Can use any disk in the cluster
• Simplify the storage subsystems
• Block size is fixed size
• Eliminating metadata concerns
Blocks
• Blocks fit well with replication for providing fault tolerance and
availability
• Replicated to a small number of physically separate machines
• Typically three
• Block that is no longer available can be replicated from alternative
locations
Namenodes and Datanodes
• HDFS cluster has two types of nodes
• Namenode (master)
• Datanode (worker)
• Namenode manages the filesystem namespace
• Maintain filesystem tree and the metadata for all the files and directories in
the tree
• Stored persistently  namespace image and the edit log
• Datanodes store and retrieve blocks when they are told to
• By clients or namenode
• Report back periodically to namenode with lists of blocks that they are storing
Namenodes and Datanodes
• Secondary namenode
• Not act as namenode
• Periodically merge the namespace image with the edit log to prevent the edit
log becoming too large
• Secondary namenode usually runs on separate pyhisical machine
• Need plenty CPU and memory to perform merge
HDFS
• Data split into blocks
• Each blocks/chunks spread
across machine
C0 C1 D0 C1 C2 C4
• Ensure persistence and
availability C5 C2 C4 C3 D0 D1
…
datanode server 1 datanode server 2 datanode server N
• Datanode server also act as
compute servers
• Bring computation to data
Hadoop YARN
• Yet Another Resource Negotiator
• Introduced in Hadoop 2
• Improve the MapReduce implementations
• General enough to support other distributed computer paradigms
• Provides API for requesting and working with cluster resources
• Users write higher-level APIs provided by distributed computing
frameworks
• Hide the resource management details from the user
Hadoop YARN
Hadoop YARN Architecture
YARN Components
• Two types of long-running daemon
• Resource manager (one per cluster)
• Node manager
• Resource manager
• Manage the use of resources across the cluster
• Node manager
• Running on all the nodes in the cluster to launch and monitor containers
• Container
• Executes an application specific process with a constrained set of resources
(memory, CPU, etc.)
How YARN works
• A client contacts the resource
manager and asks it to run an
application master process (1)
• The resource manager finds a
node manager that can launch
the application master in a
container (2a and 2b)
• Simply run a computation in the
container
• Or request more containers from
the resource manager (3)
How YARN works
• If the node manager request
more containers from the
resource manager (3)
• Use the container to run a
distributed computations (4a
and 4b)
Resource Requests
• A request for a set of containers can express
• the amount of computer resources required for each container (memory and
CPU)
• locality constraints for the containers in that request
• Locality is critical in ensuring that distributed data processing
algorithms use the cluster bandwidth efficiently
• YARN allows an application to specify locality constraints for the
containers it is requesting.
• Locality constraints can be used to request a container
on a specific node or rack, or anywhere on the cluster (off-rack)
Resource Requests
• Sometimes the locality constraint cannot be met, in which case either
no allocation is made or, optionally, the constraint can be loosened.
• Eg. if a specific node was requested but it is not possible to start a
container on it (because other containers are running on it)
• YARN will try to start a container on a node in the same rack, or, if that’s not
possible, on any node in the cluster.
• the application will request a container on one of the nodes
hosting the block’s three replicas, or on a node in one of the racks
hosting the replicas, or, failing that, on any node in the cluster.
Resource Requests
• A YARN application can make resource requests at any time while it is
running
• Eg. an application can make all of its requests up front, or it can take a more
dynamic approach whereby it requests more resources dynamically to
meet the changing needs of the application.
• Spark starting a fixed number of executors on the cluster
• MapReduce has two phases
• the map task containers are requested up front
• the reduce task containers are not started until later
• if any tasks fail, additional containers will be requested so the failed
tasks can be rerun
Application Lifespan
• Can be vary
• One application per user job
• MapReduce approach
• Run one application per workflow or user session of (possibly
unrelated) jobs
• can be more efficient than the first, since containers can be reused between
jobs
• there is also the potential to cache intermediate data between jobs.
• Spark approach
Application Lifespan
• A long-running application that is shared by different users
• An application often acts in some kind of coordination role
• provide a proxy application that the Impala daemons communicate with to
request cluster resources
• Impala approach
YARN vs MapReduce 1
• MapReduce 1  Hadoop version 1
• YARN  Hadoop version 2
• MapReduce 1
• A jobtracker
• One or more tasktracker
• Jobtracker
• Coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers
• Tasktrackers
• Run tasks
• Send progress reports to the jobtracker
YARN vs MapReduce 1
• In MapReduce 1
• Jobtracker function:
• job scheduling : matching task with tasktracker
• task progress monitoring : keeping track of tasks, restarting failed or slow tasks, doing
task bookkeeping
• In YARN
• these responsibilities are handled by separate entities (one for each
MapReduce job) :
• the resource manager
• an application master
YARN vs MapReduce 1
YARN vs MapReduce 1
• Scalability
• YARN can run on larger clusters than MapReduce 1
• MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000
tasks
• the jobtracker has to manage both jobs and tasks
• Availability
• High availability (HA) is usually achieved by replicating the state needed for
another daemon to take over the work needed to provide the service, in the
event of the service daemon failing
• the large amount of rapidly changing complex state in the jobtracker’s memory
(each task status is updated every few seconds, for example) makes it very difficult
to retrofit HA into the jobtracker service
YARN vs MapReduce 1
• Utilization
• each tasktracker is configured with a static allocation of fixed-size “slots,”
which are divided into map slots and reduce slots at configuration time
• A map slot can only be used to run a map task, and a reduce slot can only be
used for a reduce task.
• A node manager manages a pool of resources, rather than a fixed number of
designated slots
• Multitenancy
• Opens up Hadoop to other types of distributed application beyond
MapReduce.
• MapReduce is just one YARN application among many
Hadoop Related Projects
• Language-neutral data serialization
system
• Make data format that can be processed
by many language
• C, C++, C#, Java, Javascript, Perl, PHP,
Python, Ruby
• Use schema in JSON to define the data
type
• High-volume ingestion into Hadoop of
event-based data
• Eg. Collect logfiles from a bank of
webservers
• Moving the log events from those files into
new aggregated files in HDFS for processing
• Extract data from a structured data store
into Hadoop for further processing
• Use Hadoop or Hive
• Sqoop can export the final result back to
the data store
• Abstraction for processing large datasets
• Much richer data structures and
transformations
• Components
• Language to express data flows (Pig Latin)
• Environment to run Pig Latin programs
• Local in single JVM
• Distributed execution on a Hadoop Cluster
• A framework for data warehousing on
top of Hadoop
• Originated from Facebook
• Use SQL-like to run queries on huge
volumes of data
• Distributed column oriented database
built on top of HDFS
• When require a real-time read/write
random access to very large data sets
• Not relational and does not support SQL
References
• Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data
Processing on Large Clusters, OSDI, 2004
• Tom White, Hadoop: The Definitive Guide, 4th Edition, O’Reilly, 2015
• http://hadoop.apache.org
• Jure Leskovec, Anand Rajaraman, and Jeffrey Ullman, Presentation
slide Coursera “Mining Massive Dataset”
• http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-ap
plications/
• Subash D’Souza, Hadoop 2.0 and YARN

Mapreduce and Hadoop Ecosystem

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mapreduce and Hadoop Ecosystem

Uploaded by

Copyright:

Available Formats

MapReduce and

Doug Cutting Mike Cafarella

2002 2003 2004

Framework for distributed storage and processing on clusters of commodity hardware

Hadoop Common Hadoop DFS Hadoop MapReduce Hadoop YARN

You might also like