Map Reduce

-----MapReduce workflows in big
data---
MapReduce is a programming model

and an associated implementation
for processing large data sets. It
was developed by Google and has
become a fundamental tool for
processing and analyzing big data.
A MapReduce workflow is a process

for processing large datasets
using a MapReduce system. The
workflow typically consists of two
stages: the map stage and the
reduce stage.
In the map stage, the input data

is divided into smaller chunks, or
"maps," which are processed
independently by different nodes
in a cluster. Each node applies a
map function to its chunk of data,
producing a set of intermediate
key-value pairs.
In the reduce stage, the

intermediate key-value pairs are
merged and aggregated by key,
using a reduce function. The
result is a set of output
key-value pairs, which can be
written to disk or passed on to
another MapReduce job for further
processing.
MapReduce workflows are commonly

used for large-scale data
processing tasks such as data
indexing, log analysis, and
machine learning. They are
typically implemented using a
distributed file system such as
Hadoop, which provides a framework
for managing the MapReduce process
across multiple nodes in a
cluster.
MapReduce workflows can be highly

scalable and fault-tolerant,
making them a powerful tool for
processing big data. However, they
can also be complex to design and
implement, and require specialized
expertise in distributed systems
and parallel programming.
-------Map Reduce Type----
In the context of Hadoop, there

are two main types of MapReduce
jobs: batch processing and
streaming.
Batch processing: This is the
traditional MapReduce job type,
where the input data is stored in
HDFS (Hadoop Distributed File
System) and the output is also
written to HDFS. Batch processing
jobs are suitable for processing
large amounts of structured or
unstructured data. They typically
involve a large amount of data
processing and can take anywhere
from several minutes to several
hours or even days to complete,
depending on the size and
complexity of the data.
Streaming: Streaming is a newer

type of MapReduce job that allows
for real-time processing of data.
With streaming, data is processed
as it is generated, rather than
waiting for a batch of data to
accumulate before processing.
Streaming jobs are suitable for
processing continuous streams of
data, such as log data or sensor
data. The output of streaming jobs
is typically written to external
systems such as NoSQL databases or
message queues, rather than being
stored in HDFS.
----unit tests with MRUnit----
MRUnit is a Java-based unit

testing framework for MapReduce
jobs. It allows developers to test
their MapReduce code in isolation,
without having to run a full
MapReduce job on a cluster.
To write unit tests with MRUnit,
you first need to create a test
class that extends the MRUnit test
framework. This class should
define one or more test methods
that simulate the input data and
verify the output data produced by
the MapReduce job.
MRUnit provides a range of helper

classes and methods for setting up
input data, configuring the
MapReduce job, and verifying
output data. By using these tools,
developers can quickly and easily
write unit tests for their
MapReduce code, helping to ensure
that it works as expected before
deploying it to a production
environment.
-------test data and local
tests----
Testing is an important part of

any big data analysis project, as
it helps to ensure that the data
processing pipeline is functioning
correctly and producing accurate
results. There are two primary
types of testing that are commonly
used in big data analysis: test
data and local tests.
Test data is a small set of sample

data that is used to test the
processing pipeline. This data
should be representative of the
actual data that will be
processed, but small enough that
it can be easily managed and run
on a local machine. Test data is
typically used to validate the
correctness of the data processing
pipeline and to identify any bugs
or errors in the code.
Local tests involve running the

data processing pipeline on a
local machine or in a simulated
environment, rather than on a
full-scale cluster. Local testing
is useful for validating the
performance of the pipeline,
including its scalability, speed,
and resource usage. It also allows
developers to identify and fix any
issues before deploying the code
to a production environment.
To use test data and local tests
effectively, it's important to
have a well-defined testing
strategy that includes a range of
test cases, from simple unit tests
to more complex integration and
system tests. Test cases should be
designed to validate the different
components of the data processing
pipeline, including data
ingestion, processing, and output.
It's also important to use testing

frameworks and tools that are
specifically designed for big data
analysis, such as Apache Hadoop's
built-in testing framework or
third-party tools like Apache
Spark's SparkTestingBase library.
These tools provide a range of
features and utilities for setting
up test data, running tests, and
analyzing results.
By using test data and local tests

in big data analysis, developers
can ensure that their data
processing pipelines are
functioning correctly and
producing accurate results, which
is essential for making informed
business decisions and gaining
valuable insights from big data.
--------anatomy of MapReduce job

run-------
A MapReduce job consists of a set

of map tasks and reduce tasks that
are executed in parallel across a
cluster of nodes. Here's a
step-by-step breakdown of how a
MapReduce job runs:
Input data is split into small

chunks called InputSplits. Each
InputSplit is processed by a
single map task.
Map tasks are scheduled to run on

available worker nodes in the
cluster. The number of map tasks
is determined by the size of the
input data and the number of
available worker nodes.
For each map task, the MapReduce

framework loads the corresponding
InputSplit from the distributed
file system and passes it to the
map function defined by the user.
The map function processes the

input data and generates
intermediate key-value pairs.
The intermediate key-value pairs

are grouped by key and shuffled
across the cluster. This allows
all values with the same key to be
processed by the same reduce task.
Reduce tasks are scheduled to run

on available worker nodes in the
cluster. The number of reduce
tasks is determined by the number
of unique keys in the intermediate
data and the number of available
worker nodes.
For each reduce task, the

MapReduce framework retrieves the
intermediate key-value pairs for
the corresponding key from the
distributed file system and passes
them to the reduce function
defined by the user.
The reduce function processes the
intermediate data for the
corresponding key and generates
final key-value pairs, which are
written to the output file system.
Once all map and reduce tasks have

completed, the MapReduce job is
considered finished.
Throughout the job run, the

MapReduce framework monitors the
progress of each task and handles
failures and retries as necessary.
It also provides utilities for
managing data distribution, task
scheduling, and intermediate data
storage, which allow the job to be
executed efficiently and reliably
across a large-scale cluster.
By breaking down the input data
into smaller chunks and processing
them in parallel, MapReduce allows
for the efficient processing of
large-scale data sets, making it a
powerful tool for big data
analysis.
------classic Map-reduce-----
Classic MapReduce is a programming

model and software framework
originally introduced by Google
for distributed computing on large
data sets. It is designed to
handle large-scale data processing
tasks by dividing them into
smaller, independent tasks that
can be processed in parallel
across a distributed network of
computers.
The Classic MapReduce programming
model consists of two primary
operations: the map operation and
the reduce operation.
Map Operation: The Map operation

takes a set of input key-value
pairs and processes them to
generate intermediate key-value
pairs. The input data is divided
into small chunks, and a separate
map task is assigned to process
each chunk independently. The Map
operation applies a user-defined
function to each input key-value
pair and generates one or more
intermediate key-value pairs,
which are then grouped by key and
shuffled across the network.
Reduce Operation: The Reduce
operation takes the intermediate
key-value pairs generated by the
Map operation and processes them
to generate final output key-value
pairs. The Reduce operation
applies a user-defined function to
each group of intermediate
key-value pairs with the same key
and generates one or more output
key-value pairs.
The Classic MapReduce framework

provides a distributed
infrastructure for running
MapReduce jobs across a cluster of
computers. The framework manages
the coordination between map and
reduce tasks, as well as the
storage and retrieval of
intermediate data. It also
provides fault tolerance and
resource management capabilities
to ensure that jobs complete
successfully, even in the face of
hardware failures or other issues.
The Classic MapReduce framework

has been implemented in various
distributed computing systems,
including Apache Hadoop, which is
a popular open-source
implementation of MapReduce.
Hadoop provides a set of
distributed computing tools and
libraries, including the Hadoop
Distributed File System (HDFS) for
storage, and the YARN (Yet Another
Resource Negotiator) resource
manager for job scheduling and
coordination.
Overall, Classic MapReduce has
been widely adopted as a powerful
tool for large-scale data
processing and is used in a
variety of applications, including
web search, machine learning, and
data analysis.
-------YARN----
YARN (Yet Another Resource

Negotiator) is a cluster
management technology that was
introduced as part of Apache
Hadoop 2.0. YARN is responsible
for resource management and job
scheduling in Hadoop clusters,
making it a critical component of
the Hadoop ecosystem.
YARN allows Hadoop to support a

wider range of distributed
computing tasks beyond just
MapReduce, such as graph
processing, stream processing, and
interactive queries. It achieves
this by separating the resource
management and job scheduling
functions from the MapReduce
programming model, allowing other
processing models to use the same
resources and scheduling
mechanisms.
YARN consists of a ResourceManager

and NodeManagers. The
ResourceManager is responsible for
managing the allocation of
resources to various applications
running on the cluster, and the
NodeManagers are responsible for
managing individual nodes in the
cluster and executing tasks.
When a new job is submitted to the

cluster, the ResourceManager
negotiates with the NodeManagers
to allocate resources for the job.
It then provides the application
with a container, which is a
combination of CPU, memory, and
other resources necessary to
execute the job. The application
is then free to run tasks in these
containers.
YARN also supports dynamic

allocation of resources, which
means that resources can be
allocated and released on-demand
based on the workload. This allows
for better resource utilization
and improves overall cluster
efficiency.
In summary, YARN is a powerful

cluster management technology that
allows Hadoop to support a wide
range of distributed computing
tasks beyond MapReduce. It
provides resource management and
job scheduling capabilities that
are critical to the efficient
operation of large-scale Hadoop
clusters.
-------failures in classic
Map-reduce and YARN----
Failures can occur in Classic

MapReduce and YARN due to various
reasons, such as hardware
failures, network failures,
software bugs, and resource
contention. Both Classic MapReduce
and YARN have built-in mechanisms
to handle failures and ensure that
jobs complete successfully.
In Classic MapReduce, each task is

executed on a separate node in the
cluster, and the output of each
task is stored on the local disk
of that node. To ensure fault
tolerance, the output of each task
is replicated to other nodes in
the cluster, and in case of a node
failure, the output can be
retrieved from the replica copy
stored on another node. The job
tracker in Classic MapReduce keeps
track of the progress of each task
and reassigns tasks that fail to
other nodes to ensure that the job
completes successfully.
In YARN, the ResourceManager and
NodeManagers have built-in
mechanisms to handle failures. The
ResourceManager monitors the
health of NodeManagers and detects
when a node fails or becomes
unavailable. It then reallocates
the containers running on the
failed node to other nodes in the
cluster. The NodeManagers also
monitor the health of containers
running on their node and detect
when a container fails or becomes
unresponsive. They then request
the ResourceManager to allocate a
new container to replace the
failed one.
----------- job
scheduling---------------
Job scheduling is a critical
component of any Hadoop cluster.
It involves managing the
allocation of resources to
different jobs and ensuring that
jobs are executed in a timely and
efficient manner. In Hadoop, job
scheduling is typically done by
the resource manager, which is
responsible for managing the
available resources in the cluster
and allocating them to running
jobs.
The resource manager in Hadoop

uses a scheduling algorithm to
determine which jobs get access to
the available resources and when.
There are several different
scheduling algorithms that can be
used, including:
First-come, first-served (FCFS):

This is a simple scheduling
algorithm where jobs are executed
in the order in which they are
received. Jobs are executed one
after the other, with each job
running to completion before the
next one starts.
Fair scheduling: This algorithm

allocates resources to jobs based
on the concept of fairness. Jobs
are given equal access to
resources, with each job getting
an equal share of the resources
over time. This helps to ensure
that no single job monopolizes the
resources of the cluster, and that
all jobs are given a fair chance
to complete in a timely manner.
Capacity scheduling: This

algorithm divides the available
resources of the cluster into
different pools, each with its own
set of resources. Jobs are
assigned to these pools based on
their resource requirements, and
each pool is allocated a fixed
amount of resources. This helps to
ensure that resources are
allocated efficiently and that
jobs are given the resources they
need to complete.
--------shuffle and sort---------
Shuffle and sort are two important

phases in the MapReduce framework
that are used to move data between
the map and reduce phases of a
MapReduce job.
Shuffle: The shuffle phase

involves moving the output of the
map phase to the input of the
reduce phase. During the shuffle
phase, the MapReduce framework
sorts the key-value pairs produced
by the map phase and groups them
by key. The output of this process
is a set of partitions, where each
partition contains all the values
associated with a particular key.
Sort: The sort phase involves

sorting the key-value pairs within
each partition produced during the
shuffle phase. This is necessary
because the reduce phase processes
the data one partition at a time,
and the data within each partition
needs to be sorted to ensure that
the reduce function processes the
data in the correct order.
The shuffle and sort phases are

critical to the performance of a
MapReduce job because they involve
moving a large amount of data
between nodes in the cluster. To
minimize the amount of data that
needs to be moved, the MapReduce
framework tries to keep the data
local to the node where the reduce
phase is running. This is done by
assigning the reduce tasks to
nodes that have a copy of the
relevant data. If a copy of the
data is not available on the local
node, the data is transferred
across the network.
To optimize the performance of
shuffle and sort, several
techniques can be used, such as
using compression to reduce the
amount of data transferred across
the network, using combiners to
perform partial aggregation before
the shuffle phase, and using
partitioners to ensure that the
data is evenly distributed across
the nodes in the cluster.
-----------task
execution----------
In Hadoop, task execution refers

to the process of running the map
and reduce tasks of a MapReduce
job on the available nodes in the
cluster. This process is managed
by the TaskTracker, which is
responsible for scheduling and
executing the tasks assigned to it
by the JobTracker.
The process of task execution

typically involves the following
steps:
Task scheduling: The TaskTracker

receives task assignments from the
JobTracker, which is responsible
for managing the MapReduce job.
The TaskTracker then schedules the
tasks on the available nodes in
the cluster.
Task initialization: Once a task

has been scheduled, the
TaskTracker initializes the
necessary resources and
dependencies required for the task
to execute. This includes loading
the input data, initializing any
required libraries or frameworks,
and setting up the environment for
the task.
Task execution: The TaskTracker

then executes the task by running
the appropriate code for the map
or reduce phase of the MapReduce
job. The task processes the input
data, performs any necessary
calculations or transformations,
and produces output data as
required.
Task completion: Once a task has

completed its execution, the
TaskTracker reports the results
back to the JobTracker, which is
responsible for managing the
overall progress of the MapReduce
job. The TaskTracker then frees up
the resources used by the
completed task and prepares for
the next task assignment.
During task execution, several

factors can impact the performance
and efficiency of the MapReduce
job, such as the size of the input
data, the complexity of the
processing tasks, the availability
of resources in the cluster, and
the scheduling algorithm used by
the TaskTracker. By optimizing
these factors, the MapReduce
framework can efficiently execute
the tasks of a MapReduce job,
minimizing the execution time and
improving the overall performance
of the job.
-------input formats, output

formats-------------
In big data environments, input

formats and output formats are
used to specify how the input and
output data of a MapReduce job are
formatted and processed by the
MapReduce framework. These formats
define how the input and output
data are partitioned, sorted, and
serialized, and how they are read
from and written to the file
system or other storage systems
used by the MapReduce job.
Input Formats:
There are several input formats
available in Hadoop, including:
TextInputFormat: This is the

default input format used by
Hadoop, which reads data line by
line and converts each line into a
key-value pair, where the key is
the byte offset of the line in the
input file and the value is the
content of the line.
KeyValueTextInputFormat: This
input format reads data as
key-value pairs, where the key and
value are separated by a
delimiter.
SequenceFileInputFormat: This
input format reads data from
binary sequence files, which are
Hadoop-specific file formats used
to store binary data as key-value
pairs.
Output Formats:
Similarly, there are several
output formats available in
Hadoop, including:
TextOutputFormat: This is the

default output format used by
Hadoop, which writes data to the
file system as text, where each
line is a key-value pair separated
by a delimiter.
KeyValueTextOutputFormat: This
output format writes data as
key-value pairs, where the key and
value are separated by a
delimiter.
SequenceFileOutputFormat: This
output format writes data to
binary sequence files, which are
Hadoop-specific file formats used
to store binary data as key-value
pairs.
By specifying the appropriate
input and output formats, a
MapReduce job can efficiently
process data in various file
formats and storage systems, such
as HDFS, Amazon S3, and HBase.
This enables the MapReduce
framework to work with a wide
range of data sources and provide
a flexible and scalable platform
for big data processing.
-------------end------------

Map Reduce

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Map Reduce

Uploaded by

Copyright:

Available Formats

-----MapReduce workflows in big

MapReduce is a programming model

A MapReduce workflow is a process

In the map stage, the input data

In the reduce stage, the

MapReduce workflows are commonly

MapReduce workflows can be highly

-------Map Reduce Type----

In the context of Hadoop, there

Streaming: Streaming is a newer

----unit tests with MRUnit----

MRUnit is a Java-based unit

MRUnit provides a range of helper

Testing is an important part of

Test data is a small set of sample

Local tests involve running the

It's also important to use testing

By using test data and local tests

--------anatomy of MapReduce job

A MapReduce job consists of a set

Input data is split into small

Map tasks are scheduled to run on

For each map task, the MapReduce

The map function processes the

The intermediate key-value pairs

Reduce tasks are scheduled to run

For each reduce task, the

Once all map and reduce tasks have

Throughout the job run, the

Classic MapReduce is a programming

Map Operation: The Map operation

The Classic MapReduce framework

The Classic MapReduce framework

YARN (Yet Another Resource

YARN allows Hadoop to support a

YARN consists of a ResourceManager

When a new job is submitted to the

YARN also supports dynamic

In summary, YARN is a powerful

Failures can occur in Classic

In Classic MapReduce, each task is

The resource manager in Hadoop

First-come, first-served (FCFS):

Fair scheduling: This algorithm

Capacity scheduling: This

--------shuffle and sort---------

Shuffle and sort are two important

Shuffle: The shuffle phase

Sort: The sort phase involves

The shuffle and sort phases are

In Hadoop, task execution refers

The process of task execution

Task scheduling: The TaskTracker

Task initialization: Once a task

Task execution: The TaskTracker

Task completion: Once a task has

During task execution, several

-------input formats, output

In big data environments, input

TextInputFormat: This is the

TextOutputFormat: This is the

You might also like