Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

BIG DATA ANALYTICS

UNIT - III
Introduction to Hadoop: Hadoop: History of Hadoop, the Hadoop Distributed File System,
Components of Hadoop Analysing the Data with Hadoop, Scaling Out, Hadoop Streaming,
Design of HDFS, Java interfaces to HDFS Basics, Developing a Map Reduce Application,
How Map Reduce Works, Anatomy of a Map Reduce Job run, Failures, Job Scheduling,
Shuffle and Sort, Task execution, Map Reduce Types and Formats, Map Reduce Features
Hadoop environment.

Introduction to Hadoop
Hadoop is an open-source software framework for distributed storage and processing of large
datasets across clusters of computers. It is designed to handle big data that is too large and
complex to be processed by traditional computing systems. Hadoop provides a distributed file
system called Hadoop Distributed File System (HDFS) and a distributed processing system
called MapReduce.
HDFS is a highly fault-tolerant file system that provides reliable storage of large files across
multiple machines. It stores data in blocks and replicates them across multiple nodes in a
cluster, ensuring that data is always available even if one or more nodes fail.
MapReduce is a programming model for processing large datasets in a distributed
environment. It breaks down large tasks into smaller sub-tasks, distributes them across
multiple nodes in a cluster, and combines the results into a single output. MapReduce makes
it possible to process large datasets in parallel, significantly reducing the time required for data
processing.
Hadoop also includes other components such as YARN (Yet Another Resource Negotiator),
which manages resources and schedules tasks in a Hadoop cluster, and HBase, a NoSQL
database that runs on top of HDFS and provides real-time access to large datasets.
Overall, Hadoop provides a powerful platform for handling big data and has become an
essential tool for data processing and analysis in various industries, including finance,
healthcare, retail, and telecommunications.

History of Hadoop
Hadoop was created by Doug Cutting and Mike Cafarella in 2005, initially as a project to
support the Nutch search engine. The project was named after Doug Cutting's son's toy
elephant.
The original Hadoop project consisted of HDFS and MapReduce, which were inspired by
Google's Google File System and MapReduce, respectively. The goal of the project was to
provide a scalable and fault-tolerant platform for storing and processing large amounts of data.

Page | 1
Hadoop was released as an open-source project in 2006 and quickly gained popularity among
developers and data scientists who needed a tool for handling big data. In 2008, Yahoo became
a major contributor to the Hadoop project, and the development of Hadoop became a
community effort.
In 2009, Hadoop 0.20 was released, which introduced significant improvements to HDFS and
MapReduce, including support for multiple NameNodes and a JobTracker High Availability
feature. This release marked a major milestone in the development of Hadoop and made it
more reliable and scalable.
In the following years, Hadoop continued to evolve and new components such as YARN and
HBase were added to the Hadoop ecosystem. Hadoop became the de facto standard for big
data processing and analysis, and many companies, including Facebook, Twitter, and
LinkedIn, adopted Hadoop for their data processing needs.
Today, Hadoop is maintained by the Apache Software Foundation, and new features and
improvements are added to the platform regularly by a large community of developers and
users.

The Hadoop Distributed File System


The Hadoop Distributed File System (HDFS) is a distributed file system designed to store and
manage large datasets across multiple machines in a Hadoop cluster. HDFS is one of the key
components of the Hadoop ecosystem and provides reliable and fault-tolerant storage for big
data.
HDFS uses a master/slave architecture, where the NameNode acts as the master and manages
the file system namespace and access control, while the DataNodes act as slaves and store the
actual data blocks.
Files in HDFS are split into blocks of fixed size (typically 128 MB or 256 MB) and replicated
across multiple DataNodes in the cluster. The replication factor is configurable and determines
the number of copies of each block that are stored in the cluster. By default, HDFS replicates
each block three times for fault tolerance.
HDFS provides high throughput data access, which is achieved by streaming data from disk
and minimizing seeks. It also supports parallel access to data, allowing multiple clients to read
and write data simultaneously. HDFS supports various access methods, including Hadoop
APIs, command-line tools, and web interfaces.
HDFS is fault-tolerant and can handle failures of individual nodes without data loss. When a
DataNode fails, HDFS automatically replicates the missing blocks to other nodes in the cluster.
If the NameNode fails, however, the entire HDFS cluster becomes unavailable until a new
NameNode is started.
Overall, HDFS provides scalable and reliable storage for big data and is a critical component
of the Hadoop ecosystem.

Page | 2
Components of Hadoop Analysing the Data with Hadoop
Components of Hadoop
Hadoop is a distributed system and comprises several components that work together to
provide a scalable and fault-tolerant platform for storing, processing, and analyzing big data.
Some of the key components of Hadoop include:
1. Hadoop Distributed File System (HDFS): HDFS is a distributed file system that
provides scalable and reliable storage for large datasets across a cluster of machines.
2. Yet Another Resource Negotiator (YARN): YARN is a resource management
framework that manages resources and schedules tasks across the Hadoop cluster.
3. MapReduce: MapReduce is a distributed processing framework for processing large
datasets in parallel across the Hadoop cluster.
4. HBase: HBase is a distributed NoSQL database that runs on top of HDFS and provides
real-time access to large datasets.
5. Pig: Pig is a high-level platform for creating MapReduce programs used to analyze large
datasets. It provides a simple language called Pig Latin to express data analysis tasks.
6. Hive: Hive is a data warehousing tool that provides SQL-like query language called
HiveQL to analyze data stored in Hadoop.
7. Spark: Spark is a distributed processing framework that can run on top of Hadoop and
provides faster in-memory data processing capabilities.
8. ZooKeeper: ZooKeeper is a distributed coordination service that provides
synchronization and configuration management services for distributed applications
running on the Hadoop cluster.
9. Oozie: Oozie is a workflow scheduling system that allows users to define and execute
complex workflows of Hadoop jobs.
10.Flume: Flume is a distributed data collection and aggregation system that allows users
to ingest and process streaming data in real-time.
Overall, Hadoop provides a comprehensive platform for storing, processing, and analyzing
large datasets, making it a popular choice for big data analytics in various industries.

Analysing the Data with Hadoop


Analyzing data with Hadoop typically involves the following steps:
1. Storing data: The first step is to store data in HDFS, which is the primary storage
system in Hadoop. HDFS provides a distributed and fault-tolerant file system for storing
large volumes of data.
2. Pre-processing data: The next step is to pre-process the data to make it suitable for
analysis. This involves cleaning, transforming, and formatting the data. Various tools
such as Apache Pig and Apache Hive can be used for this purpose.
3. Processing data: Once the data is pre-processed, it can be analyzed using Hadoop's
distributed processing frameworks such as MapReduce and Apache Spark. These

Page | 3
frameworks provide distributed processing capabilities for analyzing large volumes of
data in parallel.
4. Analyzing data: Once the data has been processed, it can be analyzed using tools such
as Apache Hive and Apache Pig. These tools provide high-level abstractions for data
analysis and can be used to run complex queries on the data.
5. Visualizing data: Finally, the analyzed data can be visualized using tools such as
Apache Zeppelin, Tableau, and D3.js. These tools provide interactive visualizations of
the data, making it easier to derive insights and make data-driven decisions.
Overall, Hadoop provides a powerful platform for analyzing large volumes of data. Its
distributed processing capabilities, combined with a range of tools and frameworks, make it
an ideal choice for big data analytics in various industries.

Scaling Out
Scaling out is the process of increasing the capacity of a system by adding more resources. In
the context of Hadoop, scaling out refers to adding more nodes to the Hadoop cluster to
increase its processing capacity and storage capacity.
Hadoop is designed to be highly scalable, and it can handle data and processing requirements
that exceed the capacity of a single machine. By adding more nodes to the cluster, Hadoop can
distribute the workload across multiple machines, enabling it to process and store large
volumes of data.
Scaling out in Hadoop involves adding more nodes to the cluster, configuring them to work
together, and distributing data and processing tasks across the nodes. This requires a well-
designed architecture that takes into account factors such as data partitioning, load balancing,
and fault tolerance.
There are several benefits to scaling out in Hadoop, including:
1. Increased processing capacity: By adding more nodes to the cluster, Hadoop can
handle larger volumes of data and process them more quickly.
2. Improved fault tolerance: With more nodes in the cluster, Hadoop can provide better
fault tolerance, as the data and processing tasks can be replicated across multiple nodes.
3. Lower costs: Scaling out in Hadoop is generally more cost-effective than scaling up
(adding more resources to a single machine), as it allows organizations to use
commodity hardware rather than expensive high-end servers.
4. Greater flexibility: Scaling out in Hadoop provides greater flexibility, as organizations
can add or remove nodes from the cluster as needed to meet changing data and
processing requirements.
Overall, scaling out is an essential part of using Hadoop to handle big data, as it enables
organizations to process and store large volumes of data efficiently and cost-effectively.

Page | 4
Hadoop Streaming
Hadoop Streaming is a utility in Hadoop that enables users to write MapReduce programs in
languages other than Java, such as Python, Ruby, Perl, and Bash. It allows data to be processed
using scripts that can read and write data to standard input and output streams.
Hadoop Streaming works by accepting input data in the form of key-value pairs, which are
then passed to a Map script. The Map script processes the data and produces intermediate key-
value pairs, which are sorted and passed to a Reduce script. The Reduce script then processes
the intermediate data and produces the final output.
Hadoop Streaming provides a way for users to leverage existing scripts and tools that they
may have developed for data processing, without having to write Java code. This makes it
easier for users who are not familiar with Java to use Hadoop for data processing.
Some of the benefits of using Hadoop Streaming include:
1. Language flexibility: Hadoop Streaming allows users to write MapReduce programs
in languages other than Java, making it easier for users to work with Hadoop using their
preferred language.
2. Reusability: Hadoop Streaming enables users to reuse existing scripts and tools for data
processing, which can save time and effort.
3. Scalability: Hadoop Streaming works with the Hadoop distributed processing
framework, which means that it can scale to handle large volumes of data.
4. Lower development costs: By allowing users to write MapReduce programs in
languages other than Java, Hadoop Streaming can reduce the development costs
associated with building MapReduce programs from scratch.
Overall, Hadoop Streaming is a powerful tool that enables users to leverage existing scripts
and tools for data processing in a Hadoop environment. It provides language flexibility,
reusability, scalability, and lower development costs, making it an attractive option for data
processing in Hadoop.

Design of HDFS
Hadoop Distributed File System (HDFS) is the primary storage system used in Hadoop. It is
designed to be highly scalable, fault-tolerant, and efficient in handling large volumes of data.
The design of HDFS is based on the following key principles:
1. Data locality: HDFS is designed to store large files in a distributed manner across
multiple nodes in a cluster. To improve performance, HDFS stores data on nodes where
it will be processed, which minimizes the amount of network traffic and reduces latency.
2. Replication: HDFS stores multiple copies of each file, known as replicas, across
different nodes in the cluster. This ensures that data is always available, even if a node
or disk fails. By default, HDFS stores three replicas of each file.

Page | 5
3. Block storage: HDFS divides large files into smaller blocks, typically 128 MB or 256
MB in size, and stores each block on a separate node in the cluster. This improves
efficiency in handling large files and allows for parallel processing of data.
4. Namenode and Datanode architecture: HDFS consists of two types of nodes - the
Namenode and the Datanodes. The Namenode is the central node that manages the file
system namespace and coordinates access to data stored in the cluster. The Datanodes
store the data and are responsible for serving data to clients.
5. Checkpointing and journaling: HDFS periodically checkpoints the namespace image
and logs changes to a journal. This ensures that in case of a Namenode failure, the
system can be quickly restored to a consistent state.
6. Data integrity: HDFS ensures data integrity by using checksums for data stored in the
cluster. Checksums are computed at the time of writing data to the cluster and verified
when data is read from the cluster.
Overall, the design of HDFS is optimized for storing and processing large volumes of data
efficiently and reliably in a distributed environment. Its use of replication, block storage, and
data locality ensures that data is always available, even in the event of node or disk failures.

Java interfaces to HDFS Basics


Hadoop Distributed File System (HDFS) provides a Java API that allows applications to
interact with HDFS programmatically. The HDFS API consists of several interfaces that
define the methods for accessing and manipulating data stored in HDFS.
1. FileSystem interface: The FileSystem interface is the primary interface for accessing
HDFS. It defines methods for creating, reading, and writing files in HDFS, as well as
for managing permissions and directory structures.
2. Path interface: The Path interface represents a file or directory path in HDFS. It defines
methods for resolving paths and for creating new paths based on existing ones.
3. FSDataInputStream and FSDataOutputStream interfaces: These interfaces provide
methods for reading and writing data to files in HDFS, respectively. They are used in
conjunction with the FileSystem interface.
4. FSNamesystem interface: The FSNamesystem interface is used by the Namenode to
manage the file system namespace and to track data blocks and their locations.
5. DatanodeProtocol interface: The DatanodeProtocol interface is used by Datanodes to
communicate with the Namenode and to report block locations and storage information.
6. ClientProtocol interface: The ClientProtocol interface is used by clients to
communicate with the Namenode and to perform operations such as opening, closing,
and deleting files.
7. LocatedFileStatus and RemoteIterator interfaces: These interfaces are used to
retrieve information about files and directories in HDFS, including their names, sizes,
and locations.
Overall, the Java interfaces to HDFS provide a comprehensive set of methods for interacting
with HDFS programmatically. They enable developers to create custom applications that can
Page | 6
read and write data to HDFS, manage file system permissions and structures, and perform
other operations related to HDFS.

Developing a Map Reduce Application


Developing a MapReduce application involves several steps, including:
1. Defining the input and output formats: The first step is to define the input and output
formats for the MapReduce job. This involves specifying the file format, data schema,
and any other relevant properties.
2. Implementing the mapper function: The mapper function is responsible for
processing the input data and emitting key-value pairs that will be used as input to the
reducer function. The mapper function should be designed to be computationally
lightweight and to produce a large number of key-value pairs.
3. Implementing the reducer function: The reducer function receives key-value pairs
from the mapper function and processes them to produce the final output. The reducer
function should be designed to be computationally intensive and to produce a small
number of output values.
4. Configuring the MapReduce job: The MapReduce job must be configured with the
appropriate input and output formats, as well as any other relevant properties such as
the number of reducers to use.
5. Testing and debugging the application: Once the MapReduce application has been
implemented and configured, it should be tested and debugged to ensure that it is
working correctly. This may involve running the application on a small data set and
examining the output to verify that it is correct.
6. Running the application: Once the application has been tested and debugged, it can be
run on the full data set. This may involve running the application on a Hadoop cluster,
or on a single machine using Hadoop's local mode.
7. Monitoring and optimizing performance: During and after the application is run, it
is important to monitor its performance and to optimize it if necessary. This may involve
tweaking the configuration of the MapReduce job, adjusting the number of mappers or
reducers, or optimizing the algorithms used in the mapper and reducer functions.
Overall, developing a MapReduce application involves several steps, each of which must be
carefully executed to ensure that the application performs correctly and efficiently.

How Map Reduce Works


MapReduce is a programming model and framework for processing large datasets in a
distributed computing environment. It works by breaking down a large computation into
smaller, independent tasks that can be executed in parallel across a cluster of computers.
The MapReduce programming model consists of two phases:

Page | 7
1. Map Phase: In the map phase, the input data is divided into smaller chunks and
processed independently by a large number of mappers. Each mapper takes a subset of
the input data and performs a computation on it to produce a set of key-value pairs. The
key-value pairs produced by the mappers are then sorted and partitioned based on their
keys.
2. Reduce Phase: In the reduce phase, the key-value pairs produced by the mappers are
combined based on their keys to produce a final output. The reducers take the sorted
and partitioned key-value pairs produced by the mappers and perform a computation on
them to produce a set of output key-value pairs. The output of the reduce phase is
typically stored in a file or database.
The MapReduce framework provides several key features that enable it to efficiently process
large datasets:
1. Data partitioning: Input data is divided into smaller chunks that can be processed
independently by mappers.
2. Parallel processing: The MapReduce framework can execute mappers and reducers in
parallel across a cluster of computers, enabling it to process large datasets quickly.
3. Fault tolerance: If a node in the cluster fails, the MapReduce framework automatically
redistributes the work to other nodes to ensure that the computation can continue.
4. Data locality: When possible, the MapReduce framework tries to execute mappers and
reducers on the same nodes where the input data is stored, reducing network traffic and
improving performance.
Overall, the MapReduce framework provides a powerful tool for processing large datasets in
a distributed computing environment. By breaking down computations into smaller,
independent tasks and executing them in parallel across a cluster of computers, MapReduce
enables efficient processing of large datasets at scale.

Anatomy of a Map Reduce Job run


A MapReduce job typically consists of the following components:
1. Input data: The input data for a MapReduce job can be stored in a variety of formats,
including text files, sequence files, and HBase tables. The input data is divided into
smaller chunks, which are processed independently by the mappers.
2. Mapper: The mapper function takes input data and processes it to produce a set of
intermediate key-value pairs. The mapper is designed to be computationally lightweight
and to produce a large number of key-value pairs.
3. Partitioner: The partitioner function determines which reducer will receive each key-
value pair produced by the mappers. By default, Hadoop uses a hash-based partitioner
that evenly distributes the keys among the reducers.
4. Sort and shuffle: The intermediate key-value pairs produced by the mappers are sorted
and partitioned based on their keys. The sorting ensures that all key-value pairs with the

Page | 8
same key are grouped together, while the partitioning ensures that key-value pairs with
the same key are sent to the same reducer.
5. Reducer: The reducer function takes the intermediate key-value pairs produced by the
mappers and processes them to produce a final set of output key-value pairs. The reducer
is designed to be computationally intensive and to produce a small number of output
values.
6. Output data: The output data for a MapReduce job can be stored in a variety of formats,
including text files, sequence files, and HBase tables. The output data is typically stored
in a distributed file system, such as HDFS.
During the execution of a MapReduce job, several steps are performed:
1. Job submission: The user submits the MapReduce job to the Hadoop cluster.
2. Job initialization: The Hadoop framework initializes the job by setting up the necessary
configuration and creating the required job-related files.
3. Mapper execution: The mappers read input data and process it to produce intermediate
key-value pairs.
4. Sort and shuffle: The intermediate key-value pairs produced by the mappers are sorted
and partitioned based on their keys.
5. Reducer execution: The reducers process the intermediate key-value pairs to produce
the final output.
6. Output writing: The output data is written to the output directory specified by the user.
7. Job completion: Once the MapReduce job is complete, the Hadoop framework reports
the job status and any errors or exceptions that occurred during the execution.
Overall, the anatomy of a MapReduce job run involves several steps, each of which is carefully
orchestrated by the Hadoop framework to efficiently process large datasets in a distributed
computing environment.

Failures
Failures are common in distributed computing environments like Hadoop clusters due to
hardware failures, network issues, software bugs, or other unforeseeable events. Hadoop
provides several mechanisms to handle failures and ensure the reliability of data processing.
1. Task retries: Hadoop retries a failed task by default up to four times. If a task fails,
Hadoop will re-execute it on a different node until it succeeds or reaches the maximum
number of retries.
2. Speculative execution: Hadoop can launch multiple copies of the same task on different
nodes to ensure that the task completes successfully in a timely manner. The first task
to complete successfully will be used, and the others will be killed.
3. Data replication: Hadoop replicates data across multiple nodes in the cluster to ensure
that data is available even if a node fails. By default, Hadoop replicates each data block
three times across different nodes.

Page | 9
4. NameNode and DataNode HA: Hadoop provides High Availability (HA) for the
NameNode and DataNode services. In HA mode, Hadoop runs two or more instances
of the NameNode and DataNode services to provide redundancy and failover support.
5. Heartbeats and timeouts: Hadoop uses a heartbeat mechanism to monitor the health
of nodes in the cluster. Nodes periodically send heartbeat signals to the master,
indicating that they are still alive. If a node fails to send a heartbeat within a certain time
period, it is marked as dead and its tasks are re-scheduled on other nodes.
6. Checkpointing: Hadoop periodically checkpoints the state of the NameNode to disk to
ensure that metadata is not lost in case of a failure.
Overall, Hadoop provides several mechanisms to handle failures and ensure that data
processing is reliable and fault-tolerant. By replicating data, retrying failed tasks, using
speculative execution, and providing HA support for critical services, Hadoop enables large-
scale data processing in distributed computing environments with high reliability and
availability.

Job Scheduling
Job scheduling is an important aspect of Hadoop cluster management, as it enables efficient
resource utilization and improved job performance. Hadoop provides several job scheduling
mechanisms to manage job queues, prioritize jobs, and allocate resources based on job
requirements.
1. FIFO Scheduler: The First-In-First-Out (FIFO) scheduler is the default scheduler in
Hadoop. It schedules jobs based on their submission order, without considering their
resource requirements or priority. This scheduler is suitable for small clusters or for
situations where all jobs have similar resource requirements.
2. Capacity Scheduler: The Capacity Scheduler enables multiple users or groups to share
a cluster while ensuring that each user or group is allocated a guaranteed minimum
capacity. The Capacity Scheduler maintains several queues, each with a configured
capacity and priority level, and allocates resources to jobs based on their queue and
priority.
3. Fair Scheduler: The Fair Scheduler enables fair sharing of cluster resources among
multiple users or jobs, without any user or job being starved of resources. The Fair
Scheduler dynamically allocates resources to jobs based on their resource requirements
and job priority. Jobs are scheduled based on their fair share of cluster resources, with
lower priority jobs being throttled to give higher priority jobs more resources.
4. Custom Schedulers: Hadoop also provides the ability to implement custom schedulers,
allowing users to create custom scheduling algorithms that meet their specific
requirements.
Overall, Hadoop's job scheduling mechanisms enable efficient resource utilization and
improved job performance by providing fair and balanced allocation of cluster resources. By
using the appropriate scheduler for their use case, Hadoop users can ensure that their jobs are

Page | 10
executed efficiently and without delays, while ensuring that the cluster is used in an optimal
and fair manner.

Shuffle and Sort


Shuffle
Shuffle is an important phase in the MapReduce programming model of Hadoop. It is the
process of transferring data between the Map and Reduce tasks and ensuring that the data is
grouped by key for efficient processing in the Reduce phase.
During the Map phase, data is processed in parallel across multiple nodes in the cluster. The
output of the Map phase is a set of key-value pairs, where each key corresponds to a unique
record and the value is the processed data. In the Shuffle phase, the MapReduce framework
ensures that all the records with the same key are grouped together and sent to the same Reduce
task for processing.
The Shuffle phase consists of two main steps:
1. Partitioning: In this step, the MapReduce framework partitions the Map output data
into several partitions based on the key. Each partition corresponds to a unique key and
contains all the values associated with that key. The number of partitions is typically
equal to the number of Reduce tasks in the job.
2. Grouping: In this step, the MapReduce framework groups all the records with the same
key from all the partitions and sends them to the same Reduce task. This ensures that all
the records with the same key are processed together by the same Reduce task.
The Shuffle phase is a critical component of the MapReduce programming model in Hadoop.
It enables efficient transfer and grouping of data between Map and Reduce tasks, allowing for
efficient processing of large-scale data sets. By grouping data with the same key, the Shuffle
phase also reduces the amount of data that needs to be processed in the Reduce phase,
improving overall job performance.
Sort
In the context of Hadoop MapReduce, sorting refers to the process of sorting the intermediate
key-value pairs generated by the Map phase, before passing them on to the Reduce phase for
further processing.
Sorting is an important step in the MapReduce workflow, as it enables efficient processing of
the data by the Reduce phase. In particular, it enables the Reduce phase to easily group together
all values that correspond to a given key.
In the Shuffle and Sort phase of MapReduce, the output key-value pairs generated by the Map
phase are first partitioned based on the keys, and then sorted within each partition based on
the keys. This ensures that all the key-value pairs corresponding to a given key are grouped
together within a single partition, and that they are sorted based on the key.

Page | 11
Hadoop uses a sort algorithm called Timsort, which is a hybrid sorting algorithm that combines
elements of merge sort and insertion sort. Timsort is efficient at handling large datasets and
works well in parallel processing environments, such as Hadoop.
Overall, sorting is a critical step in the MapReduce workflow, as it enables efficient processing
of large datasets by the Reduce phase. By grouping key-value pairs based on their keys and
sorting them within each partition, MapReduce ensures that the Reduce phase can easily
process all values corresponding to a given key in a single iteration.

Task execution
In the context of Hadoop MapReduce, task execution refers to the process of running Map and
Reduce tasks on a Hadoop cluster to process input data and generate output data.
In a typical MapReduce job, the input data is split into multiple small pieces, called input
splits, which are processed by multiple Map tasks running in parallel on different nodes of the
Hadoop cluster. Each Map task reads one or more input splits, processes the input data, and
generates a set of intermediate key-value pairs.
After the Map tasks have completed, the intermediate data is sorted, partitioned, and grouped
by key in the Shuffle and Sort phase. This produces a set of partitions, each containing a subset
of the intermediate data with the same key.
The partitions are then processed by multiple Reduce tasks running in parallel on different
nodes of the Hadoop cluster. Each Reduce task reads one or more partitions, processes the
intermediate data, and generates the final output data.
The Hadoop MapReduce framework manages the task execution process, including task
scheduling, task coordination, task tracking, and fault tolerance. It uses a distributed task
scheduler, called the JobTracker, to schedule Map and Reduce tasks on different nodes of the
Hadoop cluster based on the available resources and data locality. The framework also
monitors the progress of the tasks and handles task failures by automatically re-executing the
failed tasks on other nodes of the cluster.
Overall, task execution is a critical component of the Hadoop MapReduce workflow, as it
enables efficient processing of large-scale data sets on distributed computing clusters. The
MapReduce framework provides a flexible and fault-tolerant framework for task execution,
making it a popular choice for big data processing.

Map Reduce Types and Formats


Map Reduce Types
In the context of Hadoop MapReduce, there are several types of Map and Reduce functions
that can be used to process input data and generate output data in different ways. Here are
some common types of Map and Reduce functions:

Page | 12
Map Functions:
1. Identity Mapper - This is a simple Map function that takes the input key-value pairs
and outputs them as is, without any modifications.
2. Tokenizer Mapper - This Map function splits the input text data into individual words
or tokens, and outputs each token as a separate key-value pair.
3. Filter Mapper - This Map function filters the input key-value pairs based on some
criteria and outputs only the selected key-value pairs.
4. Join Mapper - This Map function is used for joining two or more input data sets based
on a common key.
Reduce Functions:
1. Identity Reducer - This is a simple Reduce function that takes the intermediate key-
value pairs generated by the Map function and outputs them as is, without any
modifications.
2. Sum Reducer - This Reduce function computes the sum of all values corresponding to
each key and outputs the key-sum pairs.
3. Average Reducer - This Reduce function computes the average of all values
corresponding to each key and outputs the key-average pairs.
4. Join Reducer - This Reduce function is used for joining two or more input data sets
based on a common key.
5. Combiner - This is an optional Reduce function that can be used to perform a local
aggregation of the intermediate key-value pairs generated by the Map function, before
sending them over the network to the Reduce tasks. This can help to reduce network
traffic and improve the performance of the MapReduce job.
Overall, the choice of Map and Reduce functions depends on the specific requirements of the
data processing task, and can have a significant impact on the performance and scalability of
the MapReduce job. The Hadoop MapReduce framework provides a flexible and extensible
architecture for defining and executing custom Map and Reduce functions in a distributed and
fault-tolerant manner.
Map Reduce Formats
In Hadoop MapReduce, there are different types of input and output formats that can be used
to process data in different ways. Here are some common input and output formats used in
MapReduce:
Input Formats:
1. Text Input Format - This is the default input format in Hadoop, which reads text files
and splits them into individual lines.
2. SequenceFile Input Format - This input format reads binary files that contain key-
value pairs in a serialized format.
3. Avro Input Format - This input format reads data files in the Avro format, which is a
compact and efficient serialization format for data interchange.

Page | 13
Output Formats:
1. Text Output Format - This is the default output format in Hadoop, which writes the
output key-value pairs as plain text files.
2. SequenceFile Output Format - This output format writes the output key-value pairs
as binary files, which can be used as input to another MapReduce job.
3. Avro Output Format - This output format writes the output key-value pairs in the Avro
format, which is a compact and efficient serialization format for data interchange.
4. HBase Output Format - This output format writes the output key-value pairs to HBase,
which is a distributed NoSQL database.
Overall, the choice of input and output formats depends on the specific requirements of the
data processing task, and can have a significant impact on the performance and scalability of
the MapReduce job. The Hadoop MapReduce framework provides a flexible and extensible
architecture for defining and using custom input and output formats in a distributed and fault-
tolerant manner.

Map Reduce Features Hadoop environment


Map Reduce Features
MapReduce is a programming model and processing framework that is specifically designed
for processing and analyzing large-scale data sets in a distributed and parallel manner. Here
are some of the key features of MapReduce:
1. Scalability: MapReduce is highly scalable and can process massive amounts of data
across a large cluster of commodity hardware in a distributed and parallel manner.
2. Fault-tolerance: MapReduce provides built-in fault-tolerance mechanisms such as task
re-execution and automatic data replication to ensure that the job can continue running
even if some nodes or tasks fail.
3. Data locality: MapReduce leverages data locality to schedule processing tasks on nodes
that have the data they need, reducing network traffic and improving performance.
4. Flexibility: MapReduce provides a flexible and extensible programming model based
on the Map and Reduce functions, allowing developers to define custom data processing
logic in a variety of languages such as Java, Python, and Scala.
5. Parallel processing: MapReduce is designed to process data in parallel, which means
that the data is split into smaller chunks and processed simultaneously across multiple
nodes in the cluster, resulting in faster processing times.
6. Distributed computing: MapReduce is a distributed computing framework that can
process data across a large number of nodes in the cluster, allowing for faster and more
efficient processing of large-scale datasets.
7. Ecosystem integration: MapReduce is integrated with a range of other big data
technologies such as HDFS, Pig, and Hive, allowing users to build end-to-end data
processing pipelines for a wide range of use cases.

Page | 14
Overall, MapReduce provides a powerful and versatile framework for processing and
analyzing large-scale datasets in a distributed and parallel manner, making it an essential tool
for big data processing in many industries and applications.
Map Reduce Hadoop environment
MapReduce is a key component of the Hadoop ecosystem, which is a collection of open-source
tools and technologies that are used for distributed processing and analysis of large-scale
datasets. Here's how MapReduce fits into the Hadoop environment:
1. Hadoop Distributed File System (HDFS): HDFS is the primary storage system for
Hadoop, and it provides a distributed and fault-tolerant file system for storing and
processing large-scale datasets. MapReduce is tightly integrated with HDFS and uses it
for storing input data, intermediate data, and output data.
2. Resource Manager and Job Scheduler: Hadoop provides a centralized resource
manager and job scheduler, such as YARN, which manages the allocation of resources
and scheduling of jobs across the cluster. MapReduce jobs are submitted to the job
scheduler, which then allocates resources and schedules tasks across the nodes in the
cluster.
3. MapReduce APIs: Hadoop provides a range of APIs and libraries for developing
MapReduce applications, including the Java MapReduce API, the Streaming API for
developing MapReduce jobs in languages other than Java, and the MapReduce API for
Python.
4. Hadoop Ecosystem Tools: MapReduce is integrated with a range of other Hadoop
ecosystem tools and technologies, such as Hive, Pig, and Spark, allowing users to build
end-to-end data processing pipelines for a wide range of use cases.
Overall, MapReduce is an essential component of the Hadoop ecosystem and a popular choice
for big data processing and analysis. By leveraging the distributed processing capabilities of
Hadoop, MapReduce enables users to process and analyze large-scale datasets in a distributed
and fault-tolerant manner, making it a powerful tool for big data processing in many industries
and applications.

Page | 15

You might also like