Big Data & Hadoop Certification Training

Course Outline
Understanding Big Data Kafka Monitoring &
Stream Processing
and Hadoop

Hadoop Architecture Integration of Kafka

Kafka Producer Advance
with Hive&and
Hadoop HBase
and HDFS

Hadoop MapReduce Integration of Kafka

Kafka Consumer Advance
Framework with Spark &HBase

Kafka Operation and Processing Distributed Data

Advance MapReduce
Performance Tuning with Apache Spark

Kafka Cluster Architectures Apache Oozie and Hadoop

Pig Kafka Project
& Administering Kafka Project

Module 4: Advance MapReduce

At the end of this module, you will be able to:

• Implement Counters in MapReduce

• Understand Map and Reduce Side joins

• Test MapReduce Programs

• Implement Distributed Cache Concept in MapReduce

• Implement Custom Input Format in MapReduce

• Implement Sequence Input Format in MapReduce

Let’s Revise

Node 1 Node 2

Input data is distributed to nodes

Map Map
Each map task works on a “split” of data

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Intermediate data of the same key goes to the same reducer

Reduce Reduce
Reducer output is stored

Node 1 Node 2

Annie’s Question
Can you use the Map-Reduce algorithm to perform a relational join
on two large tables sharing a key? Assume that the two tables are
formatted as comma-separated files in HDFS:
a. Yes
b. No

Annie’s Answer

Ans. Yes using Map-Reduce we perform Join Algorithms such as

Map-side, Reduce-side, and In-Memory join in Map-Reduce.

Annie’s Question
What types of algorithms are difficult to express in Map-Reduce?

a. Algorithms that require global, shared state

b. Algorithms that requires application of the same mathematical
function to large numbers of individual binary records

Annie’s Answer
Ans. The correct option is ‘a’. Map-Reduce paradigm works in a
massively parallel system. Map and Reduce tasks execute in isolation
on a chunk of data (input splits), so algorithms which require a
global shared state to be maintained aren’t suitable for Map-Reduce

Annie’s Question
At what stage in Map-Reduce tasks execution, a reducer's reduce
function starts?

a. At least one mapper is ready with its output

b. map() and reduce() starts simultaneously
c. After processing for all the map tasks is completed

Annie’s Answer
Ans. The correct option is ‘c’. The Reduce tasks works on the output
of Map tasks and the output from all the Mappers is required to start
the Reduce process of Map-Reduce algorithm.

Annie’s Question
You want to reduce the traffic between mapper and reducer. Your class
should implement which interface?
a. Partitioner
b. Combiner
c. Writable
d. WritableComparable

Annie’s Answer
Ans. The correct option is ‘b’. Combiners are basically mini-reducers.
They essentially lessen the workload which is passed on further to the

Map and Reduce Side Joins
Fragment (large table)

Map tasks: Split 1 Split 2 Split 3 Split 4

(small table)


Map and Reduce Side Joins
Small Table

MapReduce Local Task Hash Table Files Compressed and Archived

Distributed Cache


Mapper Record
b Record Big Table Data

Demo: Joins in Map-Reduce

Input Format
Input file Input file

Input Split Input Split Input Split Input Split

Input Format

Record Record Record Record

Reader Reader Reader Reader

Mapper Mapper Mapper Mapper

(Intermediates) (Intermediates) (Intermediates) (Intermediates)

Input Format – Class Hierarchy
Combine File
Input Format<K,V>

Text Input Format

Input Format File Input Format Key Value Text

<K,V> <K,V> Input Format

Nline Input Format

Sequence File Sequence File As

Input Format<K,V> Binary Input Format
Sequence File As
Input Format Composite Input Format
Text Input Format
<K,V> <K,V>

DB Input Format Sequence File Input

<T> Filter<K,V>

Output Format

Reducer Reducer Reducer

Output Format

RecordWriter RecordWriter RecordWriter

Output file Output file Output file

Output Format – Class Hierarchy
Text Output Format
Output Format File Output Format <K,V>
<K,V> <K,V>
Sequence File
Output Format<K,V>

Null Output Format


Sequence File As Binary

Output Format
DB Output Format

Filter Output Format Lazy Output Format

<K,V> <K,V>

Demo: Custom Input Format

MRUnit Testing Framework
▪ Provides 4 drivers for separately testing MapReduce code
 MapDriver
 ReduceDriver
 MapReduceDriver *JUnit is a simple framework to
 PipelineMapReduceDriver write repeatable tests.

▪ Helps in filling the gap between MapReduce programs and JUnit*

▪ Better control on log messages with JUnit Integration

Demo: MRUnit Testing Framework

▪ Counters are lightweight objects in Hadoop that allow you to keep track of system progress in both the map and reduce
stages of processing.

▪ Counters are used to gather information about the data we are analysing, like how many types of records were processed,
how many invalid records were found while running the job, etc.


Demo: Counters

Distributed Cache
MapReduce MapReduce MapReduce MapReduce
Distributed Cache is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by

Files are copied only once per job and should not be modified
by the application or externally while the job is executing.

Distributed Cache can be used to distribute simple, read-only HDFS – Hadoop Distributed Cache
data/text files and/or more complex types such as archives, jars
etc via the JobConf.

Demo: Distributed Cache

Sequence File
▪ Hadoop is not restricted to processing plain text data. For user custom binary data type, one can use the SequenceFile
▪ SequenceFile is a flat file consisting of binary key/value pairs
▪ Used in MapReduce as input/output formats
▪ Output of Maps are stored using SequenceFile
▪ Provides
▪ A Writer
▪ A Reader
▪ A Sorter

Sequence File
Three different SequenceFile formats:
▪ Uncompressed key/value records
▪ Record compressed key/value rec
• Only ‘values’ are compressed here
▪ Block compressed key/value records
• Both keys and values are collected in ‘blocks’ separately and compressed
▪ The other objective of using SequenceFile is to 'pack' many small files into a single large SequenceFile for the
MapReduce computation since the design of Hadoop prefers large files (Remember that Hadoop default block
size for data is 64MB).

Sequence File – Record Compression
Header Record Record Sync Record Record Record Sync Record

Block Record Key Key Value

compression length length
4 4

Record Record Key Compressed

Key Value
compression length length
4 4

Sequence File – Block Compression

Header Sync Block Sync Block Sync Block Sync Block

Block Number of Compressed Compressed Compressed Compressed

compression records Key lengths keys value lengths values


Demo: SequenceFile

Practice “Advance MR Codes” present in the LMS in the Cloud Lab

Review the following PIG blogs:

Agenda for Next Class
• PIG and its need
• Difference between PIG MapReduce
• PIG features and programming structure
• PIG running modes
• PIG components and data model
• Basic operations in PIG
• UDF in PIG

