Module 2

Module Name: Hadoop
Roadmap for Session
Topic No. Topic Name
1 Introduction
2 Why?Goals?Assumptions.
3 Hadoop Common Components
4 Hadoop Ecosystem
5 Physical Architecture
6 Hadoop Limitations
7 Excercise Questions
Click icon to add picture
2
Hadoop
Lecture 4:
Introduction: Why needed, Goals,

Assumptions about Big Data.
Hadoop
• In 2008 yahoo released Hadoop , by Doug Cutting.
• Currently it is developed as an open source framework by Apache

Software Foundation.
• Storage and distributed processing made possible.
• Basically Hadoop is a framework that allows distributed processing of

large datasets across clusters of commodity computers using a simple
programming model.
4
• Designed to scale up from single server to thousands.
• Each providing storage and computation.
• Doesn’t rely on hardware for fault detection.
• Framework designed to detect and handle failure at application layer.
5
Why Hadoop?
• Data transfer problems led to organisation’s looking for

alternatives.
• Example 1:
– Transfer speed is around 100MB/s and disk is 1TB
– Time to read entire disk = 3hrs.
– Increase in processing time may not be very helpful
because of 2 reasons-a)n/w bandwidth and b) physical
limits of processor chips.
– Also instead of moving data it is better to move
computation. Click icon to add picture
6
4 Goals of Hadoop
1. Scalable
2. Fault tolerance
3. Economical : Commodity Hardware.
4. Handle Hardware Failures : Software ability detects & handles at application layer.
Huge Data– Divides data into blocks– stores across multiple computers– parallel execution.
Variety of data processed– social media, sensors– commodity hardware.
Throughput is high but latency is low.
7
Hadoop Assumptions when it was developed
1. Hardware will fail.
2. Processing in batches-high thoughput.
3. Applications have large datasets GB to TB.
4. Portability is important.
5. Availability of high aggregate data bandwidth-scaling to 100s nodes in 1 cluster.
6. Should support millions of files in 1 instance.
7. Applications need a write-once-read-many access model.
8
Hadoop
Lecture 5:
Core Components, Common Package, HDFS,
HDFS Components
Core Hadoop Components
• Hadoop Common- File system, OS Abstraction, libraries and utilities.
• Hadoop Distributed File System(HDFS)- limited interface for managing file system.
• Hadoop Map reduce- Key algorithm that MR engine uses to distribute work around a
cluster.
• Hadoop Yet Another Resource Negotiator(YARN)- Manages resources in clusters and using
them for scheduling of users applications.
10
Hadoop Common Package
• It consists of necessary JAR files and scripts to start hadoop.
• Standard startup and shutdown scripts need Secure Shell(SSH) to be setup between
nodes in cluster.
• HDFS(storage) and Map Reduce(processing) are the 2 core components of Apache

Hadoop. Both HDFS and Map Reduce work in unison and co-deployed.
• Single cluster provides ability to move computation to data.
11
Hadoop Distributed File System
• Provides interface for managing the file system.

• To allow it to scale and provide high throughput.
• HDFS creates multiple replicas of each data block.
• Distributes them on computers throughout a cluster.
• Enables Reliable and Rapid access.
• File loaded in HDFS Replicated and Fragmented into blocks Stored across cluster
nodes(aka Data nodes).
• Namenode responsible for storage and management of metadata.
• When MapReduce or another framework calls for the data the NameNode informs it
where the data that is needed resides.
12
Main Components of HDFS
13
1. Namenode
1. Master, it contains metadata.
2. Maintains the directories and files.
3. Manages block present on DataNode.
4. Maintains inode information
5. Maps inode to the list of blocks and locations.
6. Takes care of authorization and authentication.
7. Creates checkpoints and logs the namespace changes.
Thus it monitors status of datanode and replicates the missing blocks.
14
2. DataNodes
– Slaves, providing actual storage and deployed on each machine.
– Processing read and write requests for the clients.
– Handles block storage on multiple volumes and also maintains block
integrity.
– Periodically sends heartbeats and also block reports to NameNode.
By default block size is of 128MB

No. of Replications are 3 for each block
These can be altered.
15
HDFS Job Processing Sequence Diagram
1. User Copies Input Files into DFS and submits the job to the client 
2. Client gets the input file information from the DFS, creates splits and uploads the job
information to DFS.
3. Job Tracker puts ready job into the internal queue.
4. Job Scheduler picks up job from the queue and initializes the job by creating job
object.
5. Job Tracker creates a list of tasks and assigns one map task for each input split.
6. Task tracker sends heartbeats to job tracker to indicate if ready to run new tasks.
7. Job Tracker chooses task from first job in priority-queue and assigns it to the
TaskTracker.
8. Meanwhile Secondary Namenode performs periodic checkpoints.

Used to restart namenode in failures.
Finally MapReduce can process the data wherever it is located!

16
Hadoop
Lecture 6:
Map Reduce, Components, YARN,
Ecosystem.
Map Reduce
It enables parallel processing and has 2 basic steps Map and Reduce.
1. In map phase, a set of key-value pairs forms the input and over each key-value pair, the
desired function is executed.
2. It generates intermediate key-value pairs.
3. In reduce phase, intermediate pairs are are grouped by key and the values are combined
together according to the reduce algorithm provided by the user. Sometimes reduction is
not required.
4. Map reduce process divided between 2 applications- Job Tracker and Task Tracker.
5. Job Tracker is manager, only on 1 node of cluster, Task tracker is slave so on each node.
18
Main components of Map Reduce
1. Job Trackers
2. Task Trackers
3. Job History Server
19
Yet Another Resource Negotiator
• Since hadoop initially had issues with scalability, memory,single point of failure, hence
YARN was added.
1. It undertakes Resource Management- Global Resource Manager

2. Job Scheduling and monitoring.- Application Master/application.
Thus instead of single node for both tasks, YARN distributes this responsibility across cluster.
3. Resource Manager and Node Manager manage the applications.

4. RM distributes resource to all applications.
5. AM negotiates from RM and works with NodeManager(s) to execute n monitor task
20
Hadoop Ecosystem
21
Physical Architecture Components
• Master node Slave Node
• Job Tracker, Task Tracker, Name Node, Data Node
• Racks
• Secondary and Primary Name Node
22
Hadoop Limitations
• Security Concern-Disabled due to complexity, hence not preferred by

govt agencies.
• Vulnerable by nature-Java known to cyber criminals.
• Not fit for small data
• Potential Stability Issues
• General Limitations
23

Module 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2

Uploaded by

Copyright:

Available Formats

Module Name: Hadoop

Roadmap for Session

Topic No. Topic Name

3 Hadoop Common Components

Introduction: Why needed, Goals,

• In 2008 yahoo released Hadoop , by Doug Cutting.

• Currently it is developed as an open source framework by Apache

• Storage and distributed processing made possible.

• Basically Hadoop is a framework that allows distributed processing of

• Each providing storage and computation.

• Doesn’t rely on hardware for fault detection.

• Framework designed to detect and handle failure at application layer.

• Data transfer problems led to organisation’s looking for

3. Economical : Commodity Hardware.

Variety of data processed– social media, sensors– commodity hardware.

Throughput is high but latency is low.

Click icon to add picture

1. Hardware will fail.

2. Processing in batches-high thoughput.

3. Applications have large datasets GB to TB.

5. Availability of high aggregate data bandwidth-scaling to 100s nodes in 1 cluster.

6. Should support millions of files in 1 instance.

7. Applications need a write-once-read-many access model.

Click icon to add picture

• Hadoop Common- File system, OS Abstraction, libraries and utilities.

Click icon to add picture

• It consists of necessary JAR files and scripts to start hadoop.

• HDFS(storage) and Map Reduce(processing) are the 2 core components of Apache

• Single cluster provides ability to move computation to data.

Click icon to add picture

• Provides interface for managing the file system.

• Namenode responsible for storage and management of metadata.

Click icon to add picture

Thus it monitors status of datanode and replicates the missing blocks.

Click icon to add picture

By default block size is of 128MB

Click icon to add picture

8. Meanwhile Secondary Namenode performs periodic checkpoints.

Finally MapReduce can process the data wherever it is located!

Click icon to add picture

3. Job History Server

Click icon to add picture

1. It undertakes Resource Management- Global Resource Manager

3. Resource Manager and Node Manager manage the applications.

Click icon to add picture

• Master node Slave Node

• Job Tracker, Task Tracker, Name Node, Data Node

• Secondary and Primary Name Node

Click icon to add picture

• Security Concern-Disabled due to complexity, hence not preferred by

• Vulnerable by nature-Java known to cyber criminals.

• Not fit for small data

• Potential Stability Issues

Click icon to add picture

You might also like