Professional Documents
Culture Documents
Module 2
Module 2
1 Introduction
2 Why?Goals?Assumptions.
4 Hadoop Ecosystem
5 Physical Architecture
6 Hadoop Limitations
7 Excercise Questions
Click icon to add picture
2
Hadoop
Lecture 4:
4
• Designed to scale up from single server to thousands.
5
Why Hadoop?
• Example 1:
– Transfer speed is around 100MB/s and disk is 1TB
– Time to read entire disk = 3hrs.
– Increase in processing time may not be very helpful
because of 2 reasons-a)n/w bandwidth and b) physical
limits of processor chips.
– Also instead of moving data it is better to move
computation. Click icon to add picture
6
4 Goals of Hadoop
1. Scalable
2. Fault tolerance
4. Handle Hardware Failures : Software ability detects & handles at application layer.
Huge Data– Divides data into blocks– stores across multiple computers– parallel execution.
7
Hadoop Assumptions when it was developed
4. Portability is important.
8
Hadoop
Lecture 5:
Core Components, Common Package, HDFS,
HDFS Components
Core Hadoop Components
• Hadoop Distributed File System(HDFS)- limited interface for managing file system.
• Hadoop Map reduce- Key algorithm that MR engine uses to distribute work around a
cluster.
• Hadoop Yet Another Resource Negotiator(YARN)- Manages resources in clusters and using
them for scheduling of users applications.
10
Hadoop Common Package
• Standard startup and shutdown scripts need Secure Shell(SSH) to be setup between
nodes in cluster.
11
Hadoop Distributed File System
• File loaded in HDFS Replicated and Fragmented into blocks Stored across cluster
nodes(aka Data nodes).
• When MapReduce or another framework calls for the data the NameNode informs it
where the data that is needed resides.
12
Main Components of HDFS
13
Main Components of HDFS
1. Namenode
1. Master, it contains metadata.
2. Maintains the directories and files.
3. Manages block present on DataNode.
4. Maintains inode information
5. Maps inode to the list of blocks and locations.
6. Takes care of authorization and authentication.
7. Creates checkpoints and logs the namespace changes.
14
Main Components of HDFS
2. DataNodes
– Slaves, providing actual storage and deployed on each machine.
– Processing read and write requests for the clients.
– Handles block storage on multiple volumes and also maintains block
integrity.
– Periodically sends heartbeats and also block reports to NameNode.
15
HDFS Job Processing Sequence Diagram
1. User Copies Input Files into DFS and submits the job to the client
2. Client gets the input file information from the DFS, creates splits and uploads the job
information to DFS.
3. Job Tracker puts ready job into the internal queue.
4. Job Scheduler picks up job from the queue and initializes the job by creating job
object.
5. Job Tracker creates a list of tasks and assigns one map task for each input split.
6. Task tracker sends heartbeats to job tracker to indicate if ready to run new tasks.
7. Job Tracker chooses task from first job in priority-queue and assigns it to the
TaskTracker.
16
Hadoop
Lecture 6:
Map Reduce, Components, YARN,
Ecosystem.
Map Reduce
It enables parallel processing and has 2 basic steps Map and Reduce.
1. In map phase, a set of key-value pairs forms the input and over each key-value pair, the
desired function is executed.
2. It generates intermediate key-value pairs.
3. In reduce phase, intermediate pairs are are grouped by key and the values are combined
together according to the reduce algorithm provided by the user. Sometimes reduction is
not required.
4. Map reduce process divided between 2 applications- Job Tracker and Task Tracker.
5. Job Tracker is manager, only on 1 node of cluster, Task tracker is slave so on each node.
18
Main components of Map Reduce
1. Job Trackers
2. Task Trackers
19
Yet Another Resource Negotiator
• Since hadoop initially had issues with scalability, memory,single point of failure, hence
YARN was added.
Thus instead of single node for both tasks, YARN distributes this responsibility across cluster.
20
Hadoop Ecosystem
21
Physical Architecture Components
• Racks
22
Hadoop Limitations
• General Limitations
23