Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

Module Name: Hadoop

Roadmap for Session

Topic No. Topic Name

1 Introduction

2 Why?Goals?Assumptions.

3 Hadoop Common Components

4 Hadoop Ecosystem

5 Physical Architecture

6 Hadoop Limitations

7 Excercise Questions
Click icon to add picture

2
Hadoop

Lecture 4:

Introduction: Why needed, Goals,


Assumptions about Big Data.
Hadoop

• In 2008 yahoo released Hadoop , by Doug Cutting.

• Currently it is developed as an open source framework by Apache


Software Foundation.

• Storage and distributed processing made possible.

• Basically Hadoop is a framework that allows distributed processing of


large datasets across clusters of commodity computers using a simple
programming model.

4
• Designed to scale up from single server to thousands.

• Each providing storage and computation.

• Doesn’t rely on hardware for fault detection.

• Framework designed to detect and handle failure at application layer.

5
Why Hadoop?

• Data transfer problems led to organisation’s looking for


alternatives.

• Example 1:
– Transfer speed is around 100MB/s and disk is 1TB
– Time to read entire disk = 3hrs.
– Increase in processing time may not be very helpful
because of 2 reasons-a)n/w bandwidth and b) physical
limits of processor chips.
– Also instead of moving data it is better to move
computation. Click icon to add picture

6
4 Goals of Hadoop

1. Scalable

2. Fault tolerance

3. Economical : Commodity Hardware.

4. Handle Hardware Failures : Software ability detects & handles at application layer.

Huge Data– Divides data into blocks– stores across multiple computers– parallel execution.

Variety of data processed– social media, sensors– commodity hardware.

Throughput is high but latency is low.

Click icon to add picture

7
Hadoop Assumptions when it was developed

1. Hardware will fail.

2. Processing in batches-high thoughput.

3. Applications have large datasets GB to TB.

4. Portability is important.

5. Availability of high aggregate data bandwidth-scaling to 100s nodes in 1 cluster.

6. Should support millions of files in 1 instance.

7. Applications need a write-once-read-many access model.

Click icon to add picture

8
Hadoop

Lecture 5:
Core Components, Common Package, HDFS,
HDFS Components
Core Hadoop Components

• Hadoop Common- File system, OS Abstraction, libraries and utilities.

• Hadoop Distributed File System(HDFS)- limited interface for managing file system.

• Hadoop Map reduce- Key algorithm that MR engine uses to distribute work around a
cluster.

• Hadoop Yet Another Resource Negotiator(YARN)- Manages resources in clusters and using
them for scheduling of users applications.

Click icon to add picture

10
Hadoop Common Package

• It consists of necessary JAR files and scripts to start hadoop.

• Standard startup and shutdown scripts need Secure Shell(SSH) to be setup between
nodes in cluster.

• HDFS(storage) and Map Reduce(processing) are the 2 core components of Apache


Hadoop. Both HDFS and Map Reduce work in unison and co-deployed.

• Single cluster provides ability to move computation to data.

Click icon to add picture

11
Hadoop Distributed File System

• Provides interface for managing the file system.


• To allow it to scale and provide high throughput.
• HDFS creates multiple replicas of each data block.
• Distributes them on computers throughout a cluster.
• Enables Reliable and Rapid access.

• File loaded in HDFS Replicated and Fragmented into blocks Stored across cluster
nodes(aka Data nodes).

• Namenode responsible for storage and management of metadata.

• When MapReduce or another framework calls for the data the NameNode informs it
where the data that is needed resides.

Click icon to add picture

12
Main Components of HDFS

13
Main Components of HDFS

1. Namenode
1. Master, it contains metadata.
2. Maintains the directories and files.
3. Manages block present on DataNode.
4. Maintains inode information
5. Maps inode to the list of blocks and locations.
6. Takes care of authorization and authentication.
7. Creates checkpoints and logs the namespace changes.

Thus it monitors status of datanode and replicates the missing blocks.

Click icon to add picture

14
Main Components of HDFS

2. DataNodes
– Slaves, providing actual storage and deployed on each machine.
– Processing read and write requests for the clients.
– Handles block storage on multiple volumes and also maintains block
integrity.
– Periodically sends heartbeats and also block reports to NameNode.

By default block size is of 128MB


No. of Replications are 3 for each block
These can be altered.

Click icon to add picture

15
HDFS Job Processing Sequence Diagram

1. User Copies Input Files into DFS and submits the job to the client 
2. Client gets the input file information from the DFS, creates splits and uploads the job
information to DFS.
3. Job Tracker puts ready job into the internal queue.
4. Job Scheduler picks up job from the queue and initializes the job by creating job
object.
5. Job Tracker creates a list of tasks and assigns one map task for each input split.
6. Task tracker sends heartbeats to job tracker to indicate if ready to run new tasks.
7. Job Tracker chooses task from first job in priority-queue and assigns it to the
TaskTracker.

8. Meanwhile Secondary Namenode performs periodic checkpoints.


Used to restart namenode in failures.

Finally MapReduce can process the data wherever it is located!


Click icon to add picture

16
Hadoop

Lecture 6:
Map Reduce, Components, YARN,
Ecosystem.
Map Reduce

It enables parallel processing and has 2 basic steps Map and Reduce.

1. In map phase, a set of key-value pairs forms the input and over each key-value pair, the
desired function is executed.
2. It generates intermediate key-value pairs.

3. In reduce phase, intermediate pairs are are grouped by key and the values are combined
together according to the reduce algorithm provided by the user. Sometimes reduction is
not required.

4. Map reduce process divided between 2 applications- Job Tracker and Task Tracker.
5. Job Tracker is manager, only on 1 node of cluster, Task tracker is slave so on each node.

Click icon to add picture

18
Main components of Map Reduce

1. Job Trackers

2. Task Trackers

3. Job History Server

Click icon to add picture

19
Yet Another Resource Negotiator

• Since hadoop initially had issues with scalability, memory,single point of failure, hence
YARN was added.

1. It undertakes Resource Management- Global Resource Manager


2. Job Scheduling and monitoring.- Application Master/application.

Thus instead of single node for both tasks, YARN distributes this responsibility across cluster.

3. Resource Manager and Node Manager manage the applications.


4. RM distributes resource to all applications.
5. AM negotiates from RM and works with NodeManager(s) to execute n monitor task

Click icon to add picture

20
Hadoop Ecosystem

21
Physical Architecture Components

• Master node Slave Node

• Job Tracker, Task Tracker, Name Node, Data Node

• Racks

• Secondary and Primary Name Node

Click icon to add picture

22
Hadoop Limitations

• Security Concern-Disabled due to complexity, hence not preferred by


govt agencies.

• Vulnerable by nature-Java known to cyber criminals.

• Not fit for small data

• Potential Stability Issues

• General Limitations

Click icon to add picture

23

You might also like