You are on page 1of 49

BigData &

Hadoop
Shushrutha Reddy K
M.Tech in Computational Engineering from RGUKT
Senior BigData Developer @ServiceNow
Bigdata
Hadoop
MapReduce
Agenda YARN
Spark
Amazon EMR
Friday, 21 January 2022 2
How It All Started?

Friday, January 21, 2022 3


What is BigData?

BigData is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available database
management tools or traditional data processing applications.

The challenge includes capturing, curating, storing, searching, sharing,


transferring, analysing and visualization.

Friday, 21 January 2022 4


Every minute:

Friday, 21 January 2022 5


Characteristics:

Friday, 21 January 2022 6


Types of Big Data
• Three types:
• Structured - stored and processed in a fixed format - SQL
• Semi-Structured - XML files or JSON
• Unstructured - Text Files, images, audios, videos

Friday, 21 January 2022 7


Why Big Data Analytics?

Making Smarter and More Efficient Organisation

Optimize Business Operations by Analysing Customer Behaviour

Cost Reduction

New Generation Products

Friday, 21 January 2022 8


Stages in Big Data Analytics

Friday, 21 January 2022 9


Types of Big Data Analytics

Descriptive Analytics

•data aggregation and data mining to provide insight into the past

Diagnostic Analytics

•determine why something happened in the past

Predictive Analytics

•statistical models and forecasts techniques to understand the future

Prescriptive Analytics

•optimization and simulation algorithms to advice on possible outcomes

Friday, 21 January 2022 10


Big Data Domains

Friday, 21 January 2022 12


Scope of Big Data

Friday, 21 January 2022 13


Friday, 21 January 2022 14
Problems with Traditional Approach

Friday, 21 January 2022 15


Evolution of Hadoop

Friday, 21 January 2022 16


What is Hadoop?

• Hadoop is a framework that allows you to first store Big Data in a


distributed environment, so that, you can process it parallelly.

• HDFS (Hadoop distributed File System)


• storage
• YARN (Yet Another Resource Negotiator)
• resource management

Friday, 21 January 2022 17


Advantages Of HDFS
1. Distributed Storage

2. Distributed & Parallel Computation

3. Horizontal Scalability

Friday, 21 January 2022 18


HDFS

Friday, 21 January 2022 19


Hadoop – NameNode

Friday, 21 January 2022 20


Hadoop - NameNode
• Master daemon that maintains and manages the DataNodes (slave nodes)

• Records the metadata of all the blocks stored in the cluster,


• location of blocks stored, size of the files, permissions, hierarchy, etc.

• Records each and every change that takes place to the file system metadata

• If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog

• Regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are alive

• Keeps a record of all the blocks in the HDFS and DataNode in which they are stored

Friday, 21 January 2022 21


Secondary NameNode:

Friday, 21 January 2022 22


Hadoop - DataNode
• Slave daemon which runs on each slave machine

• The actual data is stored on DataNodes

• Responsible for serving read and write requests from the clients

• Responsible for creating blocks, deleting blocks and replicating the


same based on the decisions taken by the NameNode

• Sends heartbeats to the NameNode periodically to report the overall


health of HDFS, by default, this frequency is set to 3 seconds

Friday, 21 January 2022 23


Blocks

Friday, 21 January 2022 24


Replication Management

Friday, 21 January 2022 25


Friday, 21 January 2022 26
Friday, 21 January 2022 27
HDFS Write Architecture

• File “example.txt” into 2 blocks


• 128 MB (Block A)
• 120 MB (block B)

Friday, 21 January 2022 28


Data copy process
• Three stages:
• Set up of Pipeline
• Data streaming and replication
• Shutdown of Pipeline (Acknowledgement stage)

Friday, 21 January 2022 29


For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.

Friday, 21 January 2022 30


For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.

Friday, 21 January 2022 31


For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.

Friday, 21 January 2022 32


For Block A: 1A -> 2A -> 3A -> 4A
For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B

Friday, 21 January 2022 33


Friday, 21 January 2022 34
MapReduce: Traditional Way

Friday, 21 January 2022 35


What is MapReduce?
• Framework that allows us to perform distributed and parallel processing on large data
sets in a distributed environment

• 2 tasks – Map and Reduce


• block of data is read and processed to produce key-value pairs as intermediate outputs
• output of a Mapper or map job (key-value pairs) is input to the Reducer
• the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs

Friday, 21 January 2022 36


MapReduce: Word Count

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Friday, 21 January 2022 37


YARN

Friday, 21 January 2022 38


Friday, 21 January 2022 39
Resource Manager

Cluster-level (one for each cluster) component and runs on the master machine

Manages resources and schedules applications running on top of YARN

Keeps a track of the heartbeats from the Node Manager

Two Components:

Responsible for allocating resources to the various running


Scheduler applications

Responsible for accepting job submissions and negotiating


Application Manager the first container for executing the application

Friday, 21 January 2022 40


Node Manager
• Node-level component (one on each node) and runs on each slave machine

• Responsible for managing containers and monitoring resource utilization in each


container

• Keeps track of node health and log management

• Continuously communicates with Resource Manager to remain up-to-date

Friday, 21 January 2022 41


Application Submission in YARN
1) Submit the job

2) Get Application ID

3) Application Submission Context

4 a) Start Container Launch


b) Launch Application Master

5) Allocate Resources

6 a) Container
b) Launch

7) Execute

Friday, 21 January 2022 42


Application Workflow
in Hadoop YARN
1. Client submits an application
2. Resource Manager allocates a container to
start Application Manager
3. Application Manager registers with Resource
Manager
4. Application Manager asks containers from Resource
Manager
5. Application Manager notifies Node Manager to
launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application
Manager to monitor application’s status
8. Application Manager unregisters with Resource
Manager

Friday, 21 January 2022 43


Hadoop Ecosystem

Friday, 21 January 2022 44


Apache Spark
• Framework for real time data analytics in a distributed computing environment

• executes in-memory computations to increase speed of data processing over Map-Reduce

• 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations

Friday, 21 January 2022 45


Amazon EMR

Provides a managed Hadoop framework using the elastic infrastructure


of Amazon EC2 and Amazon S3.

Distributes computation of the data over multiple Amazon EC2 instances.

Analysis of the data is easy with Amazon Elastic MapReduce

Friday, 21 January 2022 46


Benefits of Amazon EMR

• Elastic - Auto Scaling can use to modify the number of instances automatically

• Economical – Cheap and has support for Amazon EC2 Spot and Reserved Instances

• Secure - Inbuilt capability to turn on the firewall for the protection and controlling cloud
network access to instances

• Flexible - For performing tasks such as root access to any instance, Installation of additional
applications, and customization of the cluster with bootstrap actions

Friday, 21 January 2022 47

You might also like