Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow

BigData &
Hadoop
Shushrutha Reddy K
M.Tech in Computational Engineering from RGUKT
Senior BigData Developer @ServiceNow
Bigdata
Hadoop
MapReduce
Agenda YARN
Spark
Amazon EMR
Friday, 21 January 2022 2
How It All Started?
Friday, January 21, 2022 3

What is BigData?
BigData is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available database
management tools or traditional data processing applications.
The challenge includes capturing, curating, storing, searching, sharing,

transferring, analysing and visualization.

Every minute:

Characteristics:

Types of Big Data
• Three types:
• Structured - stored and processed in a fixed format - SQL
• Semi-Structured - XML files or JSON
• Unstructured - Text Files, images, audios, videos

Why Big Data Analytics?
Making Smarter and More Efficient Organisation
Optimize Business Operations by Analysing Customer Behaviour
Cost Reduction
New Generation Products

Stages in Big Data Analytics

Types of Big Data Analytics
Descriptive Analytics
•data aggregation and data mining to provide insight into the past
Diagnostic Analytics
•determine why something happened in the past
Predictive Analytics
•statistical models and forecasts techniques to understand the future
Prescriptive Analytics
•optimization and simulation algorithms to advice on possible outcomes

Big Data Domains

Scope of Big Data

Problems with Traditional Approach

Evolution of Hadoop

What is Hadoop?
• Hadoop is a framework that allows you to first store Big Data in a

distributed environment, so that, you can process it parallelly.
• HDFS (Hadoop distributed File System)

• storage
• YARN (Yet Another Resource Negotiator)
• resource management

Advantages Of HDFS
1. Distributed Storage
2. Distributed & Parallel Computation
3. Horizontal Scalability

HDFS

Hadoop – NameNode

Hadoop - NameNode
• Master daemon that maintains and manages the DataNodes (slave nodes)
• Records the metadata of all the blocks stored in the cluster,

• location of blocks stored, size of the files, permissions, hierarchy, etc.
• Records each and every change that takes place to the file system metadata
• If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
• Regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are alive
• Keeps a record of all the blocks in the HDFS and DataNode in which they are stored

Secondary NameNode:

Hadoop - DataNode
• Slave daemon which runs on each slave machine
• The actual data is stored on DataNodes
• Responsible for serving read and write requests from the clients
• Responsible for creating blocks, deleting blocks and replicating the

same based on the decisions taken by the NameNode
• Sends heartbeats to the NameNode periodically to report the overall

health of HDFS, by default, this frequency is set to 3 seconds

Blocks

Replication Management

HDFS Write Architecture
• File “example.txt” into 2 blocks

• 128 MB (Block A)
• 120 MB (block B)

Data copy process
• Three stages:
• Set up of Pipeline
• Data streaming and replication
• Shutdown of Pipeline (Acknowledgement stage)

For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.



For Block A: 1A -> 2A -> 3A -> 4A
For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B

MapReduce: Traditional Way

What is MapReduce?
• Framework that allows us to perform distributed and parallel processing on large data
sets in a distributed environment
• 2 tasks – Map and Reduce

• block of data is read and processed to produce key-value pairs as intermediate outputs
• output of a Mapper or map job (key-value pairs) is input to the Reducer
• the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs

MapReduce: Word Count
Dear, Bear, River, Car, Car, River, Deer, Car and Bear

YARN

Resource Manager
Cluster-level (one for each cluster) component and runs on the master machine
Manages resources and schedules applications running on top of YARN
Keeps a track of the heartbeats from the Node Manager
Two Components:
Responsible for allocating resources to the various running

Scheduler applications
Responsible for accepting job submissions and negotiating

Application Manager the first container for executing the application

Node Manager
• Node-level component (one on each node) and runs on each slave machine
• Responsible for managing containers and monitoring resource utilization in each

container
• Keeps track of node health and log management
• Continuously communicates with Resource Manager to remain up-to-date

Application Submission in YARN
1) Submit the job
2) Get Application ID
3) Application Submission Context
4 a) Start Container Launch

b) Launch Application Master
5) Allocate Resources
6 a) Container
b) Launch
7) Execute

Application Workflow
in Hadoop YARN
1. Client submits an application
2. Resource Manager allocates a container to
start Application Manager
3. Application Manager registers with Resource
Manager
4. Application Manager asks containers from Resource
Manager
5. Application Manager notifies Node Manager to
launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application
Manager to monitor application’s status
8. Application Manager unregisters with Resource
Manager

Hadoop Ecosystem

Apache Spark
• Framework for real time data analytics in a distributed computing environment
• executes in-memory computations to increase speed of data processing over Map-Reduce
• 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations

Amazon EMR
Provides a managed Hadoop framework using the elastic infrastructure

of Amazon EC2 and Amazon S3.
Distributes computation of the data over multiple Amazon EC2 instances.
Analysis of the data is easy with Amazon Elastic MapReduce

Benefits of Amazon EMR
• Elastic - Auto Scaling can use to modify the number of instances automatically
• Economical – Cheap and has support for Amazon EC2 Spot and Reserved Instances
• Secure - Inbuilt capability to turn on the firewall for the protection and controlling cloud
network access to instances
• Flexible - For performing tasks such as root access to any instance, Installation of additional
applications, and customization of the cluster with bootstrap actions

Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow

Uploaded by

Copyright:

BigData &

Friday, January 21, 2022 3

The challenge includes capturing, curating, storing, searching, sharing,

Friday, 21 January 2022 4

Friday, 21 January 2022 5

Friday, 21 January 2022 6

Friday, 21 January 2022 7

Making Smarter and More Efficient Organisation

Optimize Business Operations by Analysing Customer Behaviour

New Generation Products

Friday, 21 January 2022 8

Friday, 21 January 2022 9

•determine why something happened in the past

•statistical models and forecasts techniques to understand the future

•optimization and simulation algorithms to advice on possible outcomes

Friday, 21 January 2022 10

Friday, 21 January 2022 12

Friday, 21 January 2022 13

Friday, 21 January 2022 15

Friday, 21 January 2022 16

• Hadoop is a framework that allows you to first store Big Data in a

• HDFS (Hadoop distributed File System)

Friday, 21 January 2022 17

2. Distributed & Parallel Computation

Friday, 21 January 2022 18

Friday, 21 January 2022 19

Friday, 21 January 2022 20

• Records the metadata of all the blocks stored in the cluster,

Friday, 21 January 2022 21

Friday, 21 January 2022 22

• The actual data is stored on DataNodes

• Responsible for creating blocks, deleting blocks and replicating the

• Sends heartbeats to the NameNode periodically to report the overall

Friday, 21 January 2022 23

Friday, 21 January 2022 24

Friday, 21 January 2022 25

• File “example.txt” into 2 blocks

Friday, 21 January 2022 28

Friday, 21 January 2022 29

Friday, 21 January 2022 30

Friday, 21 January 2022 31

Friday, 21 January 2022 32

Friday, 21 January 2022 33

Friday, 21 January 2022 35

• 2 tasks – Map and Reduce

Friday, 21 January 2022 36

Friday, 21 January 2022 37

Friday, 21 January 2022 38

Manages resources and schedules applications running on top of YARN

Keeps a track of the heartbeats from the Node Manager

Responsible for allocating resources to the various running

Responsible for accepting job submissions and negotiating

Friday, 21 January 2022 40

• Responsible for managing containers and monitoring resource utilization in each

• Keeps track of node health and log management

• Continuously communicates with Resource Manager to remain up-to-date

Friday, 21 January 2022 41

3) Application Submission Context

4 a) Start Container Launch

Friday, 21 January 2022 42

Friday, 21 January 2022 43

Friday, 21 January 2022 44

• executes in-memory computations to increase speed of data processing over Map-Reduce

Friday, 21 January 2022 45

Provides a managed Hadoop framework using the elastic infrastructure

Distributes computation of the data over multiple Amazon EC2 instances.