Professional Documents
Culture Documents
Big Data - S
Big Data - S
2
DATA AND BIG DATA
Big data refers to the large volume of structured and unstructured data. The
analysis of big data leads to better insights for business.
FACTS ABOUT BIG DATA
• Personalized Marketing
• Recommendation Engines
• Sentiment Analysis
• Personalize Marketing to the Consumer Behavior
• Genome Sequencing
• Sensors Data
PERSONALIZED MARKETING
A recommendation engine
is a type of data filtering
tool using machine
learning algorithms to
recommend the most
relevant items to a
particular user or customer
SENTIMENT ANALYSIS
Machine-Generated Data
Big Data Generated By People
Organization-Generated Data
MACHINE-
GENERATED
DATA
BIG DATA GENERATED BY PEOPLE
ORGANIZATION-
GENERATED
DATA
DATA IN
ZETTABYTE
BIG DATA: CASE STUDY
•Traditionally, the analysis of such data was done using a computer algorithm
that was designed to produce a correct solution for any given instance.
•As the data started to grow, a series of computers were employed to do the
analysis. They were also known as distributed systems.
CHARACTERISTIC
S OF BIG DATA
CHARACTERI
STICS OF BIG
DATA
CHARACTERISTIC
S OF BIG DATA
CHARACTERISTIC
S OF BIG DATA
DATA
PROCESSING
USING
TRADITIONAL
SYSTEM
AGENDA
25
FILE
SYSTEM
DISTRIBUTED SYSTEMS
High
High Limited programmin
chances of bandwidth g
system complexity
failure HADOOP is used to overcome these
challenges!
HDFS FILE
SYSTEM AND
IMPORTANCE
INTRODUCTION TO BIG DATA AND HADOOP
• What Is Hadoop?
• Hadoop Introduction
• Importance of HDFS Architecture
• Hadoop Eco System
• HDFS Architecture
• Hadoop Setup
34
HADOOP CORE COMPONENTS
COMPONENTS OF HADOOP ECOSYSTEM
9 • Hadoop MapReduce is a
1 10 framework that processes data.
8 It is the original Hadoop
processing engine, which is
2 primarily Java-based.
5
7 • It is based on the map and
reduce programming model.
3
6
• It has an extensive and mature
4 fault tolerance.
1 • It is an open-source high
3 2 performance SQL engine
10 that runs on the Hadoop
cluster.
4
• It is ideal for interactive analysis
7 and has very low latency, which
9
5
can be measured in
milliseconds.
8
6 • Impala supports a dialect of
SQL, so data in HDFS is modeled
as a database table.
COMPONENTS OF HADOOP ECOSYSTEM
2 • Hive is an abstraction
4 3 layer on top of Hadoop
1 that executes queries
using MapReduce.
5 • It is preferred for data
8
processing and ETL
10
(Extract Transform Load)
and ad hoc queries.
6
8
7
COMPONENTS OF HADOOP ECOSYSTEM
features
In the traditional system, storing and retrieving volumes of data had three major
issues:
Hadoop clusters
read/write
terabytes of
data per
second
Speed: HDFS offers
HDFS copies Search and zero licensing
analysis 2 and support
the data
is time- costs
multiple
consuming
times
Reliability: Fetching data Cost: 10,000 to
is 3 1 $14,000 per terabyte
difficult
WHAT IS HDFS?
HDFS is a distributed file system that provides access to data across Hadoop
clusters.
It manages and supports analysis of very large volumes of Big Data.
CHARACTERISTICS OF HDFS
HDFS is
economical
HDFS STORAGE
Metadata
HDFS stores
files in a
number of NameNode
blocks
Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3
Node Node
B4
data file
B4
BB1 B3 E1 B2
3
2
B4
B4
4
Node
CB1
B3
HDFS STORAGE
Metadata
NameNode
Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3
Node Node
B4
data file
B4
B B1 B3 E1 B2
3
2
B4
B4
4
Node
Each block is
CB1
replicated to a few B3
separate
computers
HDFS STORAGE
Data is divided
Metadata into 128 MB per
block
NameNod
e
Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3
Node Node
B4
data file
B4
B B1 B3 E1 B2
3
2
B4
B4
4
Node
CB1
B3
HDFS STORAGE
Metadata
NameNod
e Metadata keeps
Node Node information about
B1 A DB1 the block and its
B2 replication. It is
Very B2 B2
B3
stored in
large B3
data file
B4
B B1 B3 E1 B2
3
2
B4
B4
4
Node
CB1
B3
HDFS ARCHITECTURE AND COMPONENTS
HDFS ARCHITECTURE
It is also known as the master and slave
architecture.
Maintain
Metadata
File system
Maste
DN1: A,C
r
DN2:
File.txt = AC
A,C
NameNode DN3:
A,C
Slav
e
HDFS ARCHITECTURE
Responsible for
accepting jobs
from clients Maintain
Metadata
File system
Maste
DN1: A,C
r
DN2:
File.txt = AC
A,C
NameNode DN3:
A,C
Slav
e
HDFS ARCHITECTURE
Maintain
Slav
e
HDFS
ARCHITECTURE
Maintain
Metadata
File system
Maste
DN1: A,C
r
DN2:
File.txt = AC
A,C
NameNode DN3: A file is split into one
A,C or more blocks,
stored, and replicated
in the slave nodes
Slav
e
HDFS ARCHITECTURE
Maintain
Metadata
File system
Maste
DN1: A,C
r
DN2:
File.txt = AC
Data required for the A,C
Slav
e
HDFS COMPONENTS
• Namenode
• Secondary Namenode
• File system
• Metadata
• Datanode
HDFS COMPONENTS
NAMENODE
Metadata
File system
Metadata NameNode
Metadata
HDFS
Cluster
There can be only one Secondary NameNode server in a cluster. It cannot be treated
as
a disaster recovery server (it partially restores the NameNode server in case of
failure)
HDFS COMPONENTS
FILE SYSTEM
Datanode
/Dir 1.1 File A /Dir 2.1
File B
HDFS COMPONENTS
METADATA
File system
Metadata
Datanode
HDFS COMPONENTS
DATANODE
Client
Metadata
Datanode
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
Bloc
k
Rack Client
Rack
1 1
DATA BLOCK
SPLIT
Data block split is an important process of HDFS architecture. Each file is split into one or more
blocks and the blocks are stored and replicated in DataNodes.
NameNode
A file split b1 b2 … b2 b3 … b1 … b1 b2 …
into b3
blocks
DataNodes managing
blocks
NameNode
JobTrac
ker
B1 B2 B3
Job
1
Block
DataNode Replicatio DataNode
server1 n server 2
Resubmit Job
1
REPLICATION METHOD
• Each file is split into a sequence of blocks (of the same size, except the last one).
• Blocks are replicated for fault-tolerance.
• The block replication factor is usually configured at the cluster level (can also be done at the file
level).
• The NameNode receives a heartbeat and a block report from each DataNode in the cluster.
• A block report lists the blocks on a DataNode.
WHAT IS A RACK?
• Namenode achieves this rack information by maintaining rack ids of each data node.
• This concept of choosing closer data nodes based on rack information is called Rack Awareness
in Hadoop.
REPLICATION AND RACK
AWARENESS IN HADOOP
The topology of the replica is critical to ensure the reliability of HDFS. Usually, data is replicated thrice.
The suggested replication topology is as follows:
node.
Rack Rack
1 2
R3N1
R3N2 R2N1:
R3N2B1
R1N3: B1 R2N3: B1
REPLICATION AND RACK
AWARENESS: EXAMPLE
The diagram illustrates a Hadoop cluster with three racks. Each rack contains multiple
nodes.
NameNode
B3
B1 B2
REPLICATION AND RACK
AWARENESS: EXAMPLE
R1N1 represents
NameNode
Node 1 on Rack
1 and so on.. B1 B2
B3
REPLICATION AND RACK
AWARENESS: EXAMPLE
The NameNode
decides which
DataNode
belongs to which
rack.
NameNode
B3
B1 B2
INTRODUCTION TO YARN (YET ANOTHER
RESOURCE NEGOTIATOR)
WHAT IS YARN:
CASE STUDY
Yahoo was the first company to embrace Hadoop and became a trendsetter within the Hadoop ecosystem.
In late 2012, Yahoo struggled to handle iterative and stream processing of data on the Hadoop
infrastructure
due to MapReduce limitations.
Both iterative and stream processing were important for Yahoo in facilitating its move from batch
computing to continuous computing.