Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 79

WELCOME TO BIG DATA ERA !!!

Big Data Introduction


AGENDA

• DATA and Big Data


• Facts About Big Data
• Big Data Source
• Big Data Case Study
• Characteristics Of Big Data
• Big Data System Requirement
• Data processing Using Traditional System

2
DATA AND BIG DATA

Anything that is raw is data.


Any thing that is not processed is called data.
DATA >> processed >> INFORMATION
Mobile [storage , ram ]
10 TB Data

My existing infustructure does not support the data.


Any Data that is beyond the processing capacity is called Big Data.
WHAT IS BIG DATA?

Big data refers to the large volume of structured and unstructured data. The
analysis of big data leads to better insights for business.
FACTS ABOUT BIG DATA

• Personalized Marketing
• Recommendation Engines
• Sentiment Analysis
• Personalize Marketing to the Consumer Behavior
• Genome Sequencing
• Sensors Data
PERSONALIZED MARKETING

By collecting data, making target


personas, anticipating the future
needs of customers, and
understanding how they have
interacted with you, personalized
marketing will allow your
business to craft and deliver
messages that are relatable and
relevant to the target audience.
RECOMMENDATION
ENGINES

A recommendation engine
is a type of data filtering
tool using machine
learning algorithms to
recommend the most
relevant items to a
particular user or customer
SENTIMENT ANALYSIS

• Sentiment analysis studies the


subjective information in an expression,
that is, the opinions, appraisals,
emotions, or attitudes towards a topic,
person or entity. Expressions can be
classified as positive, negative, or
neutral.
• Sentiment analysis is the use of natural
language processing, text analysis,
computational linguistics, and biometrics
to systematically identify, extract,
quantify, and study affective states and
subjective information.
PERSONALIZE MARKETING TO THE CONSUMER BEHAVIOR
GENOME SEQUENCING

A laboratory method that is used to


determine the entire genetic makeup of a
specific organism or cell type. This method
can be used to find changes in areas of the
genome. These changes may help scientists
understand how specific diseases, such as
cancer, form.
SENSORS
DATA
BIG DATA SOURCE

Machine-Generated Data
Big Data Generated By People
Organization-Generated Data
MACHINE-
GENERATED
DATA
BIG DATA GENERATED BY PEOPLE
ORGANIZATION-
GENERATED
DATA
DATA IN
ZETTABYTE
BIG DATA: CASE STUDY

•Netflix is one of the largest providers of commercial streaming video in


the US with a customer base of over 29 million.

•It receives a huge volume of behavioral data.

• When do users watch a show?

• Where do they watch it?

• On which device do they watch the show?

• How often do they pause a program?

• How often do they re-watch a program?

• Do they skip the credits?

• What are the keywords searched?


BIG DATA: CASE STUDY

•Traditionally, the analysis of such data was done using a computer algorithm
that was designed to produce a correct solution for any given instance.

•As the data started to grow, a series of computers were employed to do the
analysis. They were also known as distributed systems.
CHARACTERISTIC
S OF BIG DATA
CHARACTERI
STICS OF BIG
DATA
CHARACTERISTIC
S OF BIG DATA
CHARACTERISTIC
S OF BIG DATA
DATA
PROCESSING
USING
TRADITIONAL
SYSTEM
AGENDA

• File System , DFS


• HDFS and its importance
• Data Processing Using Hadoop
• History of Hadoop
• Traditional Database Systems vs. Hadoop

25
FILE
SYSTEM
DISTRIBUTED SYSTEMS

A distributed system is a model in which components located on networked


computers communicate and coordinate their actions by passing messages.
HOW DOES A DISTRIBUTED SYSTEM WORK ?
CHALLENGES OF DISTRIBUTED SYSTEMS

High
High Limited programmin
chances of bandwidth g
system complexity
failure HADOOP is used to overcome these
challenges!
HDFS FILE
SYSTEM AND
IMPORTANCE
INTRODUCTION TO BIG DATA AND HADOOP

• What Is Hadoop?

Hadoop is a framework that allows distributed


processing of large datasets across
clusters of computers using simple programming
models.

Doug Cutting discovered Hadoop and named it


after his son’s yellow toy elephant. It is inspired
by the technical document published by Google.
CHARACTERISTICS OF HADOOP

Scalable: Can follow


both
horizontal
and vertical
scaling
Reliable: Stores Flexible: Stores a
copies of the lot of data
data on and enables you
different to use it later
machines and
is resistant to
hardware Economical: Ordinary
failure computers can be used for
data processing
TRADITIONAL DATABASE SYSTEMS VS.
HADOOP
• Traditional Database Systems
• Hadoop
• Data is stored in a central location and sent to the processor at run time.
• In Hadoop, the program goes to the data. It initially distributes the data to multiple systems and later runs the
computation wherever the data is located.
• Traditional Database Systems cannot be used to process and store a large amount of data (big data).
• Hadoop works better when the data size is big. It can process and store a large amount of data easily and effectively.
• Traditional RDBMS is used to manage only structured and semi-structured data. It cannot be used to manage
unstructured data.
• Hadoop has the ability to process and store a variety of data, whether it is structured or unstructured.
AGENDA

• Hadoop Introduction
• Importance of HDFS Architecture
• Hadoop Eco System
• HDFS Architecture
• Hadoop Setup

34
HADOOP CORE COMPONENTS
COMPONENTS OF HADOOP ECOSYSTEM

4 • HDFS is a storage layer of Hadoop


6 5
suitable for distributed storage and
3
processing.
7
• It provides file permissions,
2 authentication, and streaming access
10
to file system data.
8
1 • HDFS can be accessed through
Hadoop command line interface
9
COMPONENTS OF HADOOP ECOSYSTEM

5 • HBase is a NoSQL database or


7 6 non-relational database that
4 stores data in HDFS.

8 • It provides support to high


1 volume of data and high
3 throughput.

9 • It is used when you need


2
random, real-time read/write
10 access to your big data.
COMPONENTS OF HADOOP ECOSYSTEM

6 • Sqoop is a tool designed to


8 7 transfer data between
5 Hadoop and relational
database servers.
9
• It is used to import data
2
4 from relational databases
such as Oracle and MySQL
10 to HDFS
3
• and export data from HDFS
1 to relational databases.
COMPONENTS OF HADOOP ECOSYSTEM

7 • Flume is a distributed service for


9 8 ingesting streaming data suited
6 for event data from multiple
systems.
10
• It has a simple and flexible
3 5 architecture based on streaming
data flows.

1 • It is robust and fault tolerant and


2
has tunable reliability
mechanisms.

• It uses a simple extensible data


COMPONENTS OF HADOOP ECOSYSTEM

8 •Spark is an open-source cluster


10 9 computing framework that supports
7 Machine learning,
•Business intelligence, Streaming,
1
and Batch processing.
6
4 •Spark solves similar problems as
Hadoop MapReduce does but has
2
5 a fast in-memory approach and a
clean functional style API.
3
COMPONENTS OF HADOOP ECOSYSTEM

9 • Hadoop MapReduce is a
1 10 framework that processes data.
8 It is the original Hadoop
processing engine, which is
2 primarily Java-based.
5
7 • It is based on the map and
reduce programming model.
3

6
• It has an extensive and mature
4 fault tolerance.

• Hive and Pig are built on map-


reduce model.
COMPONENTS OF HADOOP ECOSYSTEM

10 • Once the data is processed, it is


2 1 analyzed using an open-source
high-level dataflow
9
• system called Pig.
3
6 • Pig converts its scripts to Map and
8
Reduce code to reduce the effort of
writing complex map-reduce
4 programs.
5
• Ad-hoc queries like Filter and Join,
which are difficult to perform in
MapReduce, can be
• easily done using Pig.
COMPONENTS OF HADOOP ECOSYSTEM

1 • It is an open-source high
3 2 performance SQL engine
10 that runs on the Hadoop
cluster.
4
• It is ideal for interactive analysis
7 and has very low latency, which
9
5
can be measured in
milliseconds.
8
6 • Impala supports a dialect of
SQL, so data in HDFS is modeled
as a database table.
COMPONENTS OF HADOOP ECOSYSTEM

2 • Hive is an abstraction
4 3 layer on top of Hadoop
1 that executes queries
using MapReduce.
5 • It is preferred for data
8
processing and ETL
10
(Extract Transform Load)
and ad hoc queries.
6
8
7
COMPONENTS OF HADOOP ECOSYSTEM

• It is Cloudera's near-real-time access


3
5 4 product that enables non-technical
users to search and explore data
2 stored in or ingested into Hadoop and
HBase.
6

• Cloudera Search is a fully integrated


1 data processing platform. It uses the
7 flexible, scalable, and robust storage
system included with CDH or
8
Cloudera’s Distribution, including
Hadoop.
BIG DATA PROCESSING
LEARNING OBJECTIVES

Discuss Hadoop Distributed File System


(HDFS)

Explain HDFS architecture and

components Describe YARN and its

features

Explain YARN architecture


WHY HDFS?

In the traditional system, storing and retrieving volumes of data had three major
issues:

Speed: Search and


analysis is time-
consuming
2

Reliability: Fetching data Cost: 10,000 to


is 3 1 $14,000 per terabyte
difficult
WHY HDFS?
HDFS resolves all major issues of the traditional file
system.

Hadoop clusters
read/write
terabytes of
data per
second
Speed: HDFS offers
HDFS copies Search and zero licensing
analysis 2 and support
the data
is time- costs
multiple
consuming
times
Reliability: Fetching data Cost: 10,000 to
is 3 1 $14,000 per terabyte
difficult
WHAT IS HDFS?

HDFS is a distributed file system that provides access to data across Hadoop
clusters.
It manages and supports analysis of very large volumes of Big Data.
CHARACTERISTICS OF HDFS

HDFS has high fault-


tolerance

HDFS has high


throughput

HDFS is
economical
HDFS STORAGE

Metadata
HDFS stores
files in a
number of NameNode
blocks
Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3

Node Node
B4

data file
B4

BB1 B3 E1 B2

3
2
B4
B4

4
Node
CB1
B3
HDFS STORAGE

Metadata

NameNode

Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3

Node Node
B4

data file
B4

B B1 B3 E1 B2

3
2
B4
B4

4
Node
Each block is
CB1
replicated to a few B3

separate
computers
HDFS STORAGE

Data is divided
Metadata into 128 MB per
block

NameNod
e
Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3

Node Node
B4

data file
B4

B B1 B3 E1 B2

3
2
B4
B4

4
Node
CB1
B3
HDFS STORAGE

Metadata

NameNod
e Metadata keeps
Node Node information about
B1 A DB1 the block and its
B2 replication. It is
Very B2 B2
B3
stored in
large B3

Node Node NameNode.


B4

data file
B4

B B1 B3 E1 B2

3
2
B4
B4

4
Node
CB1
B3
HDFS ARCHITECTURE AND COMPONENTS
HDFS ARCHITECTURE
It is also known as the master and slave
architecture.

Maintain

Edit log Fsimage


Secondary
NameNode

Metadata
File system
Maste
DN1: A,C
r
DN2:
File.txt = AC
A,C
NameNode DN3:
A,C

Data Node 1 Data Node 2 Data Node 3 Data Node N




1 3 1 3 1 3

Slav
e
HDFS ARCHITECTURE
Responsible for
accepting jobs
from clients Maintain

Edit log Fsimage


Secondary
NameNode

Metadata
File system
Maste
DN1: A,C
r
DN2:
File.txt = AC
A,C
NameNode DN3:
A,C

Data Node 1 Data Node 2 Data Node 3 Data Node N




1 3 1 3 1 3

Slav
e
HDFS ARCHITECTURE

Maintain

Edit log Fsimage


Secondary
NameNode
Stores the
Metadata
File system block location
Maste and its
DN1: A,C
r replication
DN2:
File.txt = AC
A,C
NameNode DN3:
A,C

Data Node 1 Data Node 2 Data Node 3 Data Node N




1 3 1 3 1 3

Slav
e
HDFS
ARCHITECTURE
Maintain

Edit log Fsimage


Secondary
NameNode

Metadata
File system
Maste
DN1: A,C
r
DN2:
File.txt = AC
A,C
NameNode DN3: A file is split into one
A,C or more blocks,
stored, and replicated
in the slave nodes

Data Node 1 Data Node 2 Data Node 3 Data Node N




1 3 1 3 1 3

Slav
e
HDFS ARCHITECTURE

Maintain

Edit log Fsimage


Secondary
NameNode

Metadata
File system
Maste
DN1: A,C
r
DN2:
File.txt = AC
Data required for the A,C

operation is loaded NameNode DN3:


and segregated into A,C
chunks of data blocks

Data Node 1 Data Node 2 Data Node 3 Data Node N




1 3 1 3 1 3

Slav
e
HDFS COMPONENTS

The main components of


HDFS:

• Namenode
• Secondary Namenode
• File system
• Metadata
• Datanode
HDFS COMPONENTS
NAMENODE

The NameNode server is the core component of an HDFS cluster.


Namenode
It maintains and executes file system namespace operations such as opening,
closing, and renaming of files and directories that are present in HDFS.
Secondary
Namenode Metadata
File system
DN1
File system
1 3
File.txt =
DN2 1 3 AC
NameNode DN3 1 3

Metadata

Datanode Data Node 1 Data Node 2 Data Node 3 Data Node N



1 3 1 3 1 3 …

NameNode is a single point of


failure.
HDFS COMPONENTS
NAMENODE: OPERATION

The NameNode maintains two persistent files:


Namenode
• A transaction log called an Edit Log

Secondary • A namespace image called an FsImage


Namenode

File system

Metadata NameNode

Retrieves the Updates with


Datanode Edit Edit log
log at startup information
Fsimage
E
d
it
l
o
g
HDFS COMPONENTS
SECONDARY NAMENODE

Secondary NameNode server is responsible for maintaining a backup of the NameNode


Namenode server. It maintains the edit log and namespace image information in sync with the
NameNode server.
Secondary Master
Namenode
NameNode Secondary Maintain
NameNode
File system Slave s
Edit log Fsimage

Metadata

Datanode Data Node Data Node Data Node

HDFS
Cluster

There can be only one Secondary NameNode server in a cluster. It cannot be treated
as
a disaster recovery server (it partially restores the NameNode server in case of
failure)
HDFS COMPONENTS
FILE SYSTEM

HDFS exposes a file system namespace and allows NameNode


Namenode user
data to be stored in files.
Secondary
Namenode The file system supports operations such as
create, remove, move, and rename.
/
File system

Metadata /Dir 1 /Dir 1

Datanode
/Dir 1.1 File A /Dir 2.1

File B
HDFS COMPONENTS
METADATA

HDFS metadata is the structure of HDFS directories and files in a tree.


Namenode
It includes attributes of directories and files, such as ownership, permissions, quotas,
Secondary and replication factor.
Namenode

File system

Metadata

Datanode
HDFS COMPONENTS
DATANODE

The DataNode is a multiple instance server.


Namenode
It is responsible for storing and maintaining the data blocks. It also retrieves the blocks
when
Secondary asked by clients or the NameNode.
Namenode
Metadata (Name, replicas,
Metadata …):
File system ops
NameNode /home/foo/data, 3, …

Client
Metadata

Datanode
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
Bloc
k

Rack Client
Rack
1 1
DATA BLOCK
SPLIT
Data block split is an important process of HDFS architecture. Each file is split into one or more
blocks and the blocks are stored and replicated in DataNodes.

NameNode

DataNode1 DataNode2 DataNode3 DataNode4

A file split b1 b2 … b2 b3 … b1 … b1 b2 …
into b3

blocks
DataNodes managing
blocks

By default, each file block is 128


MB.
BLOCK REPLICATION
ARCHITECTURE
Block replication refers to creating copies of a block in multiple DataNodes. Usually, the
data is split into parts, such as part-0 and part-1.

NameNode
JobTrac
ker
B1 B2 B3

Job
1

Block
DataNode Replicatio DataNode
server1 n server 2
Resubmit Job
1
REPLICATION METHOD
• Each file is split into a sequence of blocks (of the same size, except the last one).
• Blocks are replicated for fault-tolerance.
• The block replication factor is usually configured at the cluster level (can also be done at the file
level).
• The NameNode receives a heartbeat and a block report from each DataNode in the cluster.
• A block report lists the blocks on a DataNode.
WHAT IS A RACK?

Rack is a collection of machines that are physically located in a single


place/data-
center and connected through a network.
In Hadoop, Rack is a physical collection of slave machines put together at a
single location for data storage.
RACK AWARENESS IN
HADOOP
• In large clusters of Hadoop, to improve network traffic while reading/writing HDFS files,
Namenode chooses data nodes that are on the same rack or a nearby rack to read/write
request.

• Namenode achieves this rack information by maintaining rack ids of each data node.

• This concept of choosing closer data nodes based on rack information is called Rack Awareness
in Hadoop.
REPLICATION AND RACK
AWARENESS IN HADOOP
The topology of the replica is critical to ensure the reliability of HDFS. Usually, data is replicated thrice.
The suggested replication topology is as follows:

• The first replica is placed on the same node as


that of the client.

• The second replica is placed on a different NameNode


rack from that of the first replica.

• The third replica is placed on the same rack as


that of the second one, but on a different Client

node.
Rack Rack
1 2
R3N1
R3N2 R2N1:
R3N2B1

R1N3: B1 R2N3: B1
REPLICATION AND RACK
AWARENESS: EXAMPLE
The diagram illustrates a Hadoop cluster with three racks. Each rack contains multiple
nodes.

NameNode

B3
B1 B2
REPLICATION AND RACK
AWARENESS: EXAMPLE

R1N1 represents
NameNode
Node 1 on Rack
1 and so on.. B1 B2
B3
REPLICATION AND RACK
AWARENESS: EXAMPLE

The NameNode
decides which
DataNode
belongs to which
rack.
NameNode
B3
B1 B2
INTRODUCTION TO YARN (YET ANOTHER
RESOURCE NEGOTIATOR)
WHAT IS YARN:
CASE STUDY

Yahoo was the first company to embrace Hadoop and became a trendsetter within the Hadoop ecosystem.

In late 2012, Yahoo struggled to handle iterative and stream processing of data on the Hadoop
infrastructure
due to MapReduce limitations.

Both iterative and stream processing were important for Yahoo in facilitating its move from batch
computing to continuous computing.

How could this issue be solved?

You might also like