Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Technical Principles

of HDFS

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved.


HDFS application scenarios
A
Objectives HDFS system architecture
B
Upon completion of this course,
you will be able to know:
Key HDFS features
C
Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 1
CONTENTS
01 02 03 04
HDFS Overview Position of HDFS in HDFS Key
and Application FusionInsight HD System Features
Scenarios Architecture

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 2
CONTENTS
01 02 03 04
HDFS Overview Position of HDFS in HDFS Key
and Application FusionInsight HD System Features
Scenarios Architecture

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Dictionary vs. File System

Dictionary File System


Character index File name Metadata
Dictionary body Data block

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 4
HDFS Overview

Hadoop distributed file system (HDFS) is developed based on Google file system
(GFS) and runs on commodity hardware.
In addition to the features provided by other distributed file systems, HDFS also
provides the following features:
• High fault tolerance: resolves hardware unreliability problems.

• High throughput: supports applications involved with a large amount of data.

• Large file storage: supports TB and PB level data storage.

HDFS is applicable to: HDFS is inapplicable to:


• Store large files. • Store massive small files.
• Streaming data access. • Random write.
• Low-delay read.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 5
HDFS Application Scenarios

H DFS is a distributed file system of the Hadoop technical framework and is


used to manage files on multiple independent physical servers.

It is applicable to the following scenarios:

• Website user behavior data storage.


• Ecosystem data storage.
• Meteorological data storage.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 6
CONTENTS
01 02 03 04
HDFS Overview Position of HDFS in HDFS Key
and Application FusionInsight HD System Features
Scenarios Architecture

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Position of HDFS in FusionInsight
Application service layer

Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

As a Hadoop storage infrastructure, HDFS serves as a distributed, fault-


tolerant file system with linear scalability.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 8
CONTENTS
01 02 03 04
HDFS Overview Position of HDFS in HDFS Key Features
and Application FusionInsight HD System
Scenarios Architecture

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Basic System Architecture
HDFS Architecture

Metadata (Name,replicas,…) :
NameNode /home/foo/data,3,…

Metadata ops

Block ops
Client

Read DataNode DataNodes

Replication

Blocks Blocks

Client
Rack 1 Rack 2

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
HDFS Data Write Process

2:create
1:create Distributed
HDFS NameNode
3:write File System
Client 7:complete

6:close FS Data NameNode


Output Stream

Client node

4:write packet 5:ack packet

4 4
DataNode DataNode DataNode
5 5

DataNode DataNode DataNode

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 11
HDFS Data Read Process

2:get block location


1:open Distributed
HDFS NameNode
3:read File System
Client
6:close FS Data NameNode
Input Stream

Client node
5:read

4:read

DataNode DataNode DataNode

DataNode DataNode DataNode

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 12
CONTENTS
01 02 03 04
HDFS Overview Position of HDFS in HDFS Key Features
and Application FusionInsight HD System
Scenarios Architecture

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Key Design of HDFS Architecture

NameNode / DataNode
Federation storage
in master / slave mode

Unified file system


Data storage policy
Namespace

HA HDFS Data replication

Multiple access modes Metadata persistence

Space reclamation Robustness

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 14
HDFS High Availability (HA)

ZooKeeper ZooKeeper ZooKeeper

Heartbeat Heartbeat
EditLog
ZKFC JN JN JN ZKFC
Read log
Write log

FSlmage
Metadata NameNode synchronization NameNode
operation (Active) (standby)
HDFS
Block operation
Client Data read write Heartbeat

Copy

DataNode DataNode DataNode DataNode

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Metadata Persistence

Active NameNode Standby NameNode


2. Obtains Editlog and FSImage from the active node.
Download FSImage when NameNode is initialized
Editlog FSImage and the local FSImage file is used later.

1. Rolls Editlog.

Editlog.new Editlog FSImage


3. Merges Editlog
And FSImage.
FSImage.ckpt
4. Uploads the new FSImage
to the active node.
FSImage.ckpt
5. Rolls FSImage.

Editlog FSImage

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 16
HDFS Federation
APP Client-1 Client-k Client-n

HDFS Namespace-1 Namespace-k Namespace-n

NN1 NN-k NN-n


Namespace

NS-1 … NS-k … NS-n

Pool 1 Pool k Pool n


Block Pools
Block Storage

Common Storage

DataNode1 DataNode2 DataNodeN


… … …

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Data Replication
Data Center Placement policy
Distance=4
Distance=4
Distance=0

Client B1 B2 Node1 B4 Node1

Distance=2 Node2 Node2 Node2

B3 Node3 Node3 Node3

Node4 Node4 Node4

Node5 Node5 Node5

RACK1 RACK2 RACK3

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Colocation

T he definition of Colocation: is to store associated data or data that is going to be associated on the
same storage node.
According to the picture below, assume that file A and file D are going to be associated with each other,
which involves massive data migration. Data transmission consumes much bandwidth, which greatly
affects the processing speed of massive data and system performance.

NN

F
Aile A
A
A A
A B A
B C A
B D A
C D A File
A B
C D File
A C
DN1 DN2 DN3 DN4 DN5 DN6 File
A D

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Colocation Benefits

T he HDFS colocation: is to store files that need to be associated with each other on the same data node so
that data does not have to be obtained from other nodes during associated computing. This greatly reduces
network bandwidth consumption.
When joining files A and D with colocation feature, resource consumption can be greatly reduced because the
blocks of multiple associated files are distributed on the same storage node.

NN

F
Aile A
A C A
A B A
B C A
B A
C A D File
A B
D D File
A C
DN1 DN2 DN3 DN4 DN5 DN6 File
A D

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Summary
This module describes the following information about
HDFS: basic concepts, application scenarios, technical
architecture and its key features.

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved. Page 21
THANK YOU!

Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved.

You might also like