Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

HDFS Hadoop Distributed File System

Introduction
Johan Louwers Lead Architect Oracle Technology

HDFS Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However, the differences from
other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed
on low-cost hardware. HDFS provides high throughput access to application data and is suitable for
applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming
access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search
engine project. HDFS is now an Apache Hadoop subproject. The project URL
is http://hadoop.apache.org/hdfs/.

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

HDFS Simple Cluster Setup


A

Simple HDFS Cluster Setup


B

A) HDFS cluster consisting out of a number of


commodity servers.
B) A single server containing both a name
node and a data node

C) Multiple servers containing a data node

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

HDFS introduction
HDFS Name Node

Primary index of where data is stored within


the cluster.
Primary entry point for all (applications)
clients who request access to HDFS.

Advisable to size the Name Node bigger then


the Data Node server.
Option to run a Data Node instance on the
same server as the Name Node.
Hadoop 2.0.0 and higher provide the option to
have high available Name Node setup. Prior to
2.0.0 the name Node was a single point of
Failure.

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

HDFS introduction
HDFS Storage

A (large) file is chopped into blocks.


Blocks are written to the different data nodes
in the cluster.
The name node keeps track of which block is
written to which node.

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

HDFS introduction
HDFS Storage

Data blocks are replicated over different nodes


in the cluster to ensure availability when a node
fails.
Level of replication is by default 3. Configured
with the dfs.replication variable in the HDFS
configuration

On startup, the NameNode enters a special state


called Safemode. Replication of data blocks does
not occur when the NameNode is in the Safemode
state.

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

HDFS introduction
HDFS Storage

When operating a large cluster ensure that


you have enabled the rack aware option.
Refer to the HADOOP-692 improvement for
more details: http://goo.gl/dQ012n

Thanks to ChrisDag for the image

Typically large Hadoop clusters are arranged in racks


and network traffic between different nodes with in the
same rack is much more desirable than network traffic
across the racks. In addition NameNode tries to place
replicas of block on multiple racks for improved fault
tolerance.

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

HDFS Oracle & Big Data


Oracle Big Data Appliance Introduction

Oracle Big Data Appliance is a highperformance, secure platform for running


diverse workloads on Hadoop and NoSQL
systems.

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

HDFS Oracle & Big Data


Oracle Big Data Appliance Introduction

Oracle Big Data Appliance includes (almost


without the need to say it) a HDFS storage
component for storing data.

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

HDFS Oracle & Big Data


Oracle & Hadoop

Oracle XQuery for Hadoop

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

10

HDFS Oracle & Big Data


Oracle & Hadoop

Oracle SQL connector for HDFS

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

11

HDFS Oracle & Big Data


Oracle & Hadoop

Oracle Loader for Hadoop


Online mode
Offline mode

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

12

HDFS Oracle & Big Data


Oracle & Hadoop

Oracle Loader for Hadoop


Online mode
Offline mode

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

13

HDFS Oracle & Big Data


Oracle & Hadoop

Oracle Big Data SQL

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

14

HDFS Oracle & Big Data


Oracle & Hadoop

Oracle Big Data SQL

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

15

Contact me

Johan Louwers
Capgemini Lead Architect Oracle Technology
Mail
Twitter
Blog 1
Blog 2

: Johan.Louwers@capgemini.com
: @johanlouwers
: http://www.capgemini.com/blog/capgemini-oracle-blog
: http://johanlouwers.blogspot.com

Hadoop HDFS introduction


Copyright 2014 Capgemini. All rights reserved.

16

About Capgemini
With almost 140,000 people in over 40 countries, Capgemini is
one of the world's foremost providers of consulting, technology
and outsourcing services. The Group reported 2013 global
revenues of EUR 10.1 billion.
Together with its clients, Capgemini creates and delivers
business and technology solutions that fit their needs and drive
the results they want. A deeply multicultural organization,
Capgemini has developed its own way of working, the
Collaborative Business Experience, and draws on
Rightshore, its worldwide delivery model.
Learn more about us at www.capgemini.com.

www.capgemini.com
The information contained in this presentation is proprietary.
2014 Capgemini. All rights reserved.
Rightshore is a trademark belonging to Capgemini.

You might also like