Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

HDFS File System and

Commands
Types of Bigdata under Big Data:
Bigdata are categorized into three different types

Data

Semi
Structured Unstructured
Structured
Structured Data

• Structured data are those type of data


which are stored already in an order.
• There are nearly 20% of the total
existing data are structured data.
• All the data generated from sensors,
weblogs, these are all Machine
Generated Structured Data are those
which are taken as information from
human. Like their names, addresses etc.
• The example of Structured Data is
Database.
Unstructured Data

• Unstructured data have no clear format in


storage. We can store structured data in
row-column database, but unstructured
cannot be stored like that. At least 80% of
data are Unstructured. All satellite-
generated images, scientific data or
images are categorized as machine-
generated unstructured data. There are
various types of human generated
unstructured data. These are images,
videos, social media data etc.
• The examples of Unstructured Data are
text documents, PDFs, Images, Videos etc.
Semi-Structured
• It is very difficult to categorize this type of data. Sometimes they look
structured, or sometimes unstructured. So that’s why these data are
known as semi-structured data. We cannot store these type of data
using traditional database format, but it contains some organizational
properties.

• The examples of Semi-Structured Data are SpreadSheet files, XML or


JSON(JavaScript Object Notation. JSON is a lightweight format
for storing and transporting data. JSON is often used when data
is sent from a server to a web page.) Documents, NOSQL database
data items etc.
What is Hadoop
• Hadoop is Open Source Framework. Hadoop can easily handle a large amount of
data on a low cost, simple hardware cluster. Hadoop is also a scalable and fault-
tolerant framework.
• The Hadoop is not only a storage system data can be processed using this
framework.
• The Hadoop system is basically written in JAVA.
• As the open source project, we can even change the source code of the Hadoop
system.
• Most of the codes are written by Yahoo, IBM, Cloudera etc.
• Hadoop provides parallel processing through different commodity hardware
simultaneously.
• As it works on commodity hardware so the cost is very low. Commodity hardware
is low-end and very cheap hardware. So the Hadoop solution is also economic.
Why we should use Hadoop
• The Hadoop solution is very popular. It has captured at least 90% of Bigdata
Market
• Hadoop has some unique features that make this solution very popular.
• Hadoop is scalable. So we can increase the number of commodity
hardware easily
• It is has fault tolerant solution. I.e., when one node goes down other node
can process the data.
• Data can be stored in Structured, Unstructured and Semi-Structured mode.
So it is more flexible.
• Google has provided a solution called MapReduce Algorithm
Hadoop ECO-System
Hadoop Vendors
Introduction to Big Data
• Hadoop File System was developed using distributed file system
design. It is run on commodity hardware. Unlike other distributed
systems, HDFS is highly faulttolerant and designed using low-cost
hardware.
• HDFS holds very large amount of data and provides easier access. To
store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
What is Hadoop?
• Fundamentally we make Hadoop into two parts
• HDFS for Storing the Clusters
• Mapreduce which deals with processing of data stored in HDFS
pattern fashion
• HDFS is all about storing and managing Huge Datasets across the
clusters
What is Hadoop?
Hadoop 2.x(x- Version)
Namenode and Datanode
Hadoop Cluster Architecture
•Before Understating the HDFS,
Let us Understand What is DFS?
What is DFS?
Why DFS?
Why DFS?
What is HDFS?
HDFS Blocks
HDFS Blocks
HDFS Blocks
HDFS Architecture
•How Files are stored in
HDFS?
HDFS Data Blocks
•What if DataNode Containing
Data Crashes
Data Node Failures
•Is it safe to have Just 1 copy
of each block?
•What do you think?
HDFS Block Replication
•How does Hadoop decide
where to store the replica of
blocks created?
Hadoop Arch – Rack Awareness Algorithm
Hadoop Arch – Rack Awareness Algorithm
Hadoop Arch – Rack Awareness Algorithm
Hadoop Arch – Rack Awareness Algorithm
Hadoop Arch – Rack Awareness Algorithm
Hadoop Arch – Rack Awareness Algorithm
Replication Factor
Features of HDFS

• It is suitable for the distributed storage and processing.


• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily
check the status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
HDFS Architecture
HDFS Architecture
• HDFS follows the master-slave architecture and it has the following
elements.
• Namenode
• The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software. It is a software that can be
run on commodity hardware. The system having the namenode acts as the
master server and it does the following tasks −
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and
opening files and directories.
HDFS Architecture
• Block
• Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes. These file segments are called as blocks. In
other words, the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.
• hadoop version
Hadoop HDFS
version Command
• The Hadoop fs shell command version prints the
Example: Hadoop version.
Hadoop HDFS mkdir Command
Usage:
• hadoop fs –mkdir /path/directory_name

• Hadoop HDFS mkdir Command Example 1:


• hadoop fs -mkdir /newDataflair

• In this example, we are trying to create a newDataFlair named directory in


HDFS using the mkdir command.
• This command creates the directory in HDFS if it does not already exist.
• Note: If the directory already exists in HDFS, then we will get an error
message that file already exists.
• Use hadoop fs mkdir -p /path/directoryname, so not to fail even if directory
Using the ls command, we can check for
the directories in HDFS.
• Hadoop HDFS ls Command Usage:
• hadoop fs -ls /path

• The Hadoop fs shell command ls displays a list of the contents of a directory specified in
the path provided by the user. It shows the name, permissions, owner, size, and
modification date for each file or directories in the specified directory.

• Hadoop HDFS ls Command Example 1:


• Here in the below example, we are using the ls command to enlist the files and directories
present in HDFS.

• hadoop fs -ls /

• hdfs dfs -ls /


Using the ls command, we can check for
the directories in HDFS.
• Hadoop HDFS ls Command Usage:
• hadoop fs -ls /path

• The Hadoop fs shell command ls displays a list of the contents of a directory specified in the path
provided by the user. It shows the name, permissions, owner, size, and modification date for each
file or directories in the specified directory.

• Hadoop HDFS ls Command Example 1:


• Here in the below example, we are using the ls command to enlist the files and directories present in
HDFS.

• hadoop fs -ls /

• hdfs dfs -ls /

• This Hadoop fs command behaves like -ls, but recursively displays entries in all subdirectories of a
path.
Hadoop HDFS put Command
Usage:
• The Hadoop fs shell command put is similar to
the copyFromLocal, which copies files from the local
filesystem to the destination in the Hadoop filesystem.
• hadoop fs -put <localsrc> <dest>
Hadoop HDFS copyFromLocal
Command Usage:
• This command copies the file/Directory from the local file system to
HDFS. which copies files or directory from the local filesystem to the
destination in the Hadoop filesystem
• hadoop fs -copyFromLocal <localsrc> <hdfs destination>
• Hadoop HDFS copyFromLocal Command Example:
• Here in the below example, we are trying to copy the ‘test1’ file present in
the local file system to the newDataFlair directory of Hadoop.
Hadoop HDFS get Command Usage:

• The Hadoop fs shell command


get copies the file or
directory from the Hadoop
file system to the local
file system.

• hadoop fs -get <src>


<localdest>

• Hadoop HDFS get Command


Example:
• In this example, we are trying to copy
the ‘testfile’ of the hadoop filesystem
to the local file system.
Hadoop HDFS
copyToLocal
Command Usage:
• copyToLocal command copies
the file from HDFS to the local
file system.
• hadoop fs -copyToLocal
<hdfs source> <localdst>
• Hadoop HDFS copyToLocal
Command Example:
Hadoop HDFS cat Command
Usage:
• The cat command reads the file in HDFS and displays the
content of the file on console or stdout.
• hadoop fs –cat /path_to_file_in_hdfs

• Hadoop HDFS cat Command Example:


• Here in this example, we are using the cat command to display
the content of the ‘sample’ file present in newDataFlair directory
of HDFS.
Hadoop HDFS mv Command Usage
• The HDFS mv command moves the files or directories from the
source to a destination within HDFS
• hadoop fs -mv <src> <dest>
• Hadoop HDFS mv Command Example:
Hadoop HDFS cp Command Usage:
• The cp command copies a file from one directory to another
directory within the HDFS.
• hadoop fs -cp <src> <dest>
• Hadoop HDFS cp Command Example:

• The file from the local gets deleted.


HDFS moveFromLocal Command
Usage:
• The Hadoop fs shell command moveFromLocal moves the
file or directory from the local filesystem to the destination in
Hadoop HDFS.
• hadoop fs -moveFromLocal <localsrc> <dest>
• HDFS moveFromLocal Command Example:
HDFS moveToLocal Command
Usage:
• The Hadoop fs shell command moveToLocal moves the file or
directory from the Hadoop filesystem to the destination in the
local filesystem.

• hadoop fs -moveToLocal <src> <localdest>


HDFS tail Command Usage:
• The Hadoop fs shell tail command shows the last 1KB of a file
on console or stdout.

• Here using the tail command, we are trying to display the 1KB of
file ‘test’ present in the newDataFlair directory on the HDFS
filesystem.
• hadoop fs -tail [-f] <file>

• The -f shows the append data as the file grows.


HDFS rm Command Usage:
• The rm command removes the file present in the specified path.
• hadoop fs –rm <path>
• HDFS rm Command Example:
• Here in the below example we are recursively deleting the
DataFlair directory using -r with rm command.
HDFS expunge Command Usage:
• HDFS expunge command makes the trash empty.
• hadoop fs –expunge

• HDFS expunge Command Example:


HDFS chown Command Usage:
• The Hadoop fs shell command chown changes the owner of the
file.
• The -R option recursively changes files permissions through the
directory structure. The user must be the owner of the file or
superuser.
• hadoop fs -chown [-R] [owner] [:[group]] <path>
Hadoop Components
Mapreduce: Data Processing using
Programming
Mapreduce
MapReduce: Introduction

• MapReduce, is one of the core building blocks of processing in


Hadoop framework.
• Google released a paper on MapReduce technology in December,
2004. This became the genesis of the Hadoop Processing Model.
• MapReduce is a programming model that allows us to perform
parallel and distributed processing on huge data sets.
MapReduce: Traditional Way
MapReduce: Traditional Way
• Let us understand,
• when the MapReduce framework was not there, how parallel and
distributed processing used to happen in a traditional way.
• So, let us take an example where I have a weather log containing the
daily average temperature of the years from 2000 to 2015.
• Here, I want to calculate the day having the highest temperature in
each year?
MapReduce: Traditional Way
• So, just like in the traditional way,
• I will split the data into smaller parts or blocks and store them in
different machines.
• Then, I will find the highest temperature in each part stored in the
corresponding machine.
• At last, I will combine the results received from each of the machines
to have the final output.
MapReduce: Traditional Way
• Let us look at the challenges associated with this traditional approach:
1. Critical path problem: It is the amount of time taken to finish the job
without delaying the next milestone or actual completion date. So,
if, any of the machines delays the job, the whole work gets delayed.
2. Reliability problem: What if, any of the machines which is working with a
part of data fails? The management of this failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller chunks so that
each machine gets even part of data to work with. In other words, how
to equally divide the data such that no individual machine is overloaded
or under utilized.
MapReduce: Traditional Way
4. Single split may fail: If any of the machine fails to provide the
output, I will not be able to calculate the result. So, there should be
a mechanism to ensure this fault tolerance capability of the system.
5. Aggregation of result: There should be a mechanism to aggregate
the result generated by each of the machines to produce the final
output.
MapReduce: Traditional Way
These are the issues which we will have to take care individually while
performing parallel processing of huge data sets when using traditional
approaches.

Solution? -> MapReduce


MapReduce: What is MapReduce?

• MapReduce is a programming framework that allows us to perform


distributed and parallel processing on large data sets in a distributed
environment.
MapReduce: What is MapReduce?
• MapReduce consists of two distinct tasks – Map and Reduce.
• As the name MapReduce suggests, reducer phase takes place after
mapper phase has been completed.
• So, the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the
Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or key-value
pairs which is the final output.
Advantages of MapReduce
• The two biggest advantages of MapReduce are:
• 1. Parallel Processing:
• 2. Data Locality:
Advantages of MapReduce
• 1. Parallel Processing:
• In MapReduce, we are dividing the job among multiple nodes and
each node works with a part of the job simultaneously.
• So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines.
• As the data is processed by multiple machine instead of a single
machine in parallel, the time taken to process the data gets reduced
by a tremendous amount as shown in the figure .
Advantages of MapReduce
Advantages of MapReduce
• 2. Data Locality:
• Instead of moving data to the processing unit, we are moving
processing unit to the data in the MapReduce Framework. In the
traditional system, we used to bring data to the processing unit and
process it. But, as the data grew and became very huge, bringing this
huge amount of data to the processing unit posed following issues:
Advantages of MapReduce
• 2. Data Locality:
• Moving huge data to processing is costly and deteriorates the
network performance.
• Processing takes time as the data is processed by a single unit which
becomes the bottleneck.
• Master node can get over-burdened and may fail.
Advantages of MapReduce
• 2. Data Locality:
• Now, MapReduce allows us to overcome above issues by bringing the
processing unit to the data. So, as you can see in the above image
that the data is distributed among multiple nodes where each node
processes the part of the data residing on it. This allows us to have
the following advantages:
Advantages of MapReduce
• 2. Data Locality:
• It is very cost effective to move processing unit to the data.
• The processing time is reduced as all the nodes are working with their
part of the data in parallel.
• Every node gets a part of the data to process and therefore, there is
no chance of a node getting overburdened.
• The MapReduce is one of the main components of the
Hadoop Ecosystem. MapReduce is designed to process
a large amount of data in parallel by dividing the work
into some smaller and independent tasks.
• The whole job is taken from the user and divided into
smaller tasks, and assign them to the worker nodes.
What is the • MapReduce programs take input as list and convert to
the output as a List also.
Map Reduce
The Map Task
• The map/mapper takes a set of keys and values. We can say it as a
key-value pair as input. The data may be in structured or unstructured
form. The framework can make it into keys and values.
• The keys are the reference of input files and Values are the data set.
• The user can create a customer business logic based on their need for
data processing.
• The task is applied on every input value.
The Reducer Task
• The reducer takes the key-value pair, which is created by the mapper
as input. The key-value pairs are sorted by the key elements.
• In the reducer, we perform the sorting, aggregation or summation
type jobs.
How MapReduce task works
The given inputs are processed by the user-defined methods. All
different business logics are working on the mapper section. Mapper
generates Intermediate data and reducer takes them as input. The data
are processed by user defined function in the reducer section.
The Final output is stored in HDFS.
The operation of MapReduce Task

HDFS Split HDFS Split HDFS Split

Input Input Input


(Key, value) pairs (Key, value) pairs (Key, value) pairs

Map Map Map

(Key1, value1) (Key k, value k) (Key1, value1) (Key k, value k) (Key1, value1) (Key k, value k)

Shuffle & Sort : Aggregate values by Key

(Key1, intermediate Values) (Key1, intermediate Values)

Reduce Reduce

Final Key1, final Values) Final Key1, final Values)


The Overall MapReduce word count process
Deer, 1 Bear, 1
Bear, 1 Bear, 1 Bear, 2
Deer Bear River River, 1

Car, 1
Deer Bear River Car, 1 Car, 1 Car, 3 Bear, 2
Car Car River Car Car River Car ,1 Car, 1 Car, 3
Deer Car River River, 1 Deer, 2
River, 2
Deer, 1 Deer, 2
Deer, 1
Deer, 1
Deer Car River
Car, 1
Bear, 1
River, 1
River, 2
River, 1
Input Final
Splitting Mapping Suffling Result
Reducing

You might also like