Module I - Hadoop Distributed File System (HDFS)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Module I - Hadoop Distributed File System (HDFS)

What’s HDFS?
 HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to
expand.
 HDFS is the primary distributed storage for Hadoop applications.
 HDFS provides interfaces for applications to move themselves closer to data.
 HDFS is designed to ‘just work’, however a working knowledge helps in diagnostics
and improvements.

Components of HDFS:
There are two (and a half) types of machines in a HDFS cluster
 NameNode: is the heart of an HDFS filesystem, it maintains and manages the file
system metadata. Eg., what blocks make up a file, and on which datanodes those blocks
are stored.
 DataNode: where HDFS stores the actual data, there are usually quite a few of these.

HDFS Architecture:

Unique features of HDFS:


HDFS also has a bunch of unique features that make it ideal for distributed systems:
 Failure tolerant - data is duplicated across multiple DataNodes to protect against
machine failures. The default is a replication factor of 3 (every block is stored on three
machines).
 Scalability - data transfers happen directly with the DataNodes so your read/write
capacity scales fairly well with the number of DataNodes
 Space - need more disk space? Just add more DataNodes and re-balance
 Industry standard - Other distributed applications are built on top of HDFS (HBase,
Map-Reduce)
HDFS is designed to process large data sets with write-once-read-many semantics, it is not for
low latency access

HDFS – Data Organization:


 Each file written into HDFS is split into data blocks
 Each block is stored on one or more nodes
 Each copy of the block is called replica
 Block placement policy
 First replica is placed on the local node
 Second replica is placed in a different rack
 Third replica is placed in the same rack as the second replica

Design Features :
 The design of HDFS is based on the design of the Google File System (GFS).
 The write-once/read-many design is intended to facilitate streaming reads.
 Files may be appended, but random seeks are not permitted. There is no caching of data
 Converged data storage and processing happen on the same server nodes.
 "Moving computation is cheaper than moving data."
 A reliable file system maintains multiple copies of data across the cluster.
 Consequently, failure of a single node (or even a rack in a large cluster) will not bring
down the file system.
 A specialized file system is used, which is not designed for general use.

HDFS Components:
The design of HDFS is based on two types of nodes: a NameNode and multiple DataNodes.
 In a basic design, a single NameNode manages all the metadata needed to store and
retrieve the actual data from the DataNodes.
 No data is actually stored on the NameNode, however
 For a minimal Hadoop installation, there needs to be a single NameNode daemon and
a single DataNode daemon running on at least one machine
 File system namespace operations such as opening, closing, and renaming files and
directories are all managed by the NameNode.

Write Operation in HDFS:

The NameNode also determines the mapping of blocks to DataNodes and handles DataNode
failures.
 The slaves (DataNodes) are responsible for serving read and write requests from the
file system to the clients. The NameNode manages block creation, deletion, and
replication.
 An example of the client/NameNode/DataNode interaction is provided
 When a client writes data, it first communicates with the NameNode and requests to
create a file. The NameNode determines how many blocks are needed and provides the
client with the DataNodes that will store the data. As part of the storage process, the
data blocks are replicated after they are written to the assigned node.
 Depending on how many nodes are in the cluster, the NameNode will attempt to write
replicas of the data blocks on nodes that are in other separate racks (if possible). If there
is only one rack, then the replicated blocks are written to other servers in the same rack.
Note: The NameNode does not write any data directly to the DataNodes. It does, however, give the
client a limited amount of time to complete the operation. If it does not complete in the time period,
the operation is cancelled.

Read Operation in HDFS:

 Reading data happens in a similar fashion. The client requests a file from the
NameNode, which returns the best DataNodes from which to read the data. The client
then accesses the data directly from the DataNodes.
 Thus, once the metadata has been delivered to the client, the NameNode steps back and
lets the conversation between the client and the DataNodes proceed. While data transfer
is progressing, the NameNode also monitors the DataNodes by listening for heartbeats
sent from DataNodes. The lack of a heartbeat signal indicates a potential node failure.
In such a case, the NameNode will route around the failed DataNode and begin re-
replicating the now-missing blocks. Because the file system is redundant, DataNodes
can be taken offline (decommissioned) for maintenance by informing the NameNode
of the DataNodes to exclude from the HDFS pool.

The mappings between data blocks and the physical DataNodes:


The NameNode stores all metadata in memory. Upon startup, each DataNode provides a block
report (which it keeps in persistent storage) to the NameNode.
The block reports are sent every 10 heartbeats. (The interval between reports is a configurable
property.) The reports enable the NameNode to keep an up-to-date account of all data blocks
in the cluster.
 In almost all Hadoop deployments, there is a SecondaryNameNode. While not
explicitly required by a NameNode, it is highly recommended. The term
"SecondaryNameNode" (now called CheckPointNode) is somewhat misleading. It is
not an active failover node and cannot replace the primary NameNode in case of its
failure.
 The purpose of the SecondaryNameNode is to perform periodic checkpoints that
evaluate the status of the NameNode. Recall that the NameNode keeps all system
metadata memory for fast access. It also has two disk files that track changes to the
metadata:
 An image of the file system state when the NameNode was started. This file
begins with fsimage_* and is used only at start-up by the NameNode.
 A series of modifications done to the file system after starting the NameNode
 These files begin with edit_* and reflect the changes made after the file was read.
 The location of these files is set by the dfs . namenode. name. dir property in the hdfs-
site.xrnl file.

The SecondaryNameNode:
 Periodically downloads fgimage and edits files, joins them into a new fsimage, and
uploads the new fgimage file to the NameNode.
 Thus, when the NameNode restarts, the fsimage file is reasonably up-to-date and
requires only the edit logs to be applied since the last checkpoint.
 If the SecondaryNameNode were not running, a restart of the NameNode could take a
prohibitively long time due to the number of changes to the file system.

The various roles in HDFS can be summarized as follows:


 HDFS uses a transfer/slave model designed for large file reading/streaming.
 The NameNode is a metadata server or "data traffic cop."
 HDFS provides a single namespace that is managed by the NameNode.
 Data is redundantly stored on DataNodes; there is no data on the NameNode.
 The SecondaryNameNode performs checkpoints of NameNode file system's
state but is not a failover node.
HDFS Block Replication
 As mentioned, when HDFS writes a file, it is replicated across the cluster. The amount
of replication is based on the value of dfs, replication in the hdfssite.xml file.
 This default value can be overruled with the hdfs dfs-setrep command. For Hadoop
clusters containing more than eight DataNodes, the replication value is usually set to 3.
In a Hadoop cluster of eight or fewer DataNodes but more than one DataNode, a
replication factor of 2 is adequate.
 If several machines must be involved in the serving of a file, then a file could be
rendered unavailable by the loss of any one of those machines. HDFS combats this
problem by replicating each block across a number of machines (three is the default).
 In addition, the HDFS default block size is often 64MB. In a typical operating system,
the block size is 4K B or 8KB. The HDFS default block size is not the minimum block
size, however. If a 20K B file is written to HDFS, it will create a block that is
approximately 20KB in size. (The underlying file system may have a minimal block
size that increases the actual file size.) If a file of size 80MB is written to HDFS, a
64MB block and a 16MB block will be created.
 HDFS blocks are not exactly the same as the data splits used by the MapReduce
process. The HDFS blocks are based on size, while the splits are based on a logical
partitioning of the data. For instance, if a file contains discrete records, the logical split
ensures that a record is not split physically across two separate servers during
processing. Each HDFS block may consist of one or more splits
 Figure below provides an example of how a file is broken into blocks and replicated
across the cluster. lil this case, a replication factor of 3 ensures that any one DataNode
can fail and the replicated blocks will be available on other nodes—and then
subsequently re-replicated on other DataNodes
Data File (64MB Blocks):

HDFS Safe Mode


 When the NameNode starts, it enters a read-only safe mode where blocks cannot be
replicated or deleted. Safe Mode enables the NameNode to perform two important
processes:
 The previous file system state is reconstructed by loading the fsimage file into
memory and replaying the edit log.
 The mapping between blocks and data nodes is created by waiting for enough
Of the DataNodes to register so that at least one copy of the data is available.
Not all DataNodes are required to register before HDFS exits from Safe Mode.
The registration process may continue for some time.
 HDFS may also enter Safe Mode for maintenance using the hdfs dfsadmin-safemode
command or when there is a file system issue that must be addressed by the
administrator. www.vtupulse.com Rack Awareness
 Rack awareness deals with data locality. Recall that one of the main design goals of
Hadoop MapReduce is to move the computation to the data. Assuming that most data
centre networks do not offer full bisection bandwidth, a typical
Hadoop cluster will exhibit three levels of data locality:
 Data resides on the local machine (best).
 Data resides in the same rack (better).
 Data resides in a different rack (good).

Rack Awareness
 When the YARN scheduler is assigning MapReduce containers to work as mappers, it
will try to place the container first on the local machine, then on the same rack, and
finally on another rack.
 In addition, the NameNode tries to place replicated data blocks on multiple racks for
improved fault tolerance. In such a case, an entire rack failure will not cause data loss
or stop HDFS from working. Performance may be degraded, however.
 HDFS can be made rack-aware by using a user-derived script that enables the master
node to map the network topology of the cluster. A default Hadoop installation assumes
all the nodes belong to the same (large) rack. In that case, there is no option 3

NameNode High Availability


 With early Hadoop installations, the NameNode was a single point of failure that could
bring down the entire Hadoop cluster. NameNode hardware often employed redundant
power supplies and storage to guard against such problems, but it was still susceptible
to other failures. The solution was to implement NameNode High
 Availability (HA) as a means to provide true failover service.
 As shown in figure below an HA Hadoop cluster has two (or more) separate
 NameNode machines. Each machine is configured with exactly the same software.

Name Node High Availability Design:


 One of the NameNode machines is in the Active state, and the other is in the Standby
state. Like a single NameNode cluster, the Active NameNode is responsible for all
client HDFS operations in the cluster. The Standby NameNode maintains enough state
to provide a fast failover (if required).
 To guarantee the file system state is preserved, both the Active and Standby
NameNodes receive block reports from the DataNodes. The Active node also sends all
file system edits to a quorum of Journal nodes. At least three physically separate
JournalNode daemons are required, because edit log modifications must be written to a
majority of the JournalNodes. This design will enable the system to tolerate the failure
of a single JournalNode machine. The Standby node continuously reads the edits from
the JournalNodes to ensure its namespace is synchronized with that of the Active node.
In the event of an Active NameNode failure, the Standby node reads all remaining edits
from the JournalNodes before promoting itself to the Active state. www.vtupulse.com
NameNode High Availability
 To prevent confusion between NameNodes, the JournalNodes allow only one
NameNode to be a writer at a time. During failover, the NameNode that is chosen to
become active takes over the role of writing to the JournalNodes. A
SecondaryNameNode is not required in the HA configuration because the Standby node
also performs the tasks of the Secondary NameNode.
 Apache Zookeeper is used to monitor the NameNode health. Zookeeper is a highly
available service for maintaining small amounts of coordination data, notifying clients
of changes in that data, and monitoring clients for failures. HDFS failover relies on
ZooKeeper for failure detection and for Standby to Active NameNode election. The
Zookeeper components are not depicted in figure below

HDFS NameNode Federation


 Another important feature of HDFS is NameNode Federation. Older versions of HDFS
provided a single namespace for the entire cluster managed by a single NameNode.
Thus, the resources of a single NameNode determined the size of the namespace.
Federation addresses this limitation by adding support for multiple
 NameNodes/namespaces to the HDFS file system.
The key benefits are as follows:
 Namespace scalability. HDFS cluster storage scales horizontally without
placing a burden on the NameNode.
 Better performance. Adding more NameNodes to the cluster scales the file
system read/write operations throughput by separating the total namespace.
 System isolation. Multiple NameNodes enable different categories of
applications to be distinguished, and users can be isolated to different
namespaces.

HDFS NameNode Federation


Figure below illustrates how HDFS NameNode Federation is accomplished. NameNode1
manages the /research and /marketing namespaces, and NameNode2 manages the /data and
/project namespaces. The NameNodes do not communicate with each Other and the DataNodes
"just store data block" as directed by either NameNode. www.vtupulse.com HDFS
NameNode Federation

HDFS Checkpoints and Backups


 As mentioned earlier, the NameNode stores the metadata of the HDFS file system in a
file called fsimage. File systems modifications are written to an edits log file, and at
start-up the NameNode merges the edits into a new fsimage. The SecondaryNameNode
or CheckpointNode periodically fetches edits from the NameNode, merges them, and
returns an updated fsimage to the NameNode.
 An HDFS BackupNode is similar, but also maintains an up-to-date copy of the file
system namespace both in memory and on disk. Unlike a CheckpointNode, the
BackupNode does not need to download the fsimage and edits files from the active
NameNode because it already has an up-to-date namespace state in memory. A
NameNode supports one BackupNode at a time. No CheckpointNodes may be
registered if a Backup node is in use. www.vtupulse.com HDFS Snapshots
 HDFS snapshots are similar to backups, but are created by administrators using the hdfs
dfs -snapshot command. HDFS snapshots are read-only point-in-time copies of the file
system.
They offer the following features:
 Snapshots can be taken of a sub-tree of the file system or the entire file
system.
 Snapshots can be used for data backup, protection against user errors,
and disaster recovery.
 Snapshot creation is instantaneous.
 Blocks on the DataNodes are not copied, because the snapshot files
record the block list and the file size.
There is no data copying, although it appears to the user that there are duplicate files.
 Snapshots do not adversely affect regular HDFS operations.

HDFS Security:
 Authentication to Hadoop
 Simple – insecure way of using OS username to determine hadoop identity
 Kerberos – authentication using kerberos ticket
 Set by hadoop.security.authentication=simple|kerberos
 File and Directory permissions are same like in POSIX
 read (r), write (w), and execute (x) permissions
 also has an owner, group and mode
 enabled by default (dfs.permissions.enabled=true)
 ACLs are used for implemention permissions that differ from natural hierarchy
of users and groups
 enabled by dfs.namenode.acls.enabled=true

HDFS – Shell Commands:


There are two types of shell commands
User Commands:
hdfs dfs – runs filesystem commands on the HDFS
hdfs fsck – runs a HDFS filesystem checking command
Administration Commands:
hdfs dfsadmin – runs HDFS administration commands

HDFS User Commands:


• hdfs [--config confdir--] COMMAND
• Hdfs version –Hadoop 2.6.0.2.2.4.3-2
• hdfs dfs –ls / – Lists files in the root HDFS directory
• hdfs dfs –ls OR hdfs dfs –ls /user/hdfs – Lists the files in user home directory
• Hdfs dfs –mkdir stuff – Create directory
• hdfs dfs –put test stuff – Copy files to HDFS
• hdfs dfs –get stuff/test test-local – Copy files from HDFS
• hdfs dfs –cp stuff/test test.hdfs – Copy files within HDFS
• hdfs dfs –rm test.hdfs – Delete files within HDFS
• hdfs dfs –rm –r –skipTrash stuff – Delete directory in HDFS

HDFS – User Commands (fsck):


 Removing a file
hdfs dfs -rm tdataset/tfile.txt
hdfs dfs -ls –R
 List the blocks of a file and their locations
hdfs fsck /user/cloudera/tdata/geneva.csv -files -blocks –locations
 Print missing blocks and the files they belong to
hdfs fsck / -list-corruptfileblocks

HDFS – Adminstration Commands:


 Comprehensive status report of HDFS cluster
hdfs dfsadmin –report
 Prints a tree of racks and their nodes
hdfs dfsadmin –printTopology
 Get the information for a given datanode (like ping)
hdfs dfsadmin -getDatanodeInfo
localhost:50020
 Get a list of namenodes in the Hadoop cluster
hdfs getconf –namenodes
 Dump the NameNode fsimage to XML file
cd /var/lib/hadoop-hdfs/cache/hdfs/dfs/name/current
hdfs oiv -i fsimage_0000000000000003388 -o /tmp/fsimage.xml -p XML
 The general command line syntax is
hdfs command [genericOptions] [commandOptions]
Interfaces to HDFS:
 Java API (DistributedFileSystem)
 C wrapper (libhdfs)
 HTTP protocol
 WebDAV protocol
 Shell Commands
However, the command line is one of the simplest and most familiar

Other Interfaces to HDFS:


 HTTP Interface
http://quickstart.cloudera:50070
 MountableHDFS – FUSE
mkdir /home/cloudera/hdfs
sudo hadoop-fuse-dfs dfs://quickstart.cloudera:8020 /home/cloudera/hdfs

Note: Once mounted all operations on HDFS can be performed using standard Unix
utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep',

Mapper Script
#!/bi/bash
While read line ;
for token in $line;
do
if [“$token” = “Ram”];
then
echo “Ram, 1”
elif [“$token” = “Sita”];
then
echo “Sita, 1”
fi
done

done Reducer Script

#!/bin/bash
Rcount=0
Scount=0

While read line ;


do
if [ $line =“Ram, 1”];
then
Rcount = Rcount+1
elif [ $line =“Sita, 1”];
then
Scount = Scount+1
fi
Done
echo “Ram, $Rcount”
echo “Sita, $Scount

To compile and run the program from the comtnand line


Perform the following steps:
1. Make a local wordcount_classes directory.
$ mkdir wordcount—classes
2. Compile the WordCount program using the 'hadoop classpath’ command to include
all the available Hadoop class paths.
$ javac -cp ’hadoop classpath' -d wordcount_classes WordCount.java
3. The jar file can be created using the following command:
$ jar -cvf wordcount.jar -C wordcount_classes/
4. To run the example, create an input directory in HDFS and place a text file in the
new directory. For this example, we will use the war-and-peace. txt file (available from
the book download page; see Appendix A):
$ hdfs dfs -mkdir /Demo
$ hdfs dfs -put input. txt /Demo
5. Run the WordCount application using the following command:
$ hadoop jar wordcount.jar WordCount /Demo/input /output

Listing, Killing, and Job Status:

 The jobs can be managed using the mapred job command.


The most import options are —list, -kill, and -status.
 In addition, the yarn application command can be used to control all ape
applications running on the cluster.
Enabling YARN Log Aggregation:
• To manually enable log aggregation, follows these steps:
• As the HDFS superuser administrator (usually user hdfs), create the following
directory in HDFS:
$ hdfs dfs -mkdir -p /yarn/logs
$ hdfs dfs -chown -R yarn:hadoop /yarn/logs
$ hdfs dfs -chmod -R g+rw /yarn/logs

• Add the following properties in the yarn-site.xml and restart all YARN services on all
nodes
The options to yarn logs are as follows:
$ yarn logs
 Retrieve logs for completed YARN applications .
 usage: yarn logs -applicationid <application ID> (OPTIONS)

General options are:


 -appOwner <Application Owner>
 -container Id <Container ID>
 -nodeAddress <Node Address>

MapReduce Programming

#!/bin/bash
count=99
if [ $count -eq 100 ]
then
echo
"Count is
100"

if
[
$
c
o
u
n
t
-
g
t
1
0
0
]
t
h
e
n
echo "Count is greater than 100"
else
echo "Count is less than 100"
fi
fi

Input:
Hello I am GeeksforGeeks
Hello I am an Intern

Output:
GeeksforGeeks 1
Hello 2
I 2
Intern 1
am 2
an 1

Mapper Code: You have to copy paste this program into the WCMapper Java Class file.
//Importing libraries
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WCMapper extends MapReduceBase implements Mapper<LongWritable,


Text, Text, IntWritable> {

// Map function
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter rep) throws IOException
{

String line = value.toString();

// Splitting the line on spaces


for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}

Driver Code: You have to copy paste this program into the WCDriver Java Class file.
Importing libraries
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WCDriver extends Configured implements Tool {


public int run(String args[]) throws IOException
{
if (args.length < 2)
{
System.out.println("Please give valid inputs");
return -1;
}
}

JobConf conf = new JobConf(WCDriver.class);


FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WCMapper.class);
conf.setReducerClass(WCReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}

// Main Method
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new WCDriver(), args);
System.out.println(exitCode);
}
}
Module II - Essential Hadoop Tools
Hadoop Ecosystem:
 Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
 Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
 Hbase: HBase is a distributed column-oriented database built on top of the Hadoop
file system.
 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
 Flume: Used to handle streaming data on the top of Hadoop.
 Oozie: Apache Oozie is a workflow scheduler for Hadoop.

Introduction to Pig:
 Pig raises the level of abstraction for processing large amount of datasets.
 It is a fundamental platform for analyzing large amount of data sets which consists of
a high-level language for expressing data analysis programs.
 It is an open source platform developed by yahoo.

Advantages of Pig:
 Reusing the code
 Faster development
 Less number of lines of code
 Nested data types - The Pig provides a useful concept of nested data types like tuple,
bag, and map.
 Schema and type checking etc.

Features of Pig Hadoop:


 Easy to learn read and write and implement if you know SQL.
 It implements a new approach of multi query.
 Provides a large number of nested data types such as Maps, Tuples and Bags which
are not easily available in MapReduce along with some other data operations like
Filters, Ordering and Joins.
 It consists of different user groups for instance up to 90% of Yahoo’s MapReduce is
done by Pig and up to 80% of Twitter’s MapReduce is also done by Pig and various
other companies like Sales force, LinkedIn and Nokia etc., are majoritively using the
Pig.
Pig Latin comes with the following features:
 Simple programming: it is easy to code, execute and manage the program.
 Better optimization: system can automatically optimize the execution as per the
requirement raised.
 Extensive nature: Used to achieve highly specific processing tasks

Application of Pig:
 ETL data pipeline
 Research on raw data
 Iterative processing

Data Types Used in Pig:


The scalar data types in pig are in the form of int, float, double, long, chararray, and byte
array.
The complex data types in Pig are namely the map, tuple, and bag.
 Map: The data element consisting the data type chararray where element has pig data
type include complex data type
Example- [city’#’bang’,’pin’#560001]
In this city and pin are data element mapping the values here.
 Tuple: Collection of data types and it has defined fixed length. It consists of multiple
fields and those are ordered in sequence.
 Bag: It is a huge collection of tuples, unordered sequence, tuples arranged in the bag
are separated by comma.
Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)

Running Pig Programs:


There are namely 3 ways of executing Pig programs which works on both local and
MapReduce mode:

 Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs
the commands in the local file script.pig. Alternatively, for very short scripts, you can
use the -e option to run a script specified as a string on the command line.
 Grunt
Grunt is an interactive shell programming for running Pig commands. Grunt is started
when no file is specified for Pig to run, and the -e option apparently not used. It is also
possible to run Pig scripts from within Grunt using run and exec.
 Embedded
You can execute all the Pig programs from Java and can use JDBC to run SQL programs
from Java.

Installation of Pig:
 Download
 Extract
 Set Path

Run Example1 -> grunt


pig -x local
pig -x mapreduce
A = load 'passwd' using PigStorage(':’);
B = foreach A generate $0 as id;
dump B;

Run Example2 ->


A = load 'passwd' using PigStorage(':’);
B = foreach A generate $0 as id;
dump B;
Save in filename.pig

Run using command


pig -x local id.pig Script

Run Example3 ->Script in Hadoop


A = load 'passwd' using PigStorage(':’);
B = foreach A generate $0 as id;
dump B;
Save in filename.pig

Start Hadoop
Store input file in HDFS
Hdfs dfs -mkdir /passwdDIR
hdfs dfs -put passwd passwdDIR
Run using command
pig -x mapreduce id.pig www.vtupulse

Apache Sqoop
 Sqoop is a tool designed to transfer data between Hadoop and relational databases.
 Sqoop can be used to import data from a relational database management system
(RDBMS) into the Hadoop Distributed File System (HDFS), transform the data in
Hadoop, and then export the data back into an RDBMS.
 Sqoop − “SQL to Hadoop and Hadoop to SQL”
 The traditional application management system, that is, the interaction of applications
with relational database using RDBMS, is one of the sources that generate Big Data.
 Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the
relational database structure.
 When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra,
Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact
with the relational database servers for importing and exporting the Big Data residing
in them.
 Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction
between relational database server and Hadoop’s HDFS
 Sqoop can be used with any Java Database Connectivity ODBC)—compliant database
and has been tested on Microsoft SQL Server, PostgresSQL, MySQL, and Oracle.
 In version I of Sqoop, data were accessed using connectors written for specific
databases.
 Version 2 (in beta) does not support connectors or version 1 data transfer from a
RDBMS directly to Hive or HBase, or data transfer from Hive or HBase to your
RDBMS.
Working of Sqoop
 Sqoop Import – The import tool imports individual tables from RDBMS to HDFS.
Each row in a table is treated as a record in HDFS. All records are stored as text data in
text files or as binary data in Avro and Sequence files.
 Sqoop Export – The export tool exports a set of files from HDFS back to an RDBMS.
The files given as input to Sqoop contain records, which are called as rows in table.
Those are read and parsed into a set of records and delimited with user specified
delimiter.

Apache Sqoop Import and Export Methods:


 The Sqoop data import (to HDFS) process. The data import is done in two steps.
 In the first step, shown in the figure, Sqoop examines the database to gather the
necessary metadata for the data to be imported.
 The second step is a map-only (no reduce step) Hadoop job that Sqoop submits to
the cluster. This job does the actual data transfer using the metadata captured in the
previous step. Note that each node doing the import must have access to the
database.
 The imported data are saved in an HDFS directory. Sqoop will use the database
name for the directory, or the user can specify any alternative directory where the
files should be populated. By default, these files contain comma-delimited fields,
with new lines separating different records.
 You can easily override the format in which data are copied over by explicitly
specifying the field separator and record terminator characters. Once placed in
HDFS, the data are ready for processing.

Using Apache Pig:


 Apache Pig is a high-level language that enables the programmers tp write complex
MapReduce transformation using simple scripting language
 Pig Latin defines a set of transformation on a data set such as join, aggregation and
sort

Apache Pig Features:


 Pig is often used to extract, transform and load data pipelines , quick research on
raw data and iterative data processing
 Apache Pig has several usage modes.
 Apache Pig is a high-level language platform developed to execute queries on huge
datasets that are stored in HDFS using Apache Hadoop.
 It is similar to SQL query language but applied on a larger dataset and with
additional features.
 The language used in Pig is called Pig Latin.
 It is very similar to SQL.
 It is used to load the data, apply the required filters and dump the data in the required
format.
 It requires a Java runtime environment to execute the programs.
 Pig converts all the operations into Map and Reduce tasks which can be efficiently
processed on Hadoop.
 It basically allows us to concentrate upon the whole operation irrespective of the
individual mapper and reducer functions.

For example:
 Pig can be used to run a query to find the rows which exceed a threshold value.
 It can be used to join two different types of datasets based upon a key.
 Pig can be used to iterative algorithms over a dataset.
 It is ideal for ETL operations i.e; Extract, Transform and Load.
 It allows a detailed step by step procedure by which the data has to be transformed.
 It can handle inconsistent schema data.

Differences between Apache MapReduce and PIG:


Apache MapReduce Apache PIG

It is a low-level data processing


It is a high-level data flow tool.
tool.
Here, it is required to develop
It is not required to develop complex
complex programs using Java or
programs.
Python.
It provides built-in operators to perform
It is difficult to perform data
data operations like union, sorting and
operations in MapReduce.
ordering.

It doesn't allow nested data It provides nested data types like tuple,
types. bag, and map.

Local Mode:
 It executes in a single JVM and is used for development experimenting and
prototyping.
 Here, files are installed and run using localhost.
 The local mode works on a local file system. The input and output data stored in
the local file system.

The command for local mode grunt shell:


$ pig-x local

MapReduce Mode:
 The MapReduce mode is also known as Hadoop Mode.
 It is the default mode.
 In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster.
 It can be executed against semi-distributed or fully distributed Hadoop installation.
 Here, the input and output data are present on HDFS.

The command for Map reduce mode:


$ pig
Or
$ pig -x mapreduce

Pig Latin:
 The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop. It is a textual language that abstracts the programming from the Java
MapReduce idiom into a notation.

Pig Latin Statements:


 The Pig Latin statements are used to process the data. It is an operator that accepts
a relation as an input and generates another relation as an output.
 It can span multiple lines.
 Each statement must end with a semi-colon.
 It may include expression and schemas.
 By default, these statements are processed using multi-query execution
Pig Latin Conventions:

Convention Description

The parenthesis can enclose one or more items. It can also be used to
() indicate the tuple data type.
Example - (10, xyz, (3,6,9))
The straight brackets can enclose one or more items. It can also be
[] used to indicate the map data type.
Example - [INNER | OUTER]
The curly brackets enclose two or more items. It can also be used to
{} indicate the bag data type
Example - { block | nested_block }
The horizontal ellipsis points indicate that you can repeat a portion
... of the code.
Example - cat path [path ...]

Latin Data Types:


Type Description

int It defines the signed 32-bit integer.


Example - 2

long It defines the signed 64-bit integer.


Example - 2L or 2l

float It defines 32-bit floating point number.


Example - 2.5F or 2.5f or 2.5e2f or 2.5.E2F

double It defines 64-bit floating point number.


Example - 2.5 or 2.5 or 2.5e2f or 2.5.E2F

chararray It defines character array in Unicode UTF-8 format.


Example - javatpoint

bytearray It defines the byte array.

boolean It defines the boolean type values.


Example - true/false

datetime It defines the values in datetime order.


Example - 1970-01- 01T00:00:00.000+00:00

biginteger It defines Java BigInteger values.


Example - 5000000000000
bigdecimal It defines Java BigDecimal values.
Example - 52.232344535345

Complex Types:

Type Description

It defines an ordered set of fields.


tuple
Example - (15,12)

It defines a collection of tuples.


bag Example - {(15,12), (12,15)}

It defines a set of key-value pairs.


map
Example - [open#apache]

Apache Pig supports many data types. A list of Apache Pig Data Types with description
and examples are given below:

pe Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]


Pig Example:
Use case: Using Pig find the most occurred start letter.
Solution:
Case 1: Load the data into bag named "lines". The entire line is stuck to element line
of type character array.
grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
Case 2: The text in the bag lines needs to be tokenized this produces one
wgrunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) As token:
chararray;
Case 3: To retain the first letter of each word type the below command .This commands
uses substring method to take the first character.
grunt>letters = FOREACH tokens GENERATE SUBSTRING(0,1) as letter : charar
ray;
Case 4: Create a bag for unique character where the grouped bag will contain the same
character for each occurrence of that character.
grunt>lettergrp = GROUP letters by letter;

Pig UDF (User Defined Functions):

To specify custom processing, Pig provides support for user-defined functions (UDFs).
Thus, Pig allows us to create our own functions. Currently, Pig UDFs can be
implemented using the following programming languages: -
 Java
 Python
 Jython
 JavaScript
 Ruby
 Groovy
Among all the languages, Pig provides the most extensive support for Java functions.
However, limited support is provided to languages like Python, Jython, JavaScript,
Ruby, and Groovy.

Example of Pig UDF:


In Pig
 ll UDFs must extend "org.apache.pig.EvalFunc"
 All functions must override the "exec" method.
Example of a simple EVAL Function to convert the provided string to
uppercase.
UPPER.java
package com.hadoop;

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class TestUpper extends EvalFunc<String> {


public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
PER.java

The Apache Pig LOAD operator is used to load the data from the file system.
Syntax
LOAD 'info' [USING FUNCTION] [AS SCHEMA];
Where
LOAD is a relational operator.
'info' is a file that is required to load. It contains any type of data.
USING is a keyword.
FUNCTION is a load function.
AS is a keyword.
SCHEMA is a schema of passing file, enclosed in parentheses.
Flume

 Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating


and transporting large amounts of streaming data such as log files, events (etc...) from
various sources to a centralized data store.
 Flume is a highly reliable, distributed, and configurable tool.
 It is principally designed to copy streaming data (log data) from various web servers
to HDFS

HDFS put Command:


 The main challenge in handling the log data is in moving these logs produced by
multiple servers to the Hadoop environment.
 Hadoop File System Shell provides commands to insert data into Hadoop and read
from it. You can insert data into Hadoop using the put command as shown below.
 $ Hadoop fs –put /path of the required file /path in HDFS where to save the file

Problem with put Command:


 The put command of Hadoop can used to transfer data from these sources to HDFS.
But, it suffers from the following drawbacks
 Using put command, we can transfer only one file at a time while the data generators
generate data at a much higher rate. Since the analysis made on older data is less
accurate, we need to have a solution to transfer data in real time.
 If we use put command, the data is needed to be packaged and should be ready for the
upload. Since the webservers generate data continuously, it is a very difficult task.
 Solutions that can be used to overcome the drawbacks of put command and transfer
the "streaming data" from data generators to centralized stores (especially HDFS)
with less delay.
Available Solutions:
 Facebook’s Scribe
 Apache Kafka
 Apache Flume

Applications of Flume:
 Assume an e-commerce web application wants to analyze the customer behavior from
a particular region.
 To do so, they would need to move the available log data in to Hadoop for analysis.
Here, Apache Flume comes to our rescue.
 Flume is used to move the log data generated by application servers into HDFS at a
higher speed.

Advantages of Flume:
 Using Apache Flume we can store the data in to any of the centralized stores (HBase,
HDFS).
 When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized
stores and provides a steady flow of data between them.
 Flume provides the feature of contextual routing.
 The transactions in Flume are channel-based where two transactions (one sender and
one receiver) are maintained for each message. It guarantees reliable message
delivery.
 Flume is reliable, fault tolerant, scalable, manageable, and customizable.

Apache Flume – Architecture:


 The following illustration depicts the basic architecture of Flume.
 As shown in the illustration, data generators (such as Facebook, Twitter) generate data
which gets collected by individual Flume agents running on them.
 Later the data collector (which is also an agent) collects the data from the agents
which is aggregated and pushed into a centralized store such as HDFS or HBase

What is Flume?
Apache Flume – Architecture

Flume Agent
 An agent is an independent daemon process (JVM) in Flume.
 It receives the data (events) from clients or other agents and forwards it to its next
destination (sink or agent).
 Flume may have more than one agent. Following diagram represents a Flume Agent
Source
 A source is the component of an Agent which receives data from the data generators
and transfers it to one or more channels in the form of Flume events.
 Apache Flume supports several types of sources and each source receives events from
a specified data generator.
Example − Facebook, Avro source, Thrift source, twitter 1% source etc. Channel
 A channel is a transient store which receives the events from the source and buffers
them till they are consumed by sinks.
 It acts as a bridge between the sources and the sinks.
 These channels are fully transactional and they can work with any number of sources
and sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
 Sink
 A sink stores the data into centralized stores like HBase and HDFS.
 It consumes the data (events) from the channels and delivers it to the destination.
 The destination of the sink might be another agent or the central stores.
Example − HDFS sink
Setting multi-agent flow
 In order to flow the data across multiple agents or hops, the sink of the previous agent
and source of the current hop need to be avro type with the sink pointing to the hostname
(or IP address) and port of the source.
 Within Flume, there can be multiple agents and before reaching the final destination,
an event may travel through more than one agent. This is known as multi-hop flow

Consolidation
 A very common scenario in log collection is a large number of log producing clients
sending data to a few consumer agents that are attached to the storage subsystem.
For example, logs collected from hundreds of web servers sent to a dozen of agents that write
to HDFS cluster.
 This can be achieved in Flume by configuring a number of first tier agents with an avro
sink, all pointing to an avro source of single agent (Again you could use the thrift
sources/sinks/clients in such a scenario).
 This source on the second tier agent consolidates the received events into a single
channel which is consumed by a sink to its final destination.
Diagram :
Apache Hive
 Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
 It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
 Initially Hive was developed by Facebook, later the Apache Software Foundation took
it up and developed it further as an open source under the name Apache Hive.
 It is used by different companies.
For example: Amazon uses it in Amazon Elastic MapReduce
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Using Apache Hive:


 Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
 It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
 This is a brief tutorial that provides an introduction on how to use Apache Hive HiveQL
with Hadoop Distributed File System.
 Apache Hive is a data warehouse system built on top of Apache Hadoop that facilitates
easy data summarization, ad-hoc queries, and the analysis of large datasets stored in
various databases and file systems that integrate with Hadoop, including the MapR
Data Platform with MapR XD and MapR Database.
 Hive offers a simple way to apply structure to large amounts of unstructured data and
then perform batch SQL-like queries on that data.
 Hive easily integrates with traditional data center technologies using the familiar
JDBC/ODBC interface.

Architecture of Hive

OR
Architecture of Hive

Architecture of Hive
 User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS.
 The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).
 Meta Store Hive chooses respective database servers to store the schema or Metadata
of tables, databases, columns in a table, their data types, and HDFS mapping.
 HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach for MapReduce
program. Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.
 Execution Engine The conjunction part of HiveQL process Engine and MapReduce is
Hive Execution Engine. Execution engine processes the query and generates results as
same as MapReduce results. It uses the flavor of MapReduce.
 HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.

Working of Hive

Operation Steps:
1. Execute Query The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute.
2. Get Plan The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
3. Get Metadata The compiler sends metadata request to Metastore (any database).
4. Send Metadata Metastore sends metadata as a response to the compiler.
5. Send Plan The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.
6. Execute Plan The driver sends the execute plan to the execution engine.
7. Execute Job Internally, the process of execution job is a MapReduce job. The execution
engine sends the job to Job Tracker, which is in Name node and it assigns this job to Task
Tracker, which is in Data node. Here, the query executes MapReduce job.
7.1. Metadata Ops Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
8. Fetch Result The execution engine receives the results from Data nodes.
9. Send Results The execution engine sends those resultant values to the driver.
10. Send Results The driver sends the results to Hive Interfaces.

Hive - Data Types


All the data types in Hive are classified into four types, given as follows:
 Column Types
 Literals
 Null Values
 Complex Types
Hive - Data Types - Column Types
Integral Types :
 Integer type data can be specified using integral data types, INT.
 When the data range exceeds the range of INT, you need to use BIGINT and if the data
range is smaller than the INT, you use SMALLINT.
 TINYINT is smaller than SMALLINT.
String Types:
 String type data types can be specified using single quotes (' ') or double quotes (" ").
 It contains two data types: VARCHAR and CHAR.
 Hive follows C-types escape characters.
Timestamp :
 “YYYY-MM-DD HH:MM:SS.fffffffff”
Date :
 year/month/day format Decimal Unions

Hive - Data Types – Literals:


The following literals are used in Hive:
Floating Point Types:
 Floating point types are nothing but numbers with decimal points. Generally, this type
of data is composed of DOUBLE data type
Decimal Type:
 Decimal type data is nothing but floating point value with higher range than DOUBLE
data type.

 The range of decimal type is approximately -10-308 to 10308.


Hive - Data Types - Complex Types:
Arrays:
 Arrays in Hive are used the same way they are used in Java.
Maps :
 Maps in Hive are similar to Java Maps.
Structs:
 Structs in Hive is similar to using complex data with comment.
 hive> CREATE DATABASE [IF NOT EXISTS] userdb;
 hive> SHOW DATABASES;
 hive> DROP DATABASE IF EXISTS userdb;
 hive> DROP SCHEMA userdb;
 hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary
String, destination String) COMMENT ‘Employee details’ ROW FORMAT
DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
 hive> ALTER TABLE employee RENAME TO emp;
 hive> DROP TABLE IF EXISTS employee;

Apache Oozie
 Apache Oozie is a workflow scheduler for Hadoop.
 It is a system which runs the workflow of dependent jobs.
 Here, users are permitted to create Directed Acyclic Graphs of workflows, which can
be run in parallel and sequentially in Hadoop
It consists of Three parts:
 Workflow engine: Responsibility of a workflow engine is to store and run workflows
composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
 Coordinator engine: It runs workflow jobs based on predefined schedules and
availability of data.
 Bundle: Higher level abstraction that will batch a set of coordinator jobs

Features:
 Oozie is scalable and can manage the timely execution of thousands of workflows (each
consisting of dozens of jobs) in a Hadoop cluster.
 Oozie is very much flexible, as well.
 One can easily start, stop, suspend and rerun jobs. Oozie makes it very easy to rerun
failed workflows.
 One can easily understand how difficult it can be to catch up missed or failed jobs due
to downtime or failure.
 It is even possible to skip a specific failed node.

How does OOZIE work?/ Working of OOZIE


 Oozie runs as a service in the cluster and clients submit workflow definitions for
immediate or later processing.
 Oozie workflow consists of action nodes and control-flow nodes.
 A control-flow node controls the workflow execution between actions by allowing
constructs like conditional logic wherein different branches may be followed depending
on the result of earlier action node.
 Start Node, End Node, and Error Node fall under this category of nodes.
 Start Node, designates the start of the workflow job.
 End Node, signals end of the job.
 Error Node designates the occurrence of an error and corresponding error message to
be printed.
 An action node represents a workflow task, e.g., moving files into HDFS, running a
MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of
a program written in Java.

Example Workflow Diagram


HBase
 HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data
 Since 1970, RDBMS is the solution for data storage and maintenance related problems.
 After the advent of big data, companies realized the benefit of processing big data and
started opting for solutions like Hadoop

Limitations of Hadoop:
 Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner i.e., one has to search the entire dataset even for the simplest of jobs.
 A huge dataset when processed results in another huge data set, which should also be
processed sequentially.
 At this point, a new solution is needed to access any point of data in a single unit of
time (random access).

Definition /What is HBase?


 HBase is a distributed column-oriented database built on top of the Hadoop file system.
 It is an open-source project and is horizontally scalable.
 HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data.
 It leverages the fault tolerance provided by the Hadoop File System (HDFS).
HBase Architecture

Storage Mechanism in HBase:


HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which in an HBase:
 Table is a collection of rows.
 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.

Storage Mechanism in HBase:


Example 1
Example 2

Where to Use HBase/Applications:


 Apache HBase is used to have random, Realtime read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase

HBase Architecture
HBase Installation
Download
 http://archive.apache.org/dist/hbase/0.98.24/
 Extract
 sudo tar -zxvf hbase-0.98.24-hadoop2bin.tar.gz
 Move
 sudo mv hbase-0.98.24-hadoop2 /usr/local/Hbase
 cd /usr/local/Hbase/ www.vtupulse
 <property>
 <name>hbase.rootdir</name>
 <value>file:/usr/local/hadoop/HBase/HFiles</ value>
 </property>
 //Here you have to set the path where you want HBase to store its built in
zookeeper
HBase Shell
 Create Database and insert data
 create ‘apple’, ‘price’, ‘volume’
 put 'apple', '17-April-19', 'price:open', '125’
 put 'apple', '17-April-19', 'price:high', '126’
 put 'apple', '17-April-19', 'price:low', '124’
 put 'apple', '17-April-19', 'price:close', '125.5’
 put 'apple', '17-April-19', 'volume', '1000'

Inspect Database
scan 'apple'

 ROW COLUMN+CELL
 17-April-19 column=price:close, timestamp=1555508855040, value=122.5
 17-April-19 column=price:high, timestamp=1555508840180, value=126
 17-April-19 column=price:low, timestamp=1555508846589, value=124
 17-April-19 column=price:open, timestamp=1555508823773, value=125

 17-April-19 column=volume:, timestamp=1555508892705, value=1000


Get a roscan 'apple'

ROW COLUMN+CELL
 17-April-19 column=price:close, timestamp=1555508855040,
value=122.5
 17-April-19 column=price:high, timestamp=1555508840180,
value=126
 17-April-19 column=price:low, timestamp=1555508846589,
value=124
 17-April-19 column=price:open, timestamp=1555508823773,
value=125
 17-April-19 column=volume:, timestamp=1555508892705,
value=1000

Get table Cell


get 'apple', '17-April-19', { COLUMN => ['volume', 'price:low']}
COLUMN CELL
price:low timestamp=1555508846589, value=124
volume: timestamp=1555508892705, value=1000

Delete Cell, Row and Table


delete 'apple', '17-April-19', 'price:low'
 deleteall 'apple', '17-April-19'
 disable 'apple'
 drop 'apple'
Scripting
echo "create 'apple', 'price', 'volume'"
 echo "create 'apple', 'price', 'volume'" | hbase shell
 Create test.sh file with contents
 echo "create 'mango', 'price', 'volume'"
 echo "put 'mango', ‘123’, 'price', '100'"

Then run the following commands


 sudo sh test.sh | hbase shell
Web Interface
 http://localhost:60010

Why YARN?/necessity of YARN


 In Hadoop version 1.0 which is also referred to as MRV1(MapReduce Version 1),
MapReduce performed both processing and resource management functions.
 It consisted of a Job Tracker which was the single master.
 The Job Tracker allocated the resources, performed scheduling and monitored the
processing jobs.
 It assigned map and reduce tasks on a number of subordinate processes called the Task

Trackers.
 The Task Trackers periodically reported their progress to the Job Tracker.

 This design resulted in scalability bottleneck due to a single Job Tracker.


 IBM mentioned in its article that according to Yahoo!, the practical limits of such a
design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently.
 Apart from this limitation, the utilization of computational resources is inefficient in
MRV1.
 Also, the Hadoop framework became limited only to MapReduce processing paradigm
 To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year
2012 by Yahoo and Hortonworks.
 The basic idea behind YARN is to relieve MapReduce by taking over the
responsibility of Resource Management and Job Scheduling.
 YARN started to give Hadoop the ability to run non-MapReduce jobs within the
Hadoop framework.
 To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year
2012 by Yahoo and Hortonworks.
 The basic idea behind YARN is to relieve MapReduce by taking over the
responsibility of Resource Management and Job Scheduling.
 YARN started to give Hadoop the ability to run non-MapReduce jobs within the
Hadoop framework.
 MapReduce is a powerful distributed framework and programming model that allows
batch-based parallelized work to be performed on a cluster of multiple nodes.
 Despite being very efficient at what it does, though, MapReduce has some
disadvantages; principally that it’s batch-based, and as a result isn’t suited to Realtime
or even near-real-time data processing.
 Historically this has meant that processing models such as graph, iterative, and real-
time data processing are not a natural fit for MapReduce.

Introduction to Hadoop YARN

Components of YARN:
 Apart from Resource Management, YARN also performs Job Scheduling. YARN
performs all your processing activities by allocating resources and scheduling tasks.
Apache Hadoop YARN Architecture consists of the following main components :
 Resource Manager: Runs on a master daemon and manages the resource allocation in
the cluster.
 Node Manager: They run on the slave daemons and are responsible for the execution
of a task on every single Data Node.
Components of YARN – ResourceManager
 The ResourceManager is the YARN master process.
 A Hadoop cluster has a single ResourceManager (RM) for the entire cluster. Its sole
function is to arbitrate all the available resources on a Hadoop cluster.
 ResourceManager tracks usage of resources, monitors the health of various nodes in
the cluster, enforces resource-allocation invariants, and arbitrates conflicts among
users.
 The components of resource manager – Scheduler – ApplicationsManager

Components of YARN – NodeManager


 The NodeManager is the slave process of YARN.
 It runs on every data node in a cluster.
 Its job is to create, monitor, and kill containers.
 It services requests from the ResourceManager and ApplicationMaster to create
containers, and it reports on the status of the containers to the ResourceManager.
 The ResourceManager uses the data contained in these status messages to make
scheduling decisions for new container requests.
 On start-up, the NodeManager registers with the ResourceManager; it then sends
heartbeats with its status and waits for instructions.
 Its primary goal is to manage application containers assigned to it by the
ResourceManager.
YARN Applications:
 The YARN framework/platform exists to manage applications, so let’s take a look at
what components a YARN application is composed of.
 A YARN application implements a specific function that runs on Hadoop.
 A YARN application involves 3 components:
 Client
 ApplicationMaster (AM)
 Container

YARN Applications - YARN Client


 Launching a new YARN application starts with a YARN client communicating with
the ResourceManager to create a new YARN ApplicationMaster instance.
 Part of this process involves the YARN client informing the ResourceManager of the
ApplicationMaster’ s physical resource requirements.

YARN Applications - YARN ApplicationMaster


 The ApplicationMaster is the master process of a YARN application.
 It doesn’t perform any application-specific work, as these functions are delegated to
the containers.
 Instead, it’s responsible for managing the application-specific containers.
 Once the ApplicationMaster is started (as a container), it will periodically send
heartbeats to the ResourceManager to affirm its health and to update the record of its
resource demands.
YARN Applications - YARN Container
 A container is an application-specific process that’s created by a NodeManager on
behalf of an ApplicationMaster.
At the fundamental level, a container is a collection of physical resources such as
RAM, CPU cores, and disks on

YARN scheduler policies


 In an ideal world, the requests that a YARN application makes would be granted
immediately.
 In the real world, however, resources are limited, and on a busy cluster, an application
will often need to wait to have some of its requests fulfilled.
 The FIFO scheduler
 The Capacity scheduler
 The Fair scheduler

YARN scheduler policies - The FIFO scheduler

YARN scheduler policies - The Capacity scheduler

You might also like