Professional Documents
Culture Documents
Module I - Hadoop Distributed File System (HDFS)
Module I - Hadoop Distributed File System (HDFS)
Module I - Hadoop Distributed File System (HDFS)
What’s HDFS?
HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to
expand.
HDFS is the primary distributed storage for Hadoop applications.
HDFS provides interfaces for applications to move themselves closer to data.
HDFS is designed to ‘just work’, however a working knowledge helps in diagnostics
and improvements.
Components of HDFS:
There are two (and a half) types of machines in a HDFS cluster
NameNode: is the heart of an HDFS filesystem, it maintains and manages the file
system metadata. Eg., what blocks make up a file, and on which datanodes those blocks
are stored.
DataNode: where HDFS stores the actual data, there are usually quite a few of these.
HDFS Architecture:
Design Features :
The design of HDFS is based on the design of the Google File System (GFS).
The write-once/read-many design is intended to facilitate streaming reads.
Files may be appended, but random seeks are not permitted. There is no caching of data
Converged data storage and processing happen on the same server nodes.
"Moving computation is cheaper than moving data."
A reliable file system maintains multiple copies of data across the cluster.
Consequently, failure of a single node (or even a rack in a large cluster) will not bring
down the file system.
A specialized file system is used, which is not designed for general use.
HDFS Components:
The design of HDFS is based on two types of nodes: a NameNode and multiple DataNodes.
In a basic design, a single NameNode manages all the metadata needed to store and
retrieve the actual data from the DataNodes.
No data is actually stored on the NameNode, however
For a minimal Hadoop installation, there needs to be a single NameNode daemon and
a single DataNode daemon running on at least one machine
File system namespace operations such as opening, closing, and renaming files and
directories are all managed by the NameNode.
The NameNode also determines the mapping of blocks to DataNodes and handles DataNode
failures.
The slaves (DataNodes) are responsible for serving read and write requests from the
file system to the clients. The NameNode manages block creation, deletion, and
replication.
An example of the client/NameNode/DataNode interaction is provided
When a client writes data, it first communicates with the NameNode and requests to
create a file. The NameNode determines how many blocks are needed and provides the
client with the DataNodes that will store the data. As part of the storage process, the
data blocks are replicated after they are written to the assigned node.
Depending on how many nodes are in the cluster, the NameNode will attempt to write
replicas of the data blocks on nodes that are in other separate racks (if possible). If there
is only one rack, then the replicated blocks are written to other servers in the same rack.
Note: The NameNode does not write any data directly to the DataNodes. It does, however, give the
client a limited amount of time to complete the operation. If it does not complete in the time period,
the operation is cancelled.
Reading data happens in a similar fashion. The client requests a file from the
NameNode, which returns the best DataNodes from which to read the data. The client
then accesses the data directly from the DataNodes.
Thus, once the metadata has been delivered to the client, the NameNode steps back and
lets the conversation between the client and the DataNodes proceed. While data transfer
is progressing, the NameNode also monitors the DataNodes by listening for heartbeats
sent from DataNodes. The lack of a heartbeat signal indicates a potential node failure.
In such a case, the NameNode will route around the failed DataNode and begin re-
replicating the now-missing blocks. Because the file system is redundant, DataNodes
can be taken offline (decommissioned) for maintenance by informing the NameNode
of the DataNodes to exclude from the HDFS pool.
The SecondaryNameNode:
Periodically downloads fgimage and edits files, joins them into a new fsimage, and
uploads the new fgimage file to the NameNode.
Thus, when the NameNode restarts, the fsimage file is reasonably up-to-date and
requires only the edit logs to be applied since the last checkpoint.
If the SecondaryNameNode were not running, a restart of the NameNode could take a
prohibitively long time due to the number of changes to the file system.
Rack Awareness
When the YARN scheduler is assigning MapReduce containers to work as mappers, it
will try to place the container first on the local machine, then on the same rack, and
finally on another rack.
In addition, the NameNode tries to place replicated data blocks on multiple racks for
improved fault tolerance. In such a case, an entire rack failure will not cause data loss
or stop HDFS from working. Performance may be degraded, however.
HDFS can be made rack-aware by using a user-derived script that enables the master
node to map the network topology of the cluster. A default Hadoop installation assumes
all the nodes belong to the same (large) rack. In that case, there is no option 3
HDFS Security:
Authentication to Hadoop
Simple – insecure way of using OS username to determine hadoop identity
Kerberos – authentication using kerberos ticket
Set by hadoop.security.authentication=simple|kerberos
File and Directory permissions are same like in POSIX
read (r), write (w), and execute (x) permissions
also has an owner, group and mode
enabled by default (dfs.permissions.enabled=true)
ACLs are used for implemention permissions that differ from natural hierarchy
of users and groups
enabled by dfs.namenode.acls.enabled=true
Note: Once mounted all operations on HDFS can be performed using standard Unix
utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep',
Mapper Script
#!/bi/bash
While read line ;
for token in $line;
do
if [“$token” = “Ram”];
then
echo “Ram, 1”
elif [“$token” = “Sita”];
then
echo “Sita, 1”
fi
done
#!/bin/bash
Rcount=0
Scount=0
• Add the following properties in the yarn-site.xml and restart all YARN services on all
nodes
The options to yarn logs are as follows:
$ yarn logs
Retrieve logs for completed YARN applications .
usage: yarn logs -applicationid <application ID> (OPTIONS)
MapReduce Programming
#!/bin/bash
count=99
if [ $count -eq 100 ]
then
echo
"Count is
100"
if
[
$
c
o
u
n
t
-
g
t
1
0
0
]
t
h
e
n
echo "Count is greater than 100"
else
echo "Count is less than 100"
fi
fi
Input:
Hello I am GeeksforGeeks
Hello I am an Intern
Output:
GeeksforGeeks 1
Hello 2
I 2
Intern 1
am 2
an 1
Mapper Code: You have to copy paste this program into the WCMapper Java Class file.
//Importing libraries
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
// Map function
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter rep) throws IOException
{
Driver Code: You have to copy paste this program into the WCDriver Java Class file.
Importing libraries
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
// Main Method
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new WCDriver(), args);
System.out.println(exitCode);
}
}
Module II - Essential Hadoop Tools
Hadoop Ecosystem:
Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
Hbase: HBase is a distributed column-oriented database built on top of the Hadoop
file system.
Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
Flume: Used to handle streaming data on the top of Hadoop.
Oozie: Apache Oozie is a workflow scheduler for Hadoop.
Introduction to Pig:
Pig raises the level of abstraction for processing large amount of datasets.
It is a fundamental platform for analyzing large amount of data sets which consists of
a high-level language for expressing data analysis programs.
It is an open source platform developed by yahoo.
Advantages of Pig:
Reusing the code
Faster development
Less number of lines of code
Nested data types - The Pig provides a useful concept of nested data types like tuple,
bag, and map.
Schema and type checking etc.
Application of Pig:
ETL data pipeline
Research on raw data
Iterative processing
Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs
the commands in the local file script.pig. Alternatively, for very short scripts, you can
use the -e option to run a script specified as a string on the command line.
Grunt
Grunt is an interactive shell programming for running Pig commands. Grunt is started
when no file is specified for Pig to run, and the -e option apparently not used. It is also
possible to run Pig scripts from within Grunt using run and exec.
Embedded
You can execute all the Pig programs from Java and can use JDBC to run SQL programs
from Java.
Installation of Pig:
Download
Extract
Set Path
Start Hadoop
Store input file in HDFS
Hdfs dfs -mkdir /passwdDIR
hdfs dfs -put passwd passwdDIR
Run using command
pig -x mapreduce id.pig www.vtupulse
Apache Sqoop
Sqoop is a tool designed to transfer data between Hadoop and relational databases.
Sqoop can be used to import data from a relational database management system
(RDBMS) into the Hadoop Distributed File System (HDFS), transform the data in
Hadoop, and then export the data back into an RDBMS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”
The traditional application management system, that is, the interaction of applications
with relational database using RDBMS, is one of the sources that generate Big Data.
Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the
relational database structure.
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra,
Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact
with the relational database servers for importing and exporting the Big Data residing
in them.
Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction
between relational database server and Hadoop’s HDFS
Sqoop can be used with any Java Database Connectivity ODBC)—compliant database
and has been tested on Microsoft SQL Server, PostgresSQL, MySQL, and Oracle.
In version I of Sqoop, data were accessed using connectors written for specific
databases.
Version 2 (in beta) does not support connectors or version 1 data transfer from a
RDBMS directly to Hive or HBase, or data transfer from Hive or HBase to your
RDBMS.
Working of Sqoop
Sqoop Import – The import tool imports individual tables from RDBMS to HDFS.
Each row in a table is treated as a record in HDFS. All records are stored as text data in
text files or as binary data in Avro and Sequence files.
Sqoop Export – The export tool exports a set of files from HDFS back to an RDBMS.
The files given as input to Sqoop contain records, which are called as rows in table.
Those are read and parsed into a set of records and delimited with user specified
delimiter.
For example:
Pig can be used to run a query to find the rows which exceed a threshold value.
It can be used to join two different types of datasets based upon a key.
Pig can be used to iterative algorithms over a dataset.
It is ideal for ETL operations i.e; Extract, Transform and Load.
It allows a detailed step by step procedure by which the data has to be transformed.
It can handle inconsistent schema data.
It doesn't allow nested data It provides nested data types like tuple,
types. bag, and map.
Local Mode:
It executes in a single JVM and is used for development experimenting and
prototyping.
Here, files are installed and run using localhost.
The local mode works on a local file system. The input and output data stored in
the local file system.
MapReduce Mode:
The MapReduce mode is also known as Hadoop Mode.
It is the default mode.
In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster.
It can be executed against semi-distributed or fully distributed Hadoop installation.
Here, the input and output data are present on HDFS.
Pig Latin:
The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop. It is a textual language that abstracts the programming from the Java
MapReduce idiom into a notation.
Convention Description
The parenthesis can enclose one or more items. It can also be used to
() indicate the tuple data type.
Example - (10, xyz, (3,6,9))
The straight brackets can enclose one or more items. It can also be
[] used to indicate the map data type.
Example - [INNER | OUTER]
The curly brackets enclose two or more items. It can also be used to
{} indicate the bag data type
Example - { block | nested_block }
The horizontal ellipsis points indicate that you can repeat a portion
... of the code.
Example - cat path [path ...]
Complex Types:
Type Description
Apache Pig supports many data types. A list of Apache Pig Data Types with description
and examples are given below:
pe Description Example
To specify custom processing, Pig provides support for user-defined functions (UDFs).
Thus, Pig allows us to create our own functions. Currently, Pig UDFs can be
implemented using the following programming languages: -
Java
Python
Jython
JavaScript
Ruby
Groovy
Among all the languages, Pig provides the most extensive support for Java functions.
However, limited support is provided to languages like Python, Jython, JavaScript,
Ruby, and Groovy.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
The Apache Pig LOAD operator is used to load the data from the file system.
Syntax
LOAD 'info' [USING FUNCTION] [AS SCHEMA];
Where
LOAD is a relational operator.
'info' is a file that is required to load. It contains any type of data.
USING is a keyword.
FUNCTION is a load function.
AS is a keyword.
SCHEMA is a schema of passing file, enclosed in parentheses.
Flume
Applications of Flume:
Assume an e-commerce web application wants to analyze the customer behavior from
a particular region.
To do so, they would need to move the available log data in to Hadoop for analysis.
Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into HDFS at a
higher speed.
Advantages of Flume:
Using Apache Flume we can store the data in to any of the centralized stores (HBase,
HDFS).
When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized
stores and provides a steady flow of data between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one sender and
one receiver) are maintained for each message. It guarantees reliable message
delivery.
Flume is reliable, fault tolerant, scalable, manageable, and customizable.
What is Flume?
Apache Flume – Architecture
Flume Agent
An agent is an independent daemon process (JVM) in Flume.
It receives the data (events) from clients or other agents and forwards it to its next
destination (sink or agent).
Flume may have more than one agent. Following diagram represents a Flume Agent
Source
A source is the component of an Agent which receives data from the data generators
and transfers it to one or more channels in the form of Flume events.
Apache Flume supports several types of sources and each source receives events from
a specified data generator.
Example − Facebook, Avro source, Thrift source, twitter 1% source etc. Channel
A channel is a transient store which receives the events from the source and buffers
them till they are consumed by sinks.
It acts as a bridge between the sources and the sinks.
These channels are fully transactional and they can work with any number of sources
and sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
Sink
A sink stores the data into centralized stores like HBase and HDFS.
It consumes the data (events) from the channels and delivers it to the destination.
The destination of the sink might be another agent or the central stores.
Example − HDFS sink
Setting multi-agent flow
In order to flow the data across multiple agents or hops, the sink of the previous agent
and source of the current hop need to be avro type with the sink pointing to the hostname
(or IP address) and port of the source.
Within Flume, there can be multiple agents and before reaching the final destination,
an event may travel through more than one agent. This is known as multi-hop flow
Consolidation
A very common scenario in log collection is a large number of log producing clients
sending data to a few consumer agents that are attached to the storage subsystem.
For example, logs collected from hundreds of web servers sent to a dozen of agents that write
to HDFS cluster.
This can be achieved in Flume by configuring a number of first tier agents with an avro
sink, all pointing to an avro source of single agent (Again you could use the thrift
sources/sinks/clients in such a scenario).
This source on the second tier agent consolidates the received events into a single
channel which is consumed by a sink to its final destination.
Diagram :
Apache Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took
it up and developed it further as an open source under the name Apache Hive.
It is used by different companies.
For example: Amazon uses it in Amazon Elastic MapReduce
Features of Hive
It stores schema in a database and processed data into HDFS.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Architecture of Hive
OR
Architecture of Hive
Architecture of Hive
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS.
The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata
of tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach for MapReduce
program. Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is
Hive Execution Engine. Execution engine processes the query and generates results as
same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Working of Hive
Operation Steps:
1. Execute Query The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute.
2. Get Plan The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
3. Get Metadata The compiler sends metadata request to Metastore (any database).
4. Send Metadata Metastore sends metadata as a response to the compiler.
5. Send Plan The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.
6. Execute Plan The driver sends the execute plan to the execution engine.
7. Execute Job Internally, the process of execution job is a MapReduce job. The execution
engine sends the job to Job Tracker, which is in Name node and it assigns this job to Task
Tracker, which is in Data node. Here, the query executes MapReduce job.
7.1. Metadata Ops Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
8. Fetch Result The execution engine receives the results from Data nodes.
9. Send Results The execution engine sends those resultant values to the driver.
10. Send Results The driver sends the results to Hive Interfaces.
Apache Oozie
Apache Oozie is a workflow scheduler for Hadoop.
It is a system which runs the workflow of dependent jobs.
Here, users are permitted to create Directed Acyclic Graphs of workflows, which can
be run in parallel and sequentially in Hadoop
It consists of Three parts:
Workflow engine: Responsibility of a workflow engine is to store and run workflows
composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
Coordinator engine: It runs workflow jobs based on predefined schedules and
availability of data.
Bundle: Higher level abstraction that will batch a set of coordinator jobs
Features:
Oozie is scalable and can manage the timely execution of thousands of workflows (each
consisting of dozens of jobs) in a Hadoop cluster.
Oozie is very much flexible, as well.
One can easily start, stop, suspend and rerun jobs. Oozie makes it very easy to rerun
failed workflows.
One can easily understand how difficult it can be to catch up missed or failed jobs due
to downtime or failure.
It is even possible to skip a specific failed node.
Limitations of Hadoop:
Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner i.e., one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be
processed sequentially.
At this point, a new solution is needed to access any point of data in a single unit of
time (random access).
HBase Architecture
HBase Installation
Download
http://archive.apache.org/dist/hbase/0.98.24/
Extract
sudo tar -zxvf hbase-0.98.24-hadoop2bin.tar.gz
Move
sudo mv hbase-0.98.24-hadoop2 /usr/local/Hbase
cd /usr/local/Hbase/ www.vtupulse
<property>
<name>hbase.rootdir</name>
<value>file:/usr/local/hadoop/HBase/HFiles</ value>
</property>
//Here you have to set the path where you want HBase to store its built in
zookeeper
HBase Shell
Create Database and insert data
create ‘apple’, ‘price’, ‘volume’
put 'apple', '17-April-19', 'price:open', '125’
put 'apple', '17-April-19', 'price:high', '126’
put 'apple', '17-April-19', 'price:low', '124’
put 'apple', '17-April-19', 'price:close', '125.5’
put 'apple', '17-April-19', 'volume', '1000'
Inspect Database
scan 'apple'
ROW COLUMN+CELL
17-April-19 column=price:close, timestamp=1555508855040, value=122.5
17-April-19 column=price:high, timestamp=1555508840180, value=126
17-April-19 column=price:low, timestamp=1555508846589, value=124
17-April-19 column=price:open, timestamp=1555508823773, value=125
ROW COLUMN+CELL
17-April-19 column=price:close, timestamp=1555508855040,
value=122.5
17-April-19 column=price:high, timestamp=1555508840180,
value=126
17-April-19 column=price:low, timestamp=1555508846589,
value=124
17-April-19 column=price:open, timestamp=1555508823773,
value=125
17-April-19 column=volume:, timestamp=1555508892705,
value=1000
Trackers.
The Task Trackers periodically reported their progress to the Job Tracker.
Components of YARN:
Apart from Resource Management, YARN also performs Job Scheduling. YARN
performs all your processing activities by allocating resources and scheduling tasks.
Apache Hadoop YARN Architecture consists of the following main components :
Resource Manager: Runs on a master daemon and manages the resource allocation in
the cluster.
Node Manager: They run on the slave daemons and are responsible for the execution
of a task on every single Data Node.
Components of YARN – ResourceManager
The ResourceManager is the YARN master process.
A Hadoop cluster has a single ResourceManager (RM) for the entire cluster. Its sole
function is to arbitrate all the available resources on a Hadoop cluster.
ResourceManager tracks usage of resources, monitors the health of various nodes in
the cluster, enforces resource-allocation invariants, and arbitrates conflicts among
users.
The components of resource manager – Scheduler – ApplicationsManager