Unit 4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 104

Big Data Analytics(BDA)

GTU #3170722

Unit-4

HDFS, SQOOP, HIVE, PIG AND


HBASE
 Outline
HDFS: Daemons, Anatomy of File Read, Anatomy of File Write, Replica Placement Strategy,
Working with HDFS Commands

Sqoop: Introduction, import and export command

Hive: Hive Architecture and Installation, Comparison with Traditional Database,


HiveQL Querying Data, Sorting and Aggregating, Map Reduce Scripts, Joins & Sub queries

PIG: PIG Architecture & Data types, Shell and Utility components, PIG Latin Relational Operators,
PIG Latin: File Loaders and UDF, Programming structure in UDF, PIG Jars Import, limitations of PIG

HBase: HBase concepts, Advanced Usage, Schema Design, Advance Indexing

Zookeeper: How it helps in monitoring a cluster,


HBase uses Zookeeper and how to Build Applications with Zookeeper.
HDFS – Daemons and Their Features
 Daemons mean Process.
 Hadoop Daemons are a set of processes that run on Hadoop.
 Hadoop is a framework written in JAVA so all these processes are Java Processes.
 Apache Hadoop 2 consists of the following Daemons:
 NameNode
 DataNode
 Secondary Name Node
 Resource Manager
 Node Manager
 Namenode, Secondary NameNode, and Resource Manager work on a Master System while the
Node Manager and DataNode work on the Slave machine.
HDFS – Daemons and Their Features
 NameNode
 NameNode works on the Master System. The primary purpose of Namenode is to manage all the MetaData.
 Metadata is the list of files stored in HDFS(Hadoop Distributed File System).
 As we know the data is stored in the form of blocks in a Hadoop cluster.
 So the DataNode on which or the location at which that block of the file is stored is mentioned in MetaData.
 All information regarding the logs of the transactions happening in a Hadoop cluster (when or who
read/wrote the data) will be stored in MetaData.
 MetaData is stored in the memory.
 Features:
 It never stores the data that is present in the file.
 As Namenode works on the Master System, the Master system should have good processing power and more
RAM than Slaves.
 It stores the information of DataNode such as their Block id’s and Number of Blocks
 How to start Name Node?
 hadoop-daemon.sh start namenode
 How to stop Name Node?
 hadoop-daemon.sh stop namenode
HDFS – Daemons and Their Features
 DataNode
 DataNode works on the Slave system.
 The NameNode always instructs DataNode for storing the Data.
 DataNode is a program that runs on the slave system that serves the read/write request from the client.
 As the data is stored in this DataNode, they should possess high memory to store more Data.
 How to start Data Node?
 hadoop-daemon.sh start datanode
 How to stop Data Node?
 hadoop-daemon.sh stop datanode
HDFS – Daemons and Their Features
 Secondary NameNode
 Secondary NameNode is used for taking the hourly backup of the data.
 In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the hourly backup or
checkpoints of that data and store this data into a file name fsimage.
 This file then gets transferred to a new system.
 A new MetaData is assigned to that new system and a new Master is created with this MetaData, and the
cluster is made to run again correctly.
 This is the benefit of Secondary Name Node.
 Major Function Of Secondary NameNode:
 It groups the Edit logs and Fsimage from NameNode together.
 It continuously reads the MetaData from the RAM of NameNode and writes into the Hard Disk.
 As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it is also known as
the checkpoint Node.
HDFS – Daemons and Their Features

These ports can be configured manually in hdfs-site.xml and mapred-site.xml files.

The Hadoop Daemon’s Port

Name Node 50070

Data Node 50075

Secondary Name Node 50090


HDFS – Daemons and Their Features
 Resource Manager
 Resource Manager is also known as the Global Master Daemon that works on the Master System.
 The Resource Manager Manages the resources for the applications that are running in a Hadoop Cluster. The
Resource Manager Mainly consists of 2 things.
 1. ApplicationsManager
2. Scheduler
 An Application Manager is responsible for accepting the request for a client and also makes a memory
resource on the Slaves in a Hadoop cluster to host the Application Master.
 The scheduler is utilized for providing resources for applications in a Hadoop cluster and for monitoring this
application.
 How to start ResourceManager?
 yarn-daemon.sh start resourcemanager
 How to stop ResourceManager?
 stop:yarn-daemon.sh stop resoucemnager
Anatomy of File Read in HDFS
Let’s get an idea of how data flows between the client interacting with HDFS, the name node, and the data nodes
with the help of a diagram. Consider the figure:
Anatomy of File Read in HDFS
 Step 1:
 The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an
instance of Distributed File System).
 Step 2:
 Distributed File System(DFS) calls the name node, using remote procedure calls (RPCs), to determine the
locations of the first few blocks in the file.
 For each block, the name node returns the addresses of the data nodes that have a copy of that block.
 The DFS returns an FSDataInputStream to the client for it to read data from.
 FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O.
 Step 3:
 The client then calls read() on the stream.
 DFSInputStream, which has stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
 Step 4:
 Data is streamed from the data node back to the client, which calls read() repeatedly on the stream.
Anatomy of File Read in HDFS
 Step 5:
 When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds
the best data node for the next block.
 This happens transparently to the client, which from its point of view is simply reading an endless stream.
Blocks are read as, with the DFSInputStream opening new connections to data nodes because the client
reads through the stream.
 It will also call the name node to retrieve the data node locations for the next batch of blocks as needed.
 Step 6:
 When the client has finished reading the file, a function is called, close() on the FSDataInputStream.
Anatomy of File Write in HDFS
We’ll check out how files are written to HDFS.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the files which are already
stored in HDFS, but we can append data by reopening the files.
Anatomy of File Write in HDFS
 Step 1:
 The client creates the file by calling create() on DistributedFileSystem(DFS).
 Step 2:
 DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks
associated with it.
 The name node performs various checks to make sure the file doesn’t already exist and that the client has
the right permissions to create the file.
 If these checks pass, the name node prepares a record of the new file; otherwise, the file can’t be created and
therefore the client is thrown an error i.e. IOException.
 The DFS returns an FSDataOutputStream for the client to start out writing data to.
 Step 3:
 Because the client writes data, the DFSOutputStream splits it into packets, which it writes to an indoor queue
called the info queue.
 The data queue is consumed by the DataStreamer, which is liable for asking the name node to allocate new
blocks by picking an inventory of suitable data nodes to store the replicas.
 The list of data nodes forms a pipeline, and here we’ll assume the replication level is three, so there are three
nodes in the pipeline.
 The DataStreamer streams the packets to the primary data node within the pipeline, which stores each
packet and forwards it to the second data node within the pipeline.
Anatomy of File Write in HDFS
 Step 4:
 Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the
pipeline.
 Step 5:
 The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data
nodes, called an “ack queue”.
 Step 6:
 This action sends up all the remaining packets to the data node pipeline and waits for acknowledgments
before connecting to the name node to signal whether the file is complete or not.

 HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in HDFS, but we can
include them by again reopening the file.
 This design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across
all the data nodes in the cluster.
 Thus, it increases the availability, scalability, and throughput of the system.
Replica Placement Strategy
 HDFS as the name says is a distributed file system which is designed to store large files.
 A large file is divided into blocks of defined size and these blocks are stored across machines in a cluster.
 These blocks of the file are replicated for reliability and fault tolerance.

 Rake aware replica placement policy


 Large HDFS instances run on a cluster of computers that commonly spread across many racks so rack
awareness is also part of the replica placement policy in Hadoop.
 If two nodes placed in different racks have to communicate that communication has to go through switches.
 If machines are on the same rack then network bandwidth between those machines is generally greater than
the network bandwidth between machines in different racks.

 HDFS replica placement policy


 Taking rank awareness and fault tolerance into consideration the replica placement policy followed by
Hadoop framework is as follows-
 For the default case, when the replication factor is three.
Replica Placement Strategy
1. Put one replica on the same machine where the client application (application which is using the file) is, if the
client is on a DataNode. Otherwise choose a random datanode for storing the replica.
2. Store another replica on a node in a different (remote) rack.
3. The last replica is also stored on the same remote rack but the node where it is stored is different.
 In case replication factor is greater than 3, for the first 3 replicas policy as described above is followed.
 From replica number 4 onward node location is determined randomly while keeping the number of replicas per rack
below the upper limit (which is basically (replicas - 1) / racks + 2)).
 HDFS Replication pipelining
 While replicating blocks across DataNodes, pipelining is used by HDFS.
 Rather than client writing to all the chosen DataNodes data is pipelined from one DataNode to the next.
 For the default replication factor of 3 the replication pipelining works as follows-
 The NameNode retrieves a list of DataNodes that will host the replica of a block.
 Client gets this list of 3 DataNodes from NameNode and writes to the first DataNode in the list.
 The first DataNode starts receiving the data in portions, writes each portion to its local storage and then
transfers that portion to the second DataNode in the list.
 The Second DataNode follows the same procedure writes the portion to its local storage and transfers
the portion to the third DataNode in the list.
Replica Placement Strategy
 For replication factor of 3 following image shows the placement of replicas.
HDFC Commands
HDFC Commands Description
-ls List files with permissions and other details
-mkdir Creates a directory named path in HDFS
-touchz Create a new file on HDFS with size 0 bytes or Empty file
-copyFromLocal Copy file from local file system
-copyToLocal Copy files from HDFS to local file system
-cat Display or print the contents for a file
-moveFromLocal Move file / Folder from local disk to HDFS
-moveToLocal Move a File to HDFS from Local
-cp Copy files from source to destination
-mv HDFS Command to move files from source to destination
-rmr Removes the file that identified by path / Folder and subfolders
du Shows the size of the file on hdfs.
-dus Directory/file of total size
-stat Print statistics about the file/directory
-setrep Changes the replication factor of a file
Sqoop
 In traditional application management system, the interaction of applications with relational database
management using RDBMS, is one of the sources that generate Big Data.
 Such Big Data, generated by RDBMS is stored in Relational Database Servers in the relational database structure.
 When Big Data storages and analyzers such as MapReduce, Hive, Hbase, Pig, etc. of the Hadoop ecosystem came
into picture, they required a tool to interact with the relational database servers for importing and exporting the Big
Data residing in them.
 Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between relational
database server and Hadoop’s HDFS.

 Sqoop:
 “SQL to Hadoop and Hadoop to SQL” Sqoop is a tool designed to transfer data between Hadoop and relational
database servers.
 It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from
Hadoop file system to relational databases.
 It is provided by the Apache Software Foundation.
Features of the Sqoop
 How Sqoop Works?

Fig: Workflow of Sqoop


 Sqoop provides a pluggable mechanism for optimal connectivity to external systems.
 The Sqoop extension API provides a convenient framework for building new connectors which
can be dropped into Sqoop installations to provide connectivity to vatious systems.
 Sqoop itself comes bundled with various connectors that can be used for popular database and
data warehousing systems.
import command in Sqoop
 The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a
record in HDFS.
 All records are stored as text data in text files or as binary data in Avro (data serialization) and Sequence
files.
 Syntax: The following syntax is used to import data into HDFS.
 $ sqoop import (generic-args) (import-args)
 $ sqoop-import (generic-args) (import-args)
 Example
 Let us take an example of three tables named as emp, empadd, and empcontact, which are in a database
called userdb in a MySQL database server.
 Importing a Table
 Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file system as a text file or a
binary file.
 The following command is used to import the emp table from MySQL database server to HDFS.
import command in Sqoop
 To verify the imported data in HDFS, use the following command.
 $ $HADOOP-HOME/bin/hadoop fs -cat /emp/part-m-*
 It shows you the emp table data and fields are separated with comma (,).
 Importing into Target Directory
 We can specify the target directory while importing table data into HDFS using the Sqoop import tool.
 Following is the syntax to specify the target directory as option to the Sqoop import command.
--target-dir <new or exist directory in HDFS>

 The following command is used to import empadd table data into ‘/queryresult’ directory.

 The following command is used to verify the imported data in /queryresult directory form empadd table.
 $HADOOP-HOME/bin/hadoop fs -cat /queryresult/part-m-*
import command in Sqoop
 Import Subset of Table Data
 We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It executes the
corresponding SQL query in the respective database server and stores the result in a target directory in
HDFS.
 The syntax for where clause is as follows.
--where <condition>
 The following command is used to import a subset of empadd table data. The subset query is
to retrieve the employee id and address, who lives in Secunderabad city.

 The following command is used to verify the imported data in /wherequery directory from the
empadd table.
$HADOOP-HOME/bin/hadoop fs -cat /wherequery/part-m-*
import command in Sqoop
 Incremental Import
 Incremental import is a technique that imports only the newly added rows in a table.
 It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.
 The following syntax is used for the incremental option in Sqoop import command.
--incremental <mode> --check-column <column name> --last value <last check column value>
 Let us assume the newly added data into emp table is as follows −
 1206, satish p, grp des, 20000, GR The following command is used to perform the incremental import in the
emp table.

 The following command is used to verify the imported data from emp table to HDFS emp/
directory.
$HADOOP-HOME/bin/hadoop fs -cat /emp/part-m-*
import command in Sqoop
 The following command is used to see the modified or newly added rows from the emp table.
$ $HADOOP-HOME/bin/hadoop fs -cat /emp/part-m-*1

 Import All Tables


 Each table data is stored in a separate directory and the directory name is same as the table name.
 Syntax: $ sqoop import-all-tables (generic-args) (import-args)
 Example
 Let us take an example of importing all tables from the userdb database. The list of tables that
the database userdb contains is as follows.
 The following command is used to import all the tables from the userdb database.

 Note − If you are using the import-all-tables, it is mandatory that every table in that database must have a
primary key field.
import command in Sqoop
 The following command is used to verify all the table data to the userdb database in HDFS.
 $HADOOP-HOME/bin/hadoop fs -ls
 It will show you the list of table names in userdb database as directories.

 Output
drwxr-xr-x - hadoop supergroup 0 2014-12-22 22:50 -sqoop
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:46 emp
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:50 empadd
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:52 empcontact
export command in Sqoop
 With the help of the export command which works as a reverse process of operation.
 Using export command, we can transfer the data from the Hadoop database file system to the
Relational database management system.
 The data which will be exported is processed into records before operation is completed.
 The export of data is done with two steps, first is to examine the database for metadata and
second step involves migration of data.
 Each table data is stored in a separate directory and the directory name is same as the table
name.
 Syntax:
 $ sqoop export (generic-args) (export-args)
 $ sqoop -export (generic-args) (export-args)
 Example
 Let us take an example of the employee data in file, in HDFS. The employee data is available in empdata file
in ‘emp/’ directory in HDFS. The empdata is as follows.
 It is mandatory that the table to be exported is created manually and is present in the database from where it
has to be exported.
export command in Sqoop
 The following query is used to create the table ‘employee’ in mysql command line.

 The following command is used to export the table data (which is in empdata file on HDFS) to
the employee table in db database of Mysql database.

 The following command is used to verify the table in mysql command line.
mysql>select * from employee;
HIVE
 Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
 It resides on top of Hadoop to summarize Bigdata and makes querying and analysing easy.
 It is a platform used to develop SQL type scripts to do MapReduce operations.
 Initially, Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive.
 It is used by different companies.
 For example, Amazon uses it in Amazon Elastic MapReduce.

 Feature of Hive
 Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
 Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Hive Architecture
 The following component diagram depicts the architecture of Hive:
Hive Architecture
 This component diagram contains different units. The following table describes each unit:
Unit Name Operation
Hive is a data warehouse infrastructure software that can create interaction between user and
HDFS.
User Interface (UI)
The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight
(In Windows server).
Hive chooses respective database servers to store the schema or Metadata of tables, databases,
Meta Store
columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore.
It is one of the replacements of traditional approach for MapReduce program.
HiveQL Process Engine
Instead of writing MapReduce program in Java, we can write a query for MapReduce job and
process it.
The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.
Execution Engine Execution engine processes the query and generates results as same as MapReduce results.
It uses the flavor of MapReduce.
Hadoop distributed file system or HBASE are the data storage techniques to store data into file
HDFS or HBASE
system.
Working with Hive
 The following diagram depicts the workflow between Hive and Hadoop.
Working with Hive
 The following table defines how Hive interacts with Hadoop framework:
Step Operation
Description
No.
Execute
The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as
1 Query
JDBC, ODBC, etc.) to execute.

Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or
2
the requirement of query.
Get Metadata
3 The compiler sends metadata request to Metastore (any database).

Send
4 Metadata Metastore sends metadata as a response to the compiler.

Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and
5
compiling of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution engine.
Working with Hive

Step
Operation Description
No.

Internally, the process of execution job is a MapReduce job. The execution engine sends the job to
7 Execute Job JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.

7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore.

8 Fetch Result The execution engine receives the results from Data nodes.

9 Send Results The execution engine sends those resultant values to the driver.

10 Send Results The driver sends the results to Hive Interfaces.


Installation of HIVE
 Pre-requisite for install HIVE
Step 1: Java Installation - Check whether the Java is installed or not using the following
command.
 $ java -version
Step 2: Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
 $hadoop version
Step 3: Steps to install Apache Hive
 Download the Apache Hive tar file. http://mirrors.estointernet.in/apache/hive/hive-1.2.2/
 DUnzip the downloaded tar file.
 tar -xvf apache-hive-1.2.2-bin.tar.gz

 DOpen the bashrc file.


 $ sudo nano ~/.bashrc

 DNow, provide the following HIVE-HOME path.


 export HIVE-HOME=/home/codegyani/apache-hive-1.2.2-bin
 export PATH=$PATH:/home/codegyani/apache-hive-1.2.2-bin/bin
Installation of HIVE
 DUpdate the environment variable.
 $ source ~/.bashrc

 DLet's start the hive by providing the following command.


 $ hive
Comparison with Hive and Traditional database

Hive Traditional database

Schema on WRITE – table schema is enforced at data load


Schema on READ – it’s does not verify the schema while it’s
time i.e if the data being loaded does’t conformed on schema
loaded the data.
in that case it will rejected.

It’s very easily scalable at low cost. Not much Scalable, costly scale up.
It’s based on hadoop notation that is Write once and read In traditional database we can read and write many time.
many times.
Record level updates is not possible in Hive. Record level updates, insertions and deletes, transactions
and indexes are possible.

OLTP (On-line Transaction Processing) is not yet supported in Both OLTP (On-line Transaction Processing) and OLAP (On-
Hive but it’s supported OLAP (On-line Analytical Processing). line Analytical Processing) are supported in RDBMS.
HiveQL Querying Data
 HiveQL is the HIVE QUERY LANGUAGE.
 Hive offers no support for row-level inserts, updates, and deletes. Hive does not support transactions.
 Hive adds extensions to provide better performance in the context of Hadoop and to integrate with
custom extensions and even external programs.
 DDL and DML are the parts of HiveQL.
 Data Definition Language (DDL) is used for creating, altering and dropping databases, tables, views,
functions and indexes.
 Data manipulation language (DML) is used to put data into Hive tables and to extract data to the file
system and also how to explore and manipulate data with queries, grouping, filtering, joining etc.
 Hive queries provides the following features:
 Data modeling such as Creation of databases, tables, etc.
 ETL functionalities such as Extraction, Transformation, and Loading data into tables
 Joins to merge different data tables
 User specific custom scripts for ease of code
 Faster querying tool on top of Hadoop
HiveQL Querying Data
 Create Database Statement
 Create Database is a statement used to create a database in Hive.
 A database in Hive is a namespace or a collection of tables.
 Syntax: CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
 Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name
already exists. We can use SCHEMA in place of DATABASE in this command.
 The following query is executed to create a database named userdb:
 hive> CREATE DATABASE [IF NOT EXISTS] userdb;
OR
 hive> CREATE SCHEMA userdb;
 The following query is used to verify a databases list:
 hive> SHOW DATABASES;
 default
 userdb
Create Table in Hive
 Creating Table in Hive
 Create Table is a statement used to create a table in Hive.
 Syntax:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db-name.] table-name
[(col-name datatype [COMMENT col-comment], ...)]
[COMMENT table-comment]
[ROW FORMAT row-format]
[STORED AS file-format]
 The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘/t’
LINES TERMINATED BY ‘/n’
STORED AS TEXTFILE;
Load Data in Hive
 Load Data Statement
 Generally, after creating a table in SQL, we can insert data using the Insert statement. But in Hive, we can
insert data using the LOAD DATA statement.
 While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are two ways to load
data: one is from local file system and second is from Hadoop file system.
 Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
Where,
• LOCAL is identifier to specify the local path. It is optional.
• OVERWRITE is optional to overwrite the data in the table.
• PARTITION is optional.

 The following query loads the given text into the table.
 hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt‘ INTO TABLE employee;
Order by and Sort by Clause in Hive
 ORDER BY and SORT BY Clause
 By using HiveQL ORDER BY and SORT BY clause, we can apply sort on the column. It returns the result set
either in ascending or descending.
 In HiveQL, ORDER BY clause performs a complete ordering of the query result set. Hence, the complete data
is passed through a single reducer. This may take much time in the execution of large datasets. However, we
can use LIMIT to minimize the sorting time.
 Example: Let's see an example to arrange the data in the sorted order by using ORDER BY clause.
Step 1: Select the database in which we want to create a table.
hive> use hiveql;
Step 2: create a table by using the following command:
hive> create table emp (Id int, Name string , Salary float, Department string)
row format delimited
fields terminated by ',' ;
Step 3: Load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table emp;
Step 4: To fetch the data in the descending order by using the following command.
hive> select * from emp order by salary desc;
Order by and Sort by Clause in Hive
 SORT BY Clause
 The HiveQL SORT BY clause is an alternative of ORDER BY clause. It orders the data within each reducer.
Hence, it performs the local ordering, where each reducer's output is sorted separately. It may also give a
partially ordered result.

 Example: Let's see an example to arrange the data in the sorted order by using ORDER BY clause.

Step 1: Select the database in which we want to create a table.


hive> use hiveql;
Step 2: create a table by using the following command:
hive> create table emp (Id int, Name string , Salary float, Department string)
row format delimited
fields terminated by ',' ;
Step 3: Load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table emp;
Step 4: To fetch the data in the descending order by using the following command.
hive> select * from emp sort by salary desc;
Group by Clause in Hive
 Group By Clause
 The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is
used to query a group of records.
 Syntax
SELECT [ALL | DISTINCT] select-expr, select-expr, ...
FROM table-reference
[WHERE where-condition]
[GROUP BY col-list]
[HAVING having-condition]
[ORDER BY col-list]]
[LIMIT number];
 Example: Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields. Generate a
query to retrieve the number of employees in each department.
 hive> SELECT Dept, count(*) FROM employee GROUP BY DEPT;
Map Reduce Scripts
 Similar to any other scripting language, Hive scripts are used to execute a set of Hive commands collectively.
 Hive scripting helps us to reduce the time and effort invested in writing and executing the individual commands
manually. Hive scripting is supported in Hive 0.10.0 or higher versions of Hive.
 Using an approach like Hadoop Streaming, the TRANSFORM, MAP and REDUCE clauses make it possible to invoke
an external script or program from Hive.
 Example - script to filter out rows to remove poor quality readings.
import re
import sys
for line in sys.stdin:
(year, temp, q) = line.strip().split()
if (temp != "9999" and re.match("[01459]", q))
print "%s/t%s" % (year, temp)

hive> ADD FILE /path/to/is-good-quality.py;
hive> FROM records2
> SELECT TRANSFORM(year, temperature, quality)
> USING 'is-good-quality.py'
> AS year, temperature;
 Output :
1949 111
1949 78
1950 0
1950 22 1950 -11
Join in Hive
 Join
 JOIN is a clause that is used for combining specific fields from two tables by using values common to each
one. It is used to combine records from two or more tables in the database.
 JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN
in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.
 Syntax
Join-table:
table-reference JOIN table-factor [join-condition]
| table-reference {LEFT|RIGHT|FULL} [OUTER] JOIN table-reference
join-condition
| table-reference LEFT SEMI JOIN table-reference join-condition
| table-reference CROSS JOIN table-reference [join-condition]
Example: Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields.
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER-ID);
Join in Hive
 Different types of Join:
 Inner join
 Left outer Join
 Right Outer Join
 Full Outer Join
 Inner Join:
 The Records common to the both tables will be retrieved by this Inner Join.
 Example:
 SELECT c.Id, c.Name, c.Age, o.Amount FROM sample-joins c JOIN sample-joins1 o ON(c.Id=o.Id);

 Left Outer Join:


 Hive query language LEFT OUTER JOIN returns all the rows from the left table even though there are no
matches in right table.
 If ON Clause matches zero records in the right table, the joins still return a record in the result with NULL in
each column from the right table.
 Example:
 SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample-joins c LEFT OUTER JOIN sample-joins1 o
ON(c.Id=o.Id)
Join in Hive
 Right outer Join:
 Hive query language RIGHT OUTER JOIN returns all the rows from the Right table even though there are no
matches in left table
 If ON Clause matches zero records in the left table, the joins still return a record in the result with NULL in
each column from the left table
 RIGHT joins always return records from a Right table and matched records from the left table. If the left table
is having no values corresponding to the column, it will return NULL values in that place.
 Example:
 SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample-joins c RIGHT OUTER JOIN sample-joins1 o
ON(c.Id=o.Id)
 Full outer join:
 It combines records of both the tables sample_joins and sample_joins1 based on the JOIN Condition given in
query.
 It returns all the records from both tables and fills in NULL Values for the columns missing values matched
on either side.
 Example:
 SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample-joins c FULL OUTER JOIN sample-joins1 o
ON(c.Id=o.Id)
Subqueries in Hive
 A Query present within a Query is known as a sub query. The main query will depend on the
values returned by the subqueries.
 Subqueries can be classified into two types
 Subqueries in FROM clause
 Subqueries in WHERE clause
 When to use:
 To get a particular value combined from two column values from different tables
 Dependency of one table values on other tables
 Comparative checking of one column values from other tables
 Syntax:
Subquery in FROM clause
SELECT <column names 1, 2…n>From (SubQuery) <TableName-Main >
Subquery in WHERE clause
SELECT <column names 1, 2…n> From<TableName-Main>WHERE col1 IN (SubQuery);
 Example:
SELECT col1 FROM (SELECT a+b AS col1 FROM t1) t2
Aggregate Functions in Hive
 Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single value,
when used with a group, it aggregates all values in each group and returns one value for each group.
 Like in SQL, Aggregate Functions in Hive can be used with or without GROUP BY functions however these
aggregation functions are mostly used with GROUP BY.
 Hive Aggregate Functions List

Hive Aggregate Functions Syntax & Description

Returns the count of all rows in a table including rows containing NULL values
When you specify a column as an input, it ignores NULL values in the column for the count.
COUNT()
Also ignores duplicates by using DISTINCT.
Return: BIGINT
Returns the sum of all values in a column.
When used with a group it returns the sum for each group.
SUM()
Also ignores duplicates by using DISTINCT.
Return: DOUBLE
Returns the average of all values in a column.
AVG() When used with a group it returns an average for each group.
Return: DOUBLE
Aggregate Functions in Hive
Hive Aggregate Functions Syntax & Description

Returns the minimum value of the column from all rows.


MIN() When used with a group it returns a minimum for each group.
Return: DOUBLE
Returns the maximum value of the column from all rows.
MAX() When used with a group it returns a maximum for each group.
Return: DOUBLE
Returns the variance of a numeric column for all rows or for each group.
Variance(col), var-pop(col)
Return: DOUBLE
Returns the unbiased sample variance of a numeric column or for each group.
Var-samp(col)
Return: DOUBLE
Returns the statistical standard deviation of all values in a column or for each group.
Stddev-pop(col)
Return: DOUBLE
Returns the sample statistical standard deviation of all values in a column or for each group.
Stddev-samp(col)
Return: DOUBLE
Returns the sample covariance of a pair of numeric for all rows or for each group.
Covar-samp(col1,col2)
Return: DOUBLE
Aggregate Functions in Hive
Hive Aggregate Functions Syntax & Description

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
Corr(col1,col2)
Return: DOUBLE
For each group, it returns the exact percentile of a column.
Percentile(BIGINT,col,p) p must be between 0 and 1.
Return: DOUBLE
Percentile(BIGINT,col,array(p1[ Returns the exact percentiles p1, p2, … of a column in the group. pi must be between 0 and 1.
,p2]…)) Return: array<double>
Regr-Avgx(independent, Equivalent to avg(dependent).
dependent) Return: DOUBLE
Regr-avgy(independent, Equivalent to avg(independent).
dependent) Return: DOUBLE
Regr-count(independent, Returns the number of non-null pairs used to fit the linear regression line.
dependent) Return: DOUBLE
Regr-intercept(independent, Returns the y-intercept of the linear regression line.
dependent) Return: DOUBLE
Regr-r2(independent, Returns the coefficient of determination for the regression. As of Hive 2.2.0.
dependent) Return: DOUBLE
Aggregate Functions in Hive
Hive Aggregate Functions Syntax & Description

Regr-slope(independent, Returns the slope of the linear regression line


dependent) Return: DOUBLE
Regr-sxx(independent, Equivalent to regr_count(independent, dependent) * var_pop(dependent).
dependent) Return: DOUBLE
Regr-sxy(independent, Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent).
dependent) Return:
Regr-syy(independent, Equivalent to regr_count(independent, dependent) * var_pop(independent).
dependent) Return: DOUBLE
Computes a histogram of a numeric column in the group using b non-uniformly spaced bins.
Histrogram-Number(col, b)
Return: Array<struct {'x','y'}>
Returns a collection of elements in a group as a set by eliminating duplicate elements.
Collect-set(col)
Return: Array
Returns a collection of elements in a group as a list including duplicate elements.
Collect-list(col)
Return: Array
This assigns the bucket number to each row in a partition after partition into x groups.
Ntile(INTERGER, x)
Return: INTEGER
Aggregate Functions in Hive
 Hive Select Count and Count Distinct
 Syntax:
count(*)
count(expr)
count(DISTINCT expr[, expr...])

 Return: BIGINT
 count(*) – Returns the count of all rows in a table including rows containing NULL values.
 count(expr) – Returns the total number of rows for expression excluding null.
 count(DISTINCT expr[, expr]) – Returns the count of distinct rows of expression (or expressions) excluding
null values.

 Example:
hive>select count(*) from employee;
hive>select count(salary) from employee;
hive>select count(distinct gender, salary) from employee;
Aggregate Functions in Hive
 Hive Sum of a Column and sum of Distinct Columns
 Syntax:
sum(col)
sum(DISTINCT col)
 Return: DOUBLE
 Example: Returns the total sum of the elements in the group or the sum of the distinct values of the column in the group.
hive>select sum(salary) from employee;
hive>select sum(distinct salary) from employee;
hive>select age,sum(salary) from employee group by age;

 Hive Average (Avg) of a Column & Average of Distinct Column


 Syntax:
avg(col)
avg(DISTINCT col)
 Return: DOUBLE
Aggregate Functions in Hive
 Example: Returns the average of the elements in the group or the average of the distinct values of the column in the group
hive> select avg(salary) from employee group by age;
hive>select avg(distinct salary) from employee;
hive>select age,avg(salary) from employee group by age;

 min(col) – Get Minimum value of a column


 Returns the minimum of the column in the group.
Return: DOUBLE
 Example:
 hive>select min(salary) from employee;

 max(col) – Get Maximum value of a column


 Returns the maximum of the column in the group.
Return: DOUBLE
 Example:
 hive>select max(salary) from employee;
Aggregate Functions in Hive
 collect-set(col) – Collapse the records by Group and Converts into an Array
 Returns a set of objects with duplicates elements eliminated.
 Return: Array
 Example:
 hive>select gender, collect-set(age) from employee group by gender;

 collect-list(col) – Collapse the records by Group and Converts into an Array


 Returns a list of objects with duplicates. (As of Hive 0.13.0.)
Return: Array
 Example:
 hive>select gender, collect-list(age) from employee group by gender;
 variance(col), var-pop(col)
 The variance() and var-pop() aggregation functions returns the statistical variance of column in a group.
Return: DOUBLE
 Example:
 hive>select variance(salary) from employee;
 hive>select var-pop(salary) from employee;
Aggregate Functions in Hive
 var-samp(col)
 The var-samp() function returns the statistical variance of column in a group.
 Return: DOUBLE
 Example:
 hive>select var-samp(salary) from employee;
 stddev-pop(col) – Get the Standard Deviation of a column
 The stddev-pop() aggregation function returns the statistical standard deviation of numeric column values
provided in the group.
 Return: DOUBLE
 Example:
 hive>select stddev-pop(salary) from employee;
 stddev-samp(col) – Get the Standard Deviation of a column
 The stddev-pop() aggregation function returns the statistical standard deviation of numeric column values
provided in the group.
 Return: DOUBLE
 Example:
 hive>select stddev-samp(salary) from employee;
Pig
 Apache Pig is an abstraction from MapReduce.
 It is a tool / platform for the analysis of larger data sets and their representation as data
streams.
 Pig is commonly used with Hadoop; we can use Apache Pig to do all data manipulation
operations in Hadoop.
 Pig provides a high-level language called Pig Latin for writing data analysis programs.
 The language offers a variety of operators that programmers can use to develop their own
functions for reading, writing, and processing data.
 To use Apache Pig to parse data, programmers must write scripts in Pig Latin.
 All of these scripts are converted internally into Map and Reduce tasks.
 Apache Pig has a component called Pig Engine that takes Pig Latin scripts as input and
converts these scripts into MapReduce jobs.
Need of PIG
 Programmers who are not very good at Java often encounter difficulties using Hadoop,
especially when performing MapReduce tasks.
 Apache Pig is the options to all of these programmers.
 Pig Latin allows programmers to easily perform MapReduce tasks without entering complex
code in Java.
 Apache Pig uses several query methods to reduce code length.
 For example, operations that require you to enter 200 lines of code (LoC) in Java can be easily
performed by entering just 10 lines of code in Apache Pig. It reduced development time by
almost 16 times.
 Pig Latin is a SQL-like language, it is easy to learn Apache Pig after familiarizing yourself with
SQL.
 It has many built-in operators to support data operations such as join, filter, sort, etc.
 It also provides nested data types such as tuples, packets, and maps that MapReduce lacks.
PIG Architecture
 Apache pig framework has below major components as
part of its Architecture:
 Parser
 Optimizer
 Compiler
 Execution Engine

 Parser
 Initially the Pig Scripts are handled by the Parser.
 It checks the syntax of the script, does type checking, and other
miscellaneous checks.
 The output of the parser will be a DAG (directed acyclic graph),
which represents the Pig Latin statements and logical
operators.
 In the DAG, the logical operators of the script are represented
as the nodes and the data flows are represented as edges.
PIG Architecture
 Optimizer
 The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as
projection and pushdown.

 Compiler
 The compiler compiles the optimized logical plan into a series of MapReduce jobs.

 Execution engine
 Finally, the works of MapReduce are sent to Hadoop in sorted order. Finally, these MauProduces works are
performed in Hadoop that produce the desired results
Execution Types / Run Modes
 Apache Pig executes in two modes: Pig Run
1. Local Mode Modes

2. MapReduce Mode
MapReduce
Local Mode
Mode

Local Mode MapReduce Mode

 It executes in a single JVM and is used for  The MapReduce mode is also known as Hadoop
development experimenting and prototyping. Mode.
 Files are installed and run using localhost.  It is the default mode.
 The local mode works on a local file system.  In this Pig renders Pig Latin into MapReduce jobs
 The input and output data stored in the local file and executes them on the cluster.
system.  It can be executed against semi-distributed or fully
distributed Hadoop installation.
 Command: $ pig -x local
 Here, the input and output data are present on
HDFS.
 Command: $ pig or $ pig -x mapreduce
Ways to execute Pig Program
 These are the following ways of executing a Pig program on local and MapReduce mode: -
 Interactive Mode -
 In this mode, the Pig is executed in the Grunt shell.
 To invoke Grunt shell, run the pig command.
 Once the Grunt mode executes, we can provide Pig Latin statements and command interactively at the
command line.
 Batch Mode -
 In this mode, we can run a script file having a .pig extension.
 These files contain Pig Latin commands.
 Embedded Mode -
 In this mode, we can define our own functions.
 These functions can be called as UDF (User Defined Functions).
 Here, we use programming languages like Java and Python.
Data types
Data Type Description & Example
int Represents a signed 32-bit integer. Example : 8
long Represents a signed 64-bit integer. Example : 5L
float Represents a signed 32-bit floating point. Example : 5.5F
double Represents a 64-bit floating point. Example : 10.5
chararray Represents a character array (string) in Unicode UTF-8 format. Example : ‘tutorials point’
Bytearray Represents a Byte array (blob).
Boolean Represents a Boolean value. Example : true/ false.
Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00
Biginteger Represents a Java BigInteger. Example : 60708090709
Bigdecimal Represents a Java BigDecimal. Example : 185.98376256272893883
Tuple A tuple is an ordered set of fields. Example : (raja, 30)
Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)}
Map A Map is a set of key-value pairs. Example : [ ‘name’#’Raju’, ‘age’#30]
Null Values Values for all the above data types can be NULL. A null can be an unknown value or a non-existent value.
Shell & Utility Commands in PIG
 Shell Commands
 In order to write Pig Latin scripts, we use the Grunt shell of Apache Pig.
 By using sh and fs we can invoke any shell commands, before that.

(a) sh Command
 we can invoke any shell commands from the Grunt shell, using the sh command. But make sure, we cannot
execute the commands that are a part of the shell environment (ex − cd), using the sh command.
 Syntax
grunt> sh shell command parameters

Example
By using the sh option, we can invoke the ls command of Linux shell from the Grunt shell.
Here, it lists out the files in the /pig/bin/ directory.
 grunt> sh ls
 pig
 pig-1444799121955.log
 pig.cmd
 pig.py
Shell & Utility Commands in PIG
(b) fs Command
 we can invoke any fs Shell commands from the Grunt shell by using the fs command.
 Syntax
grunt> sh File System command parameters

Example
By using fs command, we can invoke the ls command of HDFS from the Grunt shell. Here, it lists the files in the HDFS
root directory.
 grunt> fs –ls

 Found 3 items
 drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
 drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen-data
 drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter-data
 Similarly, using the fs command we can invoke all the other file system shell commands from the Grunt shell.
Shell & Utility Commands in PIG
 Utility Commands
 It offers a set of Pig Grunt Shell utility commands. It involves clear, help, history, quiet, and set.
 Also, there are some commands to control Pig from the Grunt shell, such as exec, kill, and run. Here is the
description of the utility commands offered by the Grunt shell.
(a) clear Command
 In order to clear the screen of the Grunt shell, we use Clear Command.
Syntax: grunt> clear

(b) help Command


 Prints a list of Pig commands or properties.
Syntax : grunt> help [properties]

(C) history Command


 Display the list of statements used so far.
Syntax: history [-n]
Shell & Utility Commands in PIG
Example: we have executed three statements:
(i) grunt> customers = LOAD 'hdfs://localhost:9000/pig-data/customers.txt' USING PigStorage(',');
(ii) grunt> orders = LOAD 'hdfs://localhost:9000/pig-data/orders.txt' USING PigStorage(',');
(iii) grunt> Employee = LOAD 'hdfs://localhost:9000/pig-data/Employee.txt' USING PigStorage(',');

Then, using the history command will produce the following output.
grunt> history customers = LOAD 'hdfs://localhost:9000/pig-data/customers.txt' USING PigStorage(',');
orders = LOAD 'hdfs://localhost:9000/pig-data/orders.txt' USING PigStorage(',');
Employee = LOAD 'hdfs://localhost:9000/pig-data/Employee.txt' USING PigStorage(',');

(D) set Command


 To show/assign values to keys, we use set command in Pig.

Syntax:
set [key 'value']
Shell & Utility Commands in PIG
Key Value Description
Sets the number of reducers for all MapReduce jobs generated
default-parallel a whole number
by Pig
debug on/off Turns debug-level logging on or off.
Single-quoted string that contains the
job.name Sets user-specified name for the job
job name.
Acceptable values (case insensitive):
job.priority Sets the priority of a Pig job.
very-low, low, normal, high, very-high

stream.skippath String that contains the path. For streaming, sets the path from which not to ship data

(E) quit Command


 Quits from the Pig grunt shell.

Syntax:
grunt> quit
Shell & Utility Commands in PIG
(F) exec Command : We can execute Pig scripts from the Grunt shell.
Syntax:
grunt> exec [–param param-name = param-value] [–param-file file-name] [script]

(G) kill Command : We can kill a job from the Grunt shell.
Syntax:
grunt> kill JobId

(H) run Command : Run a Pig script.


Syntax:
grunt> run [–param param-name = param-value] [–param-file file-name] script
PIG Latin Relational Operators
(A) ORDER: This helps to sort the data based on Ascending and Descending manner.
1. Sorting for numerical fields are based on numerically.
2. Sorting for chararray fields are based on lexically.
3. Sorting for bytearray fields are based on lexically.
4. Nulls are considered to be smaller than other values. Therefore it always come first or last during ascending or
descending the results.

(B) LIMIT: simply limits the number of records to display.

(C) DISTINCT: removes the duplicate data.

(D) Group: Grouping the data from large pool of datasets.


PIG Latin Example
1. Load the data into bag named "lines". The entire line is stuck to element line of type character array.
grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
2. The text in the bag lines needs to be tokenized this produces one word per row.
grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) As token: chararray;
3. To retain the first letter of each word type the below command .This commands uses substring method to take the
first character.
grunt>letters = FOREACH tokens GENERATE SUBSTRING(0,1) as letter : chararray;
4. Create a bag for unique character where the grouped bag will contain the same character for each occurrence of
that character.
grunt>lettergrp = GROUP letters by letter;
5. The number of occurrence is counted in each group.
grunt>countletter = FOREACH lettergrp GENERATE group , COUNT(letters);
6. Arrange the output according to count in descending order using the commands below.
grunt>OrderCnt = ORDER countletter BY $1 DESC;
7. Limit to One to give the result.
grunt> result =LIMIT OrderCnt 1;
8. Store the result in HDFS . The result is saved in output directory under sonoo folder.
grunt> STORE result into 'home/sonoo/output';
PIG UDF (User Defined Functions)
 Apache Pig offers extensive support for Pig UDF, in addition to the built-in functions. Basically, we can define our
own functions and use them, using these UDF’s.
 Moreover, in six programming languages, UDF support is available. Such as Java, Jython, Python, JavaScript, Ruby,
and Groovy.
 However, we can say, complete support is only provided in Java. While in all the remaining languages limited
support is provided.
 In addition, we can write UDF’s involving all parts of the processing like data load/store, column transformation,
and aggregation, using Java.
 Also, we have a Java repository for UDF’s named Piggybank, in Apache Pig.
 Basically, we can access Java UDF’s written by other users, and contribute our own UDF’s, using Piggybank.

 Types of Pig UDF in Java


 We can create and use several types of functions while writing Pig UDF using Java, such as:

 Filter Functions:
 In filter statements, we use the filter functions as conditions. Basically, it accepts a Pig value as input and returns a
Boolean value.
PIG UDF (User Defined Functions)
 Eval Functions:
 In FOREACH GENERATE statements, we use the Eval functions. Basically, it accepts a Pig value as input and returns a Pig
result.
 Algebraic Functions:
In a FOREACH GENERATE statement, we use the Algebraic functions act on inner bags. Basically, to perform full MapReduce
operations on an inner bag, we use these functions.

 Example of Pig UDF:


Example of a simple EVAL Function to convert the provided string to uppercase.

 In Pig,
 All UDFs must extend "org.apache.pig.EvalFunc"
 All functions must override the "exec" method.
PIG UDF (User Defined Functions)
UPPER.java
package com.hadoop;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class TestUpper extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
PIG UDF (User Defined Functions)
Create the jar file and export it into the specific
directory. For that ,right click on project - Export
- Java - JAR file - Next.

Now, provide a specific name to the jar file and


save it in a local system directory.
PIG UDF (User Defined Functions)
Create a text file in your local machine and
insert the list of tuples.

$ nano pigsample

Upload the text files on HDFS in the specific


directory.
$ hdfs dfs -put pigexample /pigexample

Create a pig file in your local machine and write


the script.
$ nano pscript.pig
PIG UDF (User Defined Functions)
Now, run the script in the terminal to get the
output.
$pig pscript.pig
PIG File Loader
The Apache Pig LOAD operator is used to load the data from the file system.
Syntax: LOAD 'info' [USING FUNCTION] [AS SCHEMA];
Here,
 LOAD is a relational operator.
 'info' is a file that is required to load. It contains any type of data.
 USING is a keyword.
 FUNCTION is a load function.
 AS is a keyword.
 SCHEMA is a schema of passing file, enclosed in parentheses.

Example: To load the text file data from the file system.
1) Create a text file in your local machine and provide some values to it.
$ nano pload.txt
PIG File Loader
2) Check the values written in the text files.
$ cat pload.txt

3) Upload the text files on HDFS in the specific directory.


$ hdfs dfs -put pload.txt /pigexample

4) Open the pig MapReduce run mode.


$ pig

5) Load the file that contains the data.


grunt> A = LOAD '/pigexample/pload.txt' USING PigStorage(',') AS
(a1:int,a2:int,a3:int,a4:int) ;

6) Now, execute and verify the data.


grunt> DUMP A;
PIG File Loader
7) Let's check the corresponding schema.
grunt> DESCRIBE A;
Limitations of Apache Pig
i. Errors of Pig
 Errors that Pig produces due to UDFs(Python) are not helpful at all. At times, while something goes wrong, it just gives the error
such as exec error in UDF, even if the problem is related to syntax or the type error, it lets alone a logical one.
ii. Not mature
 Pig is still in the development, even if it has been around for quite some time.
iii. Support
 Generally, Google and StackOverflow do not lead good solutions for the problems.
iv. Implicit data schema
 In Apache Pig, Data Schema is not enforced explicitly but implicitly. It is also a huge disadvantage. As it does not enforce an
explicit schema, sometimes one data structure goes byte array, which is a “raw” data type.
 It is up to the time we coerce the fields even the strings, they turn byte array without notice. It leads to propagation for other
steps of the data processing.
v. Minor one
 Here is an absence of good IDE or plugin for Vim. That offers more functionality than syntax completion to write the pig
scripts.
vi. Delay in execution
 Unless either we dump or store an intermediate or final result the commands are not executed. This increases the iteration
between debug and resolve the issue.
Basics of HBase
 HBase is a data model similar to Google Big Table, which can
quickly and randomly access massive amounts of structured data.
 HBase is a distributed column-oriented database based on the
Hadoop file system.
 It is an open source project and can be scaled horizontally.
 It uses the fault tolerance of the Hadoop File System (HDFS).
 It is part of the Hadoop ecosystem and provides real-time random
read/write access to data in the Hadoop file system.
 You can save data in HDFS directly or through HBase. Data
consumers randomly read data in HDFS or access it through
HBase.
 HBase resides on the Hadoop file system and provides read and
write access.
Storage Mechanism in HBase
 HBase is a column database and the tables in it are sorted by raw.
 The table diagram only defines the column families, which are the key value pairs.
 A table has several families of columns and every family of columns can have any number of
columns.
 The following column values are stored contiguously to the disc.
 Each cell value in the table has a timestamp.
 In summary, in a HBase:
 The table is a collection of raw.
 Raw is a collection of column families.
 The column family is a collection of columns.
 Column is a collection of key value pairs.
Example

The Coordinator used to identify data in Hbase table are (1) rowkey, (2) column family, (3)column identifier, and (4) version.
HBase Architecture
 HBase has four major components:
 Client
 HMaster
 Region Servers
 Zookeeper
HBase Components
 Master Server
 Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
 Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the
regions to less occupied servers.
 Maintains the state of the cluster by negotiating the load balancing.
 It responsible for schema changes and other metadata operations such as creation of tables and column
families.
 Regions
 Regions are nothing but tables that are split up and spread across the region servers.
 Region server
 The region servers have regions that communicate with the client and handle data-related operations.
 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size thresholds.
Region Server & Region

Table HBase table present in the HBase cluster


Region HRegions for the presented tables
Store It stores per ColumnFamily for each region for the table
 Memstore for each store for each region for the table
Memstore  It sorts data before flushing into HFiles
 Write and read performance will increase because of sorting
StoreFile StoreFiles for each store for each region for the table
Block Blocks present inside StoreFiles
HBase Write & Read
HBase Write & Read
1. Client wants to write data and in turn first communicates with Regions server and then
regions
2. Regions contacting Memstore for storing associated with the column family.
3. First data stores into Memstore, where the data is sorted and after that, it flushes into HFile.
The main reason for using Memstore is to store data in a Distributed file system based on
Row Key. Memstore will be placed in Region server main memory while HFiles are written into
HDFS.
4. Client wants to read data from Regions
5. In turn Client can have direct access to Memstore, and it can request for data.
6. Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
 Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase
Regions is as shown from top to bottom in below table.
Basics of Zookeeper
 Zookeeper is a distributed co-ordination service to manage large
set of hosts.
 Coordinating and managing a service in a distributed environment
is a complicated process.
 Zookeeper solves this issue with its simple architecture and API.
 Zookeeper allows developers to focus on core application logic
without worrying about the distributed nature of the application.
 The zookeeper frame was originally built in Yahoo.
 To access you easily and robust applications. Subsequently,
Apache Zookeeper has become a standard for the organized
service used by Hadoop, HBase and other distributed frameworks.
 For example, Apache HBase uses ZooKeeper to track the status of
distributed data.
Service of Zookeeper
 Apache Zookeeper is a service used by a cluster (group of nodes) to coordinate between
themselves and maintain shared data with robust synchronization techniques.
 Naming service
 Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
 Configuration Management
 Latest and up-to-date configuration information of the system for a joining node.
 Cluster management
 Joining / leaving of a node in a cluster and node status at real time.
 Leader election
 Electing a node as leader for coordination purpose.
 Locking and synchronization service
 Locking the data while modifying it. This mechanism helps you in automatic fail recovery while connecting
other distributed applications like Apache HBase.
 Highly reliable data registry
 Availability of data even when one or a few nodes are down.
Benefits of Zookeeper
 Simple Distributed Coordination Process
 Synchronization
 Mutual exclusion and co-operation between server processes. This process helps in Apache HBase for
configuration management.
 Ordered Messages
 Serialization
 Encode the data according to specific rules. Ensure your application runs consistently. This approach can be
used in MapReduce to coordinate queue to execute running threads.
 Reliability
 Atomicity
 Data transfer either succeed or fail completely, but no transaction is partial.
Zookeeper in HBase
 Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
 Zookeeper has temporary nodes representing different region servers. Master servers use
these nodes to discover available servers.
 In addition to availability, the nodes are also used to track server failures or network partitions.
 Clients communicate with region servers via zookeeper.
 In pseudo and standalone modes, HBase itself will take care of zookeeper.
Zookeeper in HBase
 Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
 Zookeeper has temporary nodes representing different region servers. Master servers use
these nodes to discover available servers.
 In addition to availability, the nodes are also used to track server failures or network partitions.
 Clients communicate with region servers via zookeeper.
 In pseudo and standalone modes, HBase itself will take care of zookeeper.
Zookeeper Architecture
Client Clients, one of the nodes in our distributed application
cluster, access information from the server. For a
particular time interval, every client sends a message to
the server to let the sever know that the client is alive.
Similarly, the server sends an acknowledgement when a
client connects. If there is no response from the connected
server, the client automatically redirects the message to
another server.

Server Server, one of the nodes in our ZooKeeper ensemble,


provides all the services to clients. Gives
acknowledgement to client to inform that the server is
alive.
Ensemble Group of ZooKeeper servers. The minimum number of Zookeeper Data Model
nodes that is required to form an ensemble is 3.
Leader Server node which performs automatic recovery if any of
the connected node failed. Leaders are elected on service
startup.
Follower Server node which follows leader instruction.
Zookeeper Data Model
 This tree structure of ZooKeeper file system used for memory
representation. ZooKeeper node is referred as znode.
 Every znode is identified by a name and separated by a sequence of path
(/).
 First you have a root znode separated by “/”. Under root, you have two
logical namespaces config and workers.
 The config namespace is used for centralized configuration
management and the workers namespace is used for naming.
 Under config namespace, each znode can store upto 1MB of data. This is
similar to UNIX file system except that the parent znode can store data as
well.
 The main purpose of this structure is to store synchronized data
and describe the metadata of the znode.
 znode consists of Version number, Action control list (ACL),
Timestamp, Data length.

You might also like