Professional Documents
Culture Documents
Unit 4
Unit 4
Unit 4
GTU #3170722
Unit-4
PIG: PIG Architecture & Data types, Shell and Utility components, PIG Latin Relational Operators,
PIG Latin: File Loaders and UDF, Programming structure in UDF, PIG Jars Import, limitations of PIG
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in HDFS, but we can
include them by again reopening the file.
This design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across
all the data nodes in the cluster.
Thus, it increases the availability, scalability, and throughput of the system.
Replica Placement Strategy
HDFS as the name says is a distributed file system which is designed to store large files.
A large file is divided into blocks of defined size and these blocks are stored across machines in a cluster.
These blocks of the file are replicated for reliability and fault tolerance.
Sqoop:
“SQL to Hadoop and Hadoop to SQL” Sqoop is a tool designed to transfer data between Hadoop and relational
database servers.
It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from
Hadoop file system to relational databases.
It is provided by the Apache Software Foundation.
Features of the Sqoop
How Sqoop Works?
The following command is used to import empadd table data into ‘/queryresult’ directory.
The following command is used to verify the imported data in /queryresult directory form empadd table.
$HADOOP-HOME/bin/hadoop fs -cat /queryresult/part-m-*
import command in Sqoop
Import Subset of Table Data
We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It executes the
corresponding SQL query in the respective database server and stores the result in a target directory in
HDFS.
The syntax for where clause is as follows.
--where <condition>
The following command is used to import a subset of empadd table data. The subset query is
to retrieve the employee id and address, who lives in Secunderabad city.
The following command is used to verify the imported data in /wherequery directory from the
empadd table.
$HADOOP-HOME/bin/hadoop fs -cat /wherequery/part-m-*
import command in Sqoop
Incremental Import
Incremental import is a technique that imports only the newly added rows in a table.
It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.
The following syntax is used for the incremental option in Sqoop import command.
--incremental <mode> --check-column <column name> --last value <last check column value>
Let us assume the newly added data into emp table is as follows −
1206, satish p, grp des, 20000, GR The following command is used to perform the incremental import in the
emp table.
The following command is used to verify the imported data from emp table to HDFS emp/
directory.
$HADOOP-HOME/bin/hadoop fs -cat /emp/part-m-*
import command in Sqoop
The following command is used to see the modified or newly added rows from the emp table.
$ $HADOOP-HOME/bin/hadoop fs -cat /emp/part-m-*1
Note − If you are using the import-all-tables, it is mandatory that every table in that database must have a
primary key field.
import command in Sqoop
The following command is used to verify all the table data to the userdb database in HDFS.
$HADOOP-HOME/bin/hadoop fs -ls
It will show you the list of table names in userdb database as directories.
Output
drwxr-xr-x - hadoop supergroup 0 2014-12-22 22:50 -sqoop
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:46 emp
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:50 empadd
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:52 empcontact
export command in Sqoop
With the help of the export command which works as a reverse process of operation.
Using export command, we can transfer the data from the Hadoop database file system to the
Relational database management system.
The data which will be exported is processed into records before operation is completed.
The export of data is done with two steps, first is to examine the database for metadata and
second step involves migration of data.
Each table data is stored in a separate directory and the directory name is same as the table
name.
Syntax:
$ sqoop export (generic-args) (export-args)
$ sqoop -export (generic-args) (export-args)
Example
Let us take an example of the employee data in file, in HDFS. The employee data is available in empdata file
in ‘emp/’ directory in HDFS. The empdata is as follows.
It is mandatory that the table to be exported is created manually and is present in the database from where it
has to be exported.
export command in Sqoop
The following query is used to create the table ‘employee’ in mysql command line.
The following command is used to export the table data (which is in empdata file on HDFS) to
the employee table in db database of Mysql database.
The following command is used to verify the table in mysql command line.
mysql>select * from employee;
HIVE
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Bigdata and makes querying and analysing easy.
It is a platform used to develop SQL type scripts to do MapReduce operations.
Initially, Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive.
It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.
Feature of Hive
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Hive Architecture
The following component diagram depicts the architecture of Hive:
Hive Architecture
This component diagram contains different units. The following table describes each unit:
Unit Name Operation
Hive is a data warehouse infrastructure software that can create interaction between user and
HDFS.
User Interface (UI)
The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight
(In Windows server).
Hive chooses respective database servers to store the schema or Metadata of tables, databases,
Meta Store
columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore.
It is one of the replacements of traditional approach for MapReduce program.
HiveQL Process Engine
Instead of writing MapReduce program in Java, we can write a query for MapReduce job and
process it.
The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.
Execution Engine Execution engine processes the query and generates results as same as MapReduce results.
It uses the flavor of MapReduce.
Hadoop distributed file system or HBASE are the data storage techniques to store data into file
HDFS or HBASE
system.
Working with Hive
The following diagram depicts the workflow between Hive and Hadoop.
Working with Hive
The following table defines how Hive interacts with Hadoop framework:
Step Operation
Description
No.
Execute
The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as
1 Query
JDBC, ODBC, etc.) to execute.
Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or
2
the requirement of query.
Get Metadata
3 The compiler sends metadata request to Metastore (any database).
Send
4 Metadata Metastore sends metadata as a response to the compiler.
Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and
5
compiling of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution engine.
Working with Hive
Step
Operation Description
No.
Internally, the process of execution job is a MapReduce job. The execution engine sends the job to
7 Execute Job JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.
7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
8 Fetch Result The execution engine receives the results from Data nodes.
9 Send Results The execution engine sends those resultant values to the driver.
It’s very easily scalable at low cost. Not much Scalable, costly scale up.
It’s based on hadoop notation that is Write once and read In traditional database we can read and write many time.
many times.
Record level updates is not possible in Hive. Record level updates, insertions and deletes, transactions
and indexes are possible.
OLTP (On-line Transaction Processing) is not yet supported in Both OLTP (On-line Transaction Processing) and OLAP (On-
Hive but it’s supported OLAP (On-line Analytical Processing). line Analytical Processing) are supported in RDBMS.
HiveQL Querying Data
HiveQL is the HIVE QUERY LANGUAGE.
Hive offers no support for row-level inserts, updates, and deletes. Hive does not support transactions.
Hive adds extensions to provide better performance in the context of Hadoop and to integrate with
custom extensions and even external programs.
DDL and DML are the parts of HiveQL.
Data Definition Language (DDL) is used for creating, altering and dropping databases, tables, views,
functions and indexes.
Data manipulation language (DML) is used to put data into Hive tables and to extract data to the file
system and also how to explore and manipulate data with queries, grouping, filtering, joining etc.
Hive queries provides the following features:
Data modeling such as Creation of databases, tables, etc.
ETL functionalities such as Extraction, Transformation, and Loading data into tables
Joins to merge different data tables
User specific custom scripts for ease of code
Faster querying tool on top of Hadoop
HiveQL Querying Data
Create Database Statement
Create Database is a statement used to create a database in Hive.
A database in Hive is a namespace or a collection of tables.
Syntax: CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name
already exists. We can use SCHEMA in place of DATABASE in this command.
The following query is executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
OR
hive> CREATE SCHEMA userdb;
The following query is used to verify a databases list:
hive> SHOW DATABASES;
default
userdb
Create Table in Hive
Creating Table in Hive
Create Table is a statement used to create a table in Hive.
Syntax:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db-name.] table-name
[(col-name datatype [COMMENT col-comment], ...)]
[COMMENT table-comment]
[ROW FORMAT row-format]
[STORED AS file-format]
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘/t’
LINES TERMINATED BY ‘/n’
STORED AS TEXTFILE;
Load Data in Hive
Load Data Statement
Generally, after creating a table in SQL, we can insert data using the Insert statement. But in Hive, we can
insert data using the LOAD DATA statement.
While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are two ways to load
data: one is from local file system and second is from Hadoop file system.
Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
Where,
• LOCAL is identifier to specify the local path. It is optional.
• OVERWRITE is optional to overwrite the data in the table.
• PARTITION is optional.
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt‘ INTO TABLE employee;
Order by and Sort by Clause in Hive
ORDER BY and SORT BY Clause
By using HiveQL ORDER BY and SORT BY clause, we can apply sort on the column. It returns the result set
either in ascending or descending.
In HiveQL, ORDER BY clause performs a complete ordering of the query result set. Hence, the complete data
is passed through a single reducer. This may take much time in the execution of large datasets. However, we
can use LIMIT to minimize the sorting time.
Example: Let's see an example to arrange the data in the sorted order by using ORDER BY clause.
Step 1: Select the database in which we want to create a table.
hive> use hiveql;
Step 2: create a table by using the following command:
hive> create table emp (Id int, Name string , Salary float, Department string)
row format delimited
fields terminated by ',' ;
Step 3: Load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table emp;
Step 4: To fetch the data in the descending order by using the following command.
hive> select * from emp order by salary desc;
Order by and Sort by Clause in Hive
SORT BY Clause
The HiveQL SORT BY clause is an alternative of ORDER BY clause. It orders the data within each reducer.
Hence, it performs the local ordering, where each reducer's output is sorted separately. It may also give a
partially ordered result.
Example: Let's see an example to arrange the data in the sorted order by using ORDER BY clause.
Returns the count of all rows in a table including rows containing NULL values
When you specify a column as an input, it ignores NULL values in the column for the count.
COUNT()
Also ignores duplicates by using DISTINCT.
Return: BIGINT
Returns the sum of all values in a column.
When used with a group it returns the sum for each group.
SUM()
Also ignores duplicates by using DISTINCT.
Return: DOUBLE
Returns the average of all values in a column.
AVG() When used with a group it returns an average for each group.
Return: DOUBLE
Aggregate Functions in Hive
Hive Aggregate Functions Syntax & Description
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
Corr(col1,col2)
Return: DOUBLE
For each group, it returns the exact percentile of a column.
Percentile(BIGINT,col,p) p must be between 0 and 1.
Return: DOUBLE
Percentile(BIGINT,col,array(p1[ Returns the exact percentiles p1, p2, … of a column in the group. pi must be between 0 and 1.
,p2]…)) Return: array<double>
Regr-Avgx(independent, Equivalent to avg(dependent).
dependent) Return: DOUBLE
Regr-avgy(independent, Equivalent to avg(independent).
dependent) Return: DOUBLE
Regr-count(independent, Returns the number of non-null pairs used to fit the linear regression line.
dependent) Return: DOUBLE
Regr-intercept(independent, Returns the y-intercept of the linear regression line.
dependent) Return: DOUBLE
Regr-r2(independent, Returns the coefficient of determination for the regression. As of Hive 2.2.0.
dependent) Return: DOUBLE
Aggregate Functions in Hive
Hive Aggregate Functions Syntax & Description
Return: BIGINT
count(*) – Returns the count of all rows in a table including rows containing NULL values.
count(expr) – Returns the total number of rows for expression excluding null.
count(DISTINCT expr[, expr]) – Returns the count of distinct rows of expression (or expressions) excluding
null values.
Example:
hive>select count(*) from employee;
hive>select count(salary) from employee;
hive>select count(distinct gender, salary) from employee;
Aggregate Functions in Hive
Hive Sum of a Column and sum of Distinct Columns
Syntax:
sum(col)
sum(DISTINCT col)
Return: DOUBLE
Example: Returns the total sum of the elements in the group or the sum of the distinct values of the column in the group.
hive>select sum(salary) from employee;
hive>select sum(distinct salary) from employee;
hive>select age,sum(salary) from employee group by age;
Parser
Initially the Pig Scripts are handled by the Parser.
It checks the syntax of the script, does type checking, and other
miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph),
which represents the Pig Latin statements and logical
operators.
In the DAG, the logical operators of the script are represented
as the nodes and the data flows are represented as edges.
PIG Architecture
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as
projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally, the works of MapReduce are sent to Hadoop in sorted order. Finally, these MauProduces works are
performed in Hadoop that produce the desired results
Execution Types / Run Modes
Apache Pig executes in two modes: Pig Run
1. Local Mode Modes
2. MapReduce Mode
MapReduce
Local Mode
Mode
It executes in a single JVM and is used for The MapReduce mode is also known as Hadoop
development experimenting and prototyping. Mode.
Files are installed and run using localhost. It is the default mode.
The local mode works on a local file system. In this Pig renders Pig Latin into MapReduce jobs
The input and output data stored in the local file and executes them on the cluster.
system. It can be executed against semi-distributed or fully
distributed Hadoop installation.
Command: $ pig -x local
Here, the input and output data are present on
HDFS.
Command: $ pig or $ pig -x mapreduce
Ways to execute Pig Program
These are the following ways of executing a Pig program on local and MapReduce mode: -
Interactive Mode -
In this mode, the Pig is executed in the Grunt shell.
To invoke Grunt shell, run the pig command.
Once the Grunt mode executes, we can provide Pig Latin statements and command interactively at the
command line.
Batch Mode -
In this mode, we can run a script file having a .pig extension.
These files contain Pig Latin commands.
Embedded Mode -
In this mode, we can define our own functions.
These functions can be called as UDF (User Defined Functions).
Here, we use programming languages like Java and Python.
Data types
Data Type Description & Example
int Represents a signed 32-bit integer. Example : 8
long Represents a signed 64-bit integer. Example : 5L
float Represents a signed 32-bit floating point. Example : 5.5F
double Represents a 64-bit floating point. Example : 10.5
chararray Represents a character array (string) in Unicode UTF-8 format. Example : ‘tutorials point’
Bytearray Represents a Byte array (blob).
Boolean Represents a Boolean value. Example : true/ false.
Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00
Biginteger Represents a Java BigInteger. Example : 60708090709
Bigdecimal Represents a Java BigDecimal. Example : 185.98376256272893883
Tuple A tuple is an ordered set of fields. Example : (raja, 30)
Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)}
Map A Map is a set of key-value pairs. Example : [ ‘name’#’Raju’, ‘age’#30]
Null Values Values for all the above data types can be NULL. A null can be an unknown value or a non-existent value.
Shell & Utility Commands in PIG
Shell Commands
In order to write Pig Latin scripts, we use the Grunt shell of Apache Pig.
By using sh and fs we can invoke any shell commands, before that.
(a) sh Command
we can invoke any shell commands from the Grunt shell, using the sh command. But make sure, we cannot
execute the commands that are a part of the shell environment (ex − cd), using the sh command.
Syntax
grunt> sh shell command parameters
Example
By using the sh option, we can invoke the ls command of Linux shell from the Grunt shell.
Here, it lists out the files in the /pig/bin/ directory.
grunt> sh ls
pig
pig-1444799121955.log
pig.cmd
pig.py
Shell & Utility Commands in PIG
(b) fs Command
we can invoke any fs Shell commands from the Grunt shell by using the fs command.
Syntax
grunt> sh File System command parameters
Example
By using fs command, we can invoke the ls command of HDFS from the Grunt shell. Here, it lists the files in the HDFS
root directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen-data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter-data
Similarly, using the fs command we can invoke all the other file system shell commands from the Grunt shell.
Shell & Utility Commands in PIG
Utility Commands
It offers a set of Pig Grunt Shell utility commands. It involves clear, help, history, quiet, and set.
Also, there are some commands to control Pig from the Grunt shell, such as exec, kill, and run. Here is the
description of the utility commands offered by the Grunt shell.
(a) clear Command
In order to clear the screen of the Grunt shell, we use Clear Command.
Syntax: grunt> clear
Then, using the history command will produce the following output.
grunt> history customers = LOAD 'hdfs://localhost:9000/pig-data/customers.txt' USING PigStorage(',');
orders = LOAD 'hdfs://localhost:9000/pig-data/orders.txt' USING PigStorage(',');
Employee = LOAD 'hdfs://localhost:9000/pig-data/Employee.txt' USING PigStorage(',');
Syntax:
set [key 'value']
Shell & Utility Commands in PIG
Key Value Description
Sets the number of reducers for all MapReduce jobs generated
default-parallel a whole number
by Pig
debug on/off Turns debug-level logging on or off.
Single-quoted string that contains the
job.name Sets user-specified name for the job
job name.
Acceptable values (case insensitive):
job.priority Sets the priority of a Pig job.
very-low, low, normal, high, very-high
stream.skippath String that contains the path. For streaming, sets the path from which not to ship data
Syntax:
grunt> quit
Shell & Utility Commands in PIG
(F) exec Command : We can execute Pig scripts from the Grunt shell.
Syntax:
grunt> exec [–param param-name = param-value] [–param-file file-name] [script]
(G) kill Command : We can kill a job from the Grunt shell.
Syntax:
grunt> kill JobId
Filter Functions:
In filter statements, we use the filter functions as conditions. Basically, it accepts a Pig value as input and returns a
Boolean value.
PIG UDF (User Defined Functions)
Eval Functions:
In FOREACH GENERATE statements, we use the Eval functions. Basically, it accepts a Pig value as input and returns a Pig
result.
Algebraic Functions:
In a FOREACH GENERATE statement, we use the Algebraic functions act on inner bags. Basically, to perform full MapReduce
operations on an inner bag, we use these functions.
In Pig,
All UDFs must extend "org.apache.pig.EvalFunc"
All functions must override the "exec" method.
PIG UDF (User Defined Functions)
UPPER.java
package com.hadoop;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class TestUpper extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
PIG UDF (User Defined Functions)
Create the jar file and export it into the specific
directory. For that ,right click on project - Export
- Java - JAR file - Next.
$ nano pigsample
Example: To load the text file data from the file system.
1) Create a text file in your local machine and provide some values to it.
$ nano pload.txt
PIG File Loader
2) Check the values written in the text files.
$ cat pload.txt
The Coordinator used to identify data in Hbase table are (1) rowkey, (2) column family, (3)column identifier, and (4) version.
HBase Architecture
HBase has four major components:
Client
HMaster
Region Servers
Zookeeper
HBase Components
Master Server
Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the
regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
It responsible for schema changes and other metadata operations such as creation of tables and column
families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
Region Server & Region