Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

UNIT-5

HADOOP RELATED TOOLS: Hbase – data model and implementations –


Hbase clients – Hbase examples –praxis. Cassandra – Cassandra data model –
Cassandra examples – cassandra clients. –Hadoop integration. Pig – Grunt – pig
data model – Pig Latin – developing and testing Pig Latin scripts. Hive – data
types and file formats – HiveQL data definition – HiveQL data manipulation –
HiveQL queries.

Hbase
What is HBase?
HBase is an open-source, column-oriented distributed database system
in a Hadoop environment. Initially, it was Google Big Table, afterward, it was
re-named as HBase and is primarily written in Java. Apache HBase is needed
for real-time Big Data applications.
HBase can store massive amounts of data from terabytes to petabytes.
The tables present in HBase consists of billions of rows having millions of
columns. HBase is built for low latency operations, which is having some
specific features compared to traditional relational models.

HBase Unique Features


 HBase is built for low latency operations
 HBase is used extensively for random read and write operations
 HBase stores a large amount of data in terms of tables
 Automatic and configurable sharding of tables
 HBase stores data in the form of key/value pairs in a columnar model
The following image shows column families in a column-oriented database:

Rowi Column Column Column Column


d Family Family Family Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1

If this table is stored in a row-oriented database. It will store the records as


shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
While the column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang

HBASE RDBMS

 Schema-less in database  Having fixed schema in database

 Column-oriented  Row oriented data store


databases

 Designed to store De-  Designed to store Normalized data


normalized data

 Wide and sparsely  Contains thin tables in database


populated tables present
in HBase

 Supports automatic  Has no built in support for partitioning


partitioning

 Well suited for OLAP  Well suited for OLTP systems


systems

 Read only relevant data  Retrieve one row at a time and hence
from database could read unnecessary data if only
some of the data in a row is required

 Structured and semi-  Structured data can be stored and


structure data can be processed using RDBMS
stored and processed
using HBase

 Enables aggregation  Aggregation is an expensive operation


over many rows and
columns

How the Hbase differ from relational database:

data model and implementations


The Data Model in HBase is designed to accommodate semi-structured data that
could vary in field size, data type and columns. Additionally, the layout of the
data model makes it easier to partition the data and distribute it across the
cluster. The Data Model in HBase is made of different logical components
such as Tables, Rows, Column Families, Columns, Cells and Versions.

Tables
The HBase Tables are more like logical collection of rows stored in
separate partitions called Regions. As shown above, every Region is then served
by exactly one Region Server. The figure above shows a representation of a
Table.

Rows
A row is one instance of data in a table and is identified by a rowkey.
Rowkeys are unique in a Table and are always treated as a byte[].

Column Families
Data in a row are grouped together as Column Families. Each Column
Family has one more Columns and these Columns in a family are stored
together in a low level storage file known as HFile.

The table above shows Customer and Sales Column Families.

Columns
A Column Family is made of one or more columns. A Column is
identified by a Column Qualifier that consists of the Column Family name
concatenated with the Column name using a colon – example:
columnfamily:columnname. There can be multiple Columns within a Column
Family and Rows within a table can have varied number of Columns.

Cell :
A Cell stores data and is essentially a unique combination of rowkey,
Column Family and the Column (Column Qualifier). The data stored in a Cell is
called its value and the data type is always treated as byte[].

Version

 The data stored in a cell is versioned and versions of data are identified
by the timestamp associated with them.
 the timestamp represents the time on the RegionServer when the data was
written, but you can specify a different timestamp value when you put
data into the cell.
 The version number is configurable and by default it is 3.
 For e.g. if the value of some column of a row is 1234. Then you update it
and make it 5678 then hbase wont overwrite 1234 by 5678 . It would
rather create a verion of this cell as 1234 will have version 2 and 5678
will have version 1. Hbase will always show you values of version 1 and
hide all other versions.
Hbase clients
There are different flavours of clients that can be used with different
programming languages to access HBase.
1) The HBase shell
2) Kundera – the object mapper
3) The REST client
4) The Thrift client
1) HBase shell:
Hbase shell is a command line interface for reading and writing data to
HBase database.
To start Shell : Run following to start HBase shell
$ ./bin/hbase shell
This will connect hbase shell to hbase server.
Creating a table using Hbase Shell
• You can create a table using the create command, here you must specify the
table name and the Column Family name.
The syntax to create a table in HBase shell is shown below.
• create ‘<table name>’,’<column family>’
Ex:
• create 'emp', 'personal data', ’professional data’
Table named emp. It has two column families: “personal data” and
“professional data”.
hbase(main):001:0 > list
Inserting a First Row
put command, you can insert rows into a table.
• put ’<table name>’,’Key’,’<colfamily:colname>’,’<value>’
• put 'emp','1','personal data:name','raju‘
• put 'emp','1','personal data:city',‘Pune‘
• put 'emp','1','professional data:designation','manager‘
• put 'emp','1','professional data:salary','50000'

Emp per data Prof data


Name city Desi Sal
1 Raju Delhi (v1) M 50,000
Pune(v2)
Updating Data using HBase Shell
You can update an existing cell value using the put command. To do so, just
follow the same syntax and mention your new value as shown
Syntax:
• put ‘table name’,’row ’,'Column family:column name',’new value’
Ex:.
• put 'emp','row1','personal:city','Delhi‘

Reading Data :
• get ’<table name>’,’row1’
hbase> get 'emp', '1‘
Reading a Specific Column
• hbase> get 'table name', ‘rowid’ ,{COLUMN ⇒ ‘column family:
column name ’}
•hbase> get 'emp', 'row1', {COLUMN ⇒ 'personal:name'}
• Deleting a Specific Cell in a Table
• delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time
stamp>’version

2)Kundera Object Mapper:

To use HBase in Java application , one of the popular open source API
named Kundera, which is a JPA 2.1 compliant object mapper. Kundera is a
polyglot object mapper(supportst 8nosql datastores) for NoSQL, as well as
RDBMS data stores. It is a single high-level Java API that supports eight
NoSQL data stores. The idea behind Kundera is to make working with NoSQL
databases drop-dead simple and fun.

Kundera provides the following qualities:


 A robust querying system
 Easy object/relation mapping
 Support for secondary level caching(keeping the entity data local to the
application) and event-based data handling
 Optimized data store persistence

 The Java Persistence API is a Java application programming interface


specification that describes the management of relational data in
applications using Java Platform,
 the reason it is called ObjectMapper is because it maps JSON
into Java Objects (deserialization), or Java Objects into JSON
(serialization)

CRUD using Kundera:

It is very easy to perform CRUD operations using Kundera. To perform


CRUD operations, we need to have an instance of EntityManager.
The following code shows the implementation of an employee entity
as Employee.java:
@Entity
@Table(name = "EMPLOYEE")
public class Employee {
@Id
private String employeeId;
@Column
private String employeeName;
@Column
private String address;
@Column
private String department;
public Employee() {
// Default constructor.
}
}
Once the employee entity is created, implement the create, read, update
and delete (CRUD) operations to be performed:

The create operation: The following code shows the creation of an employee
record:
Employee employee = new Employee();
employee.setEmployeeId("1");
employee.setEmployeeName("John");
employee.setAddress("Atlanta");
employee.setDepartment("R&D Labs");

The read operation: The following code shows the reading of the employee
record
Employee foundEmployee = em.find(Employee.class, "1");

System.out.println(foundEmployee.getEmployeeId());
System.out.println(foundEmployee.getEmployeeName());
System.out.println(foundEmployee.getAddress());
System.out.println(foundEmployee.getDepartment());
The update operation: The following code shows the update process of the
employee record:
foundEmployee.setAddress("New York");
em.merge(foundEmployee);
The delete operation: The following code shows the deletion of the
employee record:
// delete employee record.
em.remove(foundEmployee);

3) The REST client:


It describs how a client communicates with the servers to extract
the required information.
Consider a scenario where you are using the Book My Show app. Now,
obviously, this application needs a lot of Input data, as the data present in
the application is never static.

various cities showing different languages movies at various times of the


day. It’s never static which implies the fact that data is always changing in
these applications.

Well, this data is received from the Server or most commonly known as a
Web-server. So, the client requests the server for the required information,
via an API, and then, the server sends a response to the client.
Over here, the response sent to the client is in the form of an HTML Web
Page.

We would prefer the data to be returned in the form of a structured format,


rather than the complete Web page.
So, for such reasons, the data returned by the server, in response to the
request of the client is either in the format of JSON or XML. Both JSON
and XML formats have a proper hierarchical structure of data.

But, the only issue which is present in this framework until now is that you
have to use a lot of methods to get the required information.

The REST API creates an object, and thereafter sends the values of an object in
response to the client. It breaks down a transaction in order to create small
modules. Now, each of these modules is used to address a specific part of the
transaction. This approach provides more flexibility but requires a lot of effort
to be built from the very scratch.

Methods of REST API

All of us working with the technology of the web, do CRUD operations. CRUD
operations, create a resource, read a resource, update a resource and delete a
resource.

4)The Thrift client


cross-platform client-server applications with the help of RPC framework
called Apache Thrift.
Apache Thrift was originally developed by the Facebook development team
and is currently maintained by Apache.
Thrift mainly focuses on the communication layer between components
of your system.
Thrift uses a special Interface Description Language (IDL) to define data
types and service interfaces which are stored as .thrift files and used later as
input by the compiler for generating the source code of client and server
software that communicate over different programming languages.

Hbase examples –praxis.


We have to import data present in the file into an HBase table by creating it
through Java API.
Data_file.txt contains the below data

This data has to be inputted into a new HBase table to be created through JAVA
API. Following column families have to be created

Column family region has three column qualifiers: country, state, city
Column family Time has two column qualifiers: year, month

Jar Files

Make sure that the following jars are present while writing the code as
they are required by the HBase.
a. commons-loging-1.0.4
b. commons-loging-api-1.0.4
c. hadoop-core-0.20.2-cdh3u2
d. hbase-0.90.4-cdh3u2
e. log4j-1.2.15
f. zookeper-3.3.3-cdh3u0
Apache Cassendra:

Cassandra is a distributed database management system designed for


handling a high volume of structured data across commodity servers
Cassandra handles the huge amount of data with its distributed architecture.
Data is placed on different machines with more than one replication factor that
provides high availability and no single point of failure.

Cassandra is a column-oriented database.


Cassandra is being used by some of the biggest companies like Facebook,
Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Apache Cassandra Features

There are following features that Cassandra provides.


 Massively Scalable Architecture: Cassandra has a masterless design
where all nodes are at the same level which provides operational
simplicity and easy scale out.
 Masterless Architecture: Data can be written and read on any node.
 Linear Scale Performance: As more nodes are added, the performance
of Cassandra increases.
 No Single point of failure: Cassandra replicates data on different nodes
that ensures no single point of failure.
 Fault Detection and Recovery: Failed nodes can easily be restored and
recovered.
 Flexible and Dynamic Data Model: Supports datatypes with Fast writes
and reads.
 Data Compression: Cassandra can compress up to 80% data without any
overhead.

Cassandra Architecture

Cassandra was designed to handle big data workloads across multiple


nodes without a single point of failure. It has a peer-to-peer distributed system
across its nodes, and data is distributed among all the nodes in a cluster.
o In Cassandra, each node is independent and at the same time
interconnected to other nodes. All the nodes in a cluster play the same
role.
o Every node in a cluster can accept read and write requests, regardless of
where the data is actually located in the cluster.
o In the case of failure of one node, Read/Write requests can be served
from other nodes in the network.

Components of Cassandra

The main components of Cassandra are:


o Node: A Cassandra node is a place where data is stored.
o Data center: Data center is a collection of related nodes.
o Cluster: A cluster is a component which contains one or more data
centers.
o Commit log: In Cassandra, the commit log is a crash-recovery
mechanism. Every write operation is written to the commit log.
o Mem-table: A mem-table is a memory-resident data structure. After
commit log, the data will be written to the mem-table. Sometimes, for a
single-column family, there will be multiple mem-tables.
o SSTable: It is a disk file to which the data is flushed from the mem-table
when its contents reach a threshold value.
o Bloom filter: These are nothing but quick, nondeterministic, algorithms
for testing whether an element is a member of a set. It is a special kind of
cache. Bloom filters are accessed after every query.

Cassendra Datamodel:

Data Model of Cassandra is basically the way a technology stores data.


Cluster in cassendra datamodel

Cassandra database is distributed over several machines that operate


together. The outermost container is known as the Cluster. For failure handling,
every node contains a replica, and in case of a failure, the replica takes charge.
Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to
them.
Cassandra Data Model Components

Cassandra data model provides a mechanism for data storage. The


components of Cassandra data model are keyspaces, tables, and columns.
Keyspaces
Cassandra data model consists of keyspaces at the highest level.
Keyspaces are the containers of data, similar to the schema or database in a
relational database. Typically, keyspaces contain many tables.
Some of the features of keyspaces are:
 A keyspace needs to be defined before creating tables, as there is no
default keyspace.
 A keyspace can contain any number of tables, and a table belongs only to
one keyspace. This represents a one-to-many relationship.
 Replication is specified at the keyspace level. For example, replication of
three implies that each data row in the keyspace will have three copies.
Tables
Within the keyspaces, the tables are defined. Tables are also referred to as
Column Families in the earlier versions of Cassandra. Tables contain a set of
columns and a primary key, and they store data in a set of rows.

Columns
Columns define the structure of data in a table. Each column has an
associated type, such as integer, text, double, and Boolean.

Cassendra example:

Create a Keyspace

cqlsh> CREATE KEYSPACE IF NOT EXISTS cycling WITH


REPLICATION = { 'class' : 'NetworkTopologyStrategy',
'datacenter1' : 3 };

Create table:
CREATE TABLE cycling.cyclist_name (race_id int,
race_name text, point_id int, PRIMARY KEY (race_id,
point_id));

Inserting Values:
To insert simple data into the table cycling.cyclist_name,
use the INSERT command. This example inserts a single record into the
table.

cqlsh> INSERT INTO cycling.cyclist_name (id,


lastname, firstname) VALUES (5b6962dd-3f90-4c93-8f61-
eabfa4a803e2, 'VOS','Marianne');
Querying a table
cqlsh> SELECT category, points, lastname FROM
cycling.cyclist_name;

For a large table, limit the number of rows retrieved using LIMIT. The default limit is 10,000
rows.

cqlsh> SELECT * From cycling.cyclist_name LIMIT


3;

Casendra Clients:
Cassandra client drivers organized by language. Before choosing a driver, you should verify the
Cassandra version and functionality supported by a specific driver.

Java

 Achilles
 Astyanax
 Kundera
 PlayORM
Python

 Datastax Python driver


Ruby

 Datastax Ruby driver


C# / .NET

 Cassandra Sharp
 Datastax C# driver
 Fluent Cassandra
PHP

 CQL | PHP
 Datastax PHP driver
 PHP-Cassandra
 PHP Library for Cassandra
C++

 Datastax C++ driver


 libQTCassandra
Scala

 Datastax Spark connector


 Phantom
 Quill
Perl

 Cassandra::Client and DBD::Cassandra

HADOOP INTERGRATION WITH CASSENDRA:

Here, we explore how Cassandra and Hadoop fit together and how one
can write MapReduce programs against data in Cassandra.
Cassandra has a Java source package for Hadoop integration code,
called org.apache.cassandra.hadoop.
There we find:
ColumnFamilyInputFormat
The main class we’ll use to interact with data stored in Cassandra from
Hadoop. It’s an extension of Hadoop’s InputFormat abstract class.
ConfigHelper
A helper class to configure Cassandra-specific information such as the
server node to point to, the port, and information specific to your
MapReduce job.
ColumnFamilySplit
The extension of Hadoop’s InputSplit abstract class that creates splits
over our Cassandra data. It also provides Hadoop with the location of the
data, so that it may prefer running tasks on nodes where the data is stored.
ColumnFamilyRecordReader
The layer at which individual records from Cassandra are read. It’s an
extension of Hadoop’s RecordReader abstract class.

PIG
 Pig is a high-level data flow platform for executing Map Reduce
programs of Hadoop. It was developed by Yahoo. The language for Pig is
pig Latin.
 The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS.
 Pig can handle any type of data, i.e., structured, semi-structured or
unstructured and stores the corresponding results into Hadoop Data File
System. Every task which can be achieved using PIG can also be
achieved using java used in MapReduce.
Features of Pig

Apache Pig comes with the following features −


 Rich set of operators − It provides many operators to perform
operations like join, sort, filer, etc.
 Ease of programming − Pig Latin is similar to SQL and it is easy to
write a Pig script if you are good at SQL.
 Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
 Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
 UDF’s − Pig provides the facility to create User-defined Functions in
other programming languages such as Java and invoke or embed them in
Pig Scripts.
 Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Advantages of Apache Pig

 Less code - The Pig consumes less line of code to perform any
operation.
 Reusability - The Pig code is flexible enough to reuse again.
 Nested data types - The Pig provides a useful concept of nested
data types like tuple, bag, and map
The following image shows a nested data type definition Company on the
type definition library tab:

Apache Pig Architecture:

 The language used to analyze data in Hadoop using Pig is known as Pig
Latin. It is a highlevel data processing language which provides a rich set
of data types and operators to perform various operations on the data.
 Pig, programmers need to write a Pig script using the Pig Latin language,
and execute them using any of the execution mechanisms (Grunt Shell,
UDFs, Embedded).
 Internally, Apache Pig converts these scripts into a series of MapReduce
jobs, and thus, it makes the programmer’s job easy. The architecture of
Apache Pig is shown below.

Apache Pig Components:

As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.

i. Parser
At first, all the Pig Scripts are handled by the Parser. Basically, Parser
checks the syntax of the script, does type checking, and other miscellaneous
checks. Afterward, Parser’s output will be a DAG (directed acyclic graph). That
represents the Pig Latin statements as well as logical operators.
Basically, the logical operators of the script are represented as the nodes and the
data flows are represented as edges, in the DAG (the logical plan).
ii. Optimizer
Further, DAG is passed to the logical optimizer. That carries out the
logical optimizations. Like projection and push down.
iii. Compiler
It compiles the optimized logical plan into a series of MapReduce jobs.
iv. Execution Engine
At last, MapReduce jobs are submitted to Hadoop in a sorted order.
Hence, these MapReduce jobs are executed finally on Hadoop, that produces
the desired results.
Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-
atomic datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple,
the fields can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag. Each tuple can have any number of fields
(flexible schema). A bag is represented by ‘{}’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position
(column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map
A map (or data map) is a set of key-value pairs. The key needs to be of
type chararray and should be unique. The value might be of any type. It is
represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered
(there is no guarantee that tuples are processed in any particular order).

GRUNT
Pig commands, which are written in pig latin scripts, are executed with the help
of Grunt shell.
You can invoke the Grunt shell in a desired mode (local/MapReduce) using
the −x option as shown below.

Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce

Output- Output-
grunt> grunt>
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly entering
the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

After invoking the Grunt shell, you can run your Pig scripts in the shell. In
addition to that, there are certain useful shell and utility commands provided by
the Grunt shell.
Will see shell and utility commands provided by the grunt shell.
Shell Commands

The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts.
Prior to that, we can invoke any shell commands using sh and fs.
sh Command
Using sh command, we can invoke any shell commands from the Grunt
shell.
Syntax
Given below is the syntax of sh command.
grunt> sh shell command parameters
Example
grunt> sh ls
In this example, it lists out the files in the /pig/bin/ directory.
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the
Grunt shell.
Example
We can invoke the ls command of HDFS from the Grunt shell using fs
command. In the following example, it lists the files in the HDFS root
directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data

Utility Commands:

The Grunt shell provides a set of utility commands. These include utility
commands such as clear, help, history, quit, and set; and commands such
as exec, kill, and run to control Pig from the Grunt shell.
clear Command
The clear command is used to clear the screen of the Grunt shell.
Example:
grunt> clear
help Command
The help command gives you a list of Pig commands or Pig properties.
Example
grunt> help
It will show list of commands available in pig.
history Command
This command displays a list of statements executed / used so far since
the Grunt sell is invoked.
grunt> history

customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING


PigStorage(',');
orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(',');
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',');

Assume we have executed three statements since opening the Grunt shell.
set Command
The set command is used to show/assign values to keys used in Pig.
Usage
Using this command, you can set values to the following keys.

Key Description and values

Debug You can turn off or turn on the debugging feature in Pig by
passing on/off to this key.

job.name You can set the Job name to the required job by passing a
string value to this key.

job.priority You can set the job priority to a job by passing one of the
following values to this key −
 very_low
 low
 normal
 high
 very_high
quit Command
You can quit from the Grunt shell using this command.
Example:
grunt> quit

exec Command
Using the exec command, we can execute Pig scripts from the Grunt
shell.
Example
Let us assume there is a file named student.txt in
the /pig_data/ directory of HDFS with the following content.
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in
the /pig_data/ directory of HDFS with the following content.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',')as (id:int,name:chararray,city:chararray);

Dump student;
Now, let us execute the above script from the Grunt shell using
the exec command as shown below.
grunt> exec /sample_script.pig

Output
The exec command executes the script in the sample_script.pig. As
directed in the script, it loads the student.txt file into Pig and gives you the
result of the Dump operator displaying the following content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
kill Command
You can kill a job from the Grunt shell using this command.
Syntax
Given below is the syntax of the kill command.
grunt> kill JobId
Example
Suppose there is a running Pig job having id Id_0055, you can kill it
from the Grunt shell using the kill command, as shown below.
grunt> kill Id_0055
PIG LATIN

Data types:

S.N Data Type Description & Example


.

1 Int Represents a signed 32-bit integer.


Example : 8

2 Long Represents a signed 64-bit integer.


Example : 5L

3 Float Represents a signed 32-bit floating


point.
Example : 5.5F

4 double Represents a 64-bit floating point.


Example : 10.5

5 Char array Represents a character array (string)


in Unicode UTF-8 format.
Example : ‘tutorials point’

6 Byte array Represents a Byte array (blob).

7 Boolean Represents a Boolean value.


Example : true/ false.

8 Datetime Represents a date-time.


Example : 1970-01-
01T00:00:00.000+00:00

9 Biginteger Represents a Java BigInteger.


Example : 60708090709

10 Bigdecima Represents a Java BigDecimal


l Example : 185.98376256272893883
Complex Types

11 Tuple A tuple is an ordered set of fields.


Example : (raja, 30)

12 Bag A bag is a collection of tuples.


Example : {(raju,30),
(Mohhammad,45)}

13 Map A Map is a set of key-value pairs.


Example : [ ‘name’#’Raju’,
‘age’#30]
Pig Latin – Arithmetic Operators

The following table describes the arithmetic operators of Pig Latin. Suppose a
= 10 and b = 20.

Operat Description Example


or

+ Addition − Adds values on a + b will


either side of the operator give 30

− Subtraction − Subtracts right a − b will


hand operand from left hand give −10
operand

* Multiplication − Multiplies a * b will


values on either side of the give 200
operator

/ Division − Divides left hand b / a will


operand by right hand give 2
operand

% Modulus − Divides left hand b % a will


operand by right hand give 0
operand and returns
remainder

Bincond − Evaluates the b = (a ==


Boolean operators. It has 1)? 20: 30;
three operands as shown if a = 1 the
below. value of b
?:
variable x = is 20.
(expression) ? value1 if if a!=1 the
true : value2 if false. value of b
is 30.

CASE Case − The case operator is CASE f2 %


WHEN equivalent to nested bincond 2
THEN operator. WHEN 0
ELSE THEN
END 'even'
WHEN 1
THEN
'odd'
END

Pig Latin – Comparison Operators

The following table describes the comparison operators of Pig Latin.

Operato Description Exampl


r e

== Equal − Checks if the values of two operands are (a = b) is


equal or not; if yes, then the condition becomes true. not true

!= Not Equal − Checks if the values of two operands are (a != b)


equal or not. If the values are not equal, then is true.
condition becomes true.

> Greater than − Checks if the value of the left (a > b) is


operand is greater than the value of the right operand. not true.
If yes, then the condition becomes true.
< (a < b) is
Less than − Checks if the value of the left operand is
true.
less than the value of the right operand. If yes, then the
condition becomes true.

>= (a >= b) is
Greater than or equal to − Checks if the value of the left
not true.
operand is greater than or equal to the value of the right
operand. If yes, then the condition becomes true.

<= (a <= b) is
Less than or equal to − Checks if the value of the left
true.
operand is less than or equal to the value of the right
operand. If yes, then the condition becomes true.

matches f1
Pattern matching − Checks whether the string in the
matches
left-hand side matches with the constant in the right-hand
'.*tutorial.*
side.
'

Pig Latin – Relational Operations


The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into
a relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

Grouping and Joining

JOIN To join two or more relations.


GROUP To group the data in a single relation.

Sorting

ORDER To arrange a relation in a sorted order based on one or


more fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

The Pig Latin syntax is similar to SQL but it is much more powerful.

Executing the pig script:


1) Local Mode/off line

2) Mapreduce mode/online

It is possible to execute it from the Grunt shell as well using the


exec command.

Assume we have a file Employee_details.txt in HDFS with the


following content.

Now, also we have a sample script with the name


sample_script.pig, in the same HDFS directory. It contains
statements performing operations and transformations on the
Employee relation.

1. Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING


PigStorage(',')
2. as (id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray);
3. Employee_order = ORDER Employee BY age DESC;
4. Employee_limit = LIMIT Employee_order 4;

5. Dump Employee_limit;

 The script will load the data in the file named Employee_details.txt as a
relation named Employee, in the first statement.
 Moreover, the script will arrange the tuples of the relation in descending
order, based on age, and store it as Employee_order, in the second
statement.
 The script will store the first 4 tuples of Employee_order as
Employee_limit, in the third statement.
 Ultimately, last and the 4th statement will dump the content of the
relation Employee_limit.

Further, let’s execute the sample_script.pig.

In this way, Pig gets executed and gives you the output like:
DEVELOPING AN TESTING PIG LATIN SCRIPTS.

Testing your Scripts with PigUnit

Pig unit is nothing but unit testing of your Pig scripts

Unit Testing

Unit Testing is a software development process in which the smallest


testable parts of an application, called units, are individually and independently
scrutinized for proper operation. Unit testing is often automated but it can also
be done manually.
Pig Unit

With Pig Unit you can perform unit testing, regression testing, and rapid
prototyping. No cluster set up is required if you run Pig in local mode.
Pig Unit uses the famous Java Junit framework class to test the Pig
scripts. First, we will write a Java class which junit uses to run and inside that
class we can provide our Pig scripts, input files and the expected output, so that
it will automatically run the scripts and return the results. With the help of Pig
unit we can reduce the time of debugging.

Unit Testing your own Pig script

For unit testing your own Pig script, we need to have maven and Pig
eclipse plugin installed in your eclipse.

Maven Plugin for Eclipse


Open Eclipse IDE and follow the below steps:
1. Click Help -> Install New Software…
2. Click ‘Add’ button at top right corner
3. In the pop up window, fill up the ‘Name’ as ‘M2Eclipse’ and ‘Location’
as ‘http://download.eclipse.org/technology/m2e/releases’ or
‘http://download.eclipse.org/technology/m2e/milestones/1.0’.
4. Now, click ‘OK’.

Pig Eclipse plugin


Follow the below steps to install the Pig Eclipse plugin:
1. Click ‘Help’ -> Install New Software…
2. Click ‘Add’ button at top right corner
3. In the pop up window, fill up ‘Name’ as ‘pigEditor’ and ‘Location’ as
‘https://pig-eclipse.googlecode.com/svn/trunk/update_site’.
4. Now, click ‘OK’.

After the installation of above specified plugins, restart your Eclipse.

Now, follow the series of step mentioned below to create a maven project.
File–>New–>Other–>Maven–>Maven Project
Next, open the pom.xml file, following the below steps.
src–>target–>pom.xml
then, copy the repositories and dependencies

Let’s now run a sample word count program using Junit.


Below is the Pig script for performing word count:

A = load '/home/kiran/workspace/pigunit/input.data';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
dump D;

We have saved the above lines in a file called wordcount.pig and the input data is saved in the
file with name ‘input.data’.
PigTest pigTest = new PigTest("/home/kiran/workspace/pigunit/wordcount.pig");

Here, you can see that we have created an object called ‘pigTest’ and we have passed our Pig
script into it. The output will be given inside the assertOutput method.

pigTest.assertOutput("D", output);

In this case, our statement to be dumped is D, so we have given the alias name as ‘D’ within the
double quotes and the expected output is stored in the String array output. So, we have given
the output variable name as the output.

Now, let’s have a look at the complete program.

package java1.pigunit1;
import junit.framework.TestCase;
import org.apache.pig.pigunit.PigTest;
public class AppTest extends TestCase {
public void testwordcount() throws Exception {
PigTest pigTest = new
PigTest("/home/kiran/workspace/pigunit/wordcount.pig");
String[] output = {
"(1,all)","(1,from)","(1,Hello)","(1,Brindavan!)"
};
pigTest.assertOutput("D", output);
}
}

Now, let’s run this program using Junit Test.


Right Click–>Run As–>Junit Test

What is Tokenize
Example:
Consider we have a data in a file mentioned below

USA USA EUROPE EUROPE EUROPE EUROPE USA USA USA EUROPE
EUROPE USA EUROPE USA

I'm trying to find out number of USA and EUROPE.


1) inp = LOAD '/user/countries.txt' as (singleline);

dump inp;

Output
(USA USA EUROPE EUROPE EUROPE EUROPE USA)
(USA USA EUROPE EUROPE USA)
(EUROPE USA)

2) tknz = FOREACH inp GENERATE TOKENIZE(singleline) as Words;

dump tknz;

Output

{(USA),(USA),(EUROPE),(EUROPE),(EUROPE),(EUROPE),(USA)}
{(USA),(USA),(EUROPE),(EUROPE),(USA)}
{(EUROPE),(USA)}

HIVE
The term ‘Big Data’ is used for collections of large datasets that include
huge volume, high velocity, and a variety of data that is increasing day by day.
Using traditional data management systems, it is difficult to process Big Data.
Therefore, the Apache Software Foundation introduced a framework called
Hadoop to solve Big Data management and processing challenges.
Hadoop

Hadoop is an open-source framework to store and process Big Data in a


distributed environment. It contains two modules, one is MapReduce and
another is Hadoop Distributed File System (HDFS).
 MapReduce: It is a parallel programming model for processing large
amounts of structured, semi-structured, and unstructured data on large
clusters of commodity hardware.
 HDFS:Hadoop Distributed File System is a part of Hadoop framework,
used to store and process the datasets. It provides a fault-tolerant file
system to run on commodity hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop,
Pig, and Hive that are used to help Hadoop modules.
 Sqoop: It is used to import and export data to and from between HDFS
and RDBMS.
 Pig: It is a procedural language platform used to develop a script for
MapReduce operations.
 Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.

Note: There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured,


semi-structured, and unstructured data.
 The scripting approach for MapReduce to process structured and semi
structured data using Pig.
 The Hive Query Language (HiveQL or HQL) for MapReduce to process
structured data using Hive.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in


Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under the
name Apache Hive.
Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive

The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes
each unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can


create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the


schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping. It stores the
data in a traditional RDBMS format

HiveQL Process HiveQL is similar to SQL for querying on schema info on


Engine the Metastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for
MapReduce job and process it.

Execution The conjunction part of HiveQL process Engine and


Engine MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.

HDFS or Hadoop distributed file system or HBASE are the data


HBASE storage techniques to store data into file system.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

1 Execute Query
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.

3 Get Metadata
The compiler sends metadata request to Metastore (any database).

4 Send Metadata
Metastore sends metadata as a response to the compiler.

5 Send Plan
The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.

6 Execute Plan
The driver sends the execute plan to the execution engine.

7 Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name
node and it assigns this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.

7.1 Metadata Ops


Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.

8 Fetch Result
The execution engine receives the results from Data nodes.

9 Send Results
The execution engine sends those resultant values to the driver.

10 Send Results
The driver sends the results to Hive Interfaces.
DATA TYPES AND FILE FORMATS
There are different data types in Hive, which are involved in the table
creation. All the data types in Hive are classified into four types, given as
follows:
 Column Types
 Literals
 Null Values
 Complex Types

Column Types

Column type are used as column data types of Hive. They are as follows:

Integral Types
Integer type data can be specified using integral data types, INT. When
the data range exceeds the range of INT, you need to use BIGINT and if the
data range is smaller than the INT, you use SMALLINT. TINYINT is smaller
than SMALLINT.
The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types
String type data types can be specified using single quotes (' ') or double
quotes (" "). It contains two data types: VARCHAR and CHAR. Hive follows
C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp
It supports traditional UNIX timestamp with optional nanosecond
precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

Dates
DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.

Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It
is used for representing immutable arbitrary precision. The syntax and example
is as follows:
DECIMAL(precision, scale)
decimal(10,0)

Union Types
Union is a collection of heterogeneous data types. You can create an
instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

Literals

The following literals are used in Hive:


Floating Point Types
Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range
than DOUBLE data type. The range of decimal type is approximately -10-308 to
10308.
Null Value

Missing values are represented by the special value NULL.


Complex Types

The Hive complex data types are as follows:


Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

HIVE FILE FORMATS:


Following are the Apache Hive different file formats:
 Text File
 Sequence File
 RC File
 AVRO File
 ORC File
 Parquet File

Hive Text File Format:

Hive Text file format is a default storage format. You can use the text
format to interchange the data with other client application. The text file format
is very common most of the applications. Data is stored in lines, with each line
being a record. Each lines are terminated by a newline character (\n).
The text format is simple plane file format. You can use the compression
(BZIP2) on the text file to reduce the storage spaces.
Create a TEXT file by add storage option as ‘STORED AS
TEXTFILE’ at the end of a Hive CREATE TABLE command.
Syntax:
Create table textfile_table(column_specs)
stored as textfile;

Hive Sequence File Format:

Sequence files are Hadoop flat files which stores values in binary key-
value pairs. The sequence files are in binary format and these files are able to
split. The main advantages of using sequence file is to merge two or more files
into one file.
Create a sequence file by add storage option as ‘STORED AS
SEQUENCEFILE’ at the end of a Hive CREATE TABLE command.

Syntax
Create table sequencefile_table (column_specs)
stored as sequencefile;

Hive RC File Format

RCFile is row columnar file format. This is another form of Hive file
format which offers high row level compression rates. If you have requirement
to perform multiple rows at a time then you can use RCFile format.
The RCFile are very much similar to the sequence file format. This file format
also stores the data as key-value pairs.
Create RCFile by specifying ‘STORED AS RCFILE’ option at the end
of a CREATE TABLE Command:
Syntax:
Create table RCfile_table(column_specs)
stored as rcfile;

Hive AVRO File Format


Avro stores the data definition (schema) in JSON format making it easy
to read and interpret by any program. The data itself is stored in binary format
making it compact and efficient.
Syntax:
Create table avro_table(column_specs) stored as avro;
Hive ORC File Format

The ORC file stands for Optimized Row Columnar file format. The ORC
file format provides a highly efficient way to store data in Hive table. This file
system was actually designed to overcome limitations of the other Hive file
formats. The Use of ORC files improves performance when Hive is reading,
writing, and processing data from large tables.
Syntax:
Create table orc_table(column_specs) stored as orc;

Hive Parquet File Format

Parquet is a column-oriented binary file format. The parquet is highly


efficient for the types of large-scale queries. Parquet is especially good for
queries scanning particular columns within a particular table. The Parquet table
uses compression Snappy, gzip; currently Snappy by default.
Create Parquet file by specifying ‘STORED AS PARQUET’ option at
the end of a CREATE TABLE Command.
Syntax:
Create table parquet_table(column_specs) stored as parquet;

HIVEQL DATA DEFINITION


Hive DDL commands are the statements used for defining and changing
the structure of a table or database in Hive. It is used to build or modify the
tables and other objects in the database.

The several types of Hive DDL commands are:


1. CREATE
2. SHOW
3. DESCRIBE
4. USE
5. DROP
6. ALTER
7. TRUNCATE

Table-1 Hive DDL commands

DDL
Use With
Command
CREATE Database, Table
Databases, Tables, Table Properties, Partitions, Functions,
SHOW
Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table

Note that the Hive commands are case-insensitive.

1.Create Database:
The CREATE DATABASE statement is used to create a database in the
Hive. The DATABASE and SCHEMA are interchangeable. We can use either
DATABASE or SCHEMA.
Syntax:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name;
Example :
CREATE DATABASE IF NOT EXISTS bda;

2.Show Database:
The SHOW DATABASES statement lists all the databases present in the
Hive.
Syntax:
SHOW (DATABASES|SCHEMAS);
Example:
SHOW DATABASES;

3.Describe Database:
The DESCRIBE DATABASE statement in Hive shows the name of
Database in Hive, and its location on the file system.
The EXTENDED can be used to get the database properties.
Syntax:
DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name;
Example
DESCRIBE DATABASE bda;

4.Use Database
The USE statement in Hive is used to select the specific database for a
session on which all subsequent HiveQL statements would be executed.
Syntax:
USE database_name;
Example
USE bda;

5.Drop Database
The DROP DATABASE statement in Hive is used to Drop (delete) the
database.
The default behavior is RESTRICT which means that the database is
dropped only when it is empty. To drop the database with tables, we can use
CASCADE.
Syntax:
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name
[RESTRICT|CASCADE];

Example :
DROP DATABASE [IF EXISTS] bda CASCADE

CREATE DATABASE bda;


CREATE TABLE mca;
Add col(name, dob)
6.Alter Database:
The ALTER DATABASE statement in Hive is used to change the
metadata associated with the database
Syntax:
ALTER (DATABASE|SCHEMA) database_name SET
DBPROPERTIES (property_name=property_value, ...);
Example:
ALTER DATABASE bda SET DBPROPERTIES (‘createdfor’=’mca’);

In this example, we are setting the database properties of the ‘bda’


database after its creation by using the ALTER command.

Syntax for changing Database owner:


ALTER (DATABASE|SCHEMA) database_name SET OWNER
[USER|ROLE] user_or_role;
In this example, we are changing the owner role of the ‘bda’ database using the
ALTER statement.
Example:
ALTER DATABASE bda SET OWNER ROLE admin;

1. CREATE TABLE

The CREATE TABLE statement in Hive is used to create a table with


the given name. If a table or view already exists with the same name, then the
error is thrown. We can use IF NOT EXISTS to skip the error.

Syntax:
CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name
data_type [COMMENT col_comment], ... [COMMENT col_comment])]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS
file_format] [LOCATION hdfs_path];

Example
CREATE TABLE IF NOT EXISTS Employee(
Emp-ID STRING COMMENT ‘this is employee-id’
Emp_designation STRING COMMENT ‘this is employee post’)

2. SHOW TABLES in Hive

The SHOW TABLES statement in Hive lists all the base tables
and views in the current database.
Syntax:
SHOW TABLES [IN database_name];
Example
SHOW TABLES;
3. DESCRIBE TABLE in Hive

The DESCRIBE statement in Hive shows the lists of columns for the
specified table.
Syntax:
DESCRIBE [EXTENDED|FORMATTED] [db_name.]
table_name[.col_name ( [.field_name])];
Example:
DESCRIBE Employee;
It describes column name , data type and comment.

4. DROP TABLE in Hive

The DROP TABLE statement in Hive deletes the data for a particular
table and remove all metadata associated with it from Hive metastore.
If PURGE is not specified then the data is actually moved to the
.Trash/current directory. If PURGE is specified, then data is lost completely.
Syntax:
DROP TABLE [IF EXISTS] table_name [PURGE];
Example:
DROP TABEL IF EXISTS Employee PURGE;

5. ALTER TABLE in Hive

The ALTER TABLE statement in Hive enables you to change the


structure of an existing table. Using the ALTER TABLE statement we can
rename the table, add columns to the table, change the table properties, etc.

Syntax to Rename a table:


ALTER TABLE table_name RENAME TO new_table_name;
Example
ALTER TABEL Employee RENAME TO Comp_Emp;

In this example, we are trying to rename the ‘Employee’ table to ‘Com_Emp’


using the ALTER statement.
Syntax to Add columns to a table:
ALTER TABLE table_name ADD COLUMNS (column1, column2) ;
Example
ALTER TABEL Comp_Emp ADD COLUMNS (emp_dob STRING
,emp_contact STRING);
In this example, we are adding two columns ‘Emp_DOB’ and
‘Emp_Contact’ in the ‘Comp_Emp’ table using the ALTER command.

6. TRUNCATE TABLE

TRUNCATE TABLE statement in Hive removes all the rows from the
table or partition.
Syntax:
TRUNCATE TABLE table_name;
Example
TRUNCATE TABLE Comp_Emp;
It removes all the rows in Comp_Emp Table.

HIVEQL DATA MANIPULATION


Hive DML (Data Manipulation Language) commands are used to insert,
update, retrieve, and delete data from the Hive table once the table and database
schema has been defined using Hive DDL commands.
The various Hive DML commands are:
1. LOAD
2. SELECT
3. INSERT
4. DELETE
5. UPDATE
6. EXPORT
7. IMPORT

1.Load Command:
The LOAD statement in Hive is used to move data files into the locations
corresponding to Hive tables.
 If a LOCAL keyword is specified, then the LOAD command will look
for the file path in the local filesystem.
 If the LOCAL keyword is not specified, then the Hive will need the
absolute URI of the file.
 In case the keyword OVERWRITE is specified, then the contents of the
target table/partition will be deleted and replaced by the files referred by
filepath.
 If the OVERWRITE keyword is not specified, then the files referred by
filepath will be appended to the table.

Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];
Example:
LOAD DATA LOCAL INPATH ‘/home/user/dab’ INTO TABEL emp_data;

2. SELECT COMMAND
The SELECT statement in Hive is similar to the SELECT statement in
SQL used for retrieving data from the database.
Syntax:
SELECT col1,col2 FROM tablename;

Example:

SELECT * FROM emp_data;

It will display all the rows in emp_data

3. INSERT Command
The INSERT command in Hive loads the data into a Hive table. We can
do insert to both the Hive table or partition.
a. INSERT INTO

The INSERT INTO statement appends the data into existing data in the
table or partition. INSERT INTO statement works from Hive version 0.8.
Syntax:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1,
partcol2=val2 ...)] select_statement1 FROM from_statement;
Example
CREATE TABLE IF NOT EXIST example(id STRING, name STRING,
dep STRING ,state STRING, salary STRING, year STRING

INSERT statement to load data into table “example”.


Example:
INSERT INTO TABLE example SELECT emp.emp_id, emp.emp_name,
emp.emp_dep, emp.state, emp.salary, emp.year_of_joing from emp_data
emp;

SELECT * FROM example;


b. INSERT OVERWRITE

The INSERT OVERWRITE table overwrites the existing data in the table or
partition.

Example:
Here we are overwriting the existing data of the table ‘example’ with the
data of table ‘dummy’ using INSERT OVERWRITE statement.

INSERT OVERWRITE TABLE example SELECT dmy.enroll, dmy.name,


dmy.department, dmy.salary, dmy.year from dummy dmy;

By using the SELECT statement we can verify whether the existing data of the
table ‘example’ is overwritten by the data of table ‘dummy’ or not.
SELECT * FROM EXAMPLE;
c. INSERT .. VALUES

INSERT ..VALUES statement in Hive inserts data into the table directly
from SQL..
Example:
Inserting data into the ‘student’ table using INSERT ..VALUES statement.
INSERT INTO TABLE student VALUES(101,’Callen’,’IT’,’7.8’),
(103,’joseph’,’CS’,’8.2’),
(105,’Alex’,’IT’,’7.9’);
SELECT * FROM student;

4. DELETE command
The DELETE statement in Hive deletes the table data. If the WHERE
clause is specified, then it deletes the rows that satisfy the condition in where
clause.
Syntax:
DELETE FROM tablename [WHERE expression];
Example:
In the below example, we are deleting the data of the student from table
‘student’ whose roll_no is 105.
DELETE FROM student WHERE roll_no=105;
SELECT * FROM student;
5. UPDATE Command
The UPDATE statement in Hive deletes the table data. If the WHERE
clause is specified, then it updates the column of the rows that satisfy the
condition in WHERE clause.
Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE
expression];
Example:
In this example, we are updating the branch of the student whose roll_no is 103
in the ‘student’ table using an UPDATE statement.
UPDATE student SET branch=’IT’ WHERE roll_no=103;
SELECT * FROM student;

6. EXPORT Command
The Hive EXPORT statement exports the table or partition data along
with the metadata to the specified output location in the HDFS.
Example:
Here in this example, we are exporting the student table to the HDFS
directory “export_from_hive”.
EXPORT TABLE student TO ‘export_from_hive’

The table successfully exported. You can check for the _metadata file and data
sub-directory using ls command.
7. IMPORT Command
The Hive IMPORT command imports the data from a specified location
to a new table or already existing table.
Example:
Here in this example, we are importing the data exported in the above
example into a new Hive table ‘imported_table’.

Verifying whether the data is imported or not using hive SELECT statement.
SELECT * from imported_table;

HIVEQL QUERIES.
Liks SQL , in hiveql it contains,
 Hiveql operators,
 Hiveql functions,
 Hiveql group by & having ,
 hiveql orderby & sortby ,
 HiveqlJoin

HiveQL - Operators
The HiveQL operators facilitate to perform various arithmetic and relational operations.
Here, we are going to execute such type of operations on the records :

Let's create a table and load the data into it by using the following steps: -

Select the database in which we want to create a table.

hive> use hql;

Create a hive table using the following command: -

hive> create table employee (Id int, Name string , Salary float)
Now, load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table employ
ee;

Let's fetch the loaded data by using the following command: -

hive> select * from employee;

Arithmetic Operators in Hive


In Hive, the arithmetic operator accepts any numeric type. The commonly used
arithmetic operators are: -

Operator Description
s

A+B This is used to add A and B.

A-B This is used to subtract B from A.

A*B This is used to multiply A and B.

A/B This is used to divide A and B and returns the quotient of the
operands.

A%B This returns the remainder of A / B.

A|B This is used to determine the bitwise OR of A and B.

A&B This is used to determine the bitwise AND of A and B.

A^B This is used to determine the bitwise XOR of A and B.


~A This is used to determine the bitwise NOT of A.

hive> select id, name, salary + 50 from employee;

Let's see an example to decrease the salary of each employee by 50.

hive> select id, name, salary - 50 from employee;

Let's see an example to find out the 10% salary of each employee.

hive> select id, name, (salary * 10) /100 from employee;

Relational operators in HIVE:


hive> select * from employee where salary >= 25000;

hive> select * from employee where salary < 25000;

HIVEQL Functions:

There are so many mathematical function in Hive like


Round(num) , floor(num), sqrt(num), abs(num)
hive> select Id, Name, sqrt(Salary) from employee_data ;
Aggregate Functions in Hive
the aggregate function returns a single value resulting from computation over many
rows. Let''s see some commonly used aggregate functions: -

count(*) - It returns the count of the number of rows present in the file.

sum(col), - It returns the sum of values.


sum(DISTINCT col)- It returns the sum of distinct values

avg(col) - It returns the average of values.

min(col) - It compares the values and return the minimum one from it
max(col) - It compares the values and return the minimum one from it
Example:
hive> select max(Salary) from employee_data;

hive> select min(Salary) from employee_data;

HiveQL - GROUP BY and HAVING Clause

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
|1204 | Krian | 40000 | Hr Admin | HR |
|1202 | Manisha | 45000 | Proofreader | PR |
|1201 | Gopal | 45000 | Technical manager | TP |
|1203 | Masthanvali | 40000 | Technical writer | TP |
+------+--------------+-------------+-------------------+--------+
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
Dept | Count(*) |
+------+--------------+
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+------+--------------+

HAVING CLAUSE
The HQL HAVING clause is used with GROUP BY clause. Its purpose is to apply
constraints on the group of data produced by GROUP BY clause. Thus, it always returns
the data where the condition is TRUE.

hive> select department, sum(salary) from emp group by department

having sum(salary)>=45000;

output:
DEPARTMENT SUM(SALARY) DEPARTMENT SUM(SALARY)
Tp 85000 Tp 85000
Pr 45000 Pr 45000
Hr 40000
Admin 30000

Differences between Hive and Pig


Hive Pig

Hive is commonly used by Data Pig is commonly used by


Analysts. programmers.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.

Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.

You might also like