Professional Documents
Culture Documents
Unit 5 Notes
Unit 5 Notes
Hbase
What is HBase?
HBase is an open-source, column-oriented distributed database system
in a Hadoop environment. Initially, it was Google Big Table, afterward, it was
re-named as HBase and is primarily written in Java. Apache HBase is needed
for real-time Big Data applications.
HBase can store massive amounts of data from terabytes to petabytes.
The tables present in HBase consists of billions of rows having millions of
columns. HBase is built for low latency operations, which is having some
specific features compared to traditional relational models.
HBASE RDBMS
Read only relevant data Retrieve one row at a time and hence
from database could read unnecessary data if only
some of the data in a row is required
Tables
The HBase Tables are more like logical collection of rows stored in
separate partitions called Regions. As shown above, every Region is then served
by exactly one Region Server. The figure above shows a representation of a
Table.
Rows
A row is one instance of data in a table and is identified by a rowkey.
Rowkeys are unique in a Table and are always treated as a byte[].
Column Families
Data in a row are grouped together as Column Families. Each Column
Family has one more Columns and these Columns in a family are stored
together in a low level storage file known as HFile.
Columns
A Column Family is made of one or more columns. A Column is
identified by a Column Qualifier that consists of the Column Family name
concatenated with the Column name using a colon – example:
columnfamily:columnname. There can be multiple Columns within a Column
Family and Rows within a table can have varied number of Columns.
Cell :
A Cell stores data and is essentially a unique combination of rowkey,
Column Family and the Column (Column Qualifier). The data stored in a Cell is
called its value and the data type is always treated as byte[].
Version
The data stored in a cell is versioned and versions of data are identified
by the timestamp associated with them.
the timestamp represents the time on the RegionServer when the data was
written, but you can specify a different timestamp value when you put
data into the cell.
The version number is configurable and by default it is 3.
For e.g. if the value of some column of a row is 1234. Then you update it
and make it 5678 then hbase wont overwrite 1234 by 5678 . It would
rather create a verion of this cell as 1234 will have version 2 and 5678
will have version 1. Hbase will always show you values of version 1 and
hide all other versions.
Hbase clients
There are different flavours of clients that can be used with different
programming languages to access HBase.
1) The HBase shell
2) Kundera – the object mapper
3) The REST client
4) The Thrift client
1) HBase shell:
Hbase shell is a command line interface for reading and writing data to
HBase database.
To start Shell : Run following to start HBase shell
$ ./bin/hbase shell
This will connect hbase shell to hbase server.
Creating a table using Hbase Shell
• You can create a table using the create command, here you must specify the
table name and the Column Family name.
The syntax to create a table in HBase shell is shown below.
• create ‘<table name>’,’<column family>’
Ex:
• create 'emp', 'personal data', ’professional data’
Table named emp. It has two column families: “personal data” and
“professional data”.
hbase(main):001:0 > list
Inserting a First Row
put command, you can insert rows into a table.
• put ’<table name>’,’Key’,’<colfamily:colname>’,’<value>’
• put 'emp','1','personal data:name','raju‘
• put 'emp','1','personal data:city',‘Pune‘
• put 'emp','1','professional data:designation','manager‘
• put 'emp','1','professional data:salary','50000'
Reading Data :
• get ’<table name>’,’row1’
hbase> get 'emp', '1‘
Reading a Specific Column
• hbase> get 'table name', ‘rowid’ ,{COLUMN ⇒ ‘column family:
column name ’}
•hbase> get 'emp', 'row1', {COLUMN ⇒ 'personal:name'}
• Deleting a Specific Cell in a Table
• delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time
stamp>’version
To use HBase in Java application , one of the popular open source API
named Kundera, which is a JPA 2.1 compliant object mapper. Kundera is a
polyglot object mapper(supportst 8nosql datastores) for NoSQL, as well as
RDBMS data stores. It is a single high-level Java API that supports eight
NoSQL data stores. The idea behind Kundera is to make working with NoSQL
databases drop-dead simple and fun.
The create operation: The following code shows the creation of an employee
record:
Employee employee = new Employee();
employee.setEmployeeId("1");
employee.setEmployeeName("John");
employee.setAddress("Atlanta");
employee.setDepartment("R&D Labs");
The read operation: The following code shows the reading of the employee
record
Employee foundEmployee = em.find(Employee.class, "1");
System.out.println(foundEmployee.getEmployeeId());
System.out.println(foundEmployee.getEmployeeName());
System.out.println(foundEmployee.getAddress());
System.out.println(foundEmployee.getDepartment());
The update operation: The following code shows the update process of the
employee record:
foundEmployee.setAddress("New York");
em.merge(foundEmployee);
The delete operation: The following code shows the deletion of the
employee record:
// delete employee record.
em.remove(foundEmployee);
Well, this data is received from the Server or most commonly known as a
Web-server. So, the client requests the server for the required information,
via an API, and then, the server sends a response to the client.
Over here, the response sent to the client is in the form of an HTML Web
Page.
But, the only issue which is present in this framework until now is that you
have to use a lot of methods to get the required information.
The REST API creates an object, and thereafter sends the values of an object in
response to the client. It breaks down a transaction in order to create small
modules. Now, each of these modules is used to address a specific part of the
transaction. This approach provides more flexibility but requires a lot of effort
to be built from the very scratch.
All of us working with the technology of the web, do CRUD operations. CRUD
operations, create a resource, read a resource, update a resource and delete a
resource.
This data has to be inputted into a new HBase table to be created through JAVA
API. Following column families have to be created
Column family region has three column qualifiers: country, state, city
Column family Time has two column qualifiers: year, month
Jar Files
Make sure that the following jars are present while writing the code as
they are required by the HBase.
a. commons-loging-1.0.4
b. commons-loging-api-1.0.4
c. hadoop-core-0.20.2-cdh3u2
d. hbase-0.90.4-cdh3u2
e. log4j-1.2.15
f. zookeper-3.3.3-cdh3u0
Apache Cassendra:
Cassandra Architecture
Components of Cassandra
Cassendra Datamodel:
Columns
Columns define the structure of data in a table. Each column has an
associated type, such as integer, text, double, and Boolean.
Cassendra example:
Create a Keyspace
Create table:
CREATE TABLE cycling.cyclist_name (race_id int,
race_name text, point_id int, PRIMARY KEY (race_id,
point_id));
Inserting Values:
To insert simple data into the table cycling.cyclist_name,
use the INSERT command. This example inserts a single record into the
table.
For a large table, limit the number of rows retrieved using LIMIT. The default limit is 10,000
rows.
Casendra Clients:
Cassandra client drivers organized by language. Before choosing a driver, you should verify the
Cassandra version and functionality supported by a specific driver.
Java
Achilles
Astyanax
Kundera
PlayORM
Python
Cassandra Sharp
Datastax C# driver
Fluent Cassandra
PHP
CQL | PHP
Datastax PHP driver
PHP-Cassandra
PHP Library for Cassandra
C++
Here, we explore how Cassandra and Hadoop fit together and how one
can write MapReduce programs against data in Cassandra.
Cassandra has a Java source package for Hadoop integration code,
called org.apache.cassandra.hadoop.
There we find:
ColumnFamilyInputFormat
The main class we’ll use to interact with data stored in Cassandra from
Hadoop. It’s an extension of Hadoop’s InputFormat abstract class.
ConfigHelper
A helper class to configure Cassandra-specific information such as the
server node to point to, the port, and information specific to your
MapReduce job.
ColumnFamilySplit
The extension of Hadoop’s InputSplit abstract class that creates splits
over our Cassandra data. It also provides Hadoop with the location of the
data, so that it may prefer running tasks on nodes where the data is stored.
ColumnFamilyRecordReader
The layer at which individual records from Cassandra are read. It’s an
extension of Hadoop’s RecordReader abstract class.
PIG
Pig is a high-level data flow platform for executing Map Reduce
programs of Hadoop. It was developed by Yahoo. The language for Pig is
pig Latin.
The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS.
Pig can handle any type of data, i.e., structured, semi-structured or
unstructured and stores the corresponding results into Hadoop Data File
System. Every task which can be achieved using PIG can also be
achieved using java used in MapReduce.
Features of Pig
Less code - The Pig consumes less line of code to perform any
operation.
Reusability - The Pig code is flexible enough to reuse again.
Nested data types - The Pig provides a useful concept of nested
data types like tuple, bag, and map
The following image shows a nested data type definition Company on the
type definition library tab:
The language used to analyze data in Hadoop using Pig is known as Pig
Latin. It is a highlevel data processing language which provides a rich set
of data types and operators to perform various operations on the data.
Pig, programmers need to write a Pig script using the Pig Latin language,
and execute them using any of the execution mechanisms (Grunt Shell,
UDFs, Embedded).
Internally, Apache Pig converts these scripts into a series of MapReduce
jobs, and thus, it makes the programmer’s job easy. The architecture of
Apache Pig is shown below.
As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
i. Parser
At first, all the Pig Scripts are handled by the Parser. Basically, Parser
checks the syntax of the script, does type checking, and other miscellaneous
checks. Afterward, Parser’s output will be a DAG (directed acyclic graph). That
represents the Pig Latin statements as well as logical operators.
Basically, the logical operators of the script are represented as the nodes and the
data flows are represented as edges, in the DAG (the logical plan).
ii. Optimizer
Further, DAG is passed to the logical optimizer. That carries out the
logical optimizations. Like projection and push down.
iii. Compiler
It compiles the optimized logical plan into a series of MapReduce jobs.
iv. Execution Engine
At last, MapReduce jobs are submitted to Hadoop in a sorted order.
Hence, these MapReduce jobs are executed finally on Hadoop, that produces
the desired results.
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex non-
atomic datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple,
the fields can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag. Each tuple can have any number of fields
(flexible schema). A bag is represented by ‘{}’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position
(column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of
type chararray and should be unique. The value might be of any type. It is
represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered
(there is no guarantee that tuples are processed in any particular order).
GRUNT
Pig commands, which are written in pig latin scripts, are executed with the help
of Grunt shell.
You can invoke the Grunt shell in a desired mode (local/MapReduce) using
the −x option as shown below.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output- Output-
grunt> grunt>
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly entering
the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
After invoking the Grunt shell, you can run your Pig scripts in the shell. In
addition to that, there are certain useful shell and utility commands provided by
the Grunt shell.
Will see shell and utility commands provided by the grunt shell.
Shell Commands
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts.
Prior to that, we can invoke any shell commands using sh and fs.
sh Command
Using sh command, we can invoke any shell commands from the Grunt
shell.
Syntax
Given below is the syntax of sh command.
grunt> sh shell command parameters
Example
grunt> sh ls
In this example, it lists out the files in the /pig/bin/ directory.
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the
Grunt shell.
Example
We can invoke the ls command of HDFS from the Grunt shell using fs
command. In the following example, it lists the files in the HDFS root
directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
Utility Commands:
The Grunt shell provides a set of utility commands. These include utility
commands such as clear, help, history, quit, and set; and commands such
as exec, kill, and run to control Pig from the Grunt shell.
clear Command
The clear command is used to clear the screen of the Grunt shell.
Example:
grunt> clear
help Command
The help command gives you a list of Pig commands or Pig properties.
Example
grunt> help
It will show list of commands available in pig.
history Command
This command displays a list of statements executed / used so far since
the Grunt sell is invoked.
grunt> history
Assume we have executed three statements since opening the Grunt shell.
set Command
The set command is used to show/assign values to keys used in Pig.
Usage
Using this command, you can set values to the following keys.
Debug You can turn off or turn on the debugging feature in Pig by
passing on/off to this key.
job.name You can set the Job name to the required job by passing a
string value to this key.
job.priority You can set the job priority to a job by passing one of the
following values to this key −
very_low
low
normal
high
very_high
quit Command
You can quit from the Grunt shell using this command.
Example:
grunt> quit
exec Command
Using the exec command, we can execute Pig scripts from the Grunt
shell.
Example
Let us assume there is a file named student.txt in
the /pig_data/ directory of HDFS with the following content.
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in
the /pig_data/ directory of HDFS with the following content.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',')as (id:int,name:chararray,city:chararray);
Dump student;
Now, let us execute the above script from the Grunt shell using
the exec command as shown below.
grunt> exec /sample_script.pig
Output
The exec command executes the script in the sample_script.pig. As
directed in the script, it loads the student.txt file into Pig and gives you the
result of the Dump operator displaying the following content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
kill Command
You can kill a job from the Grunt shell using this command.
Syntax
Given below is the syntax of the kill command.
grunt> kill JobId
Example
Suppose there is a running Pig job having id Id_0055, you can kill it
from the Grunt shell using the kill command, as shown below.
grunt> kill Id_0055
PIG LATIN
Data types:
The following table describes the arithmetic operators of Pig Latin. Suppose a
= 10 and b = 20.
>= (a >= b) is
Greater than or equal to − Checks if the value of the left
not true.
operand is greater than or equal to the value of the right
operand. If yes, then the condition becomes true.
<= (a <= b) is
Less than or equal to − Checks if the value of the left
true.
operand is less than or equal to the value of the right
operand. If yes, then the condition becomes true.
matches f1
Pattern matching − Checks whether the string in the
matches
left-hand side matches with the constant in the right-hand
'.*tutorial.*
side.
'
Operator Description
LOAD To Load the data from the file system (local/HDFS) into
a relation.
Filtering
Sorting
The Pig Latin syntax is similar to SQL but it is much more powerful.
2) Mapreduce mode/online
5. Dump Employee_limit;
The script will load the data in the file named Employee_details.txt as a
relation named Employee, in the first statement.
Moreover, the script will arrange the tuples of the relation in descending
order, based on age, and store it as Employee_order, in the second
statement.
The script will store the first 4 tuples of Employee_order as
Employee_limit, in the third statement.
Ultimately, last and the 4th statement will dump the content of the
relation Employee_limit.
In this way, Pig gets executed and gives you the output like:
DEVELOPING AN TESTING PIG LATIN SCRIPTS.
Unit Testing
With Pig Unit you can perform unit testing, regression testing, and rapid
prototyping. No cluster set up is required if you run Pig in local mode.
Pig Unit uses the famous Java Junit framework class to test the Pig
scripts. First, we will write a Java class which junit uses to run and inside that
class we can provide our Pig scripts, input files and the expected output, so that
it will automatically run the scripts and return the results. With the help of Pig
unit we can reduce the time of debugging.
For unit testing your own Pig script, we need to have maven and Pig
eclipse plugin installed in your eclipse.
Now, follow the series of step mentioned below to create a maven project.
File–>New–>Other–>Maven–>Maven Project
Next, open the pom.xml file, following the below steps.
src–>target–>pom.xml
then, copy the repositories and dependencies
A = load '/home/kiran/workspace/pigunit/input.data';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
dump D;
We have saved the above lines in a file called wordcount.pig and the input data is saved in the
file with name ‘input.data’.
PigTest pigTest = new PigTest("/home/kiran/workspace/pigunit/wordcount.pig");
Here, you can see that we have created an object called ‘pigTest’ and we have passed our Pig
script into it. The output will be given inside the assertOutput method.
pigTest.assertOutput("D", output);
In this case, our statement to be dumped is D, so we have given the alias name as ‘D’ within the
double quotes and the expected output is stored in the String array output. So, we have given
the output variable name as the output.
package java1.pigunit1;
import junit.framework.TestCase;
import org.apache.pig.pigunit.PigTest;
public class AppTest extends TestCase {
public void testwordcount() throws Exception {
PigTest pigTest = new
PigTest("/home/kiran/workspace/pigunit/wordcount.pig");
String[] output = {
"(1,all)","(1,from)","(1,Hello)","(1,Brindavan!)"
};
pigTest.assertOutput("D", output);
}
}
What is Tokenize
Example:
Consider we have a data in a file mentioned below
USA USA EUROPE EUROPE EUROPE EUROPE USA USA USA EUROPE
EUROPE USA EUROPE USA
dump inp;
Output
(USA USA EUROPE EUROPE EUROPE EUROPE USA)
(USA USA EUROPE EUROPE USA)
(EUROPE USA)
dump tknz;
Output
{(USA),(USA),(EUROPE),(EUROPE),(EUROPE),(EUROPE),(USA)}
{(USA),(USA),(EUROPE),(EUROPE),(USA)}
{(EUROPE),(USA)}
HIVE
The term ‘Big Data’ is used for collections of large datasets that include
huge volume, high velocity, and a variety of data that is increasing day by day.
Using traditional data management systems, it is difficult to process Big Data.
Therefore, the Apache Software Foundation introduced a framework called
Hadoop to solve Big Data management and processing challenges.
Hadoop
What is Hive
Architecture of Hive
This component diagram contains different units. The following table describes
each unit:
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name
node and it assigns this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
DATA TYPES AND FILE FORMATS
There are different data types in Hive, which are involved in the table
creation. All the data types in Hive are classified into four types, given as
follows:
Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When
the data range exceeds the range of INT, you need to use BIGINT and if the
data range is smaller than the INT, you use SMALLINT. TINYINT is smaller
than SMALLINT.
The following table depicts various INT data types:
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double
quotes (" "). It contains two data types: VARCHAR and CHAR. Hive follows
C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond
precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It
is used for representing immutable arbitrary precision. The syntax and example
is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an
instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
Hive Text file format is a default storage format. You can use the text
format to interchange the data with other client application. The text file format
is very common most of the applications. Data is stored in lines, with each line
being a record. Each lines are terminated by a newline character (\n).
The text format is simple plane file format. You can use the compression
(BZIP2) on the text file to reduce the storage spaces.
Create a TEXT file by add storage option as ‘STORED AS
TEXTFILE’ at the end of a Hive CREATE TABLE command.
Syntax:
Create table textfile_table(column_specs)
stored as textfile;
Sequence files are Hadoop flat files which stores values in binary key-
value pairs. The sequence files are in binary format and these files are able to
split. The main advantages of using sequence file is to merge two or more files
into one file.
Create a sequence file by add storage option as ‘STORED AS
SEQUENCEFILE’ at the end of a Hive CREATE TABLE command.
Syntax
Create table sequencefile_table (column_specs)
stored as sequencefile;
RCFile is row columnar file format. This is another form of Hive file
format which offers high row level compression rates. If you have requirement
to perform multiple rows at a time then you can use RCFile format.
The RCFile are very much similar to the sequence file format. This file format
also stores the data as key-value pairs.
Create RCFile by specifying ‘STORED AS RCFILE’ option at the end
of a CREATE TABLE Command:
Syntax:
Create table RCfile_table(column_specs)
stored as rcfile;
The ORC file stands for Optimized Row Columnar file format. The ORC
file format provides a highly efficient way to store data in Hive table. This file
system was actually designed to overcome limitations of the other Hive file
formats. The Use of ORC files improves performance when Hive is reading,
writing, and processing data from large tables.
Syntax:
Create table orc_table(column_specs) stored as orc;
DDL
Use With
Command
CREATE Database, Table
Databases, Tables, Table Properties, Partitions, Functions,
SHOW
Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table
1.Create Database:
The CREATE DATABASE statement is used to create a database in the
Hive. The DATABASE and SCHEMA are interchangeable. We can use either
DATABASE or SCHEMA.
Syntax:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name;
Example :
CREATE DATABASE IF NOT EXISTS bda;
2.Show Database:
The SHOW DATABASES statement lists all the databases present in the
Hive.
Syntax:
SHOW (DATABASES|SCHEMAS);
Example:
SHOW DATABASES;
3.Describe Database:
The DESCRIBE DATABASE statement in Hive shows the name of
Database in Hive, and its location on the file system.
The EXTENDED can be used to get the database properties.
Syntax:
DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name;
Example
DESCRIBE DATABASE bda;
4.Use Database
The USE statement in Hive is used to select the specific database for a
session on which all subsequent HiveQL statements would be executed.
Syntax:
USE database_name;
Example
USE bda;
5.Drop Database
The DROP DATABASE statement in Hive is used to Drop (delete) the
database.
The default behavior is RESTRICT which means that the database is
dropped only when it is empty. To drop the database with tables, we can use
CASCADE.
Syntax:
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name
[RESTRICT|CASCADE];
Example :
DROP DATABASE [IF EXISTS] bda CASCADE
1. CREATE TABLE
Syntax:
CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name
data_type [COMMENT col_comment], ... [COMMENT col_comment])]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS
file_format] [LOCATION hdfs_path];
Example
CREATE TABLE IF NOT EXISTS Employee(
Emp-ID STRING COMMENT ‘this is employee-id’
Emp_designation STRING COMMENT ‘this is employee post’)
The SHOW TABLES statement in Hive lists all the base tables
and views in the current database.
Syntax:
SHOW TABLES [IN database_name];
Example
SHOW TABLES;
3. DESCRIBE TABLE in Hive
The DESCRIBE statement in Hive shows the lists of columns for the
specified table.
Syntax:
DESCRIBE [EXTENDED|FORMATTED] [db_name.]
table_name[.col_name ( [.field_name])];
Example:
DESCRIBE Employee;
It describes column name , data type and comment.
The DROP TABLE statement in Hive deletes the data for a particular
table and remove all metadata associated with it from Hive metastore.
If PURGE is not specified then the data is actually moved to the
.Trash/current directory. If PURGE is specified, then data is lost completely.
Syntax:
DROP TABLE [IF EXISTS] table_name [PURGE];
Example:
DROP TABEL IF EXISTS Employee PURGE;
6. TRUNCATE TABLE
TRUNCATE TABLE statement in Hive removes all the rows from the
table or partition.
Syntax:
TRUNCATE TABLE table_name;
Example
TRUNCATE TABLE Comp_Emp;
It removes all the rows in Comp_Emp Table.
1.Load Command:
The LOAD statement in Hive is used to move data files into the locations
corresponding to Hive tables.
If a LOCAL keyword is specified, then the LOAD command will look
for the file path in the local filesystem.
If the LOCAL keyword is not specified, then the Hive will need the
absolute URI of the file.
In case the keyword OVERWRITE is specified, then the contents of the
target table/partition will be deleted and replaced by the files referred by
filepath.
If the OVERWRITE keyword is not specified, then the files referred by
filepath will be appended to the table.
Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];
Example:
LOAD DATA LOCAL INPATH ‘/home/user/dab’ INTO TABEL emp_data;
2. SELECT COMMAND
The SELECT statement in Hive is similar to the SELECT statement in
SQL used for retrieving data from the database.
Syntax:
SELECT col1,col2 FROM tablename;
Example:
3. INSERT Command
The INSERT command in Hive loads the data into a Hive table. We can
do insert to both the Hive table or partition.
a. INSERT INTO
The INSERT INTO statement appends the data into existing data in the
table or partition. INSERT INTO statement works from Hive version 0.8.
Syntax:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1,
partcol2=val2 ...)] select_statement1 FROM from_statement;
Example
CREATE TABLE IF NOT EXIST example(id STRING, name STRING,
dep STRING ,state STRING, salary STRING, year STRING
The INSERT OVERWRITE table overwrites the existing data in the table or
partition.
Example:
Here we are overwriting the existing data of the table ‘example’ with the
data of table ‘dummy’ using INSERT OVERWRITE statement.
By using the SELECT statement we can verify whether the existing data of the
table ‘example’ is overwritten by the data of table ‘dummy’ or not.
SELECT * FROM EXAMPLE;
c. INSERT .. VALUES
INSERT ..VALUES statement in Hive inserts data into the table directly
from SQL..
Example:
Inserting data into the ‘student’ table using INSERT ..VALUES statement.
INSERT INTO TABLE student VALUES(101,’Callen’,’IT’,’7.8’),
(103,’joseph’,’CS’,’8.2’),
(105,’Alex’,’IT’,’7.9’);
SELECT * FROM student;
4. DELETE command
The DELETE statement in Hive deletes the table data. If the WHERE
clause is specified, then it deletes the rows that satisfy the condition in where
clause.
Syntax:
DELETE FROM tablename [WHERE expression];
Example:
In the below example, we are deleting the data of the student from table
‘student’ whose roll_no is 105.
DELETE FROM student WHERE roll_no=105;
SELECT * FROM student;
5. UPDATE Command
The UPDATE statement in Hive deletes the table data. If the WHERE
clause is specified, then it updates the column of the rows that satisfy the
condition in WHERE clause.
Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE
expression];
Example:
In this example, we are updating the branch of the student whose roll_no is 103
in the ‘student’ table using an UPDATE statement.
UPDATE student SET branch=’IT’ WHERE roll_no=103;
SELECT * FROM student;
6. EXPORT Command
The Hive EXPORT statement exports the table or partition data along
with the metadata to the specified output location in the HDFS.
Example:
Here in this example, we are exporting the student table to the HDFS
directory “export_from_hive”.
EXPORT TABLE student TO ‘export_from_hive’
The table successfully exported. You can check for the _metadata file and data
sub-directory using ls command.
7. IMPORT Command
The Hive IMPORT command imports the data from a specified location
to a new table or already existing table.
Example:
Here in this example, we are importing the data exported in the above
example into a new Hive table ‘imported_table’.
Verifying whether the data is imported or not using hive SELECT statement.
SELECT * from imported_table;
HIVEQL QUERIES.
Liks SQL , in hiveql it contains,
Hiveql operators,
Hiveql functions,
Hiveql group by & having ,
hiveql orderby & sortby ,
HiveqlJoin
HiveQL - Operators
The HiveQL operators facilitate to perform various arithmetic and relational operations.
Here, we are going to execute such type of operations on the records :
Let's create a table and load the data into it by using the following steps: -
hive> create table employee (Id int, Name string , Salary float)
Now, load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table employ
ee;
Operator Description
s
A/B This is used to divide A and B and returns the quotient of the
operands.
Let's see an example to find out the 10% salary of each employee.
HIVEQL Functions:
count(*) - It returns the count of the number of rows present in the file.
min(col) - It compares the values and return the minimum one from it
max(col) - It compares the values and return the minimum one from it
Example:
hive> select max(Salary) from employee_data;
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
|1204 | Krian | 40000 | Hr Admin | HR |
|1202 | Manisha | 45000 | Proofreader | PR |
|1201 | Gopal | 45000 | Technical manager | TP |
|1203 | Masthanvali | 40000 | Technical writer | TP |
+------+--------------+-------------+-------------------+--------+
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
Dept | Count(*) |
+------+--------------+
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+------+--------------+
HAVING CLAUSE
The HQL HAVING clause is used with GROUP BY clause. Its purpose is to apply
constraints on the group of data produced by GROUP BY clause. Thus, it always returns
the data where the condition is TRUE.
having sum(salary)>=45000;
output:
DEPARTMENT SUM(SALARY) DEPARTMENT SUM(SALARY)
Tp 85000 Tp 85000
Pr 45000 Pr 45000
Hr 40000
Admin 30000
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.