Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Unit – 5 Hive

1
Introduction
• Apache Hive is an open source data warehouse system built on top of
Hadoop used for querying and analyzing large datasets stored in
Hadoop files.
• Initially, you have to write complex Map-Reduce jobs, but now with the
help of the Hive, you just need to submit merely SQL queries.
• Hive is mainly targeted towards users who are comfortable with SQL.
• Hive use language called HiveQL (HQL), which is similar to SQL.
• HiveQL automatically translates SQL-like queries into MapReduce jobs.

2
Hive Architecture

3
Hive Architecture - Component

• Metastore –
• It stores metadata for each of the tables like their schema and location.
• Hive also includes the partition metadata.
• This helps the driver to track the progress of various data sets distributed
over the cluster.
• Driver –
• It acts like a controller which receives the HiveQL statements.
• The driver starts the execution of the statement by creating sessions.
• It monitors the life cycle and progress of the execution.

4
Hive Architecture - Component

• Compiler –
• It performs the compilation of the HiveQL query.
• This converts the query to an execution plan.
• The plan contains the tasks.
• It also contains steps needed to be performed by the MapReduce to get the
output as translated by the query.
• Optimizer –
• It performs various transformations on the execution plan.
• It aggregates the transformations together, such as converting a pipeline of
joins to a single join, for better performance.

5
Hive Architecture - Component

• Executor –
• Once compilation and optimization complete, the executor executes the
tasks.
• Executor takes care of pipelining the tasks.

• CLI, UI, and Thrift Server –


• CLI (command-line interface) provides a user interface for an external user
to interact with Hive.
• Thrift server in Hive allows external clients to interact with Hive over a
network, similar to the JDBC or ODBC protocols.

6
Hive shell, Hive services, Hive metastore

• Hive shell - we can execute Hive queries and commands.


• Hive Services – The following are the services provided by Hive
• Hive MetaStore –
• Hive Server - It accepts the request from different clients and provides it to
Hive Driver.
• Hive Driver - It receives queries from different sources like web UI, CLI,
Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
• Hive Compiler - The purpose of the compiler is to parse the query and
perform semantic analysis
• Hive Execution Engine - the execution engine executes the incoming tasks in
the order of their dependencies.

7
Comparison with traditional database

RDBMS Hive
It is used to maintain database. It is used to maintain data warehouse.

It uses SQL (Structured Query Language). It uses HQL (Hive Query Language).

Schema is fixed in RDBMS. Schema varies in it.

Normalized and de-normalized both type of


Normalized data is stored.
data is stored.

Tables in rdms are sparse. Table in hive are dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used. Sharding method is used for partition.


8
HiveQL - HIVE Data Types

9
• Arithmetic Operators - +, - , *, /, %
• Relational Operators -> =, !=, <, <=, >, >=, is, is not, LIKE etc
• Logical Operators - > AND, OR, NOT
• Complex Operator ->
• These operators provide an expression to access the elements of Complex
Types
• A[n] - A is an Array and n is an int
• M[key] - M is a Map<K, V> and key has type K
• S.x - S is a struct

10
Hive DDL Commands

• Hive DDL commands are the statements used for defining and changing the
structure of a table or database in Hive.
The several types of Hive DDL commands are:
• CREATE
• SHOW
• DESCRIBE
• USE
• DROP
• ALTER
• TRUNCATE
11
Hive DDL Commands

CREATE DATABASE in Hive


Syntax –
hive> CREATE DATABASE [IF NOT EXISTS] userdb;

SHOW DATABASE in Hive


Syntax –
hive> SHOW DATABASES;

12
Hive DDL Commands

Drop Database in Hive


Syntax –
hive> DROP DATABASE IF EXISTS userdb;

Create Table in Hive


Syntax –
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name
String,salary String, destination String)

13
Hive DDL Commands

Drop Table in Hive


Syntax –
hive> DROP TABLE IF EXISTS employee;

14
HiveQL - DML

The Hive Query Language (HiveQL) is a query language for Hive to


process and analyze structured data in a Metastore.
SELECT Query -
hive> SELECT * FROM employee WHERE salary>30000;
ORDER BY clause in a SELECT statement

hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;

15
Hive tables

Two types of Table in Hive –


• Managed Tables - In a managed table, both the table data and the
table schema are managed by Hive. The data will be located in a
folder named after the table within the Hive data warehouse
• External Tables - An external table is one where only the table
schema is controlled by Hive. The user will set up the folder location
within HDFS and copy the data file(s) there.

16
Hive tables

Two types of Table in Hive –


• Managed Tables –
CREATE TABLE IF NOT EXISTS stocks (exchange STRING, symbol STRING, price_open
FLOAT, price_high FLOAT, price_low FLOAT, price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
• External Tables –
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (exchange STRING, symbol STRING,
price_open FLOAT, price_high FLOAT, price_low FLOAT, price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

17
User Defined Functions

• In Hive, the users can define own functions to meet certain client
requirements.
• These are known as UDFs in Hive.
• User Defined Functions written in Java for specific modules.
• Basically, we can use two different interfaces for writing Apache
Hive User Defined Functions.
• Simple API
• Complex API

18
User Defined Functions

Simple API
• Basically, with the simpler UDF API, building a Hive User Defined Function
involves little more than writing a class with one function.

class SimpleUDFExample extends UDF


{
public Text evaluate(Text input)
{
return new Text("Hello " + input.toString());
}
}
Complex API
A Complex API is written by extending the GenericUDF class.
19
Sorting and aggregating in hive

SORT BY (ASC|DESC): This indicates which columns to sort when


ordering the reducer input records. This means it completes sorting
before sending data to the reducer.

Hive > SELECT name FROM service_table_external SORT BY portid


DESC;

20
Aggregate Functions

• Sum –
hive> select sum(salary) from employee;

• Count –
hive> select count(*) from employee;
• Average –
hive> select avg(salary) from employee where
location='Banglore';

21
Aggregate Functions

• Minimum –
hive> select min(salary) from employee;
• Maximum -
hive> select max(salary) from employee;
• Variance -
hive> select variance(salary) from employee;
• Standard Deviation -
hive> select stddev_pop(salary) from employee;

22
Joins and Subqueries

• There are different types of joins given as follows:


• JOIN
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
• JOIN - JOIN clause is used to combine and retrieve the records from
multiple tables. JOIN is same as OUTER JOIN in SQL (Returns all
records when there is a match in either left or right table).

23
Joins and Subqueries

• LEFT OUTER JOIN - The HiveQL LEFT OUTER JOIN returns all the
rows from the left table, even if there are no matches in the right
table.
• RIGHT OUTER JOIN - The HiveQL RIGHT OUTER JOIN returns all the
rows from the right table, even if there are no matches in the left
table.
• FULL OUTER JOIN - The HiveQL FULL OUTER JOIN combines the
records of both the left and the right outer tables that fulfil the JOIN
condition.

24
CUSTOMERS Table

Join

25
LEFT OUTER JOIN

26
RIGHT OUTER JOIN

27
FULL OUTER JOIN

28
Let’s put your knowledge to the test29
Map Reduce scripts

• Using an approach like Hadoop Streaming, the TRANSFORM, MAP


and REDUCE clauses make it possible to invoke an external script or
program from Hive.
• Example - script to filter out rows to remove poor quality readings
• hive> ADD FILE /path/to/is_good_quality.py;
• hive> FROM records2 SELECT TRANSFORM(year, temperature, quality)
USING 'is_good_quality.py‘ AS year, temperature;

30
Q & A Time
We have 10 Minutes for Q&A

31

You might also like