Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Overview of Hive and Pig

Objectives

After completing this lesson, you should be able to:


• Define Hive
• Describe the Hive data flow
• Create a Hive database
• Define Pig
• List the features of Pig

6-2
Hive
• Hive is an open source Apache project and was originally
developed by Facebook.
• Hive enables analysts who are familiar with SQL to query
data stored in HDFS by using HiveQL (a SQL-like
language).
• It is an infrastructure built on top of Hadoop that supports
the analysis of large data sets.
• Hive transforms HiveQL queries into standard MapReduce
jobs (high level abstraction on top of MapReduce).
• Hive communicates with the JobTracker to initiate the
MapReduce job.
• This lesson covers Hive and Pig at a high level.

6-3
Hive: Data Units

Databases

Tables

Partitions

6-4
The Hive Metastore Database

• Contains metadata regarding databases, tables, and


partitions
• Contains information about how the rows and columns are
delimited in the HDFS files that are used in the queries
• Is an RDBMS database, such as MySQL, where Hive
persists table schemas and other system metadata

6-5
Hive Framework

External
interfaces
CLI JDBC Hue

HiveServer2

6-6
Creating a Hive Database
1. Start hive.

2. Create the database.

3. Verify the database creation.

6-7
Data Manipulation in Hive

Hive SELECT with a WHERE clause:


Map Task
Map Task
SELECT a,sum(b) Map Task
FROM myTable Map Task
WHERE a < 100
GROUP BY a

Reduce Task
Reduce Task

Note: Aggregations such


as GROUP BY are
handled by reduce tasks. Result

6-8
Data Manipulation in Hive: Nested Queries
Job # 1
Map Task
3 1 Map Task
Map Task
SELECT mt.a, mt.timesTwo,otherTable.z as ID FROM( Map Task
SELECT a, sum(b*b) AS timesTwo
FROM myTable
GROUP BY a)mt Reduce Task
Reduce Task
JOIN otherTable 2
ON otherTable.z = mt.a
3
GROUP BY mt.a,otherTable.z Temporary
Result

Map Task
Map Task
Map Task Job # 2
Map Task
Notes:
• Subqueries are treated as
sequential MapReduce jobs. 4
Reduce Task
• Jobs execute from the innermost Reduce Task Output
query outward.

6-9
Steps in a Hive Query

SELECT suit, COUNT(*)


FROM cards
WHERE face_value > 10 HiveQL
GROUP BY suit;

Map task Reduce task


Shuffle

If face_card: emit(suit,
emit(suit, count(suit))
card)

Hadoop Cluster (Job Tracker or Resource Manager)

6 - 10
Hive-Based Applications

• Log processing
• Text mining
• Document indexing
• Business analytics
• Predictive modeling

6 - 11
Hive: Limitations

• No support for materialized views


• No transaction-level support
• Not ideal for ad hoc work
• Limited subquery support

6 - 12
Pig: Overview

Pig:
• Is an open-source high-level data flow system
• Provides a simple language called Pig Latin for queries
and data manipulation, which is compiled into map-reduce
jobs that are run on Hadoop

6 - 13
Pig Latin

• Is a high-level data flow language


• Provides common operations like join, group, sort, and so
on
• Works on files in HDFS
• Was developed by Yahoo

6 - 14
Pig Applications

• Rapid prototyping of algorithms for processing large data


sets
• Log analysis
• Ad hoc queries across large data sets
• Analytics and sampling
• PigMix: A set of performance and scalability benchmarks

6 - 15
Running Pig Latin Statements

You can execute Pig Latin statements:


• Using the grunt shell or command line
• In MapReduce mode or local mode
• Either interactively or in batch
Pig processes Pig Latin statements as follows:
1. It validates the syntax and semantics of all statements.
2. If Pig encounters a DUMP or STORE, it executes the
statements.

6 - 16
Pig Latin: Features

Ease of programming

Optimization opportunities

Extensibility

Structure that accommodates substantial


parallelization

6 - 17
Working with Pig

1. Open a terminal window, type pig, and press Enter.

2. To execute scripts, use the grunt shell prompt.

6 - 18
Summary

In this lesson, you should have learned how to:


• Define Hive
• Describe the Hive data flow
• Create a Hive database
• Define Pig
• List the features of Pig

6 - 19
End Of Topic

6 - 20

You might also like