Professional Documents
Culture Documents
Unit 4 Pig and Hive
Unit 4 Pig and Hive
and
Pig Latin
Motivation
2
Motivation
➢ As a procedural programmer…
▪ May find writing queries in SQL unnatural and too
restrictive
Map-Reduce
Relies on custom code for even common operations
Need to do workarounds for tasks that have different data
flows other than the expected Map Combine Reduce
Motivation
Map-Reduce
7
Pig Latin Example
Table urls: (url,category, pagerank)
Find for each suffciently large category, the average pagerank of high-pagerank urls in that
category
SQL:
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
Big Picture
Pig
Pig Latin
Script Map-Reduce
Statements
User- Compile
Defined
Functions
Optimize
Map
13
Data Model
Atom
Tuple
Bag
Map - collection of key-value pairs
Key is an atom; value can be any type
14
Data Model
Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
Fully nested
More natural for procedural programmers (target user) than
normalization
Data is often stored on disk in a nested fashion
Facilitates ease of writing user-defined functions
No schema required
15
Data Model
16
Speaking Pig Latin
LOAD
Input is assumed to be a bag (sequence of tuples)
Can specify a deserializer with “USING‟
Can provide a schema with “AS‟
17
Speaking Pig Latin
FOREACH
Apply some processing to each tuple in a bag
Each field can be:
A fieldname of the bag
A constant
newBag =
FOREACH bagName
GENERATE field1, field2, …;
18
Speaking Pig Latin
FILTER
Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression;
Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
some_apples =
FILTER apples BY colour != ‘red’;
Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);
19
Speaking Pig Latin
COGROUP
Group two datasets together by a common attribute
Groups data into nested bags
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
20
Speaking Pig Latin
Why COGROUP and not JOIN?
url_revenues =
FOREACH grouped_data GENERATE
FLATTEN(distributeRev(results, revenue));
21
Speaking Pig Latin
Why COGROUP and not JOIN?
May want to process nested bags of tuples before taking the
cross product.
Keeps to the goal of a single high-level data transformation per
pig-latin statement.
However, JOIN keyword is still available:
JOIN results BY queryString,
revenue BY queryString;
Equivalent
temp = COGROUP results BY queryString,
revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN(results), FLATTEN(revenue);
22
Speaking Pig Latin
STORE (& DUMP)
Output data to a file (or screen)
23
Compilation
Pig system does two tasks:
24
Compilation
25
Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
26
Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
27
Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
28
Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
Filter
29
Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Foreach
30
Compilation
Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Only happens when output is
specified by STORE or DUMP
Foreach
32
Compilation
Building a Physical Plan
Step 1: Create a map-reduce job for each Load(user.dat)
COGROUP
Filter
Map
Group
Reduce
Foreach
33
Compilation
Building a Physical Plan
Step 1: Create a map-reduce job for each Load(user.dat)
COGROUP
Step 2: Push other commands into the
map and reduce functions where Map Filter
possible
34
Compilation
Efficiency in Execution
Parallelism
35
Compilation
36
Compilation
Map Filter
Group
Foreach
37
Compilation
Filter
Group
Combine
Foreach
38
Compilation
Filter
Group
Reduce Foreach
39
Compilation
Reduce SUM
40
Compilation
Pig provides an interface for writing algebraic UDFs so they can take
advantage of this optimization as well.
Inefficiencies
41
Debugging
42
Debugging
Pig-Latin command window and command generator
43
Debugging
Sand Box Dataset (generated automatically!)
44
Debugging
Pig-Pen
45
Conclusion
47
More Info
Pig, http://hadoop.apache.org/pig/
Hadoop, http://hadoop.apache.org
Anks-
Thay!
48
HIVE –
A WAREHOUSING SOLUTION
OVER
A MAP-REDUCE
FRAMEWORK
Agenda
Why Hive???
What is Hive?
Hive Data Model
Hive Architecture
HiveQL
Hive SerDe’s
Pros and Cons
Hive v/s Pig
Graphs
Data Analysts with Hadoop
Extra Overhead
Challenges that Data Analysts
faced
Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce framework to
parallelize processing of Data
What is the catch?
- Hadoop Map Reduce is Java intensive
- Thinking in Map Reduce paradigm can get tricky
… Enter Hive!
Hive Key Principles
HiveQL to MapReduce
Hive Framework
Data Analyst
rowcount, N
rowcount,1 rowcount,1
Tables
- Analogous to relational tables
- Each table has a corresponding directory in HDFS
- Data serialized and stored as files within that directory
- Hive has default serialization built in which supports
compression and lazy deserialization
- Users can specify custom serialization –deserialization
schemes (SerDe’s)
Hive Data Model Contd.
Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount FLOAT)
PARTITIONED BY (country STRING, year INT, month INT)
So each partition will be split out into different folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions
/hivebase/Sales
/country=US
/country=CANAD
A
/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.
Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket number
- Each bucket is stored as a file in partition directory
Architecture
Externel Interfaces- CLI, WebUI, JDBC,
ODBC programming interfaces
SELECT Query
➢ Hive built in Serde: Record
Avro, ORC, Regex etc Reader
Hive Table
➢ Can use Custom Deserialize
SerDe’s (e.g. for
unstructured data
like audio/video Hive Row Object
data, End User
Object Inspector
semistructured XML Map Fields
data)
Good Things
Similarities:
➢ Both High level Languages which work on top of map reduce framework
➢ Can coexist since both use the under lying HDFS and map reduce
Differences:
◆ Language
➢ Pig is a procedural ; (A = load ‘mydata’; dump A)
➢ Hive is Declarative (select * from A)
◆ Work Type
➢ Pig more suited for adhoc analysis (on demand analysis of click stream
search logs)
➢ Hive a reporting tool (e.g. weekly BI reporting)
Hive v/s Pig
Differences:
◆ Users
➢ Pig – Researchers, Programmers (build complex data pipelines,
machine learning)
➢ Hive – Business Analysts
◆ Integration
➢ Pig - Doesn’t have a thrift server(i.e no/limited cross language support)
➢ Hive - Thrift server
◆ User’s need
➢ Pig – Better dev environments, debuggers expected
➢ Hive - Better integration with technologies expected(e.g JDBC, ODBC)
Head-to-Head
(the bee, the pig, the elephant)
Sqoop
❖ It is a command-line interface application for transferring
data between relational databases and Hadoop
❖ It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop
file system to relational databases.
❖ When Big Data storages and analyzers such as MapReduce,
Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem
came into picture, they required a tool to interact with the
relational database servers for importing and exporting the
Big Data residing in them. Here, Sqoop occupies a place in
the Hadoop ecosystem to provide feasible interaction
between relational database server and Hadoop’s HDFS.
❖ Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop
Sqoop (“SQL-to-Hadoop”) is a straightforward
command-line tool. It offers the following capabilities:
1. Generally, helps to Import individual tables or entire
databases to files in HDFS
2. Also can Generate Java classes to allow you to
interact with your imported data
3. Moreover, it offers the ability to import from SQL
databases straight into your Hive data warehouse.
Sqoop
It supports incremental loads of a
single table or a free form SQL
query as well as saved jobs which
can be run multiple times to import
updates made to a database since
the last import.
Now Sqoop (As its written in java ?tries to package the compiled
Step 3 classes to beable togenerate table structure) , post compiling
creates jar file(Java packaging standard).
There are many salient features of Sqoop, which shows us the
several reasons to learn sqoop.
a. Parallel import/export
While it comes to import and export the data, Sqoop uses YARN
Key Features framework. Basically, that offers fault tolerance on top of
parallelism.
➢ Using Apache Flume we can store the data in to any of the centralized stores
(HBase, HDFS).
➢ When the rate of incoming data exceeds the rate at which data can be written to
the destination, Flume acts as a mediator between data producers and the
centralized stores and provides a steady flow of data between them.
➢ Flume provides the feature of contextual routing.
➢ The transactions in Flume are channel-based where two transactions (one sender
and one receiver) are maintained for each message. It guarantees reliable message
delivery.
➢ Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of FLUME
➢ Flume ingests log data from multiple web servers into a centralized store
(HDFS, HBase) efficiently.
➢ Using Flume, we can get the data from multiple servers immediately into
Hadoop.
➢ Along with the log files, Flume is also used to import huge volumes of
event data produced by social networking sites like Facebook and Twitter,
and e-commerce websites like Amazon and Flipkart.
➢ Flume supports a large set of sources and destinations types.
➢ Flume supports multi-hop flows, fan-in fan-out flows, contextual routing,
etc.
➢ Flume can be scaled horizontally.
The flume agent has 3 components:
Flume ~ Source: It accepts the data from the incoming streamline and
stores the data in the channel.
~ Channel: In general, the reading speed is faster than the writing
Architecture speed. Thus, we need some buffer to match the read & write speed
difference. Basically, the buffer acts as a intermediary storage that
stores the data being transferred temporarily and therefore prevents
data loss. Similarly, channel acts as the local storage or a temporary
storage between the source of data and persistent data in the HDFS.
~ Sink: Then, our last component i.e. Sink, collects the data from
the channel and commits or writes the data in the HDFS
permanently.
REFERENCES
https://hive.apache.org/
https://cwiki.apache.org/confluence/display/Hive/Presentations
https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-
sql-constructing-data-processing-pipelines-444.html
http://www.qubole.com/blog/big-data/hive-best-practices/
Hortonworks tutorials (youtube)
Graph :
https://issues.apache.org/jira/secure/attachment/12411185/hive_b
enchmark_2009-06-18.pdf
Thanks