Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Pig

and
Pig Latin
Motivation

You‟re a procedural You have huge data You want to analyze


programmer it

2
Motivation

➢ As a procedural programmer…
▪ May find writing queries in SQL unnatural and too
restrictive

▪ More comfortable with writing code; a series of


statements as opposed to a long query. (Ex:
MapReduce is so successful).
Motivation
Data analysis goals
Quick
Exploit parallel processing power of a distributed system
Easy
Be able to write a program or query without a huge learning curve
Have some common analysis tasks predefined
Flexible
Transform a data set(s) into a workable structure without much
overhead
Perform customized processing
Transparent
Have a say in how the data processing is executed on the system
Motivation

Relational Distributed Databases


Parallel database products expensive
Rigid schemas
Processing requires declarative SQL query construction

Map-Reduce
Relies on custom code for even common operations
Need to do workarounds for tasks that have different data
flows other than the expected Map Combine Reduce
Motivation

Relational Distributed Databases

Sweet Spot: Take the best of both SQL and Map-Reduce;


combine high-level declarative querying with low-level
procedural programming…Pig Latin!

Map-Reduce
7
Pig Latin Example
Table urls: (url,category, pagerank)

Find for each suffciently large category, the average pagerank of high-pagerank urls in that
category

SQL:
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6

Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
Big Picture

Pig
Pig Latin
Script Map-Reduce
Statements

User- Compile
Defined
Functions
Optimize

Write Results Read Data


Data Model
Atom - simple atomic value (ie: number or string)
Tuple
Bag
Map
Data Model
 Atom
 Tuple - sequence of fields; each field any type
 Bag
 Map
Data Model
 Atom
 Tuple
 Bag - collection of tuples
 Duplicates possible
 Tuples in a bag can have different field lengths and field types

 Map

13
Data Model
 Atom
 Tuple
 Bag
 Map - collection of key-value pairs
 Key is an atom; value can be any type

14
Data Model
 Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
 Fully nested
More natural for procedural programmers (target user) than
normalization
 Data is often stored on disk in a nested fashion
 Facilitates ease of writing user-defined functions

 No schema required
15
Data Model

 User-Defined Functions (UDFs)


 Ex: spam_urls = FILTER urls BY isSpam(url);
 Can be used in many Pig Latin statements
 Useful for custom processing tasks

 Can use non-atomic values for input and output


 Currently must be written in Java

16
Speaking Pig Latin
 LOAD
 Input is assumed to be a bag (sequence of tuples)
 Can specify a deserializer with “USING‟
 Can provide a schema with “AS‟

newBag = LOAD ‘filename’


<USING functionName() >
<AS (fieldName1, fieldName2,…)>;

Queries = LOAD ‘query_log.txt’


USING myLoad()
AS (userID,queryString, timeStamp)

17
Speaking Pig Latin
 FOREACH
 Apply some processing to each tuple in a bag
 Each field can be:
A fieldname of the bag
 A constant

A simple expression (ie: f1+f2)


 A predefined function (ie: SUM, AVG, COUNT, FLATTEN)
 A UDF (ie: sumTaxes(gst, pst) )

newBag =
FOREACH bagName
GENERATE field1, field2, …;

18
Speaking Pig Latin
 FILTER
 Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression;
 Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
some_apples =
FILTER apples BY colour != ‘red’;
 Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);

19
Speaking Pig Latin
 COGROUP
 Group two datasets together by a common attribute
 Groups data into nested bags
grouped_data = COGROUP results BY queryString,
revenue BY queryString;

20
Speaking Pig Latin
 Why COGROUP and not JOIN?
url_revenues =
FOREACH grouped_data GENERATE
FLATTEN(distributeRev(results, revenue));

21
Speaking Pig Latin
 Why COGROUP and not JOIN?
 May want to process nested bags of tuples before taking the
cross product.
 Keeps to the goal of a single high-level data transformation per
pig-latin statement.
 However, JOIN keyword is still available:
JOIN results BY queryString,
revenue BY queryString;
Equivalent
temp = COGROUP results BY queryString,
revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN(results), FLATTEN(revenue);

22
Speaking Pig Latin
 STORE (& DUMP)
 Output data to a file (or screen)

STORE bagName INTO ‘filename’


<USING deserializer ()>;

 Other Commands (incomplete)


 UNION - return the union of two or more bags
 CROSS - take the cross product of two or more bags
 ORDER - order tuples by a specified field(s)
 DISTINCT - eliminate duplicate tuples in a bag
 LIMIT - Limit results to a subset

23
Compilation
 Pig system does two tasks:

 Builds a Logical Plan from a Pig Latin script


 Supports execution platform independence
 No processing of data performed at this stage

 Compiles the Logical Plan to a Physical Plan and Executes


Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce

24
Compilation

 Building a Logical Plan

 Verify input files and bags referred to are valid

 Create a logical plan for each bag(variable) defined

25
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;

26
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;

27
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;

Foreach

28
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;

Foreach

Filter

29
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;

Group

Foreach

30
Compilation
 Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;

Group
Only happens when output is
specified by STORE or DUMP

Foreach

32
Compilation
 Building a Physical Plan
 Step 1: Create a map-reduce job for each Load(user.dat)
COGROUP

Filter

Map
Group
Reduce

Foreach

33
Compilation
 Building a Physical Plan
 Step 1: Create a map-reduce job for each Load(user.dat)
COGROUP
 Step 2: Push other commands into the
map and reduce functions where Map Filter
possible

 May be the case certain commands Group


require their own map-reduce
Reduce
job (ie: ORDER needs separate map-
reduce jobs)
Foreach

34
Compilation
 Efficiency in Execution

 Parallelism

 Loading data - Files are loaded from HDFS

 Statements are compiled into map-reduce jobs

35
Compilation

 Efficiency with Nested Bags

 In many cases, the nested bags created in each tuple of a COGROUP


statement never need to physically materialize

 Generally perform aggregation after a COGROUP and the


statements for said aggregation are pushed into the reduce function

 Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG)

36
Compilation

 Efficiency with Nested Bags Load(user.dat)

Map Filter

Group

Foreach

37
Compilation

 Efficiency with Nested Bags Load(user.dat)

Filter

Group
Combine

Foreach

38
Compilation

 Efficiency with Nested Bags Load(user.dat)

Filter

Group

Reduce Foreach

39
Compilation

 Efficiency with Nested Bags

 Why this works:


 COUNT is an algebraic function; it can be structured as a tree of sub-
functions with each leaf working on a subset of the data

Reduce SUM

Combine COUNT COUNT

40
Compilation

 Efficiency with Nested Bags

Pig provides an interface for writing algebraic UDFs so they can take
advantage of this optimization as well.

 Inefficiencies

 Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to


materialize; may cause a very large bag to spill to disk if it doesn‟t fit
in memory

Every map-reduce job requires data be written and replicated to the


HDFS (although this is offset by parallelism achieved)

41
Debugging

 How to verify the semantics of an analysis program


 Run the program against whole data set. Might take hours!
 Generate sample dataset
 Empty result set may occur on few operations like join, filter
 Generally, testing with sample dataset is difficult
 Pig-Pen
 Samples data from large dataset for Pig statements
 Apply individual Pig-Latin commands against the dataset
 In case of empty result, pig system resamples
 Remove redundant samples
Debugging
 Pig-Pen

42
Debugging
 Pig-Latin command window and command generator

43
Debugging
 Sand Box Dataset (generated automatically!)

44
Debugging
 Pig-Pen

 Provides sample data that is:


 Real - taken from actual data
 Concise - as small as possible

 Complete - collectively illustrate the key semantics of each command

 Helps with schema definition

 Facilitates incremental program writing

45
Conclusion

 Pig is a data processing environment in Hadoop that is


specifically targeted towards procedural programmers
who perform large-scale data analysis.

Pig-Latin offers high-level data manipulation in a


procedural style.

Pig-Pen is a debugging environment for Pig-Latin


commands that generates samples from real data.

47
More Info

 Pig, http://hadoop.apache.org/pig/

 Hadoop, http://hadoop.apache.org

Anks-
Thay!

48
HIVE –
A WAREHOUSING SOLUTION
OVER
A MAP-REDUCE
FRAMEWORK
Agenda

 Why Hive???
 What is Hive?
 Hive Data Model
 Hive Architecture
 HiveQL
 Hive SerDe’s
 Pros and Cons
 Hive v/s Pig
 Graphs
Data Analysts with Hadoop

 Extra Overhead
Challenges that Data Analysts
faced
 Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce framework to
parallelize processing of Data
What is the catch?
- Hadoop Map Reduce is Java intensive
- Thinking in Map Reduce paradigm can get tricky
… Enter Hive!
Hive Key Principles
HiveQL to MapReduce
Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table


MR JOB Instance
Hive Data Model

Data in Hive organized into :


 Tables
 Partitions
 Buckets
Hive Data Model Contd.

Tables
- Analogous to relational tables
- Each table has a corresponding directory in HDFS
- Data serialized and stored as files within that directory
- Hive has default serialization built in which supports
compression and lazy deserialization
- Users can specify custom serialization –deserialization
schemes (SerDe’s)
Hive Data Model Contd.
Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount FLOAT)
PARTITIONED BY (country STRING, year INT, month INT)
So each partition will be split out into different folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/Sales

/country=US
/country=CANAD
A

/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.

 Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket number
- Each bucket is stored as a file in partition directory
Architecture
Externel Interfaces- CLI, WebUI, JDBC,
ODBC programming interfaces

Thrift Server – Cross Language service


framework .

Metastore - Meta data about the Hive


tables, partitions

Driver - Brain of Hive! Compiler, Optimizer


and Execution engine
Hive Thrift Server

• Framework for cross language services


• Server written in Java
• Support for clients written in different languages
- JDBC(java), ODBC(c++), php, perl, python scripts
Metastore

• System catalog which contains metadata about the Hive tables


• Stored in RDBMS/local fs. HDFS too slow(not optimized for random
access)
• Objects of Metastore
➢ Database - Namespace of tables
➢ Table - list of columns, types, owner, storage, SerDes
➢ Partition – Partition specific column, Serdes and storage
Hive Driver

• Driver - Maintains the lifecycle of HiveQL statement


• Query Compiler – Compiles HiveQL in a DAG of map reduce tasks
• Executor - Executes the tasks plan generated by the compiler in proper
dependency order. Interacts with the underlying Hadoop instance
Compiler
 Converts the HiveQL into a plan for execution
 Plans can
- Metadata operations for DDL statements e.g. CREATE
- HDFS operations e.g. LOAD
 Semantic Analyzer – checks schema information, type checking, implicit
type conversion, column verification
 Optimizer – Finding the best logical plan e.g. Combines multiple joins in a
way to reduce the number of map reduce jobs, Prune columns early to
minimize data transfer
 Physical plan generator – creates the DAG of map-reduce jobs
HiveQL
DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE
DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe

 SELECT Query
➢ Hive built in Serde: Record
Avro, ORC, Regex etc Reader

Hive Table
➢ Can use Custom Deserialize
SerDe’s (e.g. for
unstructured data
like audio/video Hive Row Object
data, End User
Object Inspector
semistructured XML Map Fields
data)
Good Things

 Boon for Data Analysts


 Easy Learning curve
 Completely transparent to underlying Map-Reduce
 Partitions(speed!)
 Flexibility to load data from localFS/HDFS into Hive Tables
Cons and Possible Improvements

 Extending the SQL queries support(Updates, Deletes)


 Parallelize firing independent jobs from the work DAG
 Table Statistics in Metastore
 Explore methods for multi query optimization
 Perform N- way generic joins in a single map reduce job
 Better debug support in shell
Hive v/s Pig

Similarities:
➢ Both High level Languages which work on top of map reduce framework
➢ Can coexist since both use the under lying HDFS and map reduce

Differences:
◆ Language
➢ Pig is a procedural ; (A = load ‘mydata’; dump A)
➢ Hive is Declarative (select * from A)

◆ Work Type
➢ Pig more suited for adhoc analysis (on demand analysis of click stream
search logs)
➢ Hive a reporting tool (e.g. weekly BI reporting)
Hive v/s Pig

Differences:

◆ Users
➢ Pig – Researchers, Programmers (build complex data pipelines,
machine learning)
➢ Hive – Business Analysts
◆ Integration
➢ Pig - Doesn’t have a thrift server(i.e no/limited cross language support)
➢ Hive - Thrift server

◆ User’s need
➢ Pig – Better dev environments, debuggers expected
➢ Hive - Better integration with technologies expected(e.g JDBC, ODBC)
Head-to-Head
(the bee, the pig, the elephant)

Version: Hadoop – 0.18x, Pig:786346, Hive:786346


❖ Sqoop is an open source framework provided by Apache.

Sqoop
❖ It is a command-line interface application for transferring
data between relational databases and Hadoop
❖ It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop
file system to relational databases.
❖ When Big Data storages and analyzers such as MapReduce,
Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem
came into picture, they required a tool to interact with the
relational database servers for importing and exporting the
Big Data residing in them. Here, Sqoop occupies a place in
the Hadoop ecosystem to provide feasible interaction
between relational database server and Hadoop’s HDFS.
❖ Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop
Sqoop (“SQL-to-Hadoop”) is a straightforward
command-line tool. It offers the following capabilities:
1. Generally, helps to Import individual tables or entire
databases to files in HDFS
2. Also can Generate Java classes to allow you to
interact with your imported data
3. Moreover, it offers the ability to import from SQL
databases straight into your Hive data warehouse.
Sqoop
 It supports incremental loads of a
single table or a free form SQL
query as well as saved jobs which
can be run multiple times to import
updates made to a database since
the last import.

 Using Sqoop, Data can be moved


into HDFS/hive/hbase from
MySQL/ PostgreSQL/Oracle/SQL
Server/DB2 and vise versa.
Sqoop Working

Sqoop send the request to Relational DB to send the return the


Step 1 metadata informationabout the table(Metadata here is the
data about the table in relational DB).

From the received information it will generate the java classes


Step 2 (Reason why you shouldhave Java configured before get it
working-Sqoop internally uses JDBC API to generate data).

Now Sqoop (As its written in java ?tries to package the compiled
Step 3 classes to beable togenerate table structure) , post compiling
creates jar file(Java packaging standard).
There are many salient features of Sqoop, which shows us the
several reasons to learn sqoop.
a. Parallel import/export
While it comes to import and export the data, Sqoop uses YARN
Key Features framework. Basically, that offers fault tolerance on top of
parallelism.

of b. Connectors for all major RDBMS Databases


However, for multiple RDBMS databases, Sqoop offers
connectors, covering almost the entire circumference.
Sqoop c. Import results of SQL query
d. Full Load
It is one of the important features of sqoop, in which we can load
the whole table by a single command in Sqoop. Also, by using a
single command we can load all the tables from a database.
f. Kerberos Security Integration
Basically, Sqoop supports Kerberos authentication. Where
Kerberos defined as a computer network authentication
protocol. That works on the basis of ‘tickets’ to allow nodes
communicating over a non-secure network to prove their
Key Features identity to one another in a secure manner.
g. Load data directly into HIVE/HBase
of Basically, for analysis, we can load data directly into Apache
Hive. Also, can dump your data in HBase, which is a NoSQL
Sqoop database.
h. Compression
By using deflate(gzip) algorithm with –compress argument,
We can compress your data. Moreover, it is also possible by
specifying –compression-codec argument.
Sqoop commands
codegen Generate code to interact with database records

Create-hive- table Import a table definition into Hive

eval Evaluate a SQL statement and display the results

export Export an HDFS directory to a database table

help List available commands

import Import a table from a database to HDFS

Import-all-tables Import tables from a database to HDFS

list-databases List available databases on a server

list-tables List available tables in a database

version Display version information


Flume
➢ Flume is a standard, simple, robust, flexible,
and extensible tool for data ingestion from
various data producers (webservers) into
Hadoop.
➢ Apache Flume is a tool/service/data ingestion
mechanism for collecting aggregating and
Flume transporting large amounts of streaming data
such as log files, events (etc...) from various
sources to a centralized data store.
➢ Flume is a highly reliable, distributed, and
configurable tool. It is principally designed
to copy streaming data (log data) from
various web servers to HDFS.
Data FLOW using Flume
Apache Flume - Data Flow

❑ Flume is a framework which is used to move


log data into HDFS. Generally events and log
data are generated by the log servers and these
servers have Flume agents running on them.
These agents receive the data from the data
generators.
❑ The data in these agents will be collected by an
intermediate node known as Collector. Just like
agents, there can be multiple collectors in
Flume.
❑ Finally, the data from all these collectors will be
aggregated and pushed to a centralized store
such as HBase or HDFS. The following diagram
explains the data flow in Flume.
Data FLOW using Flume
Multi-hop Flow
 Within Flume, there can be multiple agents and before reaching the final destination, an event may
travel through more than one agent. This is known as multi-hop flow.
Fan-out Flow
 The dataflow from one source to multiple channels is known as fan-out flow. It is of two types −
➢ Replicating − The data flow where the data will be replicated in all the configured channels.
➢ Multiplexing − The data flow where the data will be sent to a selected channel which is mentioned
in the header of the event.
Fan-in Flow
 The data flow in which the data will be transferred from many sources to one channel is known as fan-
in flow.
Failure Handling
 In Flume, for each event, two transactions take place: one at the sender and one at the receiver. The
sender sends events to the receiver. Soon after receiving the data, the receiver commits its own
transaction and sends a “received” signal to the sender. After receiving the signal, the sender commits
its transaction. (Sender will not commit its transaction till it receives a signal from the receiver.)
Advantages of FLUME

➢ Using Apache Flume we can store the data in to any of the centralized stores
(HBase, HDFS).
➢ When the rate of incoming data exceeds the rate at which data can be written to
the destination, Flume acts as a mediator between data producers and the
centralized stores and provides a steady flow of data between them.
➢ Flume provides the feature of contextual routing.
➢ The transactions in Flume are channel-based where two transactions (one sender
and one receiver) are maintained for each message. It guarantees reliable message
delivery.
➢ Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of FLUME
➢ Flume ingests log data from multiple web servers into a centralized store
(HDFS, HBase) efficiently.
➢ Using Flume, we can get the data from multiple servers immediately into
Hadoop.
➢ Along with the log files, Flume is also used to import huge volumes of
event data produced by social networking sites like Facebook and Twitter,
and e-commerce websites like Amazon and Flipkart.
➢ Flume supports a large set of sources and destinations types.
➢ Flume supports multi-hop flows, fan-in fan-out flows, contextual routing,
etc.
➢ Flume can be scaled horizontally.
The flume agent has 3 components:

Flume ~ Source: It accepts the data from the incoming streamline and
stores the data in the channel.
~ Channel: In general, the reading speed is faster than the writing
Architecture speed. Thus, we need some buffer to match the read & write speed
difference. Basically, the buffer acts as a intermediary storage that
stores the data being transferred temporarily and therefore prevents
data loss. Similarly, channel acts as the local storage or a temporary
storage between the source of data and persistent data in the HDFS.
~ Sink: Then, our last component i.e. Sink, collects the data from
the channel and commits or writes the data in the HDFS
permanently.
REFERENCES

 https://hive.apache.org/
 https://cwiki.apache.org/confluence/display/Hive/Presentations
 https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-
sql-constructing-data-processing-pipelines-444.html
 http://www.qubole.com/blog/big-data/hive-best-practices/
 Hortonworks tutorials (youtube)
 Graph :
https://issues.apache.org/jira/secure/attachment/12411185/hive_b
enchmark_2009-06-18.pdf
Thanks

You might also like