Chapter 6 Pig - Matrix

PIG
Instructor: Oussama Derbel

Introduction
■ What is Hadoop?
– Apache Hadoop is an open source software framework used to develop data processing applications
which are executed in a distributed computing environment.
– Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers.
■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater
computational power at low cost.

Introduction
■ Core of Hadoop
HDFS
Storage Part
( Hadoop Distributed File System)
MAPREDUCE Processing Part

Introduction
■ Apache Hadoop consists of two sub-projects
1. Hadoop MapReduce: MapReduce is a computational model
and software framework for writing applications which are run
on Hadoop. These MapReduce programs are capable of
processing enormous data in parallel on large clusters of
computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of
the storage part of Hadoop applications.

Note
MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.
What is Pig?
■ Pig was created to simplify the burden of writing complex Java codes to perform MapReduce jobs.
■ Earlier Hadoop developers must write complex java codes in order to perform data analysis.
■ Apache Pig provides a high-level language known as Pig Latin which helps Hadoop developers to write data
analysis programs.
■ By using various operators provided by Pig Latin language programmers can develop their own functions for reading,
writing, and processing data.

What is Pig?
■ In order to perform analysis using Apache Pig, programmers must write scripts using Pig Latin language to process data
stored in HDFS Internally, all these scripts are converted to Map and Reduce tasks.
■ A component known as Pig Engine is present inside Apache Pig in which Pig Latin scripts are taken as input and these
scripts gets converted into Map-Reduce jobs.

What is Pig?
■ Apache Pig reduces the length of codes by using multi-query approach.
■ For example, to perform an operation we need to write 200 lines of code in Java that we can easily perform just by typing
less than 10 lines of code in Apache Pig.
■ Hence, ultimately our almost 16 times development time gets reduced using Apache Pig.
■ If developers have knowledge of SQL language, then it is very easy to learn Pig Latin language as it is like SQL language.
■ Many built-in operators are provided by Apache Pig to support data operations like filters, joins, ordering, etc.
■ In addition, nested data types like tuples, bags, and maps which are not present in MapReduce are also provided by Pig.
Pig Architecture
1. Parser: It handles all the Pig latin statements or commands. It

performs several checks on the Pig statements like syntax check
and generated a DAG
2. Optimizer: When parsing operation is completed and a DAG output
is generated the output is passed to the optimizer. The optimizer
then performs the optimization activities on the output such as
split, projection, pushdown, reorder etc
3. Compiler: The compiler compiles the output which generated by
the optimizer into a series of Map Reduce jobs.
4. Execution Engine: when all the above is done these Map Reduce
submitted to execution engine and which is then executed on the
Hadoop platform to produce the desired results.
Pig Features
Pig Features
■ Rich Set of Operators: Pig consists of a collection of rich set of operators in order to perform operations such as join,
filer, sort and many more.
■ Ease of Programming: Pig Latin is similar to SQL and hence it becomes very easy for developers to write a Pig
script. If you have knowledge of SQL language, then it is very easy to learn Pig Latin language as it is similar to SQL
language.
■ Optimization opportunities: The execution of the task in Apache Pig gets automatically optimized by the task itself,
hence the programmers need to only focus on the semantics of the language.
Pig Features
■ Extensibility: By using the existing operators, users can easily develop their own functions to read, process, and write
data.
■ User Define Functions (UDF’s): With the help of facility provided by Pig of creating UDF’s, we can easily create
User Defined Functions on several programming languages such as Java and invoke or embed them in Pig Scripts.
■ All types of data handling: Analysis of all types of Data (i.e. both structured as well as unstructured) is provided by
Apache Pig and the results are stored inside HDFS.

Pig Recap
An engine for executing data flows in parallel on Hadoop
● Language - Pig Latin
● Pig Engine
● Can be used with or without Hadoop

Pig - Modes
■MapReduce mode
• Access HDFS and Hadoop cluster
• Used in production
■ Local mode
• Access local files and local machine
• Used for testing locally
• Fastens the development

Pig - MapReduce Mode
• Login to Linux console (on Cloudera Virtual Machine).
• Type pig
• Invoke commands in grunt shell to access files in HDFS
• Control Hadoop from grunt shell

Pig - Local Mode
• Login to Linux console (on Cloudera Virtual Machine).
• Type pig -x local
• Invoke commands in grunt shell to access files in local file system

Pig - Data Types
1. int - Signed 32-bit integer - Example - 8
2. long - Signed 64-bit integer - Example - 5L
3. float - 32-bit floating point - Example - 5.5F
4. double - 64-bit floating point - Example - 10.5
5. chararray - character array - Example - ‘Cloud’
6. bytearray - blob - Example - Any binary data
7. datetime - Example - 1970-01-01T00:00:00.000+00:00

Pig - Complex Data Types
■http://pig.apache.org/docs/r0.15.0/basic.html#Data+Types+ and+More
Pig - Relational Operators- LOAD
• divs = L O A D '/data/NYSE_dividends';
• divs = L O A D '/data/NYSE_dividends' USING PigStorage(',');
• divs = L O A D '/data/NYSE_dividends' AS (name: chararray, stock_symbol: chararray, date:

datetime, dividend: float);
Pig - STORE/DUMP
■STORE
• Stores the data to HDFS and other storages
■DUMP
• Prints the value on the screen - print()
• Useful for debugging
Pig - More Operator
http://pig.apache.org/docs/r0.15.0/basic.html
References
https://data-flair.training/
Thank you

Chapter 6 Pig - Matrix

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6 Pig - Matrix

Uploaded by

Copyright:

Available Formats

PIG

Instructor: Oussama Derbel

which are executed in a distributed computing environment.

computational power at low cost.

MAPREDUCE Processing Part

■ Apache Hadoop consists of two sub-projects

1. Hadoop MapReduce: MapReduce is a computational model

and software framework for writing applications which are run

on Hadoop. These MapReduce programs are capable of

processing enormous data in parallel on large clusters of

2. HDFS (Hadoop Distributed File System): HDFS takes care of

the storage part of Hadoop applications.

writing, and processing data.

scripts gets converted into Map-Reduce jobs.

■ Apache Pig reduces the length of codes by using multi-query approach.

1. Parser: It handles all the Pig latin statements or commands. It

filer, sort and many more.

Apache Pig and the results are stored inside HDFS.

An engine for executing data flows in parallel on Hadoop

● Language - Pig Latin

● Can be used with or without Hadoop

• Access HDFS and Hadoop cluster

• Access local files and local machine

• Used for testing locally

• Fastens the development

• Login to Linux console (on Cloudera Virtual Machine).

• Invoke commands in grunt shell to access files in HDFS

• Control Hadoop from grunt shell

• Login to Linux console (on Cloudera Virtual Machine).

• Type pig -x local

• Invoke commands in grunt shell to access files in local file system

1. int - Signed 32-bit integer - Example - 8

2. long - Signed 64-bit integer - Example - 5L

3. float - 32-bit floating point - Example - 5.5F

4. double - 64-bit floating point - Example - 10.5

5. chararray - character array - Example - ‘Cloud’

6. bytearray - blob - Example - Any binary data

7. datetime - Example - 1970-01-01T00:00:00.000+00:00

• divs = L O A D '/data/NYSE_dividends' USING PigStorage(',');

• divs = L O A D '/data/NYSE_dividends' AS (name: chararray, stock_symbol: chararray, date:

You might also like