Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

PIG

Instructor: Oussama Derbel


Introduction

■ What is Hadoop?

– Apache Hadoop is an open source software framework used to develop data processing applications

which are executed in a distributed computing environment.

–  Applications built using HADOOP are run on large data sets distributed across clusters of

commodity computers.

■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater

computational power at low cost.


Introduction

■ Core of Hadoop

HDFS
Storage Part
( Hadoop Distributed File System)

MAPREDUCE Processing Part


Introduction

■ Apache Hadoop consists of two sub-projects

1. Hadoop MapReduce: MapReduce is a computational model

and software framework for writing applications which are run

on Hadoop. These MapReduce programs are capable of

processing enormous data in parallel on large clusters of

computation nodes.

2. HDFS (Hadoop Distributed File System): HDFS takes care of

the storage part of Hadoop applications.


Note
MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.
What is Pig?

■ Pig was created to simplify the burden of writing complex Java codes to perform MapReduce jobs.

■ Earlier Hadoop developers must write complex java codes in order to perform data analysis.

■ Apache Pig provides a high-level language known as Pig Latin which helps Hadoop developers to write data

analysis programs.

■ By using various operators provided by Pig Latin language programmers can develop their own functions for reading,

writing, and processing data.


What is Pig?

■ In order to perform analysis using Apache Pig, programmers must write scripts using Pig Latin language to process data

stored in HDFS Internally, all these scripts are converted to Map and Reduce tasks.

■ A component known as Pig Engine is present inside Apache Pig in which Pig Latin scripts are taken as input and these

scripts gets converted into Map-Reduce jobs.


What is Pig?

■ Apache Pig reduces the length of codes by using multi-query approach.

■ For example, to perform an operation we need to write 200 lines of code in Java that we can easily perform just by typing
less than 10 lines of code in Apache Pig.

■ Hence, ultimately our almost 16 times development time gets reduced using Apache Pig.

■ If developers have knowledge of SQL language, then it is very easy to learn Pig Latin language as it is like SQL language.

■ Many built-in operators are provided by Apache Pig to support data operations like filters, joins, ordering, etc.

■ In addition, nested data types like tuples, bags, and maps which are not present in MapReduce are also provided by Pig.
Pig Architecture

1. Parser: It handles all the Pig latin statements or commands. It


performs several checks on the Pig statements like syntax check
and generated a DAG
2. Optimizer: When parsing operation is completed and a DAG output
is generated the output is passed to the optimizer. The optimizer
then performs the optimization activities on the output such as
split, projection, pushdown, reorder etc
3. Compiler: The compiler compiles the output which generated by
the optimizer into a series of Map Reduce jobs.
4. Execution Engine: when all the above is done these Map Reduce
submitted to execution engine and which is then executed on the
Hadoop platform to produce the desired results.
Pig Features
Pig Features

■ Rich Set of Operators: Pig consists of a collection of rich set of operators in order to perform operations such as join,

filer, sort and many more.

■ Ease of Programming: Pig Latin is similar to SQL and hence it becomes very easy for developers to write a Pig

script. If you have knowledge of SQL language, then it is very easy to learn Pig Latin language as it is similar to SQL

language.

■ Optimization opportunities: The execution of the task in Apache Pig gets automatically optimized by the task itself,

hence the programmers need to only focus on the semantics of the language.
Pig Features

■ Extensibility: By using the existing operators, users can easily develop their own functions to read, process, and write

data.

■ User Define Functions (UDF’s): With the help of facility provided by Pig of creating UDF’s, we can easily create

User Defined Functions on several programming languages such as Java and invoke or embed them in Pig Scripts.

■ All types of data handling: Analysis of all types of Data (i.e. both structured as well as unstructured) is provided by

Apache Pig and the results are stored inside HDFS.


Pig Recap

An engine for executing data flows in parallel on Hadoop

● Language - Pig Latin

● Pig Engine

● Can be used with or without Hadoop


Pig - Modes

■MapReduce mode

• Access HDFS and Hadoop cluster

• Used in production

■ Local mode

• Access local files and local machine

• Used for testing locally

• Fastens the development


Pig - MapReduce Mode

• Login to Linux console (on Cloudera Virtual Machine).

• Type pig

• Invoke commands in grunt shell to access files in HDFS

• Control Hadoop from grunt shell


Pig - Local Mode

• Login to Linux console (on Cloudera Virtual Machine).

• Type pig -x local

• Invoke commands in grunt shell to access files in local file system


Pig - Data Types

1. int - Signed 32-bit integer - Example - 8

2. long - Signed 64-bit integer - Example - 5L

3. float - 32-bit floating point - Example - 5.5F

4. double - 64-bit floating point - Example - 10.5

5. chararray - character array - Example - ‘Cloud’

6. bytearray - blob - Example - Any binary data

7. datetime - Example - 1970-01-01T00:00:00.000+00:00


Pig - Complex Data Types

■http://pig.apache.org/docs/r0.15.0/basic.html#Data+Types+ and+More
Pig - Relational Operators- LOAD

• divs = L O A D '/data/NYSE_dividends';

• divs = L O A D '/data/NYSE_dividends' USING PigStorage(',');

• divs = L O A D '/data/NYSE_dividends' AS (name: chararray, stock_symbol: chararray, date:


datetime, dividend: float);
Pig - STORE/DUMP

■STORE
• Stores the data to HDFS and other storages

■DUMP
• Prints the value on the screen - print()
• Useful for debugging
Pig - More Operator

http://pig.apache.org/docs/r0.15.0/basic.html
References

https://data-flair.training/
Thank you

You might also like