BDA - Unit-4 Part 1

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 47

Unit-4

Big Data Analytics


(20CSE361)
What is Pig?
• Apache Pig is a tool/platform which is
used to analyze larger sets of data
representing them as data flows.

• Pig is generally used with Hadoop; we can


perform all the data manipulation
operations in Hadoop using Apache Pig.

• To write data analysis programs, Pig


provides a high- level language known
as Pig Latin.
Cont…
• This language provides various operators
using which programmers can develop their
own functions for reading, writing, and
processing data.
• To analyze data using Apache Pig,
programmers need to write scripts using Pig
Latin language.

• All these scripts are internally converted to


Map and Reduce tasks.
Why do we need Apache Pig?
• Using Pig Latin, programmers can perform
MapReduce tasks easily without having to
type complex codes in Java.
• Apache Pig uses multi-query approach, thereby
reducing the length of codes.
• For example,
an operation that would require you to type
200 lines of code (LoC) in Java can be easily
done by typing as less as just 10 LoC in Apache
Pig.
Why do we need Apache Pig?
• Pig Latin is SQL-like language.
• It is easy to learn Apache PIG, when you are
familiar with SQL.
• Apache PIG provides many built-in operators
like joins, filters, ordering, etc.

• In addition, it also provides nested data


types like tuples, bags, and maps that are
missing from MapReduce.
Apache Pig – Architecture
Apache Pig – Architecture
• The language used to analyze data in Hadoop
using Pig is known as Pig Latin.
• It is a high-level data processing language
which provides a rich set of data types and
operators to perform various operations on the
data.
• To perform a particular task Programmers
using Pig, programmers need to write a Pig
script using the Pig Latin language, and execute
them using any of the execution mechanisms
(Grunt Shell, UDFs, Embedded).
Cont...
• After execution, these scripts will go through
a series of transformations applied by the Pig
Framework, to produce the desired output.

• Internally, Apache Pig converts these scripts


into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy.
Apache Pig Components
1. Parser
• It checks the syntax of the script and type checking,
and other miscellaneous checks.
• The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and
logical operators.
• In the DAG, the logical operators of the script are
represented as the nodes and the data flows are
represented as edges.
2. Optimizer
• The DAG is passed to the logical optimizer, which
carries out the logical optimizations such as projection
and pushdown.
Apache Pig Components
3. Compiler
• The compiler compiles the optimized logical plan into
a series of MapReduce jobs.
4. Execution engine
• Finally, the MapReduce jobs are submitted to
Hadoop and executed on Hadoop producing the
desired results.
5. Pig Latin Data Model
• The data model of Pig Latin is fully nested and it
allows complex non-atomic datatypes such as map,
tuple and diagrammatical representation of Pig
Latin’s
How to download and install in Apache Pig ?
Method 1:
Download Apache Pig
We need to download the latest version of
Apache Pig from the following website
− https://pig.apache.org/
Method-2
Cont…
Method-3
Cont…
Method-4
Comparisons with Database
Pig vs. MapReduce
Pig vs. SQL
Pig Latin
What is Pig Latin?
• Pig Latin basics are given as Pig Latin Statements,
data types, general and relational operators, and Pig
Latin UDF’s.
• Pig Latin is the language which is used to analyze
data in Hadoop by using Apache Pig.
• Pig Latin is a dataflow language where each
processing step will result in a new data set, or in a
relation.
Pig Latin
Pig Latin - Data Model
• The data model of Pig Latin is fully nested. The
Relation in Pig Latin is the structure of the Pig Latin
data model. And it has a bag where −
– A bag has a collection of tuples.
– A tuple is an ordered set of fields which is given.
– A field is the piece of data.
Apache Pig - User Defined Functions
Data processing Operators in Pig
• Apache Pig Operators is a high-level procedural
language for querying large data sets using Hadoop
and the Map-Reduce Platform.
• Pig Latin statement is an operator that takes a
relation as input and produces another relation as
output.
• These operators are the main tools for Pig Latin
provides to operate on the data.
• Its allow to transform it by sorting, grouping, joining,
projecting, and filtering.
Data processing Operators in Pig
The Apache Pig operators can be classified as,
i. Relational Operators :
• Relational operators are the main tools Pig Latin
provides to operate on the data.
Some of the Relational Operators are :
1. LOAD:
• The LOAD operator is used to loading data from the
file system or HDFS storage into a Pig relation.
2. FOREACH:
• This operator generates data transformations based
on columns of data. It is used to add or remove fields
from a relation.
Data processing Operators in Pig
3. FILTER: This operator selects tuples from a relation
based on a condition.
4. JOIN: JOIN operator is used to performing an inner,
equijoin, join of two or more relations based on
common field values
5. ORDER BY: Order By is used to sort a relation based
on one or more fields in either ascending or descending
order using ASC and DESC keywords.
6. GROUP: The GROUP operator groups together the
tuples with the same group key (key field).
Data processing Operators in Pig
7. COGROUP: COGROUP is the same as the GROUP
operator.
For readability, GROUP when only one relation is
involved and COGROUP when multiple relations are
reinvolved.
II. Diagnostic Operator
• The load statement will simply load the data into the
specified relation in Apache Pig.
• To verify the execution of the Load statement, you
have to use the Diagnostic Operators.
Some Diagnostic Operators are :
Data processing Operators in Pig

DUMP:
• The DUMP operator is used to run Pig Latin
statements and display the results on the screen.

DESCRIBE:
Use the DESCRIBE operator to review the schema of a
particular relation.
• The DESCRIBE operator is best used for debugging a
script.
Data processing Operators in Pig
ILLUSTRATE:
• ILLUSTRATE operator is used to review how data is
transformed through a sequence of Pig Latin
statements.
• ILLUSTRATE command is your best friend when it
comes to debugging a script.
EXPLAIN:
The EXPLAIN operator is used to display the logical,
physical, and MapReduce execution plans of a relation.
Types of UDF’s in Java
• Writing UDF’s using Java, you can create and
use the following three types of functions −

• Filter Functions −
The filter functions are used as conditions in
filter statements. These functions accept a Pig
value as input and return a Boolean value.
Types of UDF’s in Java
• Eval Functions −
The Eval functions are used in FOREACH-
GENERATE statements. These functions accept a
Pig value as input and return a Pig result.

• Algebraic Functions −
The Algebraic functions act on inner bags in a
FOREACHGENERATE statement. These functions
are used to perform full MapReduce operations
on an inner bag.
Supporting languages
• The UDF support is provided in six programming
languages, namely, Java, Jython, Python, JavaScript,
Ruby and Groovy.
• For writing UDF’s, complete support is provided in
Java and limited support is provided in all the
remaining languages.
• Using Java, we can write UDF’s involving all parts of
the processing like data load/store, column
transformation, and aggregation.
Supporting languages

• Apache Pig has been written in Java, the UDF’s


written using Java language work efficiently
compared to other languages.
• In Apache Pig, you also have a Java repository for
UDF’s named Piggybank.
• Using Piggybank, you can access Java UDF’s written
with other users, and contribute your own UDF’s
Features of Pig
• Rich set of operators: It provides many
operators to perform operations like join,
sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL
and it is easy to write a Pig script if you are good
at SQL.
• Optimization opportunities: The tasks in Apache
Pig optimize their execution automatically, so
the programmers need to focus only on
semantics of the language.
Features of Pig
• Extensibility: Using the existing operators,
users can develop their own functions to
read, process, and write data.

• UDF’s: Pig provides the facility to create User-


defined Functions in other programming
languages such as Java and invoke or embed
them in Pig Scripts.
• Handles all kinds of data: Apache Pig analyzes all
kinds of data, both structured and
unstructured. It stores the results in HDFS.
Apache Pig - History
• In 2006, Apache Pig was developed originally as a
research project in Yahoo for researchers to have an ad-
hoc way to create and execute MapReduce jobs on every
dataset.
• In 2007, Apache Pig was open source which is done
through Apache incubator and it was moved into the
Apache Software foundation.
• In 2008, the first version of Apache Pig was released and
the first version which was released is version 0.1.1
• In 2010, Apache Pig adoption continued to grow, and
Apache Pig graduated from a Hadoop subproject, which is
becoming its own top-level Apache pig project.
• In 2017, the latest version for Apache Pig was developed
and the version is 0.17.0
Applications of Apache Pig
• Apache Pig is used by data scientists in order to perform
some tasks which involve ad-hoc processing and quick
prototyping. Apache Pig is used to process web log which
is used to process huge data sources and time sensitive
data loads
• It is also used to perform data processing for some
search platforms
• Apache Pig Latin is an application which allows splits in
the pipeline
• It also allows developers or programmers to store the
data in the pipeline
• It declares some execution plans and provides some
operators to perform ETL functions which is also known
as Extract, Transform and Load.
Cont..
• Advantages
• It decreases the development time when compared to
the development time in Java.
• It is procedural and is easier to follow the commands
and provides expressions in the transformation of data
• We can control the execution and If we want to write
our own UDF(User Defined Function) it is can be done
in execution
• It automatically optimizes the program from the
beginning to the end and we can produce an efficient
plan to execute.
• It is quite effective for unstructured and structured
datasets and it is one of the best tools to make the
large unstructured data to structured data.
Cont…
• Disadvantages
• If there is a problem while executing apache pig, it
gives us an exec error in udf if the problem is related
to syntax or type error.
• Apache Pig support Stackoverflow and Google
generally do not lead us to good solutions for the
problems given in Apache Pig.
• The debugging of pig scripts is 90 percent of time
schema and it may propagate the other steps of the
data processing.
• The commands are not executed in Apache Pig
unless you dump or store an intermediate final
result.
Apache Pig Execution Mechanisms
• Apache Pig scripts can be executed and run in
three modes and they are:
– interactive mode
– batch mode
– embedded mode

You might also like