Professional Documents
Culture Documents
Chapter 6 Pig - Matrix
Chapter 6 Pig - Matrix
■ What is Hadoop?
– Apache Hadoop is an open source software framework used to develop data processing applications
– Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers.
■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater
■ Core of Hadoop
HDFS
Storage Part
( Hadoop Distributed File System)
computation nodes.
■ Pig was created to simplify the burden of writing complex Java codes to perform MapReduce jobs.
■ Earlier Hadoop developers must write complex java codes in order to perform data analysis.
■ Apache Pig provides a high-level language known as Pig Latin which helps Hadoop developers to write data
analysis programs.
■ By using various operators provided by Pig Latin language programmers can develop their own functions for reading,
■ In order to perform analysis using Apache Pig, programmers must write scripts using Pig Latin language to process data
stored in HDFS Internally, all these scripts are converted to Map and Reduce tasks.
■ A component known as Pig Engine is present inside Apache Pig in which Pig Latin scripts are taken as input and these
■ For example, to perform an operation we need to write 200 lines of code in Java that we can easily perform just by typing
less than 10 lines of code in Apache Pig.
■ Hence, ultimately our almost 16 times development time gets reduced using Apache Pig.
■ If developers have knowledge of SQL language, then it is very easy to learn Pig Latin language as it is like SQL language.
■ Many built-in operators are provided by Apache Pig to support data operations like filters, joins, ordering, etc.
■ In addition, nested data types like tuples, bags, and maps which are not present in MapReduce are also provided by Pig.
Pig Architecture
■ Rich Set of Operators: Pig consists of a collection of rich set of operators in order to perform operations such as join,
■ Ease of Programming: Pig Latin is similar to SQL and hence it becomes very easy for developers to write a Pig
script. If you have knowledge of SQL language, then it is very easy to learn Pig Latin language as it is similar to SQL
language.
■ Optimization opportunities: The execution of the task in Apache Pig gets automatically optimized by the task itself,
hence the programmers need to only focus on the semantics of the language.
Pig Features
■ Extensibility: By using the existing operators, users can easily develop their own functions to read, process, and write
data.
■ User Define Functions (UDF’s): With the help of facility provided by Pig of creating UDF’s, we can easily create
User Defined Functions on several programming languages such as Java and invoke or embed them in Pig Scripts.
■ All types of data handling: Analysis of all types of Data (i.e. both structured as well as unstructured) is provided by
● Pig Engine
■MapReduce mode
• Used in production
■ Local mode
• Type pig
■http://pig.apache.org/docs/r0.15.0/basic.html#Data+Types+ and+More
Pig - Relational Operators- LOAD
• divs = L O A D '/data/NYSE_dividends';
■STORE
• Stores the data to HDFS and other storages
■DUMP
• Prints the value on the screen - print()
• Useful for debugging
Pig - More Operator
http://pig.apache.org/docs/r0.15.0/basic.html
References
https://data-flair.training/
Thank you