Emailing Pig PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

~

,~· .
'if~,.
r
'
. . .f

10
I

CHAPTER

Introduction to Pig

BRIEF CONTENTS
• What1s in Store?
• Batch Mode
• What is Pig?
• Execution Modes of Pig
Key Features of Pig
• Local Mode
• The Anatomy of Pig
• MapReduce Mode
• Pig on Hadoop • HDFS Commands
• Pig Philosophy • Relational Operators
• Use Case for Pig: ET~ Processing • EVAL Function
• Pig Latin Overview • Complex Data Types
Pig Larin Statements • Tuple
• Pig Latin: Keywords • Map
• Pig Latin: Identifiers • Piggy Bank
• Pig Latin: Comments • User-Defined Functions (UDF)
• Pig Larin: Case Sensitivity • Parameter Substitution
• Operators in Pig Latin • Diagnostic Operator
• Data Types in Pig • Word Count Example using Pig
• Simple Data Types • When to use Pig?
• Complex Data Types • When NOT to use Pig?
• Running Pig • Pig at Yahoo!
• Interactive Mode • Pig versus Hive

"Ifyou can't explain it simply, you don't understand it well enough."


- Albert Einstein, Physicist
258 •
.----.
ig l)ata ¾d .
A.11~, .
..,

~~~~=-=----------
11

WHAT'S IN STORE? n~
w, h b w you would have become familiar with the basic concepts of HD
we assume. t atTheY no . be to but•id on this• knowledge to perform FS
focus of this chapter will al a_nd Map[)
Programmrng. f p· Wi ·11 al d. an ys1s l\ed
. d.
w, ,scuss ew
fc relational and eval operators o 1g. e w1
.
so 1scuss Complex D "r Using"· llce
ata 1 ypes . l'Ig
11
an d UD F (User Defined Functions) of Pig.. .d d h ' Piggy boa.nk
. e

w ,
we sugges
t you refer to some of the learning resources prov1 e at t e end of this ch
. " ,, . apter for b ,
. w,e
ing. w1 also suggest you to pracnce Test Me exercises. etter Iea.rn_

10.1 WHAT IS PIG?


Apache Pig is a platform for data analysis. Ir is an alternative to MapReduce Programming. ~
igwp -
I
r
I
oped as a research project,!_Yah~ ~ el-
I

10.1.1 Key Features of Pig


I. Ir provides an engine for executing data flows (how your data should flow). Pig processe d
parallel on the Hadoop duster. s ata in
2. Ir provides a language called:,'£.!_g.,Latin" l ~ xpress data flows.
3, Pig Latin contains operators for many of die traditional data operations such as join fil
4. Ir allows
. . users to develop their own functions (User Defined Functions) for reading,' processin
ter, s~n, etc.
wntmg data. g, and
I:I
I
I
10.2 THE ANATOMY OF PIG
I I
The main components of Pig are as follows:

1. Data flow lan~ gd Ri~~).


2. Interactive shell where you can type Pig Latin ~tatements (Grunt).
3. Pig interpreter and execution engine. ·

- ____·-·--------1!!;1
._...
·
_
Refer Figure 1O.1.

1U
Pig Latin Script
----- •• 1 =mu~~~mrnm11111t:tt!l'Wrnrm~- .
- ,_~£'1P4"?~ .

Pig Interpreter/Execution Engine . MapReduce Jobs

Figure 10. 1
The anatomy of Pig.
~ -~:~:~:=--------------------_:~
lllllllf"" ,,aio• ., Pig
3 PIG ON HADOOP ·
2

"

~ ::; ~~~~:-i:=----:-=--- ------


0
pig rUJ1S on Hadoop. . Pig uses both H adoo o· .6
f.iult, Pig reads inp~t files from HDFS p· ISlfl uted File S stem and M R d
~eb) and the output in HDFS. How~ e ·p· tg stores t e interme iate ata a e uce Pro am mmg. By
.
os h c: II . r, 1g can aJ d. ata pro uce y ap
) 1·g supports t e ro owing: so rea input from and PIace output to other sources.
e uce
P
l, J-IDFS commands.
', . l]NIX shell commands.
2
• Relational operators.
3
4, Positional parameters.
Common mathematical functions
..6. Custom functions. ·
:;. Complex data structures.

10.4 PIG P~ILOSOPHY

figure 10.2 describes the Pig philosophy.


1. Pigs Eat Anything: Pig can process different ki d
2. Pigs Live Anywhere: Pig not only fi n. s of data such as structured and unstructured data.
r_
. the lpca:1· file system.
as files m processes les m HDFS ' it· also processes fil es m
. other sources such L,

3. -Pigs are Domestic


included Animals:
in the script Pig all ows !0 u to develop user-defined functions and .the same can be
for comple O
. . x operauons. ,
4. Pigs Fly: Pig processes data quickly. f,

Pigs fly . PI£~


Lt~

Figure 10.2 Pig philosophy.

10.5 USE CASE FOR PIG: ETL PROCESSING


Pig is widely used for "ETL" (Extract, Transform, and Load). Pig can extract data from different sources
such as ERP, Accounting, Flat Files, etc. Pig then makes use of various ope~~tors to perform transformation ;
on the data and subsequently loads it into the data warehouse. Refer Figure 1o:3.

4
260 Big Daca and A
O aJy

~ ---- -- -- --- -- - - - ------- - ---------------- - --- - -----------:


Pig Jobs running on cluster :
I

--- D ~ Removal Encocl~


I
o f duplicates value - I I
•- I validation ' 1

~ ---- -_- _-- - ------ --- ---- -- ----- - --- - --- ---- ______ J
Figure :10-3 Pig= ETL Processing.

PIG LATIN C>VERVIEW

-Pig Latin S t a t e m e n t s

I- Pig L...aciri scacemerit:s are basic corist:ruct:s co p ·rocess daca usirig Pig.
2- Pig L...aciri scat::emerit: is ari operat::or.
.3_ Ari o pecacc:>c-i-o P ig: I acio cakes a cefacioo as iri e uc ari ~ yields ariot::her ce; f~ ri as out::puc.
4_ Pig L...at::iri st:at:emerit:s iriclude schemas arid expressioris t::<> process dat::a.
5- Pig L...at:iri st:at:emerit:s s h o u l d erid vvit::h a semi-colori.
Pig Laciri Scacemen.t:s are gerierally ordered as £ollovvs:
J. I.OAI> st::at::emerit: t::hat: reads dat::a From t::he file syst::em.
2. Series 0£ st:at::emerirs co p e r f o r m craris£ormat::ioris.
3- I>V1VlP o r STORE co display/score result:.
The £ollowin.g is a simple Pig Lat::iri script: t:<> load, filrer, arid st::ore "sruderir" . dara.

·A = load 'staidera.t• (rollra.<>, ra.ai:n.e,. gpa.);


A = filter A by gpa. > 4.0; _ "-
A= £'orea.ch A gera.era.te UPPER (ra.ai:n.e);
STORE A INTO "'m.yreport'

~ote: In rhe above example A is a relarion. arid 1'TOT a variable.

10.6.2 Pig Latin: Keywords


Keywords are reserved. Ir can.nor be ••sed co riame -
L.1:.1..1.rigs.
L •

10.6.3 Pig Latin: Identifiers


1.
IIdencifiers
h uld are n.ames assigned co field s o r C>L£1er
-L d a.ca SCC"l.....l.C:tLJ.res_
2. cs <> begiri wich a lert::er and should be Followed oruy b y let:t::ers, ri u m b e rs, arid uridersco r es.
ducrion !O Pig • 261
111 1ro

. ........ ................
Valid Identifier y
?.~'. Valid and invalid identifiers
.... .. ...... .. · ........ .. .... .
JnvaUd Identifier 5 At At_ioi'4".......... .. .
Sales% _Sales
··· ········ ·· ······ ··· ·· ··· ·········· ··· ··· ····· ... ..~~~~.~~······ ····· ····· ·········
ble Io. I describes valid and invalid identifi .. .. .. .. ............ ...... ..
1a ers.

o.6.4 Pig Latin: Comments


1
In Pig Latin rwo types of comments are supported:
. Single line comments that begin with"--".
1
• _Multiline comments that begin with "/* and en d wnh
. */".
2
o.6.5 Pig Latin: Case Sensitivity
1
1. 1(evwor.~ not case sensitive such as LOAD STO
and paths are case-sensitive. ' RE, GROUP, FOREACH, DUMP, etc.
• Function names
. are case sensitive such as Pigstorage, CO UNT.
3
10.6.6 Operators in Pig Latin
Table 10.2 describes operators in Pig Latin.
Table 10.2 , Operators in Pig Latin
..Arlth~~ii~· ........ ·compapson
.. ······:······'·. ·····Null
.. ·····.. ·· ··········.. ·· .....Boolean
······.. ····. ····
+ -- JS NULL AND
!= IS NOT NULL OR

< NOT
*
I >
% <=
>=
········································ ···················································· ·· ··

10.7 DATA TYPES IN PIG \

10.7.1 Simple Data Types


-
Table 10.3 describes simple data types supported in Pig. In Pig, fields of unspecified types are considered as
an array of bytes which is known as bytearray.
Null: In Pig Latin, NULL denotes a value that is unknown or is non-existent.

10.7.2 Complex Data Types


Table 10.4 describes complex data types in Pig.
Table 10.3 Simple data types supported in Pig
··················································. o~-~~ripti~~-.............. .
Name Whole numbers
Int
Large whole numbers
Long
Decimals
Roat
Very precise decimals \
Double
Text strings
(ciiararray
Raw bytes
~earray
Datetime
Datetime
true or false
.. ... ... ..... ... .... .... .... ... .. .......... ... ..... .... ...... ... .. .. ... .
(_
Table 10.4 Complex data types in Pig
...... ... .. .. ... .. ... ............... ...... .... ....... ...... ... ..·; ... .. ... .... .
• ,•

Name Description
Tuple An ordered set of fields. Example: (2,3)
Bag A collection of tuples. Example: {(2,3),(7,5)}

map key, value pair (open # Apache)


1 ,
I I
·· ···· ······· ····· ···· ···· ····· ················ ··········· ······ ······· ········· ··
I

I/
10.8 RUNNING PIG -
You can run Pig in two ways:
1. Interactive Mode.
2. Batch Mode.
111
10.8.1 Interactive Mode
You can run Pig in interactive mode by invoking grunt shell. Type pig to get grunt shell as shown below.
r,';"~1~JniDlO:- Pi! i& g ?" '@';.4'!r'9~~!'l&§l@i% (}t'01'~. 'f4 "1t~,1~~~"'~~ ~
[root<O.volgalnxOlO ~]fl pig . •
2015-02-23 21:07:38,916 tmain] INFO org . apache.pig.Mam - Apache Pig version 0.12.0-cdhS .
1.3 (rexported) compiled Sep 16 2014, 20:39:43
2015-02-23 21:07:38,917 [main] INFO org.ilpache.pig.Main - Logging error messages to: /roo
t/pig-1424705858915. log
2015-02-23 21:07:38,934 [main] INFO org.apache.pig . impl.util.Utils - Default bootup file
/root/.pigbootup not found
2015-92- 23 21:07:~9,313 [main] INFO org.apache.hadoop.conf.Configuration . deprecation - ma
pred.Job.tracker 1s deprecated. Instead, use mapreduce . jobtracker . address 11
2015-02-23 21:Q7:39,313 [main] INFO org.apache.hadoop.conf . Configuration . deprecation - fs
.default.name 1s deprecated . Instead, use fs.defaultFS l
1
~~~~~0~~~3_2~:07:39 1 313 [main] INF9 org.apache.pig.backend.hadoop.executionengine.HExecut
2015-52-
23 21
oad native-hadoo· 07~3~8g
? 1 0t[ h<!,d]op file system at : hdfs://volgalnxOlO . ad.infosys.com:9000 \
librar mf,n WARN org . apache.hadoop . util.NativeCodeLoader - Unable to l
2015-02-23 21-ol40 23
4
.default.name . is.dep recat:d,
t
<?r/~ur platform ... using builtin-java classes where applicable ::
nI ~Fod org . apache . hadoop.conf.Configuration.deprecation - fs
grunt> [J • ns ea , use fs . defau ltFS
·onto Pig - - - - - - - - - - - - - - - - - - - - - - - - - -~~
..,iod~ •l63

voU
. grunt prompt, you can type the Pig Latin s~te~e:as show b l
get the
.,,e 1 • d d n e. ow.
O,, p,. "' • '/pig emo I stu ent.tsv' as (rolln o, name , gpa) ;
ioad
~r,t" ~p A,
\9or'~,re,o che path refers to HDFS path and DUMP displays th e result on the console as shown below.
11

0 Jol1", 3. 00)
(1()()1 , Jack,4· ))
(1()()2, 5111ith,4. 5
(1~'scot~-1'B /
C\,os 'J.•os_h_, •_· .- - - -/- - - - - - - - - - - - - ~
(~~r,t; J

_ . Batch Mode
10 8 2
. "Pig Scripr' to run pig in batch mo de. wnte
y u need to create · p·1g Laun
• statements in a file and save it
0
with .pig extension. -::::;::::,-

o.9 EXECUTION MODES OF PIG


:.---
1
You can execute pig in two modes:
1. Local Mode.
. MapReduce Mode.
2

Local Mode .
10.9·1
To run pig in local mode, you need to have your files in the local file system.

SyntaX:
pig -x local fil~e

1,0.9.2 MapReduce Mode


/
To run pig in MapReduce mode, you need to have access to a Hadoop Cluster to read /write file. This is the
default mode of Pig.
Syntax:
pig filename

10.10 HDFS COMMANDS


You can work with all HDFS commands in Grunt shell. For example, you can create a directory as shown

bdow.
1
•gr_u_n_
1grunt>
t>....;;;
fs
I __-_mk_d_i_r_/_p_i_g....1a..t.=i.."...de,...mo
_ s_; ~ - L:~

d
l6'• _________________________~ " '·',,
::..:..::__

The sections ha,re been designed as follows:


Objtctive: Whac is it chat we are crying co ac_hieve here?
Jnpllt: Whar is the input that has been given to us to act upon?
Act: The actual statement/command to accomplish the task at hand.
Olllromt: The result/output as a consequence of executing che statement.

10.11 RELATIONAL OPERATORS

10.11.1 FILTER
FILTER operator is used to select tuples from a relation based on specified conditions.

Objective: Find the tuples of chose student where the GPA is greater than 4.0.
Input:
Student ( rollno:int,name:chararray,gpa:float)
Act: ..

A= load '/pigdemo/student.tsv@ rollno:int, name:chararray, gpa:float);


B = filter A by gpa > 4.0; «')
DUMPB;
I I
11

Output:
---~'-"•;']t
I'
'I I
(1003, Smith ,4. 5)
Ii I (1004,Scott,4.2)
_[r_oo_~_ v_o,..;
1g_a1_nx_o_
1o_ p_i ,:,.
gd...emo
- sJ.,.
#.;l....,_,__
, ·----~..- - ,
j
,'',
J I

10.11.2 FOREACH
Use FOREACH wqen you want to do data transformation based on coIumns of data
I .

Objective: Display the name of all students in uppercase:


Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A= load
B r
'/pigdemo/student.tsv' as (rollno.mt,
.· name:chararra fl
= roreach Agenerate UPPER ( )· y, gpa: oat);
DUM name,
PB;

Output:
(JOHN)
(JACK)
(SMITH)
(SCOTT)
(JOSHI)
[r<?~t!yo!ga 1i:i2<Q!Q_J>_i__g<l_e_mosJ # I

I
'on co Pig • 26S
1~1t

GROUP
011,3
1 · ator is used to group data.
GfloVfoper

~ p tuples of students based on their GPA


objectJVC• .
11
IDl'Seu t:den t (roUno:int,name:charanay,gpa:Aoat)

J_ct:
~ :::; load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:ffoat);
5:::; GROUP A'BY gpa;
plJMPB;

Outp
ut:
1001 John 1 3. 0),(100liJohn,3.0)})
(3 .0,fhoos' Josh, ,3.5) ,(100),Joshi , 3.5)})
(3 . 5, {(1008' James ,4 . 0), (1002, Jack ,4 . O)})
_j
'(4.0, ioo?:oavid ,4 . 2), (1004,s~ott,4.2)})
(4 . 2, {{hoo6 Alex64. ~)' (1003,_Smlth,4 . 5)}) .
(4.5, 1 ainxOl
[rootlD-V<?_JJ__,. ___,-,r
p1gdemos]11 I
::ffllt'rtf:Orr ~~4Sf.#.IJlil:~.,#£'P
B
_

---

10 _11.4 DISTINCT
DISTINCT operator is used to remove duplicate tuples. In Pig, DISTINCT operator works on the entire
rup Ie and NOT on individual fields.

,-Objective: To remove duplicate tuples of students.


Input:
Student (rollno:int,name:chararray,gpadloat)
Input: \

1001 John 3.0


1002 Jack 4.0
1003 Smith 4.5
1004 Scott 4.2
1005 Joshi 3.5
1006 Alex 4.5
1007 David 4.2
1008 James 4.0
1001 John 3.0
1005 Joshi 3.5
Act:
, U • t oame•characray, gpa:float);
A = load '/pigdemo/student.tsv as (ro no:w • •

B = DISTINCT A i
DUMPB1

Output:
(1 001 , J ohn , J .
(1 002,
01
0
(l OOJ , S.ith,
( 1004 , Scott , 4. Z)
)

t
(1 005 , Jos h ; 1 J • 5)
(1006,A l ex,4 . 5)
(1 007 , oa v i d ,4 . l)
(1 008 James 4 . 0)
[roo ~vol ga 1nx010 p; gdemos ] I I

r 10.11.5 LIMIT
LIMIToperator is used ro limit rhe number of ourpur tuples .

• Objective: Display rhe firsr 3 tuples from the "student" relation.


I I Input:
I Student (rollno:int,name:chararray,gpa:fl.oat)
I
I Act:

A= load '/pigdemo/student.tsv' as (rollno:int, name:chat~J:"ay,' gpa:fioat);


B=LIMIT A3;
DUMPB;
I

Output:
(1001 , John,3.0)
Cl002 , Jack , 4.0)
1(1003 Smith 4.5)
1
[root11Jvolga1nxOlO

-I

10.11.6 ORDER BY
ORDER BYis used to sort a relation based on spec1"fic val ue.

•Objective: Display the names of th d .


Input: e stu ems m Ascending Order.
Student (rollno:int,name:chararraygp
, a.·float)
·on co Pig • 267

A.ct:
'/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
A.== Joad
nn£R A BY name;
s== OJU'
V~B;

Output:
le~ 4. 5)
c10067 •"oa11i d14 . 2)
(100 • k 4.0)
(1002, JaC 5' 4. 0)
(1008,~~ 3.0)
(1001 , ' 3 0)

m~()4ng~1tn
(10 • ·th 4.5) .
(100~Sml ainxOlO p1 gdemos]#
[rOO 1/0 g

10.11.7 JOIN
It 1.s used to J'oin two or more relations based on values in the common field. It always performs inner Join.

--;-:ective: To join two relations namely, "student" and "department" based on the values contained
J ,, 1
in the "rollno co umn.
Input: .·
Student (rollno:mt,name:chararray,gpa:float)
Department(rollno:int,deptno:int,depmame:chararray)
Act:

A= load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);


B = load '/pigdemo/department.tsv' as (rollno:int, deptno:int,deptname:c~ararray);
C = JOIN A BY rollno, B BY rollno;.
DUMPC;
DUMPB;

Output:
(1001 ,John,3.0,1001,101,B.E . ) '
(1001,John,3.0,1001,101,B.E . )
m&i:~:ztti;-i~s~~g~3:~l3~M:~~~~)
(1004,Scott ,4. 2,1004, 104,MCA)
i••
1
(1005,Joshi,3 . 5,1005,105,MBA)
(1005,Joshi,3 . 5,1005 ,105,MBA)
007 , 0 V1 , .l, 0 , U4, f'
((ll.,;;,
0o;;,;;6.;..
,A
18:.E
i,~;.~11.------
S~;;.:;l:00,,t;6o:• li.;;Ol!.:;;:;;;:
~1e.;.;~:•4~.;;.;; ~3
(1008lJamesi4.0,1008,102,B.Tech) ·.,. =
~•
(root111vol ga nx010 pi gdemos]# I
....

10.11.8 UNION
It is used to merge the contents of two relations.
• ObJcctive:
. I • " d t" and "department".
To merge the contents of rwo re ations stu en
Input:
Sruden t (rollno: int,name:chararray,gpa:float)
Department(rollno:int,depmo:int,depmame:chararray)
Act:

A= load '/pigdemo/student.tsV' as (rollno, name, gp);


B = load '/pigdemo/department.tsV' as (rollno, deptno,deptname);
C = UNION A,B;
STORE C INTO '/pigdemo/uniondemo';
DVMPB;

Output:
"Srore" is used ro save the output to a specified path. The output is stored in two files: part-
contains "student" content and part-m-00001 contains "department" content. rn-OOooo
/Name •, /Typ . '
puctESS /file
/imt-m-00000/file
,
~art-m-OOOllI//~
FU~: {JI.IJ:lkDl2/unfondemo/part-m-ooooo

Goro : f p,gdemoluniondemo

Go bnck tq dfr Ii Wne


AdvWJc&I \jc:w/do\rulOlld oW.l9Dw
'l

1&01 John ,.e


1&02 Jack 4.0
1003 Saith 4. 5
1- Sco'tt 4,2
1095 Joshi 3.5
1006 Alex 4.5
1007 D•vid 4.2
J11aes 4.0

/1""·
1001
1005
John
Jo,hi
3.0
3. 5

Fu,: ~ tnnioodemotpa11-m-00001

f
Goto : pigdemo'unioodemo

Y.Q.Qflf.k 10 dir listinf


Adwirnl vicwido"nload options

[100i 101 B.E.


1002 182 S. Tech
/iee3 103 M. Te:c:h
i1004 104 MCA
/1005 185 MBA
peu 101 6.E
11097 104 MCA
1008 102 8. T,ch
i

10,11.9 SPLIT
It is used to parrmon
··
a relation int
o two or more relations.
.......
~ duccion co---------.----
Pig
-- • 269

..---:-:=: Ti ..
Objective: o partmon a relation based h
• GPA=. 4.0, place it into
. relation Xon
. t e GPAs acquired by the stuclems.
• GPA 1s < 4.0, place it into relatio n y.
Input: .
Student (rollno:mt,narne:chararray,gpa: 8oat)
Act:
A:: load '/pigdemo/
, student.tsv' as (rollno,mt,
.· name•ch
pUT A INTO X IF gpa==4 o y IF · ararray, gpa:float);
S . ' gpa<=4 O·
oUMPX; .'

Output: Relation X
(1002,Jack,4,0)
l008 James 4 . 0)
1_,_
.--
~roo~vo!_gainxO~O pi gdemos]#

Output: Relation Y
(1001, John' 3. 0)
(1002, Jack !4. 0)
(1005 ,Joshi ,3. 5)
(1008 'James '4. 0)
(1001,John 1 3. 0)
(1005 Joshi , 3. 5)
[roo~vol gal ~ OlO pi gdemos]# I

10.11.10 SAMPLE
It is used to select random sample of data based on the specified sample size .

• Objective: To depict the use of SAMPLE.


Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A= load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
B= SAMPLE A 0.01;
DUMPB;

10.12 EVAL FUNCTION

10.12.1 AVG
A'"G · ed th verage of numeric values in a single column bag.
r, 1s us to compute ea
• Objective: To calculate the average marks for each student.
Input:
Student (studname:chararray,marks:int)
Act:

A= load '/pigdemo/student.csv' USING PigStorage (',') as (5 tndname:chararray,marks:int);


B = GROUP A BY studname;
t;F~ ~ ~~ e, AV~__!s);
DUMPC;

Output:
({(Jack),(Jack),(Jack),(Jack)},39.75)
({(John)i(John),(John),(John) } ,39.0)
[root~vo galnxOlO pi gdemos]# I

1
f Note: You need to use PigStorage function if you wi;h to manipulate files other than .tsv.

11 10.12.2 MAX
I,:
i:11 .MAX is used to compute the maximum of numeric values in a single column bag.

11'
Objective: To calculate the maximum marks for each student.
jl
Input:
Student (studname:chararray,marks:int)
Act:

11
A=load '/pigdemo/studentcsv' USING PigStorage (',') as (studnanie:Ghararra .
f B= GROUP ABY studname; · y, marksant);
C= FOREACH B GENERATE A.studname, MAX(A.marks)';
DUMPC; .

Output:

Hg~h~5 '. g~h~5 •gahk), (Jack)} ,46)


root©.vo1 al nxoio o. n)d, (John) }, 4 5)
1g emos]#

Note: Similarly, you can try the MIN d th


· an e SUM functions as well.

10, 12.3 COUNT . I

COUNT·
is used to count the n b
um er of elements in a bag.
rod crion co Pig - - - - - - - - - - - - - - - - - - - - - ~ - - -~~
•271

the number of tuples i b


ObJet-~· -· n a ag.
111put:
Student (studname:chararray,marks:im)

~er.
, -load
- '/pigdemo/student.csv'
· USING p·•gStorage (' ') (5
B::: GROUP A BY studname; ' as t11dname:chararray, marks:int);

c-- FORfACH B GENERATE A· studname,COUNT(A)·


plJMPC; '

Output:
(Jack),(Jack),(Jack),(Jack)} 4)
(j(John)i(John),(Jo~n),(John) }:4)
([roOt~vo galnx010 p1gdemos]# I

Note: The default file format of Pig is .tsv file · Use p·1gStorage() to mampulate
. . files other than: .tsv file.

10.13 COMPLEX DATA TYPES

10.13.1 TUPLE
A TUPLE is an ordered collection of fields .

... Objective: To use the complex data type "Tuple" to load data.
Input:
(John,12)
(James;7) · Uoseph,5)
• l .
(Smith;S) (Scott,12)

Act:
A= LOAD '/root/pigdemos/studentdata.tsv' AS {tl:tuple(tla:chararray,
tlb:int),t2:tuple(t2a:chararray,t2h:int));
B == FOREACH A GENERATE tl.tla, tl.tlb,t2.$0,t2.$1;
· DUMPB;

Output:
(John,12 , Jack,13)
(James,7,Joseph,5)
(Smith,8,Scott,12)
[roottvol galnx010 pi gdemos ]# I
Note: You can refer to the field using Positional Notation as shown above. The Positional Notation is
denoted by$ sign and the position starts with O(e.g.,_ $0). •
10.13.2 MAP
1.
MAP represents a key/value pafr.
• (I ,,

Objectiff: To depict the complex data rype map .


Input:
John fcity#Bangalore]
Jack (dty#Pune]
James [city#ChennaiJ

Act:

A= load '/root/pigdemos/studentcity.tsv' Using PigStorage as


(studname:chararray;m:map[chararray]);
B = foreach A generate m#'city' as CityName:.~hararray;

DUMPB /
Output:
(Bangalore)
(Pune)
(Chennai) .
(root@volgalnx010 p1gdemos]#

10.14 PIGGY BANK

Pig user can use Piggy Bank functions in Pig Latin script and they can also share their functions in Piggy
Bank.

Objective: To use Piggy Bank string UPPER function.


Input:
Student (rollno:int,name:chararray,gpa:float)
Act:

register '/root/pigdemos/piggybank-0.12.0~jar';

A= load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:ftoat);


upper = foreach A generate

org.apache.pig.piggybank.~aluation.string.UPPER(name);
DJ]MP upper; .
• 273

011tJ'ut:
(J()ll~)
(J,ael())

bosiiO
( ~ [ )()
E
~J~ES)
(Jotlos~i> lnx0l0 pi gdemos]I# I .
(J t f\101 g_!--- -
• Yi0 u need to use th" . "keyword to use Piggy Bank jar function in your pig script.
e register
{'Jote,

15 USER-DEFINED FUNCTIONS (UDF)
J1!:---
1\ . all you to create your own function for complex analysis.
Pig oWS

user-defined function.
de to convert name into uppercase:
J ava Co
myudfs;
_1,... m>
pal.lU'b- · OE .
iJnport java.10.I x~ept1on;
iJnport org.apache. p~g.EvalFunc;
iJnport org.apache.p~g-~ata.Tuple;
iJnport org.apache.p1g.1mpl. ut1l.Wrappedl OException;
public class UPPE~ extends EvalFunc<String>
{ I
public String exec(Tuple input) throws IOException {
if (input== null II input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedlOException.wrap ("Caught exception processing input row", e);

Note: Convert above java class into jar to include this function into your code.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:

register /root/pigdemos/myudfs.jar;
A= load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMPB;
274 • s·
- - - - - - - - - - - - - - - - - - - - -- - - - - - ,g Dara and A '

Output:
()OUN)

c~rnil
( JACK)
(SCon
( JOS IH
( AL~X)
( D.. VlO)
( JA~E S)
( JOl!N) lj
( JOSH J )
J root\lvol ga lnx0l 0 p i gdcmos]# I _ __

10.16 PARAMETER SUBSTITUTION

Pig allows you ro pass paramerers at runtime.

• Objective: To depicr parameter subsriiution.


Input:
Srudenr (rollno:int,name:chararray,gpa:float)
Act:
A= load '$student' as (rollno:int, name:chararray, gpa:ffoat);

DUMP A;

Execute:
'
pig-param student=/pigdemo/student.tsv parameterdemo.pig

Output:
(1001 , John , 3. 0)
(1002 , Jack ,4.0)
(1003,Smith,4 . 5)
(1004 , Scott ,4 . 2)
(1005, Joshi , 3. 5)
(1006,Alex,4.5)
(1007,David,4 . 2)
(1008, James ,4. 0)
(1001, John, 3. 0)
(1005,Joshi , 3.5)
f root~~ g~ pi gdemos]# I

10.17 DIAGNOSTIC OPERATOR


It returns the schema of a relation.
I
Objective: To depict the use of DESCRIBE
Input: ·
Student (rollno:int,name:chararray,,gpa.·float)
~ p;g - - - - - - - -~•275

f_ct:
f.:;::: Ioad '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:ffoat};

pfSCJUBE A;

0UtJ>ut:I 0 . ;ntiname: chararr!Y, gpa: tloat}


,. trol ~g~lnxO O pi gdemos]#
i~[~r·~,n~t:!:(O - - --•=

~ ORD COUNT EXAMPLE USING PIG

the occurrence of similar words in a file. -

1nput:
Welcome to Had~op Session
Jntroductio~ to Hadoop
Introducing Hive
Hive Session
Pig Session

Act:
lines= LOAD '/rootlpigdemos/lines.txt' AS (line:chararray);
words= FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped= GROUP words BY word;


wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

Output:
(to,2)
(Pig, 1)
{Hi ve,2)
(Hadoop,2)
(Session,3)
(Welcome , 1)
(lntroduci ng, 1)
(lntroduction,1)
[rootfD.vol ga lnxOlO pi gdemos]# I
Note:
TOKENIZE splits the line into a field for each word.
FLA1TEN will take the collection of records returned by TOKENIZE and produce a separate record
for each one, calling the single field in the record word.
2~
76:...:_
• _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _ _ _ _ _B
_i~
gD~ =ata~~d~
:

10.19 WHEN TO USE PIG?


Pig can be used in the fo llowing siruations:
1. When your data loads are time sensitive.
2. When you wa nr to process various data sources. .
3. When you want to get analytical insights through samplmg.

10.20 WHEN NOT TO USE PIG?


Pig should not be used in the following situations:
J. When your data is completely in the unstructured form such as video, text, and audio.
2. When there is a time constraint because Pig is slower than MapReduce jobs.

10.21 PIG AT YAHOO!

Yahoo uses Pig for rwo things:


1. In Pipelines, to fetch log data from its web servers and to perform cleansing to r
interval views and dicks. emove companies
2. In Research, script is used to test a theory. Pig provides facility to integrate Perl or p hon s . .
can be executed on a huge dataset. yt cnpt which

10.22 PIG versus HIVE


. ...... ..................:...........,.P..1...g.. .... .. ... ......... .. ... ·,... ... ........ .. ..... .... ....... .........
Hive
... ...... .··· ·· ··
.......

- Used By Programmers and Researchers Analyst


Used For Programming Reporting
Language Procedural data flow language SOL Like
Suitable For Semi - Structured Structured
Schema/Types Explicit Implicit
UDF Support YES YES
Join/Order/Sort YES YES
DFS Direct Access YES (Implicit) YES (Explicit)
Web Interface YES
Partitions NO
YES
Shell NO
YES
············································ ······ ... ...... ..... YES
··· ···· ···· ········ ···· ····· ··· ········· ·· ······· ·· ········· ···· ·· ··· ··
,......-- •ofl [O p,g _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _:_
· _:m~
,~,
'°"11ct
1

~ -- - - - - - - - - - - - - -
a.ID ME
che Pig is a platform for data analysis. It is an alternative to M d p .
' ,-pa ·des an engine for executing data Rows (h ap e uce rogrammmg.
Jc proVl th H d l ow your data should Aow). Pig processes data in
, ~el on e a oop c uster.
P rovides a language calle~ "Pig Latin" to express data flows.
, lt P ain components of Pig are as follows:
fhe m l (P. La . )
' • oaca £low anguage 1g tin .
• . Jnceraccive shell where yo~ can type Pig Latin statements (Grunt).
• pig interpr~te~ and execution engine.
can run Pig m two ways:
, You Md
• Interactive o e.
• Batch Mode .
.,.,.----· ' .

pOINT ME (BOOK)
. ,_

coNNECT ME (INTERNET RESOURCES)


' .
l
·• .http:!/pil ~pach~.org/docs/r0.12.0/index.html ·
o h;.p)twww.edureki.co/blog/introduction-to-pigl
· .- -:, ,;. ,"'J:•:: _.,;:

~~~::'~::v.:::·~:..'.::~~p::-::'.:S." ____:_~ .. .
TEST ME
A. Fill Me
1. Pig is a _ _ _ _ _ _ language.
2. In Pig, is used to specify data flow.
3, Pig provides an to execute data flow.
4. _ _ _ _ __, _ _ _ _ _ _ _ are execution modes of Pig.
5. The interactive mode of Pig is _ _ _ _ __
6. _ _ _ _ _ _ and are case sensitive in Pig.
7. _ _ _ _ __, _ _ _ _ _ _ _ _ _ _ _ _ _ _ are Complex Data Types of Pig.
8. Pig is used in _ _ _ _ _ _ _ process.
278 • Big Data and Analyties

Answers: ~1
5 Grunt
I. Scripting
2. Pig Latin
P. E ine
6.. Fie
. Ids and Aliases
M
7, Bag, Tuple, ap I
3. ig ng
4. Local Mode, Map Reduce Mode
s.ETL

B. Match Me
··c~t~~n
·· ···· ·· ·· ·· ····· ··
A
........ .
·····c~l~mn·····B ······ ··· ··· ······· ·· ···· ·· ······ ····
Map Hadoop Cluster .
Bag An Ordere d Collection of Fields
Collection of Tuples
local Mode Key/Value Pair

Tuple
MapReduce local file. System .......................... .
Mode ....... ................ .. ............................
······· .. ..... .......... ··· ······· ··

Answers: ........ ...... ................................. .... ........


Column ......B., .. .. ································
··c~i~.~~A Key/Value Pair

Map Collection of Tuples


Bag Local File System
Local Mode An Ordered Collection of Fields

Tuple
MapReduce Mode · Hadoop Cluster
........... ..... .......... ..... ... ........ .. ........ .. ............... . .... ..... ....... .
······························
c. True or False ..
. Storage () function is case sensitive.
1 Pig f P'
2· Local Mode is the default mode o~_1g. fi Id
. CT rd removes dupl te e s. . .
3. - erv:o
DISTINkeyword
4. LIMIT Is used -to d'_ispnl ay limited number of tuples m Pig.
5. ORDER BY is used for sortmg.
Answers:
1. True
2. False 4.True
3. False 5. True

ASSIGNMENTS FOR HANDS-ON PRACTICE

ASSIGNMENT 1: SPLIT
Objective: To learn about SPLIT relational operator.
Problem Description:

Write a Pig Script to split customers for reward program based on their life time values.
• co Pig _ _ _ _ _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ ____:'..!!!_
.,,odll~
11 •m

JPP~~~-........ ... .. ... .......... ... ..... Lt·f~·Ti~~ ·v~l~e.. .


...,~•··sta"'ers 25000
Jack 8000
smith 35000
oa\'id 15000
John 10000
5,ott 28000
Joshi 12000
AjaY 30000
v;naY
Joseph_ .............. ..••••••••••••.... ....... .... ... . ..
.... •·
• If Life Time
• Value
._ is• >1000 and <= 2000 si·1ver program
, If Life Time Vaiue
1
1s >20000 Gold Program. ·

Objective: To learn about GROUP relational operator.


problent Description: .
Create a data file for below schemas:
, Order: Customerld, Itemld, ItemName, OrderDate, DeliveryDate
, Customer: Customerld, CustomerName, Address , City, State, country

1. Load Order and Customer Data.


2. Write a Pig Latin Script to determine number of items bought by each customer.

Objective: To learn complex data type - bag in Pig.


Problem Description:
1. Create a file which contains bag dataset as shown below.
•: " >"..... ·: .. ·: ' .. ....... .. . ' ! ;i' : .. :· .. ·: ..... ......... , .... •," .... , .................................... ............................ .
User 10·
' •
From ' · . . '
, '
To '
, · , ' "
. j,
·
user1001 user1001@sample.com {(user003@sample.com ), (user004@sample.com ),
(user006@sample.com)}
{(user005@sample.com ), (user006@sample.com)}
user1002 user1002@sample.com
{(user001@sample.com),(user005@sample.com)}
user1003 user1003@sample.com
............................................................ .. ........ ... ............ .. .................. ...............................
2. Write a Pig Latin statement to display the names of all users who have sent emails and also a list of all
the people that they have sent the email to.
3. Store the result in a file.

You might also like