Professional Documents
Culture Documents
Emailing Pig PDF
Emailing Pig PDF
Emailing Pig PDF
,~· .
'if~,.
r
'
. . .f
10
I
CHAPTER
Introduction to Pig
BRIEF CONTENTS
• What1s in Store?
• Batch Mode
• What is Pig?
• Execution Modes of Pig
Key Features of Pig
• Local Mode
• The Anatomy of Pig
• MapReduce Mode
• Pig on Hadoop • HDFS Commands
• Pig Philosophy • Relational Operators
• Use Case for Pig: ET~ Processing • EVAL Function
• Pig Latin Overview • Complex Data Types
Pig Larin Statements • Tuple
• Pig Latin: Keywords • Map
• Pig Latin: Identifiers • Piggy Bank
• Pig Latin: Comments • User-Defined Functions (UDF)
• Pig Larin: Case Sensitivity • Parameter Substitution
• Operators in Pig Latin • Diagnostic Operator
• Data Types in Pig • Word Count Example using Pig
• Simple Data Types • When to use Pig?
• Complex Data Types • When NOT to use Pig?
• Running Pig • Pig at Yahoo!
• Interactive Mode • Pig versus Hive
~~~~=-=----------
11
WHAT'S IN STORE? n~
w, h b w you would have become familiar with the basic concepts of HD
we assume. t atTheY no . be to but•id on this• knowledge to perform FS
focus of this chapter will al a_nd Map[)
Programmrng. f p· Wi ·11 al d. an ys1s l\ed
. d.
w, ,scuss ew
fc relational and eval operators o 1g. e w1
.
so 1scuss Complex D "r Using"· llce
ata 1 ypes . l'Ig
11
an d UD F (User Defined Functions) of Pig.. .d d h ' Piggy boa.nk
. e
w ,
we sugges
t you refer to some of the learning resources prov1 e at t e end of this ch
. " ,, . apter for b ,
. w,e
ing. w1 also suggest you to pracnce Test Me exercises. etter Iea.rn_
- ____·-·--------1!!;1
._...
·
_
Refer Figure 1O.1.
1U
Pig Latin Script
----- •• 1 =mu~~~mrnm11111t:tt!l'Wrnrm~- .
- ,_~£'1P4"?~ .
Figure 10. 1
The anatomy of Pig.
~ -~:~:~:=--------------------_:~
lllllllf"" ,,aio• ., Pig
3 PIG ON HADOOP ·
2
"
h·
4
260 Big Daca and A
O aJy
~ ---- -_- _-- - ------ --- ---- -- ----- - --- - --- ---- ______ J
Figure :10-3 Pig= ETL Processing.
-Pig Latin S t a t e m e n t s
I- Pig L...aciri scacemerit:s are basic corist:ruct:s co p ·rocess daca usirig Pig.
2- Pig L...aciri scat::emerit: is ari operat::or.
.3_ Ari o pecacc:>c-i-o P ig: I acio cakes a cefacioo as iri e uc ari ~ yields ariot::her ce; f~ ri as out::puc.
4_ Pig L...at::iri st:at:emerit:s iriclude schemas arid expressioris t::<> process dat::a.
5- Pig L...at:iri st:at:emerit:s s h o u l d erid vvit::h a semi-colori.
Pig Laciri Scacemen.t:s are gerierally ordered as £ollovvs:
J. I.OAI> st::at::emerit: t::hat: reads dat::a From t::he file syst::em.
2. Series 0£ st:at::emerirs co p e r f o r m craris£ormat::ioris.
3- I>V1VlP o r STORE co display/score result:.
The £ollowin.g is a simple Pig Lat::iri script: t:<> load, filrer, arid st::ore "sruderir" . dara.
. ........ ................
Valid Identifier y
?.~'. Valid and invalid identifiers
.... .. ...... .. · ........ .. .... .
JnvaUd Identifier 5 At At_ioi'4".......... .. .
Sales% _Sales
··· ········ ·· ······ ··· ·· ··· ·········· ··· ··· ····· ... ..~~~~.~~······ ····· ····· ·········
ble Io. I describes valid and invalid identifi .. .. .. .. ............ ...... ..
1a ers.
< NOT
*
I >
% <=
>=
········································ ···················································· ·· ··
Name Description
Tuple An ordered set of fields. Example: (2,3)
Bag A collection of tuples. Example: {(2,3),(7,5)}
I/
10.8 RUNNING PIG -
You can run Pig in two ways:
1. Interactive Mode.
2. Batch Mode.
111
10.8.1 Interactive Mode
You can run Pig in interactive mode by invoking grunt shell. Type pig to get grunt shell as shown below.
r,';"~1~JniDlO:- Pi! i& g ?" '@';.4'!r'9~~!'l&§l@i% (}t'01'~. 'f4 "1t~,1~~~"'~~ ~
[root<O.volgalnxOlO ~]fl pig . •
2015-02-23 21:07:38,916 tmain] INFO org . apache.pig.Mam - Apache Pig version 0.12.0-cdhS .
1.3 (rexported) compiled Sep 16 2014, 20:39:43
2015-02-23 21:07:38,917 [main] INFO org.ilpache.pig.Main - Logging error messages to: /roo
t/pig-1424705858915. log
2015-02-23 21:07:38,934 [main] INFO org.apache.pig . impl.util.Utils - Default bootup file
/root/.pigbootup not found
2015-92- 23 21:07:~9,313 [main] INFO org.apache.hadoop.conf.Configuration . deprecation - ma
pred.Job.tracker 1s deprecated. Instead, use mapreduce . jobtracker . address 11
2015-02-23 21:Q7:39,313 [main] INFO org.apache.hadoop.conf . Configuration . deprecation - fs
.default.name 1s deprecated . Instead, use fs.defaultFS l
1
~~~~~0~~~3_2~:07:39 1 313 [main] INF9 org.apache.pig.backend.hadoop.executionengine.HExecut
2015-52-
23 21
oad native-hadoo· 07~3~8g
? 1 0t[ h<!,d]op file system at : hdfs://volgalnxOlO . ad.infosys.com:9000 \
librar mf,n WARN org . apache.hadoop . util.NativeCodeLoader - Unable to l
2015-02-23 21-ol40 23
4
.default.name . is.dep recat:d,
t
<?r/~ur platform ... using builtin-java classes where applicable ::
nI ~Fod org . apache . hadoop.conf.Configuration.deprecation - fs
grunt> [J • ns ea , use fs . defau ltFS
·onto Pig - - - - - - - - - - - - - - - - - - - - - - - - - -~~
..,iod~ •l63
voU
. grunt prompt, you can type the Pig Latin s~te~e:as show b l
get the
.,,e 1 • d d n e. ow.
O,, p,. "' • '/pig emo I stu ent.tsv' as (rolln o, name , gpa) ;
ioad
~r,t" ~p A,
\9or'~,re,o che path refers to HDFS path and DUMP displays th e result on the console as shown below.
11
0 Jol1", 3. 00)
(1()()1 , Jack,4· ))
(1()()2, 5111ith,4. 5
(1~'scot~-1'B /
C\,os 'J.•os_h_, •_· .- - - -/- - - - - - - - - - - - - ~
(~~r,t; J
_ . Batch Mode
10 8 2
. "Pig Scripr' to run pig in batch mo de. wnte
y u need to create · p·1g Laun
• statements in a file and save it
0
with .pig extension. -::::;::::,-
Local Mode .
10.9·1
To run pig in local mode, you need to have your files in the local file system.
SyntaX:
pig -x local fil~e
bdow.
1
•gr_u_n_
1grunt>
t>....;;;
fs
I __-_mk_d_i_r_/_p_i_g....1a..t.=i.."...de,...mo
_ s_; ~ - L:~
d
l6'• _________________________~ " '·',,
::..:..::__
10.11.1 FILTER
FILTER operator is used to select tuples from a relation based on specified conditions.
Objective: Find the tuples of chose student where the GPA is greater than 4.0.
Input:
Student ( rollno:int,name:chararray,gpa:float)
Act: ..
Output:
---~'-"•;']t
I'
'I I
(1003, Smith ,4. 5)
Ii I (1004,Scott,4.2)
_[r_oo_~_ v_o,..;
1g_a1_nx_o_
1o_ p_i ,:,.
gd...emo
- sJ.,.
#.;l....,_,__
, ·----~..- - ,
j
,'',
J I
10.11.2 FOREACH
Use FOREACH wqen you want to do data transformation based on coIumns of data
I .
Output:
(JOHN)
(JACK)
(SMITH)
(SCOTT)
(JOSHI)
[r<?~t!yo!ga 1i:i2<Q!Q_J>_i__g<l_e_mosJ # I
I
'on co Pig • 26S
1~1t
GROUP
011,3
1 · ator is used to group data.
GfloVfoper
J_ct:
~ :::; load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:ffoat);
5:::; GROUP A'BY gpa;
plJMPB;
Outp
ut:
1001 John 1 3. 0),(100liJohn,3.0)})
(3 .0,fhoos' Josh, ,3.5) ,(100),Joshi , 3.5)})
(3 . 5, {(1008' James ,4 . 0), (1002, Jack ,4 . O)})
_j
'(4.0, ioo?:oavid ,4 . 2), (1004,s~ott,4.2)})
(4 . 2, {{hoo6 Alex64. ~)' (1003,_Smlth,4 . 5)}) .
(4.5, 1 ainxOl
[rootlD-V<?_JJ__,. ___,-,r
p1gdemos]11 I
::ffllt'rtf:Orr ~~4Sf.#.IJlil:~.,#£'P
B
_
---
10 _11.4 DISTINCT
DISTINCT operator is used to remove duplicate tuples. In Pig, DISTINCT operator works on the entire
rup Ie and NOT on individual fields.
B = DISTINCT A i
DUMPB1
Output:
(1 001 , J ohn , J .
(1 002,
01
0
(l OOJ , S.ith,
( 1004 , Scott , 4. Z)
)
t
(1 005 , Jos h ; 1 J • 5)
(1006,A l ex,4 . 5)
(1 007 , oa v i d ,4 . l)
(1 008 James 4 . 0)
[roo ~vol ga 1nx010 p; gdemos ] I I
r 10.11.5 LIMIT
LIMIToperator is used ro limit rhe number of ourpur tuples .
Output:
(1001 , John,3.0)
Cl002 , Jack , 4.0)
1(1003 Smith 4.5)
1
[root11Jvolga1nxOlO
-I
10.11.6 ORDER BY
ORDER BYis used to sort a relation based on spec1"fic val ue.
A.ct:
'/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
A.== Joad
nn£R A BY name;
s== OJU'
V~B;
Output:
le~ 4. 5)
c10067 •"oa11i d14 . 2)
(100 • k 4.0)
(1002, JaC 5' 4. 0)
(1008,~~ 3.0)
(1001 , ' 3 0)
m~()4ng~1tn
(10 • ·th 4.5) .
(100~Sml ainxOlO p1 gdemos]#
[rOO 1/0 g
10.11.7 JOIN
It 1.s used to J'oin two or more relations based on values in the common field. It always performs inner Join.
--;-:ective: To join two relations namely, "student" and "department" based on the values contained
J ,, 1
in the "rollno co umn.
Input: .·
Student (rollno:mt,name:chararray,gpa:float)
Department(rollno:int,deptno:int,depmame:chararray)
Act:
Output:
(1001 ,John,3.0,1001,101,B.E . ) '
(1001,John,3.0,1001,101,B.E . )
m&i:~:ztti;-i~s~~g~3:~l3~M:~~~~)
(1004,Scott ,4. 2,1004, 104,MCA)
i••
1
(1005,Joshi,3 . 5,1005,105,MBA)
(1005,Joshi,3 . 5,1005 ,105,MBA)
007 , 0 V1 , .l, 0 , U4, f'
((ll.,;;,
0o;;,;;6.;..
,A
18:.E
i,~;.~11.------
S~;;.:;l:00,,t;6o:• li.;;Ol!.:;;:;;;:
~1e.;.;~:•4~.;;.;; ~3
(1008lJamesi4.0,1008,102,B.Tech) ·.,. =
~•
(root111vol ga nx010 pi gdemos]# I
....
10.11.8 UNION
It is used to merge the contents of two relations.
• ObJcctive:
. I • " d t" and "department".
To merge the contents of rwo re ations stu en
Input:
Sruden t (rollno: int,name:chararray,gpa:float)
Department(rollno:int,depmo:int,depmame:chararray)
Act:
Output:
"Srore" is used ro save the output to a specified path. The output is stored in two files: part-
contains "student" content and part-m-00001 contains "department" content. rn-OOooo
/Name •, /Typ . '
puctESS /file
/imt-m-00000/file
,
~art-m-OOOllI//~
FU~: {JI.IJ:lkDl2/unfondemo/part-m-ooooo
Goro : f p,gdemoluniondemo
/1""·
1001
1005
John
Jo,hi
3.0
3. 5
Fu,: ~ tnnioodemotpa11-m-00001
f
Goto : pigdemo'unioodemo
10,11.9 SPLIT
It is used to parrmon
··
a relation int
o two or more relations.
.......
~ duccion co---------.----
Pig
-- • 269
..---:-:=: Ti ..
Objective: o partmon a relation based h
• GPA=. 4.0, place it into
. relation Xon
. t e GPAs acquired by the stuclems.
• GPA 1s < 4.0, place it into relatio n y.
Input: .
Student (rollno:mt,narne:chararray,gpa: 8oat)
Act:
A:: load '/pigdemo/
, student.tsv' as (rollno,mt,
.· name•ch
pUT A INTO X IF gpa==4 o y IF · ararray, gpa:float);
S . ' gpa<=4 O·
oUMPX; .'
Output: Relation X
(1002,Jack,4,0)
l008 James 4 . 0)
1_,_
.--
~roo~vo!_gainxO~O pi gdemos]#
Output: Relation Y
(1001, John' 3. 0)
(1002, Jack !4. 0)
(1005 ,Joshi ,3. 5)
(1008 'James '4. 0)
(1001,John 1 3. 0)
(1005 Joshi , 3. 5)
[roo~vol gal ~ OlO pi gdemos]# I
10.11.10 SAMPLE
It is used to select random sample of data based on the specified sample size .
10.12.1 AVG
A'"G · ed th verage of numeric values in a single column bag.
r, 1s us to compute ea
• Objective: To calculate the average marks for each student.
Input:
Student (studname:chararray,marks:int)
Act:
Output:
({(Jack),(Jack),(Jack),(Jack)},39.75)
({(John)i(John),(John),(John) } ,39.0)
[root~vo galnxOlO pi gdemos]# I
1
f Note: You need to use PigStorage function if you wi;h to manipulate files other than .tsv.
11 10.12.2 MAX
I,:
i:11 .MAX is used to compute the maximum of numeric values in a single column bag.
11'
Objective: To calculate the maximum marks for each student.
jl
Input:
Student (studname:chararray,marks:int)
Act:
11
A=load '/pigdemo/studentcsv' USING PigStorage (',') as (studnanie:Ghararra .
f B= GROUP ABY studname; · y, marksant);
C= FOREACH B GENERATE A.studname, MAX(A.marks)';
DUMPC; .
Output:
COUNT·
is used to count the n b
um er of elements in a bag.
rod crion co Pig - - - - - - - - - - - - - - - - - - - - - ~ - - -~~
•271
~er.
, -load
- '/pigdemo/student.csv'
· USING p·•gStorage (' ') (5
B::: GROUP A BY studname; ' as t11dname:chararray, marks:int);
Output:
(Jack),(Jack),(Jack),(Jack)} 4)
(j(John)i(John),(Jo~n),(John) }:4)
([roOt~vo galnx010 p1gdemos]# I
Note: The default file format of Pig is .tsv file · Use p·1gStorage() to mampulate
. . files other than: .tsv file.
10.13.1 TUPLE
A TUPLE is an ordered collection of fields .
... Objective: To use the complex data type "Tuple" to load data.
Input:
(John,12)
(James;7) · Uoseph,5)
• l .
(Smith;S) (Scott,12)
Act:
A= LOAD '/root/pigdemos/studentdata.tsv' AS {tl:tuple(tla:chararray,
tlb:int),t2:tuple(t2a:chararray,t2h:int));
B == FOREACH A GENERATE tl.tla, tl.tlb,t2.$0,t2.$1;
· DUMPB;
Output:
(John,12 , Jack,13)
(James,7,Joseph,5)
(Smith,8,Scott,12)
[roottvol galnx010 pi gdemos ]# I
Note: You can refer to the field using Positional Notation as shown above. The Positional Notation is
denoted by$ sign and the position starts with O(e.g.,_ $0). •
10.13.2 MAP
1.
MAP represents a key/value pafr.
• (I ,,
Act:
DUMPB /
Output:
(Bangalore)
(Pune)
(Chennai) .
(root@volgalnx010 p1gdemos]#
Pig user can use Piggy Bank functions in Pig Latin script and they can also share their functions in Piggy
Bank.
register '/root/pigdemos/piggybank-0.12.0~jar';
org.apache.pig.piggybank.~aluation.string.UPPER(name);
DJ]MP upper; .
• 273
011tJ'ut:
(J()ll~)
(J,ael())
bosiiO
( ~ [ )()
E
~J~ES)
(Jotlos~i> lnx0l0 pi gdemos]I# I .
(J t f\101 g_!--- -
• Yi0 u need to use th" . "keyword to use Piggy Bank jar function in your pig script.
e register
{'Jote,
•
15 USER-DEFINED FUNCTIONS (UDF)
J1!:---
1\ . all you to create your own function for complex analysis.
Pig oWS
user-defined function.
de to convert name into uppercase:
J ava Co
myudfs;
_1,... m>
pal.lU'b- · OE .
iJnport java.10.I x~ept1on;
iJnport org.apache. p~g.EvalFunc;
iJnport org.apache.p~g-~ata.Tuple;
iJnport org.apache.p1g.1mpl. ut1l.Wrappedl OException;
public class UPPE~ extends EvalFunc<String>
{ I
public String exec(Tuple input) throws IOException {
if (input== null II input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedlOException.wrap ("Caught exception processing input row", e);
Note: Convert above java class into jar to include this function into your code.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
register /root/pigdemos/myudfs.jar;
A= load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMPB;
274 • s·
- - - - - - - - - - - - - - - - - - - - -- - - - - - ,g Dara and A '
Output:
()OUN)
c~rnil
( JACK)
(SCon
( JOS IH
( AL~X)
( D.. VlO)
( JA~E S)
( JOl!N) lj
( JOSH J )
J root\lvol ga lnx0l 0 p i gdcmos]# I _ __
DUMP A;
Execute:
'
pig-param student=/pigdemo/student.tsv parameterdemo.pig
Output:
(1001 , John , 3. 0)
(1002 , Jack ,4.0)
(1003,Smith,4 . 5)
(1004 , Scott ,4 . 2)
(1005, Joshi , 3. 5)
(1006,Alex,4.5)
(1007,David,4 . 2)
(1008, James ,4. 0)
(1001, John, 3. 0)
(1005,Joshi , 3.5)
f root~~ g~ pi gdemos]# I
f_ct:
f.:;::: Ioad '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:ffoat};
pfSCJUBE A;
1nput:
Welcome to Had~op Session
Jntroductio~ to Hadoop
Introducing Hive
Hive Session
Pig Session
Act:
lines= LOAD '/rootlpigdemos/lines.txt' AS (line:chararray);
words= FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
DUMP wordcount;
Output:
(to,2)
(Pig, 1)
{Hi ve,2)
(Hadoop,2)
(Session,3)
(Welcome , 1)
(lntroduci ng, 1)
(lntroduction,1)
[rootfD.vol ga lnxOlO pi gdemos]# I
Note:
TOKENIZE splits the line into a field for each word.
FLA1TEN will take the collection of records returned by TOKENIZE and produce a separate record
for each one, calling the single field in the record word.
2~
76:...:_
• _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _ _ _ _ _B
_i~
gD~ =ata~~d~
:
~ -- - - - - - - - - - - - - -
a.ID ME
che Pig is a platform for data analysis. It is an alternative to M d p .
' ,-pa ·des an engine for executing data Rows (h ap e uce rogrammmg.
Jc proVl th H d l ow your data should Aow). Pig processes data in
, ~el on e a oop c uster.
P rovides a language calle~ "Pig Latin" to express data flows.
, lt P ain components of Pig are as follows:
fhe m l (P. La . )
' • oaca £low anguage 1g tin .
• . Jnceraccive shell where yo~ can type Pig Latin statements (Grunt).
• pig interpr~te~ and execution engine.
can run Pig m two ways:
, You Md
• Interactive o e.
• Batch Mode .
.,.,.----· ' .
pOINT ME (BOOK)
. ,_
~~~::'~::v.:::·~:..'.::~~p::-::'.:S." ____:_~ .. .
TEST ME
A. Fill Me
1. Pig is a _ _ _ _ _ _ language.
2. In Pig, is used to specify data flow.
3, Pig provides an to execute data flow.
4. _ _ _ _ __, _ _ _ _ _ _ _ are execution modes of Pig.
5. The interactive mode of Pig is _ _ _ _ __
6. _ _ _ _ _ _ and are case sensitive in Pig.
7. _ _ _ _ __, _ _ _ _ _ _ _ _ _ _ _ _ _ _ are Complex Data Types of Pig.
8. Pig is used in _ _ _ _ _ _ _ process.
278 • Big Data and Analyties
Answers: ~1
5 Grunt
I. Scripting
2. Pig Latin
P. E ine
6.. Fie
. Ids and Aliases
M
7, Bag, Tuple, ap I
3. ig ng
4. Local Mode, Map Reduce Mode
s.ETL
B. Match Me
··c~t~~n
·· ···· ·· ·· ·· ····· ··
A
........ .
·····c~l~mn·····B ······ ··· ··· ······· ·· ···· ·· ······ ····
Map Hadoop Cluster .
Bag An Ordere d Collection of Fields
Collection of Tuples
local Mode Key/Value Pair
Tuple
MapReduce local file. System .......................... .
Mode ....... ................ .. ............................
······· .. ..... .......... ··· ······· ··
Tuple
MapReduce Mode · Hadoop Cluster
........... ..... .......... ..... ... ........ .. ........ .. ............... . .... ..... ....... .
······························
c. True or False ..
. Storage () function is case sensitive.
1 Pig f P'
2· Local Mode is the default mode o~_1g. fi Id
. CT rd removes dupl te e s. . .
3. - erv:o
DISTINkeyword
4. LIMIT Is used -to d'_ispnl ay limited number of tuples m Pig.
5. ORDER BY is used for sortmg.
Answers:
1. True
2. False 4.True
3. False 5. True
ASSIGNMENT 1: SPLIT
Objective: To learn about SPLIT relational operator.
Problem Description:
Write a Pig Script to split customers for reward program based on their life time values.
• co Pig _ _ _ _ _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ ____:'..!!!_
.,,odll~
11 •m