Professional Documents
Culture Documents
Lecture 23: Pig: Making Hadoop Easy (Slides Provided By: Alan Gates, Yahoo!Research)
Lecture 23: Pig: Making Hadoop Easy (Slides Provided By: Alan Gates, Yahoo!Research)
Lecture 23: Pig: Making Hadoop Easy (Slides Provided By: Alan Gates, Yahoo!Research)
-2-
Map-Reduce
• Computation is moved to the data
• A simple yet powerful programming model
– Map: every record handled individually
– Shuffle: records collected by key
– Reduce: key and iterator of all associated values
• User provides:
– input and output (usually files)
– map Java function
– key to aggregate on
– reduce Java function
• Opportunities for more control: partitioning, sorting, partial
aggregations, etc.
-3-
Map Reduce Illustrated
Romeo, Romeo, wherefore art thou Romeo? What, art thou hurt?
Romeo, 1
Romeo, 1 map map What, 1
wherefore, 1 art, 1
art, 1 thou, 1
thou, 1 hurt, 1
Romeo, 1
art, 2 Romeo, 3
hurt, 1 wherefore, 1
thou, 2 what, 1
-4-
Making Parallelism Simple
• Sequential reads = good read speeds
• In large cluster failures are guaranteed; Map Reduce
handles retries
• Good fit for batch processing applications that need to touch
all your data:
– data mining
– model tuning
• Bad fit for applications that need to find one particular record
• Bad fit for applications that need to communicate between
processes; oriented around independent units of work
-5-
Why use Pig?
-6-
In Map-Reduce
import java.io.IOException; reporter.setStatus("OK"); lp.setOutputKeyClass(Text.class);
import java.util.ArrayList; } lp.setOutputValueClass(Text.class);
import java.util.Iterator; lp.setMapperClass(LoadPages.class);
import java.util.List; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new
for (String s1 : first) { Path("/user/gates/pages"));
import org.apache.hadoop.fs.Path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp,
import org.apache.hadoop.io.LongWritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages"));
import org.apache.hadoop.io.Text; oc.collect(null, new Text(outval)); lp.setNumReduceTasks(0);
import org.apache.hadoop.io.Writable; reporter.setStatus("OK"); Job loadPages = new Job(lp);
import org.apache.hadoop.io.WritableComparable; }
import org.apache.hadoop.mapred.FileInputFormat; } JobConf lfu = new JobConf(MRExample.class);
import org.apache.hadoop.mapred.FileOutputFormat; } lfu.setJobName("Load and Filter Users");
import org.apache.hadoop.mapred.JobConf; } lfu.setInputFormat(TextInputFormat.class);
import org.apache.hadoop.mapred.KeyValueTextInputFormat; public static class LoadJoined extends MapReduceBase lfu.setOutputKeyClass(Text.class);
import org.apache.hadoop.mapred.Mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setOutputValueClass(Text.class);
import org.apache.hadoop.mapred.MapReduceBase; lfu.setMapperClass(LoadAndFilterUsers.class);
import org.apache.hadoop.mapred.OutputCollector; public void map( FileInputFormat.addInputPath(lfu, new
import org.apache.hadoop.mapred.RecordReader; Text k, Path("/user/gates/users"));
import org.apache.hadoop.mapred.Reducer; Text val, FileOutputFormat.setOutputPath(lfu,
import org.apache.hadoop.mapred.Reporter; OutputCollector<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users"));
import org.apache.hadoop.mapred.SequenceFileInputFormat; Reporter reporter) throws IOException { lfu.setNumReduceTasks(0);
import org.apache.hadoop.mapred.SequenceFileOutputFormat; // Find the url Job loadUsers = new Job(lfu);
import org.apache.hadoop.mapred.TextInputFormat; String line = val.toString();
import org.apache.hadoop.mapred.jobcontrol.Job; int firstComma = line.indexOf(','); JobConf join = new JobConf(MRExample.class);
import org.apache.hadoop.mapred.jobcontrol.JobControl; int secondComma = line.indexOf(',', firstComma); join.setJobName("Join Users and Pages");
import org.apache.hadoop.mapred.lib.IdentityMapper; String key = line.substring(firstComma, secondComma); join.setInputFormat(KeyValueTextInputFormat.class);
// drop the rest of the record, I don't need it anymore, join.setOutputKeyClass(Text.class);
public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setOutputValueClass(Text.class);
public static class LoadPages extends MapReduceBase Text outKey = new Text(key); join.setMapperClass(IdentityMapper.class);
implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outKey, new LongWritable(1L)); join.setReducerClass(Join.class);
} FileInputFormat.addInputPath(join, new
public void map(LongWritable k, Text val, } Path("/user/gates/tmp/indexed_pages"));
OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new
Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users"));
// Pull the key out Writable> { FileOutputFormat.setOutputPath(join, new
String line = val.toString(); Path("/user/gates/tmp/joined"));
int firstComma = line.indexOf(','); public void reduce( join.setNumReduceTasks(50);
String key = line.substring(0, firstComma); Text key, Job joinJob = new Job(join);
String value = line.substring(firstComma + 1); Iterator<LongWritable> iter, joinJob.addDependingJob(loadPages);
Text outKey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinJob.addDependingJob(loadUsers);
// Prepend an index to the value so we know which file Reporter reporter) throws IOException {
// it came from. // Add up all the values we see JobConf group = new JobConf(MRExample.class);
Text outVal = new Text("1" + value); group.setJobName("Group URLs");
oc.collect(outKey, outVal); long sum = 0; group.setInputFormat(KeyValueTextInputFormat.class);
} while (iter.hasNext()) { group.setOutputKeyClass(Text.class);
} sum += iter.next().get(); group.setOutputValueClass(LongWritable.class);
public static class LoadAndFilterUsers extends MapReduceBase reporter.setStatus("OK"); group.setOutputFormat(SequenceFileOutputFormat.class);
implements Mapper<LongWritable, Text, Text, Text> { } group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
public void map(LongWritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setReducerClass(ReduceUrls.class);
OutputCollector<Text, Text> oc, } FileInputFormat.addInputPath(group, new
Reporter reporter) throws IOException { } Path("/user/gates/tmp/joined"));
// Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new
String line = val.toString(); implements Mapper<WritableComparable, Writable, LongWritable, Path("/user/gates/tmp/grouped"));
int firstComma = line.indexOf(','); Text> { group.setNumReduceTasks(50);
String value = line.substring(firstComma + 1); Job groupJob = new Job(group);
int age = Integer.parseInt(value); public void map( groupJob.addDependingJob(joinJob);
if (age < 18 || age > 25) return; WritableComparable key,
String key = line.substring(0, firstComma); Writable val, JobConf top100 = new JobConf(MRExample.class);
Text outKey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setJobName("Top 100 sites");
// Prepend an index to the value so we know which file Reporter reporter) throws IOException { top100.setInputFormat(SequenceFileInputFormat.class);
// it came from. oc.collect((LongWritable)val, (Text)key); top100.setOutputKeyClass(LongWritable.class);
Text outVal = new Text("2" + value); } top100.setOutputValueClass(Text.class);
oc.collect(outKey, outVal); } top100.setOutputFormat(SequenceFileOutputFormat.class);
} public static class LimitClicks extends MapReduceBase top100.setMapperClass(LoadClicks.class);
} implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setCombinerClass(LimitClicks.class);
public static class Join extends MapReduceBase top100.setReducerClass(LimitClicks.class);
implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new
public void reduce( Path("/user/gates/tmp/grouped"));
public void reduce(Text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new
Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25"));
OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setNumReduceTasks(1);
Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100);
// For each value, figure out which file it's from and limit.addDependingJob(groupJob);
store it // Only output the first 100 records
// accordingly. while (count < 100 && iter.hasNext()) { JobControl jc = new JobControl("Find top 100 sites for users
List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25");
List<String> second = new ArrayList<String>(); count++; jc.addJob(loadPages);
} jc.addJob(loadUsers);
while (iter.hasNext()) { } jc.addJob(joinJob);
Text t = iter.next(); } jc.addJob(groupJob);
String value = t.toString(); public static void main(String[] args) throws IOException { jc.addJob(limit);
if (value.charAt(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run();
first.add(value.substring(1)); lp.setJobName("Load Pages"); }
else second.add(value.substring(1)); lp.setInputFormat(TextInputFormat.class); }
-8-
But can it fly?
-9-
Essence of Pig
• Map-Reduce is too low a level to program, SQL too high
• Pig Latin, a language intended to sit between the two:
– Imperative
– Provides standard relational transforms (join, sort, etc.)
– Schemas are optional, used when available, can be defined at
runtime
– User Defined Functions are first class citizens
– Opportunities for advanced optimizer but optimizations by
programmer also possible
- 10 -
How It Works
Plan standard
Logical Plan ≈ optimizations
relational algebra
Script
A = load Logical Plan Logical Plan
Parser Semantic Logical
B = filter
C = group Checks Optimizer
D = foreach
Logical Plan
Physical Logical to
MapReduce
To MR Physical
Launcher
Map-Reduce Plan Translator Physical Plan Translator
- 11 -
Cool Things We’ve Added In the Last Year
• Multiquery – Ability to combine multiple group bys into a
single MR job (0.3)
• Merge join – If data is already sorted on join key, do join via
merge in map phase (0.4)
• Skew join – Hash join for data with skew in join key. Allows
splitting of key across multiple reducers to handle skew.
(0.4)
• Zebra – Contrib project that provides columnar storage of
data (0.4)
• Rework of Load and Store functions to make them much
easier to write (0.7, branched but not released)
• Owl, a metadata service for the grid (committed, will be
released in 0.8).
- 12 -
Fragment Replicate Join
Users = load ‘users’ as (name, age);
Pages = load ‘pages’ as (user, url);
Jnd = join Pages by user, Users by name using “replicated”;
Map 1
Pages Users
Pages Users
block 1
Map 2
Pages Users
block 2
- 13 -
Hash Join
Users = load ‘users’ as (name, age);
Pages = load ‘pages’ as (user, url);
Jnd = join Users by name, Pages by user;
Map 1 Reducer 1
(1, user)
Pages (1, fred)
Pages Users (2, fred)
block n
(2, fred)
Map 2 Reducer 2
Users (1, jane)
block m (2, jane)
(2, name) (2, jane)
- 14 -
Skew Join
Users = load ‘users’ as (name, age);
Pages = load ‘pages’ as (user, url);
Jnd = join Pages by user, Users by name using “skewed”;
Map 1 Reducer 1
(1, user)
Pages S (1, fred, p1)
Pages Users P (1, fred, p2)
block n
(2, fred)
Map 2 Reducer 2
Users S (1, fred, p3)
block m P (1, fred, p4)
(2, name) (2, fred)
- 15 -
Merge Join
Users = load ‘users’ as (name, age);
Pages = load ‘pages’ as (user, url);
Jnd = join Pages by user, Users by name using “merge”;
Map 1
Pages Users Pages Users
aaron aaron aaron… aaron
. . amr …
. .
. .
. .
. .
Map 2
. .
. .
Pages Users
. . amy… amy
zach zach barb …
- 16 -
Multi-store script
A = load ‘users’ as (name, age, gender,
city, state);
B = filter A by name is not null;
C1 = group B by age, gender;
D1 = foreach C1 generate group, COUNT(B);
store D into ‘bydemo’;
C2= group B by state;
D2 = foreach C2 generate group, COUNT(B);
store D2 into ‘bystate’;
group by age, store into
apply UDFs
gender ‘bydemo’
load users filter nulls
store into
group by state apply UDFs
‘bystate’
- 17 -
Multi-Store Map-Reduce Plan
map filter
split
local rearrange local rearrange
reduce
demux
package package
foreach foreach
- 18 -
What are people doing with Pig
• At Yahoo ~70% of Hadoop jobs are Pig jobs
• Being used at Twitter, LinkedIn, and other companies
• Available as part of Amazon EMR web service and Cloudera
Hadoop distribution
• What users use Pig for:
– Search infrastructure
– Ad relevance
– Model training
– User intent analysis
– Web log processing
– Image processing
– Incremental processing of large data sets
- 19 -
What We’re Working on this Year
• Optimizer rewrite
• Integrating Pig with metadata
• Usability – our current error messages might as well be
written in actual Latin
• Automated usage info collection
• UDFs in python
- 20 -
Research Opportunities
• Cost based optimization – how does current RDBMS technology carry
over to MR world?
• Memory Usage – given that data processing is very memory intensive
and Java offers poor control of memory usage, how can Pig be written
to use memory well?
• Automated Hadoop Tuning – Can Pig figure out how to configure
Hadoop to best run a particular script?
• Indices, materialized views, etc. – How do these traditional RDBMS
tools fit into the MR world?
• Human time queries – Analysts want access to the petabytes of data
available via Hadoop, but they don’t want to wait hours for their jobs to
finish; can Pig find a way to answer analysts question in under 60
seconds?
• Map-Reduce-Reduce – Can MR be made more efficient for multiple
MR jobs?
• How should Pig integrate with workflow systems?
• See more: http://wiki.apache.org/pig/PigJournal
- 21 -
Learn More
• Visit our website: http://hadoop.apache.org/pig/
• On line tutorials
– From Yahoo, http://developer.yahoo.com/hadoop/tutorial/
– From Cloudera, http://www.cloudera.com/hadoop-training
• A couple of Hadoop books are available that include
chapters on Pig, search at your favorite bookstore
• Join the mailing lists:
– pig-user@hadoop.apache.org for user questions
– pig-dev@hadoop.apache.com for developer issues
• Contribute your work, over 50 people have so far
- 22 -