Professional Documents
Culture Documents
ds2 5 Pig Pyspark
ds2 5 Pig Pyspark
Agenda
● Pig
● PySpark
○ Basics
○ Toy-problem
○ Machine Learning Pipeline
● Demo
○ GCP - PySpark
Hadoop Zoo
Pig
● Platform for data processing
● Components:
○ PigLatin, and
○ Infrastructure layer
● In built operators and functions
● Extensible (UDF=User Defined Functions)
● Extract, Transform, Load (ETL) operations
● Manipulating, analyzing “raw” data
What is Pig?
● Framework for analyzing large unstructured and
semi-structured data on top of Hadoop.
○ Pig Engine Parses, compiles Pig Latin scripts into
MapReduce jobs run on top of Hadoop.
○ high level language interface for Hadoop.
Motivation of Using Pig
● Faster development
○ Fewer lines of code (compare to native map-reduce)
○ Re-use the code (Pig library, Piggy bank)
● One test: Find the top 5 words with most high frequency
○ 10 lines of Pig Latin V.S 200 lines in Java
○ Dev time 15 minutes in Pig Latin V.S 4 hours in Java
Word Count using MapReduce
Word Count using Pig
● Used to
○ Process images
○ Built-on optimizer
● Pigs fly
dump pp;
Apache Spark
● A general purpose data processing engine, focused on
in-memory distributed computing use cases and higher-level
data manipulation primitives.
● Spark took many concepts from MapReduce and implemented
them in new ways.
● Spark uses APIs written in Scala, that can be called with Scala,
Python, Java, and, more recently, R.
● Spark is built around the concept of an RDD.
● RDD stands for Resilient Distributed Dataset: resilient in the
sense that data can be recreated on the fly in the event of a
machine failure, and distributed in the sense that operations can
happen in parallel on multiple machines working on blocks of
data.
● Spark is using YARN for scheduling and resource management.
Apache Spark Architecture
● Apache Spark is a data access engine for fast,
large-scale data processing hosted on YARN.
● in-memory computations
● interactive data mining,
● APIs for Scala, Java, and Python.
● Built in Libraries
○ ETL
○ Machine learning
○ Graph workloads
○ Stream processing.
Comparing Apache Spark to MapReduce
● The initial motivation for Spark: iterative applications
● Shorter version of the Mapper and Reducer codes we have seen earlier;
Apache Spark RDDs
● Each RDDs is a fault-tolerant collection of elements that can be operated
on in parallel.
●
Apache Spark RDDs
Two types of operations:
● You can chain transformation and keep deriving new RDDs from a
current one.
● All transformations in Spark are lazy, in that they do not compute their
results right away. The transformations are only computed when an
action requires a result to be returned to the client. This design enables
Spark to run more efficiently.
Using Spark in Python
● Creating the connection 🡪 creating an instance of
the SparkContext class.
● http://spark.apache.org/docs/2.1.0/api/python/pyspark.html
○ low level object that lets Spark work its magic by splitting data across multiple nodes in the
cluster;
○ RDDs are hard to work with directly 🡪 Spark DataFrame abstraction built on top of RDDs;
○ optimized queries
the SparkSession.builder.getOrCreate() method.
Viewing Tables
● Your SparkSession has an attribute called catalog which lists all the data
inside the cluster.
● One of the most useful is the .listTables() method, which returns the
names of all the tables in your cluster as a list.
Query
● Advantages of the DataFrame interface is that you can run SQL queries on
the tables in your Spark cluster.
● Data: flights table.
○ Row: every flight that left Portland International Airport (PDX) or Seattle-Tacoma International
Airport (SEA) in 2014 and 2015.
● Why deal with pandas at all? Wouldn't it be easier to just read a text
file straight into Spark?
○ filters the rows of the table based on some user specified logical condition. The resulting table
contains the rows where your condition is true. For example, if you had a table of students and
grades you could do:
● This is done in SQL using the GROUP BY command. This command breaks your data
into groups and applies a function from your SELECT statement to each group;
● For example, if you wanted to count the number of flights from each of two origin
destinations, you could use the query
○ The following query counts the number of flights from SEA and PDX to every destination
airport:
● First, rename the year column of planes to plane_year to avoid duplicate column
names.
DP: Data Types
● Important to know that Spark only handles numeric data.
○ all of the columns in your DataFrame must be either integers or decimals (called 'doubles' in Spark).
● Import our data 🡪 Spark guess what kind of information each column held. Unfortunately,
Spark doesn't always guess right.
○ some of the columns in our DataFrame are strings containing numbers as opposed to actual numeric values.
● To remedy this, you can use the .cast() method in combination with the .withColumn()
method.
○ .cast() works on columns, while .withColumn() works on DataFrames.
● The only argument you need to pass to .cast() is the kind of value you want to create, in
string form.
○ For example, to create integers, you'll pass the argument "integer" and for decimal numbers you'll use "double".
● You can put this call to .cast() inside a call to .withColumn() to overwrite the already existing
column
DP: String to integer
● Use the .cast()method to convert all the appropriate columns from
your DataFrame model_data to integers!
● As a little trick to make your code more readable, you can break long chains
across multiple lines by putting .\ then calling the method on the next line,
like so:
string = string.\
upper().\
lower()
● This column holds the year each plane was manufactured. However, our
model will use the planes' age, which is slightly different from the year it
was made!
DP: Making a Boolean
● Consider that you're modeling a yes or no question: is the flight late?
● However, your data contains the arrival delay in minutes for each flight.
🡪 boolean, which indicates whether the flight was late or not!
DP: Strings and factors
● Spark requires numeric data for modeling.
● So far this hasn't been an issue; even boolean columns can easily be
converted to integers without any trouble.
○ What about strings? E.g. airline and the plane's destination as features in your model.
● The inputCol is the name of the column you want to index or encode, and the
outputCol is the name of the new column that the Transformer should create.
● Then, from the model's point of view, every observation is a vector that
contains all of the information about it and a label that tells the modeler
what value that observation corresponds to. All of the Spark modeling
routines expect the data to be in this form.
Create the pipeline
● Pipeline is a class in the pyspark.ml module that combines all the
Estimators and Transformers that you've already created. This lets you
reuse the same modeling process over and over again by wrapping it up
in one simple object.
● The submodule pyspark.ml.tuning includes a class called ParamGridBuilder that does just that
● You'll need to use the .addGrid() and .build() methods to create a grid that you can use for cross
validation. The .addGrid() method takes a model parameter (an attribute of the model Estimator, lr)
and a list of values that you want to try.
● The .build() method takes no arguments, it just returns the grid that you'll use later.
Grid Search
● The submodule pyspark.ml.tuning also has a class
called CrossValidator for performing cross validation.
This Estimator takes the modeler you want to fit, the grid of
hyperparameters you created, and the evaluator you want to use to
compare your models.
Evaluating binary classifiers
print(evaluator.evaluate(test_results) )
71.25%