Professional Documents
Culture Documents
Spark PPT
Spark PPT
DataFrames ML Pipelines
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Data Sources
Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
Storage Layers for Spark
• Spark can create distributed datasets from any file
stored in the Hadoop distributed file system (HDFS) or
other storage systems supported by the Hadoop APIs
(including your local file system, Amazon S3,
Cassandra, Hive, HBase, etc.).
• It’s important to remember that Spark does not
require Hadoop; it simply has support for storage
systems implementing the Hadoop APIs.
• Spark supports text files, SequenceFiles, Avro,
Parquet, and any other Hadoop InputFormat.
Core Spark Concepts
• Every Spark application consists of a driver program
that launches various parallel operations on a cluster.
The driver program contains your application’s main
function and defines distributed datasets on the
cluster, then applies operations to them.
Driver programs access Spark through a
Spark Context object, which represents
a connection to a computing cluster.
Once you have a Spark Context, you can
use it to build RDDs.
To run operations on RDD, driver
programs typically manage a number of
nodes called executors.
Spark’s Python and Scala Shells
• Spark comes with interactive shells that enable ad hoc data
analysis.
• Unlike most other shells, however, which let you manipulate
data using the disk and memory on a single machine, Spark’s
shells allow you to interact with data that is distributed on
disk or in memory across many machines, and Spark takes
care of automatically distributing this processing.
• bin/pyspark and bin/spark-shell to open the respective shell.
• When the shell starts, you will notice a lot of log messages. It
can be controlled by changing log4j.rootCategory=INFO in
conf/log4j.properties.
Standalone Applications
• The main difference from using it in the shell is that you need to
initialize your own SparkContext. After that, the API is the same.
• Initializing Spark in Scala
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName(“WordCount")
val sc = new SparkContext(conf)
• Once we have our build defined, we can easily package and run our
application using the bin/spark-submit script.
• The spark-submit script sets up a number of environment variables
used by Spark.
Maven build and run (mvn clean && mvn compile && mvn package)
$SPARK_HOME/bin/spark-submit \
--class com.oreilly.learningsparkexamples.mini.java.WordCount \
./target/learning-spark-mini-example-0.0.1.jar \
./README.md ./wordcounts
Word count Scala application
• // Create a Scala Spark Context.
• val conf = new SparkConf().setAppName("wordCount")
• val sc = new SparkContext(conf)
• // Load our input data.
• val input = sc.textFile(inputFile)
• // Split it up into words.
• val words = input.flatMap(line => line.split(" "))
• // Transform into pairs and count.
• val counts = words.map(word => (word, 1)).reduceByKey{case (x, y)
=> x + y}
• // Save the word count back out to a text file, causing evaluation.
• counts.saveAsTextFile(outputFile)
Resilient Distributed Dataset
• An RDD is simply a immutable distributed collection of elements.
• Each RDD is split into multiple partitions, which may be
computed on different nodes of the cluster.
• Once created, RDDs offer two types of operations:
transformations and actions.
• Transformations construct a new RDD from a previous one.
• Actions, on the other hand, compute a result based on an RDD,
and either return it to the driver program or save it to an external
storage system.
• In Spark all work is expressed as either creating new RDDs,
transforming existing RDDs, or calling operations on RDDs to
compute a result.
Create RDDs
• Spark provides two ways to create RDDs:
loading an external dataset
val lines = sc.textFile("/path/to/README.md")
parallelizing a collection in your driver program.
The simplest way to create RDDs is to take an existing
collection in your program and pass it to SparkContext’s
parallelize() method.
Beyond prototyping and testing, this is not widely used
since it requires that you have your entire dataset in
memory on one machine.
val lines = sc.parallelize(List(1,2,3))
Lazy Evaluation
• Lazy evaluation means that when we call a transformation
on an RDD (for instance, calling map()), the operation is not
immediately performed. Instead, Spark internally records
metadata to indicate that this operation has been
requested.
• Rather than thinking of an RDD as containing specific data, it
is best to think of each RDD as consisting of instructions on
how to compute the data that we build up through
transformations. Loading data into an RDD is lazily evaluated
in the same way transformations are. So, when we call
sc.textFile(), the data is not loaded until it is necessary.
• Spark uses lazy evaluation to reduce the number of passes it
has to take over our data by grouping operations together.
RDD Operations(Transformations)
• Transformations are operations on RDDs that return a new RDD, such as map()
and filter().
• Transformed RDDs are computed lazily, only when you use them in an action.
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line => line.contains("error"))
• Transformations operation does not mutate the existing inputRDD. Instead, it
returns a pointer to an entirely new RDD.
• InputRDD can still be reused later in the program
• Transformations can actually operate on any number of input RDDs. Like
Union to merge the RDDs.
• As you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage
graph. It uses this information to compute each RDD on demand and to
recover lost data if part of a persistent RDD is lost.
Common Transformations
• Element-wise transformations
MAP
The map() transformation takes in a function and applies it to each
element in the RDD with the result of the function being the new value
of each element in the resulting RDD.
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
Filter
The filter() transformation takes in a function and returns an RDD that
only has elements that pass the filter() function.
FLATMAP
Sometimes we want to produce multiple output elements for each input
element. The operation to do this is called flatMap().
val lines = sc.parallelize(List("hello world", "hi"))
val words = lines.flatMap(line => line.split(" "))
words.first() // returns "hello"
Examples(Transformation)
• Create RDD / collect(Action)
val seq = Seq(4,2,5,3,1,4)
val rdd = sc.parallelize(seq)
rdd.collect()
• Filter
val filteredrdd = rdd.filter(x=>x>2)
• Distinct
val rdddist = rdd.distinct()
rdddist.collect()
• Map (square a number)
val rddsquare = rdd.map(x=>x*x);
• Flatmap
val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"), ("Hadoop
Hadoop Hadoop")))
val rdd2 = rdd1.flatMap(x => x.split(" "))
• Sample an RDD
rdd.sample(false,0.5).collect()
Examples(Transformation) Cont…
• Create RDD
val x = Seq(1,2,3)
val y = Seq(3,4,5)
val rdd1 = sc.parallelize(x)
val rdd2 = sc.parallelize(y)
• Union
rdd1.union(rdd2).collect()
• Intersection
rdd1.intersection(rdd2).collect()
• subtract
Rdd1.subtract(rdd2).collect()
• Cartesian
rdd1.cartesian(rdd2).collect()
RDD Operations(Actions)
• Actions are operations that return a result to the driver
program or write it to storage, and kick off a
computation, such as count() and first().
• Actions force the evaluation of the transformations
required for the RDD they were called on, since they need
to actually produce output.
println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)
• It is important to note that each time we call a new
action, the entire RDD must be computed “from scratch.”
Examples(Action)
• Reduce
The most common action on basic RDDs you will likely use is reduce(), which takes a
function that operates on two elements of the type in your RDD and returns a new
element of the same type. A simple example of such a function is +, which we can
use to sum our RDD. With reduce(), we can easily sum the elements of our RDD,
count the number of elements, and perform other types of aggregations.
val seq = Seq(4,2,5,3,1,4)
val rdd = sc.parallelize(seq)
val sum = rdd.reduce((x,y)=> x+y)
• Fold
Similar to reduce() is fold(), which also takes a function with the same signature as
needed for reduce(), but in addition takes a “zero value” to be used for the initial call
on each partition. The zero value you provide should be the identity element for your
operation;
val seq = Seq(4,2,5,3,1,4)
val rdd = sc.parallelize(seq,2)
val rdd2 = rdd.fold(1)((x,y)=>x+y);
Examples(Action) cont.…
• Aggregate
val rdd = sc.parallelize(List(1,2,3,4,5,6),5)
Val rdd2 = rdd.aggregate(0,0)
((ac,val)=>(ac._1+val,ac._2+1),
(p1,p2)=>(p1._1+p2._1,p1._2+p2._2))
Val avg = rdd2._1/rdd._2.toDouble
Examples(Action) cont.…
• takeOrdered(num)(ordering)
Reverse Order (Highest number)
val seq = Seq(3,9,2,3,5,4)
val rdd = sc.parallelize(seq,2)
rdd.takeOrdered(1)(Ordering[Int].reverse)
Custom Order (Highest based on age)
case class Person(name:String,age:Int)
val rdd = sc.parallelize(Array(("x",10),("y",14),("z",12)))
val rdd2 = rdd.map(x=>Person(x._1,x._2))
rdd2.takeOrdered(1)(Ordering[Int].reverse.on(x=>x.age))
(highest/lowest repeated word in word count program)
val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"), ("Hadoop
Hadoop Hadoop")))
val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))
val rdd3 = rdd2.reduceByKey((x,y) => (x+y))
rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2)) //Lowest value
rdd3.takeOrdered(3)(Ordering[Int].reverse.on(x=>x._2)) //Higest value
Persist(Cache) RDD
• Spark’s RDDs are by default recomputed each time you run an action on them. If you
would like to reuse an RDD in multiple actions, you can ask Spark to persist it using
RDD.persist().
• After computing it the first time, Spark will store the RDD contents in memory
(partitioned across the machines in your cluster), and reuse them in future actions.
Persisting RDDs on disk instead of memory is also possible.
• If a node that has data persisted on it fails, Spark will recompute the lost partitions of
the data when needed.
• We can also replicate our data on multiple nodes if we want to be able to handle
node failure without slowdown.
• If you attempt to cache too much data to fit in memory, Spark will automatically evict
old partitions using a Least Recently Used (LRU) cache policy. Caching unnecessary
data can lead to eviction of useful data and more recomputation time.
• RDDs come with a method called unpersist() that lets you manually remove them
from the cache.
• cache() is the same as calling persist() with the default storage level.
Persistence levels
Note: Cache and persist will work only after the RDD is computed again either
by performing action on the RDD or on the transformed RDD when persist is not
given when the RDD is created.
Spark Summary
• To summarize, every Spark program and shell
session will work as follows:
Create some input RDDs from external data.
Transform them to define new RDDs using
transformations like filter().
Ask Spark to persist() any intermediate RDDs that will
need to be reused. cache() is the same as calling
persist() with the default storage level.
Launch actions such as count() and first() to kick off a
parallel computation, which is then optimized and
executed by Spark.
Pair RDD (Key/Value Pair)
• Spark provides special operations on RDDs containing key/value
pairs, called pair RDDs.
• Key/value RDDs are commonly used to perform aggregations, and
often we will do some initial ETL to get our data into a key/value
format.
• Key/value RDDs expose new operations (e.g., counting up reviews
for each product, grouping together data with the same key, and
grouping together two different RDDs)
• Creating Pair RDDs
Some loading formats directly return the Pair RDDs.
Using map function
Scala
val pairs = lines. Map(x => (x.split(" ")(0), x))
Transformations on Pair RDDs
• Pair RDDs are allowed to use all the transformations
available to standard RDDs.
• Aggregations
When datasets are described in terms of key/value pairs,
it is common to want to aggregate statistics across all
elements with the same key. We have looked at the
fold(), combine(), and reduce() actions on basic RDDs, and
similar per-key transformations exist on pair RDDs. Spark
has a similar set of operations that combines values that
have the same key. These operations return RDDs and
thus are called transformations rather than actions.
Transformations...cont…
• reduceByKey()
is quite similar to reduce(); both take a function and use it to combine values.
reduceByKey() runs several parallel reduce operations, one for each key in the
dataset, where each operation combines values that have the same key. Because
datasets can have very large numbers of keys, reduceByKey() is not implemented
as an action that returns a value to the user program. Instead, it returns a new
RDD consisting of each key and the reduced value for that key.
• Example Word count in Scala
val input = sc.textFile("s3://...")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((acc, val) => acc + val)
• Custom Accumulators
Spark supports accumulators of type Int, Double, Long, and Float.
Spark also includes an API to define custom accumulator types and custom aggregation
operations.
Custom accumulators need to extend AccumulatorParam.
we can use any operation for add, provided that operation is commutative and associative.
An operation op is commutative if a op b = b op a for all values a, b.
An operation op is associative if (a op b) op c = a op (b op c) for all values a, b, and c.
Broadcast Variables
• Shared variable, broadcast variables, allows the program to efficiently send a large, read-only
value to all the worker nodes for use in one or more Spark operations.
It’s expensive to send that Array from the master with each task. Also if we used same
object later (if we ran the same code for other files), it would send again to each node.
val c = sc.broadcast(Array(100,200))
val a = sc.parallelize(List(“This is Krishna” ,“Sureka learning the Spark”))
val d = a.flatMap(x=>x.split(" ")).map(x=>x+c.value(1))
• Optimizing Broadcasts
When we are broadcasting large values, it is important to choose a data serialization format
that is both fast and compact.
Working on a Per-Partition Basis
Working with data on a per-partition basis allows us to avoid redoing
setup work for each data item. Operations like opening a database
connection or creating a random number generator are examples of
setup steps that we wish to avoid doing for each element.
Spark has per-partition versions of map and foreach to help reduce the
cost of these operations by letting you run code only once for each
partition of an RDD.
Example-1 (mapPartitions)
val a = sc.parallelize(List(1,2,3,4,5,6),2)
scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.toList.map(y=>y+1).iterator})
scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.map(y=>y+1)})
scala> b.collect
Hello
Hello
res20: Array[Int] = Array(2, 3, 4, 5, 6, 7)
scala> a.mapPartitions(x=>x.filter(y => y > 1)).collect
res35: Array[Int] = Array(2, 3, 4, 5, 6)
Working on a Per-Partition Basis
Example-2 (mapPartitionWithIndex)
scala> val b =
a.mapPartitionsWithIndex((index:Int,x:Iterator[Int])=>{println("Hello
from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at
mapPartitionsWithIndex at <console>:29
scala> b.collect
Hello from 0
Hello from 1
res44: Array[Int] = Array(201, 202, 203, 5, 6, 7)
Practice Session
SPARK SQL
IGATE Sensitive
• scala> case class person(id:Int,name:String,sal:Int)
• defined class person
•
• scala> person
• res38: person.type = person
•
• scala> val a = sc.textFile("file:///home/vm4learning/Desktop/spark/ emp")
• a: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[36] at textFile at
<console>:27
•
• scala> a.collect
• res39: Array[String] = Array(837798 Krishna 5000000, 843123 Sandipta
10000000, 844066 Abhishek 40000, 843289 ZohriSahab 9999999)
•
• scala> val b = a.map(_.split(" "))
• b: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[37] at map at <console>:29
•
• scala> b.collect
• res40: Array[Array[String]] = Array(Array(837798, Krishna, 5000000), Array(843123, Sandipta,
10000000), Array(844066, Abhishek, 40000), Array(843289, ZohriSahab, 9999999))
•
• scala> val c = b.map (x=>person(x(0).toInt,x(1),x(2).toInt))
• c: org.apache.spark.rdd.RDD[person] = MapPartitionsRDD[38] at map at <console>:33
•
• scala> c.collect
• res41: Array[person] = Array(person(837798,Krishna,5000000), person(843123,Sandipta,10000000),
person(844066,Abhishek,40000), person(843289,ZohriSahab,9999999))
•
• scala> val d = c.toDF()
• d: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•
• scala> d.collect
• res42: Array[org.apache.spark.sql.Row] = Array([837798,Krishna,5000000], [843123,Sandipta,10000000],
[844066,Abhishek,40000], [843289,ZohriSahab,9999999])
•
• scala> d.registerTempTable("Emp")
•
• scala> val e = sqlContext.sql("""select id,name,sal from Emp where sal < 9999999""")
• e: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•
• scala> e.collect
• res44: Array[org.apache.spark.sql.Row] = Array([837798,Krishna,5000000], [844066,Abhishek,40000])
•
• scala> e
• res46: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•
• scala> val a1 = sc.textFile("file:///home/vm4learning/Desktop/spark/dept")
• a1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27
•
• scala> a1.collect
• res49: Array[String] = Array(843123 CEO, 844066 SparkCommiter, 843289
BoardHSBC)
•
• scala> case class pdept(id:Int,dept:String)
• defined class pdept
•
• scala> val a2 = a1.map(x=>x.split(" ")).map(x=>pdept(x(0).toInt,x(1))).toDF()
• a2: org.apache.spark.sql.DataFrame = [id: int, dept: string]
•
• scala> a2.collect
• res50: Array[org.apache.spark.sql.Row] = Array([843123,CEO],
[844066,SparkCommiter], [843289,BoardHSBC])
•
• scala> a2.registerTempTable("Dept")
•
• scala> val e = sqlContext.sql("""select * from Emp join Dept on Emp.id =
Dept.id where sal < 9999999""")
• e: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int, id: int, dept:
string]
•
• scala> e.collect
• res52: Array[org.apache.spark.sql.Row] =
Array([844066,Abhishek,40000,844066,SparkCommiter])
•
• scala> sqlContext.cacheTable("Dept")
•
• scala> sqlContext.uncacheTable("Dept")