Professional Documents
Culture Documents
ScalaJVMBigData SparkLessons PDF
ScalaJVMBigData SparkLessons PDF
ScalaJVMBigData SparkLessons PDF
polyglotprogramming.com/talks
dean.wampler@lightbend.com
@deanwampler
1
©Dean Wampler 2014-2019, All Rights Reserved
Spark
2
A Distributed
Computing Engine
on the JVM
3
Cluster
Node Node Node
RDD
Partition 1 Partition 1 Partition 1
Resilient Distributed
Datasets 4
Productivity?
6
Notebooks
•Jupyter
•Spark Notebook
•Zeppelin
•Beaker
•Databricks
7
8
Example:
Inverted Index
9
Web Crawl
wikipedia.org/hadoop index
Hadoop provides block
MapReduce and HDFS
... ...
wikipedia.org/hadoop Hadoop provides...
wikipedia.org/hbase block
... ...
HBase stores data in HDFS
wikipedia.org/hbase HBase stores...
... ...
10
l Compute Inverted Index
index inverse index
block block
... ... ... ...
block
block
... ...
... ...
wikipedia.org/hive Hive queries...
... ...
block 11
nverted Index
inverse index
block
... ...
hadoop (.../hadoop,1)
hbase (.../hbase,1),(.../hive,1)
hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1)
hive (.../hive,1)
... ...
racle!! 12
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
map
map
saveAsTextFile
19
Performance?
textFile
map
map
memory. map
saveAsTextFile
20
21
Higher-Level
APIs
22
SQL:
Datasets/
DataFrames 23
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
Example
.appName("Queries")
.getOrCreate()
val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
24
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()
val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
25
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()
val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
26
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
27
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
28
val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
Not an “arbitrary”
anonymous funcRon, but a
“Column” instance.
29
Performance
The Dataset API has the
same performance for all
languages:
Scala, Java,
Python, R,
and SQL! 30
def join(right: Dataset[_], joinExprs: Column): DataFrame = {
def groupBy(cols: Column*): RelationalGroupedDataset = {
def orderBy(sortExprs: Column*): Dataset[T] = {
def select(cols: Column*): Dataset[...] = {
def where(condition: Column): Dataset[T] = {
def limit(n: Int): Dataset[T] = {
def intersect(other: Dataset[T]): Dataset[T] = {
def sample(withReplacement: Boolean, fraction, seed) = {
def drop(col: Column): DataFrame = {
def map[U](f: T => U): Dataset[U] = {
def flatMap[U](f: T => Traversable[U]): Dataset[U] ={
def foreach(f: T => Unit): Unit = {
def take(n: Int): Array[Row] = {
def count(): Long = {
def distinct(): Dataset[T] = {
def agg(exprs: Map[String, String]): DataFrame = {
31
32
Structured
Streaming
33
DStream (discretized stream)
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
…
… …
34
ML/MLlib
K-Means
The JVM 39
20 Years of
DevOps
42
But it’s
not perfect...
43
Richer data libs.
in Python & R 44
Garbage
Collection
45
GC Challenges
•Typical Spark heaps: 10s-100s GB.
•Uncommon for “generic”, non-data
services.
46
GC Challenges
•Too many cached RDDs leads to huge
old generation garbage.
•Billions of objects => long GC pauses.
47
Tuning GC
•Best for Spark:
•-XX:UseG1GC -XX:-ResizePLAB -
Xms... -Xmx... -
XX:InitiatingHeapOccupancyPerce
nt=... -XX:ConcGCThread=...
databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-
applications.html 48
JVM Object Model
49
Java Objects?
•“abcd”: 4 bytes for raw UTF8, right?
•48 bytes for the Java object:
•12 byte header.
•8 bytes for hash code.
•20 bytes for array overhead.
•8 bytes for UTF16 chars. 50
val myArray: Array[String]
0 1 2 3
“second”
“first”
“third”
Arrays “zeroth”
51
val person: Person
name: String
“Buck Trends”
age: Int 29
addr: Address
… …
Class Instances
52
Hash Map
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key”
Hash Maps
53
Improving Performance
60
Reduce References
val myArray: Array[String]
val person: Person 0 1 2 3
“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key” 61
Reduce References
•Fewer, bigger objects to GC.
•Fewer cache misses
val myArray: Array[String]
val person: Person 0 1 2 3
“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4 62
“a key”
Less Expression Overhead
sql("SELECT a + b FROM table")
64
Object Encoding
New CompactRow type:
null bit set (1bit/field) values (8bytes/field) variable length
name: String
… …
66
•BytesToBytesMap:
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4
…
67
Hash Map
h/c1
key value … …
h/c2
•Compare h/c3 “a value”
h/c4
“a key”
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4
…
68
Memory Management
•Some allocations off heap.
•sun.misc.Unsafe.
69
Less Expression Overhead
sql("SELECT a + b FROM table")
•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
70
Less Expression Overhead
sql("SELECT a + b FROM table")
•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
•Spark 2.0 - for whole queries.
71
72
No Value Types
What’s
factorial(-1)?
76
Arrays Indexed
with Ints
Byte Arrays
limited to 2GB!
77
scala> val N = 1100*1000*1000
N2: Int = 1100000000 // 1.1 billion
scala> import
org.apache.spark.util.SizeEstimator
scala> SizeEstimator.estimate(array)
res3: Long = 2200000016 // 2.2GB
78
scala> val b = sc.broadcast(array)
...broadcast.Broadcast[Array[Short]] = ...
scala> SizeEstimator.estimate(b)
res0: Long = 2368
79
scala> SizeEstimator.estimate(b)
res0: Long = 2368
java.lang.OutOfMemoryError:
Boom!
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
80
But wait...
I actually lied
to you...
81
Spark handles large
broadcast variables
by breaking them
into blocks. 82
Scala
REPL83
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)
84
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
Pass this closure to
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) RDD.map:
... i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)
85
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
Verify that it’s
at java.io.ByteArrayOutputStream.write(...)
...
“clean” (serializable).
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
i => b.value(i)
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)
86
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
...which it does by
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
serializing to a byte array...
at org.apache.spark.rdd.RDD.map(...)
87
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
...which requires copying
at java.io.ByteArrayOutputStream.write(...)
...
an array...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) What array???
...
i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
...
at org.apache.spark.rdd.RDD.map(...)
scala> val array = Array.fill[Short](N)(0)
... 88
Why did this
happen?
89
•You write:
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
90
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles:
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)
94
•Transient is often all you need:
scala> @transient val array =
| Array.fill[Short](N)(0)
scala> ...
95
object Data { // Encapsulate in objects!
val N = 1100*1000*1000
val array = Array.fill[Short](N)(0)
val getB = sc.broadcast(array)
}
object Work {
def run(): Unit = {
val b = Data.getB // local ref!
val rdd = sc.parallelize(...).
map(i => b.value(i)) // only needs b
rdd.take(10).foreach(println)
}} 96
Why Scala?
See the longer version
of this talk at
polyglotprogramming.com/talks 97
polyglotprogramming.com/talks
lightbend.com/fast-data-platform
dean.wampler@lightbend.com
@deanwampler
Questions?
Bonus Material
You can find an extended version of this
talk with more details at
polyglotprogramming.com/talks
100