ScalaJVMBigData SparkLessons PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 100

Scala and the JVM for Big Data:

Lessons from Spark

polyglotprogramming.com/talks
dean.wampler@lightbend.com
@deanwampler

1
©Dean Wampler 2014-2019, All Rights Reserved
Spark

2
A Distributed
Computing Engine
on the JVM
3
Cluster
Node Node Node
RDD
Partition 1 Partition 1 Partition 1

Resilient Distributed
Datasets 4
Productivity?

Very concise, elegant, functional APIs.


•Scala, Java
•Python, R
•... and SQL!
5
Productivity?

Interactive shell (REPL)


•Scala, Python, R, and SQL

6
Notebooks
•Jupyter
•Spark Notebook
•Zeppelin
•Beaker
•Databricks
7
8
Example:
Inverted Index
9
Web Crawl
wikipedia.org/hadoop index
Hadoop provides block
MapReduce and HDFS
... ...
wikipedia.org/hadoop Hadoop provides...

... ... ...

wikipedia.org/hbase block
... ...
HBase stores data in HDFS
wikipedia.org/hbase HBase stores...
... ...
10
l Compute Inverted Index
index inverse index
block block
... ... ... ...

wikipedia.org/hadoop Hadoop provides... hadoop (.../hadoop,1)

... ... hbase (.../hbase,1),(.../hive,1)


hdfs (.../hadoop,1),(.../hbase,1),(..

block hive (.../hive,1)

... ... ... ...


Miracle!!
wikipedia.org/hbase HBase stores...
... ...
block
... ...

block
block
... ...
... ...
wikipedia.org/hive Hive queries...
... ...
block 11
nverted Index
inverse index
block
... ...
hadoop (.../hadoop,1)
hbase (.../hbase,1),(.../hive,1)
hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1)
hive (.../hive,1)
... ...
racle!! 12
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new SparkContext(master, “Inv. Index”)


sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1)) // (id, content)
}.flatMap {
case (id, content) =>
toWords(content).map(word => ((word,id),1)) // toWords not shown
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq) // Sort the value seq by count, desc.
}.saveAsTextFile("/path/to/output") 13
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new


SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
14

toWords(contents).map(w => ((w,id),1))


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new


RDD[String]: .../hadoop, Hadoop provides...
SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
RDD[(String,String)]: (.../hadoop,Hadoop provides...)
case (id, contents) =>
15

toWords(contents).map(w => ((w,id),1))


val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
RDD[((String,String),Int)]: ((Hadoop,.../hadoop),20)
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
16
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
17
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
18
Productivity?
textFile

map

Intuitive API: flatMap

•Dataflow of steps. reduceByKey

map

•Inspired by Scala collections groupByKey

and functional programming. map

saveAsTextFile
19
Performance?
textFile

map

Lazy API: flatMap

•Combines steps into “stages”. reduceByKey

map

•Cache intermediate data in groupByKey

memory. map

saveAsTextFile
20
21
Higher-Level
APIs
22
SQL:
Datasets/
DataFrames 23
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
Example
.appName("Queries")
.getOrCreate()

val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()

val planes_for_flights1 = sqlContext.sql("""


SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum LIMIT 100""")

val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
24
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()

val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
25
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()

val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
26
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()

val planes_for_flights1 = sqlContext.sql("""


SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)

27
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()

val planes_for_flights1 = sqlContext.sql("""


SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)

28
val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)

Not an “arbitrary”
anonymous funcRon, but a
“Column” instance.
29
Performance
The Dataset API has the
same performance for all
languages:
Scala, Java,
Python, R,
and SQL! 30
def join(right: Dataset[_], joinExprs: Column): DataFrame = {
def groupBy(cols: Column*): RelationalGroupedDataset = {
def orderBy(sortExprs: Column*): Dataset[T] = {
def select(cols: Column*): Dataset[...] = {
def where(condition: Column): Dataset[T] = {
def limit(n: Int): Dataset[T] = {
def intersect(other: Dataset[T]): Dataset[T] = {
def sample(withReplacement: Boolean, fraction, seed) = {
def drop(col: Column): DataFrame = {
def map[U](f: T => U): Dataset[U] = {
def flatMap[U](f: T => Traversable[U]): Dataset[U] ={
def foreach(f: T => Unit): Unit = {
def take(n: Int): Array[Row] = {
def count(): Long = {
def distinct(): Dataset[T] = {
def agg(exprs: Map[String, String]): DataFrame = {
31
32
Structured
Streaming
33
DStream (discretized stream)
Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event


… …

Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD …

Window of 3 RDD Batches #1

Window of 3 RDD Batches #2

34
ML/MLlib
K-Means

•Machine Learning requires:


•Iterative training of models.
•Good linear algebra perf.
GraphX
PageRank

•Graph algorithms require:


•Incremental traversal.
•Efficient edge and node reps.
Foundation:

The JVM 39
20 Years of
DevOps

Lots of Java Devs 40


Tools and Libraries
Akka
Breeze
Algebird
Spire & Cats
Axle
...
41
Big Data Ecosystem

42
But it’s

not perfect...
43
Richer data libs.
in Python & R 44
Garbage
Collection

45
GC Challenges
•Typical Spark heaps: 10s-100s GB.
•Uncommon for “generic”, non-data
services.

46
GC Challenges
•Too many cached RDDs leads to huge
old generation garbage.
•Billions of objects => long GC pauses.

47
Tuning GC
•Best for Spark:
•-XX:UseG1GC -XX:-ResizePLAB -
Xms... -Xmx... -
XX:InitiatingHeapOccupancyPerce
nt=... -XX:ConcGCThread=...
databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-
applications.html 48
JVM Object Model

49
Java Objects?
•“abcd”: 4 bytes for raw UTF8, right?
•48 bytes for the Java object:
•12 byte header.
•8 bytes for hash code.
•20 bytes for array overhead.
•8 bytes for UTF16 chars. 50
val myArray: Array[String]
0 1 2 3

“second”

“first”

“third”

Arrays “zeroth”
51
val person: Person

name: String
“Buck Trends”
age: Int 29
addr: Address

… …

Class Instances
52
Hash Map

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key”

Hash Maps
53
Improving Performance

Why obsess about this?


Spark jobs are CPU bound:
•Improve network I/O? ~2% better.
•Improve disk I/O? ~20% better. 54
What changed?

•Faster HW (compared to ~2000)


•10Gbs networks
•SSDs.
55
What changed?

•Smarter use of I/O


•Pruning unneeded data sooner.
•Caching more effectively.
•Efficient formats, like Parquet. 56
What changed?

•But more CPU use today:


•More Serialization.
•More Compression.
•More Hashing (joins, group-bys). 57
Improving Performance

To improve performance, we need to


focus on the CPU, the:
•Better algorithms, sure.
•And optimize use of memory. 58
Project Tungsten

Initiative to greatly improve


Dataset/DataFrame performance.59
Goals

60
Reduce References
val myArray: Array[String]
val person: Person 0 1 2 3

“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key” 61
Reduce References
•Fewer, bigger objects to GC.
•Fewer cache misses
val myArray: Array[String]
val person: Person 0 1 2 3

“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4 62
“a key”
Less Expression Overhead
sql("SELECT a + b FROM table")

•Evaluating expressions billions of


times:
•Virtual function calls.
•Boxing/unboxing.
•Branching (if statements, etc.) 63
Implementation

64
Object Encoding
New CompactRow type:
null bit set (1bit/field) values (8bytes/field) variable length

offset to var. len. data

•Compute hashCode and equals on


raw bytes. 65
val person: Person

name: String

•Compare: age: Int


addr: Address
29
“Buck Trends”

… …

null bit set (1bit/field) values (8bytes/field) variable length

offset to var. len. data

66
•BytesToBytesMap:
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4

67
Hash Map

h/c1
key value … …
h/c2
•Compare h/c3 “a value”
h/c4
“a key”

h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4

68
Memory Management
•Some allocations off heap.
•sun.misc.Unsafe.

69
Less Expression Overhead
sql("SELECT a + b FROM table")

•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
70
Less Expression Overhead
sql("SELECT a + b FROM table")

•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
•Spark 2.0 - for whole queries.
71
72
No Value Types

(Planned for Java 9 or 10)


73
case class Timestamp(epochMillis: Long) {

def toString: String = { ... }

def add(delta: TimeDelta): Timestamp = {


/* return new shifted time */
}
Don’t allocate on the heap;
... just push the primiRve long
} on the stack.
(scalac does this now.)
74
Long operations
aren’t atomic
According to the
JVM spec
75
No Unsigned Types

What’s
factorial(-1)?
76
Arrays Indexed
with Ints
Byte Arrays
limited to 2GB!
77
scala> val N = 1100*1000*1000
N2: Int = 1100000000 // 1.1 billion

scala> val array = Array.fill[Short](N)(0)


array: Array[Short] = Array(0, 0, ...)

scala> import
org.apache.spark.util.SizeEstimator

scala> SizeEstimator.estimate(array)
res3: Long = 2200000016 // 2.2GB
78
scala> val b = sc.broadcast(array)
...broadcast.Broadcast[Array[Short]] = ...

scala> SizeEstimator.estimate(b)
res0: Long = 2368

scala> sc.parallelize(0 until 100000).


| map(i => b.value(i))

79
scala> SizeEstimator.estimate(b)
res0: Long = 2368

scala> sc.parallelize(0 until 100000).


| map(i => b.value(i))

java.lang.OutOfMemoryError:
Boom!
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
80
But wait...
I actually lied
to you...
81
Spark handles large
broadcast variables
by breaking them
into blocks. 82
Scala
REPL83
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

84
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
Pass this closure to
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) RDD.map:
... i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

85
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
Verify that it’s
at java.io.ByteArrayOutputStream.write(...)
...
“clean” (serializable).
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
i => b.value(i)
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

86
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
...which it does by
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
serializing to a byte array...
at org.apache.spark.rdd.RDD.map(...)

87
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
...which requires copying
at java.io.ByteArrayOutputStream.write(...)
...
an array...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) What array???
...
i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
...
at org.apache.spark.rdd.RDD.map(...)
scala> val array = Array.fill[Short](N)(0)
... 88
Why did this
happen?
89
•You write:
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))

90
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles:
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)

class $iwC extends Serializable {


sc.parallelize(...).map(i => b.value(i))
}
} 91
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles: ... sucks in the whole object!
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)
So, this closure over “b”...
class $iwC extends Serializable {
sc.parallelize(...).map(i => b.value(i))
}
} 92
Lightbend is
investigating
re-engineering
the REPL 93
Workarounds...

94
•Transient is often all you need:
scala> @transient val array =
| Array.fill[Short](N)(0)
scala> ...

95
object Data { // Encapsulate in objects!
val N = 1100*1000*1000
val array = Array.fill[Short](N)(0)
val getB = sc.broadcast(array)
}
object Work {
def run(): Unit = {
val b = Data.getB // local ref!
val rdd = sc.parallelize(...).
map(i => b.value(i)) // only needs b
rdd.take(10).foreach(println)
}} 96
Why Scala?
See the longer version
of this talk at
polyglotprogramming.com/talks 97
polyglotprogramming.com/talks
lightbend.com/fast-data-platform
dean.wampler@lightbend.com
@deanwampler

Questions?
Bonus Material
You can find an extended version of this
talk with more details at
polyglotprogramming.com/talks

100

You might also like