ScalaJVMBigData SparkLessons PDF

Scala and the JVM for Big Data:
Lessons from Spark
polyglotprogramming.com/talks
dean.wampler@lightbend.com
@deanwampler
1
©Dean Wampler 2014-2019, All Rights Reserved
Spark
2
A Distributed
Computing Engine
on the JVM
3
Cluster
Node Node Node
RDD
Partition 1 Partition 1 Partition 1
Resilient Distributed
Datasets 4
Productivity?
Very concise, elegant, functional APIs.

•Scala, Java
•Python, R
•... and SQL!
5
Productivity?
Interactive shell (REPL)

•Scala, Python, R, and SQL
6
Notebooks
•Jupyter
•Spark Notebook
•Zeppelin
•Beaker
•Databricks
7
8
Example:
Inverted Index
9
Web Crawl
wikipedia.org/hadoop index
Hadoop provides block
MapReduce and HDFS
... ...
wikipedia.org/hadoop Hadoop provides...
... ... ...
wikipedia.org/hbase block
... ...
HBase stores data in HDFS
wikipedia.org/hbase HBase stores...
... ...
10
l Compute Inverted Index
index inverse index
block block
... ... ... ...
wikipedia.org/hadoop Hadoop provides... hadoop (.../hadoop,1)
... ... hbase (.../hbase,1),(.../hive,1)

hdfs (.../hadoop,1),(.../hbase,1),(..
block hive (.../hive,1)
... ... ... ...

Miracle!!
wikipedia.org/hbase HBase stores...
... ...
block
... ...
block
block
... ...
... ...
wikipedia.org/hive Hive queries...
... ...
block 11
nverted Index
inverse index
block
... ...
hadoop (.../hadoop,1)
hbase (.../hbase,1),(.../hive,1)
hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1)
hive (.../hive,1)
... ...
racle!! 12
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sparkContext = new SparkContext(master, “Inv. Index”)

sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1)) // (id, content)
}.flatMap {
case (id, content) =>
toWords(content).map(word => ((word,id),1)) // toWords not shown
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq) // Sort the value seq by count, desc.
}.saveAsTextFile("/path/to/output") 13
val sparkContext = new

SparkContext(master, “Inv. Index”)
map { line =>
(array(0), array(1))
}.flatMap {
case (id, contents) =>
14
toWords(contents).map(w => ((w,id),1))

val sparkContext = new

RDD[String]: .../hadoop, Hadoop provides...
SparkContext(master, “Inv. Index”)
map { line =>
}.flatMap {
RDD[(String,String)]: (.../hadoop,Hadoop provides...)
15

}.flatMap {
map {
RDD[((String,String),Int)]: ((Hadoop,.../hadoop),20)
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
16
}.flatMap {
map {
}.groupByKey.
mapValues {
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
17
}.flatMap {
map {
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
}.groupByKey.
mapValues {
18
Productivity?
textFile
map
Intuitive API: flatMap
•Dataflow of steps. reduceByKey
map
•Inspired by Scala collections groupByKey
and functional programming. map
saveAsTextFile
19
Performance?
textFile
map
Lazy API: flatMap
•Combines steps into “stages”. reduceByKey
map
•Cache intermediate data in groupByKey
memory. map
saveAsTextFile
20
21
Higher-Level
APIs
22
SQL:
Datasets/
DataFrames 23
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
Example
.appName("Queries")
.getOrCreate()
val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
val planes_for_flights1 = sqlContext.sql("""

SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum LIMIT 100""")
val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
24
.master("local")
.appName("Queries")
.getOrCreate()
val flights =
val planes =
25
.master("local")
.appName("Queries")
.getOrCreate()
val flights =
val planes =
26

JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
27

JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
28
val planes_for_flights2 =
Not an “arbitrary”
anonymous funcRon, but a
“Column” instance.
29
Performance
The Dataset API has the
same performance for all
languages:
Scala, Java,
Python, R,
and SQL! 30
def join(right: Dataset[_], joinExprs: Column): DataFrame = {
def groupBy(cols: Column*): RelationalGroupedDataset = {
def orderBy(sortExprs: Column*): Dataset[T] = {
def select(cols: Column*): Dataset[...] = {
def where(condition: Column): Dataset[T] = {
def limit(n: Int): Dataset[T] = {
def intersect(other: Dataset[T]): Dataset[T] = {
def sample(withReplacement: Boolean, fraction, seed) = {
def drop(col: Column): DataFrame = {
def map[U](f: T => U): Dataset[U] = {
def flatMap[U](f: T => Traversable[U]): Dataset[U] ={
def foreach(f: T => Unit): Unit = {
def take(n: Int): Array[Row] = {
def count(): Long = {
def distinct(): Dataset[T] = {
def agg(exprs: Map[String, String]): DataFrame = {
31
32
Structured
Streaming
33
DStream (discretized stream)
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
…
… …
Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD …
Window of 3 RDD Batches #1
Window of 3 RDD Batches #2
34
ML/MLlib
K-Means
•Machine Learning requires:

•Iterative training of models.
•Good linear algebra perf.
GraphX
PageRank
•Graph algorithms require:

•Incremental traversal.
•Efficient edge and node reps.
Foundation:
The JVM 39
20 Years of
DevOps
Lots of Java Devs 40

Tools and Libraries
Akka
Breeze
Algebird
Spire & Cats
Axle
...
41
Big Data Ecosystem
42
But it’s
not perfect...
43
Richer data libs.
in Python & R 44
Garbage
Collection
45
GC Challenges
•Typical Spark heaps: 10s-100s GB.
•Uncommon for “generic”, non-data
services.
46
GC Challenges
•Too many cached RDDs leads to huge
old generation garbage.
•Billions of objects => long GC pauses.
47
Tuning GC
•Best for Spark:
•-XX:UseG1GC -XX:-ResizePLAB -
Xms... -Xmx... -
XX:InitiatingHeapOccupancyPerce
nt=... -XX:ConcGCThread=...
databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-
applications.html 48
JVM Object Model
49
Java Objects?
•“abcd”: 4 bytes for raw UTF8, right?
•48 bytes for the Java object:
•12 byte header.
•8 bytes for hash code.
•20 bytes for array overhead.
•8 bytes for UTF16 chars. 50
val myArray: Array[String]
0 1 2 3
“second”
“first”
“third”
Arrays “zeroth”
51
val person: Person
name: String
“Buck Trends”
age: Int 29
addr: Address
… …
Class Instances
52
Hash Map
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key”
Hash Maps
53
Improving Performance
Why obsess about this?

Spark jobs are CPU bound:
•Improve network I/O? ~2% better.
•Improve disk I/O? ~20% better. 54
What changed?
•Faster HW (compared to ~2000)

•10Gbs networks
•SSDs.
55
What changed?
•Smarter use of I/O

•Pruning unneeded data sooner.
•Caching more effectively.
•Efficient formats, like Parquet. 56
What changed?
•But more CPU use today:

•More Serialization.
•More Compression.
•More Hashing (joins, group-bys). 57
Improving Performance
To improve performance, we need to

focus on the CPU, the:
•Better algorithms, sure.
•And optimize use of memory. 58
Project Tungsten
Initiative to greatly improve

Dataset/DataFrame performance.59
Goals
60
Reduce References
val person: Person 0 1 2 3
“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key” 61
Reduce References
•Fewer, bigger objects to GC.
•Fewer cache misses
val person: Person 0 1 2 3
“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4 62
“a key”
Less Expression Overhead
sql("SELECT a + b FROM table")
•Evaluating expressions billions of

times:
•Virtual function calls.
•Boxing/unboxing.
•Branching (if statements, etc.) 63
Implementation
64
Object Encoding
New CompactRow type:
null bit set (1bit/field) values (8bytes/field) variable length
offset to var. len. data
•Compute hashCode and equals on

raw bytes. 65
val person: Person
name: String
•Compare: age: Int

addr: Address
29
“Buck Trends”
… …
null bit set (1bit/field) values (8bytes/field) variable length
offset to var. len. data
66
•BytesToBytesMap:
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4
…
67
Hash Map
h/c1
key value … …
h/c2
•Compare h/c3 “a value”
h/c4
“a key”
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4
…
68
Memory Management
•Some allocations off heap.
•sun.misc.Unsafe.
69
•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
70
•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
•Spark 2.0 - for whole queries.
71
72
No Value Types
(Planned for Java 9 or 10)

73
case class Timestamp(epochMillis: Long) {
def toString: String = { ... }
def add(delta: TimeDelta): Timestamp = {

/* return new shifted time */
}
Don’t allocate on the heap;
... just push the primiRve long
} on the stack.
(scalac does this now.)
74
Long operations
aren’t atomic
According to the
JVM spec
75
No Unsigned Types
What’s
factorial(-1)?
76
Arrays Indexed
with Ints
Byte Arrays
limited to 2GB!
77
scala> val N = 1100*1000*1000
N2: Int = 1100000000 // 1.1 billion
scala> val array = Array.fill[Short](N)(0)

array: Array[Short] = Array(0, 0, ...)
scala> import
org.apache.spark.util.SizeEstimator
scala> SizeEstimator.estimate(array)
res3: Long = 2200000016 // 2.2GB
78
scala> val b = sc.broadcast(array)
...broadcast.Broadcast[Array[Short]] = ...
scala> SizeEstimator.estimate(b)
res0: Long = 2368
scala> sc.parallelize(0 until 100000).

| map(i => b.value(i))
79
scala> SizeEstimator.estimate(b)
res0: Long = 2368

java.lang.OutOfMemoryError:
Boom!
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
80
But wait...
I actually lied
to you...
81
Spark handles large
broadcast variables
by breaking them
into blocks. 82
Scala
REPL83
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)
84
...
...
Pass this closure to
.writeObject(...) RDD.map:
... i => b.value(i)
...
85
...
Verify that it’s
...
“clean” (serializable).
i => b.value(i)
.writeObject(...)
...
...
86
...
...
.writeObject(...)
...
...which it does by
...
serializing to a byte array...
87
...
...which requires copying
...
an array...
.writeObject(...) What array???
...
i => b.value(i)
...
...
... 88
Why did this
happen?
89
•You write:
90
•Scala compiles:
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)

sc.parallelize(...).map(i => b.value(i))
}
} 91
•Scala compiles: ... sucks in the whole object!
val b = sc.broadcast(array)
So, this closure over “b”...
sc.parallelize(...).map(i => b.value(i))
}
} 92
Lightbend is
investigating
re-engineering
the REPL 93
Workarounds...
94
•Transient is often all you need:
scala> @transient val array =
| Array.fill[Short](N)(0)
scala> ...
95
object Data { // Encapsulate in objects!
val N = 1100*1000*1000
val getB = sc.broadcast(array)
}
object Work {
def run(): Unit = {
val b = Data.getB // local ref!
val rdd = sc.parallelize(...).
map(i => b.value(i)) // only needs b
rdd.take(10).foreach(println)
}} 96
Why Scala?
See the longer version
of this talk at
polyglotprogramming.com/talks 97
lightbend.com/fast-data-platform
dean.wampler@lightbend.com
@deanwampler
Questions?
Bonus Material
You can find an extended version of this
talk with more details at
100

ScalaJVMBigData SparkLessons PDF

Uploaded by

Copyright:

Available Formats

You might also like

ScalaJVMBigData SparkLessons PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ScalaJVMBigData SparkLessons PDF

Uploaded by

Copyright:

Available Formats

Scala and the JVM for Big Data:

Lessons from Spark

Very concise, elegant, functional APIs.

Interactive shell (REPL)

... ... ...

wikipedia.org/hadoop Hadoop provides... hadoop (.../hadoop,1)

... ... hbase (.../hbase,1),(.../hive,1)

block hive (.../hive,1)

... ... ... ...

val sparkContext = new SparkContext(master, “Inv. Index”)

val sparkContext = new

toWords(contents).map(w => ((w,id),1))

val sparkContext = new

toWords(contents).map(w => ((w,id),1))

Intuitive API: flatMap

•Dataflow of steps. reduceByKey

•Inspired by Scala collections groupByKey

and functional programming. map

Lazy API: flatMap

•Combines steps into “stages”. reduceByKey

•Cache intermediate data in groupByKey

val planes_for_flights1 = sqlContext.sql("""

val planes_for_flights1 = sqlContext.sql("""

val planes_for_flights1 = sqlContext.sql("""

Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD …

Window of 3 RDD Batches #1

Window of 3 RDD Batches #2

•Machine Learning requires:

•Graph algorithms require:

Lots of Java Devs 40

Why obsess about this?

•Faster HW (compared to ~2000)

•Smarter use of I/O

•But more CPU use today:

To improve performance, we need to

Initiative to greatly improve

•Evaluating expressions billions of

oﬀset to var. len. data

•Compute hashCode and equals on

•Compare: age: Int

null bit set (1bit/field) values (8bytes/field) variable length

oﬀset to var. len. data

(Planned for Java 9 or 10)

def toString: String = { ... }

def add(delta: TimeDelta): Timestamp = {

scala> val array = Array.fill[Short](N)(0)

scala> sc.parallelize(0 until 100000).

scala> sc.parallelize(0 until 100000).

class $iwC extends Serializable {

You might also like