Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Performance Tuning on Apache Spark

The Five Most Common


Performance Problems

Storage

1
The 5 Most Common Performance Problems (The 5 Ss)
Storage
● Storage is our 4th problem area

● Traditionally it consisted of one specific type of problem (aka Tiny Files)

● But it is actually a class of problems that relates


to high overhead with ingesting data

● That ingest can be the initial ingest from Blob Storage,


JDBC, Kafka, EventHubs or even from a previous Spark stage

2
The 5 Most Common Performance Problems (The 5 Ss)
Storage - More Examples
We are going to take a look at a couple of examples:

● Tiny Files

● Scanning

● Schemas, Merging Schemas & Schema Evolution

3
If you had only 60 seconds to pick up as many coins as you
$0.02
$0.08
$0.09
$0.03
$0.10
$0.04
$0.12
$0.06
$0.01
$0.05
$0.13
$0.07
$0.11 vs
vs
vs$0.00
$0.50
$2.00
$2.50
$0.25
$3.00
$2.25
$3.25
$0.75
$2.75
$1.00
$1.50
$1.25
$1.75
can, one coin at a time, which pile do you want to work from?

4
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Tiny Files In Action
See Experiment #8923, contrast Step B, StepC and Step D

● Note the total execution time of each job

● In the Spark UI, see the Stage Details for the last
stage of each step and note the Input Size / Records

● In the Spark UI, see the Query Details for the last job of each step and note the...
■ number of files read
■ scan time total
■ filesystem read time total
■ size of files read

5
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Tiny Files, Review

Record Execution number of scan filesystem read size of files


Step
Count Time files read time total time total read
B - Benchmark #1 ~41 M ~3 minutes 6,273 20 minutes ~10 minutes 1,209 MB

C - Benchmark #2 ~2.7 B ~10 minutes 100 1 hour ~1 hour 102 GB

D - Tiny Files ~34 M ~1.5 hours 345,612 12 hours > 6 hours 2.1 GB

6
What can we do to mitigate the impact of tiny files?

7
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Only 2.5 Options?
I might be wrong, but I think there are really only two options...

Scenario #1
● You caused the problem…
○ You can fix the problem

Scenario #2
● Someone else caused the problem…
○ Push back on design
○ Just live with it

8
The 5 Most Common Performance Problems (The 5 Ss)
Storage - The Ideal File Size
● The ideal part-file is between 128MB and 1GB

● Smaller than 128MB and we creep


into the Tiny Files problem & its cousins

● Larger than 1GB part-files are general advised


against mainly due in part to the problems associated
with creating these large Spark-Partitions

● Remember...
1 Spark-Partition == 1 Part-File upon write

9
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Manual Compaction
● You can control the on-disk, part-file size

● The process does require some guessing

● Other “features” like partitioning and [especially]


bucketing further complicates this process

● The key to it all is the notion that one


Spark-task writes one part-file meaning that a
1GB part-files requires a 1GB Spark-partition

10
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Manual Compaction, How-To
The Algorithm An Example In Action
1. Determine the size of your dataset on disk 1. Size on disk is 150 GB
2. Decide what your ideal part-file size is 2. Assume ½ GB part-files
3. Compute the number of spark-partitions 3. 150 GB / ½ GB = 300 partitions
required (divide size-on-disk / ideal-size)
4. Configure a cluster with N cores 4. 9 x C4.8xlarge (60 GB, 36 cores)
(more cores == less time) How? ● 9 VMs x 36 cores for 324 total cores
● 60 GB / 2 = 30 GB execution
(default is 60% but 50% is safe)
● 30 GB /36 cores = 0.83 GB
(over our ½ GB goal, but disk vs RAM)
5. Read in your data, repartition by N, 5. Read in your data, repartition by 300, and
and then write to disk then write to disk
6. Check the Spark UI for spill and any other issues
Homework: See how to manually compact tiny files in Experiment #2586 11
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Automatic Compaction
Databricks Delta’s Optimize Operation
● See Optimize (Delta Lake on Databricks) for more information
● Targets a 1GB size for each part-file

Databricks’ Auto-Optimization Feature


● See Auto Optimize for more information
● Targets a 128MB size for each part-file
● Note the most optimal, but better than “tiny files”
● Enable these two options when on Databricks
1. spark.databricks.delta.optimizeWrite.enabled = true
2. spark.databricks.delta.autoCompact.enabled = true

12
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Why Auto Optimize?
● Manually compacting files, or writing them out
correctly the first time, is the most efficient process

● But the Delta optimize and auto-automize can afford


the ability to focus on [potentially] bigger problems

● To better understand this, let’s take a look at an


example of how automatic optimization works

● It’s a little bit academic, but it helps to


underscore how important this concept is

13
Storage - Traditional Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
Delta Tables
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

1 of 5
Storage - Traditional Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
Delta Tables
100 mb
64 MB

32 MB

32 MB

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Each task will write one part file to the target disk-partition 2 of 5
Storage - Traditional Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
Delta Tables
100 mb

32 MB

32 MB

32 MB

32 MB
64 MB

32 MB

32 MB

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All writes take place at the same time... 2 of 5


Storage - Traditional Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
Delta Tables
144 MB
100 mb

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
64 MB

32 MB

32 MB

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All writes take place at the same time... 3 of 5


Storage - Traditional Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
Delta Tables
144 MB
100 mb

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
64 MB

32 MB

32 MB

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All writes take place at the same time... 4 of 5


Storage - Traditional Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
Delta Tables
144 MB
100 mb

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
64 MB

32 MB

32 MB

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

No optimization of the file size, potentially


inducing the tiny-files problem 5 of 5
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
Delta Tables
144 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

1 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
Adaptive Shuffle

Delta Tables
144 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Engage in an adaptive shuffle which


is offset by reduced disk IO 2 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1

Delta Tables
144 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Based on the target disk-partition,


spark-partitions are grouped together 3 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2

Delta Tables
144 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Based on the target disk-partition,


spark-partitions are grouped together 4 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

Delta Tables
144 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Based on the target disk-partition,


spark-partitions are grouped together 5 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

Delta Tables
144 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All partitions shuffled at the same time... 6 of 17


Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
32 MB
32 MB
64 MB

32 MB
32 MB
32 MB
32 MB

16 MB
48 MB
70 MB
Delta Tables
144 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Estimate the final size of each partition on disk 7 of 17


Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Small partitions are merged together,


targeting a final size of 128 MB 8 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Each task will write one part file to the target disk-partition 9 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB

128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Each task will write one part file to the target disk-partition 10 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB

144 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB

48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All writes take place at the same time... 11 of 17


Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB

144 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB

48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All writes take place at the same time... 12 of 17


Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB

144 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB

48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Estimation was high and subsequently not merged 13 of 17


Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB

144 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB

48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

With files written to disk, a subsequent


job can now compact smaller files 14 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB

144 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB

48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

With files written to disk, a subsequent


job can now compact smaller files 15 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB

144 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB

128

128
MB

MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

With files written to disk, a subsequent


job can now compact smaller files 16 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables
144 MB

128 MB

100 MB
100 MB

144 MB
64 MB

32 MB

32 MB

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB

128

128
MB

MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

A little overhead with the write, huge


payoffs on the subsequent reads 17 of 17
Storage - Traditional vs Optimized Writes
Traditional Writes Optimized Writes

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Spark Cluster
#1 #2 #4

100 MB

144 MB
64 MB
32 MB
32 MB

32 MB
32 MB
32 MB
32 MB

48 MB
16 MB
70 MB
Delta Tables

128 MB

100 MB
144 MB

144 MB
100 mb

32 MB

32 MB

32 MB

32 MB

48 MB

16 MB

64 MB
64 MB

32 MB

32 MB

128

128
MB

MB
Disk-Partition A Disk-Partition B Disk-Partition C

No optimization on file size, potentially Reduced disk IO, and optimized


inducing the tiny-files problem around 128 MB part files
37
Performance Tuning on Apache Spark

The Five Most Common


Performance Problems

Storage - Directory Scanning

38
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Directory Scanning
The next version of the “Tiny Files Problem” is Directory Scanning

● Here is the idea:


■ One can list the files in a single directory
■ A single list with thousands of files is still OK
■ The scan still is not as bad as the overhead of reading tiny files

● Highly partitioned datasets (data on disk) present a different problem:


■ For every disk-partition there is another directory to scan
■ Multiplied that number of directories by N secondary & M tertiary partitions
■ These have to be scanned by the driver one directory at a time

39
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Scanning Example
Consider this common scenario:

● Consider 1 year’s worth of data partitioned by year, month, day & hour

● 1 year * 12 months * 30 days * 24 hours = 8,640 distinct directories

● 10 years of data becomes 86,400 directories.

40
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Scanning In Action
See Experiment #8973

● Contrast Step B, Step C and Step D


■ Note the we are not executing any actions - only declaring the DataFrames
■ Note the total execution time for each command
■ Note the results for countFiles(..)
● The Records total
● The Files total
● The Directories total

● See Step E, F & G for more variants and how they affect scanning

● For Step J open the Spark UI and look at the Query Details for the last job
■ Identify the proof that scanning is the root cause of this performance problem
41
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Scanning, Review
Step Description Duration Records Files Directories

B ~100 records per part-file (tiny files) ~1 minute 34,562,416 345,612 1

C Partitioned by year & month ~ 5 seconds 36,152,970 6,273 12

D Partitioned by year, month, hour & day ~15 minutes 37,413,338 6,273 8,760

42
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Scanning, Prove It
What proof is there in the Query Details for Experiment #8973, Step J,
that scanning is the root cause of these performance problems?

What proof is there in the Query Details for Experiment #8973, Step J,
that scanning is the root cause of these performance problems?

43
What can we do to mitigate the impact of scanning?

44
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Can we…?

What happens if we were to specify the schema?


● See Step H and determine if this solution works Nope

What happens if we were to register it as a table?


● See Step I and determine if this solution works
Yes!

45
Performance Tuning on Apache Spark

The Five Most Common


Performance Problems

Storage - Schemas

46
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Schemas
● Inferring schemas (for JSON and CSV) require a full read of the file to
determine data types, even if you only want a subset of the data

● Reading Parquet files requires a one-time read of the schema

● However, supporting schema evolution is [potentially] expensive


■ If you have hundreds to thousands of part-files, each schema has to be read in
and then merged which collectively can be fairly expensive
■ Schema merging was turned off by default starting with Spark 1.5
■ Enabled via the spark.sql.parquet.mergeSchema
configuration option or the mergeSchema option

47
What can we do to mitigate the schema issues?

48
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Schema Mitigation
There are several ways to mitigate some of these issues:

● Provide the schema every time


■ Especially for JSON and CSV
■ Applicable to Parquet and other formats

● Use tables - the backing meta store will track the table’s schema

● Or just use Delta


■ Zero reads with a meta store
■ At most one read, even with schema evolution

49
50

You might also like