Lesson 01.05 The 5 Ss Storage

Performance Tuning on Apache Spark
The Five Most Common

Performance Problems
Storage
1
The 5 Most Common Performance Problems (The 5 Ss)
Storage
● Storage is our 4th problem area
● Traditionally it consisted of one specific type of problem (aka Tiny Files)
● But it is actually a class of problems that relates

to high overhead with ingesting data
● That ingest can be the initial ingest from Blob Storage,

JDBC, Kafka, EventHubs or even from a previous Spark stage
2
Storage - More Examples
We are going to take a look at a couple of examples:
● Tiny Files
● Scanning
● Schemas, Merging Schemas & Schema Evolution
3
If you had only 60 seconds to pick up as many coins as you
$0.02
$0.08
$0.09
$0.03
$0.10
$0.04
$0.12
$0.06
$0.01
$0.05
$0.13
$0.07
$0.11 vs
vs
vs$0.00
$0.50
$2.00
$2.50
$0.25
$3.00
$2.25
$3.25
$0.75
$2.75
$1.00
$1.50
$1.25
$1.75
can, one coin at a time, which pile do you want to work from?
4
Storage - Tiny Files In Action
See Experiment #8923, contrast Step B, StepC and Step D
● Note the total execution time of each job
● In the Spark UI, see the Stage Details for the last
stage of each step and note the Input Size / Records
● In the Spark UI, see the Query Details for the last job of each step and note the...
■ number of files read
■ scan time total
■ filesystem read time total
■ size of files read
5
Storage - Tiny Files, Review
Record Execution number of scan filesystem read size of files

Step
Count Time files read time total time total read
B - Benchmark #1 ~41 M ~3 minutes 6,273 20 minutes ~10 minutes 1,209 MB
C - Benchmark #2 ~2.7 B ~10 minutes 100 1 hour ~1 hour 102 GB
D - Tiny Files ~34 M ~1.5 hours 345,612 12 hours > 6 hours 2.1 GB
6
What can we do to mitigate the impact of tiny files?
7
Storage - Only 2.5 Options?
I might be wrong, but I think there are really only two options...
Scenario #1
● You caused the problem…
○ You can fix the problem
Scenario #2
● Someone else caused the problem…
○ Push back on design
○ Just live with it
8
Storage - The Ideal File Size
● The ideal part-file is between 128MB and 1GB
● Smaller than 128MB and we creep

into the Tiny Files problem & its cousins
● Larger than 1GB part-files are general advised

against mainly due in part to the problems associated
with creating these large Spark-Partitions
● Remember...
1 Spark-Partition == 1 Part-File upon write
9
Storage - Manual Compaction
● You can control the on-disk, part-file size
● The process does require some guessing
● Other “features” like partitioning and [especially]

bucketing further complicates this process
● The key to it all is the notion that one

Spark-task writes one part-file meaning that a
1GB part-files requires a 1GB Spark-partition
10
Storage - Manual Compaction, How-To
The Algorithm An Example In Action
1. Determine the size of your dataset on disk 1. Size on disk is 150 GB
2. Decide what your ideal part-file size is 2. Assume ½ GB part-files
3. Compute the number of spark-partitions 3. 150 GB / ½ GB = 300 partitions
required (divide size-on-disk / ideal-size)
4. Configure a cluster with N cores 4. 9 x C4.8xlarge (60 GB, 36 cores)
(more cores == less time) How? ● 9 VMs x 36 cores for 324 total cores
● 60 GB / 2 = 30 GB execution
(default is 60% but 50% is safe)
● 30 GB /36 cores = 0.83 GB
(over our ½ GB goal, but disk vs RAM)
5. Read in your data, repartition by N, 5. Read in your data, repartition by 300, and
and then write to disk then write to disk
6. Check the Spark UI for spill and any other issues
Homework: See how to manually compact tiny files in Experiment #2586 11
Storage - Automatic Compaction
Databricks Delta’s Optimize Operation
● See Optimize (Delta Lake on Databricks) for more information
● Targets a 1GB size for each part-file
Databricks’ Auto-Optimization Feature

● See Auto Optimize for more information
● Targets a 128MB size for each part-file
● Note the most optimal, but better than “tiny files”
● Enable these two options when on Databricks
1. spark.databricks.delta.optimizeWrite.enabled = true
2. spark.databricks.delta.autoCompact.enabled = true
12
Storage - Why Auto Optimize?
● Manually compacting files, or writing them out
correctly the first time, is the most efficient process
● But the Delta optimize and auto-automize can afford

the ability to focus on [potentially] bigger problems
● To better understand this, let’s take a look at an

example of how automatic optimization works
● It’s a little bit academic, but it helps to

underscore how important this concept is
13
Storage - Traditional Writes
Traditional Writes Optimized Writes
Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4
Spark Cluster
Delta Tables
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
1 of 5
Spark Cluster
Delta Tables
100 mb
64 MB
32 MB
32 MB
Each task will write one part file to the target disk-partition 2 of 5
Spark Cluster
Delta Tables
100 mb
32 MB
32 MB
32 MB
32 MB
64 MB
32 MB
32 MB
All writes take place at the same time... 2 of 5

Spark Cluster
Delta Tables
144 MB
100 mb
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
64 MB
32 MB
32 MB

Spark Cluster
Delta Tables
144 MB
100 mb
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
64 MB
32 MB
32 MB

Spark Cluster
Delta Tables
144 MB
100 mb
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
64 MB
32 MB
32 MB
No optimization of the file size, potentially

inducing the tiny-files problem 5 of 5
Storage - Optimized Writes
Spark Cluster
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
1 of 17
Spark Cluster
Adaptive Shuffle
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Engage in an adaptive shuffle which

is offset by reduced disk IO 2 of 17
Spark Cluster
#1
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Based on the target disk-partition,

spark-partitions are grouped together 3 of 17
Spark Cluster
#1 #2
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB

Spark Cluster
#1 #2 #4
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB

Spark Cluster
#1 #2 #4
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
All partitions shuffled at the same time... 6 of 17

Spark Cluster
#1 #2 #4
100 MB
144 MB
32 MB
32 MB
64 MB
32 MB
32 MB
32 MB
32 MB
16 MB
48 MB
70 MB
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Estimate the final size of each partition on disk 7 of 17

Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Small partitions are merged together,

targeting a final size of 128 MB 8 of 17
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
128
MB
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB

Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB

Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB
Estimation was high and subsequently not merged 13 of 17

Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB
With files written to disk, a subsequent

job can now compact smaller files 14 of 17
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB

Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
128
128
MB
MB

Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
128
128
MB
MB
A little overhead with the write, huge

payoffs on the subsequent reads 17 of 17
Storage - Traditional vs Optimized Writes
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
128 MB
100 MB
144 MB
144 MB
100 mb
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
64 MB
32 MB
32 MB
128
128
MB
MB
Disk-Partition A Disk-Partition B Disk-Partition C
No optimization on file size, potentially Reduced disk IO, and optimized

inducing the tiny-files problem around 128 MB part files
37

Storage - Directory Scanning
38
Storage - Directory Scanning
The next version of the “Tiny Files Problem” is Directory Scanning
● Here is the idea:

■ One can list the files in a single directory
■ A single list with thousands of files is still OK
■ The scan still is not as bad as the overhead of reading tiny files
● Highly partitioned datasets (data on disk) present a different problem:

■ For every disk-partition there is another directory to scan
■ Multiplied that number of directories by N secondary & M tertiary partitions
■ These have to be scanned by the driver one directory at a time
39
Storage - Scanning Example
Consider this common scenario:
● Consider 1 year’s worth of data partitioned by year, month, day & hour
● 1 year * 12 months * 30 days * 24 hours = 8,640 distinct directories
● 10 years of data becomes 86,400 directories.
40
Storage - Scanning In Action
See Experiment #8973
● Contrast Step B, Step C and Step D

■ Note the we are not executing any actions - only declaring the DataFrames
■ Note the total execution time for each command
■ Note the results for countFiles(..)
● The Records total
● The Files total
● The Directories total
● See Step E, F & G for more variants and how they affect scanning
● For Step J open the Spark UI and look at the Query Details for the last job
■ Identify the proof that scanning is the root cause of this performance problem
41
Storage - Scanning, Review
Step Description Duration Records Files Directories
B ~100 records per part-file (tiny files) ~1 minute 34,562,416 345,612 1
C Partitioned by year & month ~ 5 seconds 36,152,970 6,273 12
D Partitioned by year, month, hour & day ~15 minutes 37,413,338 6,273 8,760
42
Storage - Scanning, Prove It
What proof is there in the Query Details for Experiment #8973, Step J,
that scanning is the root cause of these performance problems?
What proof is there in the Query Details for Experiment #8973, Step J,
that scanning is the root cause of these performance problems?
43
What can we do to mitigate the impact of scanning?
44
Storage - Can we…?
What happens if we were to specify the schema?

● See Step H and determine if this solution works Nope
What happens if we were to register it as a table?

● See Step I and determine if this solution works
Yes!
45

Storage - Schemas
46
Storage - Schemas
● Inferring schemas (for JSON and CSV) require a full read of the file to
determine data types, even if you only want a subset of the data
● Reading Parquet files requires a one-time read of the schema
● However, supporting schema evolution is [potentially] expensive

■ If you have hundreds to thousands of part-files, each schema has to be read in
and then merged which collectively can be fairly expensive
■ Schema merging was turned off by default starting with Spark 1.5
■ Enabled via the spark.sql.parquet.mergeSchema
configuration option or the mergeSchema option
47
What can we do to mitigate the schema issues?
48
Storage - Schema Mitigation
There are several ways to mitigate some of these issues:
● Provide the schema every time

■ Especially for JSON and CSV
■ Applicable to Parquet and other formats
● Use tables - the backing meta store will track the table’s schema
● Or just use Delta

■ Zero reads with a meta store
■ At most one read, even with schema evolution
49
50

Lesson 01.05 The 5 Ss Storage

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 01.05 The 5 Ss Storage

Uploaded by

Copyright:

Available Formats

Performance Tuning on Apache Spark

The Five Most Common

● Traditionally it consisted of one speciﬁc type of problem (aka Tiny Files)

● But it is actually a class of problems that relates

● That ingest can be the initial ingest from Blob Storage,

● Schemas, Merging Schemas & Schema Evolution

● Note the total execution time of each job

Record Execution number of scan filesystem read size of files

C - Benchmark #2 ~2.7 B ~10 minutes 100 1 hour ~1 hour 102 GB

● Smaller than 128MB and we creep

● Larger than 1GB part-ﬁles are general advised

● The process does require some guessing

● Other “features” like partitioning and [especially]

● The key to it all is the notion that one

Databricks’ Auto-Optimization Feature

● But the Delta optimize and auto-automize can afford

● To better understand this, let’s take a look at an

● It’s a little bit academic, but it helps to

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All writes take place at the same time... 2 of 5

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All writes take place at the same time... 3 of 5

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

All writes take place at the same time... 4 of 5

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C

No optimization of the ﬁle size, potentially

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Engage in an adaptive shuffle which

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Based on the target disk-partition,

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Based on the target disk-partition,

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Based on the target disk-partition,

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

All partitions shuffled at the same time... 6 of 17

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Estimate the ﬁnal size of each partition on disk 7 of 17

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Small partitions are merged together,

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

All writes take place at the same time... 11 of 17

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

All writes take place at the same time... 12 of 17

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

Estimation was high and subsequently not merged 13 of 17

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

With ﬁles written to disk, a subsequent

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

With ﬁles written to disk, a subsequent

Executor #1 Executor #2 Executor #3 Executor #4 Executor #1 Executor #2 Executor #3 Executor #4

With ﬁles written to disk, a subsequent