Catalyst Rastersincatalyst 170525201903

Harnessing Spark Catalyst for
Custom Data Payloads

GIS Raster Support in Spark DataFrames
Simeon H.K. Fitch
Co-Founder & VP of R&D, Astraea
Astraea See the earth. As it was, as it is, as it could be.
• Developing a machine learning platform to

make solving planetary problems easier
• With exploding population growth and finite

resources, we need to have tools to better plan
for sustainable growth
• We aim to bring earth science data to business

applications through machine learning
2
Preface
• Assumptions:
– Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame
compute model
– Basic understanding of a typical ETL/ML pipeline
• Prior Art:
– Approach outlined derived from other work
– Fundamental raster support via Azavea’s GeoTrellis
– Spark integration cues taken from:
• CCRi’s GeoMesa
• Databrick’s Spark-Avro
• Caveat Emptor:
– As of Spark 2.1.0, approach is not officially sanctioned;
uses undocumented, private APIs
– Not for everyone, but for us, benefits outweigh the risks
3
To efficiently and effectively build machine learning models with Earth observation data
PROBLEM STATEMENT
4
Data Native Form
Key Value
Granule Metadata (GM)
scale_factor 0.002
Remote Sensing Data Product
TileID 51004010
Temporal
Band c
Band b
Band a
Granule/Scene/Tile Multiband Granule-wide
(GeoTIFF, HDF-EOS, GML-JPEG2000) Projected valid_range 1, 255
Tile properties
Extent (TPE) long_name Band 32 emissivity
add_offset 0.49
… …
5
Canonical ML Functional Form
Granule Metadata (GM)
Temporal
Band c
Band b
Band a
Projected
Extent (TPE)
Projected Extent of
Tile + Cell Row/ Band Values at
Column Single Cell
Spark Dataframe Row a c

GMA TPEA
1 [r1 , c1 ] 1[ 0] b[ 0] 1[ 0] ...
(i.e. ML Observation) 1
... ... ... ... ... ...
6
Delivering Imagery to ML
World-wide data coverage
Scalable Machine Learning
Scenes/
Granules Data Quality
(Scene 1) (Scene 2) (Scene N)
t0,b1 t1,b1 tf,b1
Check
(DQC)
t0,b2 t1,b2 tf,b2
Base Analytics Functional Form Analytics Base Table

t0,b3 t1,b3 tf,b3
(BAFF) (ABT)
t1 i1 t1 T1
t2 i2 t2
t0, b4 t1, b4 tf, b4
T2
Exploratory Data
SLAAW t1 i3 t1 T3
Analysis
…
(Scene 1)
(Scene 2) (Scene N)
t0, b5
t1, b5 tf, b5
t2 i4 (EDA) t2 T2
(Scene 1)
t0, b6
(Scene 2) (Scene N)
t1 i5 t1 T2
t1, b6 tf, b6
t2 i6 t2 T3
(Scene 1)
(Scene 2) (Scene N) … …
wavelength
t0, b7
t1, b7 tf, b7
…
…
…

Feature
t0, bn tf, bn
t1, bn
Engineering
time
Distributed DataFrame
Distributed DataFrame
7
Why This is Hard: Dimensionality
Spatial
(500m → 5m → 30cm)
DigiGlobe
Landsat8
Metadata
Planet + • Coordinate Reference System
• Temporal/Spatial Extent
• QA Flags
• Calibration parameters
Planetary
Temporal Resources Spectral
(Refresh rates: Weeks → Daily → Hourly) (4 bands → 200 bands)
8
Why This is Hard: Data Footprint
As resolution scales, image size explodes
Planet
Landsat8 Planetary
PlanetScope DigiGlobe
(NASA) Resources
Ortho
• 30 meters • 3 meters • 30 centimeters • 10 m Resolution

• 8 band • 4 band • 4 band • 200 band (hyper-spectral)
• 0.5 GB/image • 16 GB/image • 1.0 TB/image • 50 TB/ image?
Data footprint for one football field size multiband raster

(single point in time!)
9
Prototyping Spark Catalyst raster integration
CAPABILITY DEMONSTRATION
10
Domain-Specific Data Discretization
Each of these has one or more “bands”
(e.g. Landsat 8: 11, MODIS: 36, Hyperion: 220)
Swath ~ Granule ~ Scene ~ Raster

𝑛 × 𝑚 where 𝑛, 𝑚 ≳ 1200
⇓
(e.g. Landsat 8: 76002)
Tile ~ Chip
.
𝑛 , where 𝑛 ≲ 512
⇓(Typical: 642 to 2562)
Cell ~ Pixel
1×1
11
TileUDT and Friends
• Using the approach covered in the next section we register TileUDT
with Spark
• With UDTs come User Defined Functions (UDFs)
• Some examples:
§ vectorizeTiles § tileHistogram
§ explodeTiles § tileStatistics
§ localMax § tileMean
§ localMin § aggHistogram
§ localStats § aggStats
§ localAdd
§ localSubtract
See work-in-progress code and examples/tests in:

https://github.com/s22s/geotrellis-spark-sql/
12
TileUDT Notebook Demo
ZeppelinHub Version
13
From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame
IMPLEMENTATION
14
Software Stack
• Scala
• Apache Spark
• GeoTrellis
• Accumulo
• Docker
• Apache Zeppelin
15
GeoTrellis
• GeoTrellis is an open source
Scala framework for efficiently
manipulating raster GIS data
• Provides facilities to ingest and
process tiles at scale
• Has powerful abstractions for
working with RDD[Tile]s.
– Mosaicing, stitching, pyramiding,
resampling, reprojecting, etc.
– Implements C. Dana Tomlin’s
“Map Algebra”
16
Getting From RDDs to DataFrames
• Goal: work with tiles via DataFrame APIs
– Better ergonomics
– More computationally efficient
– Required for SparkML
• Bonus: if a capability is available in
DataFrames, it’s also available in SQL!
17
Encoding Data with Spark Catalyst
• Catalyst is the engine behind Spark DataFrames & SQL
• Moving data from RDDs to DataFrames requires using one of two
Catalyst APIs:
– ExpressionEncoder[Tile] or
– UserDefinedType[Tile]
• Both are (currently) package private
• Both have steep learning curves
• Both are extremely powerful once harnessed
– ExpressionEncoder is ideal for simple structures
– UserDefinedType is more efficient for larger data payloads
• For our needs, UserDefinedType (UDT) is the best fit
18
Anatomy of a UDT
To access private API, need to be a subpackage of sql.
Supertype parameterized on user type
Name shown in schema and query plan
Runtime class descriptor of user type

Conversion from user data type to Catalyst encoding
Conversion from Catalyst encoding to user data type
Schema describing how the type will be

encoded within Catalyst. You have lots of
flexibility here, even using other UDTs. In this
example we pack the tile into an opaque blob.
19
UDT Registration
• User defined type is registered with
Catalyst by providing mapping between
native type and UDT
20
Spark Catalyst Toolbox
• User Defined Type (UDT)
• User Defined Function (UDF, 2 forms)
• User Defined Aggregation Function (UDAF)
• User Defined Table Function (UDTF, a.k.a.
“Generator”)
• Data Source
• Query Plan
• Optimization Rule
21
Future Work
• GeoTrellis Layer Store as an integrated
Spark DataSource (in progress)
• Expanding standard GeoTrellis RDD
features into efficient UDFs
• GIS Vector primitives (a la GeoMesa)
• Becoming an official module of GeoTrellis
22
The End
THANK YOU!
23

Catalyst Rastersincatalyst 170525201903

Uploaded by

Copyright:

Available Formats

You might also like

Catalyst Rastersincatalyst 170525201903

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Catalyst Rastersincatalyst 170525201903

Uploaded by

Copyright:

Available Formats

Harnessing Spark Catalyst for

Custom Data Payloads

• Developing a machine learning platform to

• With exploding population growth and finite

• We aim to bring earth science data to business

Spark Dataframe Row a c

Base Analytics Functional Form Analytics Base Table

(Scene 1) (Scene 2) (Scene N)

• 30 meters • 3 meters • 30 centimeters • 10 m Resolution

Data footprint for one football field size multiband raster

Swath ~ Granule ~ Scene ~ Raster

See work-in-progress code and examples/tests in:

Name shown in schema and query plan

Runtime class descriptor of user type

Conversion from Catalyst encoding to user data type

Schema describing how the type will be

You might also like