Professional Documents
Culture Documents
Catalyst Rastersincatalyst 170525201903
Catalyst Rastersincatalyst 170525201903
Catalyst Rastersincatalyst 170525201903
2
Preface
• Assumptions:
– Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame
compute model
– Basic understanding of a typical ETL/ML pipeline
• Prior Art:
– Approach outlined derived from other work
– Fundamental raster support via Azavea’s GeoTrellis
– Spark integration cues taken from:
• CCRi’s GeoMesa
• Databrick’s Spark-Avro
• Caveat Emptor:
– As of Spark 2.1.0, approach is not officially sanctioned;
uses undocumented, private APIs
– Not for everyone, but for us, benefits outweigh the risks
3
To efficiently and effectively build machine learning models with Earth observation data
PROBLEM STATEMENT
4
Data Native Form
Key Value
Granule Metadata (GM)
scale_factor 0.002
Remote Sensing Data Product
TileID 51004010
Temporal
Band c
Band b
Band a
Granule/Scene/Tile Multiband Granule-wide
(GeoTIFF, HDF-EOS, GML-JPEG2000) Projected valid_range 1, 255
Tile properties
Extent (TPE) long_name Band 32 emissivity
add_offset 0.49
… …
5
Canonical ML Functional Form
Granule Metadata (GM)
Temporal
Band c
Band b
Band a
Projected
Extent (TPE)
Projected Extent of
Tile + Cell Row/ Band Values at
Column Single Cell
6
Delivering Imagery to ML
World-wide data coverage
Scalable Machine Learning
Scenes/
Granules Data Quality
(Scene 1) (Scene 2) (Scene N)
t0,b1 t1,b1 tf,b1
Check
(DQC)
(Scene 1) (Scene 2) (Scene N)
t0,b2 t1,b2 tf,b2
…
(Scene 1)
(Scene 2) (Scene N)
t0, b5
t1, b5 tf, b5
t2 i4 (EDA) t2 T2
(Scene 1)
t0, b6
(Scene 2) (Scene N)
t1 i5 t1 T2
t1, b6 tf, b6
t2 i6 t2 T3
(Scene 1)
(Scene 2) (Scene N) … …
wavelength
t0, b7
t1, b7 tf, b7
…
…
…
time
Distributed DataFrame
Distributed DataFrame
7
Why This is Hard: Dimensionality
Spatial
(500m → 5m → 30cm)
DigiGlobe
Landsat8
Metadata
Planet + • Coordinate Reference System
• Temporal/Spatial Extent
• QA Flags
• Calibration parameters
Planetary
Temporal Resources Spectral
(Refresh rates: Weeks → Daily → Hourly) (4 bands → 200 bands)
8
Why This is Hard: Data Footprint
As resolution scales, image size explodes
Planet
Landsat8 Planetary
PlanetScope DigiGlobe
(NASA) Resources
Ortho
9
Prototyping Spark Catalyst raster integration
CAPABILITY DEMONSTRATION
10
Domain-Specific Data Discretization
Each of these has one or more “bands”
(e.g. Landsat 8: 11, MODIS: 36, Hyperion: 220)
Tile ~ Chip
.
𝑛 , where 𝑛 ≲ 512
⇓(Typical: 642 to 2562)
Cell ~ Pixel
1×1
11
TileUDT and Friends
• Using the approach covered in the next section we register TileUDT
with Spark
• With UDTs come User Defined Functions (UDFs)
• Some examples:
§ vectorizeTiles § tileHistogram
§ explodeTiles § tileStatistics
§ localMax § tileMean
§ localMin § aggHistogram
§ localStats § aggStats
§ localAdd
§ localSubtract
ZeppelinHub Version
13
From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame
IMPLEMENTATION
14
Software Stack
• Scala
• Apache Spark
• GeoTrellis
• Accumulo
• Docker
• Apache Zeppelin
15
GeoTrellis
• GeoTrellis is an open source
Scala framework for efficiently
manipulating raster GIS data
• Provides facilities to ingest and
process tiles at scale
• Has powerful abstractions for
working with RDD[Tile]s.
– Mosaicing, stitching, pyramiding,
resampling, reprojecting, etc.
– Implements C. Dana Tomlin’s
“Map Algebra”
16
Getting From RDDs to DataFrames
• Goal: work with tiles via DataFrame APIs
– Better ergonomics
– More computationally efficient
– Required for SparkML
• Bonus: if a capability is available in
DataFrames, it’s also available in SQL!
17
Encoding Data with Spark Catalyst
• Catalyst is the engine behind Spark DataFrames & SQL
• Moving data from RDDs to DataFrames requires using one of two
Catalyst APIs:
– ExpressionEncoder[Tile] or
– UserDefinedType[Tile]
• Both are (currently) package private
• Both have steep learning curves
• Both are extremely powerful once harnessed
– ExpressionEncoder is ideal for simple structures
– UserDefinedType is more efficient for larger data payloads
• For our needs, UserDefinedType (UDT) is the best fit
18
Anatomy of a UDT
To access private API, need to be a subpackage of sql.
Supertype parameterized on user type
19
UDT Registration
• User defined type is registered with
Catalyst by providing mapping between
native type and UDT
20
Spark Catalyst Toolbox
• User Defined Type (UDT)
• User Defined Function (UDF, 2 forms)
• User Defined Aggregation Function (UDAF)
• User Defined Table Function (UDTF, a.k.a.
“Generator”)
• Data Source
• Query Plan
• Optimization Rule
21
Future Work
• GeoTrellis Layer Store as an integrated
Spark DataSource (in progress)
• Expanding standard GeoTrellis RDD
features into efficient UDFs
• GIS Vector primitives (a la GeoMesa)
• Becoming an official module of GeoTrellis
22
The End
THANK YOU!
23