Catalyst Rastersincatalyst 170525201903

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Harnessing Spark Catalyst for

Custom Data Payloads


GIS Raster Support in Spark DataFrames
Simeon H.K. Fitch
Co-Founder & VP of R&D, Astraea
Astraea See the earth. As it was, as it is, as it could be.​

• Developing a machine learning platform to


make solving planetary problems easier

• With exploding population growth and finite


resources, we need to have tools to better plan
for sustainable growth

• We aim to bring earth science data to business


applications through machine learning

2
Preface
• Assumptions:
– Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame
compute model
– Basic understanding of a typical ETL/ML pipeline

• Prior Art:
– Approach outlined derived from other work
– Fundamental raster support via Azavea’s GeoTrellis
– Spark integration cues taken from:
• CCRi’s GeoMesa
• Databrick’s Spark-Avro

• Caveat Emptor:
– As of Spark 2.1.0, approach is not officially sanctioned;
uses undocumented, private APIs
– Not for everyone, but for us, benefits outweigh the risks

3
To efficiently and effectively build machine learning models with Earth observation data

PROBLEM STATEMENT

4
Data Native Form

Key Value
Granule Metadata (GM)
scale_factor 0.002
Remote Sensing Data Product
TileID 51004010
Temporal

Band c
Band b
Band a
Granule/Scene/Tile Multiband Granule-wide
(GeoTIFF, HDF-EOS, GML-JPEG2000) Projected valid_range 1, 255
Tile properties
Extent (TPE) long_name Band 32 emissivity

add_offset 0.49

… …

5
Canonical ML Functional Form
Granule Metadata (GM)

Temporal
Band c
Band b
Band a

Projected
Extent (TPE)

Projected Extent of
Tile + Cell Row/ Band Values at
Column Single Cell

Spark Dataframe Row a c


GMA TPEA
1 [r1 , c1 ] 1[ 0] b[ 0] 1[ 0] ...
(i.e. ML Observation) 1
... ... ... ... ... ...

6
Delivering Imagery to ML
World-wide data coverage
Scalable Machine Learning
Scenes/
Granules Data Quality
(Scene 1) (Scene 2) (Scene N)
t0,b1 t1,b1 tf,b1
Check
(DQC)
(Scene 1) (Scene 2) (Scene N)
t0,b2 t1,b2 tf,b2

Base Analytics Functional Form Analytics Base Table


(Scene 1) (Scene 2) (Scene N)
t0,b3 t1,b3 tf,b3
(BAFF) (ABT)
t1 i1 t1 T1
t2 i2 t2
(Scene 1) (Scene 2) (Scene N)
t0, b4 t1, b4 tf, b4
T2
Exploratory Data
SLAAW t1 i3 t1 T3
Analysis


(Scene 1)
(Scene 2) (Scene N)
t0, b5
t1, b5 tf, b5
t2 i4 (EDA) t2 T2
(Scene 1)
t0, b6
(Scene 2) (Scene N)
t1 i5 t1 T2
t1, b6 tf, b6
t2 i6 t2 T3
(Scene 1)
(Scene 2) (Scene N) … …
wavelength

t0, b7
t1, b7 tf, b7


(Scene 1) (Scene 2) (Scene N)


Feature
t0, bn tf, bn
t1, bn
Engineering

time

Distributed DataFrame
Distributed DataFrame

7
Why This is Hard: Dimensionality
Spatial
(500m → 5m → 30cm)

DigiGlobe
Landsat8
Metadata
Planet + • Coordinate Reference System
• Temporal/Spatial Extent
• QA Flags
• Calibration parameters

Planetary
Temporal Resources Spectral
(Refresh rates: Weeks → Daily → Hourly) (4 bands → 200 bands)

8
Why This is Hard: Data Footprint
As resolution scales, image size explodes

Planet
Landsat8 Planetary
PlanetScope DigiGlobe
(NASA) Resources
Ortho

• 30 meters • 3 meters • 30 centimeters • 10 m Resolution


• 8 band • 4 band • 4 band • 200 band (hyper-spectral)
• 0.5 GB/image • 16 GB/image • 1.0 TB/image • 50 TB/ image?

Data footprint for one football field size multiband raster


(single point in time!)

9
Prototyping Spark Catalyst raster integration

CAPABILITY DEMONSTRATION

10
Domain-Specific Data Discretization
Each of these has one or more “bands”
(e.g. Landsat 8: 11, MODIS: 36, Hyperion: 220)

Swath ~ Granule ~ Scene ~ Raster


𝑛 × 𝑚 where 𝑛, 𝑚 ≳ 1200

(e.g. Landsat 8: 76002)

Tile ~ Chip
.
𝑛 , where 𝑛 ≲ 512
⇓(Typical: 642 to 2562)

Cell ~ Pixel
1×1
11
TileUDT and Friends
• Using the approach covered in the next section we register TileUDT
with Spark
• With UDTs come User Defined Functions (UDFs)
• Some examples:

§ vectorizeTiles § tileHistogram
§ explodeTiles § tileStatistics
§ localMax § tileMean
§ localMin § aggHistogram
§ localStats § aggStats
§ localAdd
§ localSubtract

See work-in-progress code and examples/tests in:


https://github.com/s22s/geotrellis-spark-sql/
12
TileUDT Notebook Demo

ZeppelinHub Version
13
From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame

IMPLEMENTATION

14
Software Stack
• Scala
• Apache Spark
• GeoTrellis
• Accumulo
• Docker
• Apache Zeppelin

15
GeoTrellis
• GeoTrellis is an open source
Scala framework for efficiently
manipulating raster GIS data
• Provides facilities to ingest and
process tiles at scale
• Has powerful abstractions for
working with RDD[Tile]s.
– Mosaicing, stitching, pyramiding,
resampling, reprojecting, etc.
– Implements C. Dana Tomlin’s
“Map Algebra”
16
Getting From RDDs to DataFrames
• Goal: work with tiles via DataFrame APIs
– Better ergonomics
– More computationally efficient
– Required for SparkML
• Bonus: if a capability is available in
DataFrames, it’s also available in SQL!

17
Encoding Data with Spark Catalyst
• Catalyst is the engine behind Spark DataFrames & SQL
• Moving data from RDDs to DataFrames requires using one of two
Catalyst APIs:
– ExpressionEncoder[Tile] or
– UserDefinedType[Tile]
• Both are (currently) package private
• Both have steep learning curves
• Both are extremely powerful once harnessed
– ExpressionEncoder is ideal for simple structures
– UserDefinedType is more efficient for larger data payloads
• For our needs, UserDefinedType (UDT) is the best fit

18
Anatomy of a UDT
To access private API, need to be a subpackage of sql.
Supertype parameterized on user type

Name shown in schema and query plan

Runtime class descriptor of user type


Conversion from user data type to Catalyst encoding

Conversion from Catalyst encoding to user data type

Schema describing how the type will be


encoded within Catalyst. You have lots of
flexibility here, even using other UDTs. In this
example we pack the tile into an opaque blob.

19
UDT Registration
• User defined type is registered with
Catalyst by providing mapping between
native type and UDT

20
Spark Catalyst Toolbox
• User Defined Type (UDT)
• User Defined Function (UDF, 2 forms)
• User Defined Aggregation Function (UDAF)
• User Defined Table Function (UDTF, a.k.a.
“Generator”)
• Data Source
• Query Plan
• Optimization Rule

21
Future Work
• GeoTrellis Layer Store as an integrated
Spark DataSource (in progress)
• Expanding standard GeoTrellis RDD
features into efficient UDFs
• GIS Vector primitives (a la GeoMesa)
• Becoming an official module of GeoTrellis

22
The End

THANK YOU!

23

You might also like