Professional Documents
Culture Documents
ApacheCon 2018 Geo 3 Calcite Spatial Hyde
ApacheCon 2018 Geo 3 Calcite Spatial Hyde
ApacheCon 2018 Geo 3 Calcite Spatial Hyde
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as
HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements
much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes.
Calcite is bringing spatial query to the masses!
@julianhyde
SQL
Query planning
Query federation
BI & OLAP
Streaming
Hadoop
ASF member
Original author of Apache Calcite
PMC Apache Arrow, Calcite, Drill, Eagle, Kylin
Architect at Looker
Apache Calcite
https://calcite.apache.org
https://github.com/apache/calcite
SELECT d.name, COUNT(*) AS c
FROM Emps AS e
Relational algebra JOIN Depts AS d USING (deptno)
WHERE e.age < 40
GROUP BY d.deptno
HAVING COUNT(*) > 5 Sort [c DESC]
Based on set theory, plus operators:
ORDER BY c DESC
Project, Filter, Aggregate, Union, Join,
Project [name, c]
Sort
Filter [c > 5]
Requires: declarative language (SQL),
query planner
Aggregate [deptno, COUNT(*) AS c]
Zachary’s
pizza
•
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
King
Yen
•
A spatial query •
Station
burger
Filippo’s 7 4
Station burger 5 6
package org.apache.calcite.runtime;
FROM Restaurants AS r
SpatialReference sr();
Geom transform(int srid);
WHERE ST_Distance( }
Geom wrap(Geometry g);
ST_MakePoint(r.x, r.y),
static class SimpleGeom implements Geom { … }
ST_MakePoint(6, 7)) < 1.5 }
Traditional DB indexing
•
techniques don’t work •
Sort •
CREATE /* b-tree */ INDEX
I_Restaurants
ON Restaurants(x, y);
•
A scan over a two-dimensional index only
Hash has locality in one dimension
Master
● Data-oriented
● Space-oriented
R-tree (a data-oriented structure)
R-tree (split vertically into 2)
R-tree (split horizontally into 4)
R-tree (split vertically into 8)
R-tree (split horizontally into 16)
R-tree (split vertically into 32)
Grid (a space-oriented structure)
Grid (a space-oriented structure)
King
Yen
•
Spatial query •
Station
burger
SELECT * Zachary’s
pizza
FROM Restaurants AS r •
WHERE ST_Distance(
ST_MakePoint(r.x, r.y), restaurant x y
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
Hilbert space-filling curve
Filippo’s 7 4 52
Algebraic rewrite
Scan [T]
Scan [T]
SELECT * SELECT *
FROM States AS s FROM States AS s
WHERE ST_Intersects(s.geometry, WHERE ST_Intersects(s.geometry,
ST_MakePoint(6, 7)) ST_GeomFromText('LINESTRING(...)'))
Montana
Tile index
Idaho Wyoming
We cannot use space-filling curves, because
each region (state or park) is a set of points and
not known as planning time.
Nevada Utah Colorado
Divide regions into a (coarse) set of tiles. They
intersect only if some of their tiles intersect.
Montana
Tile index 6 7 8
Idaho Wyoming
tileId state tileId state 3 4 5
0 Nevada 4 Utah
0 Utah 4 Wyoming
1 Utah 5 Wyoming 0
Nevada 1
Utah 2
Colorado
2 Colorado 6 Idaho
2 Utah 6 Montana
3 Idaho 7 Montana
3 Nevada 7 Wyoming
3 Utah 8 Montana
4 Idaho 8 Wyoming
CREATE MATERIALIZED
VIEW EmpSummary AS
Aside: Materialized views SELECT deptno, COUNT(*) AS c
FROM Emp
GROUP BY deptno;
A materialized view is a table
that is defined by a query Aggregate [deptno, count(*)]
Scan
The planner knows about the Scan [Emps]
[EmpSummary]
mapping and can
transparently rewrite queries empno name deptno deptno c
to use it
100 Fred 20 10 2
110 Barney 10 20 1
120 Wilma 30 30 1
130 Dino 10
Montana
Point-to-polygon query 6 7 8
Idaho Wyoming
What state am I in? (point-to-polygon) 3 4 5
1. Divide the plane into tiles, and pre-compute
the state-tile intersections 0
Nevada 1
Utah 2
Colorado
2. Use this ‘tile index’ to narrow list of states
SELECT s.*
FROM States AS s
WHERE s.stateId IN (SELECT stateId
FROM StateTiles AS t
WHERE t.tileId = 8)
AND ST_Intersects(s.geometry, ST_MakePoint(6, 7))
Algebraic rewrite
Filter [ST_Intersects(
S.geometry,
ST_Point(x, y)]
Filter [ST_Intersects(
S.geometry,
ST_Point(x, y)] TileSemiJoinRule
SemiJoin [S.stateId = T.stateId]
Scan [S]
Example query: Every minute, emit the number of journeys that have intersected
each city. (Some journeys intersect multiple cities.)
Table: splunk
Key: productId
Condition: Key: productName Key: c desc
action = 'purchase' Agg: count
scan
join
MySQL
filter group sort
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
Optimized query join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
Splunk
order by c desc
MySQL
join group sort
scan
Table: products
Calcite framework
Relational algebra SQL parser Transformation rules
RelNode (operator) SqlNode RelOptRule
• TableScan SqlParser • FilterMergeRule
• Filter SqlValidator • AggregateUnionTransposeRule
• Project • 100+ more
• Union Metadata Global transformations
• Aggregate Schema • Unification (materialized view)
• … Table • Column trimming
RelDataType (type) Function • De-correlation
RexNode (expression) • TableFunction
RelTrait (physical property) • TableMacro Cost, statistics
• RelConvention (calling-convention) Lattice
• RelCollation (sortedness) RelOptCost
• RelDistribution (partitioning) JDBC driver RelOptCostFactory
RelBuilder RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity