Professional Documents
Culture Documents
Geokettle: A Powerful Spatial Etl Tool For Feeding Your Spatial Data Infrastructure (Sdi)
Geokettle: A Powerful Spatial Etl Tool For Feeding Your Spatial Data Infrastructure (Sdi)
Geokettle: A Powerful Spatial Etl Tool For Feeding Your Spatial Data Infrastructure (Sdi)
Spatialytics
http://www.spatialytics.com
• What is GeoKettle?
• Basic features of GeoKettle
• Installing GeoKettle
• Spatial features of GeoKettle
• Practical learning: Exercises
• Conclusion
What is GeoKettle?
• GeoKettle forum
http://www.spatialytics.com/forum
• GeoKettle Trac
http://trac.spatialytics.com/geokettle
• GeoKettle plugins
http://trac.spatialytics.com/geokettle/wiki/Plugins
Introduction to basic features
of GeoKettle
Transformations (1/3)
• The ETL processes are named
transformations
• Elements of a transformation are steps
• Links between steps are hops
• Parallel execution (threads) of steps
hops
steps
Transformations (2/3)
• Steps have configuration parameters (double-
click the step icon to open the dialog box):
– DB connection
– Filename to open
– Query filter
– Source code of a script (JavaScript)
– ...
• Steps categories:
– input
– output
– transformations
– flow
– scripting
– ...
Transformations (3/3)
• hops link steps between them and define the
data flow
• To create a hop: drag and drop from a step to
another with the middle button of the mouse
pressed (or Shift+left button)
• In a hop:
– data flows from the output of a step to the input of the
next step, row by row
– fields definition (number, names & types) is always
the same from one row to another
• Different hop types:
– Spatial DBMS:
• PostgreSQL/PostGIS (native)
• MySQL spatial (native)
• Oracle Spatial / Locator (native)
• ESRI personal geodatabse*, Ingres*, Informix datablade*,
ArcSDE*, SQLite/SpatiaLite (through GDAL/OGR)
* requires valid licenses and GDAL/OGR re-compilation
• MS SQL Server 2008, IBM DB2, … (non native, requires
hints)
./pan.sh -file=”/home/user/Desktop/geokettle_workshop/solutions/
transformations/exercise_1/ex_1.ktr”
Exercise 2
• The aim of this exercise is to know:
– A way to perform some spatial selection over geospatial features
in GeoKettle
– How to perform some data aggregation in order to compute
statistics on data and export these stats in a MS Excel file
– How to create a job that enable to perform the two previous
tasks sequentially
• In this exercise we will play with the following new steps/job entries:
– Filter rows
– Join rows (cartesian product)
– OGR File Input
– Sort rows
– Group by
– Excel Output
– Transformation
Exercise 2 – Part 1
• Design a transformation that:
– Reads data the previous municipalities table and extracts
the_geom and name fields as muni_geom and muni_name fields
– Filters rows in order to keep only the county of Durham
– In parallel, reads data form a mapinfo tab file located in the
ontario_rrn_tab directory. It is an extract of the national road
network stemming form Geobase.ca.
– Selects only roads that intersects the Durham county
– Sets the SRS of data to WGS84
– Filters the stream in order to preserve only the_geom, ROADSEGID,
ROADCLASS, RTNUMBER1, RTENAME1EN attributes but renames them
resp. as the_geom, id, class, number and name
– Adds an identifier (numeric incremental id) to objects
– Stores data into a roads table of a geokettle database on your
PostgreSQL/PostGIS instance
– Finally, publish it in GeoServer
Exercise 2 – Part 2
• Design a transformation that:
– Reads data in the previously created roads table
– Converts coordinates of data from WGS84 to NAD83 (CSRS) /
UTM Zone 17N
– Computes by script only the length in km of each road segments
and add the value in a new field named length
– Aggregates (sum) the values of length for each roads of a same
class and stores the total value in a new field named total_length
– Finally, exports aggregated data into an Excel file
Exercise 2 – Job
• Design a job that performs the two previous
tasks sequentially
• Run it into Sponn
• But also, try to run it with the Kitchen
command line tool
Exercise 2 – Part 1: Solution
Exercise 2 – Part 2: Solution
Exercise 2 – Job: Solution
Exercise 2 - Solution
./kitchen.sh -file=”/home/user/Desktop/geokettle_workshop/solutions/
transformations/exercise_2/ex_2.kjb”
Exercise 3
• The aim of this exercise is to know how to:
– retrieve data from a WFS service
– perform some geo-processing operations with the Sextante
plugin
– and export the result to two different file formats: KML and
Mapinfo
• In this exercise we will play with the following new steps/job entries:
– Sextante plugin
– OGR Output
– KML Output
– HTTP
Exercise 3 – Job
• Design a job that:
– Requests municipalites data in GML 2 from the GeoServer WFS
hosted on your WM. Use the preview layer in GeoServer in order
to retrieve the GET request to send.
– And runs a transformation that we will define in the next slide
Exercise 3 – Transformation
• Design a transformation that:
– Reads the GML file extracted from the WFS
– Removes holes from the polygons and stores the new
geometry of objects in a result_geom field
– Filters the stream in order to preserve only the gml_id,
name, county_name, designation, area and result_geom
fields
– Filters rows that have a valid and not null geometry
– And stores the resulting stream in a KML file and a Mapinfo
MIF/MID file
Exercise 3 – Job: Solution
Exercise 3 – Transform.: Solution
Exercise 4
• The aim of this exercise is to know how to extract
some POI from an OSM data file
• Listen to the instructor that will explain you how is
structured a OSM data file
• In this exercise we will play with the following new
steps:
– Get data from XML
Exercise 4
• Design a transformation that:
– Extracts POI data from the OSM data file located in the
ottawa_osm directory
– Set the SRS of data to WGS84
– And exports the result as an ESRI shapefile
– Finally, publish it in GeoServer
Exercise 4 – Solution
Exercise 5
• The aim of this exercise is to know how to:
– Extract sensor data from a SOS
– Perform some spatial computation with the Spatial
Analysis step
– Retrieves some metadata on the data stream
– And push these metadata in a CSW
• Listen to the instructor that will explain you how to
proceed with SOS and CSW steps
• In this exercise we will play with the following new
steps:
– SOS Input
– Spatial Analysis
– CSW Output
Exercise 5
• Design a transformation that:
– Retrieves GAUGE_HEIGHT measures from the SOS service
given by the instructor
– Removes rows where measure presents values <=30
– Group rows by procedure
– Compute the envelope of each resulting geometry
– Retrieves and sets some mandatory metadata
(MD_METADATA profile)
– And finally, publish the metadata in GeoNetwork
Exercise 5 – Solution
Exercise 6
• The aim of this exercise is to know how to harvest
metadata from a CSW compliant service
• In this exercise we will play with the following new
steps:
– CSW Input
– Dummy
Exercise 6
• Design a transformation that:
– Harvest metadata from the geocat.ch online catalog
– Filters metadata that deal with dataset
– For each metadata row, computes by script the extent of
the dataset
– And export the the BriefRecord_title, BriefRecord_type and
the extent in a new PostGIS table named meta_extent
– Finally, publish this new table in GeoServer
Exercise 6 – Solution
Exercise 7
• The aim of this exercise is to know how to call a
process hosted in a WPS compliant service
• In this exercise, we will create a new layer from
our polygons layer (municipalities) hosted in
GeoServer by applying on each polygon a
Centroid WPS service
• In this exercise we will play with the following
new steps entries:
– Add constants
– HTTP Client
Exercise 7 – Job
• Design a job that:
– Requests municipalites data in GML 2 from the GeoServer WFS
hosted on your WM. Use the preview layer in GeoServer in order
to retrieve the GET request to send.
– And runs a transformation that we will define in the next slide
Exercise 7 – Transformation
• Design a transformation that:
– Reads the GML file extracted from the WFS
– For each rows, call the Centroid service hosted in the Zoo
WPS instance on your VM
– Stores the result in a new table named muninames in your
PostGIS DBMS instance.
– Finally, publish it in GeoServer.
Exercise 7 – Job: Solution
Exercise 7 – Transform.: Solution
Exercise 8
• Based on exercise 4, design a transformation that
extracts the road network from the Ottawa OSM
data file
• In this exercise we will play with the following new
steps:
– Shapefile File Output
Exercise 8 – Solution
Exercise 9
• The aim of this exercise is to know how to:
– Retrieve location information from some Twitter
tweets
– Call the geonames gazetteer service in order to
retrieve lat/lon information for tweets that have
no geo tag
• Listen to the instructor that will explain you how
the twitter and geonames services work
• In this exercise we will play with the following new
steps:
– Unique rows (HashSet)
– Generate rows
Exercise 9
• Design a transformation that:
– Retrieves tweets mentioning the #foss4g tags
– For each tweet, checks if there is a geo info present
– If not, uses the location info and call the geoames.org
gazetteer in order to retrieve the lat/lon of this location
– Stores the result in a new table named tweets in your
geokettle database in the PostGIS DBMS.
– Finally, publish it in GeoServer
Exercise 9 – Solution
Conclusion
Upcoming features
• Versions 2.x will be the last versions of GeoKettle based on
the Kettle 3.2 code base.
• Thanks to the tremendous work of the Kettle developers,
future version of GeoKettle will be more pluggable with
Kettle
• Hence, it will be possible to add spatial extensions provided
by GeoKettle to any Kettle/PDI 4.x installation.
• Maximizing this architecture switch, we want to perform a
re-engineering of the Geometry data type.
• At present, it only supports 2D data.
• We want to allow support for:
– X,Y,Z,t and M data
– LiDAR data
– Linear referencing
– Raster data
Upcoming features
• So many tasks can be automated with GeoKette.
• We can think about many new steps in future
releases ...
• But, you know, the roadmap can be influenced by
opportunities ...
• So, we are open to your ideas, opportunities and
possible sponsoring to have your required feature
implemented
• Spatialytics can also provide:
– Support (1st and 2nd line through partners)
– Advanced training
– Be your partner in tender
– ...
Upcoming features
• Additional non exhaustive list of steps/jobs that could be envisaged:
– Additional geometric data cleansing and geo-processing
capabilities:
• inclusion of some JCS/OpenJump conflation & topology
checking/cleansing capabilities (GPL -> plugin)
• Towards a geospatial data quality module to check and correct errors
– Read/write support for other DBMS, GIS file formats and services
• NetCDF, SDMX, Linked Geodata, ...
• Native support for MS SQL Server 2008, Netezza spatial, NoSQL dbs, ...
• Native support for WFS-T, WPS, WMS, Table Joining Service (TJS), ...
– Dedicated steps:
• Social media (Twitter, ...), OSM, cartograhic generalisation, geocoding &
reverse geocoding ...
– Direct publishing into GeoServer and MapServer
– But also why not see GeoKettle as a possible data source for this
web servers ...
– Raster support: re-initiating the development of a plugin to integrate all
raster capabilities provided by the Sextante library (BeETLe project)
To learn more about GeoKettle
• Do not hesitate to:
– Visit our web sites
• http://www.spatialytics.com
• http://www.spatialytics.org
– Subscribe to the monthly Spatialytics eNews letter
– Follow us on Twitter and Facebook
– Check the documentation on the wiki
– Post your questions on the forum
– Submit a bug report or feature request on the
GeoKettle trac
– Contact us
Questions
Contact info:
Dr. Thierry Badard, CTO
Spatialytics inc.
Quebec, Canada
Email: tbadard@spatialytics.com
Web: http://www.spatialytics.org
http://www.spatialytics.com
Twitter: tbadard, spatialytics