Geokettle: A Powerful Spatial Etl Tool For Feeding Your Spatial Data Infrastructure (Sdi)

GeoKettle: A powerful spatial ETL

tool for feeding your Spatial Data

Infrastructure (SDI)

Dr. Thierry Badard, CTO


FOSS4G 2011 Workshop, Denver, CO, USA, September 12, 2011


• These slides constitute the training material used

for the GeoKettle workshop given by Spatialytics
during the FOSS4G 2011 conference

• They are available online in PDF format:


• They are released under the terms of the Creative

Commons CC-BY-SA license.

• What is GeoKettle?
• Basic features of GeoKettle
• Installing GeoKettle
• Spatial features of GeoKettle
• Practical learning: Exercises
• Conclusion
What is GeoKettle?

• It is an open source Spatial ETL tool

• It is part of the geospatial BI software stack
developed initially by the GeoSOA research group
at Laval University in Quebec …
• But are now developed and supported by
– (open source community)
– (professional support, training &
services but also Enterprise Editions which include support)
• The stack comprises:
– GeoKettle
– GeoMondrian
– SOLAPLayers/GeoBIExt /
What is Geospatial BI (GeoBI)?

• Want to know more about GeoBI and what this

type of application can do for you?
– Please attend my presentation entitled “Building
professional geo-analytical dashboards and
reports with GeoBIExt”
Time slot: Friday - 11:00am - 11:30am
Room: Denver

• In this workshop, we will focus on GeoKettle

capabilities and how it can facilitate your every
day life while playing with geospatial data, SDI,
web services, GIS formats, spatial databases, ...
What is an ETL tool?
• A type of software used to populate databases or
data warehouses from heterogeneous data sources
• ETL stands for:
– Extract – Extract data from data sources
– Transform – Transformation of data in order to correct
errors, make some data cleansing, change the data
structure, make them compliant to defined standards, etc.
– Load – Load transformed data into a target DBMS,
service, file format ...

• An ETL tool should manage the insertion of new

data and the updating of existing data
• Should be able to perform transformations from:
– A OLTP system to another OLTP system
– A OLTP system to analytical data warehouse
Why use an ETL tool?
• Automation of complex and repetitive data
processing without producing any specific
• Conversion between various data formats
• Migration of data from a DBMS to another
• Data feeding into various DBMS
• Population of analytical data warehouses
for decision support purposes
• etc.
• A "spatially-enabled" version of Pentaho Data
Integration (Kettle)
• Kettle is a metadata-driven ETL with direct
execution of transformations
– No intermediate code generation!
• Kettle supports several DBMS and file formats
– DBMS support: MySQL, PostgreSQL, Oracle, DB2, MS
SQL Server, ... (total of 37)
– Read/write support of various data file formats: text,
Excel, Access, DBF, XML, …
– Various services/systems: LDAP, CRM, ...
• Numerous transformation steps
– A transformation is built in a GUI and can be seen as a
chain of transformation steps
• Methods for the updating of databases and DW
• GeoKettle provides a true and consistent integration
of the spatial component
– All steps provided by Kettle are able to deal with geospatial
data types
– Some geospatial dedicated steps have been added (SRS,
SOS, CSW, Spatial Analysis, …)
– Allow then powerful integration of corporate + spatial data
• First release in May 2008: 2.5.2-20080531
• Version 3.2.0-20090706 on July 2009
• Current stable version: 2.0 stable (Sept. 2011)
• Released under LGPL
• Used in different organizations and countries:
– Some ministries, public bodies, utilities, bank, insurance,
integrators, …
• A growing community of users and contributors
GeoKettle – Online ressources
• GeoKettle project page

• GeoKettle documentation (wiki)

• GeoKettle forum
• GeoKettle Trac
• GeoKettle plugins
Introduction to basic features
of GeoKettle
Transformations (1/3)
• The ETL processes are named
• Elements of a transformation are steps
• Links between steps are hops
• Parallel execution (threads) of steps

Transformations (2/3)
• Steps have configuration parameters (double-
click the step icon to open the dialog box):
– DB connection
– Filename to open
– Query filter
– Source code of a script (JavaScript)
– ...

• Steps categories:
– input
– output
– transformations
– flow
– scripting
– ...
Transformations (3/3)
• hops link steps between them and define the
data flow
• To create a hop: drag and drop from a step to
another with the middle button of the mouse
pressed (or Shift+left button)
• In a hop:
– data flows from the output of a step to the input of the
next step, row by row
– fields definition (number, names & types) is always
the same from one row to another
• Different hop types:

copy distribute Conditional output

• A job defines a series of job entries (tasks) to
run sequentially
• These tasks can be some:
– transformations
– SQL queries
– file operations (copy, delete, upload, etc.)
– conditional tests
– scripts (shell, JavaScript)
– e-mailing operations (send / receive emails)
– others jobs
– etc.
The different GeoKettle tools
• Spoon: GUI for the edition of transformations
and jobs
• Pan: command line interface for running
• Kitchen: command line interface for running
• Carte: Web service for the remote execution of
transformations and jobs
– Allow to expose and run the transformation and data
integration processes as web services ...
– Remote execution and running transformations in a
cluster environment (i.e. in the cloud)
• Transformations and jobs are usually
saved in XML files (.ktr/.kjb)
• Alternatively, they can be saved in a
database repository and hence be and
shared between users more easily
– Transformations, jobs and connection
parameters to DBMS are stored in a dedicated
– See the first pop-up window when running
• Enable the preservation/centalisation of
knowledge about data integration processes
inside the company
Installing & compiling GeoKettle
Compiling GeoKettle?
• To get all the latest features of GeoKettle
– Get the source code and compile GeoKettle!
• Requirements:
– Subversion Client (Eclipse Subversive or Tortoise SVN)
– Java JDK version 5 or higher
– Apache Ant (
• 3 steps:
% svn co­2.0/trunk geokettle
% cd geokettle
% ant
% ant zip to build a binary distribution archive of GeoKettle
% ant zip­plugins to build a binary distribution archive of GeoKettle including
selected plugins.
Installation procedure
• Available (2.0-RC1) on OSGeo Live DVD but we will use the
2.0 stable version in the workshop
• Very simple installation procedure without the installer
– See documentation on GeoKettle wiki
• Even more simple with the new installer!
• Prerequisites:
– All you need is a Java Runtime Environment
– Version 5.0 or higher
• Start the OSGeo Live Virtual Machine (if not already done)
• Download and start the installer inside the VM:
• When done, double click the GeoKettle icon on the desktop to
run it
– Please wait for instructions when first window (repository selection)
pops up!
Spatial features of GeoKettle
Transparent spatial support
• Consistent and transparent integration
of the geometry data types:

– Vector geometry (based on JTS – point-

line-polygon model)

– Transparent conversions between data

• Geometry  String: from and to WKT
• Geometry  Binary: from and to WKB

– Native I/O support for some spatial DBMS

(via JDBC or through GDAL/OGR)
Inputs / outputs
• Read/write support:

– Spatial DBMS:
• PostgreSQL/PostGIS (native)
• MySQL spatial (native)
• Oracle Spatial / Locator (native)
• ESRI personal geodatabse*, Ingres*, Informix datablade*,
ArcSDE*, SQLite/SpatiaLite (through GDAL/OGR)
* requires valid licenses and GDAL/OGR re-compilation
• MS SQL Server 2008, IBM DB2, … (non native, requires

– GIS file formats:

• ESRI ShapeFile, GML 3.1.1, KML 2.2
• And all GIS file formats provided by GDAL/OGR
– Arc/Info, GeoJSON, GeoConcept, GeoRSS, GML 2.x,
GPX, KML 2.0, ...
Inputs / outputs
• Read/write support:

– Geospatial web services:

• SOS (read only)
• No dedicated steps yet but possible:
– WFS, WMS, WPS, …
– We will see how in this workshop! ;-)

• On the fly preview/geopreview

– Allow to know if a transformation produces the expected results
on a smaller dataset
– Offer different widget: Pan, zoom, Get object attributes,
symbolization (color, opacity, ...)
– Can preview streams with more than one geometry column
Spatial analysis
• Accessing and processing Geometry objects in JavaScript
– Base on Mozilla Rhino (
– It allows the definition of custom transformation steps by the user
(“Modified Javascript Value” step)
– JTS (Java Topology Suite) and Sextante API fully available!
– JCS (Java Conflation Suite) processing capabilities should be available
soon …
• Spatial analysis functions
– Topological predicates: intersects, touches, within, …
• Join and Filtering steps
– Spatial functions: union, intersection, length, buffer, ...
• Modified JavaScript Value, Spatial Analysis and Calculator steps
– Aggregative operators: union, geometry collection, bounding
box, …
• Group by step
– Advanced geoprocessing: delaunay, remove holes, simplify, smooth, ...
• Sextante plugin
SRS & coordinates transformation
• Native support of Spatial Reference Systems
(SRS) in metadata of the Geometry fields
(based on GeoTools – referencing library)
• Coordinates transformation / Change of
Spatial Reference System
– SRS Transformation step

• Assign a SRS to a data flow

– Set SRS step

• Reading and writing of SRS metadata

– Read SRS from data source: Databases and GIS file
– Validation of SRS when inserting data into PostGIS and
• Other DBMS do not support this feature yet!
– Add the SRS info when writing data into GIS file
Practical learning: Exercises!
Before beginning the exercises ...
• Start the OSGeo Live Virtual Machine (if not already done)
and log in
• Download the archive containing data and solutions to
the different exercises of this workshop
– Unzip the archive on your Desktop
– It contains 3 sub directories:
• data
– input
– output
• solutions
– transformations
» exercise_0 to exercise_9
• transformations
• We are now ready!
Exercise 0
• We will do this first exercise all together, step by
step in order to discover GeoKettle
• The aim of this exercise is to know how to load a
ESRI shapefile into a PostGIS database and have it
published properly in GeoServer
• In this exercise we will play with the following new
– Shapefile File Input
– Set SRS
– Select Values
– Add sequence
– Table Output
Exercise 0
• Design a transformation that:
– Reads the Shapefile contained in the ontario_names_shp
data directory. It is a set of points that locate geo names
for the whole Ontario province in Canada (source:
– Assigns the EPSG 4326 SRS code (WGS 84) to data
– Filters the stream in order to preserve only the_geom,
REGIONNAME attributes
– Adds an identifier (numeric incremental id) to objects
– Stores data into a geonames table of a geokettle database
on your PostgreSQL/PostGIS instance
– Finally, publish it in GeoServer
Exercise 0 – Solution
• From this point, do the exercises by yourself
• Exercises are more and more difficult
• The aim is not to follow step by step procedures
mentioned in exercises
• We want you to become more and more
efficient/autonomous and aware on how to do some
tasks in GeoKettle
• That's why instructions will be less and less detailed
as we progress in the exercises
Exercise 1
• The aim of this exercise is to know how to perform
some basic computation (compute area for
poygons) with GeoKettle
• In this exercise we will play with the following new
– SRS Transformation
– Calculator
– Modified JavaScript Values
Exercise 1
• Based on the previous transformation, design a new one that:
– Reads the Shapefile contained in the ontario_mrc_shp data directory. It is
a set of polygons that represents some counties in the Ontario province in
Canada (source: Geobase,
– Converts coordinates of data from WGS84 to NAD83 (CSRS) / UTM Zone
– Computes the area of each polygon and add the value in a new field
– Converts by scripting area_meters values from m2 to km2 and stores this
value in a new field named area
– Filters the stream in order to preserve only the_geom, COMMONAME1,
LEGALNAME1, DESIGNATN attributes but renames them resp. as
the_geom, name, county_name, designation
– Converts back coordinates to WGS84
– Adds an identifier (numeric incremental id) to objects
– Stores data into a municipalities table of a geokettle database on your
PostgreSQL/PostGIS instance
– Finally, publish it in GeoServer
Exercise 1
• Runs this transformation in Spoon in order to test it
• When finished, try to run it with the pan command
line tool
Exercise 1 – Solution
Exercise 1 - Solution

./ -file=”/home/user/Desktop/geokettle_workshop/solutions/
Exercise 2
• The aim of this exercise is to know:
– A way to perform some spatial selection over geospatial features
in GeoKettle
– How to perform some data aggregation in order to compute
statistics on data and export these stats in a MS Excel file
– How to create a job that enable to perform the two previous
tasks sequentially
• In this exercise we will play with the following new steps/job entries:
– Filter rows
– Join rows (cartesian product)
– OGR File Input
– Sort rows
– Group by
– Excel Output
– Transformation
Exercise 2 – Part 1
• Design a transformation that:
– Reads data the previous municipalities table and extracts
the_geom and name fields as muni_geom and muni_name fields
– Filters rows in order to keep only the county of Durham
– In parallel, reads data form a mapinfo tab file located in the
ontario_rrn_tab directory. It is an extract of the national road
network stemming form
– Selects only roads that intersects the Durham county
– Sets the SRS of data to WGS84
– Filters the stream in order to preserve only the_geom, ROADSEGID,
ROADCLASS, RTNUMBER1, RTENAME1EN attributes but renames them
resp. as the_geom, id, class, number and name
– Adds an identifier (numeric incremental id) to objects
– Stores data into a roads table of a geokettle database on your
PostgreSQL/PostGIS instance
– Finally, publish it in GeoServer
Exercise 2 – Part 2
• Design a transformation that:
– Reads data in the previously created roads table
– Converts coordinates of data from WGS84 to NAD83 (CSRS) /
UTM Zone 17N
– Computes by script only the length in km of each road segments
and add the value in a new field named length
– Aggregates (sum) the values of length for each roads of a same
class and stores the total value in a new field named total_length
– Finally, exports aggregated data into an Excel file
Exercise 2 – Job
• Design a job that performs the two previous
tasks sequentially
• Run it into Sponn
• But also, try to run it with the Kitchen
command line tool
Exercise 2 – Part 1: Solution
Exercise 2 – Part 2: Solution
Exercise 2 – Job: Solution
Exercise 2 - Solution

./ -file=”/home/user/Desktop/geokettle_workshop/solutions/
Exercise 3
• The aim of this exercise is to know how to:
– retrieve data from a WFS service
– perform some geo-processing operations with the Sextante
– and export the result to two different file formats: KML and
• In this exercise we will play with the following new steps/job entries:
– Sextante plugin
– OGR Output
– KML Output
Exercise 3 – Job
• Design a job that:
– Requests municipalites data in GML 2 from the GeoServer WFS
hosted on your WM. Use the preview layer in GeoServer in order
to retrieve the GET request to send.
– And runs a transformation that we will define in the next slide
Exercise 3 – Transformation
• Design a transformation that:
– Reads the GML file extracted from the WFS
– Removes holes from the polygons and stores the new
geometry of objects in a result_geom field
– Filters the stream in order to preserve only the gml_id,
name, county_name, designation, area and result_geom
– Filters rows that have a valid and not null geometry
– And stores the resulting stream in a KML file and a Mapinfo
MIF/MID file
Exercise 3 – Job: Solution
Exercise 3 – Transform.: Solution
Exercise 4
• The aim of this exercise is to know how to extract
some POI from an OSM data file
• Listen to the instructor that will explain you how is
structured a OSM data file
• In this exercise we will play with the following new
– Get data from XML
Exercise 4
• Design a transformation that:
– Extracts POI data from the OSM data file located in the
ottawa_osm directory
– Set the SRS of data to WGS84
– And exports the result as an ESRI shapefile
– Finally, publish it in GeoServer
Exercise 4 – Solution
Exercise 5
• The aim of this exercise is to know how to:
– Extract sensor data from a SOS
– Perform some spatial computation with the Spatial
Analysis step
– Retrieves some metadata on the data stream
– And push these metadata in a CSW
• Listen to the instructor that will explain you how to
proceed with SOS and CSW steps
• In this exercise we will play with the following new
– SOS Input
– Spatial Analysis
– CSW Output
Exercise 5
• Design a transformation that:
– Retrieves GAUGE_HEIGHT measures from the SOS service
given by the instructor
– Removes rows where measure presents values <=30
– Group rows by procedure
– Compute the envelope of each resulting geometry
– Retrieves and sets some mandatory metadata
(MD_METADATA profile)
– And finally, publish the metadata in GeoNetwork
Exercise 5 – Solution
Exercise 6
• The aim of this exercise is to know how to harvest
metadata from a CSW compliant service
• In this exercise we will play with the following new
– CSW Input
– Dummy
Exercise 6
• Design a transformation that:
– Harvest metadata from the online catalog
– Filters metadata that deal with dataset
– For each metadata row, computes by script the extent of
the dataset
– And export the the BriefRecord_title, BriefRecord_type and
the extent in a new PostGIS table named meta_extent
– Finally, publish this new table in GeoServer
Exercise 6 – Solution
Exercise 7
• The aim of this exercise is to know how to call a
process hosted in a WPS compliant service
• In this exercise, we will create a new layer from
our polygons layer (municipalities) hosted in
GeoServer by applying on each polygon a
Centroid WPS service
• In this exercise we will play with the following
new steps entries:
– Add constants
– HTTP Client
Exercise 7 – Job
• Design a job that:
– Requests municipalites data in GML 2 from the GeoServer WFS
hosted on your WM. Use the preview layer in GeoServer in order
to retrieve the GET request to send.
– And runs a transformation that we will define in the next slide
Exercise 7 – Transformation
• Design a transformation that:
– Reads the GML file extracted from the WFS
– For each rows, call the Centroid service hosted in the Zoo
WPS instance on your VM
– Stores the result in a new table named muninames in your
PostGIS DBMS instance.
– Finally, publish it in GeoServer.
Exercise 7 – Job: Solution
Exercise 7 – Transform.: Solution
Exercise 8
• Based on exercise 4, design a transformation that
extracts the road network from the Ottawa OSM
data file
• In this exercise we will play with the following new
– Shapefile File Output
Exercise 8 – Solution
Exercise 9
• The aim of this exercise is to know how to:
– Retrieve location information from some Twitter
– Call the geonames gazetteer service in order to
retrieve lat/lon information for tweets that have
no geo tag
• Listen to the instructor that will explain you how
the twitter and geonames services work
• In this exercise we will play with the following new
– Unique rows (HashSet)
– Generate rows
Exercise 9
• Design a transformation that:
– Retrieves tweets mentioning the #foss4g tags
– For each tweet, checks if there is a geo info present
– If not, uses the location info and call the
gazetteer in order to retrieve the lat/lon of this location
– Stores the result in a new table named tweets in your
geokettle database in the PostGIS DBMS.
– Finally, publish it in GeoServer
Exercise 9 – Solution
Upcoming features
• Versions 2.x will be the last versions of GeoKettle based on
the Kettle 3.2 code base.
• Thanks to the tremendous work of the Kettle developers,
future version of GeoKettle will be more pluggable with
• Hence, it will be possible to add spatial extensions provided
by GeoKettle to any Kettle/PDI 4.x installation.
• Maximizing this architecture switch, we want to perform a
re-engineering of the Geometry data type.
• At present, it only supports 2D data.
• We want to allow support for:
– X,Y,Z,t and M data
– LiDAR data
– Linear referencing
– Raster data
Upcoming features
• So many tasks can be automated with GeoKette.
• We can think about many new steps in future
releases ...
• But, you know, the roadmap can be influenced by
opportunities ...
• So, we are open to your ideas, opportunities and
possible sponsoring to have your required feature
• Spatialytics can also provide:
– Support (1st and 2nd line through partners)
– Advanced training
– Be your partner in tender
– ...
Upcoming features
• Additional non exhaustive list of steps/jobs that could be envisaged:
– Additional geometric data cleansing and geo-processing
• inclusion of some JCS/OpenJump conflation & topology
checking/cleansing capabilities (GPL -> plugin)
• Towards a geospatial data quality module to check and correct errors
– Read/write support for other DBMS, GIS file formats and services
• NetCDF, SDMX, Linked Geodata, ...
• Native support for MS SQL Server 2008, Netezza spatial, NoSQL dbs, ...
• Native support for WFS-T, WPS, WMS, Table Joining Service (TJS), ...
– Dedicated steps:
• Social media (Twitter, ...), OSM, cartograhic generalisation, geocoding &
reverse geocoding ...
– Direct publishing into GeoServer and MapServer
– But also why not see GeoKettle as a possible data source for this
web servers ...
– Raster support: re-initiating the development of a plugin to integrate all
raster capabilities provided by the Sextante library (BeETLe project)
To learn more about GeoKettle
• Do not hesitate to:
– Visit our web sites
– Subscribe to the monthly Spatialytics eNews letter
– Follow us on Twitter and Facebook
– Check the documentation on the wiki
– Post your questions on the forum
– Submit a bug report or feature request on the
GeoKettle trac
– Contact us
Contact info:
Dr. Thierry Badard, CTO
Spatialytics inc.
Quebec, Canada
Twitter: tbadard, spatialytics Twitter : geokettle Twitter : geomondrian Twitter : solaplayer Twitter : geobiext

