Professional Documents
Culture Documents
Open Source ETL Software: Pentaho Kettle
Open Source ETL Software: Pentaho Kettle
Open Source ETL Software: Pentaho Kettle
ISDS 570
Introduction
Kettle is an open-source ETL tool that provides extract, transform, and load capabilities. After
acquired by Pentaho, the name was replaced by Pentaho Data Integration, and it is now owned
and supported by Hitachi Data System since its acquisition in 2015. The business intelligence
software provides the customer with business intelligence tools, including data cleansing, data
analysis, data integration, data mining, and reporting. This software suite is suitable to perform
data extraction, transformation, analysis, forecasting, and publishing for the Electricity Market
ETL project.
Main Features
Pentaho Data Integration allows users to perform ETL and task scheduling in the Pentaho Data
Integration client interface. This tool will enable users to drag and drop steps on the chart.
Common features including:
Migrating data between different databases
Retrieving data, aggregate data, populate tables, and email an error log if task fails
Clean dirty data from simple to complex transformations
Ability to integrate real-time data
Dashboards and visualizations
Data discovery and analysis (OLAP)
Embedded reporting and OLAP engine
Capable to schedule jobs like Windows Task Scheduler via scheduled transformation
Web services including web service lookup, modified Java Script Value, RSS input, and
HTTP post
Generate SQL statements
Create checkpoints to restart jobs
Map commonly used steps to reuse transformation flows
Perform multidimensional modeling, relational modeling, and streamlined data refinery
Lin Yuan
ISDS 570
Can support and execute Hadoop and Spark jobs
Customers can connect files through Virtual File Systems connections to connect to specific file
system. The Virtual File System supports Google Could Storage, Snowflake Staging, Amazon
S3/MinIO, HCP, and Catalog.
Summary
Potential limitations:
It does not provide support, training, consult and licensing automatic patches and updates service
as open-source software. Users only can rely on the community for updates and support for
critical issues. It has some suitable modules, but newer modules development is slow, may not
contain enough features. Unlike other popular software like Power BI and SSIS, community
support is minimal. The report designer seems outdated and not very friendly to navigate. The
integrated task scheduler and job manager are not available in the free edition. Since it is not a
popular tool, there will be a learning curve for users. To enjoy the full benefits of the ETL
software, users will need to purchase the enterprise edition.
Recommendation:
Pentaho Data Integration has a user-friendly interface and easy to learn. Pentaho Data Integration
application can build transformations and schedules to run jobs in an environment that allows
users to cooperate with other users to build solutions faster and more efficiently. Pentaho
integrated with its built-in task scheduler, which is an excellent convenience feature. This
product can extract data from the website via its web service tool, transform data, cleanse data,
consolidate data, validate data, populate tables, notify when errors occur, schedule jobs and
backups, and publish tables on websites. Pentaho is an excellent open-source ELT overall, and
the latest version was released eight months ago. Despite some limitations, it is still
recommended for the Electricity Market ETL project.