Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 70

ETL TOOLS PENTAHO DATA INTEGRATION

BHAVANI.P SUBHASHINI.V PUNNIYAA

Introduction

Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing that involves:
  

Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse)

Extract

Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure

Transform
             

Selecting only certain columns to load. Translating coded values Encoding free-form values Deriving a new calculated value Filtering Sorting Joining data from multiple sources Aggregation Generating surrogate-key values Transposing or pivoting Splitting a column into multiple columns Dis-aggregation of repeating columns into a separate detail table Lookup and validate the relevant data from tables or referential files for slowly changing dimensions Applying any form of simple or complex data validation

Load

  

The load phase loads the data into the end target, usually the data warehouse (DW) The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs The load phase interacts with a database, the constraints defined in the database schema - as well as in triggers activated upon data load - apply

ETL Cycle
          

The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up

Challenges


ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data warehouses are typically assembled from a variety of data sources with different formats and purposes. Design analysts should establish the scalability of an ETL system across the lifetime of its usage. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time.

Performance
          

Direct Path Extract method or bulk unload whenever is possible (instead of querying the database) to reduce the load on source system while getting high speed extract Most of the transformation processing outside of the database To use bulk load operations whenever possible. Still, even using bulk operations, database access is usually the bottleneck in the ETL process. Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the partitioning). Do all validation in the ETL layer before the load. Disable integrity checking in the target database tables during the load. Disable triggers in the target database tables during the load. Simulate their effect as a separate step. Generate IDs in the ETL layer. Drop the indexes (on a table or partition) before the load - and recreate them after the load. Use parallel bulk load when possible. If a requirement exists to do insertions, updates, or deletions, find out which rows should be processed in which way in the ETL layer, and then process these three operations in the database separately.

Parallel Processing
  

Sources Central ETL layer Targets

ETL applications implement three main types of parallelism:


 

Data: By splitting a single sequential file into smaller data files to provide parallel access. Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2. Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

Rerunnability, recoverability


Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs will help to roll back and rerun the failed piece. Best practice also calls for "checkpoints", which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, and so on.

Best practices
   

Four-layered approach for ETL architecture design Use file-based ETL processing where possible Use data-driven methods and minimize custom ETL coding Qualities of a good ETL architecture design

Tools


Programmers can set up ETL processes using almost any programming language, but building such processes from scratch can become complex. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation, and loading of data. Many ETL vendors now have data profiling, data quality, and metadata capabilities.

Open-source ETL frameworks Apatar CloverETL Flat File Checker Jitterbit 2.0 Pentaho Data Integration (now included in OpenOffice Base) RapidMiner Scriptella Talend Open Studio Proprietary ETL frameworks IBM InfoSphere DataStage Informatica PowerCenter Oracle Data Integrator (ODI) Ab Initio Altova MapForce HiT Software Allora Digital Fuel Service Flow Phocas ETL Microsoft SQL Server Integration Services (SSIS)

The Pentaho BI Project is open source application software for enterprise reporting, analysis, dashboard, data mining, workflow and ETL capabilities for business intelligence needs.

Business Model Pentaho uses a subscription model: its commercial open source business model eliminates software license fees, providing support, services, and product enhancements via an annual subscription. A commercial open source company, Pentaho "leads and sponsors" the open source projects that are core to its suite, giving it direct influence over software development.

Pentahos Board of Directors & Investors


The Board and Investor's composition is a strong, balanced blend of skills and experience, allowing them to offer guidance in core areas important to Pentaho.

Management and Technical Leads


The core project team at Pentaho has been together for many years and through success after success. It includes highly experienced industry leaders with a strong record of creating successful BI products for top-tier commercial vendors, including:       Business Objects Cognos Hyperion IBM Oracle SAS

COMPONENTS OF PENTAHO BI SUITE ENTERPRISE EDITION


The Pentaho BI Suite provides a full spectrum of business intelligence (BI) capabilities including query and reporting, interactive analysis, dashboards, data integration/ETL, data mining, and a BI platform that has made it the world's most popular open source BI suite. Pentaho Enterprise Edition products provide comprehensive technical support, software maintenance, and enhanced functionality. Pentaho's technology was architected from the ground-up as a modern, fully integrated BI platform built on open standards. That means it fits easily into any IT infrastructure, out-of-the-box or embedded in a custom application

Pentaho Reporting
Flexible deployment from standalone desktop reporting to embedded reporting and enterprise business intelligence Broad data source support including relational, OLAP, or XML-based data sources Popular output options including Adobe PDF, HTML, Microsoft Excel, Rich Text Format, or plain text Web-based ad hoc query and reporting for business users Enterprise Edition provides enhanced software functionality, comprehensive professional technical support, product expertise, certified software and software maintenance.

Embedded reporting

Operational Reporting

Production Reporting

Pentaho Report Designer


Design reports quickly with the streamlined report wizard that takes authors from a blank canvas to a highly polished report in four simple steps. Connect to diverse data sources including relational data, Pentaho Analysis, flat files, java objects, or even stream data directly from Pentaho Data Integration transformations to design reports. Create and view user prompts, including dynamic cascading prompts. Publish directly to the BI server to give business users instant access to the information they need. Add rich data visualizations with over 15 customizable chart types, barcodes, sparklines, survey scales, and more. Localize reports easily to support multi-lingual deployment with a single report file. Embed HTML and JavaScript controls for dynamic and interactive online reports. Fine-tune reports using the built-in interactive preview mode.

Pentaho Analysis
    Freely explore business information by drilling into and cross-tabulating data Experience speed-of-thought response times to complex analytical queries View information multi-dimensionally, choosing specific metrics and attributes to analyze Deploy stand-alone or integrated with other products in the Pentaho BI Suite

Pentaho Analyzer Pentaho Analyzer provides intuitive, interactive analytical reporting letting non-technical business users quickly understand business information. As part of the enhanced functionality in Pentaho Analysis Enterprise Edition, Analyzer features:      Web-based, drag-and-drop report creation Advanced sorting and filtering Customized totals and user-defined calculations Chart visualizations And much more

Pentaho Dashboards
Pentaho Dashboards delivers the visibility by providing:


Rich, interactive displays including Adobe Flash-based visualizations so that business users can immediately see which business metrics are on track, and which need attention Self-service dashboard designer that lets business users easily create personalized dashboards with zero training Integration with Pentaho Reporting and Pentaho Analysis so that users can drill to underlying reports and analysis to understand what factors are contributing to good or bad performance Portal integration to make it easy to deliver relevant business metrics to large numbers of users, seamlessly integrated into their application Integrated alerting to continuously monitor for exceptions and notify users to take action

Pentaho Data Integration


With Pentaho Data Integration, Pentaho is redefining the way that BI applications are built and deployed. Utilizing Pentahos Agile BI approach, Pentaho Data Integration unifies the ETL, modeling and visualization processes into a single, integrated environment that enables developers and end-users to work seamlessly together. The end result is that BI developers and end users can build BI applications more quickly, easily and at a small fraction of the cost of traditional solutions. Pentahos Agile BI:

   

Powers instantaneous, iterative BI application development Enables seamless collaboration between developers and end users Merges complex BI development into a single process Dramatically reduces time and difficulty of building and deploying BI apps

Pentaho Data Integration is a full-featured ETL solution including:


      

Rich transformation library with over 100 out-of-the-box mapping objects Broad data source support including packaged applications, over 30 open source and proprietary database platforms, flat files, Excel documents and more Advanced data warehousing support for Slowly Changing and Junk Dimensions Proven enterprise-class performance and scalability Integration with the Pentaho BI Suite for Enterprise Information Integration (EII), advanced scheduling, and process integration Unified ETL, modeling and visualization development environment for design of BI applications.

Pentaho Data Integration Transformation Screenshot

Pentaho Data Integration Job Screenshot

Common use cases for Pentaho Data Integration include


 Data warehouse population  Agile design of BI applications  Information enrichment by integrating data from various sources  Data migration between applications  Imports of data into databases from text-files, Excel spreadsheets,

relational systems and more


 Data

cleansing transformations

by

applying

complex

conditions

in

data

 Exploration of data in existing databases (tables, views, etc.)

Pentaho Data Mining


 Data Mining is the process of running data through sophisticated

algorithms to uncover meaningful patterns and correlations that may otherwise be hidden. These can be used to understand the business better and also exploited to improve future performance through predictive analytics.
 Pentaho Data Mining is differentiated by its open, standards-

compliant nature, use of Weka data mining technology, and tight integration with core business intelligence capabilities including reporting, analysis and dashboards. Other data mining offerings lack this level of sophistication and integration.

Pentaho Data Mining can be deployed as:




An out-of-the-box solution for immediate deployment to analysts. As far as end-users are concerned, data mining operates entirely in the background users see results and recommendations through e-mail or other web pages, which can include Pentaho Dashboards. A set of components that enable Java developers to quickly create custom reporting solutions using Java Objects or Java Server Pages (JSPs). These can be tightly integrated with other applications or portals. Together with other components of the overall Pentaho BI Suite

Features and Benefits


 

Provides insight into hidden patterns and relationships in your data Enables you to exploit these correlations to improve organizational performance Provides indicators of future performance Enables embedding of recommendations in your applications Enables you to take full advantage of a range of data mining algorithms

  

Technology
Powerful Data Mining Engine


Provides a comprehensive set of machine learning algorithms from the Weka project including clustering, segmentation, decision trees, random forests, neural networks, and principal component analysis. Pentaho has added integration with Pentaho Data Integration and automated the process of transforming data into the format the data mining engine needs. Algorithms can either be applied directly to a dataset or called from Java code. Output can be viewed graphically, interacted with programmatically, or used data source for reports, further analysis, and other processes. Filters are provided for discretization, normalization, re-sampling, attribute selection, and transforming and combining attributes.

  

Classifiers provide models for predicting nominal or numeric quantities. Learning schemes include decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, and other advanced techniques. The data mining engine is also well-suited for developing new machine learning schemes, enabling customers to incorporate their own models. Inputs and outputs can be controlled programmatically, enabling developers to create completely custom solutions using the components provided.

 

Graphical Design Tools




Graphical user interfaces are provided for data pre-processing, classification,regression, clustering, association rules, andvisualization.

Data Mining - Boundary Visualizer

Data Mining Classify Panel

Data Mining- Knowledge Flow

Data Mining- Explorer

CUSTOMER SUCCESSES

Pentaho customers address a wide range of BI challenges using services and software from Pentaho. Many Pentaho customers use Pentaho for reporting, data integration, dashboards, and/or analysis. Some use multiple modules or the full Pentaho BI Suite. With subscription services and open source licensing from Pentaho, customers can get best-in-class BI capabilities with the peace of mind of professional support, software maintenance, training, consulting, and more. The following is a small sample of the many organizations around the globe that depend on Pentaho for commercial open source business intelligence.

"Our only regret was that we didn't have Pentaho for data integration years ago. Immediately we were able to see the increased operational efficiency, reduced internal costs and greater customer value using Pentaho Data Integration.

Deployment Overview
Key Challenges  Cumbersome, manual process for creation and distribution of reports  Multiple data points including Google-Analytics needed to integrate and automate into one report Pentaho Solution  Pentaho Data Integration  Business and implementation services by Pentaho Systems Integrator Partner, DEFTeam Solutions Results  Increased operational efficiency  Reduced internal costs  Greater customer value Why Pentaho  Low cost  Flexibility  Speed-to-market

"We needed to deliver a business intelligence solution that would show immediate benefit by increasing efficiencies, containing costs, and helping drive revenue. By using Pentaho BI Suite Enterprise Edition, we were able to do so in a fiscally responsible manner, and in today's economic climate that is of utmost importance."

Deployment Overview
Key Challenges
 

Gaining better insight across the organization to help steer strategic decisionmaking Conducting deeper analysis on historical data across all facets of its service offerings Pentaho BI Suite Enterprise Edition for data integration, reporting and analysis CentOS, PostgreSQL Database Company-wide performance gains through better visibility into customer, cost, and revenue trends Increased operational efficiency, reduced internal costs and greater customer value End-to-end BI capabilities Value vs. proprietary BI Enterprise Edition features

Pentaho Solution
      

Results

Why Pentaho

"The simplicity of the interface actually allows Lifetime Entertainment Services to give direct access to business analysts, allowing them to understand and manage the business rules governing the integration of information. That wasn't previously possible with complex hand-coded integration jobs."

Deployment Overview
Key Challenges  Optimizing advertising processes to drive ad revenue growth  Adapting data integration infrastructure to keep up with changing business rules Pentaho Solution  Pentaho Data Integration Enterprise Edition  Selected over Informatica and BusinessObjects Data Integrator  Continued use of Business Objects BI tools Results  Ability for business analysts to manage integration rules and adapt integration processes to company business rules Why Pentaho  Ease of use  Cost of ownership  Enterprise Edition Features

"ActivePivot (tm) uniquely marries the concept of online analytical process with real-time position-keeping; something no other company currently offers. Thanks to Pentaho Spreadsheet Services we can now offer seamless MDX connectivity to Microsoft Excel."

Deployment Overview
Key Challenges
    

Excel-based access to analytic application data Maximizing margins on analytic software solution for financial institutions Pentaho Analysis Pentaho Spreadsheet Services Competitive differentiation based on Excel-based access to centralized information Low costs delivered by commercial open source business model Standards-based offering allowing Excel-based connectivity to live OLAP data

Pentaho Solution

Results

Why Pentaho
 

"Pentaho's BI suite and top-notch professional support enabled us to deliver a successful, high-value BI solution at a much lower cost than would have been possible with the expensive, proprietary alternatives."

Deployment Overview
Key Challenges  Understanding the effectiveness of its online marketing activities  Outgrowing Microsoft Excel-based reporting system  Maintaining complex, hand-coded ETL scripts Pentaho Solution  Pentaho BI Suite Enterprise Edition  IBM servers, SUSE Linux, 1.5 terabyte Microsoft SQL Server data warehouse  Professional services from Pentaho partner OpenBI Results  Automated integration of clickstream data with Google Analytics and catalog sales activity data  Greater visibility into website traffic, keyword performance and revenue attribution Why Pentaho  Standards-based, cross platform support  Quality of support and services

"With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs. We were also surprised at how much faster Pentaho Data Integration was than our prior solution."

Deployment Overview
Key Challenges
         

Measuring and optimizing agent performance, customer satisfaction, and marketing ROI Getting an integrated, strategic view across multiple operational systems Pentaho Data Integration Enterprise Edition Red Hat Enterprise Linux, MySQL database Continued use of proprietary BI tools (MicroStrategy) Product expertise Three-fold performance increase, 8 hour reduction in batch load times Simplified maintenance and reduced costs Functionality and flexibility Professional support

Pentaho Solution

Results

Why Pentaho

AWARDS AND RECOGNITION

 .

You might also like