Professional Documents
Culture Documents
Isas Etl Final
Isas Etl Final
Introduction
Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing that involves:
Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse)
Extract
Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure
Transform
Selecting only certain columns to load. Translating coded values Encoding free-form values Deriving a new calculated value Filtering Sorting Joining data from multiple sources Aggregation Generating surrogate-key values Transposing or pivoting Splitting a column into multiple columns Dis-aggregation of repeating columns into a separate detail table Lookup and validate the relevant data from tables or referential files for slowly changing dimensions Applying any form of simple or complex data validation
Load
The load phase loads the data into the end target, usually the data warehouse (DW) The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs The load phase interacts with a database, the constraints defined in the database schema - as well as in triggers activated upon data load - apply
ETL Cycle
The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up
Challenges
ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data warehouses are typically assembled from a variety of data sources with different formats and purposes. Design analysts should establish the scalability of an ETL system across the lifetime of its usage. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time.
Performance
Direct Path Extract method or bulk unload whenever is possible (instead of querying the database) to reduce the load on source system while getting high speed extract Most of the transformation processing outside of the database To use bulk load operations whenever possible. Still, even using bulk operations, database access is usually the bottleneck in the ETL process. Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the partitioning). Do all validation in the ETL layer before the load. Disable integrity checking in the target database tables during the load. Disable triggers in the target database tables during the load. Simulate their effect as a separate step. Generate IDs in the ETL layer. Drop the indexes (on a table or partition) before the load - and recreate them after the load. Use parallel bulk load when possible. If a requirement exists to do insertions, updates, or deletions, find out which rows should be processed in which way in the ETL layer, and then process these three operations in the database separately.
Parallel Processing
Data: By splitting a single sequential file into smaller data files to provide parallel access. Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2. Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.
Rerunnability, recoverability
Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs will help to roll back and rerun the failed piece. Best practice also calls for "checkpoints", which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, and so on.
Best practices
Four-layered approach for ETL architecture design Use file-based ETL processing where possible Use data-driven methods and minimize custom ETL coding Qualities of a good ETL architecture design
Tools
Programmers can set up ETL processes using almost any programming language, but building such processes from scratch can become complex. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation, and loading of data. Many ETL vendors now have data profiling, data quality, and metadata capabilities.
Open-source ETL frameworks Apatar CloverETL Flat File Checker Jitterbit 2.0 Pentaho Data Integration (now included in OpenOffice Base) RapidMiner Scriptella Talend Open Studio Proprietary ETL frameworks IBM InfoSphere DataStage Informatica PowerCenter Oracle Data Integrator (ODI) Ab Initio Altova MapForce HiT Software Allora Digital Fuel Service Flow Phocas ETL Microsoft SQL Server Integration Services (SSIS)
The Pentaho BI Project is open source application software for enterprise reporting, analysis, dashboard, data mining, workflow and ETL capabilities for business intelligence needs.
Business Model Pentaho uses a subscription model: its commercial open source business model eliminates software license fees, providing support, services, and product enhancements via an annual subscription. A commercial open source company, Pentaho "leads and sponsors" the open source projects that are core to its suite, giving it direct influence over software development.
Pentaho Reporting
Flexible deployment from standalone desktop reporting to embedded reporting and enterprise business intelligence Broad data source support including relational, OLAP, or XML-based data sources Popular output options including Adobe PDF, HTML, Microsoft Excel, Rich Text Format, or plain text Web-based ad hoc query and reporting for business users Enterprise Edition provides enhanced software functionality, comprehensive professional technical support, product expertise, certified software and software maintenance.
Embedded reporting
Operational Reporting
Production Reporting
Pentaho Analysis
Freely explore business information by drilling into and cross-tabulating data Experience speed-of-thought response times to complex analytical queries View information multi-dimensionally, choosing specific metrics and attributes to analyze Deploy stand-alone or integrated with other products in the Pentaho BI Suite
Pentaho Analyzer Pentaho Analyzer provides intuitive, interactive analytical reporting letting non-technical business users quickly understand business information. As part of the enhanced functionality in Pentaho Analysis Enterprise Edition, Analyzer features: Web-based, drag-and-drop report creation Advanced sorting and filtering Customized totals and user-defined calculations Chart visualizations And much more
Pentaho Dashboards
Pentaho Dashboards delivers the visibility by providing:
Rich, interactive displays including Adobe Flash-based visualizations so that business users can immediately see which business metrics are on track, and which need attention Self-service dashboard designer that lets business users easily create personalized dashboards with zero training Integration with Pentaho Reporting and Pentaho Analysis so that users can drill to underlying reports and analysis to understand what factors are contributing to good or bad performance Portal integration to make it easy to deliver relevant business metrics to large numbers of users, seamlessly integrated into their application Integrated alerting to continuously monitor for exceptions and notify users to take action
Powers instantaneous, iterative BI application development Enables seamless collaboration between developers and end users Merges complex BI development into a single process Dramatically reduces time and difficulty of building and deploying BI apps
Rich transformation library with over 100 out-of-the-box mapping objects Broad data source support including packaged applications, over 30 open source and proprietary database platforms, flat files, Excel documents and more Advanced data warehousing support for Slowly Changing and Junk Dimensions Proven enterprise-class performance and scalability Integration with the Pentaho BI Suite for Enterprise Information Integration (EII), advanced scheduling, and process integration Unified ETL, modeling and visualization development environment for design of BI applications.
cleansing transformations
by
applying
complex
conditions
in
data
algorithms to uncover meaningful patterns and correlations that may otherwise be hidden. These can be used to understand the business better and also exploited to improve future performance through predictive analytics.
Pentaho Data Mining is differentiated by its open, standards-
compliant nature, use of Weka data mining technology, and tight integration with core business intelligence capabilities including reporting, analysis and dashboards. Other data mining offerings lack this level of sophistication and integration.
An out-of-the-box solution for immediate deployment to analysts. As far as end-users are concerned, data mining operates entirely in the background users see results and recommendations through e-mail or other web pages, which can include Pentaho Dashboards. A set of components that enable Java developers to quickly create custom reporting solutions using Java Objects or Java Server Pages (JSPs). These can be tightly integrated with other applications or portals. Together with other components of the overall Pentaho BI Suite
Provides insight into hidden patterns and relationships in your data Enables you to exploit these correlations to improve organizational performance Provides indicators of future performance Enables embedding of recommendations in your applications Enables you to take full advantage of a range of data mining algorithms
Technology
Powerful Data Mining Engine
Provides a comprehensive set of machine learning algorithms from the Weka project including clustering, segmentation, decision trees, random forests, neural networks, and principal component analysis. Pentaho has added integration with Pentaho Data Integration and automated the process of transforming data into the format the data mining engine needs. Algorithms can either be applied directly to a dataset or called from Java code. Output can be viewed graphically, interacted with programmatically, or used data source for reports, further analysis, and other processes. Filters are provided for discretization, normalization, re-sampling, attribute selection, and transforming and combining attributes.
Classifiers provide models for predicting nominal or numeric quantities. Learning schemes include decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, and other advanced techniques. The data mining engine is also well-suited for developing new machine learning schemes, enabling customers to incorporate their own models. Inputs and outputs can be controlled programmatically, enabling developers to create completely custom solutions using the components provided.
Graphical user interfaces are provided for data pre-processing, classification,regression, clustering, association rules, andvisualization.
CUSTOMER SUCCESSES
Pentaho customers address a wide range of BI challenges using services and software from Pentaho. Many Pentaho customers use Pentaho for reporting, data integration, dashboards, and/or analysis. Some use multiple modules or the full Pentaho BI Suite. With subscription services and open source licensing from Pentaho, customers can get best-in-class BI capabilities with the peace of mind of professional support, software maintenance, training, consulting, and more. The following is a small sample of the many organizations around the globe that depend on Pentaho for commercial open source business intelligence.
"Our only regret was that we didn't have Pentaho for data integration years ago. Immediately we were able to see the increased operational efficiency, reduced internal costs and greater customer value using Pentaho Data Integration.
Deployment Overview
Key Challenges Cumbersome, manual process for creation and distribution of reports Multiple data points including Google-Analytics needed to integrate and automate into one report Pentaho Solution Pentaho Data Integration Business and implementation services by Pentaho Systems Integrator Partner, DEFTeam Solutions Results Increased operational efficiency Reduced internal costs Greater customer value Why Pentaho Low cost Flexibility Speed-to-market
"We needed to deliver a business intelligence solution that would show immediate benefit by increasing efficiencies, containing costs, and helping drive revenue. By using Pentaho BI Suite Enterprise Edition, we were able to do so in a fiscally responsible manner, and in today's economic climate that is of utmost importance."
Deployment Overview
Key Challenges
Gaining better insight across the organization to help steer strategic decisionmaking Conducting deeper analysis on historical data across all facets of its service offerings Pentaho BI Suite Enterprise Edition for data integration, reporting and analysis CentOS, PostgreSQL Database Company-wide performance gains through better visibility into customer, cost, and revenue trends Increased operational efficiency, reduced internal costs and greater customer value End-to-end BI capabilities Value vs. proprietary BI Enterprise Edition features
Pentaho Solution
Results
Why Pentaho
"The simplicity of the interface actually allows Lifetime Entertainment Services to give direct access to business analysts, allowing them to understand and manage the business rules governing the integration of information. That wasn't previously possible with complex hand-coded integration jobs."
Deployment Overview
Key Challenges Optimizing advertising processes to drive ad revenue growth Adapting data integration infrastructure to keep up with changing business rules Pentaho Solution Pentaho Data Integration Enterprise Edition Selected over Informatica and BusinessObjects Data Integrator Continued use of Business Objects BI tools Results Ability for business analysts to manage integration rules and adapt integration processes to company business rules Why Pentaho Ease of use Cost of ownership Enterprise Edition Features
"ActivePivot (tm) uniquely marries the concept of online analytical process with real-time position-keeping; something no other company currently offers. Thanks to Pentaho Spreadsheet Services we can now offer seamless MDX connectivity to Microsoft Excel."
Deployment Overview
Key Challenges
Excel-based access to analytic application data Maximizing margins on analytic software solution for financial institutions Pentaho Analysis Pentaho Spreadsheet Services Competitive differentiation based on Excel-based access to centralized information Low costs delivered by commercial open source business model Standards-based offering allowing Excel-based connectivity to live OLAP data
Pentaho Solution
Results
Why Pentaho
"Pentaho's BI suite and top-notch professional support enabled us to deliver a successful, high-value BI solution at a much lower cost than would have been possible with the expensive, proprietary alternatives."
Deployment Overview
Key Challenges Understanding the effectiveness of its online marketing activities Outgrowing Microsoft Excel-based reporting system Maintaining complex, hand-coded ETL scripts Pentaho Solution Pentaho BI Suite Enterprise Edition IBM servers, SUSE Linux, 1.5 terabyte Microsoft SQL Server data warehouse Professional services from Pentaho partner OpenBI Results Automated integration of clickstream data with Google Analytics and catalog sales activity data Greater visibility into website traffic, keyword performance and revenue attribution Why Pentaho Standards-based, cross platform support Quality of support and services
"With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs. We were also surprised at how much faster Pentaho Data Integration was than our prior solution."
Deployment Overview
Key Challenges
Measuring and optimizing agent performance, customer satisfaction, and marketing ROI Getting an integrated, strategic view across multiple operational systems Pentaho Data Integration Enterprise Edition Red Hat Enterprise Linux, MySQL database Continued use of proprietary BI tools (MicroStrategy) Product expertise Three-fold performance increase, 8 hour reduction in batch load times Simplified maintenance and reduced costs Functionality and flexibility Professional support
Pentaho Solution
Results
Why Pentaho
.