Professional Documents
Culture Documents
ETL Concepts
ETL Concepts
What is ETL? Need for ETL ETL Glossary The ETL Process Data Extraction and Preparation Data Cleansing Data Transformation Data Load Data Refresh Strategies ETL Solution Options Characteristics of ETL Tools
ETL stands for Extraction, Transformation and Load This is the most challenging, costly and time consuming step towards building any type of Data warehouse. This step usually determines the success or failure of a Data warehouse because any analysis lays a lot of importance on data and the quality of data that is being analyzed.
What is ETL?
What is ETL?
Extraction The process of culling out data that is required for the Data Warehouse from the source system Can be to a file or to a database Could involve some degree of cleansing or transformation Can be automated since it becomes repetitive once established
Facilitates Integration of data from various data sources for building a Datawarehouse Note: Mergers and acquisitions also create disparities in data representation and pose more difficult challenges in ETL. Businesses have data in multiple databases with different codification and formats Transformation is required to convert and to summarize operational data into a consistent, business oriented format Pre-Computation of any derived
Same person, different spellings Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names mumbai, bombay Different account numbers generated by different applications for the same customer Required fields left blank Invalid product codes collected at point of
ETL Glossary
Extracting Conditioning House holding Enrichment Scoring
ETL Glossary
Extracting
House holding Identifying all members of a household (living at the same address) Ensures only one mail is sent to a household Can result in substantial savings: 1 lakh catalogues at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh. Enrichment Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C.
ETL Glossary
Pull :- A Pull strategy is initiated by the Target System. As a part of the Extraction Process, the source data can be pulled from Transactional system into a staging area by establishing a connection to the relational/flat/ODBC sources.
Advantage :- No additional space required to store the data that needs to be loaded into to the staging database Disadvantage :- Burden on the Transactional systems when we want to load data into the staging database
OR
With a PUSH strategy, the source system area maintains the application to read the source and create an interface file that is presented to your ETL. With a PULL strategy, the DW maintains the application to read the source.
Stage II
Stage III
Transform
Staging Area Data Warehouse
OLTP Systems
Extract
Load
OLTP Systems
Stage I
Stage II
Stage III
Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
Incremental extract = capturing changes that have occurred since the last static extract
Transform = convert data from format of operational system to format of data warehouse
Record-level:
Selection data partitioning Joining data combining Aggregation data summarization
Field-level:
Single-field from one field to one field Multi-field from many fields to one, or one field to many
Load/Index= place transformed data into the warehouse and create indexes
Sophisticated transformation tools used for improving the quality of data Clean data is vital for the success of the warehouse Example Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person
Scrubbing/Cleansing Data
Parsing
Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. Examples include parsing the first, middle, and last name; street number and street name; and city and state.
Correcting
Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. Example include replacing a vanity address and adding a zip code.
Standardizing
Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. Examples include adding a pre name, replacing a nickname, and using a preferred street name.
Matching
Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. Examples include identifying similar names and addresses.
Consolidating
Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
MarketScope Update: Data Quality Technology ratings, 2005 (Source: Gartner - June 2005)
First load is a complex exercise Data extracted from tapes, files, archives etc. First time load might take a lot of time to complete
Data Refresh
Issues: when to refresh? on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) periodically (e.g., every 24 hours, every week) or after significant events refresh policy set by administrator
Data refreshing can follow two approaches : Complete Data Refresh Completely refresh the target table every time Data Trickle Load Replicate only net changes and update the target database
Data Refresh
Snapshot Approach - Full extract from base tables read entire source table or database: expensive may be the only choice for legacy databases or files. Incremental techniques (related to work on active DBs) detect & propagate changes on base tables: replication servers (e.g., Sybase, Oracle, IBM Data Propagator) snapshots & triggers (Oracle)
Custom Solution
Generic Solution
Using RDBMS staging tables and stored procedures Programming languages like C, C++, Perl, Visual Basic etc Building a code generator
Custom Solution
Control Program
Time window based extraction Restart at point of failure High level of error handling Control metadata captured in Oracle tables Facility to launch failure recovery programs Automatically
Address limitations (in scalability & complexity) of manual coding The need to deliver quantifiable business value Functionality, Reliability and Viability are no longer major issues
Generic Solution
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment Support data extraction, cleansing, aggregation, reorganization, transformation, and load operations
First-generation
Second-generation
Engine-driven products
Generate directly executable code
Note: Due to more efficient architecture, second generation tools have significant
High cost of products Complex training Extract programs have to compiled from source Many transformations have to coded manually Lack of parallel execution support Most metadata to be manually generated
multi-threaded
ETL functions highly
Support to retrieve, cleanse, transform, summarize, aggregate, and load data Engine-driven products for fast, parallel operation Generate and manage central metadata repository Open metadata exchange architecture Provide end-users with access to metadata in business terms Support development of logical and physical data models
Target Database Loading ETI Extract SAS Warehouse Administrator Ardent Warehouse Executive Carleton Pureview Source: Gartner Report