ETL Concepts

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 56

ETL Concepts

Scope of the Training


What is ETL?
Need for ETL
ETL Glossary
The ETL Process
• Data Extraction and Preparation
• Data Cleansing
• Data Transformation
• Data Load
• Data Refresh Strategies
ETL Solution Options
Characteristics of ETL Tools
What is ETL?
ETL stands for Extraction, Transformation
and Load
This is the most challenging, costly and
time consuming step towards building any
type of Data warehouse.
This step usually determines the success
or failure of a Data warehouse because
any analysis lays a lot of importance on
data and the quality of data that is being
analyzed.
What is ETL?
What is ETL? - Extraction
Extraction
The process of culling out data that is
required for the Data Warehouse from the
source system
Can be to a file or to a database
Could involve some degree of cleansing or
transformation
Can be automated since it becomes
repetitive once established
What is ETL? - Transformation &
Cleansing
Transformation
Modification or transformation of data being
imported into the Data Warehouse.
Usually done with the purpose of ensuring
‘clean’ and ‘consistent’ data
Cleansing
The process of removing errors and
inconsistencies from data being imported to
a data warehouse
Could involve multiple stages
What is ETL? - Loading
After extracting, scrubbing, cleaning,
validating etc. need to load the data into
the warehouse
Issues
huge volumes of data to be loaded
small time window available when
warehouse can be taken off line (usually
nights)
when to build index and summary
tables
What is ETL? - Loading Techniques

Techniques
batch load utility: sort input records on
clustering key and use sequential I/O;
build indexes and derived tables
sequential loads still too long
use parallelism and incremental
techniques
The Need for ETL
Facilitates Integration of data from various
data sources for building a Datawarehouse
• Note: Mergers and acquisitions also create
disparities in data representation and pose
more difficult challenges in ETL.
Businesses have data in multiple
databases with different codification and
formats
Transformation is required to convert and
to summarize operational data into a
consistent, business oriented format
Pre-Computation of any derived
The Need for ETL - Example
Data Warehouse
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female

appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds

appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
Data Integrity Problems - Scenarios
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
Persistent Systems, PSPL, Persistent
Pvt. LTD.
Use of different names
mumbai, bombay
Different account numbers generated by
different applications for the same customer
Required fields left blank
Invalid product codes collected at point of
ETL Glossary

 Extracting
 Conditioning
 House holding
 Enrichment
 Scoring
ETL Glossary

Extracting
Capture of data from operational
source in “as is” status
Sources for data generally in legacy
mainframes in VSAM, IMS, IDMS,
DB2; more data today in relational
databases on Unix
Conditioning
The conversion of data types from the
ETL Glossary
House holding
Identifying all members of a household
(living at the same address)
Ensures only one mail is sent to a
household
Can result in substantial savings: 1 lakh
catalogues at Rs. 50 each costs Rs. 50
lakhs. A 2% savings would save Rs. 1 lakh.
Enrichment
Bring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, A. C.
The ETL Process
Access data dictionaries defining source
files
Build logical and physical data models for
target data
Identify sources of data from existing
systems
Specify business and technical rules for
data extraction, conversion and
transformation
Perform data extraction and
The ETL Process – Push vs. Pull
 Pull :- A Pull strategy is initiated by the Target
System. As a part of the Extraction Process, the
source data can be pulled from Transactional
system into a staging area by establishing a
connection to the relational/flat/ODBC sources.
• Advantage :- No additional space required to store
the data that needs to be loaded into to the staging
database
• Disadvantage :- Burden on the Transactional
systems when we want to load data into the staging
database
OR
The ETL Process – Push vs. Pull

 With a PUSH strategy, the source system area


maintains the application to read the source and
create an interface file that is presented to your ETL.
With a PULL strategy, the DW maintains the
application to read the source.
The ETL Process - Data Extraction
and Preparation
Stage I Extract

Stage II Analyze, Clean Periodic


and Transform Refresh/
Update

Data Movement
Stage III and Load
The ETL Process – A simplified
picture
OLTP
Systems Transform

Staging Data
OLTP Extract Area Load
Systems Warehouse

OLTP
Systems

Stage I Stage II Stage III


The ETL Process – Step1

Capture = extract…obtaining a snapshot of


a chosen subset of the source data for
loading into the data warehouse

Static extract = capturing a Incremental extract = capturing


snapshot of the source data at a changes that have occurred since
point in time the last static extract
The ETL Process – Step2

Scrub = cleanse…uses pattern recognition


and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous Also: decoding, reformatting, time


dates, incorrect field usage, mismatched stamping, conversion, key generation,
addresses, missing data, duplicate data, merging, error detection/logging, locating
inconsistencies missing data
The ETL Process – Step3

Transform = convert data from format of


operational system to format of data
warehouse

Record-level: Field-level:
Selection – data partitioning Single-field – from one field to one field
Joining – data combining Multi-field – from many fields to one, or one
Aggregation – data summarization field to many
The ETL Process – Step4

Load/Index= place transformed data into the


warehouse and create indexes

Refresh mode: bulk rewriting of target Update mode: only changes in source
data at periodic intervals data are written to data warehouse
The ETL Process - Data
Transformation
Transforms the data in accordance with
the business rules and standards that
have been established
Example include: format changes, de-
duplication, splitting up fields,
replacement of codes, derived values,
and aggregates
Scrubbing/Cleansing Data
Sophisticated transformation tools used for
improving the quality of data
Clean data is vital for the success of the
warehouse
Example
• Seshadri, Sheshadri, Sesadri, Seshadri S.,
Srinivasan Seshadri, etc. are the same
person
Reasons for “Dirty” data
Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys
Non-Unique Identifiers
Data Integration Problems
The ETL Process - Data Cleansing

Source systems contain “dirty data”


that must be cleansed
ETL software contains rudimentary
data cleansing capabilities
Specialized data cleansing software is
often used. Important for performing
name and address correction and
house holding functions
Leading data cleansing/Quality
Technology vendors include IBM
Steps in Data Cleansing

Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing

Parsing locates and identifies


individual data elements in the source
files and then isolates these data
elements in the target files.
Examples include parsing the first,
middle, and last name; street number
and street name; and city and state.
Correcting

Corrects parsed individual data


components using sophisticated data
algorithms and secondary data
sources.
Example include replacing a vanity
address and adding a zip code.
Standardizing

Standardizing applies conversion


routines to transform data into its
preferred (and consistent) format using
both standard and custom business
rules.
Examples include adding a pre name,
replacing a nickname, and using a
preferred street name.
Matching

Searching and matching records within


and across the parsed, corrected and
standardized data based on predefined
business rules to eliminate
duplications.
Examples include identifying similar
names and addresses.
Consolidating

Analyzing and identifying relationships


between matched records and
consolidating/merging them into ONE
representation.
Data Quality Technology Tools
(Vendors)
DataFlux Integration Server & dfPower
® Studio (www.DataFlux.com)
Trillium Software Discovery & Trillium
Software System
(www.trilliumsoftware.com)
ProfileStage & QualityStage
(www.ascential.com)
MarketScope Update: Data Quality
Technology ratings, 2005 (Source:
Gartner - June 2005)
The ETL Process - Data Loading

Data are physically moved to the data


warehouse
The loading takes place within a “load
window”
The trend is to near real time updates
of the data warehouse as the
warehouse is increasingly used for
operational applications
Data Loading - First Time Load
First load is a complex exercise
• Data extracted from tapes, files, archives
etc.
• First time load might take a lot of time to
complete
Data Refresh

Issues:
when to refresh?
on every update: too expensive, only
necessary if OLAP queries need
current data (e.g., up-the-minute stock
quotes)
periodically (e.g., every 24 hours,
every week) or after “significant”
events
refresh policy set by administrator
Data Refresh
Data refreshing can follow two approaches :
Complete Data Refresh
• Completely refresh the target table every
time
Data Trickle Load
• Replicate only net changes and update
the target database
Data Refresh Techniques
Snapshot Approach - Full extract from base
tables
read entire source table or database:
expensive
may be the only choice for legacy
databases or files.
Incremental techniques (related to work on
active DBs)
detect & propagate changes on base
tables: replication servers (e.g., Sybase,
Oracle, IBM Data Propagator)
snapshots & triggers (Oracle)
ETL Solution Options

ETL

Custom Generic
Solution Solution
Custom Solution
Using RDBMS staging tables and stored
procedures
Programming languages like C, C++, Perl,
Visual Basic etc
Building a code generator
Custom Solution – Typical components
Extract From Source Data Quality Generate Download Files

• Multiple Stars extracted as


• Snapshots for dimension • Control table driven
separate groups
tables • Highly configurable process
• Pro*C programs using
• PL/SQL extraction procedure • PL/SQL procedure
embedded SQL
• Complex views for • Checks performed - referential
• Surrogate key generation
transformation integrity, Y2K, elementary
mechanism
• Control table and highly statistics, business rules
• ASCII file downloads
parameterized/generic • Mechanism to flag the records
generated for load into
extraction process as bad / reject
warehouse

• Control Program

• Time window based extraction


• Restart at point of failure
• High level of error handling
• Control metadata captured in Oracle tables
• Facility to launch failure recovery programs Automatically
Generic Solution
Address limitations (in scalability &
complexity) of manual coding
The need to deliver quantifiable business
value
Functionality, Reliability and Viability are
no longer major issues
Characteristics of ETL Tools
 Provides facility to specify a large number of
transformation rules with a GUI
 Generate programs to transform data
 Handle multiple data sources
 Handle data redundancy
 Generate metadata as output
 Most tools exploit parallelism by running on
multiple low-cost servers in multi-threaded
environment
 Support data extraction, cleansing, aggregation,
reorganization, transformation, and load
operations
Types of ETL Tools
First-generation
• Code-generation products
• Generate the source code
Second-generation
• Engine-driven products
• Generate directly executable code

Note: Due to more efficient architecture,


second generation tools have significant
Types of ETL Tools - First-
Generationtransformation,load process
Extraction,
run on server or host
GUI interface is used to define extraction/
transformation processes
Detailed transformations require coding in
COBOL or C
Extract program is generated
automatically as source code. Source
code is compiled, scheduled, and run in
batch mode
Uses intermediate files
First-Generation ETL Tools –
Strengths and Limitations
Strengths Limitations
 High cost of products
 Complex training
Tools are mature
 Extract programs have to
compiled from source
Programmers are  Many transformations have to
coded manually
 Lack of parallel execution
familiar with code support
 Most metadata to be manually
generation in generated

COBOL or C
First-Generation ETL Tools –
Examples
SAS/Warehouse Administrator
Prism from Prism Solutions
Passport from Apertus Carleton Corp
ETI-EXTRACT Tool Suite from
Evolutionary Technologies
Copy Manager from Information Builders
Types of ETL Tools - Second-
Generation
Extraction/Transformation/Load runs on
server
Data directly extracted from source and
processed on server
Data transformation in memory and written
directly to warehouse database. High
throughput since intermediate files are not
used
Directly executable code
Support for monitoring, scheduling,
extraction, scrubbing, transformation,
Second-Generation ETL Tools –
Strengths and Limitations
Strengths Limitations
 Not mature
Lower cost suites,
 Initial tools oriented only to

platforms, and RDBMS sources

environment

Fast, efficient, and

multi-threaded

ETL functions highly


Second-Generation ETL Tools –
PowerMart
Examples from Informatica
DataStage from Ardent

Data Mart Solution from Sagent


Technology

Tapestry from D2K


ETL Tools - Examples

DataStage from Ascential Software


SAS System from SAS Institute
Power Mart/Power Center from
Informatica
Sagent Solution from Sagent Software
Hummingbird Genio Suite from
Hummingbird Communications
ETL Tool - General Selection criteria
 Business Vision/Considerations
 Overall IT strategy/Architecture
 Over all cost of Ownership
 Vendor Positioning in the Market
 Performance
 In-house Expertise available
 User friendliness
 Training requirements to existing users
 References from other customers
ETL Tool – Specific Selection criteria
Support to retrieve, cleanse, transform,
summarize, aggregate, and load data
Engine-driven products for fast, parallel
operation
Generate and manage central metadata
repository
Open metadata exchange architecture
Provide end-users with access to
metadata in business terms
Support development of logical and
physical data models
ETL Tool - Selection criteria Metadata
High Rating Data Extraction &
Management and Integration
Administration complexity
Low Rating

Data
Operations Transformation
Management/ and Repair
Process Complexity
Automation

Target Database Loading Ease of Use / Development


Capabilities
ETI Extract Informatica PowerCenter
SAS Warehouse Platinum Decision Base
Administrator
Ardent Warehouse
Ardent DataStage
Executive
Data Mirror Transformation
Carleton Pureview Server

Source: Gartner Report

You might also like