ETL Concepts

ETL Concepts
Scope of the Training

What is ETL?
Need for ETL
ETL Glossary
The ETL Process
• Data Extraction and Preparation
• Data Cleansing
• Data Transformation
• Data Load
• Data Refresh Strategies
ETL Solution Options
Characteristics of ETL Tools
What is ETL?
ETL stands for Extraction, Transformation
and Load
This is the most challenging, costly and
time consuming step towards building any
type of Data warehouse.
This step usually determines the success
or failure of a Data warehouse because
any analysis lays a lot of importance on
data and the quality of data that is being
analyzed.
What is ETL?
What is ETL? - Extraction
Extraction
The process of culling out data that is
required for the Data Warehouse from the
source system
Can be to a file or to a database
Could involve some degree of cleansing or
transformation
Can be automated since it becomes
repetitive once established
What is ETL? - Transformation &
Cleansing
Transformation
Modification or transformation of data being
imported into the Data Warehouse.
Usually done with the purpose of ensuring
‘clean’ and ‘consistent’ data
Cleansing
The process of removing errors and
inconsistencies from data being imported to
a data warehouse
Could involve multiple stages
What is ETL? - Loading
After extracting, scrubbing, cleaning,
validating etc. need to load the data into
the warehouse
Issues
huge volumes of data to be loaded
small time window available when
warehouse can be taken off line (usually
nights)
when to build index and summary
tables
What is ETL? - Loading Techniques
Techniques
batch load utility: sort input records on
clustering key and use sequential I/O;
build indexes and derived tables
sequential loads still too long
use parallelism and incremental
techniques
The Need for ETL
Facilitates Integration of data from various
data sources for building a Datawarehouse
• Note: Mergers and acquisitions also create
disparities in data representation and pose
more difficult challenges in ETL.
Businesses have data in multiple
databases with different codification and
formats
Transformation is required to convert and
to summarize operational data into a
consistent, business oriented format
Pre-Computation of any derived
The Need for ETL - Example
Data Warehouse
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds
appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
Data Integrity Problems - Scenarios
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
Persistent Systems, PSPL, Persistent
Pvt. LTD.
Use of different names
mumbai, bombay
Different account numbers generated by
different applications for the same customer
Required fields left blank
Invalid product codes collected at point of
ETL Glossary
 Extracting
 Conditioning
 House holding
 Enrichment
 Scoring
ETL Glossary
Extracting
Capture of data from operational
source in “as is” status
Sources for data generally in legacy
mainframes in VSAM, IMS, IDMS,
DB2; more data today in relational
databases on Unix
Conditioning
The conversion of data types from the
ETL Glossary
House holding
Identifying all members of a household
(living at the same address)
Ensures only one mail is sent to a
household
Can result in substantial savings: 1 lakh
catalogues at Rs. 50 each costs Rs. 50
lakhs. A 2% savings would save Rs. 1 lakh.
Enrichment
Bring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, A. C.
The ETL Process
Access data dictionaries defining source
files
Build logical and physical data models for
target data
Identify sources of data from existing
systems
Specify business and technical rules for
data extraction, conversion and
transformation
Perform data extraction and
The ETL Process – Push vs. Pull
 Pull :- A Pull strategy is initiated by the Target
System. As a part of the Extraction Process, the
source data can be pulled from Transactional
system into a staging area by establishing a
connection to the relational/flat/ODBC sources.
• Advantage :- No additional space required to store
the data that needs to be loaded into to the staging
database
• Disadvantage :- Burden on the Transactional
systems when we want to load data into the staging
database
OR
The ETL Process – Push vs. Pull
 With a PUSH strategy, the source system area

maintains the application to read the source and
create an interface file that is presented to your ETL.
With a PULL strategy, the DW maintains the
application to read the source.
The ETL Process - Data Extraction
and Preparation
Stage I Extract
Stage II Analyze, Clean Periodic

and Transform Refresh/
Update
Data Movement
Stage III and Load
The ETL Process – A simplified
picture
OLTP
Systems Transform
Staging Data
OLTP Extract Area Load
Systems Warehouse
OLTP
Systems
Stage I Stage II Stage III

The ETL Process – Step1
Capture = extract…obtaining a snapshot of

a chosen subset of the source data for
loading into the data warehouse
Static extract = capturing a Incremental extract = capturing

snapshot of the source data at a changes that have occurred since
point in time the last static extract
Scrub = cleanse…uses pattern recognition

and AI techniques to upgrade data quality
Fixing errors: misspellings, erroneous Also: decoding, reformatting, time

dates, incorrect field usage, mismatched stamping, conversion, key generation,
addresses, missing data, duplicate data, merging, error detection/logging, locating
inconsistencies missing data
Transform = convert data from format of

operational system to format of data
warehouse
Record-level: Field-level:
Selection – data partitioning Single-field – from one field to one field
Joining – data combining Multi-field – from many fields to one, or one
Aggregation – data summarization field to many
Load/Index= place transformed data into the

warehouse and create indexes
Refresh mode: bulk rewriting of target Update mode: only changes in source
data at periodic intervals data are written to data warehouse
The ETL Process - Data
Transformation
Transforms the data in accordance with
the business rules and standards that
have been established
Example include: format changes, de-
duplication, splitting up fields,
replacement of codes, derived values,
and aggregates
Scrubbing/Cleansing Data
Sophisticated transformation tools used for
improving the quality of data
Clean data is vital for the success of the
warehouse
Example
• Seshadri, Sheshadri, Sesadri, Seshadri S.,
Srinivasan Seshadri, etc. are the same
person
Reasons for “Dirty” data
Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys
Non-Unique Identifiers
Data Integration Problems
The ETL Process - Data Cleansing
Source systems contain “dirty data”

that must be cleansed
ETL software contains rudimentary
data cleansing capabilities
Specialized data cleansing software is
often used. Important for performing
name and address correction and
house holding functions
Leading data cleansing/Quality
Technology vendors include IBM
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing
Parsing locates and identifies

individual data elements in the source
files and then isolates these data
elements in the target files.
Examples include parsing the first,
middle, and last name; street number
and street name; and city and state.
Correcting
Corrects parsed individual data

components using sophisticated data
algorithms and secondary data
sources.
Example include replacing a vanity
address and adding a zip code.
Standardizing
Standardizing applies conversion

routines to transform data into its
preferred (and consistent) format using
both standard and custom business
rules.
Examples include adding a pre name,
replacing a nickname, and using a
preferred street name.
Matching
Searching and matching records within

and across the parsed, corrected and
standardized data based on predefined
business rules to eliminate
duplications.
Examples include identifying similar
names and addresses.
Consolidating
Analyzing and identifying relationships

between matched records and
consolidating/merging them into ONE
representation.
Data Quality Technology Tools
(Vendors)
DataFlux Integration Server & dfPower
® Studio (www.DataFlux.com)
Trillium Software Discovery & Trillium
Software System
(www.trilliumsoftware.com)
ProfileStage & QualityStage
(www.ascential.com)
MarketScope Update: Data Quality
Technology ratings, 2005 (Source:
Gartner - June 2005)
The ETL Process - Data Loading
Data are physically moved to the data

warehouse
The loading takes place within a “load
window”
The trend is to near real time updates
of the data warehouse as the
warehouse is increasingly used for
operational applications
Data Loading - First Time Load
First load is a complex exercise
• Data extracted from tapes, files, archives
etc.
• First time load might take a lot of time to
complete
Data Refresh
Issues:
when to refresh?
on every update: too expensive, only
necessary if OLAP queries need
current data (e.g., up-the-minute stock
quotes)
periodically (e.g., every 24 hours,
every week) or after “significant”
events
refresh policy set by administrator
Data Refresh
Data refreshing can follow two approaches :
Complete Data Refresh
• Completely refresh the target table every
time
Data Trickle Load
• Replicate only net changes and update
the target database
Data Refresh Techniques
Snapshot Approach - Full extract from base
tables
read entire source table or database:
expensive
may be the only choice for legacy
databases or files.
Incremental techniques (related to work on
active DBs)
detect & propagate changes on base
tables: replication servers (e.g., Sybase,
Oracle, IBM Data Propagator)
snapshots & triggers (Oracle)
ETL Solution Options
ETL
Custom Generic
Solution Solution
Custom Solution
Using RDBMS staging tables and stored
procedures
Programming languages like C, C++, Perl,
Visual Basic etc
Building a code generator
Custom Solution – Typical components
Extract From Source Data Quality Generate Download Files
• Multiple Stars extracted as

• Snapshots for dimension • Control table driven
separate groups
tables • Highly configurable process
• Pro*C programs using
• PL/SQL extraction procedure • PL/SQL procedure
embedded SQL
• Complex views for • Checks performed - referential
• Surrogate key generation
transformation integrity, Y2K, elementary
mechanism
• Control table and highly statistics, business rules
• ASCII file downloads
parameterized/generic • Mechanism to flag the records
generated for load into
extraction process as bad / reject
warehouse
• Control Program
• Time window based extraction

• Restart at point of failure
• High level of error handling
• Control metadata captured in Oracle tables
• Facility to launch failure recovery programs Automatically
Generic Solution
Address limitations (in scalability &
complexity) of manual coding
The need to deliver quantifiable business
value
Functionality, Reliability and Viability are
no longer major issues
Characteristics of ETL Tools
 Provides facility to specify a large number of
transformation rules with a GUI
 Generate programs to transform data
 Handle multiple data sources
 Handle data redundancy
 Generate metadata as output
 Most tools exploit parallelism by running on
multiple low-cost servers in multi-threaded
environment
 Support data extraction, cleansing, aggregation,
reorganization, transformation, and load
operations
Types of ETL Tools
First-generation
• Code-generation products
• Generate the source code
Second-generation
• Engine-driven products
• Generate directly executable code
Note: Due to more efficient architecture,

second generation tools have significant
Types of ETL Tools - First-
Generationtransformation,load process
Extraction,
run on server or host
GUI interface is used to define extraction/
transformation processes
Detailed transformations require coding in
COBOL or C
Extract program is generated
automatically as source code. Source
code is compiled, scheduled, and run in
batch mode
Uses intermediate files
First-Generation ETL Tools –
Strengths and Limitations
Strengths Limitations
 High cost of products
 Complex training
Tools are mature
 Extract programs have to
compiled from source
Programmers are  Many transformations have to
coded manually
 Lack of parallel execution
familiar with code support
 Most metadata to be manually
generation in generated
COBOL or C
First-Generation ETL Tools –
Examples
SAS/Warehouse Administrator
Prism from Prism Solutions
Passport from Apertus Carleton Corp
ETI-EXTRACT Tool Suite from
Evolutionary Technologies
Copy Manager from Information Builders
Types of ETL Tools - Second-
Generation
Extraction/Transformation/Load runs on
server
Data directly extracted from source and
processed on server
Data transformation in memory and written
directly to warehouse database. High
throughput since intermediate files are not
used
Directly executable code
Support for monitoring, scheduling,
extraction, scrubbing, transformation,
Second-Generation ETL Tools –
Strengths and Limitations
Strengths Limitations
 Not mature
Lower cost suites,
 Initial tools oriented only to
platforms, and RDBMS sources
environment
Fast, efficient, and
multi-threaded
ETL functions highly

Second-Generation ETL Tools –
PowerMart
Examples from Informatica
DataStage from Ardent
Data Mart Solution from Sagent

Technology
Tapestry from D2K

ETL Tools - Examples
DataStage from Ascential Software

SAS System from SAS Institute
Power Mart/Power Center from
Informatica
Sagent Solution from Sagent Software
Hummingbird Genio Suite from
Hummingbird Communications
ETL Tool - General Selection criteria
 Business Vision/Considerations
 Overall IT strategy/Architecture
 Over all cost of Ownership
 Vendor Positioning in the Market
 Performance
 In-house Expertise available
 User friendliness
 Training requirements to existing users
 References from other customers
ETL Tool – Specific Selection criteria
Support to retrieve, cleanse, transform,
summarize, aggregate, and load data
Engine-driven products for fast, parallel
operation
Generate and manage central metadata
repository
Open metadata exchange architecture
Provide end-users with access to
metadata in business terms
Support development of logical and
physical data models
ETL Tool - Selection criteria Metadata
High Rating Data Extraction &
Management and Integration
Administration complexity
Low Rating
Data
Operations Transformation
Management/ and Repair
Process Complexity
Automation
Target Database Loading Ease of Use / Development

Capabilities
ETI Extract Informatica PowerCenter
SAS Warehouse Platinum Decision Base
Administrator
Ardent Warehouse
Ardent DataStage
Executive
Data Mirror Transformation
Carleton Pureview Server
Source: Gartner Report

ETL Concepts

Uploaded by

Copyright:

Available Formats

You might also like

ETL Concepts

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ETL Concepts

Uploaded by

Copyright:

Available Formats

ETL Concepts

Scope of the Training

 With a PUSH strategy, the source system area

Stage II Analyze, Clean Periodic

Stage I Stage II Stage III

Capture = extract…obtaining a snapshot of

Static extract = capturing a Incremental extract = capturing

Scrub = cleanse…uses pattern recognition

Fixing errors: misspellings, erroneous Also: decoding, reformatting, time

Transform = convert data from format of

Load/Index= place transformed data into the

Source systems contain “dirty data”

Parsing locates and identifies

Corrects parsed individual data

Standardizing applies conversion

Searching and matching records within

Analyzing and identifying relationships

Data are physically moved to the data

• Multiple Stars extracted as

• Time window based extraction

Note: Due to more efficient architecture,

platforms, and RDBMS sources

Fast, efficient, and

ETL functions highly

Data Mart Solution from Sagent

Tapestry from D2K

DataStage from Ascential Software

Target Database Loading Ease of Use / Development

Source: Gartner Report

You might also like