Professional Documents
Culture Documents
ETL Concepts
ETL Concepts
ETL Concepts
Techniques
batch load utility: sort input records on
clustering key and use sequential I/O;
build indexes and derived tables
sequential loads still too long
use parallelism and incremental
techniques
The Need for ETL
Facilitates Integration of data from various
data sources for building a Datawarehouse
• Note: Mergers and acquisitions also create
disparities in data representation and pose
more difficult challenges in ETL.
Businesses have data in multiple
databases with different codification and
formats
Transformation is required to convert and
to summarize operational data into a
consistent, business oriented format
Pre-Computation of any derived
The Need for ETL - Example
Data Warehouse
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds
appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
Data Integrity Problems - Scenarios
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
Persistent Systems, PSPL, Persistent
Pvt. LTD.
Use of different names
mumbai, bombay
Different account numbers generated by
different applications for the same customer
Required fields left blank
Invalid product codes collected at point of
ETL Glossary
Extracting
Conditioning
House holding
Enrichment
Scoring
ETL Glossary
Extracting
Capture of data from operational
source in “as is” status
Sources for data generally in legacy
mainframes in VSAM, IMS, IDMS,
DB2; more data today in relational
databases on Unix
Conditioning
The conversion of data types from the
ETL Glossary
House holding
Identifying all members of a household
(living at the same address)
Ensures only one mail is sent to a
household
Can result in substantial savings: 1 lakh
catalogues at Rs. 50 each costs Rs. 50
lakhs. A 2% savings would save Rs. 1 lakh.
Enrichment
Bring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, A. C.
The ETL Process
Access data dictionaries defining source
files
Build logical and physical data models for
target data
Identify sources of data from existing
systems
Specify business and technical rules for
data extraction, conversion and
transformation
Perform data extraction and
The ETL Process – Push vs. Pull
Pull :- A Pull strategy is initiated by the Target
System. As a part of the Extraction Process, the
source data can be pulled from Transactional
system into a staging area by establishing a
connection to the relational/flat/ODBC sources.
• Advantage :- No additional space required to store
the data that needs to be loaded into to the staging
database
• Disadvantage :- Burden on the Transactional
systems when we want to load data into the staging
database
OR
The ETL Process – Push vs. Pull
Data Movement
Stage III and Load
The ETL Process – A simplified
picture
OLTP
Systems Transform
Staging Data
OLTP Extract Area Load
Systems Warehouse
OLTP
Systems
Record-level: Field-level:
Selection – data partitioning Single-field – from one field to one field
Joining – data combining Multi-field – from many fields to one, or one
Aggregation – data summarization field to many
The ETL Process – Step4
Refresh mode: bulk rewriting of target Update mode: only changes in source
data at periodic intervals data are written to data warehouse
The ETL Process - Data
Transformation
Transforms the data in accordance with
the business rules and standards that
have been established
Example include: format changes, de-
duplication, splitting up fields,
replacement of codes, derived values,
and aggregates
Scrubbing/Cleansing Data
Sophisticated transformation tools used for
improving the quality of data
Clean data is vital for the success of the
warehouse
Example
• Seshadri, Sheshadri, Sesadri, Seshadri S.,
Srinivasan Seshadri, etc. are the same
person
Reasons for “Dirty” data
Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys
Non-Unique Identifiers
Data Integration Problems
The ETL Process - Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing
Issues:
when to refresh?
on every update: too expensive, only
necessary if OLAP queries need
current data (e.g., up-the-minute stock
quotes)
periodically (e.g., every 24 hours,
every week) or after “significant”
events
refresh policy set by administrator
Data Refresh
Data refreshing can follow two approaches :
Complete Data Refresh
• Completely refresh the target table every
time
Data Trickle Load
• Replicate only net changes and update
the target database
Data Refresh Techniques
Snapshot Approach - Full extract from base
tables
read entire source table or database:
expensive
may be the only choice for legacy
databases or files.
Incremental techniques (related to work on
active DBs)
detect & propagate changes on base
tables: replication servers (e.g., Sybase,
Oracle, IBM Data Propagator)
snapshots & triggers (Oracle)
ETL Solution Options
ETL
Custom Generic
Solution Solution
Custom Solution
Using RDBMS staging tables and stored
procedures
Programming languages like C, C++, Perl,
Visual Basic etc
Building a code generator
Custom Solution – Typical components
Extract From Source Data Quality Generate Download Files
• Control Program
COBOL or C
First-Generation ETL Tools –
Examples
SAS/Warehouse Administrator
Prism from Prism Solutions
Passport from Apertus Carleton Corp
ETI-EXTRACT Tool Suite from
Evolutionary Technologies
Copy Manager from Information Builders
Types of ETL Tools - Second-
Generation
Extraction/Transformation/Load runs on
server
Data directly extracted from source and
processed on server
Data transformation in memory and written
directly to warehouse database. High
throughput since intermediate files are not
used
Directly executable code
Support for monitoring, scheduling,
extraction, scrubbing, transformation,
Second-Generation ETL Tools –
Strengths and Limitations
Strengths Limitations
Not mature
Lower cost suites,
Initial tools oriented only to
environment
multi-threaded
Data
Operations Transformation
Management/ and Repair
Process Complexity
Automation