DW Training

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 31

ETL- Extraction,Transformation & Loading

What is ETL?

• ETL(Extraction, Transformation and Loading) is a process by which


data is integrated and transformed from the operational systems into
the Data warehouse environment

Filters and
Extractors
Cleanser
Error
Error
Operational systems Cleaning View
Rules Check
• Rule 1 Correct
• Rule 2
• Rule 3
Transformation
Rules

• Rule 1
• Rule 2
• Rule 3

Transformation
Engine
Integrator

Error
View
Check
Correct Loader Warehouse

1-2
Operational Data - Challenges

• Data from heterogeneous sources


• Format differences
• Data Variations

• Context

– Across locations the same code could


represent different customers
– Across periods of time a product code could
have been reused

1-3
Extraction

Data
from
80 tables tables 30

Filter
Oracle
Data from 10
tables Where
Date<10/12/99
50 tables

f iles
Sybase ta f rom
Da

Target

1-4 Text files


Transformation

Source
Emp id Last First
Name Name
10001 Jones Indiana

10002 Holmes Sherlock


Staging Area

Name =
Concat(First Name,
Last Name)
Indiana Jones
Sherlock Homes

1-5
Loading

Data
Source Warehouse
Direct Load

Staging
Area g rated
i nt e
rm ed &
ra n sfo load
l ean,T data
C
Cleaning,
Transformation
& Integration of
Raw data

1-6
Volume of ETL in a Data warehouse

Source OLTP
Systems Data Marts

Metadata
e re
is h
Enterprise
or k
Data Warehouse
e w
o f th
0 %
o 8
0 t
6
•Extract •Load
•Design •Extract •Load •Replication •Access & Analysis
•Design •Scrub •Index •Replication •Access & Analysis
•Mapping •Scrub •Index •Data Set Distribution •Resource Scheduling & Distributio
•Mapping •Transform •Aggregation •Data Set Distribution •Resource Scheduling & Distributio
•Transform •Aggregation

Meta
MetaData
Data

System
SystemMonitoring
Monitoring
1-7
Factors Influencing ETL Architecture

• Volume at each warehouse component.

• The time window available for extraction.

• The extraction type (Full,Periodic etc.)

• Complexity of the processes at each stage.

1-8
Extraction Types
Extraction Types

Extraction

Periodic/
Full Extract Incremental
Extract

1-10
Full Extract

Existing data

Data Mart

Full Extract

Source System

1-11
Full Extract

New data

Data Mart

Full Extract

Source System

1-12
Incremental Extract

Existing data
Incremental
Data
Data Mart

Incremental Extract
Source System

1-14
Incremental Extract

Existing data
New data
Incremental
Data
Data Mart

Incremental Extract
Source System

Changed data

1-15
Incremental Extract

New data Incremental addition


Incremental to data mart
Data
Data Mart

Source System
Incremental Extract

Changed data
Existing data updated
using changed data

1-16
Transformation
Data Transformation

• Conversions

– Data type (e.g. Char to Date)

– Bring data to common units (Currency,Measuring Units)

• Classifications

– Changing continuous values to discrete ranges (e.g.


Temperatures to Temperature Ranges)

• Splitting of fields

• Merging of fields

• Aggregations (e.g. Sum, Avg., Count)

• Derivations (Percentages, Ratios, Indicators)

1-18
Structural Transformations

• Additive
OLTP

Orders arrive
Data ware
every Aggregate house
two minutes

OLTP
Average

Daily
Productivity Average Data ware
figures house

1-19
Format transformation

Source Target
Schema Schema
Data Type Transformation
Conversions “32” 32

Age as a String Age as an


Integer

Source Target
Splitting Schema Schema
Transformation
“15-10-1992” 15 10 1999

Day Month Year

Date as a Date as a combination of 3


String integer fields

1-20
Simple Conversions

Source Target
Schema Schema

Multiply by 1/43
Rs. 10000 $232.56

Revenue in Revenue in
Rupees Dollars

Multiply by 0.4536
1000 lbs. 453.56 kgs.

Production in Production in
Pounds Kilograms
Source Target
Schema Schema

• Transformations using Simple Conversions

1-21
Classification

Name Age
John Black 27
Richard Wayne 53
Age GroupFrequency
Jennifer Goldman 45 20-25 1
Helmut Koch 37 26-30 4
Anna Ludwig 32
Shito Maketha 28 31-35 3
Tracy Withman 39 36-40 2
Ada Zhesky 25 Grouping
41-45 2
David Rosenberg 33
Pankaj Sharma 29 46-50 1
Zhu Ling 44 51-55 1
George Kurtz 27
Rita Hartman 34
56-60 0

1-22
Data Consistency Transformations

Source 1 Source 2 Source 1


Gender Gender Gender
Male – M Male – Male Male – 1
Female – F Female – Female Female – 2

Target
Gender
Male – M
Female – F

1-23
Reconciliation of Duplicated data

Joseph J.R.Smith Joe Smith


Smith 123 Maine 123 Maine
123 Maine St. St.
St. MA - MA -
MA - 70127 70127
70127

Joseph R Smith
123 Maine St.
MA - 70127

1-24
Data Aggregation - Design Requirements

• Aggregates must be stored in their own fact tables and each level
should have its own fact table

• Dimension tables attached to the aggregate fact tables should where


ever possible be shrunken versions of the dimension tables attached to
the base fact table

• The base fact table and all of its related aggregate fact tables must be
associated together as a family of schemas

1-25
Loading
Types of Data warehouse Loading

• Target update types


– Insert
– Update

1-27
Types of Data Warehouse Updates

Data Warehouse

Source data Data Staging

 Point in Time Snapshots  Insert


 New Data  Full Replace
 Changed Data  Selective Replace
 Update
 Update plus Retain
History

1-28
New Data and Point-In-Time Data Insert

Source data

New data

OR

Point-in-Time
Snapshot New Data Added to
(e.g.. Monthly) Existing Data

1-29
Changed Data Insert

Source data
Changed Data Added to
Existing Data

Changed
data

1-30
Change of Dimension values

When the value of dimension in a data warehouse changes,


then

 History of change needs to be maintained.


 Changed data alone needs to be identified
 Changed data should be easier to access.
 Reconstruction of the dimension table any point in time should be
easier

1-31
ETL - Approach in a nutshell

1) Identify the Operational systems based on data islands in the


target
2) Map source-target dependencies.
3) Define cleaning and transformation rules
4) Validate source-target mapping
5) Consolidate Meta data for ETL
6) Draw the ETL architecture
7) Build the cleaning, transformation and auditing routines using
either a tool or customized programs

1-32

You might also like