Professional Documents
Culture Documents
3 Lecture 3-ETL
3 Lecture 3-ETL
3 Lecture 3-ETL
02/17/24 1
Data Extraction Transformation and Loading
Page 257
02/17/24 2
Extract Transform and Load
Extract, transform, and load (ETL) is a process
in database usage and especially in data
warehousing that involves
02/17/24 3
Major steps in the ETL process
Data extraction
Determine all the target data needed in the data warehouse
02/17/24 4
Major steps in the ETL process
Data Transformation
Determine data transformation and cleansing rules or
02/17/24 6
Challenges of ETL Functions
Source systems are very diverse and disparate.
02/17/24 10
Categories of data in operational
systems
Current Value:
Under current data value the stored value of an attribute
represents the value of an attribute at that moment of
time .i.e. The value of an attribute remains constant only
until a business transaction changes it .
Periodic Status:
In this category, the value of the attribute is perceived as the
status every time a change occurs.
The history of the changes is preserved in the source systems
themselves. Whether it is status data or data about an event,
the source systems contain data at each point in time when
any change occurred.
02/17/24 11
Data extraction techniques
1. Capture of static data or “AS IS”
This is the capture of data at a given point in time.
Both current or transient data is captured.
Used primarily for the initial load of the data warehouse. Sometimes,
may require a full refresh of a dimension table.
Advantages
Good flexibility for capture specifications.
Performance of source systems not affected.
No revisions to existing applications.
Can be used on legacy systems.
Can be used on file-oriented systems.
Vendor products are used. No internal costs
02/17/24 12
Data extraction techniques cont’d
Other techniques are categorized into two:
A) Immediate Data Extraction. In this option, the data
extraction is real-time. Extraction takes place while
transactions occur in the source operational systems.
Three options for immediate data extraction include:
Capture of transaction logs, Capture through database
triggers and Capture in source applications. (Elaborated
in next slides)
B) Deferred Data Extraction. Do not capture the changes
in real time. i.e. The data capture happens later. Two
options for deferred data extraction include: Capture
based on date and time, Capture by comparing files,
(Elaborated in next slides)
02/17/24 13
Data extraction techniques cont’d
The options under Immediate Data Extraction
2. Capture through transaction logs
As each transaction adds, updates, or deletes a row from a database
table, the DBMS immediately writes entries on the log file. This data
extraction technique reads the transaction log and selects all the
committed transactions.
All the transactions have to be extracted before the log file gets
refreshed.
Advantages
Performance of source systems not affected because logging is
already part of the transaction processing.
No revisions to existing applications.
Can be used on most legacy systems.
Vendor products are used. No internal costs.
Disadvantages
Not much flexibility for capture specifications.
Cannot be used on file-oriented systems.
02/17/24 14
Data extraction techniques cont’d
3. Capture through database triggers
Triggers are special stored procedures (programs) that are stored on
the database and fired when certain predefined events occur.
Trigger programs are created for all events for which data is to be
captured. The output of the trigger programs is written to a
separate file that will be used to extract data for the data
warehouse.
Advantages
Performance of source systems not affected because logging is already
part of the transaction processing.
No revisions to existing applications.
Can be used on most legacy systems.
Vendor products are used. No internal costs.
Disadvantages
Not much flexibility for capture specifications.
Cannot be used on file-oriented systems. Only applicable on database
applications
02/17/24 15
Data extraction techniques cont’d
4. Capture in source applications
This is also referred to as application-assisted data capture. Involves
revision of the programs to write all adds, updates, and deletes to the
source files and database tables. Then other extract programs can use
the separate file containing the changes to the source data.
Advantages
Good flexibility for capture specifications.
Can be used on most legacy systems.
Can be used on file-oriented systems.
Disadvantages
Performance of source systems affected a bit.
High internal costs because of in-house work.
Major revisions to existing applications.
02/17/24 16
Data extraction techniques cont’d
The options under Deferred Data Extraction
5. Capture based on date and time stamp (one of the two
options under deferred extraction)
Every time a source record is created or updated, it may be marked with a
stamp showing the date and time. The time stamp provides the basis for
selecting records for data extraction.
Advantages
Good flexibility for capture specifications.
Performance of source systems not affected.
Can be used on file-oriented systems.
Vendor products may be used.
Disadvantages
Major revisions to existing applications likely.
Cannot be used on most legacy systems.
02/17/24 17
Data extraction techniques cont’d
6. Capture by comparing files (second option under deferred
extraction)
This technique is also called the snapshot differential technique.
It compares two snapshots of the source data then capture any
changes between the two copies
Advantages
Good flexibility for capture specifications.
Performance of source systems not affected.
No revisions to existing applications.
May be used on legacy systems.
May be used on file-oriented systems.
Vendor products are used. No internal costs.
02/17/24 18
Major types of data transformation
Format Revisions.
These include changes to the data types and lengths of
individual fields. There is need to standardize and change the
data type to text to provide values meaningful to the users.
Decoding of Fields.
In multiple source systems, same data items may be described
by different field values. E.g. the coding for gender, using 1 and
2 for male and female or M and F in different systems. Also, in
cases of using cryptic codes such as IST3109 to represent
business intelligence and data ware housing, decode all such
cryptic codes and change these into values that make sense to
the users.
02/17/24 19
Major types of data transformation
Calculated and Derived Values.
Calculations need to be performed on the extracted data from
the source system before data can be stored in the data
warehouse. Derived fields can be average daily balances and
operating ratios.
02/17/24 20
Major types of data transformation
Merging of Information.
Does not exactly mean the merging of several fields to create a single
field of data. It means the combination of different fields from
different source systems into a single entity.
Character Set Conversion. The conversion of character sets to an
agreed standard character set for textual data in the data warehouse.
E.g. the source data in EBCDIC format (common with IBM) to the
ASCII format (American Standard Code for Information Interchange ).
Conversion of Units of Measurements.
For companies with global branches, there is need to convert the
metrics so that the numbers may all be in one standard unit of
measurement. E.g. dollars for currency
02/17/24 21
Major types of data transformation
Date/Time Conversion.
This type relates to representation of date and time in
standard formats. For example, 04/29/2010 in the U.S.
format and as 29/04/2010 in the British format can be
standardized to be written as 29 APR 2010.
Summarization.
Involves creating of summaries to be loaded in the data
warehouse instead of loading the most granular level of
data. .e.g. summarize the daily transactions for each credit
card and store the summary data instead of storing the
most granular data by individual transactions.
02/17/24 22
Major types of data transformation
Key Restructuring.
The primary keys of the extracted records form a basis for
coming up with keys for the fact and dimension tables. When
choosing keys for the data warehouse database tables, avoid
keys with built-in meanings. Transform such keys into generic
keys generated by the system itself. This is called key
restructuring.
Deduplication.
One entity may have several records. i.e. duplicates. In the data
warehouse, there is need to keep a single record for one entity
and link all the duplicates in the source systems to this single
record.
02/17/24 23
Problems faced in data integration and
consolidation
1. Entity Identification Problem
This problem comes in as a result of having one entity with many unique
identification numbers which may not be the same across the source
systems.
However, in the data warehouse you need to keep a single record for each
entity which requires one to get all the activities of this entity from the
various source systems and then match up with the single record to be
loaded to the data warehouse.
This is a common but very difficult problem in many enterprises where
applications have evolved over time from the distant past.
Solution:
In the first phase, all records, irrespective of whether they are duplicates
or not, are assigned unique identifiers.
The second phase consists of reconciling the duplicates periodically
through automatic algorithms and manual verification.
02/17/24 24
Problems faced in data integration and
consolidation
2. Multiple Sources Problem
This problem results from a single data element having more than one
source. However, there are different values that correspond to this
entity in the different source systems.
Solution:
A straightforward solution is to assign a higher priority to one of the
sources and pick up the value from that source.
You may have to select from either of the files based on the last
update date. Or, in some other instances, your determination of the
appropriate source depends on other related fields.
02/17/24 25
Methods of data transformation
1. Use of transformation tools
The desired goal is to eliminate manual methods. It involves the
use of automated tools
Advantage
improves efficiency and accuracy. However, one has to specify the
parameters, the data definitions, and the rules to the
transformation tool. If this input into the tool is accurate, then the
rest of the work is performed efficiently by the tool.
02/17/24 26
Methods of data transformation
2. Using manual techniques
Adequate for smaller data warehouses.
Manually coded programs and scripts perform every data
transformation. Mostly, these programs are executed in
the data staging area. The analysts and programmers who
already possess the knowledge and the expertise are able
to produce the programs and scripts.
Disadvantage
Although the initial cost may be reasonable, ongoing
maintenance may escalate the cost.
Prone to errors.
May require several individual programs
Every time changes occur to transformation rules, the
metadata has to be maintained. This puts an additional
burden on the maintenance of the manually coded
transformation programs
02/17/24 27
Data Loading Processes
1. Initial Load
populating all the data warehouse tables for the very first time
you may maintain the data warehouse and keep it up-to-date by using
two methods: Update and Refresh (explained later)
2. Incremental Load
applying ongoing changes as necessary in a periodic manner
History data could remain as it is along with the new data or
overwritten by incremental data. .i.e. It involves destructive merge and
constructive merge
3. Full Refresh
completely erasing the contents of one or more tables and reloading
with fresh data (initial load is a refresh of all the tables).
Note: There are two kinds of refresh .i.e. partial refresh and the full refresh
Partial refresh are used to rewrite only specific tables. Partial refreshes
are rare because every dimension table is intricately tied to the fact
table.
02/17/24 28
Update vs Refresh
Update:
—application of incremental changes in the data sources
Refresh:
—complete reload at specified intervals
02/17/24 29
Four Modes of applying data to a data
warehouse
1. Load.
If the target table to be loaded already exists and data exists in the
table, the load process wipes out the existing data and applies the data
from the incoming file.
If the table is already empty before loading, the load process simply
applies the data from the incoming file.
4. Constructive Merge.
If the primary key of an incoming record matches with the key of an
existing record, leave the existing record, add the incoming record,
and mark the added record as superceding the old record.
02/17/24 31
Three Broad functional categories of ETL tools
1. Data transformation engines.
Consist of dynamic and sophisticated data manipulation algorithms.
The tool suite:
captures data from a designated set of source systems at user-defined
intervals,
performs elaborate data transformations,
sends the results to a target environment, and
applies the data to target files.
2. Data capture through replication.
Most of these tools use the transaction recovery logs maintained by
the DBMS. The changes to the source systems captured in the
transaction logs are replicated in near real time to the data staging
area for further processing.
Some of the tools provide the ability to replicate data through the
use of database triggers which signal the replication agent to capture
and transport the changes.
02/17/24 32
Three Broad functional categories of ETL tools
3. Code generators.
These are tools that directly deal with the extraction,
transformation, and loading of data. The tools enable the process
by generating program code to perform these functions.
You provide the parameters of the data sources and the target
layouts along with the business rules. The tools generate most of
the program code in some of the common programming
languages.
When you want to add more code to handle the types of
transformation not covered by the tool, you may do so with your
own program code. The code automatically generated by the tool
has exits at which points you may add your code to handle special
conditions.
02/17/24 33
ETL vs ELT
• Increase in number of data sources
• Processing demand for massive data sets for
BI and big data analytics
• ELT provides an alternative to the
traditional data integration method.
• Data is loaded straight into a central
repository where all transformations occur.
• The staging database is absent.
02/17/24 34
While ETL focuses on retaining important data,
through business logic, elaborations, decisions, filters
and aggregation, and produces a data warehouse
ready for easy consumption and business reports, ELT
retains all data in its natural state, as it flows from the
sources, including both the data that is important
today, and the data that might be used someday.
02/17/24 35
•ETL is applied when
working with OLAP data
warehouses, legacy
systems, and relational
databases. It doesn’t
provide data lake support.
02/17/24 36
02/17/24 37
When to use ETL or ELT
Choosing between ETL and ELT depends on multiple
considerations. For example:
What is the nature of my data?
What’s the business value case I want to accomplish?
Who are the people who need to query my data store?
What are their skills? What types of queries will they
need to perform?
Which technologies do I have in place or do I plan to
deploy?
02/17/24 38
ETL vs ELT use cases
Cloud data warehouses have opened new horizons for data
integration, but the choice between ETL and ELT relies on
the needs of a company in the first place.
It is better to use ETL when . . .
Working with sensitive data. E.g. healthcare organizations.
ETL is used to mask, encrypt or remove sensitive data before
loading it in the cloud.
02/17/24 40
You deal with cloud projects or hybrid architectures.
Modern ETL has involves use of cloud warehouses, need
for separate engine to perform transformations before
loading data into the cloud, unlike ELT. So ELT is a better
choice for cloud and hybrid use cases.
You have a data science team that needs access to all raw
data to use in machine learning projects.
02/17/24 41
ELT seems to be the logical future for building
effective data flows as it offers a myriad of benefits
over ETL.
ELT is cost-effective, flexible, and lower maintenance.
It fits businesses of diverse fields and sizes.
On the other hand, ETL is an outdated and slow
process with tons of hidden rocks on which
organizations can stumble on the way to data
integration.
But as we can tell from the use cases ETL cannot be
replaced completely
02/17/24 42