3 Lecture 3-ETL

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 42

ETL

02/17/24 1
Data Extraction Transformation and Loading

Page 257

02/17/24 2
Extract Transform and Load
Extract, transform, and load (ETL) is a process
in database usage and especially in data
warehousing that involves

Extraction. The first part of an ETL process involves extracting


the data from the source systems.
Transformation. Transformation of source data encompasses
a wide variety of manipulations to change all the extracted source
data into usable information to be stored in the data warehouse.
Loading. The load phase loads the data into the end target,
usually the data warehouse (DW).

02/17/24 3
Major steps in the ETL process
Data extraction
 Determine all the target data needed in the data warehouse

 Determine all the data sources, both internal and external

 Prepare data mapping for target data elements from sources.

 Ensure compatibility of data structures with those of the

external in cases of external data sources


 Establish comprehensive data extraction rules

02/17/24 4
Major steps in the ETL process
Data Transformation
 Determine data transformation and cleansing rules or

functions. Based on the data structure of the source


 Input selection,
 Separation of input structures,
 Normalization and denormalization of source structures,
 Aggregation
 Conversion
 Resolving of missing values and conversions of names and
addresses.
 Plan for aggregate tables.

 Organize data staging area and test tools


02/17/24 5
Major steps in the ETL process
Data Loading
 Write procedures for all data loads.

 ETL for dimension tables

 ETL for fact tables

02/17/24 6
Challenges of ETL Functions
Source systems are very diverse and disparate.

Multiple platforms and different operating systems.

 Older legacy applications running on obsolete


database technologies.

Historical information not preserved in source


operational systems.

 Poor quality of data in many old source systems.


02/17/24 7
Challenges of ETL Functions
Changes in source system structures due to new
business conditions. i.e. need to modify ETL.

 Gross lack of consistency among source systems

Lack of a means for resolving mismatches


escalates the problem of inconsistency.

Most source systems do not represent data in


types or formats that are meaningful to the users.
02/17/24 8
Data extraction issues
Source Identification— identify source
applications and source structures.
 Method of extraction— for each data source,
define whether the extraction process is manual
or tool-based.
 Extraction frequency— for each data source,
establish how frequently the data extraction must
by done— daily, weekly, quarterly, and so on.
 Time window— for each data source, denote the
time window for the extraction process.
02/17/24 9
Data extraction issues
Job sequencing— determine whether the beginning
of one job in an extraction job stream has to wait until
the previous job has finished successfully.

 Exception handling— determine how to handle


input records that cannot be extracted.

02/17/24 10
Categories of data in operational
systems
Current Value:
 Under current data value the stored value of an attribute
represents the value of an attribute at that moment of
time .i.e. The value of an attribute remains constant only
until a business transaction changes it .
Periodic Status:
 In this category, the value of the attribute is perceived as the
status every time a change occurs.
 The history of the changes is preserved in the source systems
themselves. Whether it is status data or data about an event,
the source systems contain data at each point in time when
any change occurred.

02/17/24 11
Data extraction techniques
1. Capture of static data or “AS IS”
 This is the capture of data at a given point in time.
 Both current or transient data is captured.
 Used primarily for the initial load of the data warehouse. Sometimes,
may require a full refresh of a dimension table.

Advantages
 Good flexibility for capture specifications.
 Performance of source systems not affected.
 No revisions to existing applications.
 Can be used on legacy systems.
 Can be used on file-oriented systems.
 Vendor products are used. No internal costs

02/17/24 12
Data extraction techniques cont’d
Other techniques are categorized into two:
A) Immediate Data Extraction. In this option, the data
extraction is real-time. Extraction takes place while
transactions occur in the source operational systems.
Three options for immediate data extraction include:
Capture of transaction logs, Capture through database
triggers and Capture in source applications. (Elaborated
in next slides)
B) Deferred Data Extraction. Do not capture the changes
in real time. i.e. The data capture happens later. Two
options for deferred data extraction include: Capture
based on date and time, Capture by comparing files,
(Elaborated in next slides)

02/17/24 13
Data extraction techniques cont’d
The options under Immediate Data Extraction
2. Capture through transaction logs
 As each transaction adds, updates, or deletes a row from a database
table, the DBMS immediately writes entries on the log file. This data
extraction technique reads the transaction log and selects all the
committed transactions.
 All the transactions have to be extracted before the log file gets
refreshed.
Advantages
 Performance of source systems not affected because logging is
already part of the transaction processing.
 No revisions to existing applications.
 Can be used on most legacy systems.
 Vendor products are used. No internal costs.

Disadvantages
 Not much flexibility for capture specifications.
 Cannot be used on file-oriented systems.

02/17/24 14
Data extraction techniques cont’d
3. Capture through database triggers
 Triggers are special stored procedures (programs) that are stored on
the database and fired when certain predefined events occur.

 Trigger programs are created for all events for which data is to be
captured. The output of the trigger programs is written to a
separate file that will be used to extract data for the data
warehouse.
Advantages
 Performance of source systems not affected because logging is already
part of the transaction processing.
 No revisions to existing applications.
 Can be used on most legacy systems.
 Vendor products are used. No internal costs.
Disadvantages
 Not much flexibility for capture specifications.
 Cannot be used on file-oriented systems. Only applicable on database
applications

02/17/24 15
Data extraction techniques cont’d
4. Capture in source applications
 This is also referred to as application-assisted data capture. Involves
revision of the programs to write all adds, updates, and deletes to the
source files and database tables. Then other extract programs can use
the separate file containing the changes to the source data.

Advantages
 Good flexibility for capture specifications.
 Can be used on most legacy systems.
 Can be used on file-oriented systems.
Disadvantages
 Performance of source systems affected a bit.
 High internal costs because of in-house work.
 Major revisions to existing applications.

02/17/24 16
Data extraction techniques cont’d
The options under Deferred Data Extraction
5. Capture based on date and time stamp (one of the two
options under deferred extraction)
 Every time a source record is created or updated, it may be marked with a
stamp showing the date and time. The time stamp provides the basis for
selecting records for data extraction.
Advantages
 Good flexibility for capture specifications.
 Performance of source systems not affected.
 Can be used on file-oriented systems.
 Vendor products may be used.
Disadvantages
 Major revisions to existing applications likely.
 Cannot be used on most legacy systems.

02/17/24 17
Data extraction techniques cont’d
6. Capture by comparing files (second option under deferred
extraction)
 This technique is also called the snapshot differential technique.
 It compares two snapshots of the source data then capture any
changes between the two copies
Advantages
 Good flexibility for capture specifications.
 Performance of source systems not affected.
 No revisions to existing applications.
 May be used on legacy systems.
 May be used on file-oriented systems.
 Vendor products are used. No internal costs.

02/17/24 18
Major types of data transformation
Format Revisions.
These include changes to the data types and lengths of
individual fields. There is need to standardize and change the
data type to text to provide values meaningful to the users.

Decoding of Fields.
In multiple source systems, same data items may be described
by different field values. E.g. the coding for gender, using 1 and
2 for male and female or M and F in different systems. Also, in
cases of using cryptic codes such as IST3109 to represent
business intelligence and data ware housing, decode all such
cryptic codes and change these into values that make sense to
the users.

02/17/24 19
Major types of data transformation
Calculated and Derived Values.
Calculations need to be performed on the extracted data from
the source system before data can be stored in the data
warehouse. Derived fields can be average daily balances and
operating ratios.

Splitting of Single Fields.


Earlier legacy systems stored large text fields .e.g. the first
name, middle initials, and last name of customer were stored
as a large text in a single field. There is need to store individual
components in separate fields in your data warehouse to
improve the operating performance by indexing on individual
components or perform analysis by using individual
components

02/17/24 20
Major types of data transformation
 Merging of Information.
Does not exactly mean the merging of several fields to create a single
field of data. It means the combination of different fields from
different source systems into a single entity.
 Character Set Conversion. The conversion of character sets to an
agreed standard character set for textual data in the data warehouse.
E.g. the source data in EBCDIC format (common with IBM) to the
ASCII format (American Standard Code for Information Interchange ).
 Conversion of Units of Measurements.
For companies with global branches, there is need to convert the
metrics so that the numbers may all be in one standard unit of
measurement. E.g. dollars for currency

02/17/24 21
Major types of data transformation
Date/Time Conversion.
This type relates to representation of date and time in
standard formats. For example, 04/29/2010 in the U.S.
format and as 29/04/2010 in the British format can be
standardized to be written as 29 APR 2010.

Summarization.
Involves creating of summaries to be loaded in the data
warehouse instead of loading the most granular level of
data. .e.g. summarize the daily transactions for each credit
card and store the summary data instead of storing the
most granular data by individual transactions.

02/17/24 22
Major types of data transformation
Key Restructuring.
The primary keys of the extracted records form a basis for
coming up with keys for the fact and dimension tables. When
choosing keys for the data warehouse database tables, avoid
keys with built-in meanings. Transform such keys into generic
keys generated by the system itself. This is called key
restructuring.

Deduplication.
One entity may have several records. i.e. duplicates. In the data
warehouse, there is need to keep a single record for one entity
and link all the duplicates in the source systems to this single
record.
02/17/24 23
Problems faced in data integration and
consolidation
1. Entity Identification Problem
 This problem comes in as a result of having one entity with many unique
identification numbers which may not be the same across the source
systems.
 However, in the data warehouse you need to keep a single record for each
entity which requires one to get all the activities of this entity from the
various source systems and then match up with the single record to be
loaded to the data warehouse.
 This is a common but very difficult problem in many enterprises where
applications have evolved over time from the distant past.
Solution:
 In the first phase, all records, irrespective of whether they are duplicates
or not, are assigned unique identifiers.
 The second phase consists of reconciling the duplicates periodically
through automatic algorithms and manual verification.
02/17/24 24
Problems faced in data integration and
consolidation
2. Multiple Sources Problem
 This problem results from a single data element having more than one
source. However, there are different values that correspond to this
entity in the different source systems.

Solution:
 A straightforward solution is to assign a higher priority to one of the
sources and pick up the value from that source.
 You may have to select from either of the files based on the last
update date. Or, in some other instances, your determination of the
appropriate source depends on other related fields.

02/17/24 25
Methods of data transformation
1. Use of transformation tools
The desired goal is to eliminate manual methods. It involves the
use of automated tools

Advantage
 improves efficiency and accuracy. However, one has to specify the
parameters, the data definitions, and the rules to the
transformation tool. If this input into the tool is accurate, then the
rest of the work is performed efficiently by the tool.

 Recording of metadata by the tool. When you specify the


transformation parameters and rules, these are stored as metadata
by the tool. This metadata then becomes part of the overall
metadata component of the data warehouse. It may be shared by
other components. When changes occur, the metadata for the
transformations get automatically adjusted by the tool.

02/17/24 26
Methods of data transformation
2. Using manual techniques
Adequate for smaller data warehouses.
Manually coded programs and scripts perform every data
transformation. Mostly, these programs are executed in
the data staging area. The analysts and programmers who
already possess the knowledge and the expertise are able
to produce the programs and scripts.
Disadvantage
Although the initial cost may be reasonable, ongoing
maintenance may escalate the cost.
Prone to errors.
May require several individual programs
Every time changes occur to transformation rules, the
metadata has to be maintained. This puts an additional
burden on the maintenance of the manually coded
transformation programs

02/17/24 27
Data Loading Processes
1. Initial Load
 populating all the data warehouse tables for the very first time
 you may maintain the data warehouse and keep it up-to-date by using
two methods: Update and Refresh (explained later)
2. Incremental Load
 applying ongoing changes as necessary in a periodic manner
 History data could remain as it is along with the new data or
overwritten by incremental data. .i.e. It involves destructive merge and
constructive merge
3. Full Refresh
 completely erasing the contents of one or more tables and reloading
with fresh data (initial load is a refresh of all the tables).

Note: There are two kinds of refresh .i.e. partial refresh and the full refresh
 Partial refresh are used to rewrite only specific tables. Partial refreshes
are rare because every dimension table is intricately tied to the fact
table.

02/17/24 28
Update vs Refresh
Update:
—application of incremental changes in the data sources
Refresh:
—complete reload at specified intervals

Refresh is a much simpler option than update:


 Update option, involves devising the proper strategy to extract the
changes from each data source. Then you have to determine the best
strategy to apply the changes to the data warehouse.
 The refresh option simply involves the periodic replacement of
complete data warehouse tables. But refresh jobs can take a long
time to run.

02/17/24 29
Four Modes of applying data to a data
warehouse
1. Load.
 If the target table to be loaded already exists and data exists in the
table, the load process wipes out the existing data and applies the data
from the incoming file.
 If the table is already empty before loading, the load process simply
applies the data from the incoming file.

2. Append. (This is an extension of the load. )


 If data already exists in the table, the append process unconditionally
adds the incoming data, preserving the existing data in the target
table.
 When an incoming record is a duplicate of an already existing record
you may either:
 Allow it to be added as a duplicate OR.
 Reject the duplicate record during the append process.
02/17/24 30
Four Modes of applying data to a data
warehouse
3. Destructive Merge.
 In this mode, you apply the incoming data to the target data. If the
primary key of an incoming record matches with the key of an
existing record, update the matching target record.
 If the incoming record is a new record without a match with any
existing record, add the incoming record to the target table.

4. Constructive Merge.
 If the primary key of an incoming record matches with the key of an
existing record, leave the existing record, add the incoming record,
and mark the added record as superceding the old record.

02/17/24 31
Three Broad functional categories of ETL tools
1. Data transformation engines.
 Consist of dynamic and sophisticated data manipulation algorithms.
The tool suite:
 captures data from a designated set of source systems at user-defined
intervals,
 performs elaborate data transformations,
 sends the results to a target environment, and
 applies the data to target files.
2. Data capture through replication.
 Most of these tools use the transaction recovery logs maintained by
the DBMS. The changes to the source systems captured in the
transaction logs are replicated in near real time to the data staging
area for further processing.
 Some of the tools provide the ability to replicate data through the
use of database triggers which signal the replication agent to capture
and transport the changes.
02/17/24 32
Three Broad functional categories of ETL tools
3. Code generators.
 These are tools that directly deal with the extraction,
transformation, and loading of data. The tools enable the process
by generating program code to perform these functions.
 You provide the parameters of the data sources and the target
layouts along with the business rules. The tools generate most of
the program code in some of the common programming
languages.
 When you want to add more code to handle the types of
transformation not covered by the tool, you may do so with your
own program code. The code automatically generated by the tool
has exits at which points you may add your code to handle special
conditions.

02/17/24 33
ETL vs ELT
• Increase in number of data sources
• Processing demand for massive data sets for
BI and big data analytics
• ELT provides an alternative to the
traditional data integration method.
• Data is loaded straight into a central
repository where all transformations occur.
• The staging database is absent.

02/17/24 34
While ETL focuses on retaining important data,
through business logic, elaborations, decisions, filters
and aggregation, and produces a data warehouse
ready for easy consumption and business reports, ELT
retains all data in its natural state, as it flows from the
sources, including both the data that is important
today, and the data that might be used someday.

Unlike the data warehouse, which is a highly


structured data model designed to answer specific
questions through reporting, a data lake retains all
types of structured, semi-structured, and
unstructured data (for example web server logs,
sensor data, social network activity, text and images).

02/17/24 35
•ETL is applied when
working with OLAP data
warehouses, legacy
systems, and relational
databases. It doesn’t
provide data lake support.

•ELT is a modern method


that can be used with
cloud-based warehouses
and data lakes.

02/17/24 36
02/17/24 37
When to use ETL or ELT
Choosing between ETL and ELT depends on multiple
considerations. For example:
What is the nature of my data?
What’s the business value case I want to accomplish?
Who are the people who need to query my data store?
What are their skills? What types of queries will they
need to perform?
Which technologies do I have in place or do I plan to
deploy?

02/17/24 38
ETL vs ELT use cases
Cloud data warehouses have opened new horizons for data
integration, but the choice between ETL and ELT relies on
the needs of a company in the first place.
It is better to use ETL when . . .
 Working with sensitive data. E.g. healthcare organizations.
ETL is used to mask, encrypt or remove sensitive data before
loading it in the cloud.

 Have only structured data and/or small portions of data.

 Your company runs a legacy system or deals with on-premise


relational databases. E.g during EHR system modernization,
data is extracted from the legacy EHR system ut must
selectively be transformed into a fitting format for the new
system.
02/17/24 39
It is better to use ELT when . . .
Real-time decision-making is key. The speed of data
integration is a key advantage of ELT since the target system
can do the data transformation and loading in parallel. This,
in turn, allows nearly real-time analytics.
Your company operates massive amounts of data both
structured and unstructured. For example, a transportation
company that uses telematics devices in its fleet may need
to process huge volumes of diverse data generated by
sensors, video recorders, GPS trackers, etc.
You are going to run a Big Data project. ELT is born to
address the key challenges of Big Data: Volume, Variety,
Velocity, and Veracity

02/17/24 40
You deal with cloud projects or hybrid architectures.
Modern ETL has involves use of cloud warehouses, need
for separate engine to perform transformations before
loading data into the cloud, unlike ELT. So ELT is a better
choice for cloud and hybrid use cases.

You have a data science team that needs access to all raw
data to use in machine learning projects.

Your project tends to scale and you want to take


advantage of the high scalability of modern cloud
warehouses and data lakes.

02/17/24 41
ELT seems to be the logical future for building
effective data flows as it offers a myriad of benefits
over ETL.
ELT is cost-effective, flexible, and lower maintenance.
It fits businesses of diverse fields and sizes.
On the other hand, ETL is an outdated and slow
process with tons of hidden rocks on which
organizations can stumble on the way to data
integration.
But as we can tell from the use cases ETL cannot be
replaced completely

02/17/24 42

You might also like