Professional Documents
Culture Documents
Data Extraction and Loading
Data Extraction and Loading
Data Extraction and Loading
Introduction
During the ETL process, data is extracted from an OLTP database, transformed to
match the data warehouse schema, and loaded into the data warehouse database.
Many data warehouses also incorporate data from non-OLTP systems, such as text
files, legacy systems, and spreadsheets; such data also requires extraction,
transformation, and loading.
In its simplest form, ETL is the process of copying data from one database to another.
This simplicity is rarely, if ever, found in data warehouse implementations; in reality, ETL
is often a complex combination of process and technology that consumes a significant
portion of the data warehouse development efforts and requires the skills of business
analysts, database designers, and application developers.
When defining ETL for a data warehouse, it is important to think of ETL as a process,
not a physical implementation. ETL systems vary from data warehouse to data
warehouse and even between department data marts within a data warehouse. A
monolithic application, regardless of whether it is implemented in Transact-SQL or a
traditional programming language, does not provide the flexibility for change necessary
in ETL systems. A mixture of tools and technologies should be used to develop
applications that each perform a specific ETL task.
The ETL process is not a one-time event; new data is added to a data warehouse
periodically. Typical periodicity may be monthly, weekly, daily, or even hourly, depending
on the purpose of the data warehouse and the type of business it serves. Because ETL
is an integral, ongoing, and recurring part of a data warehouse, ETL processes must be
automated and operational procedures documented. ETL also changes and evolves as
the data warehouse evolves, so ETL processes must be designed for ease of
modification. A solid, well-designed, and documented ETL system is necessary for the
success of a data warehouse project.
Data warehouses evolve to improve their service to the business and to adapt to
changes in business processes and requirements. Business rules change as the
business reacts to market influencesthe data warehouse must respond in order to
maintain its value as a tool for decision makers. The ETL implementation must adapt as
the data warehouse evolves.
Microsoft SQL Server 2000 provides significant enhancements to existing
performance and capabilities, and introduces new features that make the development,
deployment, and maintenance of ETL processes easier and simpler, and its
performance faster.
Top of page
ETL Functional Elements
Regardless of how they are implemented, all ETL systems have a common purpose:
they move data from one database to another. Generally, ETL systems move data from
OLTP systems to a data warehouse, but they can also be used to move data from one
data warehouse to another. An ETL system consists of four distinct functional elements:
Extraction
Transformation
Loading
Meta data
Extraction
The ETL extraction element is responsible for extracting data from the source system.
During extraction, data may be removed from the source system or a copy made and
the original data retained in the source system. It is common to move historical data that
accumulates in an operational OLTP system to a data
warehouse to maintain OLTP performance and efficiency. Legacy systems may require
too much effort to implement such offload processes, so legacy data is often copied into
the data warehouse, leaving the original data in place. Extracted data is loaded into the
data warehouse staging area (a relational database usually separate from the data
warehouse database), for manipulation by the remaining ETL processes.
Data extraction is generally performed within the source system itself, especially if it is a
relational database to which extraction procedures can easily be added. It is also
possible for the extraction logic to exist in the data warehouse staging area and query
the source system for data using ODBC, OLE DB, or other APIs. For legacy systems,
the most common method of data extraction is for the legacy system to produce text
files, although many newer systems offer direct query APIs or accommodate access
through ODBC or OLE DB.
ETL systems are arguably the single most important source of meta data about both the
data in the data warehouse and data in the source system. Finally, the ETL process
itself generates useful meta data that should be retained and analyzed regularly. Meta
data is discussed in greater detail later in this chapter.
Top of page
ETL Architectures
Before discussing the physical implementation of ETL systems, it is important to
understand the different ETL architectures and how they relate to each other.
Essentially, ETL systems can be classified in two architectures: the homogenous
architecture and the heterogeneous architecture.
Homogenous Architecture
A homogenous architecture for an ETL system is one that involves only a single source
system and a single target system. Data flows from the single source of data through
the ETL processes and is loaded into the data warehouse, as shown in the following
diagram.
requirements.
Simple research requirements: The research efforts to locate data are generally
simple: if the data is in the source system, it can be used. If it is not, it cannot.
The homogeneous ETL architecture is generally applicable to data marts, especially
those focused on a single subject matter.
Heterogeneous Architecture
A heterogeneous architecture for an ETL system is one that extracts data from multiple
sources, as shown in the following diagram. The complexity of this architecture arises
from the fact that data from more than one source must be merged, rather than from the
fact that data may be formatted differently in the different sources. However, significantly
different storage formats and database schemas do provide additional complications.
ETL development consists of two general phases: identifying and mapping data, and
developing functional element implementations. Both phases should be carefully
documented and stored in a central, easily accessible location, preferably in electronic
form.
Identify and Map Data
This phase of the development process identifies sources of data elements, the targets
for those data elements in the data warehouse, and the transformations that must be
applied to each data element as it is migrated from its source to its destination. High
level data maps should be developed during the requirements gathering and data
modeling phases of the data warehouse project. During the ETL system design and
development process, these high level data maps are extended to thoroughly specify
system details.
Identify Source Data
For some systems, identifying the source data may be as simple as identifying the
server where the data is stored in an OLTP database and the storage type (SQL Server
database, Microsoft Excel spreadsheet, or text file, among others). In other systems,
identifying the source may mean preparing a detailed definition of the meaning of the
data, such as a business rule, a definition of the data itself, such as decoding rules (O =
On, for example), or even detailed documentation of a source system for which the
system documentation has been lost or is not current.
Identify Target Data
Each data element is destined for a target in the data warehouse. A target for a data
element may be an attribute in a dimension table, a numeric measure in a fact table, or
a summarized total in an aggregation table. There may not be a one-to-one
correspondence between a source data element and a data element in the data
warehouse because the destination system may not contain the data at the same
granularity as the source system. For example, a retail client may decide to roll data up
to the SKU level by day rather than track individual line item data. The level of item
detail that is stored in the fact table of the data warehouse is called the grain of the data.
If the grain of the target does not match the grain of the source, the data must be
summarized as it moves from the source to the target.
Map Source Data to Target Data
A data map defines the source fields of the data, the destination fields in the data
warehouse and any data modifications that need to be accomplished to transform the
data into the desired format for the data warehouse. Some transformations require
aggregating the source data to a coarser granularity, such as summarizing individual
item sales into daily sales by SKU. Other transformations involve altering the source
data itself as it moves from the source to the target. Some transformations decode data
into human readable form, such as replacing "1" with "on" and "0" with "off" in a status
field. If two source systems encode data destined for the same target differently (for
example, a second source system uses Yes and No for status), a separate
transformation for each source system must be defined. Transformations must be
documented and maintained in the data maps. The relationship between the source and
target systems is maintained in a map that is referenced to execute the transformation
of the data before it is loaded in the data warehouse.
Develop Functional Elements
Design and implementation of the four ETL functional elements, Extraction,
Transformation, Loading, and meta data logging, vary from system to system. There will
often be multiple versions of each functional element.
Each functional element contains steps that perform individual tasks, which may
execute on one of several systems, such as the OLTP or legacy systems that contain
the source data, the staging area database, or the data warehouse database. Various
tools and techniques may be used to implement the steps in a single functional area,
such as Transact-SQL, DTS packages, or custom applications developed in a
programming language such as Microsoft Visual Basic. Steps that are discrete in one
functional element may be combined in another.
Extraction
The extraction element may have one version to extract data from one OLTP data
source, a different version for a different OLTP data source, and multiple versions for
legacy systems and other sources of data. This element may include tasks that execute
SELECT queries from the ETL staging database against a source OLTP system, or it
may execute some tasks on the source system directly and others in the staging
database, as in the case of generating a flat file from a legacy system and then
importing it into tables in the ETL database. Regardless of methods or number of steps,
the extraction element is responsible for extracting the required data from the source
system and making it available for processing by the next element.
9
Transformation
Frequently a number of different transformations, implemented with various tools or
techniques, are required to prepare data for loading into the data warehouse. Some
transformations may be performed as data is extracted, such as an application on a
legacy system that collects data from various internal files as it produces a text file of
data to be further transformed. However, transformations are best accomplished in the
ETL staging database, where data from several data sources may require varying
transformations specific to the incoming data organization and format.
Data from a single data source usually requires different transformations for different
portions of the incoming data. Fact table data transformations may include
summarization, and will always require surrogate dimension keys to be added to the
fact records. Data destined for dimension tables in the data warehouse may require one
process to accomplish one type of update to a changing dimension and a different
process for another type of update.
Transformations may be implemented using Transact-SQL, as is demonstrated in the
code examples later in this chapter, DTS packages, or custom applications.
Regardless of the number and variety of transformations and their implementations, the
transformation element is responsible for preparing data for loading into the data
warehouse.
Loading
The loading element typically has the least variety of task implementations. After the
data from the various data sources has been extracted, transformed, and combined, the
loading operation consists of inserting records into the various data warehouse
database dimension and fact tables. Implementation may vary in the loading tasks, such
as using BULK INSERT, bcp, or the Bulk Copy API. The loading element is responsible
for loading data into the data warehouse database tables.
Meta Data Logging
Meta data is collected from a number of the ETL operations. The meta data logging
implementation for a particular ETL task will depend on how the task is implemented.
For a task implemented by using a custom application, the application code may
produce the meta data. For tasks implemented by using Transact-SQL, meta data can
be captured with Transact-SQL statements in the task processes. The meta data
logging element is responsible for capturing and recording meta data that documents
10
the operation of the ETL functional areas and tasks, which includes identification of data
that moves through the ETL system as well as the efficiency of ETL tasks.
Common Tasks
Each ETL functional element should contain tasks that perform the following functions,
in addition to tasks specific to the functional area itself:
Confirm Success or Failure. A confirmation should be generated on the success or
failure of the execution of the ETL processes. Ideally, this mechanism should exist for
each task so that rollback mechanisms can be implemented to allow for incremental
responses to errors.
Scheduling. ETL tasks should include the ability to be scheduled for execution.
Scheduling mechanisms reduce repetitive manual operations and allow for maximum
use of system resources during recurring periods of low activity.
Top of page
SQL Server 2000 ETL Components
SQL Server 2000 includes several components that aid in the development and
maintenance of ETL systems:
Data Transformation Services (DTS): SQL Server 2000 DTS is a set of graphical
tools and programmable objects that lets you extract, transform, and consolidate data
from disparate sources into single or multiple destinations.
SQL Server Agent: SQL Server Agent provides features that support the scheduling
of periodic activities on SQL Server 2000, or the notification to system administrators
of problems that have occurred with the server.
Stored Procedures and Views: Stored procedures assist in achieving a consistent
implementation of logic across applications. The Transact-SQL statements and logic
needed to perform a commonly performed task can be designed, coded, and tested
once in a stored procedure. A view can be thought of as either a virtual table or a
stored query. The data accessible through a view is not stored in the database as a
distinct object; only the SELECT statement for the view is stored in the database.
Transact SQL: Transact-SQL is a superset of the SQL standard that provides
powerful programming capabilities that include loops, variables, and other
programming constructs.
OLE DB: OLE DB is a low-level interface to data. It is an open specification designed
to build on the success of ODBC by providing an open standard for accessing all
kinds of data.
11
Meta Data Services: SQL Server 2000 Meta Data Services provides a way to store
and manage meta data about information systems and applications. This technology
serves as a hub for data and component definitions, development and deployment
models, reusable software components, and data warehousing descriptions.
Top of page
The ETL Staging Database
In general, ETL operations should be performed on a relational database server
separate from the source databases and the data warehouse database. A separate
staging area database server creates a logical and physical separation between the
source systems and the data warehouse, and minimizes the impact of the intense
periodic ETL activity on source and data warehouse databases. If a separate database
server is not available, a separate database on the data warehouse database server
can be used for the ETL staging area. However, in this case it is essential to schedule
periods of high ETL activity during times of low data warehouse user activity.
For small data warehouses with available excess performance and low user activity, it is
possible to incorporate the ETL system into the data warehouse database. The
advantage of this approach is that separate copies of data warehouse tables are not
needed in the staging area. However, there is always some risk associated with
performing transformations on live data, and ETL activities must be very carefully
coordinated with data warehouse periods of minimum activity. When ETL is integrated
into the data warehouse database, it is recommended that the data warehouse be taken
offline when performing ETL transformations and loading.
Most systems can effectively stage data in a SQL Server 2000 database, as we
describe in this chapter. An ETL system that needs to process extremely large volumes
of data will need to use specialized tools and custom applications that operate on files
rather than database tables. With extremely large volumes of data, it is not practical to
load data into a staging database until it has been cleaned, aggregated, and stripped of
meaningless information. Because it is much easier to build an ETL system using the
standard tools and techniques that are described in this chapter, most experienced
system designers will attempt to use a staging database, and move to custom tools only
if data cannot be processed during the load window.
What does "extremely large" mean and when does it become infeasible to use standard
DTS tasks and Transact-SQL scripts to process data from a staging database? The
12
answer depends on the load window, the complexity of transformations, and the degree
of data aggregation necessary to create the rows that are permanently stored in the
data warehouse. As a conservative rule of thumb, if the transformation application
needs to process more than 1 gigabyte of data in less than an hour, it may be
necessary to consider specialized high performance techniques, which are outside the
scope of this chapter.
This section provides general information about configuring the SQL Server 2000
database server and the database to support an ETL system staging area database
with effective performance. ETL systems can vary greatly in their database server
requirements; server configurations and performance option settings may differ
significantly from one ETL system to another.
ETL data manipulation activities are similar in design and functionality to those of OLTP
systems although ETL systems do not experience the constant activity associated with
OLTP systems. Instead of constant activity, ETL systems have periods of high write
activity followed by periods of little or no activity. Configuring a server and database to
meet the needs of an ETL system is not as straightforward as configuring a server and
database for an OLTP system.
For a detailed discussion of RAID and SQL Server 2000 performance tuning, see
Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."
Server Configuration
Disk storage system performance is one of the most critical factors in the performance
of database systems. Server configuration options offer additional methods for adjusting
server performance.
RAID
As with any OLTP system, the RAID level for the disk drives on the server can make a
considerable performance difference. For maximum performance of an ETL database,
the disk drives for the server computer should be configured with RAID 1 or RAID 10.
Additionally, it is recommended that the transaction logs, databases, and tempdb be
placed on separate physical drives. Finally, if the hardware controller supports write
caching, it is recommended that write caching be enabled. However, be sure to use a
caching controller that guarantees that the controller cache contents will be written to
disk in case of a system failure.
Server Configuration Options (sp_configure)
13
14
For more information about database performance tuning, see Chapter 20, "RDBMS
Performance Tuning Guide for Data Warehousing."
The following table lists some database options and their setting that may be used to
increase ETL performance.
Option name
AUTO_CREATE_STATISTICS
AUTO_UPDATE_STATISTICS
AUTO_SHRINK
CURSOR_DEFAULT
RECOVERY Option
TORN_PAGE_DETECTION
Setting
Off
On
Off
LOCAL
Bulk_Loaded
On
Caution Different recovery model options introduce varying degrees of risk of data loss.
It is imperative that the risks be thoroughly understood before choosing a recovery
model.
Top of page
Managing Surrogate Keys
Surrogate keys are critical to successful data warehouse design: they provide the
means to maintain data warehouse information when dimensions change. For more
information and details about surrogate keys, see Chapter 17, "Data Warehouse Design
Considerations."
The following are some common characteristics of surrogate keys:
Used as the primary key for each dimension table, instead of the original key used in
the source data system. The original key for each record is carried in the table but is
not used as the primary key.
May be defined as the primary key for the fact table. In general, the fact table uses a
composite primary key composed of the dimension foreign key columns, with no
surrogate key. In schemas with many dimensions, load and query performance will
improve substantially if a surrogate key is used. If the fact table is defined with a
surrogate primary key and no unique index on the composite key, the ETL application
must be careful to ensure row uniqueness outside the database. A third possibility for
the fact table is to define no primary key at all. While there are systems for which this
is the most effective approach, it is not good database practice and should be
considered with caution.
Contains no meaningful business information; its only purpose is to uniquely identify
15
each row. There is one exception: the primary key for a time dimension table provides
human-readable information in the format "yyyymmdd ".
Is a simple key on a single column, not a composite key.
Should be numeric, preferably integer, and not text.
Should never be a GUID.
The SQL Server 2000 Identity column provides an excellent surrogate key mechanism.
Top of page
ETL Code Examples
Code examples in these sections use the pubs sample database included with SQL
Server 2000 to demonstrate various activities performed in ETL systems. The examples
illustrate techniques for loading dimension tables in the data warehouse; they do not
take into consideration separate procedures that may be required to update OLAP
cubes or aggregation tables.
The use of temporary and staging tables in the ETL database allows the data extraction
and loading process to be broken up into smaller segments of work that can be
individually recovered. The temporary tables allow the source data to be loaded and
transformed without impacting the performance of the source system except for what is
necessary to extract the data. The staging tables provide a mechanism for data
validation and surrogate key generation before loading transformed data into the data
warehouse. Transformation, validation, and surrogate key management tasks should
never be performed directly on dimension tables in the data warehouse.
The code examples in this chapter are presented as Transact-SQL, in order to
communicate to the widest audience. A production ETL system would use DTS to
perform this work. A very simple system may use several Execute SQL tasks linked
within a package. More complex systems divide units of work into separate packages,
and call those subpackages from a master package. For a detailed explanation of how
to use DTS to implement the functionality described in this chapter, please see SQL
Server Books Online.
Tables for Code Examples
The examples use the authors table in the pubs database as the source of data. The
following three tables are created for use by the code examples.
Table name
Purpose
Authors_Temp Holds the data imported from the source system.
Authors_Staging Holds the dimension data while it is being updated. The data for the
16
Table name
Authors_DW
Purpose
authors will be updated in this table and then the data will be loaded
into the data warehouse dimension table.
Simulates the Authors dimension table in the data warehouse.
18
19
20
--Load all of the data from the source system into the Authors_Temp table
INSERT INTO Authors_Temp
SELECT * FROM Authors
GO
--Set a starting value for the Contract field for two records
-- for use by future examples
UPDATE Authors_Temp
SET Contract = 0
WHERE state = 'UT'
GO
--Locate all of the new records that have been added to the source system by
--comparing the new temp table contents to the existing staging table contents
--and add the new records to the staging table
INSERT INTO Authors_Staging (au_id, au_lname, au_fname, phone, address, city,
state,
zip, contract)
SELECT T.au_id, T.au_lname, T.au_fname, T.phone, T.address, T.city, T.state, T.zip,
T.contract
FROM Authors_Temp T LEFT OUTER JOIN
Authors_Staging S ON T.au_id = S.au_id
WHERE (S.au_id IS NULL)
GO
21
--Locate all of the new records that are to be added to the data warehouse
--and insert them into the data warehouse by comparing Authors_Staging to
Authors_DW
INSERT INTO Authors_DW (Author_Key, au_id, au_lname, au_fname, phone, address,
city,
state, zip, contract,
DateCreated, DateUpdated)
SELECT S.Author_Key, S.au_id, S.au_lname, S.au_fname, S.phone, S.address, S.city,
S.state, S.zip, S.contract,
S.DateCreated, S.DateUpdated
FROM Authors_Staging S LEFT OUTER JOIN
Authors_DW D ON S.au_id = D.au_id
WHERE (D.au_id IS NULL)
GO
Managing Slowly Changing Dimensions
This section describes various techniques for managing slowly changing dimensions in
the data warehouse. "Slowly changing dimensions" is the customary term used for
dimensions that contain attributes that, when changed, may affect grouping or
summarization of historical data. Design approaches to dealing with the issues of slowly
changing dimensions are commonly categorized into the following three change types:
Type 1: Overwrite the dimension record
Type 2: Add a new dimension record
Type 3: Create new fields in the dimension record
Type 1 and Type 2 dimension changes are discussed in this section. Type 3 changes
are not recommended for most data warehouse applications and are not discussed
here. For more information and details about slowly changing dimensions, see Chapter
17, "Data Warehouse Design Considerations."
Type 1 and Type 2 dimension change techniques are used when dimension attributes
change in records that already exist in the data warehouse. The techniques for inserting
new records into dimensions (discussed earlier in the section "Inserting New Dimension
22
23
--For example purposes, make sure the staging table records have a different value
25
--Insert new records into the Staging Table for those records in the temp table
--that have a different value for the contract field
INSERT INTO Authors_Staging (au_id, au_lname, au_fname, phone, address, city,
state,
zip, contract)
SELECT T.au_id, T.au_lname, T.au_fname, T.phone, T.address, T.city, T.state, T.zip,
T.contract
FROM Authors_Temp T
LEFT OUTER JOIN Authors_Staging S ON T.au_id = S.au_id
WHERE T.Contract <> S.Contract
GO
26
Store_DW
Several key points about the structures of these tables should be noted:
There is no difference between the structures of the Fact_Source table and the
Fact_Temp tables. This allows for the easiest method to extract data from the source
system so that transformations on the data do not impact the source system.
The Fact_Staging table is used to add the dimension surrogate keys to the fact table
records. This table is also used to validate any data changes, convert any data types,
27
and so on.
The structures of the Fact_Staging and Fact_DW tables do not match. This is
because the final fact table in the data warehouse does not store the original keys
just the surrogate keys.
The fact table key is an identity column that is generated when the transformed data is
loaded into the fact table. Since we will not be updating the records once they have
been added to the Fact_DW table, there is no need to generate the key prior to the
data load into the fact table. This is not how the key column is generated in dimension
tables. As discussed above, the decision to use an identity key for a fact table
depends on the complexity of the data warehouse schema and the performance of
load and query operations; this example implements an identity key for the fact table.
The following Transact-SQL statements create the tables defined above:
Code Example 19.6
--Create the simulated source data table
CREATE TABLE [Fact_Source] (
[stor_id] [char] (4) NOT NULL ,
[ord_num] [varchar] (20) NOT NULL ,
[ord_date] [datetime] NOT NULL ,
[qty] [smallint] NOT NULL ,
[payterms] [varchar] (12) NOT NULL ,
[title_id] [tid] NOT NULL
) ON [PRIMARY]
GO
--Create the example temporary source data table used in the ETL database
CREATE TABLE [Fact_Temp] (
[stor_id] [char] (4) NOT NULL ,
[ord_num] [varchar] (20) NOT NULL ,
[ord_date] [datetime] NOT NULL ,
28
29
30
--Load the Fact_Temp table with data from the Fact_Source table
INSERT INTO Fact_Temp
SELECT *
FROM Fact_Source
GO
31
32
FROM Fact_Temp
GO
Now that the Fact_Staging table is loaded, the surrogate keys can be updated. The
techniques for updating the surrogate keys in the fact table will differ depending on
whether the dimension contains Type 2 changes. The following technique can be used
for Type 1 dimensions:
Code Example 19.9
--Update the Fact_Staging table with the surrogate key for Titles
--(Type 1 dimension)
UPDATE Fact_Staging
SET Title_Key = T.Title_Key
FROM Fact_Staging F INNER JOIN
Titles_DW T ON F.title_id = T.title_id
GO
--Update the Fact_Staging table with the surrogate key for Store
--(Type 1 dimension)
UPDATE Fact_Staging
SET Store_Key = S.Store_Key
FROM Fact_Staging F INNER JOIN
Stores_DW S ON F.Stor_id = S.Stor_id
GO
The technique above will not work for dimensions that contain Type 2 changes,
however, because there may be more than one dimension record that contains the
original source key. The following technique is appropriate for Type 2 dimensions:
Code Example 19.10
--Add a few new rows to the Stores_DW table to demonstrate technique
33
--Duplicate Store records are added that reflect changed store names
INSERT INTO Stores_DW (stor_id, stor_name, stor_address, city, state, zip)
SELECT stor_id, 'New ' + stor_name, stor_address, city, state, zip
FROM Stores_DW
WHERE state = 'WA'
GO
34
UPDATE Fact_Staging
SET Store_Key = S.Store_Key
FROM Fact_Staging F INNER JOIN
#Stores S ON F.stor_id = S.stor_id
WHERE F.Store_Key = 0
35
Table name
Stores_Staging
Purpose
Holds the dimension data while it is being updated. The data for
the stores will be updated in this table and then the data will be
Stores_Current
Stores_DW
38
39
40
FROM Fact_Staging
WHERE QTY < 20
GO
--Update the fact data using the Store_Key key from the Store_Current table
--to relate the new fact data to the latest store record
UPDATE Fact_Staging
SET Store_Key = C.Store_Key
FROM Fact_Staging F INNER JOIN
Stores_Current C ON F.stor_id = C.stor_id
WHERE F.Store_Key = 0
GO
Meta Data Logging
A critical design element in successful ETL implementation is the capability to generate,
store and review meta data. Data tables in a data warehouse store information about
customers, items purchased, dates of purchase, and so on. Meta data tables store
information about users, query execution times, number of rows retrieved in a report,
etc. In ETL systems, meta data tables store information about transformation execution
time, number of rows processed by a transformation, the last date and time a table was
updated, failure of a transformation to complete, and so on. This information, if analyzed
appropriately, can help predict what is likely to occur in future transformations by
analyzing trends of what has already occurred.
In the code examples that follow, the terms "Job" and "Step" are used with the following
meanings:
A "Job" is an ETL element that is either executed manually or as a scheduled event. A
Job contains one or more steps.
A "Step" is an individual unit of work in a job such as an INSERT, UPDATE, or
DELETE operation.
A "Threshold" is a range of values defined by a minimum value and a maximum value.
Any value that falls within the specified range is deemed acceptable. Any value that
42
does not fall within the range is unacceptable. For example, a processing window is a
type of threshold. If a job completes within the time allotted for the processing window,
then it is acceptable. If it does not, then it is not acceptable.
Designing meta data storage requires careful planning and implementation. There are
dependencies between tables and order of precedence constraints on records.
However, the meta data information generated by ETL activities is critical to the success
of the data warehouse. Following is a sample set of tables that can be used to track
meta data for ETL activities.
Job Audit
ETL jobs produce data points that need to be collected. Most of these data points are
aggregates of the data collected for the job steps and could theoretically be derived by
querying the job step audit table. However, the meta data for the job itself is important
enough to warrant storage in a separate table. Below are sample meta data tables that
aid in tracking job information for each step in an ETL process.
tblAdmin_Job_Master
This table lists all of the jobs that are used to populate the data warehouse. These are
the fields in tblAdmin_Job_Master:
Field
JobNumber
JobName
Definition
A unique identifier for the record, generally an identity column.
The name (description) for the job. For example, "Load new
dimension data."
MinThreshRecords The minimum acceptable number of records affected by the job.
MaxThreshRecords The maximum acceptable number of records affected by the job.
MinThreshTime
The minimum acceptable execution time for the job.
MaxThreshTime
The maximum acceptable execution time for the job.
CreateDate
The date and time the record was created.
tblAdmin_Audit_Jobs
This table is used to track each specific execution of a job. It is related to the
tblAdmin_Job_Master table using the JobNumber column. These are the fields in
tblAdmin_Audit_Jobs:
Field
JobNumber
JobName
Definition
A unique identifier for the record, generally an identity column.
The name (description) for the job. For example, "Load new dimension
data."
StartDate
The date and time the job was started.
EndDate
The date and time the job ended.
NumberRecords The number of records affected by the job.
43
Field
Successful
Definition
A flag-indicating if the execution of the job was successful.
This data definition language will generate the above audit tables:
Code Example 19.19
CREATE TABLE [dbo].[tblAdmin_Job_Master] (
[JobNumber] [int] IDENTITY (1, 1) NOT NULL
CONSTRAINT UPKCL_Job PRIMARY KEY CLUSTERED,
[JobName] [varchar] (50) NULL DEFAULT ('Missing'),
[MinThreshRecords] [int] NOT NULL DEFAULT (0),
[MaxThreshRecords] [int] NOT NULL DEFAULT (0),
[MinThreshTime] [int] NOT NULL DEFAULT (0),
[MaxThreshTime] [int] NOT NULL DEFAULT (0)
GO
44
when it happened and how many rows it processed. This information should be stored
for every step in an ETL job. Below are sample meta data tables that aid in tracking
information for each step in an ETL job.
tblAdmin_Step_Master
This table lists all of the steps in a job. These are the fields in tblAdmin_Step_Master:
Field
JobNumber
StepSeqNumber
Definition
The unique number of the job that this step is associated with.
The step number within the object that executed the unit of work.
Frequently, ETL jobs contain more than a single unit of work and
storing the step number allows for easy debugging and specific
reporting. If the object only has a single step, then the value of this
StepDescription
Object
field is "1".
A description of the step. For example, "Inserted records into tblA."
The name of the object. For example, the name of a stored
Definition
A unique value assigned to the record, generally an identity column.
Used to tie the specific execution on a job step to the specific execution
StepNumber
Parameters
of a job.
The step number executed.
Any parameters sent to the job step for the specific execution instance.
46
Definition
A unique value assigned to the record, generally an identity column.
The step number executed that generated the error.
Any parameters sent to the job step for the specific execution instance.
47
--DECLARE variables
DECLARE @ErrorNumber int --the number of the SQL error generated
DECLARE @ErrorRowCount int --the number of rows in the unit of work affected by
error
DECLARE @Startdate smalldatetime --the datetime the load job started
DECLARE @EndDate smalldatetime --the datetime the load job ended
--INSERT the first record (start time) for the job into the tblAdmin_Audit_Jobs table
BEGIN TRANSACTION
SET @StartDate = getdate() --set a start date for the batch
SET @EndDate = '01/01/1900' --set a bogus endate for the batch
insert into tblAdmin_Audit_Jobs (JobNumber, StartDate, EndDate, NumberRecords,
Successful)
values (@JobNumber, @StartDate, @EndDate, 0, 0)
If @ErrorNumber <> 0
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
COMMIT TRANSACTION
RETURN (0)
48
Err_Handler:
exec usp_AdminError @@ProcID, 'none', @ErrorNumber, @ErrorRowCount
RETURN (1)
GO
The following stored procedure indicates the end of an ETL job and should be the last
stored procedure executed in the ETL job. It is important to note that in addition to
updating the tblAdmin_Audit_Jobs table, this stored procedure also updates the
tblAdmin_Audit_Step table with the threshold information for each step. The threshold
information is stored with each step in the table because over time, the acceptable
thresholds for the step may change. If the threshold information is only stored in the
master step table (a Type 1 dimension), any changes to the table affect meta data
generated for historical steps.
Therefore, storing the threshold with the step (a Type 2 dimension) allows us to
maintain historical execution records without affecting their integrity if the master step
information is changed. For example, if a step initially loads 1,000 rows but over time
the number of rows increases to 1 million, the acceptable threshold information for that
step must be changed as well. If the threshold data is stored only in the
tblAdmin_Step_Master table and not stored with each record, the context of the data
will be lost, which can cause inaccuracies in reports built on the meta data information.
For simplicity, to illustrate the technique, the sample code does not maintain threshold
information automatically. In order to change the threshold information for a step, an
administrator will need to modify the master step record manually. However, it would be
possible to automate this process.
Code Example 19.22
CREATE PROCEDURE usp_Admin_Audit_Job_End
@JobNumber int = 1, --The number of the job (from the master job table) being
executed
@Successful bit --A flag indicating if the job was successful
AS
49
--DECLARE variables
DECLARE @ErrorNumber int --the number of the SQL error generated
DECLARE @ErrorRowCount int --the number of rows in the unit of work affected by
error
DECLARE @Startdate smalldatetime --the datetime the load job started
DECLARE @EndDate smalldatetime --the datetime the load job ended
DECLARE @JobAuditID int --the # for the instance of the job
DECLARE @RowCount int --the number of rows affected by the job
UPDATE tblAdmin_Audit_Jobs --Update the Job record with the end time
SET EndDate = @EndDate,
NumberRecords = @RowCount,
Successful = 1
WHERE JobAuditID = @JobAuditID
50
51
If @ErrorNumber <> 0
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
COMMIT TRANSACTION
RETURN (0)
Err_Handler:
exec usp_AdminError @@ProcID, 'none', @ErrorNumber, @ErrorRowCount
RETURN (1)
GO
Code Sample: Step Audit
The following stored procedures demonstrate one method of logging step records from
within ETL stored procedures. Notice that the @@ProcID is used to retrieve the object
id of the executing stored procedure. Also note that the values of @@error and
@@rowcount are retrieved immediately after the INSERT statement.
Code Example 19.23
ALTER PROCEDURE usp_Admin_Audit_Step
@StepNumber tinyint = 0, --the uniue number of the step
@Parameters varchar(50) = 'none', --any parameters used in the SP
@RecordCount int = 0, --the number of records modified by the step
@StartDate smalldatetime, --the date & time the step started
@EndDate smalldatetime --the date & time the step ended
AS
SET NOCOUNT ON --SET NoCount ON
--DECLARE variables
52
BEGIN TRANSACTION --INSERT the audit record into the tblAdmin_Audit_Step table
SET @JobAuditID = (SELECT MAX(JobAuditID) FROM tblAdmin_Audit_Jobs)
If @ErrorNumber <> 0
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
COMMIT TRANSACTION
RETURN (0)
Err_Handler:
exec usp_Admin_Log_Error @@ProcID, 1, 'none', @ErrorNumber, @ErrorRowCount
RETURN (1)
53
GO
The following stored procedure demonstrates the use of the auditing stored procedure
detailed above:
Code Example 19.24
CREATE PROCEDURE usp_AuditSample
AS
SET NOCOUNT ON --SET NoCount ON
--DECLARE variables
DECLARE @ErrorNumber int
DECLARE @RecordCount int
DECLARE @StartDate smalldatetime
DECLARE @EndDate smalldatetime
BEGIN TRANSACTION
SET @StartDate = getdate() --get the datetime the step started
insert into tblTest
select * from tblTest
54
Err_Handler:
exec usp_Admin_Log_Error @@ProcID, 'none', @ErrorNumber, @RecordCount
RETURN (1)
GO
Code Sample: Error Tracking
The following stored procedures demonstrate one possible method of logging errors in
ETL stored procedures. Notice that the stored procedure uses the OBJECT_NAME
function to retrieve the name of the object (table, view, stored procedure, and so on).
This introduces a level of abstraction so that the code is only useful for stored
procedures.
Code Example 19.25
CREATE PROCEDURE usp_Admin_Log_Error
@ObjectID int,
@StepNumber int,
@Parameters varchar(50) = 'none',
@ErrorNumber int = 0,
@RecordCount int = 0
AS
55
--SET NoCount ON
SET NOCOUNT ON
56
--DECLARE Variables
DECLARE @ObjectName varchar(50)
DECLARE @ErrorNumber int, @RecordCount int
DECLARE @Step int
If @ErrorNumber <> 0
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
COMMIT TRANSACTION
RETURN (0)
Err_Handler:
exec usp_Admin_Log_Error @@ProcID, @Step, 'none', @ErrorNumber,
@RecordCount
RETURN (1)
GO
57
Conclusion
The ETL system efficiently extracts data from its sources, transforms and sometimes
aggregates data to match the target data warehouse schema, and loads the
transformed data into the data warehouse database. A well-designed ETL system
supports automated operation that informs operators of errors with the appropriate level
of warning. SQL Server 2000 Data Transformation 1048527521Services can be used to
manage the ETL operations, regardless of the techniques used to implement individual
ETL tasks.
While it is tempting to perform some transformation on data as it is extracted from the
source system, the best practice is to isolate transformations within the transformation
modules. In general, the data extraction code should be designed to minimize the
impact on the source system databases.
In most applications, the key to efficient transformation is to use a SQL Server 2000
database for staging. Once extracted data has been loaded into a staging database, the
powerful SQL Server 2000 database engine is used to perform complex
transformations.
The process of loading fact table data from the staging area into the target data
warehouse should use bulk load techniques. Dimension table data is usually small in
volume, which makes bulk loading less important for dimension table loading.
The ETL system is a primary source of meta data that can be used to track information
about the operation and performance of the data warehouse as well as the ETL
processes.
58