Professional Documents
Culture Documents
TRA04.05 - Extract, Transform, and Load (ETL) Processing
TRA04.05 - Extract, Transform, and Load (ETL) Processing
Prepared by:
South Carolina Department of Health and Human Services (SCDHHS)
Enterprise Services (ES)
TRA04.05 – Extract, Transform, and Load (ETL) Processing
Table of Contents
1. Introduction.................................................................................................................................7
2. ETL in the MES Architecture........................................................................................................8
3. Two ETL Patterns—E-T-L and E-L-T............................................................................................10
3.1. E-T-L Pattern........................................................................................................................10
3.2. E–L–T Pattern......................................................................................................................11
4. ETL Process Setup......................................................................................................................11
4.1. Direct Database Connections..............................................................................................12
4.2. File-Based Exchanges..........................................................................................................13
5. ETL Process Standard.................................................................................................................14
5.1. Authentication and Authorization......................................................................................16
5.2. Data Access.........................................................................................................................17
5.2.1. Direct Database Exchanges..........................................................................................18
5.2.2. File-Based Exchanges...................................................................................................19
5.2.3. Metadata Repository....................................................................................................20
5.3. Data Extraction....................................................................................................................20
5.4. Data Validation....................................................................................................................21
5.5. Data Transformation...........................................................................................................22
5.6. Load Step.............................................................................................................................22
5.7. Common Infrastructure......................................................................................................23
5.7.1. Process Logging............................................................................................................23
5.7.2. Exception Handling.......................................................................................................24
5.7.3. Component Documentation.........................................................................................26
5.8. Operational Considerations................................................................................................27
5.8.1. High Availability and Failover Mechanisms..................................................................27
5.8.2. Causes of Infrastructure Failure...................................................................................27
5.8.3. Clustering......................................................................................................................27
5.8.4. Restartability................................................................................................................28
5.9. Scheduling...........................................................................................................................28
5.9.1. Process Monitoring......................................................................................................29
5.9.2. Reporting......................................................................................................................29
5.9.3. Alerts and Notification.................................................................................................30
6. Use Cases...................................................................................................................................30
6.1. Initial Load...........................................................................................................................30
6.2. Incremental Loads / Deltas.................................................................................................30
6.3. Initial as Well as Ongoing Loads..........................................................................................32
6.4. Transformation Performed on Target Systems..................................................................32
6.4.1. Data Loads into Data Marts..........................................................................................32
6.4.2. Data Loads into NoSQL Databases...............................................................................33
7. Tools and Technologies.............................................................................................................34
8. Appendices................................................................................................................................34
8.1. Relevant Documentation....................................................................................................34
8.2. Revision History...................................................................................................................35
8.3. Acronyms............................................................................................................................35
8.4. Glossary...............................................................................................................................36
Table of Figures
Figure 1. ETL in the MES architecture..............................................................................................7
Figure 2. MES ETL architecture........................................................................................................9
Figure 3. E-T-L pattern steps..........................................................................................................10
Figure 4. E-L-T pattern steps..........................................................................................................11
Figure 5. Detailed ETL process.......................................................................................................15
Figure 6. MES ETL connection architecture...................................................................................18
Table of Tables
Table 1. Events and metrics to record in ETL process logs............................................................24
Table 2. Information logged in event of ETL process error...........................................................25
Table 3. Considerations for scheduling ETL process jobs..............................................................29
Table of Standards
TRA04.05-S1. The ETL team will establish failure response scenarios..........................................12
TRA04.05-S2. ETL processes will log exceptions in accordance with the enterprise standards.. .12
TRA04.05-S3. Changes affecting the ETL process will adhere to ITIL framework.........................12
TRA04.05-S4. Database connection details will be set up as an encrypted configuration of the
ETL tool..........................................................................................................................................13
TRA04.05-S5. Database table structures will be managed through the metadata repository.....13
TRA04.05-S6. The ETL process will comply with TRA04.06 – Managed File Transfer (MFT)........13
TRA04.05-S7. File schemas will be managed through the metadata repository..........................13
TRA04.05-S8. ETL processes will adhere to TRA01.02 – Access Control and Identity
Management.................................................................................................................................16
TRA04.05-S9. At each step in the ETL process, access to the data will be controlled..................16
TRA04.05-S10. The sources systems will ensure the ETL processes have access to the source
data................................................................................................................................................18
TRA04.05-S11. The ETL tool will have read-only access to the source data.................................18
TRA04.05-S12. The ETL application will perform file validations..................................................19
TRA04.05-S13. The ETL process will validate the structure of the database or file......................21
TRA04.05-S14. The ETL process will validate that the data follows specified business rules.......21
TRA04.05-S15. The ETL tool will have read/write access to staging and destination databases. 22
TRA04.05-S16. The ETL process will log important events and metrics.......................................23
TRA04.05-S17. The ETL process will not write personal or protected data in logs.......................23
TRA04.05-S18. The ETL application will be able to increase/decrease log detail.........................23
TRA04.05-S19. The ETL application will adhere to the MES log retention policy.........................23
TRA04.05-S20. The ETL process will log all defined and undefined errors...................................25
TRA04.05-S21. The ETL process will anticipate failures and perform prevention procedures.....25
TRA04.05-S22. The ETL process will handle errors as defined during set-up...............................25
TRA04.05-S23. The ETL process will send an alert in the event of an exception..........................25
TRA04.05-S24. The ETL process will capture full error context....................................................25
TRA04.05-S25. Each ETL process will document its business purpose and technology
components...................................................................................................................................26
TRA04.05-S26. The ETL will be executed per the configurations in scheduler.............................28
SCDHHS expressly restricts the distribution of this document to include only those SCDHHS
staff, SCDHHS processing environment contractors, or any entity given explicit access to this
document with SCDHHS executive or management approval.
Intended audience
The SCDHHS Technical Reference Architecture (TRA) audience includes SCDHHS staff, trading
partners, and third-party vendors who will be implementing, integrating, managing, or
operating the SCDHHS Medicaid Enterprise System (MES). The SCDHHS TRA provides the
reference model by which technologies and components of the SCDHHS MES will be measured
for development and implementation.
Trademarks
Microsoft and Excel are registered trademarks of the Microsoft Corporation.
1. Introduction
Extract, transfer, and load (ETL) processing broadly refers to processes that extract large
volumes of data from source systems, transform the data to fit into the schema of the
destination systems, and load the data into the target (destination) systems. Although ETL
refers to the three distinct steps (extract, transform, and load), these major steps can be
performed in both the E-T-L and E-L-T pattern, depending on the purpose of the process. For
simplification, unless explicitly specified, when this document uses the term 'ETL' the document
refers to the general ETL process in either the E-T-L or E-L-T pattern.
The ETL process involves more than just extracting, transforming, and loading data. For
example, the ETL process designer needs to understand the schema of the source database and
the schema of the target database. Other ETL components include data validation, process
logging, authentication, exception handling, etc. Section 5.. ETL Process Standard describes the
ETL process in detail, and Section 6.. Use Cases presents sample use cases to load data into
transactional databases and to perform transformation in the target systems.
The ETL process can be designed and executed in a number of technologies, for instance,
database stored procedures, Java programs, and ETL tools. However, the South Carolina
Department of Health and Human Services
(SCDHHS) Medicaid Enterprise System (MES)
uses a commercial off-the-shelf (COTS) ETL
application. COTS ETL applications offer an
integrated development environment (IDE)
that allows developers drag-and-drop
functionality to configure data
transformation.
COTS ETL applications are ideal for dealing
with large volumes of data and accessing
data from more than one source. Figure 1.
ETL in the MES architecture highlights
integration points where ETL processes may
be appropriate to move MES data through
Enterprise Data Management Pipeline
(EDMP). The TRA05 –Enterprise Data
Services (EDS) supplement describes the EDS Figure 1. ETL in the MES architecture and
its sub-components in further detail.
To bring the dissimilar data residing in the MES modules into the EDS requires a well-planned
strategy and design, as well as attention to daily operations and maintenance tasks. Using COTS
ETL tools in the MES provides the following benefits:
Connectors to common data sources such as databases, flat files, mainframe
systems, etc.
Data transformations across disparate data sources, including filtering, reformatting,
sorting, joining, merging, aggregation, and other operations
Scheduling and monitoring
Version control
Unified metadata management
Integration with business intelligence (BI) tools
Built-in support for establishing templates and enabling standards and reuse
The information in this supplement applies to all COTS ETL applications approved for use in the
MES architecture. See Section 7.. Tools and Technologies for technology details and the TRA06
– Technology Products Portfolio (TPP) supplement for the approved COTS ETL technologies.
staging tables. Transformed data is loaded to the destination tables and the process is
complete. Throughout the process, logs document steps performed on the extracted records. In
addition, throughout the process, if an error occurs, appropriate exception handling is executed
and the error is logged.
especially true when partnering with legacy applications that are nearing their end-of-life, or
with partners with limited resources.
The ETL team should agree on, and establish failure response scenarios depending on the type
of exception. For example, if certain records do not meet the structure requirements, that row
should be rejected and the process should continue and if the entire table is missing a field,
then the process should log the appropriate error code and halt further processing.
Some exception handling scenarios include the following
Detect an error, stop the process, and present the error code
Detect an error and write the record in an error table with the corresponding code
Detect an error, write the record in both the target and the error table with the error
code, and flag the record as error in the target table.
TRA04.05-S2. ETL processes will log exceptions in accordance with the enterprise standards.
Processes for all scenarios must embed logging mechanisms that adhere to the standards for
documenting process steps for process control and audits.
All potential and anticipated errors shall be assigned unique error codes and description.
See also Section 5.7.2.. Exception Handling.
TRA04.05-S3. Changes affecting the ETL process will adhere to ITIL framework.
Changes affecting the ETL process will be made according to the Information Technology
Infrastructure Library (ITIL) change management process and will be subject to version and
configuration management. Examples of changes affecting the ETL process include:
File layout changes
Database schema changes
Addition or removal of data elements
The TRA09 – Project Delivery Framework supplement describes the SCDHHS implementation of
ITIL.
Database name
Port
Service ID
Password
Appropriate drivers
For the source system, the database access will be read-only and for the target system, the
database access will be read-write.
The connection details will be setup as an encrypted configuration of the ETL tool and will
under no circumstances be hard-coded or human-readable.
TRA04.05-S5. Database table structures will be managed through the metadata repository.
The table structure of the source and target databases will be managed through the metadata
repository. (See Section 5.2.3.. Metadata Repository.)
TRA04.05-S6. The ETL process will comply with TRA04.06 – Managed File Transfer (MFT).
To ensure secure transmission of files across enterprise systems and between internal and
external partners, the ETL process will use a secure file transport mechanism (for both input
and output files). File transfers in the MES architecture conform to the standards described in
the TRA04.06 – Managed File Transfer (MFT) supplement.
The integrating applications agree on the file server and the directory to be used to transport
files based on the systems’ ability to transport the files in and out of the enterprise firewall.
In addition to the file format, the structure of fields in the file is also agreed upon. For example,
in delimited files, the sequence of fields should be pre-defined and distributed among parties;
and in XML files, the nested relationships should be shared prior in the interface design stage.
The schemas of the files being exchanged shall be stored in the metadata repository. (See
Section 5.2.3.. Metadata Repository.)
TRA04.05-S9. At each step in the ETL process, access to the data will be controlled.
ETL processes may make multiple copies of data in staging areas, such as extracted flat files or
temporary relational tables. At each step, the data must be subjected to access control to avoid
inadvertent access.
Fixed-width files
Relational databases
Non-relational databases
NoSQL databases
Direct database connections are preferable for ETL processing. However, when the source
and/or destination locations do not support direct data access, file-based exchanges can also be
used. Figure 3. Detailed ETL process illustrates the MES ETL architecture for both direct
database and file-based data exchanges. The use cases in Section 6.. Use Cases provide details
about the enactment of this architecture.
Each source system ensures that the ETL process has access to the source data. The level of
access given to the ETL application should be the minimum level of access required to access
only the data required for the ETL process. A common alternative is to have the application
support teams extract the data and provide extracted data to the ETL process in the form of flat
files or other staging formats.
TRA04.05-S11. The ETL tool will have read-only access to the source data.
ETL tool extracts the data from the source database without updating any data. No other
updates will be made to the source database during the read process.
File-based exchanges are sensitive to file structure changes and require validations to avoid
inadvertent data corruption. Hence, the ETL application will validate that the file format and file
encoding are as specified (when setting up the ETL process) before reading and staging the
data.
When validation fails, the ETL process will handle the exception gracefully and terminate
execution (see Section 5.7.2.. Exception Handling). Examples of file-based processing validation
failures include:
An ETL processor encounters a comma-separated file instead of a pipe-delimited file.
An ETL processor encounters a file with only 9 fields while it is expecting 10 fields
A field that usually takes up 9 characters is extended to 10 characters, thereby
invalidating all subsequent mapped fields
directly attributed to the business purpose of the ETL should be accessed and/or extracted.
The scope of data extraction is more of a consideration for direct database exchanges, where
filtering of the data is built into the ETL process. For file-based exchanges, data filtering is
outside of the span of control of the ETL process and data filtering designed into the source
system batch processes.
Metadata stored in the metadata repository is used to translate the structure of the data
source to intermediate structures, if any, for further processing.
TRA04.05-S13. The ETL process will validate the structure of the database or file.
Validate the structure of the database or files, according to the data exchange agreement
and/or approved design. Failing structural validation means that records are rejected for
further processing of transformation and loading. The file shall be rejected after proper error
handling.
TRA04.05-S14. The ETL process will validate that the data follows specified business rules.
The ETL process will validate the data to ensure that the data follows certain business rules, for
instance:
A phone number should have 10 characters with all of them being numeric.
The value in the state field should be a valid state code.
The email field should have one and only one @ character.
A member address record contains a valid address according to the USPS.
A valid ICD-10 code is provided in a claims record.
TRA04.05-S16. The ETL process will log important events and metrics.
The ETL process shall have a mechanism to log important events and metrics before, during,
and after the execution of the process.
TRA04.05-S17. The ETL process will not write personal or protected data in logs.
The ETL application, or any program block used to execute the ETL process, should have a
customizable process logging mechanism that allows adjustment for the level of detail captured
in process log files.
Detailed logs assist investigations; however, creating and maintaining detailed logs is expensive
in terms of processing power and memory. Administrators should be to adjust the ETL process
logging.
TRA04.05-S19. The ETL application will adhere to the MES log retention policy.
MES shall have a policy to determine the retention time of the logs. A shorter log retention
period ensures that the process logs do not quickly fill up with the logs being generated daily,
while a longer log retention period helps analyze ETL loads and may fulfill the regulatory
requirements.
Systems operations and maintenance staff can use process logs to perform an initial analysis of
process execution. To ensure that appropriate, and consistent, data is available for analysis, the
Event/Metric Description
Start and stop The beginning and ending time stamp for the ETL process as a whole, as
events well as the individual steps, should be stored.
Status Process steps can succeed or fail individually, and as such, their status
(not started, running, succeeded, or failed) should be logged
individually.
Errors and other While logging failures and anomalies often consumes the most time
exceptions when building a logging infrastructure, these logs also yield the most
value during testing and troubleshooting. See Section 5.7.2.. Exception
Handling for failure response scenarios, logging requirements, and
information logged for errors and exceptions.
Audit information This can vary from simply capturing the number of rows loaded in each
process execution, to a full analysis of row count and dollar value from
source to destination.
Testing and This is particularly useful during the development and testing phase,
debugging most notably for processes that are heavy on the transformation part of
information ETL.
Security events Security events, such as user login, login time, etc.
TRA04.05-S20. The ETL process will log all defined and undefined errors.
The ETL process will log any and all defined or undefined errors occurring in all stages of the ETL
process.
TRA04.05-S21. The ETL process will anticipate failures and perform prevention procedures.
The ETL process shall contain routines to perform a sanity check on the data before processing
it. The routines should be designed to automatically correct some common data discrepancies
and error scenarios, for example, removing preceding zeros for certain data elements,
removing hyphens in a phone number, etc.
TRA04.05-S22. The ETL process will handle errors as defined during set-up.
During set-up, the ETL team agrees on, and establishes, the error handling scenarios. See
Section 4.. ETL Process Setup.
TRA04.05-S23. The ETL process will send an alert in the event of an exception.
The ETL process will send alerts reliably and consistently in the form of email in event of an
exception. The list of individuals alerted should be customizable by the Enterprise Services (ES)
administrators.
In the event of an error, the ETL process will capture and save the information listed in Table 2.
Information logged in event of ETL process error. All potential and anticipated errors shall be
assigned unique error codes and description. Based on the error that occurred, the error
handling routine should be designed to capture the appropriate error code.
Table 2. Information logged in event of ETL process error
TRA04.05-S25. Each ETL process will document its business purpose and technology
components.
Each ETL process job will specify the supported business purpose and the technology
components that directly feed data to the ETL job or consume data directly from the ETL job.
ETL jobs will not be allowed production migration, without documentation of the business
purpose. In addition, each ETL process will include the following documentation:
Programmer
Programmer name that last updated the code.
Version
Version information and update history.
Last update date
Date of last code update.
Dependency
Dependencies on other components as well as which components are dependent upon
this process.
5.8.3. Clustering
One of the methods of handling systems-level failures and achieving high availability is
clustering - implementing a group of hosts that act like a single system to provide continuous
uptime. Clustering is generally used for load balancing and failover purposes to aid in making
the system highly-available.
The MES clustered environment for ETL processing has multiple web server and schedulers. This
environment provides failover capabilities for load balancing and scheduling to handle
situations, such as if one of the nodes in the systems were to go down. The ETL tool shall have
the mechanism to replicate/synchronize the configuration across the server nodes.
5.8.4. Restartability
Restartability is the ability to restart an ETL job if a processing step fails to execute properly.
This will avoid the need of any manual cleaning up before a failed job can restart. Depending on
the design of individual jobs, the ETL design will address the ability to restart processing at the
step where it failed as well as the ability to restart the entire ETL session.
5.9. Scheduling
A job scheduler can be part of the ETL tool or a compatible application that enables the
developers and systems administrators to control and monitor the execution of ETL processes
across enterprises.
After design, development, and testing, ETL processes need to be scheduled to be executed on
the application server. These processes can be designed to be executed separately in a
sequence of simple standalone tasks or together as a combination of tasks.
Examples of standalone tasks include:
Creating an extract file from a database
Validating addresses against the USPS
An example of a sequence of tasks:
Extracting data from source table and loading the records into a staging table
Running a transformation logic on staged data using data maps
Schedules define when and how often the ETL process or group of processes need to be run.
ETL processes can run repeatedly based on an interval defined in the scheduler.
A common definition of the schedule contains the following information:
The date and time when a process should begin to run
The frequency in which the process should run
The date and time when the process should end its runs
The process identification details (e.g. the process name, process code, etc.)
Table 3. Considerations for scheduling ETL process jobs describes a number of things to consider
when scheduling ETL process.
Consideration Description
Sequence The sequence in which the process should run, for instance, reference
data must be loaded first.
Duration Processes are scheduled taking their end times into consideration. For
example, if a process is supposed to be done by 8AM and might take 3
hours to run then it needs to be started before 5 AM
Preconditions Events that must occur before running the process, for instance, a
particular process should be completed before the start of another
process.
5.9.2. Reporting
The scheduler should be able to generate a report daily and/or on-demand to report on all the
processes executed and details of completion. Details to be included in the reports include:
Process name and code
Completion status
Start time
End time
Duration
Name of the server on which the task was run
Exit code
6. Use Cases
This section provides a few examples of ETL processes in an operational environment that show
how standards are applied within the context of the MES.
system, the changes may have to be reflected in the claims and financial systems for invoices to
be processed correctly. In such cases, the provider management system becomes the source,
and the operational data store becomes the target. When the source data is being updated and
the target database needs to be in sync with the source database, the ETL process runs
periodically (in predefined time intervals, such as hourly, monthly, etc.) bringing in data from
the source database to the target. Figure 6. Incremental loads/deltas use case illustrates this
incremental load use case.
business analytics. Data from one or more operational systems needs to be extracted and
copied into the data mart following steps similar to those listed in Section 6.1.. Initial Load.
However, the transformation step takes place in the target system. Figure 8. Data loads into
data mart use case shows the steps to load data into the data mart using an E-L-T pattern.
8. Appendices
8.1. Relevant Documentation
Document name Dependency
TRA04.02 – Web Services Development guide for web services and integration into the
ESB.
TRA04.06 – Managed File Specifies the mechanism by which files are transported
Transfer (MFT) between source and destination locations
TRA04.07 – Electronic Data Guidance on when to use the X12 standards, when other
Interchange (EDI) formats (XML, JSON) are allowed. Decoding, interpretation
and encoding of X12 files.
TRA05 – The Enterprise Data Overview of the data architecture and the data flow.
Services (EDS) framework
TRA06 – Technology Overview of the technology products portfolio (TP) and the
Products Portfolio process governing the TPP.
8.3. Acronyms
Term Meaning
BI Business Intelligence
ES Enterprise Services
Term Meaning
8.4. Glossary
Term Definition
Term Definition