Best Practices in ETL

Best Practices in Extraction
O Data profiling should be done on the source data to analyze it and ensuring the data quality
and completeness of business requirements.
O The logical data mapping describing the source elements, target elements and transformation
between them should be prepared; this is often referred to as Source-to-Target Mapping.
4 Transformation rules should be explicitly stated for each target column, taking care of
necessary data conversions.
4 This Source-to-Target information acts as the metadata that can be used by
developers as well as be QA team and end users to understand and test the ETL
process.
O On completion of logical data mapping, a review walkthrough of the same should be done
with the ETL team before any actual coding begins.
O n case the source system is on database the extract SQL should be written with full attention
and should be reviewed by a DBA for optimal performance.
O n case the source system provides feed files then the feed should be checked for duplication.
4 This check can be done by generating a unique 32 byte file signature for the first 1
MB of data read from the feed file and compare it with signature from all previous
day's feed file. f both the file signature matches then current feed should be treated
as duplicate and a process should intimate the support team to contact source
system.
4 This is an audit check to prevent processing the same feed again.
O Names of the feed files to be extracted by a particular ETL process should be parameterized.
4 This will help in reducing code push to minimum, whenever a new feed has to be
added then you just need to add the file name in the parameter file, therefore
improving the maintainability of the ETL process.
O ncremental extraction of only new, changed and delete data from the feed/source table
should be considered whenever it is possible.
4 By doing this it will help in reducing time taken for extraction as well as the
downstream ETL will take less time.
O Capture time taken from extraction along with the number of records processed
4 This helps in building automated process to reconcile data across ETL steps
4 Also such statistics collected over a period of time helps in analyzing the performance
of the ETL process in conjunction with audit data captured for data volume.
O Make sure to extract only the required columns from source, this helps in speeding extraction
O Archive the source feeds for minimum of a week time, also depends upon how much space is
available to you.
4 These feeds can be useful in quick data recoveries in case of data loss.

Best Practices in Transformation
O Filter out the data that should not be loaded into the data warehouse as the first step of
transformation.
O dentify common transformation processes to be used across different transformation steps
within same or across different ETL processes and then implement as common reusable
module that can be shared.
4 This ensures that shared transformation logic is available centrally at one place and
used by different ETL processes, it also prevents coding mistakes that might be
introduced while coding the same logic at multiple places, any future changes need to
be made only once to a common module, thus improving productivity and
maintainability.
O Lineage additions to the warehouse - a batch identifier, ETL PROC D (Process dentifier),
ties not only the auditing metadata together for reporting and isolating, but it is also useful in
the warehouse to identify the record source. Every dimension and fact record originated from
a specific data load, and this column identifies that load.
4 Any downstream transformations can utilize this column for updates, inserts and
tracking, this column is also useful for data validation as well as manual corrections
needed in case of data corruption.
O While converting a source DATE into a target DATE column, validate that the source DATE is
in the expected format.
4 This helps in correct transformation of DATE columns and also prevents ETL process
failure while trying to convert an unexpected source DATE.
O NULL values received in source data for loading into a dimension column should be set to a
default value.
4 This ensures that queries based on such columns do not leave out records with NULL
data in those columns.
4 An example would be while load dim/fact table if lookup fails then assign -1 to the
target column
O Extract and transformation process should try to minimize resource utilization on the
database server, this can be done by unloading the required data on ETL layer and used for
processing.
4 This helps in reducing impact on user query performance due to ETL process

Best Practices in Loading
O Loading transformed data into target tables should be done in a separate (designated)
Loading Window during which no user queries are fired
4 This ensures that there is no contention for database server resource between the
loading and user queries
O f loading large volumes of data into a high-volume target table having indexes, drop indexes
on target table before loading and recreate them after completion of load.
4 This improves the time taken to insert and update into such tables.
O Use Oracle Exchange Partition feature to the possible extent, this is to keep the loading time
to minimum. All the necessary processing required for a partition exchange should be
completed as part of transformation phase.
4 This approach results in minimum downtime for data warehouse availability for use
queries.
O Reconcile count of records at the end of loading step (including records that are loaded into
target table plus records that are rejected, if any) against the count of records at the end of
transform step. Both the counts should match.
4 This is an audit check to ensure that all transformed records are accounted for and
nothing is missed out during the loading phase.

Best Practices in ETL

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Best Practices in ETL

Uploaded by

Copyright:

Available Formats

Best Practices in Extraction

You might also like