Professional Documents
Culture Documents
Base SAS® vs. SAS® Data Integration Studio: Greg Nelson and Danny Grasse
Base SAS® vs. SAS® Data Integration Studio: Greg Nelson and Danny Grasse
2
Overview
• ETL
• Data Warehousing 101
• Data Integration Studio
• “Consistent” version of the truth
• Credible information versus Data quality
3
Corporate Information Factory
4
Ralph Kimball History Excellence
5
The Data Integration Process
6
38 Subsystems
7
38 Subsystems: Category 1
Productionizatation
8
Design and Data Profiling
Productionizatation
10
Source Data Extraction
1. Extract system. Source data adapters, push/pull/dribble job schedulers, filtering and
sorting at the source, proprietary data format conversions, and data staging after
transfer to ETL environment.
Productionizatation
12
Transformation and Loading
5. Data conformer.
9. Surrogate key creation system.
12. Fixed hierarchy dimension builder.
13. Variable hierarchy dimension builder.
14. Multivalued dimension bridge table builder.
15. Junk dimension builder.
16. Transaction grain fact table loader.
17. Periodic snapshot grain fact table loader.
18. Accumulating snapshot grain fact table loader.
19. Surrogate key pipeline.
20. Late arriving fact handler.
21. Aggregate builder.
22. Multidimensional cube builder.
23. Real-time partition builder.
24. Dimension manager system.
25. Fact table provider system.
13
Transformation and Loading
Productionizatation
15
Change Data Capture
2. Change data capture system. Source log file readers, source date and
sequence number filters, and CRC-based record comparison in ETL
system.
11. Late arriving dimension handler. Insertion and update logic for dimension
changes that have been delayed in arriving at the data warehouse.
16
Change Data Capture
17
Change Data Capture
• SAS Approaches
– Base SAS – very robust using macros
– Can control everything about the load
– DI Studio has limited coverage
– SAS does support CRC-based record comparisons
(MD5 function)
• DI Studio
– 3 types of loading techniques: update, refresh,
append
– Type I & II are dropdowns; Type II SCD is a
transform
– Doesn’t support Type 3 outside of transform code
18
38 Subsystems: Category 5
Productionizatation
19
Quality Handling
4. Data cleansing system. Typically a dictionary driven system for complete parsing of
names and addresses of individuals and organizations, possibly also products or
locations. "De-duplication" including identification and removal usually of individuals
and organizations, possibly products or locations. Often uses fuzzy logic. "Surviving"
using specialized data merge logic that preserves specified fields from certain
sources to be the final saved versions. Maintains back references (such as natural
keys) to all participating original sources.
7. Quality screen handler. In line ETL tests applied systematically to all data flows
checking for data quality issues. One of the feeds to the error event handler (see
subsystem 8).
8. Error event handler. Comprehensive system for reporting and responding to all ETL
error events. Includes branching logic to handle various classes of errors, and
includes real-time monitoring of ETL data quality
20
Quality Handling
• Detecting errors
• Handling them
• Providing audit records
21
Quality Management
• Detecting errors
– SAS errors versus data errors
• DataFlux
– Data rationalization
– At the point of data entry
• Base SAS
– If then else routines (lookup tables, formats)
• DI Studio
– Not much other than BASE SAS
22
Audit trail
• Base SAS
– Log parsing routines
• DI Studio
– Workspace server logs
• Event System
– Detailed logs, summary logs and event
triggers
23
Exception Handling
• Base SAS
– Macros, put statements in log file
• DI Studio
– Simple email, exception tables and log
file
• Event System
– Subscribe to events
– Responds to errors, warnings, notes and
custom assertions
24
38 Subsystems: Category 6
Productionizatation
25
Productionization of SAS ETL
27
Productionization of SAS
• Scheduling, dependency
management and restartability,
including parallelization.
– Provided by LSF Scheduler
– Managed by person doing the
scheduling not writing the code
– LSF provides parallelization, but also
'grid' computing with the associated
'pipelining' of steps
28
Productionization of SAS
29
Productionization of SAS
30
Summary
Danny Grasse
Senior Consultant
dgrasse@thotwave.com
Greg Nelson
CEO and Founder
greg@thotwave.com
32