Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 32

Base SAS® vs.

SAS® Data Integration Studio

Greg Nelson and Danny Grasse


• Outline

– Overview of Understanding ETL


– What SAS approaches do we have?
– 38 Best Practices
– Key Areas for Comparison
1. Design and data profiling.
2. Source data extraction.
3. Transformation and loading.
4. Change data capture.
5. Quality review, auditing and exception
handling/management.
6. Integration with the production environment and
business process components.
– Summary and Conclusions

2
Overview

• ETL
• Data Warehousing 101
• Data Integration Studio
• “Consistent” version of the truth
• Credible information versus Data quality

3
Corporate Information Factory

4
Ralph Kimball History Excellence

• Father of “dimensional” data warehouse


design
• The Data Warehouse Toolkit (I and II)
• The Data Warehouse Lifecycle Toolkit
• The Data Warehouse ETL Toolkit

5
The Data Integration Process

6
38 Subsystems

• 38 Sub-systems define your ETL strategy


– Design and data profiling
– Source data extraction
– Transformation and loading
– Change data capture
– Quality review, auditing and exception
handling/management
– Integration with the production environment
and business process components

7
38 Subsystems: Category 1

Design and Data Profiling

Source data extraction

Transformation and loading

Change data capture

Quality and exception handling

Productionizatation

8
Design and Data Profiling

3. Data profiling system. Column property analysis including discovery of inferred


domains, and structure analysis including candidate foreign key — primary
relationships, data rule analysis, and value rule analysis.

• Design is often played by “soft skills” in


the SAS ecosystem
• Data profiling
– Base SAS – frequencies, crosstabs, macros and
toolkits
– DataFlux software – data profiling on steroids
– DI Studio – data profiling currently supported
through generalized metadata
• Data Profiling is an analysis exercise, not a
technical one
9
38 Subsystems: Category 2

Design and Data Profiling

Source data extraction

Transformation and loading

Change data capture

Quality and exception handling

Productionizatation

10
Source Data Extraction

1. Extract system. Source data adapters, push/pull/dribble job schedulers, filtering and
sorting at the source, proprietary data format conversions, and data staging after
transfer to ETL environment.

• Source data adapters (including data conversions


and filters)
– SAS/Access products
– Data Step, SQL and DI Studio
• Push/pull/dribble
– How we move the data
• Filtering & Sorting
– How we select the data to be moved
• Data staging (versus accessing)
– How we land the data 11
38 Subsystems: Category 3

Design and Data Profiling

Source data extraction

Transformation and loading

Change data capture

Quality and exception handling

Productionizatation

12
Transformation and Loading

5. Data conformer.
9. Surrogate key creation system.
12. Fixed hierarchy dimension builder.
13. Variable hierarchy dimension builder.
14. Multivalued dimension bridge table builder.
15. Junk dimension builder.
16. Transaction grain fact table loader.
17. Periodic snapshot grain fact table loader.
18. Accumulating snapshot grain fact table loader.
19. Surrogate key pipeline.
20. Late arriving fact handler.
21. Aggregate builder.
22. Multidimensional cube builder.
23. Real-time partition builder.
24. Dimension manager system.
25. Fact table provider system.
13
Transformation and Loading

• Conforming dimensions and facts


– Good old SAS code (Data Step, SQL,
Formats) – all available in DI Studio
– Good design
• Creation of surrogate keys
– Hand coded in Base SAS
– “Automagic” in DI Studio
• Building summary tables and cubes
– Just another “target” in DI Studio
14
38 Subsystems: Category 4

Design and Data Profiling

Source data extraction

Transformation and loading

Change data capture

Quality and exception handling

Productionizatation

15
Change Data Capture

2. Change data capture system. Source log file readers, source date and
sequence number filters, and CRC-based record comparison in ETL
system.

10. Slowly Changing Dimension (SCD) processor. Transformation logic for


handling three types of time variance possible for a dimension attribute:
Type 1 (overwrite), Type 2 (create new record), and Type 3 (create new
field).

11. Late arriving dimension handler. Insertion and update logic for dimension
changes that have been delayed in arriving at the data warehouse.

16
Change Data Capture

• When we see new data coming in from the


operational system, we have to make a decision
about how to handle that change. We have three
options:
– We can overwrite or update the old value (Type I)
– We can create a new record and create some
mechanism for recreating historical references to that
data – depending on the date for the report that is being
requested (Type II).
– We can retain both as alternative “realities”. For the
latter, we usually create a new column and put the old
value in the new column to allow for alternatives to
reporting. (Type III)

17
Change Data Capture

• SAS Approaches
– Base SAS – very robust using macros
– Can control everything about the load
– DI Studio has limited coverage
– SAS does support CRC-based record comparisons
(MD5 function)
• DI Studio
– 3 types of loading techniques: update, refresh,
append
– Type I & II are dropdowns; Type II SCD is a
transform
– Doesn’t support Type 3 outside of transform code

18
38 Subsystems: Category 5

Design and Data Profiling

Source data extraction

Transformation and loading

Change data capture

Quality and exception handling

Productionizatation

19
Quality Handling

4. Data cleansing system. Typically a dictionary driven system for complete parsing of
names and addresses of individuals and organizations, possibly also products or
locations. "De-duplication" including identification and removal usually of individuals
and organizations, possibly products or locations. Often uses fuzzy logic. "Surviving"
using specialized data merge logic that preserves specified fields from certain
sources to be the final saved versions. Maintains back references (such as natural
keys) to all participating original sources.

6. Audit dimension assembler. Assembly of metadata context surrounding each fact


table load in such a way that the metadata context can be attached to the fact table
as a normal dimension.

7. Quality screen handler. In line ETL tests applied systematically to all data flows
checking for data quality issues. One of the feeds to the error event handler (see
subsystem 8).
8. Error event handler. Comprehensive system for reporting and responding to all ETL
error events. Includes branching logic to handle various classes of errors, and
includes real-time monitoring of ETL data quality
20
Quality Handling

• Detecting errors
• Handling them
• Providing audit records

21
Quality Management

• Detecting errors
– SAS errors versus data errors
• DataFlux
– Data rationalization
– At the point of data entry
• Base SAS
– If then else routines (lookup tables, formats)
• DI Studio
– Not much other than BASE SAS

22
Audit trail

• Base SAS
– Log parsing routines
• DI Studio
– Workspace server logs
• Event System
– Detailed logs, summary logs and event
triggers

23
Exception Handling

• Base SAS
– Macros, put statements in log file
• DI Studio
– Simple email, exception tables and log
file
• Event System
– Subscribe to events
– Responds to errors, warnings, notes and
custom assertions
24
38 Subsystems: Category 6

Design and Data Profiling

Source data extraction

Transformation and loading

Change data capture

Quality and exception handling

Productionizatation

25
Productionization of SAS ETL

26. Job scheduler.


27. Workflow monitor.
28. Recovery and restart system.
29. Parallelizing/pipelining system.
30. Problem escalation system.
31. Version control system.
32. Version migration system.
33. Lineage and dependency analyzer.
34. Compliance reporter.
35. Security system.
36. Backup system.
37. Metadata repository manager.
38. Project management system. 26
Productionization of SAS

• Version control, change control,


promotion, backup and recovery
– None is available in BASE SAS
– Version control – minimal for multi-developer
access
– Change management –partially available in DI
Studio
– Automated promotion – weak and/or not
available
– Backup – metadata server can be backed up
(but no rollback feature)

27
Productionization of SAS

• Scheduling, dependency
management and restartability,
including parallelization.
– Provided by LSF Scheduler
– Managed by person doing the
scheduling not writing the code
– LSF provides parallelization, but also
'grid' computing with the associated
'pipelining' of steps
28
Productionization of SAS

• Metadata management and impact


analysis.
– Very good in DI Studio

29
Productionization of SAS

• Project management and problem


escalation.
– Not in scope for DI Studio

30
Summary

• Di Studio is a code generator


• Can do everything Base SAS can do
• Metadata and security is the key for
why we want to use DI Studio
• DI Studio writes “better” code in
some cases
• Challenges: change control and what
happens when things go bad
31
ThotWave Technologies
Thinking Data

  How to reach us...

Danny Grasse
Senior Consultant
dgrasse@thotwave.com

Greg Nelson
CEO and Founder
greg@thotwave.com
32

You might also like