Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

0

Practical Tips to Improve


Data Load Performance
and Efficiency

Joe Darlak
Comerit

© 2010 Wellesley Information Services. All rights reserved.


In This Session ...
• Learn how to improve data loading performance by up to 75% by
applying proven optimization methods to data modeling;
extraction, transformation, and loading (ETL) processes; and
process chain design in SAP NetWeaver Business Warehouse.
• Discover how each decision made when architecting a BI system,
designing a data model, or developing ETL logic can have a
significant impact on data load performance.
• Receive best practices to maximize data load performance while
reducing long-term maintenance costs, such as using portable
ETL code and eliminating hard-coding in your ETL logic.

2
In This Session ... Continued
• Find out how to enable version history to track code changes and
how to create reusable ETL logic to improve throughput and
reduce data load time.
• Get tips on when and how to use customer exits in DataSources
and variables to manage risk and reduce maintenance costs.
• Identify the challenges and benefits of semantic partitioning and
the importance of efficient data models.
• Take home a checklist to ensure your data models are optimally
designed.

3
What We‘ll Cover …
• How to leverage the BW architecture
• Data Modeling
• ETL – Extraction
• ETL – Transformation
• ETL – Load (Process Chains)
• Wrap-up

4
How to Leverage BW ETL Architecture 1
• Implement a Layered Scalable Architecture (LSA):
 Create multiple data warehouse layers (a DSO layer)

 Employ the ―touch it, take it‖ principle

 Enable deltas to subsequent data targets to add stability

 Eliminate reporting impact due to overlapping requests

• Keep data normalized to reduce redundancy


 Header ODS and detail ODS

• De-normalize to improve performance


• Limit transformation logic on extract
 Minimize risk of re-load because of your logic issues

• Lookups on extract reduce timing constraints on loads

5
How to Leverage BW ETL Architecture 2
• Illustration: Sample Dataflow Diagram

6
What We‘ll Cover …
• How to leverage the BW architecture
• Data Modeling
• ETL – Extraction
• ETL – Transformation
• ETL – Load (Process Chains)
• Wrap-up

7
Data Modeling 1: Overview
• Data modeling is still important!!!
 BWA does not give license to design poorly

 Data model design still impacts load performance

 BWA memory is expensive (licensing, H/W and service cost)

• Manage granularity
 Do not add free text fields to cubes

 Minimize use of different dates and/or document/line item detail

 Use Report-to-report interface to provide details when needed

• Think ahead
 Semantic partitioning

 Data retention policy (archiving)

8
Data Modeling 2: Defining Dimensions
• Use as many dimensions as possible
 Separate common filter characteristics into own dimension

• Use line-item dimensions for high cardinality characteristics


 Do not set the high cardinality flag!

• Define related characteristics in the same dimension


 Calculate expected number of dimensional entries

 Try not to exceed 10% of expected fact table entries

• Add all relevant time characteristics


 If 0CALMONTH is lowest granularity, add 0CALMONTH2,
0CALQUARTER, 0CALQUART1, 0HALFYEAR and 0CALYEAR
 Provides greatest reporting flexibility without need to reload

9
Data Modeling 3: Semantic Partitioning
• What is it?
 An architectural design to enable parallel data loading and
query execution
 Partitioning criteria: Year, Region or Actual/Plan

10
Data Modeling 4: Semantic Partitioning
• Benefits of Semantic Partitioning:
 Reduction in BWA footprint (when partitioned by year)

 Parallel data loading (when not partitioned by year)

 Parallel query execution

 Best case when partitioning criterion is set as constant

 Almost as good to create variables to filter on 0INFOPROV

 Archival of a single InfoCube does not impact others

 Easier DB maintenance

Performance benefits are so significant…


Semantic Partitioning should be deployed
on virtually every data model!

11
Data Modeling 5: Semantic Partitioning
• Example: Semantic partitioning by year
History MultiProvider
(Summarized)

Current Year – 3 Current Year - 2 Current Year - 1 Current Year Current Year + 1

ALL years Current Year – 3 Current Year - 2 Current Year - 1 Current Year Current Year + 1
Write-Optimized (No SIDs)

Ex: Current Year + 1 = 2010


Current Year = 2009 Datasource
Current Year - 1 = 2008
Current Year - 2 = 2007
Current Year - 3 = 2006

12
Data Modeling 6: Data Retention Policy
• Develop and implement a data retention strategy to effectively
manager data as it ages
• Use a combination of approaches:
 Aggregated history cubes

 Near-line storage

 Traditional archiving

 Data deletion

Up front planning will significantly reduce


implementation cost later and allow for a
common scalable approach

13
What We‘ll Cover …
• How to leverage the BW architecture
• Data Modeling
• ETL – Extraction
• ETL – Transformation
• ETL – Load (Process Chains)
• Wrap-up

14
Extraction 1 Overview
• Focus on R/3 extraction
• SAP delivers over 1,000 pre-developed DataSource
 Still doesn‘t cover all SAP extraction requirements

 In addition, custom tables and customer-enhanced tables need


their own extractors or enhancements to delivered ones
• Defining a flexible yet consistent strategy to deal with the many
different extraction scenarios you will face is an important up-
front task for any BW project and/or architect

15
Extraction 2: To Enhance Or Not To Enhance?
• Enhance business content (create user exit) if:
 DataSource is delta-enabled

 Extraction method is ―function module‖ (i.e., LIS


extractors)
 Extraction method is ―View‖ and required fields do not
exist in base tables or check tables (or other joinable
tables)
• Create a generic DataSource if:
 New view could contain all necessary fields

 Function module can be copied and modified to provide


better performance

16
Extraction 3: Coding Tips – Dynamic Calls
• Code the extractor user exits so that they call a dynamic
program per DataSource
 Isolate the code per DataSource in a self-contained
program
 Minimize risk that a syntax error in code for one
DataSource impacts exctraction from all other DataSources
• Example
 Program name = ‗YBW‘ + <DataSource name>

 Form name = ‗DOYBW‘ + <DataSource name>

• This same technique can be used with customer exit variable


code

17
Extraction 4 : User Exit: Program Calls
• Illustration: Sample dynamic program call

18
Extraction 5: Coding Tips – Field Symbols
• Performance consideration: where possible, use field symbols to
populate fields in the data package
 The move costs of a LOOP ... INTO statement depend on the
size of a table line. The larger the line size, the longer the move
will take
 By applying a LOOP ... ASSIGNING statement you can attach a
field-symbol to the table lines and operate directly on the line
contents
 This a much faster way to access the internal table lines
without moving their contents

19
Extraction 6: User Exit: Field Symbols
• Illustration: Sample use of field symbols
User Exit (without field-symbols) User Exit (with field-symbols)
REPORT YBWZDS_AGR_USER. REPORT YBWZDS_AGR_USER.
********************************************************************* *********************************************************************
* Form called dynamically must start with DOYBW + <DataSource> * * Form called dynamically must start with DOZBW + <DataSource> *
********************************************************************* *********************************************************************
FORM DOYBWZDS_AGR_USER FORM DOYBWZDS_AGR_USER
TABLES C_T_DATA STRUCTURE ZOXBWD0001. TABLES C_T_DATA STRUCTURE ZOXBWD0001.

data: l_logsys type logsys. data: l_logsys type logsys.


l_s_data like ZOXBWD0001.
field-symbols: <fs> like c_t_data.
field-symbols: <fs> like c_t_data.
select single logsys from t000
select single logsys from t000 into l_logsys
into l_logsys where mandt = sy-mandt.
where mandt = sy-mandt.
loop at c_t_data assigning <fs>.
loop at c_t_data into l_s_data. <fs>-load_dt = sy-datum.
l_s_data-load_dt = sy-datum. <fs>-logsys = l_logsys.
l_s_data-logsys = l_logsys. endloop.
modify c_t_data from l_s_data index sy-tabix.
endloop. ENDFORM.

ENDFORM.

20
Extraction 7: Generic DataSources
• Improve extract performance by creating delta-enabled generic
DataSources
• Simple:
 By date

 By timestamp

 By sequential number (unique table key)

• Complex:
 Pointers – ABAP techniques can be used to record an array of
pointers to identify new and changed records

21
Extraction 8: Generic DataSources
• Illustration: Delta enabling a generic DataSource

22
Extraction 9: Architecture Tip
• Need to update an ODS or master data from multiple
sources?
 Rather than enhancing business content, consider using
multiple DataSources to load a single BW Object—as long
as ODS, MD key is available to both DataSources
 Decrease regression testing

 Mitigate risk of re-loading delta initializations (win-win if


delta extractor is an LIS extractor)
 Perform a single activation, if ODS, or single attribute
change run, if master data

23
What We‘ll Cover …
• How to leverage the BW architecture
• Data Modeling
• ETL – Extraction
• ETL – Transformation
• ETL – Load (Process Chains)
• Wrap-up

24
Transformation 1: Overview
• Common needs for transforming data:
 Aggregation

 Disaggregation (i.e., time-distribution)

 Conversion

 Validation

 Filtering/deletion

 Creation (result tables)

 Lookups/merging

25
Transformation 2: Use 3.x or 7.x Technology?
• Architecture decision:
 Transfer Rules and Update Rules (3.x)?

 Or Transformations (7.x)?

• Transfer Rules and Update Rules are stable and proven


 Easier to track through the system (retain same technical id)

 Offer better performance

 Fewer transport bugs

• Transformations are new and improve


 Appear to be more flexible (perception only?)

 Visually more appealing

 The long-term standard?

26
Transformation 3: Transfer Rules
• Architecture:
 Only one InfoSource per DataSource

 Transfer routines are record by record—no consolidation, no


results table
 Changes to number of records in start routine are not reflected
in the PSA—there is difficulty linking error messages from
monitor entries back to PSA records
• If communication structure feeds multiple data targets, then this
is the logical place for common transformations
• Use the start routine to maintain entire data package at one time
(good place to use field symbols)

27
Transformation 4: Master Data Transfer Rules
• If an InfoObject requires a common transformation across the
warehouse, code it in the InfoObject definition
• The transfer routine will now be available in all transfer rules
where the InfoObject is used
 You need to re-activate pre-existing transfer rules for a newly
added InfoObject routine to be recognized
• Allows for global conversion and/or validation of master data

28
Transformation 5: Master Data Transfer Rules
• Illustration: InfoObject Definition with Transfer Routine

29
Transformation 6: Architecture Tips
• Consider designing Level 1 ODS Objects to contain all possible
fields from source (if not LIS DataSource)
 Minimize maintenance and downtime later to add fields and
populate in live environment
 ODS Level 1 objects can then become the source for lookups
from other updates, thereby reducing redundant reads of
source tables in R/3
• Master data to multiple targets? Use flexible update rules
 Default communication structures for InfoObjects are the
attribute tables—here you can define custom ones and use
update rules from them to multiple data targets

30
Transformation 7: Lookups
• Do not use single selects for lookups!
• For better performance:
 Use start routines to read lookup data to an internal table

 Read internal table to populate field values in routines

• For best performance:


 Add lookup fields to InfoSource

 Use start routine and field symbols to populate blank


fields for entire data package at one time (see illustration
for DataSource user exit above)

31
Transformation 8: Program Includes
• Use includes for all complex routine logic
• Access logic by using ―perform‖ statements
• Increase portability of transformation logic
 Use same read statements for multiple lookups

 Reduce risk of errors in obscure places

• Decrease maintenance cost of complex update rules


 One place to go to fix/enhance logic

 Code is consistent and easier to follow

• Enable version management of code


 Track changes over time

 Compare between systems

 Revert to previous versions

32
Transformation 9: Program Includes
• Illustration – Select into internal table
Start routine Update include
FORM startup ************************************************************************
TABLES MONITOR STRUCTURE RSMONITOR "user defined monitoring * INITIALIZATION (ONE-TIME PER DATA PACKET) ****************************
MONITOR_RECNO STRUCTURE RSMONITORS " monitoring with record n * TO READ FROM DATABASE (ALL RECORDS FOR DATA PACKAGE) *****************
DATA_PACKAGE STRUCTURE DATA_PACKAGE ************************************************************************
USING RECORD_ALL LIKE SY-TABIX * FORM READ_USR02_TO_MEMORY_FOR_0BWTC_C02
SOURCE_SYSTEM LIKE RSUPDSIMULH-LOGSYS *---------------------------------------------------------------------- *
CHANGING ABORT LIKE SY-SUBRC. "set ABORT <> 0 to cancel update Form READ_USR02_TO_MEMORY_FOR_0BWTC_C02
* TABLES MONITOR STRUCTURE RSMONITOR "user defined monitoring
*$*$ begin of routine - insert your code only below this line *-* DATA_PACKAGE STRUCTURE /BIC/CS80BWTC_C02
* fill the internal tables "MONITOR" and/or "MONITOR_RECNO", USING RECORD_ALL LIKE SY-TABIX
* to make monitor entries SOURCE_SYSTEM LIKE RSUPDSIMULH-LOGSYS
CHANGING ABORT LIKE SY-SUBRC. "ABORT<>0 cancels update
perform READ_USR02_TO_MEMORY_FOR_0BWTC_C02
TABLES MONITOR * REFRESH ALL INTERNAL TABLES.
DATA_PACKAGE REFRESH: GT_USR02.
USING RECORD_ALL * READ USR02 user data to memory
SOURCE_SYSTEM select * into corresponding fields of table GT_USR02
CHANGING ABORT. from USR02
FOR ALL ENTRIES IN DATA_PACKAGE
* if abort is not equal zero, the update process will be canceled where BNAME = DATA_PACKAGE-TCTUSERNM
* ABORT = 0. order by primary key.
* if abort is not equal zero, the update process will be canceled
*$*$ end of routine - insert your code only before this line *-* ABORT = 0.
ENDFORM. "READ_USR02_TO_MEMORY_FOR_0BWTC_C02

33
Transformation 10: Program Includes
• Illustration – Include perform statements

Update routine Update include


FORM compute_key_field ************************************************************************
TABLES MONITOR STRUCTURE RSMONITOR "user defined monitoring * RECORD PROCESSING (RUN PER RECORD) ***********************************
USING COMM_STRUCTURE LIKE /BIC/CS0BWTC_C02 * TO READ FROM MEMORY (ONE RECORD) *************************************
RECORD_NO LIKE SY-TABIX ************************************************************************
RECORD_ALL LIKE SY-TABIX * FORM READ_GT_USR02
SOURCE_SYSTEM LIKE RSUPDSIMULH-LOGSYS *---------------------------------------------------------------------- *
CHANGING RESULT LIKE /BI0/V0BWTC_C02T-USERGROUP Form READ_GT_USR02
RETURNCODE LIKE SY-SUBRC USING TCTUSERNM LIKE USR02-BNAME
ABORT LIKE SY-SUBRC. "set ABORT <> 0 to cancel update RECORD_NO LIKE SY-TABIX
* RECORD_ALL LIKE SY-TABIX
*$*$ begin of routine - insert your code only below this line *-* SOURCE_SYSTEM LIKE RSUPDSIMULH-LOGSYS
* fill the internal table "MONITOR", to make monitor entries CHANGING GS_USR02
PERFORM READ_GT_USR02 ABORT LIKE SY-SUBRC. "set ABORT <> 0 cancel update
USING COMM_STRUCTURE-TCTUSERNM
RECORD_NO STATICS: L_RECORD LIKE SY-TABIX.
RECORD_ALL IF RECORD_NO <> L_RECORD.
SOURCE_SYSTEM L_RECORD = RECORD_NO.
CHANGING GS_USR02 clear GS_USR02.
ABORT. * Read user data from internal table GT_USR02
read table GT_USR02
RESULT = GS_USR02-CLASS. with key BNAME = TCTUSERNM
* if abort is not equal zero, the update process will be canceled into GS_USR02.
*$*$ end of routine - insert your code only before this line *-* ENDIF.
ENDFORM. ENDFORM. "READ_GT_USR02

34
Transformation 11: Update Rules - Results Tables
• Need to ―create‖ data based on business logic
• Beware of hard-coding based on fields like document types
 New doc types can require enhancements/corrections to hard-
coded logic
 Such dependencies need to be communicated to business and
changes to logic need to become part of business process for
creating doc types

35
What We‘ll Cover …
• How to leverage the BW architecture
• Data Modeling
• ETL – Extraction
• ETL – Transformation
• ETL – Load (Process Chains)
• Wrap-up

36
Load 1: Process Chain Strategy
• Split loads by frequency and criticality
 Separate daily loads from weekly, monthly, annual and ad-hoc
loads
 Within each frequency group, identify the critical path, and
remove non-essential loads
• Design chains based on Dataflow dependencies
 Remember the dataflow diagram?

• Within each chain, take advantage of parallel processing wherever


possible
 Not all loads need to be sequential

• Minimize parallel updates to BWA (competing changes to


common master data indexes can cause an abort)

37
Load 2: Process Chain Tips
• Process chains require explicit scheduling of all load events
previously handled by the InfoPackage
 Use ―Only to PSA‖ and ―Subsequent Update‖ to reduce number
of dialog processes spawned during loads
• If possible, schedule loads when users are off system
 Can then delete indexes prior to loads and re-create after

 Will result in poor query performance during loads if not using


BWA or aggregates
• Schedule deletion of PSA data by process chain
 Good rule of thumb is to delete data from PSA that is no longer
recoverable (8-30 days)

38
Load 3: Use Decision Variants
• Decision variants allow flexibility in chain logic
• For example, if you need to load a cube only on a specific day of
the month, or month of the year:

39
Load 4: Performance Tips
• Reduce data packet transfer size if there is extensive use of
lookups in transfer/update rules
• Use multiple loads with non-overlapping selection conditions v.
single loads
 Some R/3 DataSources are not delta capable nor ODS
compatible—so they only support full loads
 Separate InfoPackages for actual and plan data by current and
future years reduces full load size
 Set number of background processes accordingly

• Turn off consistency check for proven loads from proven sources

40
Load 5: Error Handling
• If source data is frequently problematic, use error handling
 Strips out error records into separate PSA or DTP to be
processed later without impacting current load
 Completes processes of correct records

• Illustration: Error Handling in InfoPackage

41
Load 6: Partitioning
• Define partitioning strategy before you go-live
 Cubes must be empty before they are partitioned by transport

 Quicker and less risky than using the repartitioning tool

• Partition by calendar month or fiscal period


 Queries should use filters, variables or selections on the
partitioning column characteristic—read values from another
variable if necessary
• SEM Transactional Cubes should also be partitioned

42
Load 7: Compression
• Compression should be scheduled regularly
• SAP recommendations for number of partitions:
 30-50 partitions (requests) per F-fact table

 120-150 partitions (time periods) per E-fact table—this is more


than 10 years by calendar month!
• Use zero-elimination during compression
 Can greatly reduce number of fact table records for cubes
loaded by ODS Objects or delta capable DataSources
 Consult OSS before using zero elimination—there are known
issues with specific database versions, although patches are
available

43
Load 8: Data Load Scheduling Strategy
• Will the loads be scheduled by an external software?
 Does the R/3 batch process use a external tool such as
AutoSys?
 Consistent approach to batch scheduling could reduce overall
support and maintenance costs
• Will BW load success be monitored in BW or via the external tool?
 If using an external tool, need to develop a mechanism to report
success/failure back to the tool
 If using BW, consider adding text message notification steps to
process chains upon success/failure

44
Load 9: Data Load Scheduling Strategy
• Illustration: External scheduling process

External Scheduling Tool

BW
Program ZBW_PC_LOAD Process Chain

Triggers event to execute Start event


process chain, and then
waits until it reports back Data Load
a success or failure
Success Failure

45
What We‘ll Cover …
• How to leverage the BW architecture
• Data Modeling
• ETL – Extraction
• ETL – Transformation
• ETL – Load (Process Chains)
• Wrap-up

46
7 Key Points to Take Home
• Intelligently managing data model granularity is critical to
performance—even with BW Accelerator!
• Implement Semantic Partitioning on every data model
• Define a data retention strategy early on to lower TCO
• Use dynamic programming for customer exits to simplify
maintenance and reduce risk of production impact
• Use field symbols in the start routine to transform data to achieve
optimal performance
• Use program includes to enable portability and version history for
your complex transformations
• Define process chains based on frequency and the critical path
 Use decision variants to improve flexibility

47
Resources
• Jens Doerpmund, ―Introducing the Layered, Scalable Architecture (LSA)
Approach to Data Warehouse Design for Improved Reporting and Analytic
Performance‖ (BI and Portals 2009)
• Jens Doerpmund, ―Beyond the Basics of SAP NetWeaver Business Intelligence
Accelerator‖ (BI and Portals 2009)
• Ron Silberstein, ―Data Modeling, Management, and Architectural Techniques
for High Data Volumes with SAP Netweaver Business Intelligence‖ (BI and
Portals 2008)
• Joe Darlak, ―Maximize the Capabilities, Efficiency and Performance of ETL
Logic in BW‖ (ASUG Forums, October 2004)
• Ralph Kimball, The Data Warehouse Toolkit, (Wiley Publishing 2002)
• Rajiv Kalra, ―Conditional Execution‖ (BI Expert, March 2008)
• John Kurgen, ―Use a New Process Type to Create Dynamic Process Chains‖ (BI
Expert, January 2008)

48
Your Turn!

How to contact me:


Joe Darlak
jdarlak@comerit.net
49
Disclaimer
SAP, R/3, mySAP, mySAP.com, SAP NetWeaver®, Duet™, PartnerEdge, and other SAP products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. All other product
and service names mentioned are the trademarks of their respective companies. Wellesley Information Services is neither owned nor controlled by
SAP.

50

You might also like