Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Complex ETL Testing:

A Strategic Approach

Jeffrey R. Bocarsly, Ph.D


Director Functional Test Automation Group

2011 Real-Time Technology Solutions (212) 240-9050 www.rtts.com info@rttsweb.com 360 Lexington Ave. Fl 9, New York, NY 10017

Data Warehouse Testing


A data warehouse is a repository of transactional data that has been extracted from original electronic sources and transformed so that query, analysis and reporting on trends within historic data are possible and efficient. The analyses provided by data warehouses may involve strategic planning, decision support, and monitoring the outcomes of a chosen strategy. Typically, data that is loaded into a data warehouse is derived from diverse sources of operational data, which may consist of data from databases, feeds, application files or flat files. The data must be extracted from these diverse sources, transformed to a common format, and loaded into the data warehouse. Typically, it is further aggregated into a data mart for efficient reporting. The ETL (Extract, transform and load) process is a critical step in any data warehouse implementation, and continues to be an area of major significance whenever the ETL code is updated. Once the data warehouse and data marts are populated, business intelligence applications facilitate querying, analysis and reporting. The business intelligence tools may provide simple presentations of data based on queries, or may support sophisticated statistical analysis options. Data warehouses may have multiple front-end applications, depending on the needs of the user community. Figure 1 shows a simplified data warehouse scheme. An effective data warehouse testing strategy focuses on the main structures within the data warehouse architecture: 1. The ETL layer 2. The data warehouse itself 3. Associated data marts 4. The front-end business intelligence/reporting applications Each of these units must be treated separately and in combination, and since there may be multiple components in each (multiple feeds to ETL, multiple databases or data repositories that constitute the warehouse, multiple data marts, and multiple front-end applications), each of these subsystems must be individually validated. One issue specific to data warehousing in the Pharmaceutical industry is the variety and types of data sources. In addition to application databases with varied data models, data sources may utilize Pharma-specific information exchange formats (e.g., HL7, CDISC). Organizations may have multiple proprietary internal data formats, which may have been acquired in the process of industry consolidation. Thus, pharmaceutical data warehouses may have to support a broader range of input formats to their ETL layers than data warehouses in other industries.

2011 Real-Time Technology Solutions (212) 240-9050 www.rtts.com info@rttsweb.com 360 Lexington Ave. Fl 9, New York, NY 10017

Data Verification
The recommended pre-deployment strategy is to build test automation (both functional and performance) for every test entry point in the system (feeds, databases, internal messaging, front-end transactions). The goal of the strategy is to provide automated tools for rapid localization of issues between test entry points (see Figure 1). HL7 CDISC XML File
ETL ETL ETL ETL ETL Warehouse Mart(s) BI Tools Data ETL Data

Figure 1. A simplified Data Warehouse scheme

In Figure 1, data from a variety of sources is transformed by the ETL into the Data Warehouse. A second ETL leg aggregates data from the Data Warehouse into Datamart tables for efficient reporting. Front-end applications and Business Intelligence applications access the Datamart in order to provide historical and statistical analyses of company data.

Data Comparison

Data Comparison

HL7 CDISC XML File


ETL ETL ETL ETL ETL Warehouse Mart(s) BI Tools Data ETL Data

Data Comparison Data Comparison

Figure 2. Data comparison points in the Data Warehouse scheme

As suggested by Figure 1, there are four major points of data comparison in the scheme. These are indicated in Figure 2 and are found between the: Source(s) and Data Warehouse Data Warehouse and Datamart Datamart and BI Tool Source(s) and Datamart

2011 Real-Time Technology Solutions (212) 240-9050 www.rtts.com info@rttsweb.com 360 Lexington Ave. Fl 9, New York, NY 10017

In automated Data Warehouse testing, the emphasis is the validation of data integrity between all points of comparison to ensure the proper implementation of the ETL mappings and transformations across the architecture. Using test automation, data can be tracked from the source layer, through the ETL processing and through the Data Warehouse and Datamart components, to the front-end applications. If corrupt data is found in a front-end application, the execution of automated tests can quickly determine whether the problem is located in the data source, an ETL process, in a data warehouse database, a datamart or in the frontend/Business Intelligence tool. Rapid determination of the application tier in which an issue is located can dramatically lower turnaround times for remediation, as well as enhance quality.

Performance and Scalability


HL7 CDISC XML File
ETL ETL ETL ETL ETL Data Warehouse Performance ETL Data Mart(s) BI Tools Performance Performance

Figure 3. RTTS Data Warehouse Post-Deployment Testing Strategy

On the performance side, the test entry points analogous to the data comparison points are used, to focus on characterizing subsystem response under load. These are indicated in Figure 3. Using industry standard performance tools, the performance of each of the components in the ETL processes can be evaluated. Often, a major focus of performance and scalability testing is the Business Intelligence tool client interfaces because these are the user-facing components of the Data Warehouse architecture. Perceived poor performance by the user community is frequently a significant concern, and performance measurements across a series of releases can often reduce project tensions around this issue.

Overall Strategy
The emphasis on rapid localization of either data or performance problems in complex data warehouse architectures provides a key tool for: Promoting efficiencies in the development cycle Shortening build cycles and meeting release targets Delivering high quality Data Warehouse architectures.

2011 Real-Time Technology Solutions (212) 240-9050 www.rtts.com info@rttsweb.com 360 Lexington Ave. Fl 9, New York, NY 10017

Bibliography
1. Inmon, W. H., Building the Data Warehouse, John Wiley & Sons; 3rd edition, 2002 2. Kimball, R.; Reeves, L.; Ross, M.; Thornthwaite, W., The Data Warehouse Lifecycle Toolkit : Expert Methods for Designing, Developing, and Deploying Data Warehouses, John Wiley & Sons, 1998

About the Author


Jeffrey R. Bocarsly, Ph.D., a Vice President and functional testing division manager with RTTS, has implemented many automated software testing projects at Fortune 500 firms. His experience includes projects in various sectors including brokerage, utility, media, pharmaceutical, software, banking, and insurance. Jeff is a co-author of Software Test Engineering with IBM Rational Functional Tester: The Definitive Resource, IBM Press, 2010. He holds a BS (UCLA) and Masters and PhD degrees (Columbia University).

About RTTS
RTTS offers the most comprehensive suite of quality assurance services to help organizations drive positive results from their critical software projects. Headquartered in New York, NY, our expert team has worked closely with over 400 clients to improve their testing processes, tool knowledge, and application deployment outcomes. RTTS was founded in 1996, and has forged partnerships with the worlds leading test tool vendors. Our satellite locations are in Philadelphia, Atlanta, and Phoenix, and many of our consulting and education services are offered through the cloud. No matter where you are, RTTS will ensure application functionality, performance, scalability, and security for your organization.

About QuerySurge
RTTS team of test experts developed QuerySurge to address the unique testing needs in the data warehousing space. It has been implemented on projects ranging from data warehousing and ETL processes to data migrations and database upgrades. QuerySurge can verify as much as 100% of all data from source systems, through the ETL process, to the target data warehouse. The tool has increased test coverage and reduced test cycle time for several organizations, helping them to mitigate risk and meet business requirements. QuerySurge is offered exclusively by RTTS, as are any accompanying Data Warehouse Testing services. If you are interested in learning more, or to schedule a private demonstration, please contact us by e-mail here: info@rttsweb.com. If you would like to speak with our Sales team, please call (212) 240-9050.

2011 Real-Time Technology Solutions (212) 240-9050 www.rtts.com info@rttsweb.com 360 Lexington Ave. Fl 9, New York, NY 10017

You might also like