Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Informatica Data Explorer Performance Tuning

2011 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means

(electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation.

Abstract
The system resource guidelines for Informatica Data Explorer include resource recommendations for the Profiling Service Module, the Data Integration Service, profile warehouse, and hardware settings for different profile types. You can follow the guidelines for mapping memory and disk size configuration for profiles with Data Quality transformations in them. This article describes the system performance guidelines for Informatica Data Explorer.

Supported Versions
Informatica Data Explorer 9.1.0

Table of Contents
System Performance Guidelines Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Resource Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Profiling Service Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Hardware Considerations for Flat File and Mainframe Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Hardware Considerations for Relational Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Profile Warehouse Guidelines for Column Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Profile Warehouse Guidelines for Key and Functional Dependency Discovery. . . . . . . . . . . . . . . . . . . . . . . 6 Profile Warehouse Guidelines for Foreign Key and Overlap Discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Resource Guidelines for Profiles with Data Quality Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Mapping Memory and Disk Size Guidelines for Standard Transformations. . . . . . . . . . . . . . . . . . . . . . . . . 8 Mapping Memory and Disk Size Guidelines for Reference Data Transformations. . . . . . . . . . . . . . . . . . . . . 8

System Performance Guidelines Overview


Effective performance tuning of Informatica Data Explorer depends on how well you balance system resources for the Data Integration Service, the Profiling Service Module, and profile warehouse. It is important to organize mapping memory and disk size for profiles with Data Quality transformations.

Resource Guidelines
Resource guidelines include resource recommendations such as number of CPUs, amount of memory, disk space, and disk speed. The optimal use of these resources can lead to improved performance of the Profiling Service Module, the Data Integration Service, and profile warehouse. The system resource guidelines depend on profile types. Column profiling guidelines depend on the data source type and hardware capacity. Other types of profiling such as key discovery, functional dependency discovery, foreign key discovery, and overlap discovery have specific hardware resource guidelines.

Profiling Service Module


The Profiling Service Module interacts with profile warehouse and data sources such as relational databases and nonrelational databases. Modern relational databases are optimized to process the data stored in them. The Profiling Service Module requires additional resources to read a nonrelational database source. Nonrelational sources can be SAP resources or mainframe sources, such as IMS or VSAM. For mainframe sources, the Profiling Service Module performs most of the data processing tasks to minimize the data access costs. The following table describes the system resource requirements for the Profiling Service Module:
System Resource CPU Requirement Informatica Data Explorer uses less than 1 CPU. Each profile type has different CPU requirements: - Relational systems require less than 1 CPU for each Data Transformation Manager thread. - Flat files use approximately 2.3 CPUs for each Data Transformation Manager thread. - Key and functional dependency discovery require 1 CPU for each Data Transformation Manager thread. - Join, foreign key, and overlap discovery require 2 CPUs for each Data Transformation Manager thread. Minimum memory required to run the profile. No disk space is required. Use a 64-bit operating system if memory requirements are greater than 3 GB.

Memory Disk Operating System

Data Integration Service


The Data Integration Service runs the Profiling Service Module. The Data Integration Service has fixed memory and variable memory requirements. The CPU requirements are not significant. The following table describes the memory requirements:
Memory Type Fixed Description The amount of memory required to run the Java Virtual Machine that the Data Integration Service uses. The requirement is approximately 500 MB. The amount of memory required to run each Data Transformation Manager thread. One Data Transformation Manager thread is required to run each mapping that computes a part of a profile job. This overhead is dependent on the Maximum Execution Pool Size property in the service properties. The default value of this property is 10 and the overhead is approximately 1000 MB. Note: A profile that reads the output of an address validation rule may incur an additional 1 GB in memory to read and cache the address validation reference data.

Variable

Hardware Considerations for Flat File and Mainframe Sources


When you run a profile job on a flat file, the Profiling Service Module generates mappings that infer the metadata for the columns and virtual columns. Each mapping can run serially or in parallel. The Profiling Service Module may generate a second type of mapping to cache the source data. This mapping always runs in parallel with the column profiling mappings because it takes longer than a column profile mapping. The following section describes the hardware requirements for running different profiles on flat file and mainframe sources:

Column Profile for a Column Profile Mapping


A coulmn profile mapping has the following requirements: CPU 2.3 Memory The minimum resource required is 10 MB, representing 2 MB 5 columns. The maximum resource required is 72 MB, representing a 64 MB buffer for one high-cardinality column and 8 MB for the remaining four low cardinality columns. Disk Space 2 Number of columns per mapping Maximum number of rows ((2 bytes per character Maximum string size in characters) + frequency bytes) Disk Speed 7200 RPM is the minimum required disk speed.

Column Profile for a Profile Cache Mapping


A profile cache mapping has the following requirements: CPU 1.5 Memory Memory required for Data Transformation Manager Disk Space No disk space is required. Disk Speed Not applicable for a flat file source and 7200 PRM is the minimum required disk speed for a mainframe source.

Key and Functional Dependency Discovery


Key and functional dependency discovery have the following requirements: CPU 1 Memory 256 MB, in addition to the mapping memory Disk Space A minimum of 128 GB

Disk Speed 7200 RPM is the minimum required disk speed.

Foreign Key and Overlap Discovery


Foreign key and overlap discovery have the following requirements: CPU 2 Memory 64 MB Disk Space No disk space is required. Disk Speed Not applicable

Hardware Considerations for Relational Sources


The Profiling Service Module transfers as much processing as it can to the machine hosting the relational database. The division of work between the Profiling Service Module and the database can be challenging when you estimate resources for each machine. The following section describes resource considerations based on a single mapping that pushes the profiling logic down to the relational database for each column: CPU Based on the relational database, at least one CPU processes each query. If the relational database provides a mechanism to increase this, such as the parallel hint in Oracle, the number of CPUs utilized increases accordingly. Memory The relational database requires memory in the form of a buffer cache. The greater the buffer cache, the faster the relational database runs the query. Use at least 512 MB of buffer cache. Disk Relational systems use temporary table space. The formula for the maximum amount of temporary table space required is as follows:
2 maximum number of rows in any table (maximum column size + frequency bytes)

where
2 = two passes (some analyses need two passes). Maximum column size = the number of bytes in any column in a table that is not one of the very large

datatypes, for example CLOB, that you cannot run a profile on. The column size must take into account the character encoding, such as Unicode or ASCII.
Frequency bytes = 4 or 8 bytes to store the frequency during the analysis. This is the default size that the

database uses for COUNT(*). Operating System Use a 64-bit operating system if memory requirements are greater than 3 GB.

Profile Warehouse Guidelines for Column Profiling


The profile warehouse stores profiling results. The main resource for the profile warehouse is disk space. The disk size calculations depend on the expected storage sizes of integers. Some databases, such as Oracle, use a compressed number format and they require less disk size. Column profiling stores statistical and bookkeeping data, value frequencies, and staged data in the profile warehouse. Following are the profile warehouse guidelines for column profiling: Statistical and Bookkeeping Data Guidelines Each column contains a set of statistics, such as the minimum and maximum values. The profile warehouse contains a set of tables that store bookkeeping data, such as profile ID. These tables take up very little space and you can exclude them from disk space calculations. Value Frequency Calculation Guidelines Value frequencies are a key element in profile results. They list the unique values in a column along with a count of the occurrences of each value. Low cardinality columns have very few values, but large cardinality columns can have millions of values. The Profiling Service Module limits the number of unique values it identifies to 16,000 by default. You can change this value. Use the following formula to calculate disk size requirements:
Number of columns number of unique values (average value size + 64)

where
Number of columns = the sum of columns and virtual columns in the profile run. Average value size includes Unicode encoding of characters. 64 bytes for each value = 8 bytes for the frequency and 56 bytes for the key.

Cached Data Guidelines Cached data is also known as staged data. It is a copy of the source data that is used for drilldown operations. Depending on the data source, this can use a very large amount of disk space. Use the following formula to calculate disk size requirements for cached data:
Number of rows number of columns (average value size + 24)

where 24 is the cache key size. Sum the results of this calculation for all cached tables. Other Resource Needs The profile warehouse has the following memory and CPU requirements: Memory The queries run by the Profiling Service Module do not use significant amounts of memory. Use the manufacturer's recommendations based on the table sizes. CPU Use 1 CPU for each concurrent profile job. This applies to each relational database or flat file profile job, not to each profile mapping. If the data is cached, use 2 CPUs for each concurrent profile job.

Profile Warehouse Guidelines for Key and Functional Dependency Discovery


The disk space for key and functional dependency discovery depends on the number of inferred keys, functional dependencies, and their dependency violations. These items take up large space in the profile warehouse if you set a large number for key and functional dependency discovery. You can use the following formulas to compute the disk space. If you set the confidence parameter to 100%, the profile warehouse does not store violating rows and you can omit its computation.

Keys Use the following formula to compute the disk space for key discovery:
Number of Inferred Keys Average Number of Columns in the Key 32 + Number of Keys ( 32 + (2 Average Column Size ) Average Number of Key Columns Average Number of Rows that Violate the Key)

where
32 is the number of bytes used to store one column in the key. 2 is the typical number of bytes used for a single Unicode character.

Functional Dependency Use the following formula to compute the disk space for functional dependency:
Number of Inferred Functional Dependencies (Average Number of LHS Columns + 1) 32 + Number of Inferred Functional Dependencies (32 + (2 Average Number of Characters in Columns) (Average Number of LHS Columns ) Average Number of Rows that Violate the Functional Dependency

where
Average Number of LHS Columns is the average number of columns in the determinant of the functional

dependency. One is added for the dependent column.


32 is the number of bytes used to store one column in the functional dependency. 2 is the typical number of bytes used for a single Unicode character.

Profile Warehouse Guidelines for Foreign Key and Overlap Discovery


The disk space for foreign key and overlap discovery is dependent on the number inferred foreign keys and overlapping column pairs. These items take up large space in the profile warehouse if you set a large number for foreign key and overlap discovery. The Profiling Service Module computes column signatures once for foreign key and overlap discovery. You can use the following formula for computing the disk space for column signatures: Signatures
Number of Columns in Schema * 3600

where
Number of Columns in Schema is the total number of columns in the profile model. After the Profiling Service

Module generates the column signature for a profile task, subsequent profile tasks reuse the signature.
3600 is the amount of space required to store the signatures for one column.

Foreign Keys Use the following formula to compute the disk space for foreign keys:
Number of Inferred Foreign Keys * 2 * (Average Number Of Columns in the Primary or Foreign Key) * 32 + Number Of Foreign Keys *( 32 + (2 Bytes per Character * Average Number of Characters in the Columns) * Average Number Of Key Columns * Average Number of Rows that Violate the Foreign Key Either in the Parent Table or Child Table

where
2 is the multiplier to get the total number of columns for the foreign key. 32 is the number of bytes to store one column in the key. 2 Bytes per Character is the typical number of bytes for a single Unicode character.

Overlap Discovery Use the following formula to compute the disk space for overlap discovery:
Number Of Inferred Overlap Pairs * 2 * 32

where
2 is the number of columns in the pair. 32 is the number of bytes required to store one column in the overlap pair.

Resource Guidelines for Profiles with Data Quality Transformations


The memory and disk overhead are critical when you run profiles with Data Quality transformations. When you determine your resource needs, consider the number of concurrent mappings submitted to the server, the types of transformation used in each mapping, and the size of the source data sets.

Mapping Memory and Disk Size Guidelines for Standard Transformations


The standard transformations, in the performance context, are Comparison, Decision, Weighted Average, and Merge. The memory or disk usage of these transformations does not vary with the size of the data processed. These components process data rows in small batches and send them to the next component in the mapping immediately. The standard transformations do not incur additional costs in memory or disk usage beyond the standard running size.

Mapping Memory and Disk Size Guidelines for Reference Data Transformations
Reference data transformations such as Case Converter, Labeler, Parser, and Standardizer process data immediately, but they have initialization costs that increase memory use according to their configuration. The reference table data is managed in the database. At run time, the data is held in memory for performance reasons. To optimize data throughput, this in-memory storage is designed for speed rather than space efficiency. Each transformation has its own copy of the in-memory reference data. To estimate the in-memory storage, multiply the number of bytes in each column of the reference table by the number of rows in the reference table. Then multiply the total by 1.3. For example, following is the in-memory requirement for a reference table with 10000 rows, 6 columns, and an average byte count of 25:
10000 6 25 1.3

The total value equals approximately 2 MB. Data Quality uses reference tables to enable operations such as standardization, labeling, and parsing. Each reference data set is carried in a table and has a size in the database equivalent to its disk size. Use the following formulas to calculate reference data table size:
number of data rows number of columns number of characters per column

Note: This formula applies if all columns have the same average data size.
number of data rows (characters in column 1 + characters in column 2 + characters in column n)

Note: This formula applies when table columns have different sizes.

Author
Rajesh Sivanarayanan Lead Technical Writer

Acknowledgements
The author would like to acknowledge Jeff Millman and Venkatakrishnan Swaminathan for their contributions to this article.

You might also like