Professional Documents
Culture Documents
De 90 0363 IDE Performance Tuning
De 90 0363 IDE Performance Tuning
2011 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means
Abstract
The system resource guidelines for Informatica Data Explorer include resource recommendations for the Profiling Service Module, the Data Integration Service, profile warehouse, and hardware settings for different profile types. You can follow the guidelines for mapping memory and disk size configuration for profiles with Data Quality transformations in them. This article describes the system performance guidelines for Informatica Data Explorer.
Supported Versions
Informatica Data Explorer 9.1.0
Table of Contents
System Performance Guidelines Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Resource Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Profiling Service Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Hardware Considerations for Flat File and Mainframe Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Hardware Considerations for Relational Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Profile Warehouse Guidelines for Column Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Profile Warehouse Guidelines for Key and Functional Dependency Discovery. . . . . . . . . . . . . . . . . . . . . . . 6 Profile Warehouse Guidelines for Foreign Key and Overlap Discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Resource Guidelines for Profiles with Data Quality Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Mapping Memory and Disk Size Guidelines for Standard Transformations. . . . . . . . . . . . . . . . . . . . . . . . . 8 Mapping Memory and Disk Size Guidelines for Reference Data Transformations. . . . . . . . . . . . . . . . . . . . . 8
Resource Guidelines
Resource guidelines include resource recommendations such as number of CPUs, amount of memory, disk space, and disk speed. The optimal use of these resources can lead to improved performance of the Profiling Service Module, the Data Integration Service, and profile warehouse. The system resource guidelines depend on profile types. Column profiling guidelines depend on the data source type and hardware capacity. Other types of profiling such as key discovery, functional dependency discovery, foreign key discovery, and overlap discovery have specific hardware resource guidelines.
Variable
where
2 = two passes (some analyses need two passes). Maximum column size = the number of bytes in any column in a table that is not one of the very large
datatypes, for example CLOB, that you cannot run a profile on. The column size must take into account the character encoding, such as Unicode or ASCII.
Frequency bytes = 4 or 8 bytes to store the frequency during the analysis. This is the default size that the
database uses for COUNT(*). Operating System Use a 64-bit operating system if memory requirements are greater than 3 GB.
where
Number of columns = the sum of columns and virtual columns in the profile run. Average value size includes Unicode encoding of characters. 64 bytes for each value = 8 bytes for the frequency and 56 bytes for the key.
Cached Data Guidelines Cached data is also known as staged data. It is a copy of the source data that is used for drilldown operations. Depending on the data source, this can use a very large amount of disk space. Use the following formula to calculate disk size requirements for cached data:
Number of rows number of columns (average value size + 24)
where 24 is the cache key size. Sum the results of this calculation for all cached tables. Other Resource Needs The profile warehouse has the following memory and CPU requirements: Memory The queries run by the Profiling Service Module do not use significant amounts of memory. Use the manufacturer's recommendations based on the table sizes. CPU Use 1 CPU for each concurrent profile job. This applies to each relational database or flat file profile job, not to each profile mapping. If the data is cached, use 2 CPUs for each concurrent profile job.
Keys Use the following formula to compute the disk space for key discovery:
Number of Inferred Keys Average Number of Columns in the Key 32 + Number of Keys ( 32 + (2 Average Column Size ) Average Number of Key Columns Average Number of Rows that Violate the Key)
where
32 is the number of bytes used to store one column in the key. 2 is the typical number of bytes used for a single Unicode character.
Functional Dependency Use the following formula to compute the disk space for functional dependency:
Number of Inferred Functional Dependencies (Average Number of LHS Columns + 1) 32 + Number of Inferred Functional Dependencies (32 + (2 Average Number of Characters in Columns) (Average Number of LHS Columns ) Average Number of Rows that Violate the Functional Dependency
where
Average Number of LHS Columns is the average number of columns in the determinant of the functional
where
Number of Columns in Schema is the total number of columns in the profile model. After the Profiling Service
Module generates the column signature for a profile task, subsequent profile tasks reuse the signature.
3600 is the amount of space required to store the signatures for one column.
Foreign Keys Use the following formula to compute the disk space for foreign keys:
Number of Inferred Foreign Keys * 2 * (Average Number Of Columns in the Primary or Foreign Key) * 32 + Number Of Foreign Keys *( 32 + (2 Bytes per Character * Average Number of Characters in the Columns) * Average Number Of Key Columns * Average Number of Rows that Violate the Foreign Key Either in the Parent Table or Child Table
where
2 is the multiplier to get the total number of columns for the foreign key. 32 is the number of bytes to store one column in the key. 2 Bytes per Character is the typical number of bytes for a single Unicode character.
Overlap Discovery Use the following formula to compute the disk space for overlap discovery:
Number Of Inferred Overlap Pairs * 2 * 32
where
2 is the number of columns in the pair. 32 is the number of bytes required to store one column in the overlap pair.
Mapping Memory and Disk Size Guidelines for Reference Data Transformations
Reference data transformations such as Case Converter, Labeler, Parser, and Standardizer process data immediately, but they have initialization costs that increase memory use according to their configuration. The reference table data is managed in the database. At run time, the data is held in memory for performance reasons. To optimize data throughput, this in-memory storage is designed for speed rather than space efficiency. Each transformation has its own copy of the in-memory reference data. To estimate the in-memory storage, multiply the number of bytes in each column of the reference table by the number of rows in the reference table. Then multiply the total by 1.3. For example, following is the in-memory requirement for a reference table with 10000 rows, 6 columns, and an average byte count of 25:
10000 6 25 1.3
The total value equals approximately 2 MB. Data Quality uses reference tables to enable operations such as standardization, labeling, and parsing. Each reference data set is carried in a table and has a size in the database equivalent to its disk size. Use the following formulas to calculate reference data table size:
number of data rows number of columns number of characters per column
Note: This formula applies if all columns have the same average data size.
number of data rows (characters in column 1 + characters in column 2 + characters in column n)
Note: This formula applies when table columns have different sizes.
Author
Rajesh Sivanarayanan Lead Technical Writer
Acknowledgements
The author would like to acknowledge Jeff Millman and Venkatakrishnan Swaminathan for their contributions to this article.