Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Enterprise Discovery Best Practices

© 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by
any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All
other company and product names may be trade names or trademarks of their respective owners and/or copyrighted
materials of such owners.
Abstract
This article describes the best practice guidelines that you can follow when you perform enterprise discovery for
different use cases. An enterprise discovery profile runs multiple data discovery tasks on many data sources and
generates a consolidated summary of the profile results.

Supported Versions
• Data Quality 9.6.1

Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Sampling Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Profile Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Introduction
Enterprise discovery is a process that finds column profile statistics, data domains, primary keys, and foreign keys in
many data sources spread across multiple connections or schemas.

You can use enterprise discovery to solve use cases ranging from analyzing tables for specific properties to performing
the entire data discovery analysis. You need to consider the cost impact and benefits before you extend the profile
operation from a single table to a schema, especially because the tables in the schema might vary considerably in the
number of column and rows. To meet this challenge, make sure that you choose the design-time and run-time
parameters for enterprise discovery.

Enterprise discovery has the following steps:

1. Column profile that discovers the basic column data.


2. Data domain discovery that discovers the functional meaning or semantics of column data.
3. Primary key profile that discovers the key structure of a table.
4. Foreign key profile that discovers table relationships in the schema.
You can choose to run one or more steps when you run the enterprise discovery profile.

How you choose the right parameters for each step of the enterprise discovery depends on the use case. The two
primary use cases include screening summary profile statistics where enterprise discovery derives the results from a
sample and complete analysis where the profile statistics come from the entire data set.

Screening Versus Complete Analysis


Use the screening use case to scan the tables in the schema for specific statistics and take the required actions. The
most common use case is to screen a schema for specific data domains or patterns. Generally, you set the parameter
values to reduce the total time for the profile run. There might be use cases where you apply aggressive sampling to
prioritize accuracy over profile run time. To make aggressive sampling effective, the assumption is that the data is
consistent in each table. If this assumption of data consistency does not hold, you must run a column profile on the
entire data set.

The complete analysis use case prioritizes complete and accurate information over the profile run time. One way to
reduce the profile run time is to provide more hardware resources for profiling, such as additional cores or machines.
This approach of adding more resources works because profiling is scalable.

2
Sampling Options
You can use profiling to try different sampling techniques with differing trade-offs. Not all profiling steps support all the
sampling options. The sampling options are as follows:

• Complete– Selects all the rows in the data set for analysis.
• First N Rows–Selects the first ‘N’ rows of the data set, or the entire data set if the number of rows are fewer
than ‘N’. This option is the fastest method because the Data Integration Service stops processing source rows
after reading the first ‘N’ from the source. However, if the data source changes over a period of time, this
option might not select a representative source sample.
• Random N Rows– Selects the specified number of rows randomly throughout the data set. First, profiling
determines the number of rows in the data set to compute the percentage of rows to sample. Determining the
number of rows might be costly for those data sources that do not keep track of this metric. Then, profiling
reads the entire data set discarding those rows that are not in the sample. This step generally takes more time
than sampling the first N rows.
• Random Sample (Auto)– Uses the database to return a sample of rows based on more efficient sampling
techniques that are specific to each database type. If a database does not support sampling push down,
profiling uses a random sample of about 100,000 rows.
For profiling the enterprise discovery, the first two sampling options are available at the EDD job level. These
techniques are applied to all the data objects when the EDD job is created. If you require the Random N Rows or the
Random Sample (Auto) option, go to each source object in the EDD profile and change it. When the EDD profile runs,
the profile picks up the manually changed sampling options.

When you run an enterprise discovery profile, the profile can use the Complete and First N Rows sampling options at
the profile job level. When the profile first creates the enterprise discovery job, the profile applies the sampling
techniques to all the data objects. Based on the requirement of other sampling techniques, the profile applies those
sampling techniques to individual data objects. When the enterprise discovery job runs, the job uses the newer
sampling parameters.

The following table describes the sampling options that each profile type supports:

Sampling Option Column Data Domain Primary Key Foreign Key EDD Default
Profile Discovery Profile Profile

Complete Yes Yes - Yes Column profiling,


Foreign key profiling

First N Rows Yes Yes Yes - Data domain


discovery, Primary key
profiling

Random N Rows Yes - - - -

Random Sample Yes - - - -


(Auto)

Sample Size
After you decide to use a sampling option, the next logical question is what is the right sample size. If you have a data
source that has a population fully contained in the data set, you can use any number of the available calculators.
These calculators determine the random sample size based on the level of acceptable error.

The following image depicts the number of rows that you can select based on the data set size for 95% and 99%
accuracy levels. A general guideline is to select 1000 rows if 5% error is acceptable and 17,000 rows for 1% error.

3
When you use the First N Rows sampling technique, you can treat it as a random sample if there is no correlation
between the rows. In this case, the general recommendation of 1000 row sample for 5% error and 17,000 row sample
for 1% error applies. Otherwise, you can use your best judgment to determine an appropriate sample size. Note that
the column profile and data domain discovery are optimized for processing sample sizes of 100,000 rows or fewer.

Filtering
Filtering is another way to sample the data source. You must apply filtering to individual tables because filtering is
specific to the columns in the table. Use filtering to select the rows that meet the specific criteria or goals.

Note: It is outside the scope of this document to make filter recommendations.

Profile Functions
The specific recommendations for each profiling function within an enterprise discovery profile vary based on the
screening and complete analysis use cases.

Data Domain Discovery


Data domain discovery uses data rules to discover the semantic content of a column. If a data value matches the data
rule in a data domain, it adds to the count that conforms to the data domain. A null represents a missing value and
does not add to the count because it does not give any information about the data domain.

For most columns, the data is consistent throughout the table. In these situations, if a data domain matches the
column, it matches with all parts of the table including the initial rows. Therefore, the recommendation is to sample the
first 1000 rows. This recommendation is also applicable for the screening analysis use case.

You might have a use case that mandates a data domain to be inferred if a single row matches its data rule. For these
use cases, set the minimum conformance percent to zero and the sampling option to All Rows.

In the Analyst and Developer tools, you have a Verify option for data domain discovery. You can click Verify to get the
counts of the conforming and non-conforming rows. You can also drill down into either set of rows. For interactive use
cases, you can follow this approach with a small sample size for a quick and effective analysis. The small sample size
provides for a quick profile. Then, use the Verify option only for those inferred data domains that require further
investigation based on any unusual results.

4
Column Profiling
The majority of the use cases for Column Profiling require you to select all rows for the analysis. These use cases
require you to know the exact number of rows in the data source and the aggregate numbers. The general
recommendation is to run a column profile on all rows of all the tables in the enterprise discovery profile.

There are a few use cases where you do not need to compute the aggregate statistics on the entire data source.
These use cases include the screening use case and interactive use case. For screening, you might want to compute
and analyze aggregates such as the percentage null, patterns, and data type. If the profile results for a column display
unusual values, you can run a profile on the column that uses the entire data set.

For the interactive use case, you might know the approximate row count of a table. If the table has a high row count,
instead of profiling all the rows, use a sample to identify the problems in the data set. After the initial inspection, you
can decide whether to run a profile for the entire data set or not.

If you apply any of the sampling options, the recommendation is to use a sample size less than or equal to 100,000
rows. This size makes sure that the sample is processed entirely within the DIS in one pass. This process is faster than
splitting the work between the DIS and database.

Primary Key Profiling


Primary key profiling requires a sample of the data set because the algorithm consumes many resources when the
profile runs on the entire data set. Even with aggressive sampling, the primary key profiling algorithm can run for a long
time based on the complexity of the data. The general recommendation is to use a sample size equal to the square of
the number of columns. For example, if the table has 50 columns, you can use a sample size of 2500 (502).

Primary key profile results might display too many candidate primary keys or false positives. You can reduce the false
positives in the following ways:

• Increasing the number of rows reduces false positives by increasing the probability that the extra rows might
violate the false positives. The assumption is that a small sample might support these false positives. The
recommendation is that when you increase the number of rows, verify that there are resources to
accommodate the computation required by the algorithm. It is best to run primary key profiling first to get a
baseline for the table before increasing the number of rows in the sample.
• Increasing the Minimum Percent Conformance or decreasing the Maximum Violation Rows reduces the
false positives when there is a strong primary for the table. This action reduces false positives by aggressively
eliminating the false positives by a small violation threshold.
• Decreasing the Max Key Columns might reduce the number of false positives by reducing the potential
number of column combinations. When you allow more potential columns in a key, the probability of source
data supporting the false positive key might be more. You can follow this method if the schema does not
contain any table with a many primary key columns.
• Setting the Exclude data objects with parameter to exclude the documented, user-defined, and approved keys.
You can use this parameter if the data source enforces primary keys.
When you review the primary key profile results, you can use the Verify option to get the exact conformance of
the key. All the duplicate rows and key columns that contain nulls count towards the number of violating rows.

Foreign Key Profiling


Foreign key profiling does not require sampling because the profile uses all of the data for each source. When you run
a foreign key profile for the first time, the profile computes the signatures, which is an expensive operation. Subsequent
profiles compute the signatures for new tables. Therefore, after the initial computation, the foreign key profiles run
faster because the signature computation is complete.

The exception to the reuse of the signatures is when you change some of the parameters to the foreign key profile.
The parameters include using a different data type classification, case sensitivity, and trimming whitespace. When you
change any of these parameters or when you select the Regenerate Signature option, the signatures are recomputed
because these signatures take the preceding parameters into consideration.

5
Conclusion
Enterprise discovery, by default, is configured for the screening use case. You can use sampling to reduce the overall
cost of profiling a schema.

You can tune the default parameters to enable the exact computation of all the profiling results. If you have to use
sampling options in profiles, you can verify specific results to compute the exact values.

Authors
Jeff Millman
Development Architect

You might also like