Professional Documents
Culture Documents
Super Informatica Basics PDF
Super Informatica Basics PDF
OLTP
Operational Processing
Transactional
Clerk, DBA
Day to day operation
ER Based, application oriented
detailed, flat relation
Read/Write
10MB to 100MB
OLAP
Informational Processing
Analysis
Knowledge Worker
Long term informational requirements
star/snowflake, subject oriented
Summarized, Multidimensional
Mostly Read
100MB to TB
Sales Fact
Customer Dimension
Sale_d(pk)
Cust_id (fk)
Store_id (fk)
Product_id(fk)
Product Dimension Date_in (fk)
Store Dimension
Time Dimension
Star Schema
2. Snowflake Schema
The snowflake schema is a variant of star schema, where some
Dimension tables are normalized, thereby further splitting the data into additional tables.
The resulting schema graph forms a shape similar to snowflake.
Adv: Space can be minimized by splitting into the normalized table.
Disadv: It can hamper the query performance due to more number of joins.
Sales Fact
Sale_d(pk)
Cust_id (fk)
Store_id (fk)
Product_id(fk)
Date_in (fk)
Customer Dimension
Product Dimension
Store Dimension
City Dimension
Time Dimension
Item Dimension
Snowflake Schema
3. Galaxy Schema (Fact Constellation Schema)
Sophisticated application may
require multiple facts table to share dimension table. This type of schema can be viewed
as combination of stars hence called galaxy schema or fact constellation schema.
D1
D5
D3
Fact 1
Fact 2
3
D6
D7
D2
D4
D8
D9
Galaxy Schema
Dimensions
Dimension tables are sometimes called lookup or reference table.
1. Confirmed Dimension: - A dimension table which can be shared by multiple fact
tables is known as confirmed dimension.
2. Junk Dimension:- A dimension with the type descriptive, flag, Boolean which are
not used to describe the key performance indicators knows as facts, such
dimensions are called junk dimensions. Example, Product description, Address,
Phone number etc.
3. Slowly Changing Dimension:- A Dimensions that change over time are called
Slowly Changing Dimensions. For instance, a product price changes over time;
People change their names for some reason; Country and State names may change
over time. These are a few examples of Slowly Changing Dimensions since some
changes are happening to them over a period of time. Slowly Changing
Dimensions are often categorized into three types namely Type1, Type2 and
Type3. The following section deals with how to capture and handling these
changes over time.
Type 1: Overwriting the old values.
In the year 2005, if the price of the product changes to $250, then the old
values of the columns "Year" and "Product Price" have to be updated
and replaced with the new values. In this Type 1, there is no way to find
out the old value of the product "Product1" in year 2004 since the table
now contains only the new price and year information.
Type 2: Creating another additional record.
In this Type 2, the old values will not be replaced but a new row
containing the new values will be added to the product table. So at any
point of time, the difference between the old values and new values can
be retrieved and easily be compared. This would be very useful for
reporting purposes.
Data Modeling
A Data model is a conceptual representation of data structures (tables) required for a
database and is very powerful in expressing and communicating the business
requirements.
A data model visually represents the nature of data, business rules governing the data,
and how it will be organized in the database.
Data modeling consists of three phases to design the database.
1. Conceptual Modeling
Understand the business requirements
Identify the entities (tables)
Identify the columns (attributes)
Identify the relationship
2. Logical Modeling
Design the tables with the required attributes.
3. Physical Modeling
Execute the logical tables to exist physical existence in the database.
Data modeling tools
There are a number of data modeling tools to transform business requirements into
logical data model, and logical data model to physical data model. From physical data
model, these tools can be instructed to generate SQL code for creating database.
INFORMATICA
Introduction
Is GUI based ETL product from Informatica corporation.
Is a client server technology.
Is developed using JAVA language.
Is an integrated tool set (To Design, To Run, To Monitor)
Versions:
1. 5.0
2. 6.0
3. 7.1.1
4. 8.1.1
5. 8.5
6. 8.6
Meta Data
Meta Data is a Data about Data means Data that describes data and other structures,
such as objects, business rules, and processes.
Example: Table Structure (column name, data type, precision, scale and kyes),
Description
Mapping
Is a GUI representation for the data flow from source to target. In other words, the
definition of the relationship and data flow between source and target objects.
Requirements for mappings
a) Source Metadata
b) Business logic
c) Target Metadata
Repository
Central Database or Metadata Storage place
Informatica
Informatica
Source Database
Repository
Repository (Working
Place in Informatica)
Staging Area
A place where data is processed before entering the warehouse.
Source System
A database, application, file, or other storage facility from which the data in a data
warehouse is derived.
Target System
A database, application, file, or other storage facility to which the "transformed source
data" is loaded in a data warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data,
typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying is a transformation.
Examples include cleansing, aggregating, and integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to a data warehouse.
Working Professional Divisions
Designation
Roles
Administrator
Two Flavors
1. Informatica Power center For Big Scale Industries.
2. Informatica Power mart For Small Scale Industries.
Components of Informatica
Client Components
1. Designer
2. Work flow Manager
3. Work flow Monitor
4. Repository Manager
8
5. Admin Console
Mapping
(M_xyz)
Save
|
Repository
Workflow Manager
Workflow Monitor
1.Create Session
Mapping
(S_xyz)
Save
2. Create Workflow
Start
--Executing into
Informatica server.
--Integration services
are responsible for
execution.
Admin Console
For Administrative Purpose.
Monitoring (Mapping)
Session
10
Customer
CID number(4)pk
Cfname varchar2(5)
Clname varchar2(5)
Gender number(1)
Source Database
Client(GUI)
ODB
C
E
Customer
CID
Cfname
Clname
Gender
ODB
C
CONCAT()
DECODE()
Dim_CustomerCID
number(4)pkCname
varchar2(10)Gender
varchar(1)
Dim_Customer
CID
Cname
Gender
MAPPING
11
Data Warehouse
[Target Database]
Designer
Workflow Manger
Workflow Monitor
Repository Manager
Create Source
Definition
Create target
Definition
Define T/R Rule
Design Mapping
Execute Workflow
Schedule Workflow
Integration
Services
Repository
Services
E
Repository
Mapping
Source Definition
Target Definition
T/R Rule
Session
Workflow
Session log
Schedule Info
Web Services
Hub
External
Client
L
T
Staging
Area
Source DB
Target DB
12
13
6. Repository Service
The Repository service manages connections to the power center
repository from client applications.
The Repository service is a multithreaded process that inserts,
retrieves, deleted and updates metadata in the repository.
The Repository service ensures the consistency of the metadata in
the repository.
The Following Power Center applications can access the repository
service
a) Power Center Client
b) Integration Service
c) Web Service Hub
d) Command Line Program (For backup and Recovery for
administrative purpose)
7. Integration Service
The Integration Service reads mappings and session information from
the repository.
It extract the data from the mapping source stores in the memory
(Staging Area) where it applies the transformation rule that you can
configure in the mapping.
The Integration Service loads the transformed data into the
mapping targets.
The integration service connects to the repository through
repository service to fetch the metadata.
8. Web Service Hub
The Web Service Hub is a web service gateway for the external clients.
The web service clients (Internet Explorer, Mozilla) access the
integration service and repository service through web service hub.
It is used to run and monitor web enabled work flows.
Definitions
Session:
Workflow: A Workflow is a start task which contains a set of instruction to execute the
other task such as session.
Workflow is a top object in the power center development hierarchy.
Schedule Workflow: A Schedule workflow is an administrative task which specifies the
data and time to run the workflow.
14
Transformation
A transformation is an object used to define business logic for processing the data.
Transformation can be categorized in two categories
1. Based upon no. of rows processing
2. Based upon connection
Based upon no. of rows processing there are two types of Transformation
1. Active Transformation
2. Passive Transformation
Active Transformation:
A transformation which can affect the number of rows while data is going from source to
target is known as active transformation.
The following are the list of active transformation used for processing the data.
1. Source Qualifier Transformation
2. Filter Transformation
3. Aggregator Transformation
4. Joiner Transformation
5. Router Transformation
6. Rank Transformation
7. Sorter Transformation
8. Update Strategy Transformation
9. Transaction Control Transformation
10. Union Transformation
11. Normalizer Transformation
12. XML Source Qualifier
13. Java Transformation
14. SQL Transformation
Passive Transformation:
A transformation which does not affect the number of rows when the data is moving from
source to target is known as passive transformation.
The following are the list of passive transformation used for processing the data.
1. Expression Transformation
2. Sequence Generator Transformation
3. Stored Procedure Transformation
4. Lookup Transformation
Example:
Example of Active Transformation.
Emp
14 rows
Filter Transformation
SQ_Emp
SAL>3000
14 rows
14(I)
15
T_Emp
6(O)
Emp
SQ_Emp
14 Rows
Tax=Sal*0.10
14 Rows
14(I)
T_Emp
14(O)
S
S
I
I
SAL
COM
T
O
O
Tax
Annual Sal
T/R
T/R
16
Mapping
Target
Target Type
Target Table
Target Column
Format Type P, S
Description
Transformation Rule
Calculate Tax (sal*0.10)
for top 3 employees based
on the salary in dept 30;
DFD
Emp
14 Rows
SQ_EMP
14 Rows
Dept = 30
14(I)
Top 3
6(O)
6(I)
Tax (Sal*.10)
3(O)
3(I)
T_Emp
3(O)
1. Filter Transformation
This is of type an active transformation which allows you to filter the data based on given
condition.
-- A condition is created with the three elements
1. Port
2. Operator
3. Operand
17
The integration service evaluates the filter condition against each input record, returns
TRUE or FALSE.
-- The integration service returns TURE when the records is satisfied with the condition
and the records are given for further processing or loading the data into the target.
-- The integration service returns FALSE when the input record is not satisfied with the
condition and those records are rejected from filter transformation.
-- Filter transformation does not support IN operator.
-- The filter transformation supports to send the data to the single target.
-- Use filter transformation to perform data cleansing activity.
-- The filter transformation functions as WHERE clause in terms of SQL.
2. Rank Transformation
This is of type an active transformation which allows you to identify the TOP and
BOTTOM performers.
-- The rank transformation can be created with following types of ports.
1. Input Port
2. Output Port
3. Rank Port (R)
4. Variable Prot (V)
Rank Port: - The port based on which rank is determined is known as Rank Port.
Variable Port: - A port which can store the data temporally is known as a variable port.
The following properties need to be set for calculating the Ranks.
1. Top/Bottom
2. Number of Rank
The Rank transformation by default create with an output port called Rank index.
Dense Ranking: - It is a process of calculating the ranks for each group.
Sampling: - It is a process of reading the data of specified size (No. of records) for
testing.
3. Expression Transformation
This is a type of passive transformation which allows you to calculate the expression for
each record.
The expression can be calculated only in the output ports.
Used expression transformation to perform data cleansing and data scrubbing activities.
Expression transformations define only on the output port.
4. Sorter Transformation:
This is of type an Active Transformation which sorts the data in ascending or in
descending order.
-- The port on which sorting takes place is represented as a key.
-- User sorter Transformation for eliminating duplicates.
18
5. Aggregator Transformation
This is of type of an Active transformation which allows you to calculate the summary
for a group of records.
Aggregator transformation is created with following four components.
1. Group by: It defines the group on a port for which summaries are calculated. Ex.
Deptno
2. Aggregate Expression:- The aggregate expressions can be developed only in the
output ports using following aggregate function.
--sum( )
--max( )
-- avg( )
3. Sorted Input: - An aggregator transformation receives sorted data as an input to
improve the performance of summary calculations.
The port on which group is defined, the same ports need to be sorted,
using sorter transformation. (Only group by port need to be sorted by
sorter transformation)
4. Aggregate Cache: - The Integration service creates cache memory when the first
time session executes on it.
-- The aggregate cache stored on server hard drive.
-- An incremental Aggregation uses aggregate cache to improve the performance
of session.
Incremental Aggregation
Its a process of calculating the summary for only new records, which pass through
mapping using historical cache.
Note: - Both sorted input and incremental aggregation can not be used for a same
application to achieve the greater performance. (Session gets failed because ROWID
will not matched)
6. Lookup Transformation
This is of type of passive transformation which allows you to perform a lookup on
relational tables, flat files, synonyms and views.
-- When the mapping contains a lookup transformation the integration service queries the
lookup
data
and
compares
it
with
transformation
port
values
(EMP.DEPTNO=DEPT.DEPTNO).
-- A lookup transformation can be created with the following types of port.
1. Input Port ( I )
2. Output Port ( O )
3. Lookup Port ( L )
4. Return Port (R)
--There are two lookups
1. Connected
2. Unconnected
19
20
Merging
Horizontally
Vertical
Joiner Transformation
Union Transformation
Equi-Join
8. Router Transformation
Router transformation is a type of active transformation which allows to apply multiple
condition, to load multiple target table.
-- Is created with two types of group.
1. Input Group: - Which receives the data from source.
2. Output Group: - Which sends the data to target.
Output groups are also of two types.
1. User defined group allows to apply condition.
2. Default group captures the rejected record.
21
Sales
Sales_SQ
Router Transformation
Input
State=HR
State=DL
State=KA
Default
State HR
State DL
State KA
Default
9. Union Transformation
Union transformation combines multiple input flows into a single output flow.
It supports homogeneous and heterogeneous sources also.
Created with two groups.
1. Input group: - Receives the information
2. Output group: - Sends the information to the target.
Union transformation works as union all in Oracle.
Note: All the sources should have the same structure.
10. Stored Procedure Transformation
This is of type passive transformation which is used to call the stored procedure from the
database.
A stored procedure is a set of pre compiled SQL Statements which receives the input and
provides the output.
There are two types of stored procedure transformation.
1. Connected Stored Procedure
2. Unconnected Stored Procedure
The following properties need to be set for stored procedure transformation.
i.
Normal
ii.
Source Pre Load
iii.
Source Post Load
iv.
Target Pre Load
v.
Target Post Load
22
Use the normal property when the stored procedure involves is performing calculation
11. Source Qualifier Transformation
This is a type of an active transformation which allows you to read the data from
databases and flat files (text file).
SQL Override
Its a process of changing the default SQL using Source filter, User defined joins, Sorting
input data and Eliminating duplicates (Distinct)
Source Qualifier transformation supports SQL override when the source is database.
The above logic gets process on the database server.
The business logic process is sharing between integration service and database server.
This improves the performance of data acquisition.
User Defined Joins If the two sources are belongs to the same database user account or
same ODBC then apply the joins in the source qualifier rather than using joiner
transformation.
Mapplet & Types of Mapplet
A mapplet is reusable metadata object created with business logic using set of
transformation.
A mapplet is created using mapplet designer tool.
There are two types of mapplet.
1. Active mapplet: - Its created with the set of active transformation.
2. Passive mapplet: - Its created with the set of passive transformation.
It can be reused in a multiple mappings, having the following restrictions.
1. When you want to use stored procedure transformation you should use the stored
procedure transformation with the type Normal.
2. When you want to use sequence generator transformation you should use the
reusable sequence generator transformation.
3. The following objects can not be used to create a mapplet.
i.
Normalizer Transformation
ii.
XML Source Qualifier Transformation
iii.
Pre/Post Stored Procedure Transformation
iv.
Mapplets (Nested Mapplet)
Note: Reusable TransformationContains Single Transformation
Mapplet Contains Set of Transformation
Reusable Transformation
A Reusable transformation is reusable metadata object which contains the business logic
created with single transformation.
23
Emp
Emp_Dept
-Empno
-Ename
-Job
-Sal
-Deptno
Emp S_Q
-Empno
-Ename
-Job
-Sal
-Deptno
-Dname
-Loc
Dept
-Deptno
-Dname
-Location
Scheduling Workflow
A schedule specifies the data and time to run the workflow.
There are two types of schedule.
1. Reusable Schedule: - A schedule which can be attached to the multiple workflow
is known as reusable schedule.
2. Non-reusable: - A schedule which is created at the time of creating workflow is
known as non-reusable schedule.
A non-reusable schedule can be converted into reusable schedule.
Target Load Plan
A target load plan specifies the order in which data being extracted from Source Qualifier
Transformation.
Flat Files
A flat file is an ASCII text file which are saved with an extension .txt, .csv
There are two types of flat files.
1. Delimited Flat Files: - In this type of file each field or columns separated by some
special character like comma, tab, space, semicolon etc;
2. Fixed width Flat files: - A record of continuous length to be splitted into multiple
fields.
24
Note:-- Relational Reader Its reads the data from relational sources.
-- File Reader Its reads the data from flat files.
-- XML Reader Its reads the data from XML Reader.
--Relational Writer Its writes the data to the relational targets.
-- File writers Its writes the data to the flat file targets.
--XML writer Its writes the data to the XML file targets.
-- DTM (Data Transformation Manager) Its process the business logic defined in the
mapping.
The above readers, writers and DTM are known as Integration service components.
File List
A file list is a list of flat files with the same data definition, which needs to be merged
with the source file type as indirect.
XML Source Qualifier Transformation
This transformation is used to read the data from XML files. (Just like Source Qualifier)
Every XML source definition by default associates with XML source qualifier
transformation.
An XML is a case sensitive markup language saved with extension .xml
Note: XML files are case sensitive file.
XML File Example:
Emp.xml
<EMP_DETAILS>
<EMP>
<EMPNO>100 </EMPNO>
<ENAME>PRAKASH</ENAME>
<JOB>DEVELOPER</JOB>
<SAL>17000</SAL>
<DEPTNO>20</DEPTNO>
</EMP>
<EMP>
<EMPNO>200 </EMPNO>
<ENAME>JITESH</ENAME>
<JOB>MANAGER</JOB>
<SAL>77000</SAL>
<DEPTNO>20</DEPTNO>
25
</EMP>
</EMP_DETAILS>
Normalizer Transformation
This is of type of an active transformation which reads the data from Global file source.
It is used to read the file from COBOL source. Every COBOL source definition by
default associate with Normalizer transformation.
Normalizer transformation functions like a source qualifier which reading the data from
COBOL Sources.
User Normalizer transformation to convert a single input record from source into multiple
output data records. This process is known as data pivoting
Example:
File name: Account.txt
Year Account
Month1
2008 Salary
25000
2008 Others
5000
Month2
30000
6000
Output
Year Account
2008 Salary
2008 Salary
2008 Salary
2008 Others
2008 Others
2008 Others
Amount
25000
30000
28000
5000
6000
4000
Month
1
2
3
1
2
3
Month3
28000
4000
A transaction can be control at session level also by using the property commit interval.
Sequence Generator Transformation
This is of type passive transformation which allows you to generate the sequence number
to be treated as primary keys.
-- A surrogate key is a system generated sequence number to be used as primary key to
maintain the history in a dimension tables.
-- A surrogate key is also known as dimensional key or artificial key or synthetic key.
-- A sequence generator transformation is created with two default output ports.
i. Nextval
ii. Curval
-- This Transformation does not allow you to create a new ports or edit the existing
output ports.
This transformation is used in implementing slowly changing dimensions type2 to
maintain the history in type2 SCD.
The following are the properties to be set to generate the sequence number.
1. Start Value
2. Current Value
3. Increment by
Update Strategy Transformation
This is of type an active transformation which flag the source records for Insert, Update,
Delete, and Reject data driven operations.
This transformation functions an DML command in terms of SQL.
There are two different ways to implement an update strategy.
i.
Using update strategy transformation at mapping level.
ii.
Using target table options at session level.
The conditional update strategy expressions can be developed using following
constraints.
DD_Insert
0
DD_Update 1
DD_Delete 2
DD_Reject
3
-- DD stands for Data Driven
Ex: IFF(SAL>3000, DD_Insert, DD_Reject)
The above expression can be implemented using update strategy transformation at
mapping level.
27
CACHE
JOINER CACHE
How it Works
There are two types of cache memory, index and data cache.
All rows from the master source are loaded into cache memory.
The index cache contains all port values from the master source where the port is
specified in the join condition.
The data cache contains all port values not specified in the join condition.
After the cache is loaded the detail source is compared row by row to the values in the
index cache.
Upon a match the tows from the data cache are included in the stream.
Key Point
If there is not enough memory specified in the index and data cache properties the
overflow will be written out to disk.
Performance consideration
The master source should be the source that will take up the least amount of space in
cache.
Another performance consideration would be the sorting of data prior to the joiner
transformation. (Sorted Input).
Note: The index cache is saved with an extension .idx and data cache is saved with an
extension .dat
The cache stored on server hard drive.
28
Joiner Cache
Index Cache
Data Cache
Deptno
10
20
30
40
Dname
HR
IT
MKT
SALE
Location
HYBD
NDLS
KA
CHE
LOOKUP CACHE
How it works
There are two types of cache memory index and data cache.
All ports value from the lookup table where the port is part of the lookup condition are
loaded into index cache.
The index cache contains all ports value from the lookup table where the port is specified
in the lookup condition.
The data cache contains all port values from the lookup table that are not in lookup
condition and are specified as output ports.
After the cache loaded, values from the lookup input ports that are part of lookup
condition are compared to index cache.
Upon a match the rows from the cache are included in stream.
Types of Lookup Cache
When the mapping contains lookup transformation the integration service queries the
lookup data and stores in the lookup cache.
The following are the types of cache created by integration service.
1. Static Lookup Cache
This is the default lookup cache created by integration service, it is the read only cache,
can not be updated.
2. Dynamic Lookup Cache
The cache can be updated during the session run and particularly used when you perform
a lookup on target table in implementing Slowly Changing Dimension.
29
It the lookup table is the target the cache is changed dynamically as target load rows are
processed.
New row to be inserted or updated in the target are also written to the cache.
Dynamic Lookup
Cache
Lookup
Response
Lookup
Request
Write to
Cache
Target
Table
Lookup
Transformation
Write to
Target
Business Purpose
In a data warehousing dimensions tables are frequently updated and changes to the new
row data must be captured within the load cycle.
New Lookup Row
0 The integration service does not update or insert the row in cache.
1 The integration service inserts the row into the cache.
2 The integration service updates the row into the cache.
Key Points
1. The lookup transformation Associated port matches a lookup input port with
the corresponding part in the lookup cache.
2. The Ignore null inputs for updates should be checked for ports where null data
in the input stream may overwrite the corresponding field in the lookup cache.
3. The Ignore in Comparison should be checked for any port that is not to be
compared.
4. The flag New Lookup Row indicates the type of row manipulation of the cache.
If an input row creates an insert n the lookup cache the flag is set to 1. If an
input row creates an update of the lookup cache the flag is set to 2. If no
changes is detected the flag is set to 0. A filter or router transformation can be
used with an update strategy transformation to set the proper row tag to update a
target table.
30
Performance Consideration
A large lookup table may require more memory resources than available. A SQL override
in the lookup transformation can be used.
Persistent Lookup Cache
The cache can be reused for multiple session runs. It improves the performance of the
session.
AGGREGATE CACHE
How it Works
When the first time session executes on integration service, the integration service creates
an aggregate cache which is made up of index cache and the data cache.
The integration service uses an aggregate cache to perform incremental aggregation.
This improves the performance of session.
There are two types of cache memory, index and data cache.
All rows are loaded into cache before any aggregation tasks place.
All index cache contains group by port values.
The data cache contains all ports value variable and connected output ports.
Non-group by input ports used in non-aggregate output expression.
Non group by input/output ports.
Local variable ports.
Ports containing aggregate function (multiply by three).
One output rows will be required for each unique occurrence of the group ports.
When you perform the incremental aggregation the integration service reads the record
from the source and check in the index cache for the existence of group value.
If the group value exist then it performs the aggregation calculation incrementally using
historical cache.
If it does not find the group in he index cache it creates the group and perform
aggregation.
Performance Consideration
Sorted Input: - Aggregator performance can be increased when you sort the input in the
same order as the aggregator group by ports prior to doing the aggregation. The
aggregator stored input property would need to be checked.
Relational source data can be sorted using an order by clause in the source qualifier
override.
Flat file source data can be sorted using an external sort application or the sorter
transformation. Cache size is also important in assuring optimal performance in the
aggregator. Make sure that your cache size settings are large enough to accommodate all
of the data. If they are not, the system will cache out to disk causing a slow down in
performance.
31
Aggregate Cache
Index Cache
Data Cache
Deptno
10
20
30
40
Sum(Sal)
8000
12000
6000
99000
SORTER CACHE
How it Works
If the cache size is specified in the properties exceeds the available amount of memory on
the integration service process machine then the integration service fails the session.
All of the incoming data is passed into cache memory before the sort operation is
performed.
If the amount of incoming data is greater than the cache size specified then the
PowerCenter will temporary store the data in the sorter transformation work directory.
Key Points
The integration service requires disk space of at least twice the amount of incoming data
when storing data in work directory.
Performance Consideration
Using sorter transformation may improve performance over an Order by clause in a
SQL override in aggregate session when the source is a database because the source
database may not be tuned with the buffer size needed for a database sort.
Performance Consideration in Various Transformations
Filter Transformation
Keep the filter transformation as close to the source qualifier as possible to filter the data
early in the data flow.
If possible move the same condition to source qualifier transformation.
32
Router Transformation
When splitting row data based on field values a router transformation has a performance
advantage over multiple filter transformation because a row is read once into the input
group but evaluated multiple times based in the number of groups. Whereas using
multiple filter transformation requires the same row data to be duplicated for each filter
transformation.
Update Strategy Transformation
The update strategy transformation performance can vary depending on the number of
updates and inserts. In some cases there may be a performance benefit to split a mapping
with updates and insert into two mapping and sessions. One mapping with inserts and
other with updates.
Expression Transformation
Use operator instead of functions
Ex: Instead of using concat function use || operator to concatenate two string fields.
Simplify the complex expressions by defining variable ports.
Try to avoid the usage of aggregate function.
TASK and TYPES OF TASK
A task is defined as a set of instructions. There are two types of task.
i. Reusable Task: - A task which can be defined for multiple workflows is known as
reusable task. A reusable task is created using task developer tool. Ex: Session,
command, Email.
ii. Non-Reusable Task: A task which is created and defined at the time of creating
workflow is known as non-reusable task. Ex: Session, Command, Email, Decision task,
Control task, Timer task, Event wait task, Event raise task, Worklet.
Note:- A non-reusable task can be converted into reusable task.
Types of Batch Processing
There are two types of batch processing.
i. Parallel batch processing: - In a parallel batch processing all the session start executing
at the same point of time. Session execute concurrently.
WKF
F
S-10
S-20
S-30
33
S-10
S-20
S-30
WKF
F
S-10
S-20
S-30
$S-20: PrevTaskStatus: SUCCEEDED
The above pictorial representation defined as follows, If the S-10 succeeded then S-20
will execute and so on.
WORKLET and TYPE OF WORKLET
A Worklet is defined as group of tasks. There are two types of worklet.
i. Reusable Worklet: - A worklet which can be defined in a multiple workflows is known
as reusable worklet.
A reusable worklet is created using worklet designer tool. In a workflow manager.
A worklet can be executed using a start task known as workflow.
ii. Non-reusable Worklet: - a worklet which is created at the time of creating workflow is
known as non-reusable worklet.
A non-reusable worklet can be converted into the reusable worklet.
COMMAND TASK
You can specify one or more shell commands to run during the workflow with command
task.
You specify the shell commands in the command task to delete, reject file, copy file etc.
Use command task in the following ways:
1. Stand-alone command task:- Use a command task anywhere in the workflow or
worklet to run the shell command.
34
2. Pre-Post Session shell command: - you can call the command task as the pre-post
session shell command for a session task.
You can use any valid UNIX commands for UNIX servers and any valid DOS command
for WINDOWS server.
Copy C:\test.txt D:\New Test
WKF
F
S-10
CMD Task
Event Task
You can define the events in the workflow to specify the sequence of task execution.
The event is triggered based on the completion of sequence of the task.
Use the following task to define the vent in the workflow.
i.
Event Raise Task: - The event raise task represent User defined
event. When the integration service runs the event raise task. The event raise
target triggers the event. Use event raise task with event wait task to define the
events.
ii.
Event Wait Task: - The event wait task waits for an event to occur.
Once the event triggers the integration service continues executing the rest of
workflow. You may specify the following types of event for event wait and event
raise task.
a) Pre-defined: - A predefined event is the file watch event. For a predefined events
use event wait task to instruct the integration service to wait for specified indicator
file. To appear before continuing with the rest of workflow. When the integration
service locates the indicator file it starts the next task in the workflow.
b) User Defined Event: - A user defined event is a sequence of task in the workflow.
Use an event raise task to specify the location of user defined event in the
workflow.
Decision Task
You can enter a condition that determines the execution of the workflow with decision
task, similar to the link condition. The decision task has a predefined variable called
$decision_task_name.condition that represents the result of decision condition.
The integration service evaluates the condition in the decision task and sets the predefined condition variable to True or False.
Use decision task instead of multiple link condition in the workflow.
35
Timer Task
You can specify the period of time to wait before integration service runs the next task in
the workflow with the timer task.
The timer task has two types of settings.
i.
Absolute type: - We can specify the time that integration service starts running
the next task in the workflow.
ii.
Relative type: - You instruct the integration service to wait for specified
period of time. After the timer task.
Ex: A workflow contains two sessions. You want the integration service wait 10 minutes
after the first session completes, before it runs the second session.
Use the timer task after the first session, in the relative time setting of a timer task.
Specify 10 minutes for start time of the timer task.
Assignment Task
You can assign a value to user defined workflow variable with the assignment task.
To use assignment task in the workflow first create an add an assignment task to
workflow. Then configure the assignment task to assign value or expression to user
defined variable.
Email Task
Email task is used to send an email within a workflow.
Note: - Emails can also be set post session in a session task.
-- Can be used within a link condition to notify success or failure of prior task.
PMCD Utility
The PMCD is a command line program utility which communicates with integration
services.
Using PMCD the following task can be preformed
i. Start Workflow
ii. Schedule Workflow
iii. Get Service details
iv. Ping Service
The following commands can be used with PMCD Utility.
1. Connect it connect the PMCD program to the integration service.
2. Disconnect It disconnects the PMCD from the integration service.
3. Exit Disconnects the PMCD from the integration service and closes the PMCD
program.
36
38
[Folder . Session]
$ session parameter = Connection
Tracing Level
A tracing level determines the amount of information in the session log.
The following are the types of tracing levels.
1. Normal
2. Verbose
3. Verbose Data
4. Terse
The default tracing level is Normal.
Tracing Level
Description
Normal
Terse
Verbose
Initialization
Verbose Data
.
Session Recovery
If you stop a session or an error passes a session to stop. Then identified the reasons for
the failure and start the session again using one of the following methods.
1. Restart the session again if the integration service has not issued at least one
commit.
39
2. Perform session recovery if the integration service has issued at least one commit.
When you start the recovery session the integration service reads the ROWID of last row
committed record from OPB_SRVR_RECOVERY table.
The integration service reads all the source data and start processing from next ROWID.
DEBUGGER
It is used to debug the mapping to check the business functionality.
Metadata Extension
A metadata extension provides information about the developer who has created an
object.
Metadata extension includes the following information.
1. Developer Name
2. Object Creation Date
3. Email ID
4. Desk Phone etc
UNIT TESTING
A unit test for the data warehouse is a white box testing. It should check the ETL
procedure, mappings, and front end developed reports.
Executes the following test cases
1. Data Availability
Test Procedure
40
Connect to the source database with valid username and password. Run the SQL Query
on the database to verify that the data is available in the table from where it needs to be
extracted.
Expected Behavior
-- The login to the database should be successful.
-- The table should contain relevant data.
Actual Behavior
-- As expected
Test Result
-- Pass or Fail
2. Data Load/Insert
Ensure that records are being inserted in the target.
Test Procedure
i. Make sure that target table is not having any records
ii. Run the mapping and check that records are being inserted in the target table.
Expected Behavior
The target table should contain inserted record.
Actual Behavior
-- As expected
Test Result
-- Pass
3. Data Load/Update
Ensure that update is properly happening in the target.
Test Procedure
i. Make sure that some records are there in the target already.
ii. Update the value of the some field in a source table record which has been already
loaded into the target.
iii. Run the mapping
Expected Behavior
The target table should contain updated record.
Actual Behavior
-- As expected
Test Result
-- Pass
4. Incremental Data Load
Ensure that the data from the source should be properly populated into the target
incrementally and without any data loss.
41
Test Procedure
i. Add new record with new values in addition to already existing record in the source.
ii. Run the mapping
Expected Behavior
The target table should be added with only new record.
Actual Behavior
-- As expected
Test Result
-- Pass
5. Data Accuracy
The data from the source should be populated into the target accurately.
Test Procedure
i. Add new record with new values in addition to already existing record in the source.
ii. Run the mapping
Expected Behavior
The column values in the target should be the same the data source value.
Actual Behavior
-- As expected
Test Result
-- Pass
6. Verify Data Loss
Check the number of records in the source and target.
Test Procedure
i. Run the mapping and check the number of records inserted in the target and number of
records rejected.
Expected Behavior
No. of records in the source table should be equal to the number of records in the target
table + rejected records.
Actual Behavior
-- As expected
Test Result
-- Pass
7. Verify Column Mapping
Verify that source columns are properly linked to the target column.
Test Procedure
42
i. Perform a manual check to confirm that source columns are properly linked to the
target columns.
Expected Behavior
The data from the source columns should be placed in target table accurately.
Actual Behavior
-- As expected
Test Result
-- Pass
8. Verify Naming Standard
Ensure that objects are created with industry specific naming standard.
Test Procedure
i. A manual check can be performed to verify the naming standard.
Expected Behavior
Objects should be given appropriate naming standards.
Actual Behavior
-- As expected
Test Result
-- Pass
9. SCD Type2 Mapping
Ensure that surrogate keys are properly generating for a dimensional change.
Test Procedure
i. Insert a new record with new values in addition to already existing records in the
source.
ii. Change the value of some field in a source table record which has been already loaded
into the target run the mapping.
ii. Verify the target for appropriate surrogate keys.
Expected Behavior
The target table should contain appropriate surrogate key for insert and update.
Actual Behavior
-- As expected
Test Result
-- Pass
SYSTEM TESTING
System testing also called Data Validation Testing.
The system and acceptance testing are usually separate. It might be move beneficial to
combine the two phases in case of tight timeline and budget constraint.
43
A simple technique of counting the number of records in the source table that should be
tie up with Number of records in the target table + Number of records rejected
Test the rejects for business logic.
The ETL system is tested with the full functionality and is expected to function as in
production.
In many case the dimension table exists as masters in OLTP and can be checked
directly.
Performance Testing and Optimization
The first step in performance tuning is to identify the performance bottleneck in the
following order.
1. Target
2. Source
3. Mapping
4. Session
5. System
The most common performance bottleneck occurs when the integration service writes the
data to target.
1. Identifying Target Bottleneck
Test Procedure: A target bottleneck can be identified by configuring the session to write
to a flat file target.
Optimization:
i.
User Bulk loading instead of Normal load.
ii. Increase Commit Interval
iii. Drop index of target table before loading
2. Identifying Source Bottleneck
Test Procedure: A source bottleneck can be identified by removing all the transformation
in test mapping and if the performance is similar then there is source bottleneck.
Test Procedure: Add a filter condition after the Source Qualifier to false so that no data is
processed passed the filter transformation. If the time it takes to run the new session is
same as original session there is a source bottleneck.
Optimization:
i.
Create Index
ii.
Optimize the query using hint = WHERE clause.
3. Identifying Mapping Bottleneck
Test Procedure: Add a filter condition before each target definition and set condition to
false so that no records are loaded into the target.
If the time it takes to run the new session is same as original session then there is a
mapping bottleneck.
44
Optimization:
i. Joiner Transformation
1. Use Sorted Input
2. Define the source as master source which occupies the least amount of memory
in the cache.
ii. Aggregator Transformation
1. Use Sorted Input
2. Incremental aggregation with aggregate cache.
3. Group by simpler ports, preferably Numeric Ports.
iii. Lookup Transformation
1. Define SQL Override on lookup table
2. User persistent lookup cache.
iv. Expression Transformation
1. Use operators instead of function
2. Avoid the usage of aggregate function call.
3. Simplify the expression by creating variable ports.
v. Filter Transformation
1. Keep the filter transformation as close to the source qualifier as possible to filter
the data early in the data flow.
4. Identifying Session Bottleneck
Test Procedure: Use Collect performance details to identify session bottleneck. Low (020%) buffer input efficiency and buffer output efficiency counter values indicates session
bottleneck.
Optimization:
Tune the following parameters in the session.
1. DTM buffer size 6M to 128M
2. Buffer block size 4K to 128K
3. Data cache size 2M to 24M
4. Index cache size 1M to 12M
Test Procedure: Double Click Session Properties Tab Select Collect
performance Data Click ApplyOk
Execute the Session.
The Integration service creates a performance file that saved with an extension .pref
The .pref file located in session log directory.
5. Identifying System Bottleneck
45
If there is no target, source, mapping and session bottleneck then there may be a system
bottleneck. Use the system tool to monitor CPU usage and memory usage.
On Windows Operating System used Task Manager, on Unix Operating System use
system tool such as iostat, sar.
Optimization:
Improve Network Speed
Improve CPU Usage
SQL Transformation
The SQL Transformation processes the SQL queries in the pipeline. You can insert,
delete, update and retrieve rows from the database. You can pass the database connection
information to the SQL Transformation as input data at run time.
You can configure the SQL Transformation to run into the following modes.
1. Script Mode: - An SQL Transformation running in script mode runs SQL Scripts
from the text file.
You pass each script file name from source to SQL Transformation using script
name port.
The Script file name contains complete path to script file.
An SQL Transformation configure for script mode has the following default ports.
i. Script Name Input port
ii. Script Result Output Port (Returns passed if the script execution
succeeded otherwise returns fail)
iii. Script Error Output Port (Returns Error Message)
2. Query Mode: - When a SQL Transformation runs in query mode it executes an
SQL Query that you define in the transformation.
When you configure the SQL Transformation to run in a query mode you
create an active transformation. The transformation can returns multiple
rows for each row.
Unconnected Stored Procedure
An Unconnected stored procedure transformation is not a part of data flow. It can be
called through other transformation using :sp( ) Identifier.
An Unconnected stored procedure can receive act as function that can be called through
other transformation such as expression transformation.
An Unconnected stored procedure can receive multiple inputs but provides single output.
Difference between Connected and Unconnected Lookup Transformation
Connected Lookup
1 Part of the mapping data flow
Unconnected Lookup
Separate from mapping data flow
46
2 Returns multiple values (by linking Returns one value by checking the Return Port
Output
ports
to
another option for the output port that provides the
transformation.)
return value.
3 Execute for every record passing Only executed when the lookup function is
through the transformation
called
4 More visible, shows where the Less visible as the lookup is called from an
lookup values are used.
expression within another transformation.
5 Default values are used.
48
49