Why ETL

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Why ETL:

Think of GE, the company has over 100+ years of history & presence in almost all the industries. Over
these years company’s management style has been changed from book keeping to SAP. This transition
was not a single day transition. In transition, from book keeping to SAP, they used a wide array of
technologies, ranging from mainframes to PCs, data storage ranging from flat files to relational databases,
programming languages ranging from Cobol to Java. This transformation resulted into different
businesses, or to be precise different sub businesses within a business, running different applications,
different hardware and different architecture. Technologies are introduced as and when invented & as and
when required.

This directly resulted into the scenario, like HR department of the company running on Oracle
Applications, Finance running SAP, some part of process chain supported by mainframes, some data
stored on Oracle, some data on mainframes, some data in VSM files & the list goes on. If one day

 First completely manual, generate different reports from different systems and integrate them.
 Second fetch all the data from different systems/applications, make a Data Warehouse, and
generate reports as per the requirement.

Company requires consolidated reports of assets, there are two ways.

Obviously second approach is going to be the best.


Now to fetch the data from different systems, making it coherent and loading into a Data Warehouse
requires some kind of extraction, cleansing, integration, and load. ETL stands for Extraction,
Transformation & Load.

ETL Tools provide facility to Extract data from different non-coherent systems, cleanse it, merge it and
load into target systems.

ETL Architecture:

Extract

The first part of an ETL process involves extracting the data from the source systems. In many cases this
is the most challenging aspect of ETL, as extracting data correctly will set the stage for how subsequent
processes will go.
Most data warehousing projects consolidate data from different source systems. Each separate system
may also use a different data organization/format. Common data source formats are relational databases
and flat files, but may include non-relational database structures such as Information Management
System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed
Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or
screen-scraping. The streaming of the extracted data source and load on-the-fly to the destination
database is another way of performing ETL when no intermediate data storage is required. In general, the
goal of the extraction phase is to convert the data into a single format which is appropriate for
transformation processing.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data
meets an expected pattern or structure. If not, the data may be rejected entirely or in part.

Transform

The transform stage applies to a series of rules or functions to the extracted data from the source to
derive the data for loading into the end target. Some data sources will require very little or even no
manipulation of data. In other cases, one or more of the following transformation types may be required
to meet the business and technical needs of the target database:

 Selecting only certain columns to load (or selecting null columns not to load). For example, if the
source data has three columns (also called attributes), for example roll_no, age, and salary, then
the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all
those records where salary is not present (salary = null).
 Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the
warehouse stores M for male and F for female)
 Encoding free-form values (e.g., mapping "Male" to "1")
 Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
 Sorting
 Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
 Aggregation (for example, rollup — summarizing multiple rows of data — total sales for each store,
and for each region, etc.)
 Generating surrogate-key values
 Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
 Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string
in one column as individual values in different columns)
 Disaggregation of repeating columns into a separate detail table (e.g., moving a series of
addresses in one record into single addresses in a set of records in a linked address table)
 Lookup and validate the relevant data from tables or referential files for slowly changing
dimensions.
 Applying any form of simple or complex data validation. If validation fails, it may result in a full,
partial or no rejection of the data, and thus none, some or all the data is handed over to the next
step, depending on the rule design and exception handling. Many of the above transformations
may result in exceptions, for example, when a code translation parses an unknown code in the
extracted data.

Load

The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the
requirements of the organization, this process varies widely. Some data warehouses may overwrite
existing information with cumulative information, frequently updating extract data is done on daily, weekly
or monthly basis. Other DW (or even other parts of the same DW) may add new data in a historicized
form, for example, hourly. To understand this, consider a DW that is required to maintain sales records of
the last year. Then, the DW will overwrite any data that is older than a year with newer data. However,
the entry of data for any one year window will be made in a historicized manner. The timing and scope to
replace or append are strategic design choices dependent on the time available and the business needs.
More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW.
As the load phase interacts with a database, the constraints defined in the database schema — as well as
in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory
fields), which also contribute to the overall data quality performance of the ETL process.

 For example, a financial institution might have information on a customer in several departments
and each department might have that customer's information listed in a different way. The
membership department might list the customer by name, whereas the accounting department
might list the customer by number. ETL can bundle all this data and consolidate it into a uniform
presentation, such as for storing in a database or data warehouse.

 Another way that companies use ETL is to move information to another application permanently.
For instance, the new application might use another database vendor and most likely a very
different database schema. ETL can be used to transform the data into a format suitable for the
new application to use.

 An example of this would be an Expense and Cost Recovery System (ECRS) such as used by
accountancies, consultancies and lawyers. The data usually ends up in the time and billing system,
although some businesses may also utilize the raw data for employee productivity reports to
Human Resources (personnel dept.) or equipment usage reports to Facilities Management.

Challenges

ETL processes can involve considerable complexity, and significant operational problems can occur with
improperly designed ETL systems.

The range of data values or data quality in an operational system may exceed the expectations of
designers at the time validation and transformation rules are specified. Data profiling of a source during
data analysis can identify the data conditions that will need to be managed by transform rules
specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in
the ETL process.

Data warehouses are typically assembled from a variety of data sources with different formats and
purposes. As such, ETL is a key process to bring all the data together in a standard, homogeneous
environment.

Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This
includes understanding the volumes of data that will have to be processed within service level
agreements. The time available to extract from source systems may change, which may mean the same
amount of data may have to be processed in less time. Some ETL systems have to scale to process
terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data
may require designs that can scale from daily batch to multiple-day micro batch to integration with
message queues or real-time change-data capture for continuous transformation and update

Performance

ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~1 GB per second)
using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and
lots of memory. The fastest ETL record is currently held by Syncsort, [1] Vertica and HP at 5.4TB in under
an hour which is more than twice as fast as the earlier record held by Microsoft and Unisys.

In real life, the slowest part of an ETL process usually occurs in the database load phase. Databases may
perform slowly because they have to take care of concurrency, integrity maintenance, and indices. Thus,
for better performance, it may make sense to employ:

 Direct Path Extract method or bulk unload whenever is possible (instead of querying the database) to reduce
the load on source system while getting high speed extract
 most of the transformation processing outside of the database
 bulk load operations whenever possible.

Still, even using bulk operations, database access is usually the bottleneck in the ETL process. Some
common methods used to increase performance are:

 Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the
partitioning).
 Do all validation in the ETL layer before the load. Disable integrity checking (disable constraint ...) in the target
database tables during the load.
 Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a
separate step.
 Generate IDs in the ETL layer (not in the database).
 Drop the indices (on a table or partition) before the load - and recreate them after the load (SQL: drop
index ...; create index ...).
 Use parallel bulk load when possible — works well when the table is partitioned or there are no indices. Note:
attempt to do parallel loads into the same table (partition) usually causes locks — if not on the data rows,
then on indices.
 If a requirement exists to do insertions, updates, or deletions, find out which rows should be processed in
which way in the ETL layer, and then process these three operations in the database separately. You often can
do bulk load for inserts, but updates and deletes commonly go through an API (using SQL).

Whether to do certain operations in the database or outside may involve a trade-off. For example,
removing duplicates using distinct may be slow in the database; thus, it makes sense to do it outside. On
the other side, if using distinct will significantly (x100) decrease the number of rows to be extracted, then it
makes sense to remove duplications as early as possible in the database before unloading data.

A common source of problems in ETL is a big number of dependencies among ETL jobs. For example, job
"B" cannot start while job "A" is not finished. You can usually achieve better performance by visualizing all
processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making
"chains" of consecutive processing as short as possible. Again, partitioning of big tables and of their
indices can really help.

Another common issue occurs when the data is spread between several databases, and processing is done
in those databases sequentially. Sometimes database replication may be involved as a method of copying
data between databases - and this can significantly slow down the whole process. The common solution is
to reduce the processing graph to only three layers:

 Sources
 Central ETL layer
 Targets

This allows processing to take maximum advantage of parallel processing. For example, if you need to
load data into two databases, you can run the loads in parallel (instead of loading into 1st - and then
replicating into the 2nd).

Of course, sometimes processing must take place sequentially. For example, you usually need to get
dimensional (reference) data before you can get and validate the rows for main "fact" tables.

Parallel processing

A recent development in ETL software is the implementation of parallel processing. This has enabled a
number of methods to improve overall performance of ETL processes when dealing with large volumes of
data.

ETL applications implement three main types of parallelism:

 Data: By splitting a single sequential file into smaller data files to provide parallel access.
 Pipeline: Allowing the simultaneous running of several components on the same data stream. For example:
looking up a value on record 1 at the same time as adding two fields on record 2.
 Component: The simultaneous running of multiple processes on different data streams in the same job, for
example, sorting one input file while removing duplicates on another file.

All three types of parallelism usually operate combined in a single job.

An additional difficulty comes with making sure that the data being uploaded is relatively consistent.
Because multiple source databases may have different update cycles (some may be updated every few
minutes, while others may take days or weeks), an ETL system may be required to hold back certain data
until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents
in a source system or with the general ledger, establishing synchronization and reconciliation points
becomes necessary.

What is Informatica?

Informatica is a tool, supporting all the steps of Extraction, Transformation and Load process. Now a days
Informatica is also being used as an Integration tool.

Informatica is an easy to use tool. It has got a simple visual interface like forms in visual basic. You just need to drag
and drop different objects (known as transformations) and design process flow for Data extraction transformation and
load. These process flow diagrams are known as mappings. Once a mapping is made, it can be scheduled to run as
and when required. In the background Informatica server takes care of fetching data from source, transforming it, &
loading it to the target systems/databases.

Informatica can communicate with all major data sources (mainframe/RDBMS/Flat Files/XML/VSM/SAP etc), can
move/transform data between them. It can move huge volumes of data in a very effective way, many a times better
than even bespoke programs written for specific data movement only. It can throttle the transactions (do big updates
in small chunks to avoid long locking and filling the transactional log). It can effectively join data from two distinct
data sources (even a xml file can be joined with a relational table). In all, Informatica has got the ability to effectively
integrate heterogeneous data sources & converting raw data into useful information.

Before we start actually working in Informatica, let’s have an idea about the company owning this wonderful
product.Some facts and figures about Informatica Corporation:

 Founded in 1993, based in Redwood City, California


 1400+ Employees; 3450 + Customers; 79 of the fortune 100 Companies
 NASDAQ Stock Symbol: INFA; Stock Price: $18.74 (09/04/2009)
 Revenues in fiscal year 2008: $455.7M
 Informatica Developer Networks: 20000 Members

In short, Informatica is world’s leading ETL tool & its rapidly acquiring market as an Enterprise Integration Platform.

Informatica Software Architecture illustrated

Informatica ETL product, known as Informatica Power Center consists of 3 main components.

1. Informatica PowerCenter Client Tools:

These are the development tools installed at developer end. These tools enable a developer to

 Define transformation process, known as mapping. (Designer)


 Define run-time properties for a mapping, known as sessions (Workflow Manager)
 Monitor execution of sessions (Workflow Monitor)
 Manage repository, useful for administrators (Repository Manager)
 Report Metadata (Metadata Reporter)
2. Informatica PowerCenter Repository:

Repository is the heart of Informatica tools. Repository is a kind of data inventory where all the data
related to mappings, sources, targets etc is kept. This is the place where all the metadata for your
application is stored. All the client tools and Informatica Server fetch data from Repository. Informatica
client and server without repository is same as a PC without memory/harddisk, which has got the ability to
process data but has no data to process. This can be treated as backend of Informatica.

3. Informatica PowerCenter Server:

Server is the place, where all the executions take place. Server makes physical connections to
sources/targets, fetches data, applies the transformations mentioned in the mapping and loads the data in
the target system.This architecture is visually explained in diagram below

Sources
Targets

Standard: RDBMS, Flat Files,


Standard: RDBMS, Flat Files,
XML, ODBC
XML, ODBC

Applications: SAP R/3, SAP


Applications: SAP R/3, SAP
BW, PeopleSoft, Siebel, JD
BW, PeopleSoft, Siebel, JD
Edwards, i2
Edwards, i2

EAI: MQ Series, Tibco, JMS,


EAI: MQ Series, Tibco, JMS,
Web Services
Web Services

Legacy: Mainframes (DB2,


Legacy: Mainframes
VSAM, IMS, IDMS,
(DB2)AS400 (DB2)
Adabas)AS400 (DB2, Flat File)

Remote Targets
Remote Sources
Informatica Product Line

Informatica is a powerful ETL tool from Informatica Corporation, a leading provider of enterprise data
integration software and ETL softwares.

The important products provided by Informatica Corporation is provided below:

 Power Center
 Power Mart
 Power Exchange
 Power Center Connect
 Power Channel
 Metadata Exchange
 Power Analyzer
 Super Glue

Power Center & Power Mart: Power Mart is a departmental version of Informatica for building,
deploying, and managing data warehouses and data marts. Power center is used for corporate enterprise
data warehouse and power mart is used for departmental data warehouses like data marts. Power Center
supports global repositories and networked repositories and it can be connected to several sources. Power
Mart supports single repository and it can be connected to fewer sources when compared to Power Center.
Power Mart can extensibily grow to an enterprise implementation and it is easy for developer productivity
through a codeless environment.

Power Exchange: Informatica Power Exchange as a stand alone service or along with Power Center,
helps organizations leverage data by avoiding manual coding of data extraction programs. Power
Exchange supports batch, real time and changed data capture options in main frame(DB2, VSAM, IMS
etc.,), mid range (AS400 DB2 etc.,), and for relational databases (oracle, sql server, db2 etc) and flat files
in unix, linux and windows systems.

Power Center Connect: This is add on to Informatica Power Center. It helps to extract data and
metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party
applications.

Power Channel: This helps to transfer large amount of encrypted and compressed data over LAN, WAN,
through Firewalls, tranfer files over FTP, etc.

Meta Data Exchange: Metadata Exchange enables organizations to take advantage of the time and effort
already invested in defining data structures within their IT environment when used with Power Center. For
example, an organization may be using data modeling tools, such as Erwin, Embarcadero, Oracle
designer, Sybase Power Designer etc for developing data models. Functional and technical team should
have spent much time and effort in creating the data model's data structures(tables, columns, data types,
procedures, functions, triggers etc). By using meta deta exchange, these data structures can be imported
into power center to identifiy source and target mappings which leverages time and effort. There is no
need for informatica developer to create these data structures once again.

Power Analyzer: Power Analyzer provides organizations with reporting facilities. PowerAnalyzer makes
accessing, analyzing, and sharing enterprise data simple and easily available to decision makers.
PowerAnalyzer enables to gain insight into business processes and develop business intelligence.

With PowerAnalyzer, an organization can extract, filter, format, and analyze corporate information from
data stored in a data warehouse, data mart, operational data store, or otherdata storage models.
PowerAnalyzer is best with a dimensional data warehouse in a relational database. It can also run reports
on data in any table in a relational database that do not conform to the dimensional model.
Super Glue: Superglue is used for loading metadata in a centralized place from several sources. Reports
can be run against this superglue to analyze meta data.

Informatica Power Center Client:

The Power Center Client consists of the following applications that we use to manage the repository,
design mappings, mapplets, and create sessions to load the data:

1. Designer
2. Data Stencil
3. Repository Manager
4. Workflow Manager
5. Workflow Monitor

1. Designer:

Use the Designer to create mappings that contain transformation instructions for the Integration Service.

The Designer has the following tools that you use to analyze sources, design target Schemas, and build
source-to-target mappings:

  Source Analyzer: Import or create source definitions.


  Target Designer: Import or create target definitions.
  Transformation Developer: Develop transformations to use in mappings.

You can also develop user-defined functions to use in expressions.

  Mapplet Designer: Create sets of transformations to use in mappings.


  Mapping Designer: Create mappings that the Integration Service uses to Extract, transform, and load data.
2.Data Stencil

Use the Data Stencil to create mapping template that can be used to generate multiple mappings. Data
Stencil uses the Microsoft Office Visio interface to create mapping templates. Not used by a developer
usually.

3.Repository Manager

Use the Repository Manager to administer repositories. You can navigate through multiple folders and
repositories, and complete the following tasks:

 Manage users and groups: Create, edit, and delete repository users and User groups. We can assign and
revoke repository privileges and folder Permissions.
 Perform folder functions: Create, edit, copy, and delete folders. Work we perform in the Designer and
Workflow Manager is stored in folders. If we want to share metadata, you can configure a folder to be shared.
 View metadata: Analyze sources, targets, mappings, and shortcut dependencies, search by keyword, and
view the properties of repository Objects. We create repository objects using the Designer and Workflow
Manager Client tools.

We can view the following objects in the Navigator window of the Repository Manager:

 Source definitions: Definitions of database objects (tables, views, synonyms) or Files that provide source
data.
 Target definitions: Definitions of database objects or files that contain the target data.
 Mappings: A set of source and target definitions along with transformations containing business logic that
you build into the transformation. These are the instructions that the Integration Service uses to transform
and move data.
 Reusable transformations: Transformations that we use in multiple mappings.
 Mapplets: A set of transformations that you use in multiple mappings.
 Sessions and workflows: Sessions and workflows store information about how and When the Integration
Service moves data. A workflow is a set of instructions that Describes how and when to run tasks related to
extracting, transforming, and loading Data. A session is a type of task that you can put in a workflow. Each
session Corresponds to a single mapping.

4.Workflow Manager :

Use the Workflow Manager to create, schedule, and run workflows. A workflow is a set of instructions that
describes how and when to run tasks related to extracting, transforming, and loading data.

The Workflow Manager has the following tools to help us develop a workflow:

 Task Developer: Create tasks we want to accomplish in the workflow.


 Work let Designer: Create a worklet in the Worklet Designer. A worklet is an object that groups a set of
tasks. A worklet is similar to a workflow, but without scheduling information. We can nest worklets inside a
workflow.
 Workflow Designer: Create a workflow by connecting tasks with links in the Workflow Designer. You can
also create tasks in the Workflow Designer as you develop the workflow.

When we create a workflow in the Workflow Designer, we add tasks to the workflow. The Workflow
Manager includes tasks, such as the Session task, the Command task, and the Email task so you can
design a workflow. The Session task is based on a mapping we build in the Designer.

We then connect tasks with links to specify the order of execution for the tasks we created. Use
conditional links and workflow variables to create branches in the workflow.
5.Workflow Monitor

Use the Workflow Monitor to monitor scheduled and running workflows for each Integration Service. We
can view details about a workflow or task in Gantt chart view or Task view. We Can run, stop, abort, and
resume workflows from the Workflow Monitor. We can view Sessions and workflow log events in the
Workflow Monitor Log Viewer.

The Workflow Monitor displays workflows that have run at least once. The Workflow Monitor continuously
receives information from the Integration Service and Repository Service. It also fetches information from
the repository to display historic Information.

Informatica Architecture:

Informatica PowerCenter is not just a tool but an end-to-end data processing and data integration
environment. It facilitates organizations to collect, centrally process and redistribute data. It can be used
just to integrate two different systems like SAP and MQ Series or to load data warehouses or Operational
Data Stores (ODS).

Now Informatica PowerCenter also includes many add-on tools to report the data being processed,
business rules applied and quality of data before and after processing.

To facilitate this PowerCenter is divided into different components:

 PowerCenter Domain: As Informatica says “The Power Center domain is the primary unit for
management and administration within PowerCenter”. Doesn’t make much sense? Right... So here
is a simpler version. Power Center domain is the collection of all the servers required to support
Power Center functionality. Each domain has gateway (called domain server) hosts. Whenever you
want to use Power Center services you send a request to domain server. Based on request type it
redirects your request to one of the Power Center services.
 PowerCenter Repository: Repository is nothing but a relational database which stores all the
metadata created in Power Center. Whenever you develop mapping, session, workflow, execute
them or do anything meaningful (literally), entries are made in the repository.
 Integration Service: Integration Service does all the real job. It extracts data from sources,
processes it as per the business logic and loads data to targets.
 Repository Service: Repository Service is the one that understands content of the repository,
fetches data from the repository and sends it back to the requesting components (mostly client
tools and integration service)
 PowerCenter Client Tools: The PowerCenter Client consists of multiple tools. They are used to
manage users, define sources and targets, build mappings and mapplets with the transformation
logic, and create workflows to run the mapping logic. The PowerCenter Client connects to the
repository through the Repository Service to fetch details. It connects to the Integration Service to
start workflows. So essentially client tools are used to code and give instructions to PowerCenter
servers.
 PowerCenter Administration Console: This is simply a web-based administration tool you can use to
administer the PowerCenter installation.
There are some more not-so-essential-to-know components discussed below:

 Web Services Hub: Web Services Hub exposes PowerCenter functionality to external clients
through web services.
 SAP BW Service: The SAP BW Service extracts data from and loads data to SAP BW.
 Data Analyzer: Data Analyzer is like a reporting layer to perform analytics on data warehouse or
ODS data.
 Metadata Manager: Metadata Manager is a metadata management tool that you can use to browse
and analyze metadata from disparate metadata repositories. It shows how the data is acquired,
what business rules are applied and where data is populated in readable reports.
 PowerCenter Repository Reports: PowerCenter Repository Reports are a set of prepackaged Data
Analyzer reports and dashboards to help you analyze and manage PowerCenter metadata.

Informatica Transformations

A transformation is a repository object that generates, modifies, or passes data. The Designer
provides a set of transformations that perform specific functions. For example, an Aggregator
transformation performs calculations on groups of data.

Transformations can be of two types:

Active Transformation

An active transformation can change the number of rows that pass through the transformation, change
the transaction boundary, can change the row type. For example, Filter, Transaction Control and Update
Strategy are active transformations.

The key point is to note that Designer does not allow you to connect multiple active transformations or an
active and a passive transformation to the same downstream transformation or transformation input
group because the Integration Service may not be able to concatenate the rows passed by active
transformations However, Sequence Generator transformation (SGT) is an exception to this rule. A
SGT does not receive data. It generates unique numeric values. As a result, the Integration Service does
not encounter problems concatenating rows passed by a SGT and an active transformation.

Passive Transformation.
A passive transformation does not change the number of rows that pass through it, maintains the
transaction boundary, and maintains the row type.

The key point is to note that Designer allows you to connect multiple transformations to the same
downstream transformation or transformation input group only if all transformations in the upstream
branches are passive. The transformation that originates the branch can be active or passive.

Transformations can be Connected or UnConnected to the data flow.

Connected Transformation
Connected transformation is connected to other transformations or directly to target table in the mapping.

Unconnected Transformation

An unconnected transformation is not connected to other transformations in the mapping. It is called


within another transformation, and returns a value to that transformation.

Aggregator Transformation

Aggregator transformation performs aggregate funtions like average, sum, count etc. on multiple rows or
groups. The Integration Service performs these calculations as it reads and stores data group and row
data in an aggregate cache. It is an Active & Connected transformation.

Difference b/w Aggregator and Expression Transformation?

Expression transformation permits you to perform calculations row by row basis only. In Aggregator you
can perform calculations on groups.

Aggregator transformation has following ports State, State_Count, Previous_State and State_Counter.

Components: Aggregate Cache, Aggregate Expression, Group by port, Sorted input.

Aggregate Expressions: are allowed only in aggregate transformations.


Can include conditional clauses and non-aggregate functions.
Can also include one aggregate function nested into another aggregate function.

Aggregate Functions: AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM,
VARIANCE.

Expression Transformation

Passive & Connected. are used to perform non-aggregate functions, i.e to calculate values in a single row.
Example: to calculate discount of each product or to concatenate first and last names or to convert date to
a string field.

You can create an Expression transformation in the Transformation Developer or the Mapping Designer.
Components: Transformation, Ports, Properties, Metadata Extensions.

Filter Transformation

Active & Connected. It allows rows that meet the specified filter condition and removes the rows that do
not meet the condition. For example, to find all the employees who are working in NewYork or to find out
all the faculty member teaching Chemistry in a state. The input ports for the filter must come from a
single transformation. You cannot concatenate ports from more than one transformation into the Filter
transformation. Components: Transformation, Ports, Properties, Metadata Extensions.

Joiner Transformation

Active & Connected. It is used to join data from two related heterogeneous sources residing in different
locations or to join data from the same source. In order to join two sources, there must be at least one or
more pairs of matching column between the sources and a must to specify one source as master and the
other as detail. For example: to join a flat file and a relational source or to join two flat files or to join a
relational source and a XML source.
The Joiner transformation supports the following types of joins:

 Normal- Normal join discards all the rows of data from the master and detail source that do not
match, based on the condition.
 Master Outer- Master outer join discards all the unmatched rows from the master source and
keeps all the rows from the detail source and the matching rows from the master source.
 Detail Outer- Detail outer join keeps all rows of data from the master source and the matching
rows from the detail source. It discards the unmatched rows from the detail source.
 Full Outer- Full outer join keeps all rows of data from both the master and detail sources.

Limitations on the pipelines you connect to the Joiner transformation:


*You cannot use a Joiner transformation when either input pipeline contains an Update Strategy
transformation.
*You cannot use a Joiner transformation if you connect a Sequence Generator transformation directly
before the Joiner transformation.

Lookup Transformation

Passive & Connected or Unconnected. It is used to look up data in a flat file, relational table, view, or
synonym. It compares lookup transformation ports (input ports) to the source column values based on the
lookup condition. Later returned values can be passed to other transformations. You can create a lookup
definition from a source qualifier and can also use multiple Lookup transformations in a mapping.

You can perform the following tasks with a Lookup transformation:


*Get a related value. Retrieve a value from the lookup table based on a value in the source. For example,
the source has an employee ID. Retrieve the employee name from the lookup table.
*Perform a calculation. Retrieve a value from a lookup table and use it in a calculation. For example,
retrieve a sales tax percentage, calculate a tax, and return the tax to a target.
*Update slowly changing dimension tables. Determine whether rows exist in a target.

Lookup Components: Lookup source, Ports, Properties, Condition.


Types of Lookup:
1) Relational or flat file lookup.
2) Pipeline lookup.
3) Cached or uncached lookup.
4) Connected or unconnected lookup.

Rank Transformation

Active & Connected. It is used to select the top or bottom rank of data. You can use it to return the largest
or smallest numeric value in a port or group or to return the strings at the top or the bottom of a session
sort order. For example, to select top 10 Regions where the sales volume was very high or to select 10
lowest priced products. As an active transformation, it might change the number of rows passed through
it. Like if you pass 100 rows to the Rank transformation, but select to rank only the top 10 rows, passing
from the Rank transformation to another transformation. You can connect ports from only one
transformation to the Rank transformation. You can also create local variables and write non-aggregate
expressions.
Router Transformation

Active & Connected. It is similar to filter transformation because both allow you to apply a condition to
test data. The only difference is, filter transformation drops the data that do not meet the condition
whereas router has an option to capture the data that do not meet the condition and route it to a default
output group.
If you need to test the same input data based on multiple conditions, use a Router transformation in a
mapping instead of creating multiple Filter transformations to perform the same task. The Router
transformation is more efficient.

Sequence Generator Transformation

Passive & Connected transformation. It is used to create unique primary key values or cycle through a
sequential range of numbers or to replace missing primary keys.

It has two output ports: NEXTVAL and CURRVAL. You cannot edit or delete these ports. Likewise, you
cannot add ports to the transformation. NEXTVAL port generates a sequence of numbers by connecting it
to a transformation or target. CURRVAL is the NEXTVAL value plus one or NEXTVAL plus the Increment By
value.
You can make a Sequence Generator reusable, and use it in multiple mappings. You might reuse a
Sequence Generator when you perform multiple loads to a single target.

For non-reusable Sequence Generator transformations, Number of Cached Values is set to zero by default,
and the Integration Service does not cache values during the session.For non-reusable Sequence
Generator transformations, setting Number of Cached Values greater than zero can increase the number
of times the Integration Service accesses the repository during the session. It also causes sections of
skipped values since unused cached values are discarded at the end of each session.

For reusable Sequence Generator transformations, you can reduce Number of Cached Values to minimize
discarded values, however it must be greater than one. When you reduce the Number of Cached Values,
you might increase the number of times the Integration Service accesses the repository to cache values
during the session.

Sorter Transformation

Active & Connected transformation. It is used sort data either in ascending or descending order according
to a specified sort key. You can also configure the Sorter transformation for case-sensitive sorting, and
specify whether the output rows should be distinct. When you create a Sorter transformation in a
mapping, you specify one or more ports as a sort key and configure each sort key port to sort in
ascending or descending order.

Source Qualifier Transformation

Active & Connected transformation. When adding a relational or a flat file source definition to a mapping,
you need to connect it to a Source Qualifier transformation. The Source Qualifier is used to join data
originating from the same source database, filter rows when the Integration Service reads source data,
Specify an outer join rather than the default inner join and to specify sorted ports.
It is also used to select only distinct values from the source and to create a custom query to issue a
special SELECT statement for the Integration Service to read source data

Union Transformation

Active & Connected. The Union transformation is a multiple input group transformation that you use to
merge data from multiple pipelines or pipeline branches into one pipeline branch. It merges data from
multiple sources similar to the UNION ALL SQL statement to combine the results from two or more SQL
statements. Similar to the UNION ALL statement, the Union transformation does not remove duplicate
rows.
Rules
1) You can create multiple input groups, but only one output group.
2) All input groups and the output group must have matching ports. The precision, datatype, and scale
must be identical across all groups.
3) The Union transformation does not remove duplicate rows. To remove duplicate rows, you must add
another transformation such as a Router or Filter transformation.
4) You cannot use a Sequence Generator or Update Strategy transformation upstream from a Union
transformation.
5) The Union transformation does not generate transactions.
Components: Transformation tab, Properties tab, Groups tab, Group Ports tab.

Update Strategy Transformation

Active & Connected transformation. It is used to update data in target table, either to maintain history of
data or recent changes. It flags rows for insert, update, delete or reject within a mapping.

You might also like