Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

1)Filter by expression

Filter by expression filters records according to a given DML expression.

Ports: In, Out, Deselect, Reject, Error, Log

Parameters: Select expr

Non-0 value –it writes the records to output port.

0 value –it writes the records to deselect port.

Null –it writes the records to reject port and writes descriptive error message to error port.

2)dedup sort

Dedup sorted:

It separates one specified record in each group of records from the rest of the records in
the group.

Ports: In, Out, Dup, Reject, Error, Log

DEDUP SORTED requires grouped input.

Parameters: Key, Select, Keep-First, Last, Unique -Specifies which record the component keeps.

Run time Behavior: Reads a grouped flow of records from the in port.

If u supplied an expression for the select parameter, Dedup sorted applies the expression to the
records and processed according to the select expression. If u doesn’t supply any expression for
select it processes all records on the in port.

Consider any consecutive records with the same key value as the same group. If group
consists of one record, writes that record only.

If u chooses unique only that does not write any groups consisting of more than one record.

Both out and dup ports are optional. If u doesn’t connect any flows to them it discards
those records.
3)reformat:

Reformat:

Reformat changes the format of records by dropping fields, or by using DML expressions to
add fields, combine fields, or transform the data in the records.

Ports: In, Out, Reject, Error, Log

Parameters:

Count-Sets the number of out ports, reject ports, error ports, transform parameters

Default is 1.

Transform n-Specifies either the name of a file containing a transform function, or a


transform string corresponding to an out port; n represents the number of an out port.

Transform functions for REFORMAT should have one input and one output.

Select- (expression, optional)-Filter for records before reformatting.

Output-index- (filename or string, optional)

When you specify a value for this parameter, each input record goes to exactly one transform-
output port pair.

For example, if the component has two output ports and there are no rejects, 100 input
records results in 100 output records on each port for a total of 200 output records.

Output-indexes- (filename or string, optional)

The expected output of the transform function is a vector of numeric values. The component
considers each element of this vector as an index into the output transforms and ports. The
component directs the input record to the identified output ports and executes the
transform functions, if any, associated with those ports.

If you specify a value for the output-index parameter, you cannot also specify the
output indexes parameter.

If you specify an expression for the select parameter, the expression filters the records on the in
port:
If the expression evaluates to 0 for a particular record, REFORMAT does not process the
record, which means that the record does not appear on any output port.

If the expression produces NULL for any record, REFORMAT writes a descriptive error
message and stops execution of the graph.

If the expression evaluates to anything other than 0 or NULL for a particular record,
REFORMAT processes the record.

If a transform function returns NULL, REFORMAT writes:

1. An error message to the corresponding error port

2. The current input record to the corresponding reject port

If you do not connect flows to the reject or error ports, REFORMAT discards the information.

Writes the valid records to the out ports.

4)rollup:

Rollup:

Rollup evaluates a group of input records that have the same key, and then generates
records that either summarize each group or select certain information from each group.

Ports: In, Out, Reject, Error, Log.

Parameters:

Sorted-input- (Boolean, required)

When set to in memory: Input need not be sorted, the component accepts ungrouped
input, and requires the use of the max-core parameter.

When set to Input must be sorted or grouped, the component requires grouped input, and
the max-core parameter is not available.

Default is Input must be sorted or grouped.

Key-method- (choice, optional)


o Use key specifier — The component uses a key specifier.

o Use key change function — The component uses the key change transform function.

Key- (key specifier, optional)

Name(s) of the key field(s) the component can use to group or define groups of records. If
the value of the key-method parameter is Use key specifier, you must specify a value for the
key parameter.

If you specify Use key change function for the key-method parameter, the key parameter is not
available.

Transform- (filename or string, required)

Either the names of the file containing the types and transform functions, or a transform string.

Max-core- (integer, required)

Maximum memory usage in bytes.

If the total size of the intermediate results the component holds in memory exceeds the
number of bytes specified in the max-core parameter, the component writes temporary files
to disk.

Default is 67108864 (64 MB).

Template mode: During a rollup operation, aggregation functions calculate cumulative


information about any expression. For example, suppose you have an input record for each

purchase by each customer. You could use the sum aggregation function to determine the
total amount spent by each customer

Expanded mode: Now suppose that for each customer, you want to determine the price of the
largest single purchase and the item that was purchased. In this situation, aggregation functions
cannot compute the result you want. Consequently, you would write a transform that examines the
input records for each group and selects the appropriate fields from one of the records.
5)scan: For every input record, Scan generates an output record that includes a running
cumulative summary for the group the input record belongs to. For example, the
output records might include successive year-to-date totals for groups of records.

Ports: In, Out, Reject, Error, Log.

Parameters: Sorted-input, key-method, key, transform, max-core.

19) Normalize :

NORMALIZE generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.

Normalize converts an array of size n to n number of records. At the input of normalize it will be
a single record for each student. The output of normalize will have 4 records of the same
student but with different subjects.

20) Denormalize :

Denormalization is a strategy used on a previously-normalized database to increase


performance

There are a few situations when you definitely should think of denormalization:
Maintaining history: Data can change during time, and we need to store values that were valid
when a record was created. ...
Improving query performance: Some of the queries may use multiple tables to access data that
we frequently need.

lookup :
---From a lookup file, returns the first record that matches a specified expression
A lookup is a component of abinitio graph where we can store data and retrieve it by using a
key parameter.
A lookup file is the physical file where the data for the lookup is stored.

Parameters : key

Fuse :

FUSE combines multiple input flows (perhaps with different record formats) into a single output
flow. It examines one record from each input flow simultaneously, acting on the records
according to the transform function you specify. For example, you can compare records,
selecting one record or another based on specified criteria, or “fuse” them into a single record
that contains data from all the input records.
Parameters : count,transform

SORT WITHIN GROUPS :

SORT WITHIN GROUPS refines the sorting of records already sorted according to one key
specifier: it sorts the records within the groups formed by the first sort according to a second
key specifier.

Paramaters : Major key, Minor Key , Max core

---major-key : Specifies the name(s) of key field(s) and sequence specifier(s), according to which
the component assumes the input has already been ordered.

---minor-key : Name(s) of the key field(s) and the sequence specifier(s) you want the
component to use when it refines the sorting of records.

---max-core : Maximum memory usage in bytes.

When the component reaches the number of bytes specified in the max-core parameter, it
sorts the records it has read for the group and writes a temporary file to disk. Once all the data
is sorted, it merges the temporary files and sends the records to the out port.

Default is 10485760 (10 MB).

---allow-unsorted

Set to True to allow input not sorted according to the major-key parameter.

When you set allow-unsorted to True, the boundary between groups of records occurs when
any change occurs in the value(s) of the field(s) specified in the major-key parameter.

Default is False.
Partition components :
1) Partition by Key : partition by key distributes records to its output flow partitions according
to key values.
Parameters :
key : Specifies the name(s) of the key field(s) that you want the component to use when it
distributes records among flow partitions.

2) Partition by Round-Robin: Partition by Round-Robin distributes blocks of records evenly to


each output flow in round‑robin fashion.
---According to the block_size parameter, distributes blocks of records to its output flows in the
order in which the flows are connected
Parameters :
block_size : Number of records you want the component to distribute to one flow before
distributing the same number to the next flow.
Default is 1.

3) Partition by Expression:
Partition By Expression distributes records to its output flow partitions according to a specified
DML expression or transform function.
Parameters :
Function parameter : we can use expressions to partition data

4)Partition by Key And Sort:


Partition by Key and Sort repartitions records by key values and
then sorts the records within each partition. The number of input and output partitions can be
different. Partition by Key and Sort is a subgraph that contains two components, Partition by
Key and Sort.

Departition Components :

13) GATHER :

GATHER combines records from multiple flow partitions in an arbitrary order.

You can use GATHER to:


Reduce data parallelism, by connecting a single fan-in flow to the in port
Reduce component parallelism, by connecting multiple straight flows to the in port

Runtime behaviour :
1.Reads records from the flows connected to the in port.
2.Combines the records in an arbitrary order.
3.Writes the combined records to the out port.

parameters : None

14) MERGE :

MERGE combines records from multiple flows or flow partitions that have been sorted
according to the same key specifier, and maintains the sort order.
MERGE requires sorted input data, but never sorts data itself.

Parameters for MERGE


key: Name of the key field and the sequence specifier you want MERGE to use to maintain the
order of data records while merging them.

check-sort: Prevents the graph from running to completion if the component detects unsorted
input data.

Possible values are the following:

•True (default) — The component stops the graph with an error on the first input record that is
out of sorted order, according to the value of the key parameter. In almost all cases, this default
value is appropriate.

•False — The component does not stop or issue an error when it encounters unsorted inputs.
Since MERGE does not sort data itself, do not expect that unsorted input data will result in
output data that is sorted or grouped. This setting is rarely used.

15) CONCATENATE :

CONCATENATE appends multiple flow partitions of records one after another. The in port for
CONCATENATE is ordered. For more information

Runtime behavior of concatenate


CONCATENATE does the following:

1.Reads all records from the first flow connected to the in port (counting from top to bottom on
the graph) and copies them to the out port.
2.Reads all records from the next flow connected to the in port and appends them to the
records from the previously processed flow.
3.Repeats Step 2 for each subsequent flow connected to the in port.

16) REPLICATE :
REPLICATE arbitrarily combines all records it receives into a single flow and writes a copy of that
flow to each of its output flows. Use REPLICATE to support component parallelism

Parameters for REPLICATE


None

Runtime behavior of REPLICATE :

1.Arbitrarily combines the records from all the flows on the in port into a single flow.

2.Copies that flow to all the flows connected to the out port.

REPLICATE does not support implicit reformat, so you cannot use it to change the record format
associated with a particular flow. For that reason, you must make the record format of the in
and out ports identical. If you do not, execution of the graph stops when it reaches REPLICATE.

17) BROADCAST :

BROADCAST combines in an arbitrary order all records it receives into a single flow and writes a
copy of that flow to each of its output flow partitions.
Use BROADCAST to increase data parallelism when you have connected a single fan-out flow to
the out port, or to increase component parallelism when you have connected multiple straight
flows to the out port.

Parameters for BROADCAST


None

Runtime behavior of BROADCAST


BROADCAST does the following:

1.Reads records from all flows on the in port.

2.Combines the records in an arbitrary order into a single flow.

3.Copies all the records to all the flow partitions connected to the out port.

18) REPLICATE versus BROADCAST


REPLICATE and BROADCAST are similar components, so it can be difficult to know which one to
use in a particular graph:

BROADCAST is used to increase data parallelism by feeding records to fan-out or all-to-all flows.
REPLICATE is generally used to increase component parallelism, emitting multiple straight flows
to separate pipelines.

Specifically, the difference between them lies in how their flows are set up and how their
layouts are propagated in the GDE:

REPLICATE allows multiple outputs for a given layout and propagates the layout from the input
to the output.

BROADCAST is a partitioning component that defines the transition from one layout to another.

19) Normalize :

NORMALIZE generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.

Normalize converts an array of size n to n number of records. At the input of normalize it will be
a single record for each student. The output of normalize will have 4 records of the same
student but with different subjects.

20) Denormalize :

Denormalization is a strategy used on a previously-normalized database to increase


performance

There are a few situations when you definitely should think of denormalization:
Maintaining history: Data can change during time, and we need to store values that were valid
when a record was created. ...
Improving query performance: Some of the queries may use multiple tables to access data that
we frequently need.

You might also like