Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Ab Initio Playbook

Overview
Ab Initio is Latin for “from the beginning” or “from first principles”

From the beginning our software was designed to meet demands of the largest and most complex
organizations in the world. To achieve this, we produced many built-in features which address:

● Performance
● Scalability
● Flexibility
● Robustness

This Ab initio ETL platform provides a robust architecture that allows simple, fast, and highly secure
integration of systems and applications. This tool can run heterogeneous applications parallelly over
the distributed networks. Besides, it can integrate diverse, complex, and continuous data streams
ranging from gigabytes to petabytes to provide both ETL and EAI tasks within a single and consistent
framework. Moreover, it can integrate arbitrary data sources and programs and supplies entire
metadata management across the enterprise

Architecture
Ab Initio is a general-purpose data processing platform, it has a single architecture for processing
files, database tables, message queues, Web services and metadata. This same architecture enables
virtually any technical or business rule to be graphically defined, shared and executed. It processes
data in parallel across multiple processors, even processors on different servers

Ab Initio is a Client-Server software which include the following: Graphical Development


Environment (GDE), Co>Operating System, Enterprise Meta>Environment (EME), the Component
Library, Data Profiler
Graphical Development Environment is a graphical interface for creating, editing and executing
graphs. You can easily frag and drop components, configure them and develop graphs. You can also
check the record count at every stage, check execution time and work on performance tuning

Co-Operating System is the foundation for all the Ab-Initio applications which acts as an engine to
integrate all kinds of data processing and communications between al the tolls. This runs on
Mainframe, Unix and Linux and support parallelism

Enterprise Meta Environment is a database which keeps on tracking the changes to the graphs
developed. It is also well intelligent that is provides feedback of data which is flown trough graph to
see the impact on other graphs, which is called data impact analysis

Component Library consists of all the components with which you can transform data, load and
unload data

Data Profiler is an analytical application that present summary information about the data, such as
number of values, the minimum and maximum values, the number of invalid values and so forth

Sandbox
Sandbox is a working copy of a project. A project is a collection of graphs, plans and related files that
accomplish a single business goal. A project is stored in a single directory tree in EME technical
repository

● Sandbox is the work area where the developer develops a graph


● Each sandbox is particular to a specific project
● Every user has an individual sandbox to develop they own graphs

There are two types of sandboxes in Ab Initio

● Private sandbox
● Common sandbox

Private sandboxes are checked out to your own area, typically under a ‘home’ directory

Common sandboxes are generally checked out once into a shared area
Serial and Multi Files System
Serial File

● A serial file is Just like a normal file system. Usually under serial paths
● The Data Location URL for a serial use a parameter like $AI_SERIAL

MFS (Multi Files System)

● When you use multi files, your graphs can read, write, and process data in parallel
● A multifile is made up of many partitions. Each partition of the multifile is an ordinary file
that contains part of the dataset
● The Data Location URL for a multifile uses the mfile protocol and a parameter like $AI_MFS
● Multi files use the same record formats as serial files.

Component
Read Hive table

To read data from hive table and in this component, you can use filter expression to get the specific
data but only partition that you can filer (year, month, day)

Write Hive table

To write the data to hive table. You can specify this write method in parameter like overwrite before
load or append every new record
Read Excel Spreadsheet

To reads a Microsoft Excel spreadsheet into a graph and creates an output record for each row in
the spreadsheet.

Write Excel Spreadsheet

To write the data to Microsoft Excel spreadsheet

Partition by Round Robin


The Round-Robin Partition is used to achieve an equal distribution of rows to partitions randomly,
you can’t specify which column that you want to partition

Partition by Key
The Partition by Key is used to group data by a key, you can specify which column that you want to
partition

Reformat

Reformat changes the format of records by dropping fields, or by using DML expressions to add
fields, combine fields, or transform the data in the records

Merge
Merge combines data records from multiple flow partitions that have been sorted according to the
same key specifier and maintains the sort order (if the records are already sorted). Merge will collect
different flows and maintain the sorted order

Concatenate
Use the Concatenate component where the flows need to be combined in a particular order.
Concatenate appends multiple flow partitions of data records one after another

Gather
Gather combines data records from multiple flow partitions (multi files system) or multiple flows
arbitrarily and make the flow serial and collect from different serial flow of same type (of same dml)
to make it single flow. Gather will collect different flow arbitrarily

● Reads data records from the flows connected to the input port
● Combines the records arbitrarily and writes to the output
● Not key-based.
● Result ordering is unpredictable.
● Most useful method for efficient collection of data from multiple partitions and for
repartitioning.
● Used most frequently.
Sort (order by)

To sort the data in ascending or descending according to the key specified, you can specify which
column you want to sort in key parameter, and you can use Sort to order records before you send
them to a component that requires grouped or sorted records. Key is one of the parameters for sort
components which describes the collation order

Join

We can combine records in different data source using Join component. The key creates the
correspondence between the record in the data sets. You need to sort the data using Sort
component before using Join component

Join Types

● Inner Join
The transform is called to produce and output record only when there is a match in all of the
inputs
● Full Outer Join
The transform is called to produce and output record, even if matching records are not
found
● Explicit Join
You set the record-required parameters to specify on a port-by-port basis whether the
transform will be called if a matching record is not found

Rollup (group by)


Use the Rollup component to compute aggregation functions on groups of records.

● Rollup component requires the input to be sorted


● Choose which field or fields to use as the grouping key
● Choose which aggregation computations to perform

Dedup Sorted (distinct)

Dedup Sorted separates one specified data record in each group of data records from the rest of the
records in the group i.e., removes duplicate records from the flow according to key specified

Input file

Input File represents records read as input to a graph from one or more serial files or from a multi
file

Output file

Output File represents records written as output from a graph into one or more serial files or a
multifile

Input Table

Input Table unloads records from a database into a graph, allowing you to specify as the source
either a database table or an SQL statement that selects records from one or more tables

Output Table
Output Table loads records from a graph into a database, letting you specify the destination either
directly as a single database table, or through an SQL Statement that inserts records into one or
more tables

Filter by Expression (where)

Filters records according to a DML expression or transform function, which specifies the selection
criteria

Replicate

It replicates the data for a particular partition and send it out to multiple out ports of the component

SCAN
Generates the series of cumulative summary records for the groups of data records. You can use this
component to create rank in the data record

Expression
First Defined
first_defined(expr,value)

This function is equivalent to the ANSI-92 SQL function COALESCE and similar to the older Oracle
function NVL.

Is Defined
is_defined(expr)

This function is same with “is not null” in SQL

Example : “Product_Type == is_defined(total_amt)” same with “where product type =


SSP_SMS_Anynet and total_amt is not null”

Is Null
is_null(expr)
This function is same with “is null” in SQL

String Substring
string_substring(string, start, length)

Extracts a string with a specified length, starting from a given location in an input string

String Upcase
string_upcase(column)

converts all the letters in a string into uppercase

String Downcase
string_downcase(column)

converts all the characters in a string into lowercase

String like
string_like(column,expression)

Example: string_like tests whether the string "abcdef" matches the pattern "abc%" — that is, "abc"
plus one or more characters:

string_like(str = "abcdef ", pattern = "abc%")

Is Valid
is_valid(expr)

To use is_valid to check the validity of fields in a record or union, call the function on a record or
union that contains validation function fields

String Filter
for return the same char in two strings

Example, string_filter returns "ABC", which appears in both input strings:

string_filter(str = "AXBYCZ", filter_str = "ABCDEF")

The function returns the following result:

"ABC"

string_filter("023#46#13", "0123456789")

The function returns the following result:

"0234613"

Parameter
A parameter is a value that you specify to control some part of an object’s behaviour.
The object can be a project, component, graph, subgraph, plan, and so on. You type in a value for a
parameter (or click a button or select a value from a list), and thus specify the aspect of the object’s
behaviour identified by the parameter’s name.

Every parameter has two main parts:

* The declaration of its name

* The definition of its value

Parameters also have attributes that specify various details about what type of value it can hold,
whether the parameter is input or local.

The normal way to edit a component’s parameters is through the Parameters tab of the component
dialog.

Graph, subgraph and project parameters are edited through the Parameters Editor. Component
parameters too can be edited with the Parameters Editor. Usually, component parameters are
edited through the components’ own dialogs (with Description, Parameters, Ports, and other tabs).

Parameter sets
Every object (component, graph, project, plan) has a parameter set consisting of all that object’s
parameters.

The parameter set completely controls the object’s behaviour. It’s like turning the knobs on a piece
of equipment: each parameter is like one knob, controlling one detail of the object’s behaviour.

When editing a graph in the GDE, you can view the complete parameter set of any component in the
graph by selecting the component in the Parameters Editor’s left pane. A graph by default has no
parameters of its own.

The complete component parameter set you see in the editor includes parameters not shown in the
component dialog’s Parameters tab.

The GDE allows you to create files containing sets of values for a given graph’s input parameters (as
well as any input parameters in the graph’s sandbox or common sandboxes). Such a file is called an
input pset.

A component’s full parameter set includes all the values that can be set for it, in any of its tabs in its
GDE dialog.

The component’s Description, Layout, Port and other tabs allow more convenient access to these
values than would be possible by showing them as parameters in the Parameters tab. Nevertheless,
all these values are parameters.

Parameter interpretation
The Interpretation you specify for a given parameter determines what kind of expression you can
use to define its value. Above all, it determines how you can make references to other parameters in
its definition.

Most of the interpretation methods offer, at a minimum, some form of $ substitution: by specifying
the name of another parameter, preceded by a $, you effectively substitute the value of that
parameter in the expression:

The big difference between the different interpretation methods is what they allow you to do in
addition to simply referencing other parameters:

● Parameter Definition Language (PDL) interpretation

A full-featured notation that allows you to specify parameter references with $ or ${} substitution, as
needed. Also allows you to use inline DML or Korn shell expressions in parameter definitions.

● Shell interpretation

Allows you to use shell syntax to construct parameter definitions (along with $ or ${} substitution).
Shell interpretation is not as versatile as PDL; PDL includes the capability of using shell interpretation
along with other features that shell interpretation does not offer.

● $ substitution interpretation

Allows you to reference other parameters by specifying them in the $name form in the definition.

● ${} substitution interpretation

Allows you to reference other parameters by specifying them in the ${name} form in the definition.

● Constant interpretation

Declares that everything appearing in the parameter’s definition is to be interpreted literally; no


substitution occurs at all.

In the GDE, we specify a parameter’s Interpretation method first by selecting it in the Parameters
tab. Then we select the interpretation you want in the Interpretation box at the bottom of the tab.

In the Parameters Editor, we can set or change a parameter’s interpretation method by selecting the
parameter and then editing the Interpretation attribute’s value in the editor’s right-hand pane.

Parameter example:
TRX_DATE $[(date("YYYYMMDD"))(today()-1)]

YEAR $[string_substring(TRX_DATE,1,4)]

MONTH $[string_substring(TRX_DATE,1,6)]

DAY $[(string(""))(date("YYYYMMDD"))(today()-1)]

AS_OF_DATE $[(date("YYYYMMDD"))datetime_add(now(), -1)]

PROCESS_MONTH $[(string(""))(date("YYYYMM"))(date("YYYYMMDD"))AS_OF_DATE]
Graph example

You might also like