Professional Documents
Culture Documents
Ab Initio Playbook 1
Ab Initio Playbook 1
Overview
Ab Initio is Latin for “from the beginning” or “from first principles”
From the beginning our software was designed to meet demands of the largest and most complex
organizations in the world. To achieve this, we produced many built-in features which address:
● Performance
● Scalability
● Flexibility
● Robustness
This Ab initio ETL platform provides a robust architecture that allows simple, fast, and highly secure
integration of systems and applications. This tool can run heterogeneous applications parallelly over
the distributed networks. Besides, it can integrate diverse, complex, and continuous data streams
ranging from gigabytes to petabytes to provide both ETL and EAI tasks within a single and consistent
framework. Moreover, it can integrate arbitrary data sources and programs and supplies entire
metadata management across the enterprise
Architecture
Ab Initio is a general-purpose data processing platform, it has a single architecture for processing
files, database tables, message queues, Web services and metadata. This same architecture enables
virtually any technical or business rule to be graphically defined, shared and executed. It processes
data in parallel across multiple processors, even processors on different servers
Co-Operating System is the foundation for all the Ab-Initio applications which acts as an engine to
integrate all kinds of data processing and communications between al the tolls. This runs on
Mainframe, Unix and Linux and support parallelism
Enterprise Meta Environment is a database which keeps on tracking the changes to the graphs
developed. It is also well intelligent that is provides feedback of data which is flown trough graph to
see the impact on other graphs, which is called data impact analysis
Component Library consists of all the components with which you can transform data, load and
unload data
Data Profiler is an analytical application that present summary information about the data, such as
number of values, the minimum and maximum values, the number of invalid values and so forth
Sandbox
Sandbox is a working copy of a project. A project is a collection of graphs, plans and related files that
accomplish a single business goal. A project is stored in a single directory tree in EME technical
repository
● Private sandbox
● Common sandbox
Private sandboxes are checked out to your own area, typically under a ‘home’ directory
Common sandboxes are generally checked out once into a shared area
Serial and Multi Files System
Serial File
● A serial file is Just like a normal file system. Usually under serial paths
● The Data Location URL for a serial use a parameter like $AI_SERIAL
● When you use multi files, your graphs can read, write, and process data in parallel
● A multifile is made up of many partitions. Each partition of the multifile is an ordinary file
that contains part of the dataset
● The Data Location URL for a multifile uses the mfile protocol and a parameter like $AI_MFS
● Multi files use the same record formats as serial files.
Component
Read Hive table
To read data from hive table and in this component, you can use filter expression to get the specific
data but only partition that you can filer (year, month, day)
To write the data to hive table. You can specify this write method in parameter like overwrite before
load or append every new record
Read Excel Spreadsheet
To reads a Microsoft Excel spreadsheet into a graph and creates an output record for each row in
the spreadsheet.
Partition by Key
The Partition by Key is used to group data by a key, you can specify which column that you want to
partition
Reformat
Reformat changes the format of records by dropping fields, or by using DML expressions to add
fields, combine fields, or transform the data in the records
Merge
Merge combines data records from multiple flow partitions that have been sorted according to the
same key specifier and maintains the sort order (if the records are already sorted). Merge will collect
different flows and maintain the sorted order
Concatenate
Use the Concatenate component where the flows need to be combined in a particular order.
Concatenate appends multiple flow partitions of data records one after another
Gather
Gather combines data records from multiple flow partitions (multi files system) or multiple flows
arbitrarily and make the flow serial and collect from different serial flow of same type (of same dml)
to make it single flow. Gather will collect different flow arbitrarily
● Reads data records from the flows connected to the input port
● Combines the records arbitrarily and writes to the output
● Not key-based.
● Result ordering is unpredictable.
● Most useful method for efficient collection of data from multiple partitions and for
repartitioning.
● Used most frequently.
Sort (order by)
To sort the data in ascending or descending according to the key specified, you can specify which
column you want to sort in key parameter, and you can use Sort to order records before you send
them to a component that requires grouped or sorted records. Key is one of the parameters for sort
components which describes the collation order
Join
We can combine records in different data source using Join component. The key creates the
correspondence between the record in the data sets. You need to sort the data using Sort
component before using Join component
Join Types
● Inner Join
The transform is called to produce and output record only when there is a match in all of the
inputs
● Full Outer Join
The transform is called to produce and output record, even if matching records are not
found
● Explicit Join
You set the record-required parameters to specify on a port-by-port basis whether the
transform will be called if a matching record is not found
Dedup Sorted separates one specified data record in each group of data records from the rest of the
records in the group i.e., removes duplicate records from the flow according to key specified
Input file
Input File represents records read as input to a graph from one or more serial files or from a multi
file
Output file
Output File represents records written as output from a graph into one or more serial files or a
multifile
Input Table
Input Table unloads records from a database into a graph, allowing you to specify as the source
either a database table or an SQL statement that selects records from one or more tables
Output Table
Output Table loads records from a graph into a database, letting you specify the destination either
directly as a single database table, or through an SQL Statement that inserts records into one or
more tables
Filters records according to a DML expression or transform function, which specifies the selection
criteria
Replicate
It replicates the data for a particular partition and send it out to multiple out ports of the component
SCAN
Generates the series of cumulative summary records for the groups of data records. You can use this
component to create rank in the data record
Expression
First Defined
first_defined(expr,value)
This function is equivalent to the ANSI-92 SQL function COALESCE and similar to the older Oracle
function NVL.
Is Defined
is_defined(expr)
Is Null
is_null(expr)
This function is same with “is null” in SQL
String Substring
string_substring(string, start, length)
Extracts a string with a specified length, starting from a given location in an input string
String Upcase
string_upcase(column)
String Downcase
string_downcase(column)
String like
string_like(column,expression)
Example: string_like tests whether the string "abcdef" matches the pattern "abc%" — that is, "abc"
plus one or more characters:
Is Valid
is_valid(expr)
To use is_valid to check the validity of fields in a record or union, call the function on a record or
union that contains validation function fields
String Filter
for return the same char in two strings
"ABC"
string_filter("023#46#13", "0123456789")
"0234613"
Parameter
A parameter is a value that you specify to control some part of an object’s behaviour.
The object can be a project, component, graph, subgraph, plan, and so on. You type in a value for a
parameter (or click a button or select a value from a list), and thus specify the aspect of the object’s
behaviour identified by the parameter’s name.
Parameters also have attributes that specify various details about what type of value it can hold,
whether the parameter is input or local.
The normal way to edit a component’s parameters is through the Parameters tab of the component
dialog.
Graph, subgraph and project parameters are edited through the Parameters Editor. Component
parameters too can be edited with the Parameters Editor. Usually, component parameters are
edited through the components’ own dialogs (with Description, Parameters, Ports, and other tabs).
Parameter sets
Every object (component, graph, project, plan) has a parameter set consisting of all that object’s
parameters.
The parameter set completely controls the object’s behaviour. It’s like turning the knobs on a piece
of equipment: each parameter is like one knob, controlling one detail of the object’s behaviour.
When editing a graph in the GDE, you can view the complete parameter set of any component in the
graph by selecting the component in the Parameters Editor’s left pane. A graph by default has no
parameters of its own.
The complete component parameter set you see in the editor includes parameters not shown in the
component dialog’s Parameters tab.
The GDE allows you to create files containing sets of values for a given graph’s input parameters (as
well as any input parameters in the graph’s sandbox or common sandboxes). Such a file is called an
input pset.
A component’s full parameter set includes all the values that can be set for it, in any of its tabs in its
GDE dialog.
The component’s Description, Layout, Port and other tabs allow more convenient access to these
values than would be possible by showing them as parameters in the Parameters tab. Nevertheless,
all these values are parameters.
Parameter interpretation
The Interpretation you specify for a given parameter determines what kind of expression you can
use to define its value. Above all, it determines how you can make references to other parameters in
its definition.
Most of the interpretation methods offer, at a minimum, some form of $ substitution: by specifying
the name of another parameter, preceded by a $, you effectively substitute the value of that
parameter in the expression:
The big difference between the different interpretation methods is what they allow you to do in
addition to simply referencing other parameters:
A full-featured notation that allows you to specify parameter references with $ or ${} substitution, as
needed. Also allows you to use inline DML or Korn shell expressions in parameter definitions.
● Shell interpretation
Allows you to use shell syntax to construct parameter definitions (along with $ or ${} substitution).
Shell interpretation is not as versatile as PDL; PDL includes the capability of using shell interpretation
along with other features that shell interpretation does not offer.
● $ substitution interpretation
Allows you to reference other parameters by specifying them in the $name form in the definition.
Allows you to reference other parameters by specifying them in the ${name} form in the definition.
● Constant interpretation
In the GDE, we specify a parameter’s Interpretation method first by selecting it in the Parameters
tab. Then we select the interpretation you want in the Interpretation box at the bottom of the tab.
In the Parameters Editor, we can set or change a parameter’s interpretation method by selecting the
parameter and then editing the Interpretation attribute’s value in the editor’s right-hand pane.
Parameter example:
TRX_DATE $[(date("YYYYMMDD"))(today()-1)]
YEAR $[string_substring(TRX_DATE,1,4)]
MONTH $[string_substring(TRX_DATE,1,6)]
DAY $[(string(""))(date("YYYYMMDD"))(today()-1)]
PROCESS_MONTH $[(string(""))(date("YYYYMM"))(date("YYYYMMDD"))AS_OF_DATE]
Graph example