Professional Documents
Culture Documents
Ab Initio Training
Ab Initio Training
Ab Initio Training
Applications
Ab Initio
Application Development Environments Metadata
Graphical C ++ Shell Repository
Component User-defined Third Party
Library Components Components
DTM
Graph when
User deployed
generate .ksh Used to schedule graphs developed in
GDE. It also has capability to maintain
dependencies between graphs
Co>Operating System
Host
GDE
Host
GDE Agent Agent
Host
GDE Agent Agent
Host
GDE Agent Agent
Host
GDE Agent Agent
Host
GDE
Host
GDE
File Extensions
– .mp Stored Ab Initio graph or graph component
– .mpc Program or custom component
– .mdc Dataset or custom dataset component
– .dml Data Manipulation Language file or record type
definition
– .xfr Transform function file
– .dat Data file (either serial file or multifile)
Versions
To find the GDE version Select
Help >> About Ab Initio from the
GDE window.
To find the Co>Operating
System version Select Run >>
Settings from the GDE window.
Look for the Detected base
System Version.
Connecting to Co>op Server from GDE
Host Profile Setting
Enter Host,
Select the
Login,
Shell Type
Password &
Host directory
Ab Initio Components
Ab Initio provided
components. Datasets,
Partition, Transform,
Sort, Database are
frequently used.
Creating Graph
Type the
Label
Specify the
Input .dat
file
Create Graph - Dml
Propagate from Neighbors: Copy
record formats from connected flow.
Specify Same As: Copy record format’s
the .dml file
from a specific component’s port.
Path: Store record formats in a
Local file, Host File, or in the Ab
Initio repository.
Embedded: Type the record format
directly in a string.
Creating Graph - dml
DML is Ab Initio’s Data
Manipulation Language.
DML describes data in terms
of
– Record Formats that list the
fields and format of input,
output, and intermediate
records.
– Expressions that define
simple computations, for
example, selection.
– Transform Functions that
control reformatting,
Editing .dml file through aggregation, and other data
Record Format Editor – Grid transformations.
View – Keys that specify groupings,
ordering, and partitioning
relationships between
records.
Creating Graph - Transform
A transform function is either a
DML file or a DML string that
describes how you manipulate
your data.
Ab Initio transform functions
mainly consist of a series of
assignment statements. Each
statement is called a business
rule.
Specify the .xfr file When Ab Initio evaluates a
transform function, it performs
following tasks:
– Initializes local variables
– Evaluates statements
– Evaluates rules.
Transform function files have the
xfr extension.
Creating Graph - xfr
Transform functions: A set
of rules that compute
output values from input
values.
Business rule: Part of a
transform function that
describes how you
manipulate one field of
your output data.
Variable: Optional part of a
transform function that
provides storage for
temporary values.
Statement: Optional part of
a transform function that
assigns values of variables
in a specific order.
Sample Components
Sort
Dedup
Join
Replicate
Rollup
Filter by Expression
Merge
Lookup
Reformat etc.
Creating Graph – Sort Component
Sort: The sort component
reorders data. It
comprises two
parameters: Key and
Specify Key for
the Sort
max-core.
Key: The Key is one of
the parameters for Sort
component which
describes the collation
order.
Max-core: The max-core
parameter controls how
often the sort component
dumps data from
memory to disk.
Creating Graph – Dedup component
Dedup component
removes duplicate
records.
Dedup criteria will
be either unique-
only, First or Last.
Component parallelism
Pipeline parallelism
Data parallelism
Component Parallelism
Sorting Customers
Sorting Transactions
Component Parallelism
Comes “for free” with graph programming.
Limitation:
– Scales to number of “branches” a graph.
Pipeline Parallelism
Processing Record: 100
Processing Record: 99
Pipeline Parallelism
Comes “for free” with graph programming.
Limitations:
– Scales to length of “branches” in a graph.
– Some operations, like sorting, do not pipeline.
Data Parallelism
ns
ti tio
r
Pa
Two Ways of Looking at
Data Parallelism
Expanded View:
Global View:
Data Parallelism
Scales with data.
Global View:
Data Partitioning:
The Global View
Degree of Parallelism
Fan-out Flow
Session III
Partitioning
Partitioning Review
Fan-out Flow
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 2
Partition 3
Partition 3
Balanced: Skewed:
Processors get neither Some processors get
too much nor too little. too much, others too little.
Sample Data to be Partitioned
Customers
Customers
42John
42John 02116
02116 30
30
record
43Mark
43Mark 02114
02114 99
44Bob decimal(2) id;
44Bob 02116
02116 88
45Sue
45Sue 02241
02241 92
92 string(5) name;
46Rick
46Rick 02116
02116 23
23 decimal(5) zipcode;
47Bill
47Bill 02114
02114 14
14 decimal(3) amount;
48Mary
48Mary 02116
02116 38
38
49Jane
string(1) newline;
49Jane 02241
02241 2.2.
end
Partition by Round-robin
Customers
Customers Customers
Customers Customers
Customers
42John
42John 02116
02116 3030 43Mark
43Mark 02114
02114 99 44Bob
44Bob 02116
02116 88
45Sue
45Sue 02241
02241 9292 46Rick
46Rick 02116
02116 2323 47Bill
47Bill 02114
02114 1414
48Mary
48Mary 02116
02116 3838 49Jane
49Jane 02241
02241 22
Partition by Round-robin
partition on zipcode:
Customers
Customers Customers
Customers
43Mark
43Mark 02114
02114 99 42John
42John 02116
02116 3030
45Sue
45Sue 02241
02241 9292 44Bob
44Bob 02116
02116 88
47Bill
47Bill 02114
02114 1414 46Rick
46Rick 02116
02116 2323
49Jane
49Jane 02241
02241 22 48Mary
48Mary 02116
02116 3838
Partition by Key often
followed by a Sort
Sort on zipcode:
Customers
Customers Customers
Customers
43Mark
43Mark 02114
02114 99 42John
42John 02116
02116 3030
47Bill
47Bill 02114
02114 1414 44Bob
44Bob 02116
02116 88
45Sue
45Sue 02241
02241 9292 46Rick
46Rick 02116
02116 2323
49Jane
49Jane 02241
02241 22 48Mary
48Mary 02116
02116 3838
Rollup by zipcode:
Totals
Totalsby
byZipcode
Zipcode Totals
Totalsby
byZipcode
Zipcode
02114
02114 23
23 02116
02116 99
99
02241
02241 94
94
Partition by Key
Key-based.
Usually results in well balanced data.
Useful for key-dependent parallelism.
Partition by Expression
Expression: amount/33
Customers
Customers Customers
Customers Customers
Customers
42John
42John 02116
02116 3030 48Mary
48Mary 02116
02116 3838 45Sue
45Sue 02241
02241 9292
43Mark
43Mark 02114
02114 99
44Bob
44Bob 02116
02116 88
46Rick
46Rick 02116
02116 2323
47Bill
47Bill 02114
02114 1414
49Jane
49Jane 02241
02241 22
Partition by Expression
Key-based.
Resulting balance dependent on set of
splitters chosen.
Useful for “binning” and global sorting.
Partition with Load Balance
Not key-based.
Results in skewed data distribution to
complement skewed load.
Useful for record-independent parallelism.
Partition with Percentage
With percentages: 4, 20
Customers
Customers Customers
Customers Customers
Customers
42John
42John 02116
02116 3030 46Rick
46Rick 02116
02116 2323 ...
...
43Mark
43Mark 02114
02114 99 47Bill
47Bill 02114
02114 1414
44Bob
44Bob 02116
02116 88 48Mary
48Mary 02116
02116 3838
45Sue
45Sue 02241
02241 9292 49Jane
49Jane 02241
02241 22
Not key-based
Results in usually skewed data distribution
conforming to the provided percentages.
Useful for record-independent parallelism.
Broadcast (as a Partitioner)
Unlike all other partitioners which write a record to ONE output
flow, Broadcast writes each record to EVERY output flow.
Customers
Customers Customers
Customers Customers
Customers
42John
42John 02116
02116 30
30 42John
42John 02116
02116 30
30 42John
42John 02116
02116 30
30
43Mark
43Mark 02114
02114 99 43Mark
43Mark 02114
02114 99 43Mark
43Mark 02114
02114 99
44Bob
44Bob 02116
02116 88 44Bob
44Bob 02116
02116 88 44Bob
44Bob 02116
02116 88
45Sue
45Sue 02241
02241 92
92 45Sue
45Sue 02241
02241 92
92 45Sue
45Sue 02241
02241 92
92
46Rick 02116 23
46Rick 02116 23 46Rick 02116 23
46Rick 02116 23 46Rick 02116 23
46Rick 02116 23
47Bill
47Bill 02114
02114 14
14 47Bill
47Bill 02114
02114 14
14 47Bill
47Bill 02114
02114 14
14
48Mary
48Mary 02116
02116 38
38 48Mary
48Mary 02116
02116 38
38 48Mary
48Mary 02116
02116 38
38
49Jane
49Jane 02241
02241 22 49Jane
49Jane 02241
02241 22 49Jane
49Jane 02241
02241 22
Broadcast
Not key-based
Results in perfectly balanced partitions
Useful for record-independent parallelism.
Session IV
De-Partitioning
Departitioning
Score 1
Departition
Score
2 Output File
Score
3
Global View:
Departitioning
Fan-in Flow
Sorted data:
49Jane
49Jane 02241
02241 22
44Bob
44Bob 02116
02116 88
43Mark
43Mark 02114
02114 99
47Bill
47Bill 02114
02114 14
14
46Rick
46Rick 02116
02116 23
23
42John
42John 02116
02116 30
30
48Mary
48Mary 02116
02116 38
38
45Sue
45Sue 02241
02241 92
92
Concatenation
Not key-based.
Result ordering is by partition.
Serializes pipelined computation.
Useful for:
– creating serial flow from partitioned data
– appending headers and trailers
– writing DML
Used infrequently
Merge
Round-robin partitioned and sorted by amount:
42John
42John 02116
02116 30
30 49Jane
49Jane 02241
02241 22 44Bob
44Bob 02116
02116 88
48Mary
48Mary 02116
02116 38
38 43Mark
43Mark 02114
02114 99 47Bill
47Bill 02114
02114 14
14
45Sue
45Sue 02241
02241 92
92 46Rick
46Rick 02116
02116 23
23
Sorted data, following merge on amount:
49Jane
49Jane 02241
02241 22
44Bob
44Bob 02116
02116 88
43Mark
43Mark 02114
02114 99
47Bill
47Bill 02114
02114 14
14
46Rick
46Rick 02116
02116 23
23
42John
42John 02116
02116 30
30
48Mary
48Mary 02116
02116 38
38
45Sue
45Sue 02241
02241 92
92
Merge
Key-based.
Result ordering is sorted if each input is sorted.
Possibly synchronizes pipelined computation;
may even serialize.
Useful for creating ordered data flows.
Used more than concatenate, but still infrequently
Interleave
Round-robin partitioned and scored:
42John
42John 02116
02116 30A
30A 43Mark
43Mark 02114
02114 9C9C 44Bob
44Bob 02116
02116 8C8C
45Sue
45Sue 02241
02241 92A
92A 46Rick
46Rick 02116
02116 23B
23B 47Bill
47Bill 02114
02114 14B
14B
48Mary
48Mary 02116
02116 38A
38A 49Jane
49Jane 02241
02241 2C2C
Scored dataset in original order, following interleave:
42John
42John 02116
02116 30A
30A
43Mark
43Mark 02114
02114 9C9C
44Bob
44Bob 02116
02116 8C8C
45Sue
45Sue 02241
02241 92A
92A
46Rick
46Rick 02116
02116 23B
23B
47Bill
47Bill 02114
02114 14B
14B
48Mary
48Mary 02116
02116 38A
38A
49Jane
49Jane 02241
02241 2C2C
Interleave
Not key-based.
Result ordering is inverse of round-robin.
Synchronizes pipelined computation.
Useful for restoring original order following a
record-independent parallel computation
partitioned by round-robin.
Used in rare circumstances
Gather
Not key-based.
Result ordering is unpredictable.
Neither serializes nor synchronizes pipelined
computation.
Useful for efficient collection of data from multiple
partitions and for repartitioning.
Used most frequently
Layout
dedupn
Set the dedupn parameter to true to remove duplicates from the
corresponding inn port before joining. This allows you to choose
only one record from a group with matching key values as the
argument to the transform function.
Default is false, which does not remove duplicates
override-keyn
Alternative name(s) for the key field(s) for a particular in port.
References
Ab Initio Tutorial
Ab Initio Online Help
Website (abinitio.com)