Intro 2

A Practical Introduction to
Ab Initio Software
Part 2: Building Applications
Confidential & Proprietary

Outline
• Constructing Applications
• Parallelism
• Data Partitioning
• Multifiles

Steps in Building an
Application
• Add datasets.
• Add components.
• Add flows.
• Modify as needed.
• Configure datasets and components along the way; let

the yellow “To Do” cues guide you.
• Generally, you should configure your input and output
metatdata (record formats) before adding flows.

Adding an Input Dataset
1. Click on Component Button
2. Open Datasets Category
3. Choose InputFile

Configuring the Input
Dataset
1. Browse to find simple.dat 2. Browse to find simple.dml
3. Change label to something descriptive

Adding a Filter by Expression
Component
1. Open Transform Category
2. Choose Filter by Expression

Adding an Output Dataset
Choose OutputFile

Configuring the Output
Dataset
1. Browse to see directory 2. Enter name of output file

Adding Flows
1. Click on source (hold)
2. Drag to destination (release)

Configuring Filter by
Expression
Enter expression

Flows Can Propagate
Configuration
• One way to “Get rid of yellow” is to configure

datasets or components.
• Hooking up flows allows the GDE to

automatically propagate many kinds of
information, like record format metadata;
sometimes, connecting things is all you need
to do to “Get rid of yellow.”

Tip: Let Propagation Do the
Work!
• Define record formats for input datasets.
• Define record formats for output datasets only

when they differ from input datasets; let
propagation do as much as possible.
• If record formats change, this minimizes the

impact on the graph.
• Sometimes you will need to set record formats

on components. In such cases, usually you
should set the format on the output port.
Tip: Look Before Deleting
Components!
• Before deleting a component in a graph, look to
see whether the component defines record
formats for any of its ports. If you delete a
component with record format definitions, you
may lose the definitions.
• To safely delete such a component: For each

port with a record format definition, go to the
other end of the flow for that port (which will be
some other component or dataset) and uncheck
the ‘propogate from neighbor’ box for the
associated port.

Running the Application
1. Push “Run” button.
2. View monitoring information.
3. View output data.

Diagnostic Ports:
Reject, Error
• Reject: Input records that caused

errors.
• Error: Error messages.

Instrumentation Parameters:
Reject-threshold
• A drop-down menu specifying the number of

errors to tolerate. The choice “Use
limit/ramp” allows for other possibilities.

Diagnostic Port:
Log
• Log: Logging records.

Instrumentation Parameters:
Log
• Syntax: event OR event/n (a power of 10)

• Logs records of type event. If n is specified, only
1 of every n records are logged. Valid events
are:
• input, output, reject, intermediate

Logging Record Format
• Logging flows have predefined metadata.

• The record format is:
• record
• string("|") node;
• string("|") timestamp;
• string("|") component;
• string("|") subcomponent;
• string("|") event_type;
• string("|\n") event_text;
• end

Component: Gather Logs
• Reads logging records from multiple flows

connected to the input port and writes them to
the specified file outside of the application’s
transactional context. The start-text and end-
text parameter values are written to the log at
the beginning and end.

Component: Replicate
• Copies records from input port to multiple flows

connected to output port.

Sample Graph

Exercise 9: Creating a
Reformatting Application
• Create a new graph that:
Reads data from simple.dat with record format

simple.dml.
Reformats that data with simple-out.xfr.
Writes the results to simple-out.dat with record format

simple-out.dml.
• Run it and verify the results.

Exercise 10:
Obtaining Log Information
• Add a Gather Logs component to the application.
• Configure the component. Don’t forget to provide a log

file name.
• Connect it to the Reformat’s log port.
• Run the application.
• View the log file on the server.

Exercise 11: Creating an
Aggregation Application
• Create an application that:
Reads data from visits.dat with record
format visits.dml.
Sorts it by city.
Aggregates it (using Rollup component) by

city with visits-to-city-rollup.xfr.
Writes the results to visits-to-city.dat

with record format visits-to-city.dml.
Logs input,output,intermediate events.

Computing without Sort
Some components do not require pre-sorted inputs.
These components work by keeping some or all of the

inputs in memory.
These components usually have a sorted-input parameter,

or have the word hash in their name.
There are rules of thumb about when to use “in-memory”

sorting or grouping vs sorting before the component.

Exercise 12: Rollup without
Sort
• Open figure-05.
• Save As... to figure-05-nosort.
• Delete the Sort component.
• Change the sorted-input parameter of the Rollup component to

“in-memory…”
• Run the application and examine the results.

Exercise 13:
Join without Sort
• Open figure-06.
• Save As... to figure-06-nosort.
• Delete both Sort components.
• Change the sorted-input parameter of the Join component to

“in-memory…”

Forms of Parallelism
• Component parallelism
• Pipeline parallelism
• Data parallelism

Component Parallelism
Sorting Customers
Sorting Transactions

Component Parallelism
• Comes “for free” with graph

programming.
• Limitation:
• Scales to number of “branches” a graph.

Pipeline Parallelism
Processing Record: 100
Processing Record: 99

Pipeline Parallelism
• Comes “for free” with graph

programming.
• Limitations:
• Scales to length of “branches” in a graph.
• Some operations, like sorting, do not
pipeline.

Data Parallelism
ns
t i o
rt i
Pa

Two Ways of Looking at
Data Parallelism
Expanded View:
Global View:

Data Parallelism
• Scales with data.
• Requires data partitioning.
• Different partitioning methods for

different operations.

Data Partitioning
Expanded View:
Global View:

Data Partitioning:
The Global View
Degree of Parallelism
Fan-out Flow

Component:
Partition by Round-robin
• Reads records from its input port and writes them to
the flow partitions connected to its output port. Records
are written to partitions in “roundrobin” fashion, with
block-size records going to a partition before moving on
to the next.

Roundrobin Partitioning
Partition 0 Partition 1 Partition 2

A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
D D

Roundrobin Partitioning

A A B C
B D E F
C C D B
D G B A
E A D F
F E A D
C
D
B
G
B
A
A
D
F
E
A
D

A Data Parallel Application:
The Expanded View

Exercise 14: Data Parallel
Reformatting (Expanded)
• Open figure-04.
• Save As... to figure-04-expanded.
• Create a copy of the Reformat and the Simple-Out dataset (use Edit...Copy and
Edit…Paste).
• Change the path for the copy of Simple-Out.
• Add a Partition by Round-robin component before the Reformat components;

hook them up with flows.

A Data Parallel Application:
The Global View
Degree of Parallelism
(Abstract)
Fan-out Flow Multifile

What is a Multifile?
• A multifile is essentially the “global view” of a set of
ordinary files, each of which may be located
anywhere.
• Each partition of a multifile is an ordinary file.
• By using the global view and multifiles, you can

avoid having to draw data parallelism explicitly.
• Ab Initio utilities let you manipulate (copy, rename,

delete, etc.) multifiles as easily as ordinary files.
• Note that the icon for a multifile has 3 platters

instead of 2.

Multifiles
• Multifiles reside in multidirectories.
• Multidirectories and multifiles are identified using

URL syntax with “mfile” as the protocol part:
• mfile:/users/training-07/test-mfs/
• mfile:mfs2/transactions/
• mfile://mktg-mpp/vol3/big-mfs/january/sales.dat
• These URL’s are simply abbreviations for the many
pieces making up a multidirectory or multifile.
(See Chapter 2 of the Co>Operating System
Administrator’s Guide for more information on
multifiles.)

A Multidirectory
mfile://host1/u/jo/mfs A single name for three directories
//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/
Control Data Data Data

Partition Partition Partition Partition

A Multifile
mfile://host1/u/jo/mfs/a.dat A single name for three files
a.dat a.dat a.dat a.dat


Additional Multidirectories
mfile://host1/u/jo/mfs/dir1

dir1/ dir1/ dir1/ dir1/



dir1/ dir1/ dir1/ dir1/



dir1/ dir2/ dir1/ dir2/ dir1/ dir2/ dir1/ dir2/


A Multidirectory Hierarchy

dir1/ dir2/ dir1/ dir2/ dir1/ dir2/ dir1/ dir2/
x.dat b.dat x.dat b.dat x.dat b.dat x.dat b.dat

mfile://host1/u/jo/mfs/dir2/b.dat

Adding a Multifile Dataset
1. Drill into
multidirectory
2. Type in filename

Exercise 15: Data Parallel
Reformatting (Global)
• Open figure-04.
• Save As... to figure-04-global.
• Add a Partition by Round-robin component.
• Change the Simple-Out dataset to a multifile.
• Run the application and examine the results (use the “Partition”
option in View Data).

Data Aggregation in Parallel
0345Smith Bristol 56 Bristol 63

0322Jones Compton 12 Compton 12
0121Forth Bristol 7
0212Spade London 8 London 31

0492West London 23 New York 42
0221Black New York 42

Data Aggregation of Grouped
Input in Parallel
0345Smith Bristol 56
0121Forth Bristol 7 Bristol 63
0322Jones Compton 12 Compton 12
0212Spade London 8
0492West London 23 London 31
0221Black New York 42 New York 42

Key-Dependent Data
Parallelism
• Aggregation processes records in groups defined by key

values.
• Parallel aggregation requires partitioning based on key

value.
• Parallel aggregation takes three steps:

• Partition by key.
• Sort by key. Same key in each step
• Aggregate by key.

Component: Partition by Key
• Reads records from its input port and writes them to

the flow partitions connected to its output port. A hash
code computed using the key determines which
partition a record will be written on, meaning that
records with the same key value will go to the same
partition.

Partitioning by Key

A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
D D

Partitioning by Key

A A B D
B C E F
C C B D
D A G D
E A B F
F A E D
C
D
B
G
B
A
A
D
F
E
A
D

Partition by Key + Sort =
Parallel Grouping
A A B D
B C E F
C C B D
D A G D
E A B F
F A E D
C
D
B
G
B
A
A A B D
D A B D
F A B D
E A E D
A C E F
D C G F

Common Mistakes
• Incorrect Results if:

Keys for partition, sort, or aggregate
differ.
Data is partitioned, but is never sorted.
• Computationally Expensive if:

Data is sorted before it is partitioned.

Exercise 16:
Data Parallel Aggregation
• Start with figure-05.
• Save As... to figure-05-parallel.
• Add a Partition by Key component.
• Change the output file to a multifile.

Intro 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro 2

Uploaded by

Copyright:

Available Formats

A Practical Introduction to

Part 2: Building Applications

Confidential & Proprietary

Confidential & Proprietary

• Configure datasets and components along the way; let

Confidential & Proprietary

1. Click on Component Button

2. Open Datasets Category

Confidential & Proprietary

3. Change label to something descriptive

Confidential & Proprietary

1. Open Transform Category

2. Choose Filter by Expression

Confidential & Proprietary

Confidential & Proprietary

1. Browse to see directory 2. Enter name of output file

Confidential & Proprietary

1. Click on source (hold)

2. Drag to destination (release)

Confidential & Proprietary

Confidential & Proprietary

• One way to “Get rid of yellow” is to configure

• Hooking up flows allows the GDE to

Confidential & Proprietary

• Define record formats for output datasets only

• If record formats change, this minimizes the

• Sometimes you will need to set record formats

• To safely delete such a component: For each

Confidential & Proprietary

1. Push “Run” button.

2. View monitoring information.

3. View output data.

Confidential & Proprietary

• Reject: Input records that caused

Confidential & Proprietary

• A drop-down menu specifying the number of

Confidential & Proprietary

• Log: Logging records.

Confidential & Proprietary

• Syntax: event OR event/n (a power of 10)

Confidential & Proprietary

• Logging flows have predefined metadata.

Confidential & Proprietary

• Reads logging records from multiple flows

Confidential & Proprietary

• Copies records from input port to multiple flows

Confidential & Proprietary

Confidential & Proprietary

• Create a new graph that:

Reads data from simple.dat with record format

Reformats that data with simple-out.xfr.

Writes the results to simple-out.dat with record format

• Run it and verify the results.

Confidential & Proprietary

• Configure the component. Don’t forget to provide a log

• Connect it to the Reformat’s log port.

• Run the application.

• View the log file on the server.

Confidential & Proprietary

Aggregates it (using Rollup component) by

Writes the results to visits-to-city.dat

Logs input,output,intermediate events.