Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 63

A Practical Introduction to

Ab Initio Software

Part 2: Building Applications

Confidential & Proprietary


Outline

• Constructing Applications
• Parallelism
• Data Partitioning
• Multifiles

Confidential & Proprietary


Steps in Building an
Application
• Add datasets.
• Add components.
• Add flows.
• Modify as needed.

• Configure datasets and components along the way; let


the yellow “To Do” cues guide you.
• Generally, you should configure your input and output
metatdata (record formats) before adding flows.

Confidential & Proprietary


Adding an Input Dataset

1. Click on Component Button

2. Open Datasets Category

3. Choose InputFile

Confidential & Proprietary


Configuring the Input
Dataset
1. Browse to find simple.dat 2. Browse to find simple.dml

3. Change label to something descriptive

Confidential & Proprietary


Adding a Filter by Expression
Component

1. Open Transform Category

2. Choose Filter by Expression

Confidential & Proprietary


Adding an Output Dataset

Choose OutputFile

Confidential & Proprietary


Configuring the Output
Dataset

1. Browse to see directory 2. Enter name of output file

Confidential & Proprietary


Adding Flows

1. Click on source (hold)

2. Drag to destination (release)

Confidential & Proprietary


Configuring Filter by
Expression

Enter expression

Confidential & Proprietary


Flows Can Propagate
Configuration

• One way to “Get rid of yellow” is to configure


datasets or components.

• Hooking up flows allows the GDE to


automatically propagate many kinds of
information, like record format metadata;
sometimes, connecting things is all you need
to do to “Get rid of yellow.”

Confidential & Proprietary


Tip: Let Propagation Do the
Work!
• Define record formats for input datasets.

• Define record formats for output datasets only


when they differ from input datasets; let
propagation do as much as possible.

• If record formats change, this minimizes the


impact on the graph.

• Sometimes you will need to set record formats


on components. In such cases, usually you
should set the format on the output port.
Confidential & Proprietary
Tip: Look Before Deleting
Components!
• Before deleting a component in a graph, look to
see whether the component defines record
formats for any of its ports. If you delete a
component with record format definitions, you
may lose the definitions.

• To safely delete such a component: For each


port with a record format definition, go to the
other end of the flow for that port (which will be
some other component or dataset) and uncheck
the ‘propogate from neighbor’ box for the
associated port.

Confidential & Proprietary


Running the Application

1. Push “Run” button.

2. View monitoring information.

3. View output data.

Confidential & Proprietary


Diagnostic Ports:
Reject, Error

• Reject: Input records that caused


errors.
• Error: Error messages.

Confidential & Proprietary


Instrumentation Parameters:
Reject-threshold

• A drop-down menu specifying the number of


errors to tolerate. The choice “Use
limit/ramp” allows for other possibilities.

Confidential & Proprietary


Diagnostic Port:
Log

• Log: Logging records.

Confidential & Proprietary


Instrumentation Parameters:
Log

• Syntax: event OR event/n (a power of 10)


• Logs records of type event. If n is specified, only
1 of every n records are logged. Valid events
are:
• input, output, reject, intermediate

Confidential & Proprietary


Logging Record Format

• Logging flows have predefined metadata.


• The record format is:

• record
• string("|") node;
• string("|") timestamp;
• string("|") component;
• string("|") subcomponent;
• string("|") event_type;
• string("|\n") event_text;
• end

Confidential & Proprietary


Component: Gather Logs

• Reads logging records from multiple flows


connected to the input port and writes them to
the specified file outside of the application’s
transactional context. The start-text and end-
text parameter values are written to the log at
the beginning and end.

Confidential & Proprietary


Component: Replicate

• Copies records from input port to multiple flows


connected to output port.

Confidential & Proprietary


Sample Graph

Confidential & Proprietary


Exercise 9: Creating a
Reformatting Application

• Create a new graph that:

Reads data from simple.dat with record format


simple.dml.

Reformats that data with simple-out.xfr.

Writes the results to simple-out.dat with record format


simple-out.dml.

• Run it and verify the results.

Confidential & Proprietary


Exercise 10:
Obtaining Log Information
• Add a Gather Logs component to the application.

• Configure the component. Don’t forget to provide a log


file name.

• Connect it to the Reformat’s log port.

• Run the application.

• View the log file on the server.

Confidential & Proprietary


Exercise 11: Creating an
Aggregation Application
• Create an application that:
Reads data from visits.dat with record
format visits.dml.

Sorts it by city.

Aggregates it (using Rollup component) by


city with visits-to-city-rollup.xfr.

Writes the results to visits-to-city.dat


with record format visits-to-city.dml.

Logs input,output,intermediate events.

Confidential & Proprietary


Computing without Sort

Some components do not require pre-sorted inputs.

These components work by keeping some or all of the


inputs in memory.

These components usually have a sorted-input parameter,


or have the word hash in their name.

There are rules of thumb about when to use “in-memory”


sorting or grouping vs sorting before the component.

Confidential & Proprietary


Exercise 12: Rollup without
Sort
• Open figure-05.

• Save As... to figure-05-nosort.

• Delete the Sort component.

• Change the sorted-input parameter of the Rollup component to


“in-memory…”

• Run the application and examine the results.

Confidential & Proprietary


Exercise 13:
Join without Sort

• Open figure-06.

• Save As... to figure-06-nosort.

• Delete both Sort components.

• Change the sorted-input parameter of the Join component to


“in-memory…”

• Run the application and examine the results.

Confidential & Proprietary


Forms of Parallelism

• Component parallelism

• Pipeline parallelism

• Data parallelism

Confidential & Proprietary


Component Parallelism

Sorting Customers

Sorting Transactions

Confidential & Proprietary


Component Parallelism

• Comes “for free” with graph


programming.

• Limitation:
• Scales to number of “branches” a graph.

Confidential & Proprietary


Pipeline Parallelism

Processing Record: 100

Processing Record: 99

Confidential & Proprietary


Pipeline Parallelism

• Comes “for free” with graph


programming.

• Limitations:
• Scales to length of “branches” in a graph.
• Some operations, like sorting, do not
pipeline.

Confidential & Proprietary


Data Parallelism

ns
t i o
rt i
Pa

Confidential & Proprietary


Two Ways of Looking at
Data Parallelism
Expanded View:

Global View:

Confidential & Proprietary


Data Parallelism

• Scales with data.

• Requires data partitioning.

• Different partitioning methods for


different operations.

Confidential & Proprietary


Data Partitioning

Expanded View:

Global View:

Confidential & Proprietary


Data Partitioning:
The Global View

Degree of Parallelism

Fan-out Flow

Confidential & Proprietary


Component:
Partition by Round-robin
• Reads records from its input port and writes them to
the flow partitions connected to its output port. Records
are written to partitions in “roundrobin” fashion, with
block-size records going to a partition before moving on
to the next.

Confidential & Proprietary


Roundrobin Partitioning

Partition 0 Partition 1 Partition 2


A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
D D

Confidential & Proprietary


Roundrobin Partitioning

Partition 0 Partition 1 Partition 2


A A B C
B D E F
C C D B
D G B A
E A D F
F E A D
C
D
B
G
B
A
A
D
F
E
A
D

Confidential & Proprietary


A Data Parallel Application:
The Expanded View

Confidential & Proprietary


Exercise 14: Data Parallel
Reformatting (Expanded)

• Open figure-04.

• Save As... to figure-04-expanded.

• Create a copy of the Reformat and the Simple-Out dataset (use Edit...Copy and
Edit…Paste).

• Change the path for the copy of Simple-Out.

• Add a Partition by Round-robin component before the Reformat components;


hook them up with flows.

• Run the application and examine the results.

Confidential & Proprietary


A Data Parallel Application:
The Global View

Degree of Parallelism
(Abstract)

Fan-out Flow Multifile

Confidential & Proprietary


What is a Multifile?
• A multifile is essentially the “global view” of a set of
ordinary files, each of which may be located
anywhere.

• Each partition of a multifile is an ordinary file.

• By using the global view and multifiles, you can


avoid having to draw data parallelism explicitly.

• Ab Initio utilities let you manipulate (copy, rename,


delete, etc.) multifiles as easily as ordinary files.

• Note that the icon for a multifile has 3 platters


instead of 2.

Confidential & Proprietary


Multifiles
• Multifiles reside in multidirectories.

• Multidirectories and multifiles are identified using


URL syntax with “mfile” as the protocol part:
• mfile:/users/training-07/test-mfs/
• mfile:mfs2/transactions/
• mfile://mktg-mpp/vol3/big-mfs/january/sales.dat
• These URL’s are simply abbreviations for the many
pieces making up a multidirectory or multifile.
(See Chapter 2 of the Co>Operating System
Administrator’s Guide for more information on
multifiles.)

Confidential & Proprietary


A Multidirectory
mfile://host1/u/jo/mfs A single name for three directories

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

Control Data Data Data


Partition Partition Partition Partition

Confidential & Proprietary


A Multifile
mfile://host1/u/jo/mfs/a.dat A single name for three files

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

a.dat a.dat a.dat a.dat

Control Data Data Data


Partition Partition Partition Partition

Confidential & Proprietary


Additional Multidirectories
mfile://host1/u/jo/mfs/dir1

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

a.dat a.dat a.dat a.dat


dir1/ dir1/ dir1/ dir1/

Control Data Data Data


Partition Partition Partition Partition

Confidential & Proprietary


Additional Multidirectories
mfile://host1/u/jo/mfs/dir1

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

a.dat a.dat a.dat a.dat


dir1/ dir1/ dir1/ dir1/

Control Data Data Data


Partition Partition Partition Partition

Confidential & Proprietary


Additional Multidirectories
mfile://host1/u/jo/mfs/dir2

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

a.dat a.dat a.dat a.dat


dir1/ dir2/ dir1/ dir2/ dir1/ dir2/ dir1/ dir2/

Control Data Data Data


Partition Partition Partition Partition

Confidential & Proprietary


A Multidirectory Hierarchy

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

a.dat a.dat a.dat a.dat


dir1/ dir2/ dir1/ dir2/ dir1/ dir2/ dir1/ dir2/

x.dat b.dat x.dat b.dat x.dat b.dat x.dat b.dat

Control Data Data Data


Partition Partition Partition Partition

mfile://host1/u/jo/mfs/dir2/b.dat

Confidential & Proprietary


Adding a Multifile Dataset

1. Drill into
multidirectory

2. Type in filename

Confidential & Proprietary


Exercise 15: Data Parallel
Reformatting (Global)

• Open figure-04.

• Save As... to figure-04-global.

• Add a Partition by Round-robin component.

• Change the Simple-Out dataset to a multifile.

• Run the application and examine the results (use the “Partition”
option in View Data).

Confidential & Proprietary


Data Aggregation in Parallel

0345Smith Bristol 56 Bristol 63


0322Jones Compton 12 Compton 12
0121Forth Bristol 7

0212Spade London 8 London 31


0492West London 23 New York 42
0221Black New York 42

Confidential & Proprietary


Data Aggregation of Grouped
Input in Parallel

0345Smith Bristol 56
0121Forth Bristol 7 Bristol 63
0322Jones Compton 12 Compton 12

0212Spade London 8
0492West London 23 London 31
0221Black New York 42 New York 42

Confidential & Proprietary


Key-Dependent Data
Parallelism

• Aggregation processes records in groups defined by key


values.

• Parallel aggregation requires partitioning based on key


value.

• Parallel aggregation takes three steps:


• Partition by key.
• Sort by key. Same key in each step
• Aggregate by key.

Confidential & Proprietary


Component: Partition by Key

• Reads records from its input port and writes them to


the flow partitions connected to its output port. A hash
code computed using the key determines which
partition a record will be written on, meaning that
records with the same key value will go to the same
partition.

Confidential & Proprietary


Partitioning by Key

Partition 0 Partition 1 Partition 2


A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
D D

Confidential & Proprietary


Partitioning by Key

Partition 0 Partition 1 Partition 2


A A B D
B C E F
C C B D
D A G D
E A B F
F A E D
C
D
B
G
B
A
A
D
F
E
A
D

Confidential & Proprietary


Partition by Key + Sort =
Parallel Grouping
Partition 0 Partition 1 Partition 2
A A B D
B C E F
C C B D
D A G D
E A B F
F A E D
C
D
B
G
B
A
A A B D
D A B D
F A B D
E A E D
A C E F
D C G F

Confidential & Proprietary


Common Mistakes

• Incorrect Results if:


Keys for partition, sort, or aggregate
differ.
Data is partitioned, but is never sorted.

• Computationally Expensive if:


Data is sorted before it is partitioned.

Confidential & Proprietary


Exercise 16:
Data Parallel Aggregation
• Start with figure-05.

• Save As... to figure-05-parallel.

• Add a Partition by Key component.

• Change the output file to a multifile.

• Run the application and examine the results.

Confidential & Proprietary

You might also like