Day 3

AB INITIO
( Day - 3 )
A Practical
Introduction to
Ab Initio Software:
Part 3
Lookup files
Pop Quiz: Lookup Files and
Joins
 When is it a bad idea to replace a
Join with use of a lookup file in a
reformatting component?
Lookup File Topics
 Lookup File Basics

 Lookup Multifiles
Lookup Files
 DML provides a facility for looking up
records in a dataset based on a key:
record lookup(string file, expression
[, expression ...] )
 The data is read from a file into memory.
 The GDE provides a Lookup File component

as a special dataset with no ports.
Using lookup instead of Sort/Merge
Using Last-Visits
as a lookup file
Pop Quiz Answer: Lookup Files and Joins
 Q: When is it a bad idea to replace a Join
with use of a lookup file in a reformatting
component?
 A: whenever you need to process every
record in the flow that would become the
lookup file. In addition to outer joins,
this includes inner joins with flows
connected to unused ports (anti-joins).
Configuring a Lookup File
1. Label used as name in 3. Set record format
lookup expression
2. Browse for pathname 4. Set key

Using lookup in a Transform
Function
Input 0 record format: Output record format:
record record
decimal(4) id; decimal(4) id;
string(6) name; string(8) city;
string(8) city; decimal(3) amount;
decimal(3) amount; date(”YYYY/MM/DD”) dt;
end end
 Transform function:
out :: lookup_info(in) =
begin
out.id : : in.id;
out.city : : in.city;
out.amount : : in.amount;
out.dt :1 : lookup(”Last-Visits”, in.id).dt;
out.dt :2 : ”1900/01/01”;
end;
Other lookup functions
int lookup_count (string file, expression
[,expression...] )
 Returns the number of records from Lookup File file that
match the given expression(s).
record lookup_next (string file )

 Used after lookup_count or after a successful call to
lookup, the function lookup_next returns successive
records from file that match the values of the expression
arguments given in the prior lookup_count or lookup.
Exercise 6: Using lookup in an Application
 Create an application that joins visits.dat
and last-visits.dat using a reformat
component instead of a join component. Use
the transform function given above. The
output record format will be merged-
visits.dml.
 The files are in $AI_SERIAL; the record

formats are in $AI_DML.
Using lookup_local()
 If lookup files grow large, it may be useful to
partition them as multifiles on their key.
 The function lookup_local() is identical to

lookup(), except that only the local partition will
be examined. The input data must be partitioned
on the same key that is used for the lookup.
 Usage:
record lookup_local (string file, expression [,
expression ...] )
Example of lookup_local()
Multifile partitioned by field A
Input data partitioned by field A
lookup_local(“Lookup File”, in.A)

Other local lookup functions
int lookup_local_count (string file,
expression [,expression...] )
record lookup_local_next (string file
)
 These are identical to the non-local

versions, except that they only
operate on the local partition.
Parallelism
Forms of Parallelism
 Component parallelism
 Pipeline parallelism
 Data parallelism
Component Parallelism
Sorting Customers
Sorting Transactions
Component Parallelism
 Comes “for free” with graph
programming.
 Limitation:
• Scales to number of “branches” a graph.
Pipeline Parallelism
Processing Record: 100
Processing Record: 99
Pipeline Parallelism
 Comes “for free” with graph
programming.
 Limitations:
• Scales to length of “branches” in a graph.
• Some operations, like sorting, do not pipeline.
Data Parallelism
 Scales with data.
 Requires data partitioning.
 Dependent upon the application,

different partitioning methods are
available.
Data Parallelism
n s
r t itio
Pa
Two Ways of Looking at Data Parallelism
Expanded View:
Global View:
* *
Data Partitioning
Expanded View:
Global View:
* *
Data Partitioning: The Global
View
Degree of Parallelism
* **
Fan-out Flow
Flows
 Four kinds of flows:
• straight
• fan-in
• fan-out
• all-to-all
Straight Flow
 Straight flows connect components
that have the same number of
partitions.
Illustration of Straight Flow
Fan-In
 Fan-in flows connect components
with a large number of partitions to
components with a smaller number
of partitions.
 The most common use of fan-in is to
connect flows to Departition
components.
Illustration of Fan-in flow
Fan-out
 Fan-out flows connect components with
a small number of partitions to
components with a larger number of
partitions.
 The most common use of fan-out is to
connect flows from partition
components.
 This flow pattern is used to divide data
into many segments for performance
improvements.
Illustration of Fan-out flow
All-to- All
 All-to-all flows typically connect
components with different numbers
of partitions.
 Data from any of the upstream
partitions is sent to any of the
downstream partitions.
Illustration of All-to-all flow
What is a Multifile?
 A multifile is essentially the “global

view” of a set of ordinary files, each of
which may be located anywhere where
the Ab Initio Co>Operating System is
installed.
 Each partition of a multifile is an
ordinary file.
Cont.
What is a Multifile?
 By using the global view and

multifiles, you can avoid having to
draw data parallelism explicitly.
 Note that the icon for a multifile has
3 platters instead of 2.
 Ab Initio utilities let you copy,
rename, delete, etc., multifiles as
easily as ordinary files.
Multifiles
 Multifiles reside in multidirectories.
 Multidirectories and multifiles are
identified using URL syntax with “mfile:”
as the protocol part:
• A Multidirectory
mfile:/users/training-07/test-mfs/
• A Multifile within a Multidirectory
mfile:/users/training-07/test-mfs/xx.dat
A Multidirectory across multiple platforms
A single name or abstraction for a multidirectory
mfile://host1/u/jo/mfs
//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/
The Control Data Data Data

Partition Partition 0 Partition 1 Partition 2
A Multifile across multiple platforms
A single name or abstraction for a multifile (a.dat)
mfile://host1/u/jo/mfs/a.dat
//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/
a.dat a.dat a.dat a.dat
The Control Data Data Data

Partition Partition 0 Partition 1 Partition 2
Multifile Commands
 m_mkfs
 m_ls
 m_expand
 m_dump
 m_cp
 m_mv
 m_touch
 m_rm
The m_mkfs Command
m_mkfs mfs-url dir-url1 dir-url2 ...
 Creates a multifile system rooted at mfs-url

and having as partitions the new directories
dir-url1, dir-url2, ...
$ m_mkfs //host1/u/jo/mfs3 \
//host1/vol4/dat/mfs3_p0 \
//host2/vol3/dat/mfs3_p1 \
//host3/vol7/dat/mfs3_p2
$ m_mkfs my_mfs my_mfs_p0 my_mfs_p1 my_mfs_p2

The m_ls command
m_ls [options...] url [url...]
 Lists information on the file or directories

specified by the urls. The information
presented is controlled by the options, which
follow the form of ls.
$ m_ls -ld mfile:my-mfs/subdir

$ m_ls mfile://host2/tmp/temp-mfs
$ m_ls -l -partitions .
The m_expand command
m_expand [options...] path
 Displays the locations of the data

partitions of a multifile or multidirectory
$ m_expand mfile:mymfs
$ m_expand -native /path/to/the/mdir/bar
The m_dump command
m_dump metadata [path] [options ...]
 Displays contents of files, multifiles, or selected
records from files or multifiles, similar to View
Data from GDE.
$ m_dump simple.dml simple.dat -start 10 -end 20

$ m_dump simple.dml -describe
$ m_dump simple.dml simple.dat -end 1 -print
'id*2’
$ m_dump help
$ m_dump -string ‘string(“\n”)’ bigex/acct.dat
The m_cp command
m_cp source dest
m_cp source […] directory
 Copies files or multifiles that have the
same degree of parallelism. Behind the
scenes, m_cp actually builds and runs a
small graph, so it may copy from one
machine to another where Ab Initio is
installed.
$ m_cp mfs1 mfs2
The m_mv command
m_mv oldpath newpath
 Moves a single file, multifile,

directory, or multi-directory from one
path to another path on the same
host via renaming… does not actually
move data.
 $ m_mv mfs1 mfs2

The m_touch command
m_touch path
 Creates an empty file or multifile in the

specified location. If some or all of the
data partitions already exist in the
expected locations, they will not be
destroyed.
$ m_touch foo.dat
The m_rm command
m_rm [options] path [...]
 Removes a file or multifile and all its

associated data partitions.
$ m_rm foo
$ m_rm -f -r mfile:dir1
Exercise : Multifile Commands
 Create a four-partition multifile
system named mfs_4way.
 Create two directories within
mfs_4way named dir1 and dir2.
 Use m_ls to list the contents of
mfs_4way.
 Create a dummy mfs using m_touch
 Use ls to examine the contents of
mfs_4way
Partitioning
Components
Component: Partition by Round-robin
 Reads records from its input port
and writes them to the flow
partitions connected to its output
port. Records are written to
partitions in “roundrobin” fashion,
with ‘block-size’ records going to a
partition before moving on to the
next.
Roundrobin Partitioning
Partition 0 Partition 1 Partition 2
A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
Round robin Partitioning
A A B C
B D E F
C C D B
D G B A
E A D F
F E A
C
D
B
G
B
A
A
D
F
E
A
A Data Parallel Application: The
Expanded View
A Data Parallel Application: The
Global View
Degree of Parallelism
(Abstract)
Fan-out Flow Multifile

Exercise : Partition By Round Robin
 Read data from Serial file.
• id| name
 Partition it in 4 ways using PRR with block 2.
Part -1
 Save in four different files.
 Check the result.
Part -2
 Then replace four output files with 4 way MFS
file.
Component: Partition by Key
 Reads records from its input port and
writes them to the flow partitions
connected to its output port. A hash
code computed using the key
determines which partition a record
will be written on, meaning that
records with the same key value will
go to the same partition.
Key-Dependent Data
Parallelism
 Aggregation processes records in groups
defined by key values.
 Parallel aggregation requires partitioning

based on key value.
 Parallel aggregation takes three steps:

• Partition by key.
• Sort by key. Same key in each step
• Aggregate by key.
Partitioning by Key
A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
G G
Partitioning by Key
A A B D
B C E F
C C B D
D A G D
E A B F
F A E
C G
D
B
G
B
A
A
D
F
E
A
G
Partition by Key and Sort
Partition by Key + Sort = Parallel Grouping

Partition by Key + Sort = Parallel Grouping

A A B D
B C E F
C C B D
D A G D
E A B F
F A E
C G
D
B
G
B
A
A A B D
D A B D
F A B D
E A E F
A C E F
G C G
G
Exercise : Partition By KEY
• id| name
 Partition it in 4 ways using PBK using key as id
Part-1
Part -2
 Then replace four output files with 4 way MFS file.
Part -3
 Replace PRB --> partition by key and sort .
 Save in 4 way MFS and Check the result

Partition by Expression
 Partition by Expression distributes data
records to its output flow partitions
according to a specified DML expression.
Partition by Expression
 The expression must evaluate to a
number between 0 and the number
of flows connected to the out port
minus 1.
 Partition by Expression routes the
record to the flow number returned
by this expression. Flow numbers
start at 0.
Exercise : Partition By Expression
• id| name
 Partition it in 4 ways using following rules
• Flow1 : id<5
• Flow2 : id 5 to 10
• Flow3 : id 10 to 15
• Flow4 : all above
Part -1
Part -2
 Then replace four output files with 4 way MFS file.

Departitioning
Components
Departitioning
Departitioning combines many flows of data to

produce one flow. It is the opposite of partitioning.
Each departition component combines flows in a

different manner.
Departitioning
Expanded View:
Global View:
* *
Departitioning
 For the various departitioning
components:
• Key-based?
• Result ordering?
• Effect on parallelism?
• Uses? Fan-in Flow
Departitioning: Performance
Input buffer Output buffer
Free space
Used space
Gather
Round-robin partitioned and scored:
42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C
Scored dataset in random order, following gather:

43Mark 02114 9C
46Rick 02116 23B
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
44Bob 02116 8C
47Bill 02114 14B
49Jane 02241 2C
Gather: Performance
Reading flows as
data is available
Note that the Gather will not affect upstream

processing
Gather
 Not key-based.
 Result ordering is unpredictable.
 Most useful method for efficient
collection of data from multiple
partitions and for repartitioning.
 Used most frequently
Merge
Round-robin partitioned and sorted by amount:
42John 02116 30 49Jane 02241 2 44Bob 02116 8
48Mary 02116 38 43Mark 02114 9 47Bill 02114 14
45Sue 02241 92 46Rick 02116 23
Sorted data, following merge on amount:

49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Merge
 Key-based.
 Result ordering is sorted if each input
is sorted.
 Useful for creating ordered data
flows.
 Other than the ‘Gather’, the Merge is
the other ‘departitioner’ of choice
Concatenation
Globally ordered, partitioned data:
49Jane 02241 2 47Bill 02114 14 42John 02116 30
44Bob 02116 8 46Rick 02116 23 48Mary 02116 38
43Mark 02114 9 45Sue 02241 92
Sorted data:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Concatenation
 Not key-based.
 Result ordering is by partition.
 Useful for:
• appending headers and trailers
• creating serial flow from partitioned
data
 Used very infrequently
Interleave
 Interleave combines blocks of data
records from multiple flow partitions in
round-robin fashion.
Interleave
 Not key-based.
 Result ordering is not fix.
 Used very infrequently
Exercise : DePartition
 Read data from 2 Serial input files.
• id| name
 Combine the data using Gather, Merge,
Concatenation and interleave

 Save the output in Serial output file.
Part -2
 Then replace Serial input file with output file
MFS file created in Partitioning exercise.

 Save the output in Serial output file.

Repartitioning
 Use to redistribute records across
partitions.
 Records are almost always
redistributed in a key-based manner,
but don’t have to be.
 Records can be redistributed to fewer
partitions, the same number of
partitions, or more partitions.
The “Wrong” Way, but
technically a correct solution
1 1
This serializes the computation unnecessarily!

Repartitioning -- The Right Way
An Expanded View:
Partition the data in Partition 1
to all downstream departitioners
Repartitioning -- The Right Way
An Expanded View:
Additionally partition the data in Partition 2
to all downstream departitioners
Repartitioning - The Global View
All-to-All Flow
Note: The departition component is almost

always a Gather.
Layout
 Layout determines the location of a
resource.
 A layout is either serial or parallel.
 A serial layout specifies a single node
and a single directory on that node.
 A parallel layout specifies multiple nodes
with multiple directories across the
nodes. It is permissible for the same
node to be repeated.
Controlling Layout
Propagate (default)
Bind layout to that

of another component
Use layout of URL
Construct layout
manually
Run on these
hosts
Layout Determines What Runs Where
Node W Node X Node Y Node Z

Layout Determines What Runs Where

Layout Determines What Runs
Where
Serial
Parallel
3-way multifile on
file on Node W Node X,Y,Z
Layout Determines What Runs
Where

Phases
Phase 0 Phase 1
Phases
 Breaking an application into phases
limits the contention for:
• Main memory.
• Processor(s).
 Breaking an application into phases

costs:
• Disk space.
Checkpoints
 Since data is staged to disk between
phases, one can arrange to use that
data to “start from the middle”
should something go wrong.
 Any phase break can be a

checkpoint.
The Phase Toolbar
A Toggle between:
Phase (P), and Checkpoint After Phase (C)
Increment Phase Number
Decrement Phase Number
View Phase
END
(Day – 3)

Day 3

Uploaded by

Copyright:

Available Formats

You might also like

Day 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day 3

Uploaded by

Copyright:

Available Formats

AB INITIO

 Lookup File Basics

 The data is read from a file into memory.

 The GDE provides a Lookup File component

2. Browse for pathname 4. Set key

record lookup_next (string file )

 The files are in $AI_SERIAL; the record

 The function lookup_local() is identical to

Multifile partitioned by field A

Input data partitioned by field A

lookup_local(“Lookup File”, in.A)

 These are identical to the non-local

 Requires data partitioning.

 Dependent upon the application,

 A multifile is essentially the “global

 By using the global view and

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

The Control Data Data Data

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

a.dat a.dat a.dat a.dat

The Control Data Data Data

 Creates a multifile system rooted at mfs-url

$ m_mkfs my_mfs my_mfs_p0 my_mfs_p1 my_mfs_p2

 Lists information on the file or directories

$ m_ls -ld mfile:my-mfs/subdir

 Displays the locations of the data

$ m_dump simple.dml simple.dat -start 10 -end 20

 Moves a single file, multifile,

 $ m_mv mfs1 mfs2

 Creates an empty file or multifile in the

 Removes a file or multifile and all its

Fan-out Flow Multifile

 Check the result.

 Parallel aggregation requires partitioning

 Parallel aggregation takes three steps:

Partition by Key + Sort = Parallel Grouping

Partition 0 Partition 1 Partition 2

 Check the result.

 Check the result.

 Save in 4 way MFS and Check the result

 Check the result.

 Check the result.

Departitioning combines many flows of data to

Each departition component combines flows in a

Input buffer Output buffer

Scored dataset in random order, following gather:

Note that the Gather will not affect upstream

Sorted data, following merge on amount:

Concatenation and interleave

 Check the result.

MFS file created in Partitioning exercise.

 Check the result.

This serializes the computation unnecessarily!

Note: The departition component is almost

Bind layout to that

Use layout of URL

Node W Node X Node Y Node Z