Day 3

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 98

AB INITIO

( Day - 3 )
A Practical
Introduction to
Ab Initio Software:
Part 3
Lookup files
Pop Quiz: Lookup Files and
Joins
 When is it a bad idea to replace a
Join with use of a lookup file in a
reformatting component?
Lookup File Topics

 Lookup File Basics


 Lookup Multifiles
Lookup Files
 DML provides a facility for looking up
records in a dataset based on a key:
record lookup(string file, expression
[, expression ...] )

 The data is read from a file into memory.

 The GDE provides a Lookup File component


as a special dataset with no ports.
Using lookup instead of Sort/Merge

Using Last-Visits
as a lookup file
Pop Quiz Answer: Lookup Files and Joins
 Q: When is it a bad idea to replace a Join
with use of a lookup file in a reformatting
component?
 A: whenever you need to process every
record in the flow that would become the
lookup file. In addition to outer joins,
this includes inner joins with flows
connected to unused ports (anti-joins).
Configuring a Lookup File
1. Label used as name in 3. Set record format
lookup expression

2. Browse for pathname 4. Set key


Using lookup in a Transform
Function
Input 0 record format: Output record format:
record record
decimal(4) id; decimal(4) id;
string(6) name; string(8) city;
string(8) city; decimal(3) amount;
decimal(3) amount; date(”YYYY/MM/DD”) dt;
end end

 Transform function:
out :: lookup_info(in) =
begin
out.id : : in.id;
out.city : : in.city;
out.amount : : in.amount;
out.dt :1 : lookup(”Last-Visits”, in.id).dt;
out.dt :2 : ”1900/01/01”;
end;
Other lookup functions
int lookup_count (string file, expression
[,expression...] )
 Returns the number of records from Lookup File file that
match the given expression(s).

record lookup_next (string file )


 Used after lookup_count or after a successful call to
lookup, the function lookup_next returns successive
records from file that match the values of the expression
arguments given in the prior lookup_count or lookup.
Exercise 6: Using lookup in an Application
 Create an application that joins visits.dat
and last-visits.dat using a reformat
component instead of a join component. Use
the transform function given above. The
output record format will be merged-
visits.dml.

 The files are in $AI_SERIAL; the record


formats are in $AI_DML.
Using lookup_local()
 If lookup files grow large, it may be useful to
partition them as multifiles on their key.

 The function lookup_local() is identical to


lookup(), except that only the local partition will
be examined. The input data must be partitioned
on the same key that is used for the lookup.
 Usage:
record lookup_local (string file, expression [,
expression ...] )
Example of lookup_local()

Multifile partitioned by field A

Input data partitioned by field A

lookup_local(“Lookup File”, in.A)


Other local lookup functions
int lookup_local_count (string file,
expression [,expression...] )
record lookup_local_next (string file
)

 These are identical to the non-local


versions, except that they only
operate on the local partition.
Parallelism
Forms of Parallelism
 Component parallelism

 Pipeline parallelism

 Data parallelism
Component Parallelism

Sorting Customers

Sorting Transactions
Component Parallelism
 Comes “for free” with graph
programming.

 Limitation:
• Scales to number of “branches” a graph.
Pipeline Parallelism
Processing Record: 100

Processing Record: 99
Pipeline Parallelism
 Comes “for free” with graph
programming.

 Limitations:
• Scales to length of “branches” in a graph.
• Some operations, like sorting, do not pipeline.
Data Parallelism
 Scales with data.

 Requires data partitioning.

 Dependent upon the application,


different partitioning methods are
available.
Data Parallelism

n s
r t itio
Pa
Two Ways of Looking at Data Parallelism
Expanded View:

Global View:
* *
Data Partitioning
Expanded View:

Global View:
* *
Data Partitioning: The Global
View

Degree of Parallelism

* **

Fan-out Flow
Flows
 Four kinds of flows:
• straight
• fan-in
• fan-out
• all-to-all
Straight Flow
 Straight flows connect components
that have the same number of
partitions.
Illustration of Straight Flow
Fan-In
 Fan-in flows connect components
with a large number of partitions to
components with a smaller number
of partitions.
 The most common use of fan-in is to
connect flows to Departition
components.
Illustration of Fan-in flow
Fan-out
 Fan-out flows connect components with
a small number of partitions to
components with a larger number of
partitions.
 The most common use of fan-out is to
connect flows from partition
components.
 This flow pattern is used to divide data
into many segments for performance
improvements.
Illustration of Fan-out flow
All-to- All
 All-to-all flows typically connect
components with different numbers
of partitions.
 Data from any of the upstream
partitions is sent to any of the
downstream partitions.
Illustration of All-to-all flow
What is a Multifile?

 A multifile is essentially the “global


view” of a set of ordinary files, each of
which may be located anywhere where
the Ab Initio Co>Operating System is
installed.
 Each partition of a multifile is an
ordinary file.

Cont.
What is a Multifile?

 By using the global view and


multifiles, you can avoid having to
draw data parallelism explicitly.
 Note that the icon for a multifile has
3 platters instead of 2.
 Ab Initio utilities let you copy,
rename, delete, etc., multifiles as
easily as ordinary files.
Multifiles
 Multifiles reside in multidirectories.
 Multidirectories and multifiles are
identified using URL syntax with “mfile:”
as the protocol part:
• A Multidirectory
mfile:/users/training-07/test-mfs/
• A Multifile within a Multidirectory
mfile:/users/training-07/test-mfs/xx.dat
A Multidirectory across multiple platforms
A single name or abstraction for a multidirectory

mfile://host1/u/jo/mfs

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

The Control Data Data Data


Partition Partition 0 Partition 1 Partition 2
A Multifile across multiple platforms
A single name or abstraction for a multifile (a.dat)

mfile://host1/u/jo/mfs/a.dat

//host1/u/jo/mfs/ //host1/vol4/pA/ //host2/vol3/pB/ //host3/vol7/pC/

a.dat a.dat a.dat a.dat

The Control Data Data Data


Partition Partition 0 Partition 1 Partition 2
Multifile Commands
 m_mkfs
 m_ls
 m_expand
 m_dump
 m_cp
 m_mv
 m_touch
 m_rm
The m_mkfs Command
m_mkfs mfs-url dir-url1 dir-url2 ...

 Creates a multifile system rooted at mfs-url


and having as partitions the new directories
dir-url1, dir-url2, ...

$ m_mkfs //host1/u/jo/mfs3 \
//host1/vol4/dat/mfs3_p0 \
//host2/vol3/dat/mfs3_p1 \
//host3/vol7/dat/mfs3_p2

$ m_mkfs my_mfs my_mfs_p0 my_mfs_p1 my_mfs_p2


The m_ls command
m_ls [options...] url [url...]

 Lists information on the file or directories


specified by the urls. The information
presented is controlled by the options, which
follow the form of ls.

$ m_ls -ld mfile:my-mfs/subdir


$ m_ls mfile://host2/tmp/temp-mfs
$ m_ls -l -partitions .
The m_expand command
m_expand [options...] path

 Displays the locations of the data


partitions of a multifile or multidirectory

$ m_expand mfile:mymfs
$ m_expand -native /path/to/the/mdir/bar
The m_dump command
m_dump metadata [path] [options ...]
 Displays contents of files, multifiles, or selected
records from files or multifiles, similar to View
Data from GDE.

$ m_dump simple.dml simple.dat -start 10 -end 20


$ m_dump simple.dml -describe
$ m_dump simple.dml simple.dat -end 1 -print
'id*2’
$ m_dump help
$ m_dump -string ‘string(“\n”)’ bigex/acct.dat
The m_cp command
m_cp source dest
m_cp source […] directory
 Copies files or multifiles that have the
same degree of parallelism. Behind the
scenes, m_cp actually builds and runs a
small graph, so it may copy from one
machine to another where Ab Initio is
installed.
$ m_cp mfs1 mfs2
The m_mv command
m_mv oldpath newpath

 Moves a single file, multifile,


directory, or multi-directory from one
path to another path on the same
host via renaming… does not actually
move data.

 $ m_mv mfs1 mfs2


The m_touch command
m_touch path

 Creates an empty file or multifile in the


specified location. If some or all of the
data partitions already exist in the
expected locations, they will not be
destroyed.

$ m_touch foo.dat
The m_rm command
m_rm [options] path [...]

 Removes a file or multifile and all its


associated data partitions.

$ m_rm foo
$ m_rm -f -r mfile:dir1
Exercise : Multifile Commands
 Create a four-partition multifile
system named mfs_4way.
 Create two directories within
mfs_4way named dir1 and dir2.
 Use m_ls to list the contents of
mfs_4way.
 Create a dummy mfs using m_touch
 Use ls to examine the contents of
mfs_4way
Partitioning
Components
Component: Partition by Round-robin
 Reads records from its input port
and writes them to the flow
partitions connected to its output
port. Records are written to
partitions in “roundrobin” fashion,
with ‘block-size’ records going to a
partition before moving on to the
next.
Roundrobin Partitioning
Partition 0 Partition 1 Partition 2
A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
Round robin Partitioning
Partition 0 Partition 1 Partition 2
A A B C
B D E F
C C D B
D G B A
E A D F
F E A
C
D
B
G
B
A
A
D
F
E
A
A Data Parallel Application: The
Expanded View
A Data Parallel Application: The
Global View
Degree of Parallelism
(Abstract)

Fan-out Flow Multifile


Exercise : Partition By Round Robin
 Read data from Serial file.
• id| name
 Partition it in 4 ways using PRR with block 2.

Part -1
 Save in four different files.

 Check the result.

Part -2
 Then replace four output files with 4 way MFS

file.
 Check the result.
Component: Partition by Key
 Reads records from its input port and
writes them to the flow partitions
connected to its output port. A hash
code computed using the key
determines which partition a record
will be written on, meaning that
records with the same key value will
go to the same partition.
Key-Dependent Data
Parallelism
 Aggregation processes records in groups
defined by key values.

 Parallel aggregation requires partitioning


based on key value.

 Parallel aggregation takes three steps:


• Partition by key.
• Sort by key. Same key in each step
• Aggregate by key.
Partitioning by Key
Partition 0 Partition 1 Partition 2
A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
G G
Partitioning by Key
Partition 0 Partition 1 Partition 2
A A B D
B C E F
C C B D
D A G D
E A B F
F A E
C G
D
B
G
B
A
A
D
F
E
A
G
Partition by Key and Sort

Partition by Key + Sort = Parallel Grouping


Partition by Key + Sort = Parallel Grouping

Partition 0 Partition 1 Partition 2


A A B D
B C E F
C C B D
D A G D
E A B F
F A E
C G
D
B
G
B
A
A A B D
D A B D
F A B D
E A E F
A C E F
G C G
G
Exercise : Partition By KEY
 Read data from Serial file.
• id| name
 Partition it in 4 ways using PBK using key as id

Part-1
 Save in four different files.

 Check the result.

Part -2
 Then replace four output files with 4 way MFS file.

 Check the result.

Part -3
 Replace PRB --> partition by key and sort .

 Save in 4 way MFS and Check the result


Partition by Expression
 Partition by Expression distributes data
records to its output flow partitions
according to a specified DML expression.
Partition by Expression
 The expression must evaluate to a
number between 0 and the number
of flows connected to the out port
minus 1.
 Partition by Expression routes the
record to the flow number returned
by this expression. Flow numbers
start at 0.
Exercise : Partition By Expression
 Read data from Serial file.
• id| name
 Partition it in 4 ways using following rules

• Flow1 : id<5
• Flow2 : id 5 to 10
• Flow3 : id 10 to 15
• Flow4 : all above
Part -1
 Save in four different files.

 Check the result.

Part -2
 Then replace four output files with 4 way MFS file.

 Check the result.


Departitioning
Components
Departitioning

Departitioning combines many flows of data to


produce one flow. It is the opposite of partitioning.

Each departition component combines flows in a


different manner.
Departitioning
Expanded View:

Global View:
* *
Departitioning
 For the various departitioning
components:
• Key-based?
• Result ordering?
• Effect on parallelism?
• Uses? Fan-in Flow
Departitioning: Performance

Input buffer Output buffer

Free space

Used space
Gather
Round-robin partitioned and scored:
42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C

Scored dataset in random order, following gather:


43Mark 02114 9C
46Rick 02116 23B
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
44Bob 02116 8C
47Bill 02114 14B
49Jane 02241 2C
Gather: Performance

Reading flows as
data is available

Note that the Gather will not affect upstream


processing
Gather
 Not key-based.
 Result ordering is unpredictable.
 Most useful method for efficient
collection of data from multiple
partitions and for repartitioning.
 Used most frequently
Merge
Round-robin partitioned and sorted by amount:
42John 02116 30 49Jane 02241 2 44Bob 02116 8
48Mary 02116 38 43Mark 02114 9 47Bill 02114 14
45Sue 02241 92 46Rick 02116 23

Sorted data, following merge on amount:


49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Merge
 Key-based.
 Result ordering is sorted if each input
is sorted.
 Useful for creating ordered data
flows.
 Other than the ‘Gather’, the Merge is
the other ‘departitioner’ of choice
Concatenation
Globally ordered, partitioned data:
49Jane 02241 2 47Bill 02114 14 42John 02116 30
44Bob 02116 8 46Rick 02116 23 48Mary 02116 38
43Mark 02114 9 45Sue 02241 92

Sorted data:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Concatenation
 Not key-based.
 Result ordering is by partition.
 Useful for:
• appending headers and trailers
• creating serial flow from partitioned
data
 Used very infrequently
Interleave
 Interleave combines blocks of data
records from multiple flow partitions in
round-robin fashion.
Interleave
 Not key-based.
 Result ordering is not fix.
 Used very infrequently
Exercise : DePartition
 Read data from 2 Serial input files.
• id| name
 Combine the data using Gather, Merge,

Concatenation and interleave


 Save the output in Serial output file.

 Check the result.

Part -2
 Then replace Serial input file with output file

MFS file created in Partitioning exercise.


 Save the output in Serial output file.

 Check the result.


Repartitioning
 Use to redistribute records across
partitions.
 Records are almost always
redistributed in a key-based manner,
but don’t have to be.
 Records can be redistributed to fewer
partitions, the same number of
partitions, or more partitions.
The “Wrong” Way, but
technically a correct solution

1 1

This serializes the computation unnecessarily!


Repartitioning -- The Right Way
An Expanded View:
Partition the data in Partition 1
to all downstream departitioners
Repartitioning -- The Right Way
An Expanded View:
Additionally partition the data in Partition 2
to all downstream departitioners
Repartitioning - The Global View

All-to-All Flow

Note: The departition component is almost


always a Gather.
Layout
 Layout determines the location of a
resource.
 A layout is either serial or parallel.
 A serial layout specifies a single node
and a single directory on that node.
 A parallel layout specifies multiple nodes
with multiple directories across the
nodes. It is permissible for the same
node to be repeated.
Controlling Layout
Propagate (default)

Bind layout to that


of another component

Use layout of URL

Construct layout
manually

Run on these
hosts
Layout Determines What Runs Where

Node W Node X Node Y Node Z


Layout Determines What Runs Where

Node W Node X Node Y Node Z


Layout Determines What Runs
Where

Serial
Parallel

3-way multifile on
file on Node W Node X,Y,Z
Layout Determines What Runs
Where

Node W Node X Node Y Node Z


Phases

Phase 0 Phase 1
Phases
 Breaking an application into phases
limits the contention for:
• Main memory.
• Processor(s).

 Breaking an application into phases


costs:
• Disk space.
Checkpoints
 Since data is staged to disk between
phases, one can arrange to use that
data to “start from the middle”
should something go wrong.

 Any phase break can be a


checkpoint.
The Phase Toolbar
A Toggle between:
Phase (P), and Checkpoint After Phase (C)

Increment Phase Number

Decrement Phase Number

View Phase
END
(Day – 3)

You might also like