Professional Documents
Culture Documents
Day 3
Day 3
Day 3
( Day - 3 )
A Practical
Introduction to
Ab Initio Software:
Part 3
Lookup files
Pop Quiz: Lookup Files and
Joins
When is it a bad idea to replace a
Join with use of a lookup file in a
reformatting component?
Lookup File Topics
Using Last-Visits
as a lookup file
Pop Quiz Answer: Lookup Files and Joins
Q: When is it a bad idea to replace a Join
with use of a lookup file in a reformatting
component?
A: whenever you need to process every
record in the flow that would become the
lookup file. In addition to outer joins,
this includes inner joins with flows
connected to unused ports (anti-joins).
Configuring a Lookup File
1. Label used as name in 3. Set record format
lookup expression
Transform function:
out :: lookup_info(in) =
begin
out.id : : in.id;
out.city : : in.city;
out.amount : : in.amount;
out.dt :1 : lookup(”Last-Visits”, in.id).dt;
out.dt :2 : ”1900/01/01”;
end;
Other lookup functions
int lookup_count (string file, expression
[,expression...] )
Returns the number of records from Lookup File file that
match the given expression(s).
Pipeline parallelism
Data parallelism
Component Parallelism
Sorting Customers
Sorting Transactions
Component Parallelism
Comes “for free” with graph
programming.
Limitation:
• Scales to number of “branches” a graph.
Pipeline Parallelism
Processing Record: 100
Processing Record: 99
Pipeline Parallelism
Comes “for free” with graph
programming.
Limitations:
• Scales to length of “branches” in a graph.
• Some operations, like sorting, do not pipeline.
Data Parallelism
Scales with data.
n s
r t itio
Pa
Two Ways of Looking at Data Parallelism
Expanded View:
Global View:
* *
Data Partitioning
Expanded View:
Global View:
* *
Data Partitioning: The Global
View
Degree of Parallelism
* **
Fan-out Flow
Flows
Four kinds of flows:
• straight
• fan-in
• fan-out
• all-to-all
Straight Flow
Straight flows connect components
that have the same number of
partitions.
Illustration of Straight Flow
Fan-In
Fan-in flows connect components
with a large number of partitions to
components with a smaller number
of partitions.
The most common use of fan-in is to
connect flows to Departition
components.
Illustration of Fan-in flow
Fan-out
Fan-out flows connect components with
a small number of partitions to
components with a larger number of
partitions.
The most common use of fan-out is to
connect flows from partition
components.
This flow pattern is used to divide data
into many segments for performance
improvements.
Illustration of Fan-out flow
All-to- All
All-to-all flows typically connect
components with different numbers
of partitions.
Data from any of the upstream
partitions is sent to any of the
downstream partitions.
Illustration of All-to-all flow
What is a Multifile?
Cont.
What is a Multifile?
mfile://host1/u/jo/mfs
mfile://host1/u/jo/mfs/a.dat
$ m_mkfs //host1/u/jo/mfs3 \
//host1/vol4/dat/mfs3_p0 \
//host2/vol3/dat/mfs3_p1 \
//host3/vol7/dat/mfs3_p2
$ m_expand mfile:mymfs
$ m_expand -native /path/to/the/mdir/bar
The m_dump command
m_dump metadata [path] [options ...]
Displays contents of files, multifiles, or selected
records from files or multifiles, similar to View
Data from GDE.
$ m_touch foo.dat
The m_rm command
m_rm [options] path [...]
$ m_rm foo
$ m_rm -f -r mfile:dir1
Exercise : Multifile Commands
Create a four-partition multifile
system named mfs_4way.
Create two directories within
mfs_4way named dir1 and dir2.
Use m_ls to list the contents of
mfs_4way.
Create a dummy mfs using m_touch
Use ls to examine the contents of
mfs_4way
Partitioning
Components
Component: Partition by Round-robin
Reads records from its input port
and writes them to the flow
partitions connected to its output
port. Records are written to
partitions in “roundrobin” fashion,
with ‘block-size’ records going to a
partition before moving on to the
next.
Roundrobin Partitioning
Partition 0 Partition 1 Partition 2
A A
B B
C C
D D
E E
F F
C C
D D
B B
G G
B B
A A
A A
D D
F F
E E
A A
Round robin Partitioning
Partition 0 Partition 1 Partition 2
A A B C
B D E F
C C D B
D G B A
E A D F
F E A
C
D
B
G
B
A
A
D
F
E
A
A Data Parallel Application: The
Expanded View
A Data Parallel Application: The
Global View
Degree of Parallelism
(Abstract)
Part -1
Save in four different files.
Part -2
Then replace four output files with 4 way MFS
file.
Check the result.
Component: Partition by Key
Reads records from its input port and
writes them to the flow partitions
connected to its output port. A hash
code computed using the key
determines which partition a record
will be written on, meaning that
records with the same key value will
go to the same partition.
Key-Dependent Data
Parallelism
Aggregation processes records in groups
defined by key values.
Part-1
Save in four different files.
Part -2
Then replace four output files with 4 way MFS file.
Part -3
Replace PRB --> partition by key and sort .
• Flow1 : id<5
• Flow2 : id 5 to 10
• Flow3 : id 10 to 15
• Flow4 : all above
Part -1
Save in four different files.
Part -2
Then replace four output files with 4 way MFS file.
Global View:
* *
Departitioning
For the various departitioning
components:
• Key-based?
• Result ordering?
• Effect on parallelism?
• Uses? Fan-in Flow
Departitioning: Performance
Free space
Used space
Gather
Round-robin partitioned and scored:
42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C
Reading flows as
data is available
Sorted data:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Concatenation
Not key-based.
Result ordering is by partition.
Useful for:
• appending headers and trailers
• creating serial flow from partitioned
data
Used very infrequently
Interleave
Interleave combines blocks of data
records from multiple flow partitions in
round-robin fashion.
Interleave
Not key-based.
Result ordering is not fix.
Used very infrequently
Exercise : DePartition
Read data from 2 Serial input files.
• id| name
Combine the data using Gather, Merge,
Part -2
Then replace Serial input file with output file
1 1
All-to-All Flow
Construct layout
manually
Run on these
hosts
Layout Determines What Runs Where
Serial
Parallel
3-way multifile on
file on Node W Node X,Y,Z
Layout Determines What Runs
Where
Phase 0 Phase 1
Phases
Breaking an application into phases
limits the contention for:
• Main memory.
• Processor(s).
View Phase
END
(Day – 3)