02 Principles of Parallel Execution and Partitioning

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 23

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Module 2 Principles of Parallel Execution


DataStage parallel jobs, as the name suggests, execute a number of
processes in parallel. There are two aspects to this parallelism; that of
having more than one stage executing at the same time, passing rows of
data along the stream of stages, and that of dividing ("partitioning") the
data over one or more logical processing nodes.
We also introduce how you can specify how data are to be partitioned into
multiple nodes, or collected from multiple nodes into one.

Objectives
Having completed this module you will be able:

to define pipeline parallelism and partition parallelism

to name five partitioning algorithms used by DataStage

to name two situations in which key-based partitioning is


necessary to yield correct results

to state how pipeline parallelism is implemented in DataStage


parallel jobs

to name two DataStage structures in which partitioned data can be


stored

Page 2-1

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Parallel Execution
Parallel execution is best understood by contemplating what happens
when parallel execution is not occurring.
In the worst case, one row is processed right through the ETL process, and
disposed of in accordance with the design. Then the next row is processed
right through the ETL process, and so on. In short, processing occurs one
row at a time.
There are two ways that parallelism that is, processing more than one
row at the same time can occur. There are names for these ways; they
are called pipeline parallelism and partition parallelism. DataStage
parallel jobs employ both, largely automatically. It is possible to design
both into DataStage server jobs too, in a non-automatic implementation,
but that discussion is outside the scope of these notes.

Pipeline Parallelism
If a DataStage job consists of stages connected by links, then the data flow
through that job is indicated by the arrowheads on the links.

By associating some kind of memory buffer with each of the links, then it
becomes possible to transmit more than one row at a time, and for
upstream stages to begin processing rows while other, earlier rows, are
still being processed by downstream stages.

This is known as pipeline parallelism. It establishes a pipeline from one


stage to the next. Each stage is a producer of rows, a consumer of rows, or
both.
In a DataStage parallel job a specific structure, called a Data Set (more
exactly a virtual Data Set) is used to implement these buffers. By default
there is a pair of buffers associated with each link, with the upstream stage

Page 2-2

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

producing rows into one and the downstream stage consuming rows from
the other. These roles switch when certain tuneable thresholds1 occur.
This mechanism tends to smooth out speed variations in stage execution
and allows processing to continue as fast as possible.
Another way to think of pipeline parallelism is as a conveyor belt moving
buffers (boxes?) of rows from one stage to the next.
Conceptually the operating system idea of a pipeline may be used to aid
understanding of what is happening. Operating systems allow pipelines of
commands to be created, with operators such as the pipe (|), re-direction
operators such as > and conditional pipe operators such as && joining
them and controlling what happens.
If you imagine for a moment that each stage in a DataStage parallel job
design generates an operator when the job is compiled then you might
contemplate the execution path of a parallel job as being able to be
described as something like:
op0 | op1 | op2 | op3
To provide for the buffering the virtual Data Sets this model needs to
be slightly refined.
op0 > ds0 | op1 < ds0 > ds1 | op2 < ds1 > ds2 | op3 < ds3
In this model, op0 through op3 refer generically to operators and ds0
through ds2 refer generically to Data Sets2.
Indeed, when you examine the Orchestrate shell script that is generated by
compiling a job, or when you examine the score that is actually executed,
you see precisely this relationship between operators and Data Sets.
In the Orchestrate parallel execution engine, the only structure with which
an operator can deal is an Orchestrate Data Set. This has implications for
dealing with real world data, which will be discussed in a later module.

Details on how these mechanisms are tuned are beyond the scope of these notes. They
properly belong in an advanced syllabus.

In the Orchestrate parallel execution environment, since everything is written in C or C+


+, counts always begin from zero. That is, counts are zero based.

Page 2-3

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Partition Parallelism
Partition parallelism involves dividing the data into subsets of rows and
processing each subset identically and in parallel. Provided that there are
sufficient resources to handle this parallel processing, overall execution
time should be reduced linearly.
That is, if you divide your data into N equal subsets of rows, then it should
take very close to 1/N the amount of time to execute as it would to process
the entire set of rows in a single stream.
In practice there will be factors, such as the need to keep common key
values together and the overheads of starting up and shutting down the
additional parallel processes, that will mitigate against achieving this
optimal result, but as a general rule partition parallelism will yield reduced
throughput time.
DataStage parallel jobs achieve partition parallelism automatically, using
the Orchestrate parallel execution mechanism described below.
This means that you only have to design as if there will be a single stream
of processing; the DataStage parallel engine looks after the rest (though
you do need to remain alert to implications for partitioning data and
generating sequences of numbers in a parallel execution environment).
Consider the following job design (dont worry too much yet about what it
actually does).

Page 2-4

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

When you execute this parallel job in an execution environment that


specifies four processing nodes, what actually gets executed might be
represented as follows. Again, this simplifies the actual situation, but is
useful as an explanation.

The number of processing nodes is specified by the parallel execution


configuration file (of which more shortly). A DataStage server will
typically have more than one configuration file available. The one that a
particular job will use is specified by an environment variable called
APT_CONFIG_FILE.
The value of APT_CONFIG_FILE is a pathname of a DataStage parallel
execution configuration file. By default these are stored in a directory
called Configurations that is a sibling to the DataStage engine directories,
but may be stored in any directory. We strongly recommend conforming
to the conventional storage location, so as to cause least surprise to anyone
subsequently maintaining the system.

Page 2-5

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Always include the environment variable $APT_CONFIG_FILE as a job


parameter in DataStage parallel jobs and in job sequences that invoke
DataStage parallel jobs. This allows the configuration file for parallel
execution to be selected on a job by job basis when the job is executed.
The default value for this job parameter should be the special value
$PROJDEF, which means that the project-level default will be used unless
it is overridden when the job run is requested.

Page 2-6

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Execution Environment
It would be well here to take a short sidestep to discuss some terminology.
The term symmetric multi-processing (SMP) is often used to describe a
single machine with more than one CPU.
Multi-core CPU chips count as multiple
processors for this discussion. Although
technically it is not quite the same thing,
non-uniform memory architecture (NUMA)
is treated as SMP for DataStage
.
In an SMP environment, every processor has
access to all of the memory and all of the disk
resources available to that machine. For that
reason, it is sometimes usefully referred to as
a shared everything environment.
When processing occurs
on multiple machines,
which may be
connected together in a
cluster or grid
configuration, then each
machines CPUs have
access only to that
machines memory and
disk resources. For that
reason, such a
configuration is
sometimes referred to as
a shared nothing environment. The term massively parallel
processing (MPP) is also encountered when describing this kind of
environment.
Execution on a grid configuration, where the number of available
machines may vary over time, requires a more dynamic configuration file.
DataStage has the capability to manage or, more accurately, to be
managed by a grid environment. This requires installation of the Grid
Enablement Toolkit.

Page 2-7

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Configuration File
A DataStage parallel execution configuration file contains a list of one or
more node names, the location of each node and the resources available on
each node. A node, in this context, is a purely logical construct called a
processing node there is no fixed relationship between the number of
nodes and, for example, the number of CPUs available.
On a four CPU machine (or cluster or grid), for example, it is likely that
one-node, two-node, four-node and even eight-node configurations might
be defined. And where the nodes are on separate machines, the same
considerations apply (consider the total number of CPUs available), but
there are some small overheads in communication between the machines
that needs to be taken into consideration.
In an SMP environment, everything is on the same machine in the
network. Therefore the machine location of every processing node in the
parallel execution file will be the same machine name. In an MPP
environment, however, there will be different machine names associated
with different processing nodes. It still may be the case that there is more
than one node on each of these machines, but if there is ever more than
one machine named in a configuration file, then it describes an MPP
configuration of some kind.
Lets examine the default configuration file that ships with DataStage
server to be installed on a Windows operating system.
{
node "node1"
{
fastname "PCXX"
pools ""
resource disk "C:/IBM/InformationServer/Server/Datasets"
{pools ""}
resource scratchdisk "C:/IBM/InformationServer/Server/Scratch"
{pools ""}
}
}

The outer set of curly braces contains the entire configuration definition.
The only thing that can appear outside of these curly braces are C-style
comments (that is, text introduced by /* and terminated by */).
In the default configuration file shown there is only one node, and it is
named node1. Node names are simply names; they are used only in
error, warning and informational messages to be read by humans.
DataStage internally uses the node number; the ordinal number, beginning
from zero, of the node within the current configuration file. Therefore, in
the default configuration file shown here, you would expect to see node #0
referred to in areas like monitoring.
Page 2-8

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

After each node name there is another set of curly braces, which contains
the definition of that particular node. A node definition contains:

the fastname of that node that is, the network name of the
machine on which this nodes processing is to occur it is called
fastname because you always specify the name by which that
machine is known on the fastest possible network connection

a list of the node pools in which this node will participate this is
a space separated list introduced by the word pools with one or
more node pool names each surrounded by double quote characters

a list of resources associated with the node resources called disk


and scratchdisk are always defined and there may be others disk
and scratchdisk resources may be further qualified by a list of disk
pools in which the resource will participate this is a space
separated list enclosed in curly braces and introduced by the word
pools with one or more disk pool names each surrounded by
double quote characters

A node pool or disk pool name that consists of a zero length string, as
shown in the default configuration file displayed above, is the default
pool. At least one node must belong to the default node pool unless every
stage in the job design has an explicit node pool specified.

Figure 2-1 DataStage Configurations Editor

Page 2-9

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

From DataStage Designer client (under the Tools menu) you can open a
tool called the Configurations editor, which is illustrated in Figure 2-1. As
well as being able to view the configuration files available on the system,
this editor allows new configuration files to be created from old, existing
configuration files to be edited3, and the validity of a configuration file to
be checked.
Lets examine a configuration file from a UNIX system. (The only
difference on a Windows system would be Windows pathnames.)4
/* Four-node configuration file
four_node.apt
*/
/* Created 2008-02-22, Ray Wurlod
*/
{
node "node0"
{
fastname "dsdevsvr"
pools
"" "node0" "low2"
resource disk "/usr/data1/DataStage/disk00"
resource disk "/usr/data2/DataStage/disk00"
resource scratchdisk "/usr/tmp/DataStage/disk00"
resource scratchdisk "/var/tmp/DataStage/disk00"
}
node "node1"
{
fastname "dsdevsvr"
pools
"" "node1" "low2"
resource disk "/usr/data1/DataStage/disk01"
resource disk "/usr/data2/DataStage/disk01"
resource scratchdisk "/usr/tmp/DataStage/disk01"
resource scratchdisk "/var/tmp/DataStage/disk01"
}
node "node2"
{
fastname "dsdevsvr"
pools
"" "node2" "high2"
resource disk "/usr/data1/DataStage/disk02"
resource disk "/usr/data2/DataStage/disk02"
resource scratchdisk "/usr/tmp/DataStage/disk02"
resource scratchdisk "/var/tmp/DataStage/disk02"
}
node "node3"
{
fastname "dsdevsvr"
pools
"" "node3" "high2"
resource disk "/usr/data1/DataStage/disk03"
resource disk "/usr/data2/DataStage/disk03"
resource scratchdisk "/usr/tmp/DataStage/disk03"
resource scratchdisk "/var/tmp/DataStage/disk03"
}
}

{}
{pools "buffer"}

{}
{pools "buffer"}

{}
{pools "buffer"}

{}
{pools "buffer"}

Most sites protect the Configurations directory from being able to allow edits other than
by administrative staff. If this is the case at your site, you will receive an access denied
message when you attempt to save changes.

Configuration files for grid environments have other requirements due to the dynamic
nature of nodes resulting from grid management software. These are beyond the scope of these
notes.

Page 2-10

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

This is clearly an SMP system (the fastname is the same for all nodes).
There are four single-node node pools, and two two-node node pools. The
two-node node pools are named low2 and high2. The four one-node
node pools are eponymously named for the node to which each relates.
Node pool names can be dedicated. There are several reserved node pool
names in DataStage, among them DB2, INFORMIX, ORACLE, SAS and
sort. The first four tie in with resources available on that node. The sort
node pool can be used to specify a separate scratch disk resource that will
be used by sort rather than using the default scratch disk disk pool.
Directories listed under resource disk are used to store data in Data Sets,
File Sets and Lookup File Sets. We shall see later that files stored in these
directories are recorded in a control file for each of these particular
objects.
Directories listed under resource scratchdisk are, as the name suggests,
used for scratch space by various kinds of operator. For example, the
Sort stage allows a certain maximum amount of physical memory per
node to be allocated to the sorting process. Once all this memory is in use,
the sorting operation overflows into files in the scratch space. These files
should automatically be deleted once the job completes successfully.
Scratch space may also be used by other facilities. For example, if the
Oracle bulk loader (sqlldr) experiences a problem and generates a bad
file and a log file, and the designer has not specified locations where
these should be written, then DataStage will direct them to the scratch
space.
The parallel engine uses the default scratch disk for temporary storage
other than buffering. If you define a buffer scratch disk pool for a node in
the configuration file, the parallel engine uses that scratch disk pool rather
than the default scratch disk for buffering, and all other scratch disk pools
defined are used for temporary storage other than buffering. Each
processing node in the above example has a single scratch disk resource in
the buffer pool, so buffering will use /var/tmp/DataStage/disknn but will
not use /usr/tmp/DataStage/disknn (other operators will use this scratch
space).
However, if /var/tmp/DataStage/disknn were not in the buffer pool,
both /var/tmp/DataStage/disknn and /usr/tmp/DataStage/disknn would be
used because both would then be in the default pool.
Other resource types may (indeed may need to) be defined, for example
resource DB2, resource INFORMIX, resource ORACLE and resource
SAS.
For more information on configuration files and the configuration file
editor tool refer to the chapter on the parallel engine configuration file in
the DataStage Parallel Job Developer Guide.

Page 2-11

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Configuration File in Grid Environment


In a grid environment the available resources may vary over time, so that a
different mechanism than a static configuration file is required.
To enable a dynamic configuration to be used, the Grid Enablement
Toolkit introduces a number of additional environment variables and
scripts. The Grid Enablement Toolkit environment variables are installed
in all existing DataStage/QualityStage projects, and in the template from
which new projects are created.
All but one of the environment variables have names beginning with
"APT_GRID_". The single exception is an environment variable called
APT_DYNAMIC_CONFIG_SCRIPT, which allows a name to be given to
the generated dynamic configuration file.
Grid-enabling a single job is triggered by setting the environment variable
APT_GRID_ENABLE to YES and providing values to two other
variables:

APT_GRID_COMPUTENODES = m where m is the number of


machines from the grid on which processes will be run

APT_GRID_PARTITIONS = n where n is the number of


"processing nodes" (in the DataStage sense5) to be created on each
of those machines

The dynamic configuration file is dumped into the job log. Its "nodes" are
named according to the convention node_m_n where m is the machine
number and n is the partition number on that machine. For example, on a
2:2 configuration (two compute nodes and two partitions per compute
node), the configuration file would refer to node_1_1, node_1_2,
node_2_1 and node_2_2. (This numbering is an exception to the zerobased numbering encountered in most aspects of parallel job execution.)
If APT_GRID_ENABLE is not set to YES then DataStage will use the
static configuration file whose pathname is given in APT_CONFIG_FILE.
Sequences use a different mechanism. Sequences always execute on a
single machine. In a sequence you begin with an Execute Command
activity that executes sequencer.sh that takes the name of the new
configuration file and the number of compute nodes and partitions as
arguments, and returns a space-separated list of two sequential file host
names (discussed in a later module) and APT_CONFIG_FILE pathname.

Grid management software uses the term "node" to mean a machine in the grid.
DataStage uses the term "node" to mean a logical subset of available processing resources, as
described in the earlier section on configuration files.

Page 2-12

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Partitioning
A partitioning algorithm describes to DataStage how to distribute rows of
data among the processing nodes. The partitioning algorithm to be used is
specified on the input link to a stage.
If you do not explicitly specify a partitioning algorithm, then (Auto) is
filled in, which instructs DataStage to choose the appropriate partitioning
algorithm when the job design is converted for parallel execution6.
Table 2-1 DataStage Partitioning Algorithms

Algorithm

KeyBased?

Description

Round
Robin

first row to first node (number 0), second row to


second node (number 1), and so on

Random

rows randomly allocated to nodes

Modulus

(integer) key value is divided by the number of


nodes and the remainder used as the node
number to which to allocate the row

Hash

a hashing algorithm is allocated to the (any data


type) key value to determine the node number to
which to allocate the row

DB2

matches the DB2 partitioning algorithm

Entire

every row is placed on every node

Range

rows are allocated to nodes based on range of


values in a field, via a pre-created range map

Also on the list of available partitioning algorithms are two pseudo


algorithms called (Auto) and Same.
(Auto) specifies that DataStage is to determine the partitioning
algorithm according to the following rules.
Use Same on a parallel to parallel link.
Use Round Robin on a sequential to parallel link
Use Entire on the reference input link to a Lookup stage.
Use DB2 on a link that connects to a DB2 Enterprise stage or
DB2 Connector.
6

For reasons that will become clear in a later module, this process is called composing
the score.

Page 2-13

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

These are not always appropriate.


Particular circumstances where a better partitioning algorithm can be
found essentially group into three categories: where key values need to be
compared, where key values need to be combined, and one special case,
that of a lookup in an MPP environment.
Where key values need to be compared, such as in a Lookup, a Join, a
Merge (master and updates) or when removing duplicates, then it is vital
that all occurrences of each key value occur on the same node as each
other. The ability to search other nodes for matching values is not part of
what DataStage does; execution on each node is isolated from execution
on other nodes.
Where key values need to be combined, such as in an Aggregator, a Sort
or some calculations in a Transformer stage, again it is vital that all
occurrences of each key value occur on the same node as each other. For
example consider a summary by state of US data. In a four-node parallel
configuration, with key adjacency, the correct result (at most 50 rows) will
be the result. Without key adjacency, as many as 200 rows (50 from each
of four nodes) may occur, which would not be the desired summary.
To get around the problem of key adjacency with a Lookup stage, the
default is to Entire partition on the reference input. That way, no matter
how the stream input is partitioned, every row of the reference input is
findable, because every key occurs on every partition. In an SMP
environment this is a good way to do things due to the shared everything
nature of the environment: in fact, the reference input is placed into shared
memory so that there is only one copy of the Data Set. However, in an
MPP environment, Entire partitioning will involve moving all the rows of
the reference Data Set to every processing node, via TCP/IP. This may be
too great an overhead compared to using a key-based partitioning
algorithm on the reference input.

Page 2-14

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Configuring Partitioning
Every stage type that has an input link and which executes in parallel
mode can have the Partitioning properties of that input link configured.
On the input link properties page there is a tab captioned Partitioning.

Figure 2-2 Configuring Partitioning

In the Partitioning/Collecting frame there is a drop-down list where the


partitioning algorithm (partition type) may be selected. If a key-based
partition type is selected, then one or more columns must be selected from
the Available columns field immediately below the drop-down list.
To the right of the drop-down list is a tool that is enabled if certain of the
algorithms are chosen. For example, if Range is selected the name of the
associated range map must be specified; if DB2 is selected the name of the
DB2 database name, instance name and partition table must be specified.
If NLS is enabled, this tool may also be used to select the appropriate
locale to use when performing calculations. Locale can specify sorting and
comparison order.

Page 2-15

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Configuring Collecting
Every stage type that has an input link and which executes in sequential
mode can have the Collecting properties of that input link configured. On
the input link properties page there is a tab captioned Partitioning.

Figure 2-3 Configuring Collecting

In the Partitioning/Collecting frame there is a drop-down list where the


collecting algorithm (collector type) may be selected. If a key-based
collector type is selected, then one or more columns must be selected from
the Available columns field immediately below the drop-down list.
To the right of the drop-down list is a tool that is enabled if Sort Merge
algorithms is chosen. This tool may also be used to select the appropriate
locale to use when performing sorting.

Page 2-16

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Parallel Data Storage


Quickest throughput in a DataStage job is achieved by keeping the rows in
memory; that is, not letting them touch down on disk. The logic behind
this statement is attractive; memory access times are faster than disk
access times by several orders of magnitude.
However there will be times when you do not want extraction,
transformation and load to occur all in one fell swoop. For example,
there may be limited, and different, time windows when the source and
target databases are available. Also, if data are in memory when some
catastrophic failure occurs that shuts down the machine, those data are
lost. Creating staging points for data affords an aid to restartability.
Recall, from earlier in this module, that the only input or output that can
be used with a DataStage operator is a Data Set. When resident in
memory (that is, when directly associated with a link), we refer to a
virtual Data Set.
Conceptually a Data Set spans all partitions, though physically each node
has its own subset of rows. In addition, the use of node pools or disk
pools can limit the nodes on which a Data Set has rows.
Another characteristic of a Data Set is that the data in it are in internal
format; that is, for example, numeric data are stored in binary format
while variable length character strings are stored in a fixed size buffer and
have a numeric length indicator associated with them.
It turns out that the data in a virtual Data Set can be stored on disk in a
structure called a persistent Data Set.
Both kinds of Data Set have, when created, a control file that contains
information about the record schema (the layout of each record in the Data
Set).

For a virtual Data Set, this file has a name created from the stage
and link to which it relates, and a suffix of .v. Virtual Data Set
control files are deleted automatically when the job finishes. The
name of a virtual Data Set control file can be seen in the osh script
generated by compiling a DataStage parallel job.

For a persistent Data Set, the control file has a name given by the
developer, which conventionally has a suffix of .ds. Persistent
Data Set control files contain not only the record schema but also
contain the location of the physical files on each processing node
in which the data are stored. These files always reside in
directories mentioned in the resource disk portion of the
configuration file.

Page 2-17

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Let us contemplate, therefore, what a persistent (or virtual) Data Set looks
like.

Figure 2-4 Contents of a Data Set

There are rows of data on each of the nodes, and a control file that outlines
the structure of each record.
Within that structure the final four fields are used to control such things as
preservation of sorted order in the Data Set. These are used only within
DataStage they are no accessible to developers. However, the record
schema can be viewed.
Note that there is no mention in the control file, when we are considering a
virtual Data Set, of where the data files actually are. This is because they
are only in memory. For a persistent Data Set the control file will also
contain the pathnames of the data files for each segment of the Data Set.
These may be viewed in the Data Set Management tool.
Data are moved to and from persistent Data Sets using a Data Set stage,
which in turn compiles to a copy operator. That is, any sorting and
partitioning in the Data Set is preserved. This makes the Data Set the ideal
structure for intermediate storage of data between jobs. That is, if you
need one job to store data on disk for another job to pick up later, then the
persistent Data Set should be the structure in which to store them.
In DataStage, persistent Data Sets are read from or written to using the
Data Set stage. They can be manipulated (inspected, copied or deleted)

Page 2-18

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

using the Data Set Management tool (from the Tools menu in Designer) or
using the orchadmin command.

Page 2-19

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

File Set
A File Set is a similar structure to a Data Set.
A File Set has a control file (this time with a .fs suffix) and one or more
data files on each processing node. Data in a File Set preserve both sorted
order and partitioning.
The difference is that the data stored in a File Set are in human-readable,
or external format, and therefore must go through conversion from or to
the internal format used within the DataStage job. On the other hand, data
in a File Set are readily accessible to other applications.

Figure 2-5 Structure of a File Set

The Lfile entries in the control file (on the right) are the physical locations
where data are stored on each node. The directory part of each of these
pathnames is the directory recorded as resource disk in the configuration
file for parallel execution (the one referred to by the environment variable
APT_CONFIG_FILE).
Data in each of these files is human readable; in this cases the format is
comma-separated values with strings double-quoted, which can be seen in
the format portion of the record schema. These fields are not visible.
As with Data Sets, the final four fields are used to control such things as
preservation of sorted order in the File Set. The control file for a File Set
is clear text, so it is straightforward for other applications to read it (the
control file) to determine the locations of data.
In DataStage, File Sets are read from and written to using the File Set
stage.
Page 2-20

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Lookup File Set


The Lookup File Set is neither a File Set nor a Data Set, though it has
characteristics of both.
It has a control file that has a .fs suffix. That is its only resemblance to a
File Set.
Its data are stored in internal format, but not in rows like in a Data Set.
Instead, rows are stored in blocks within the Lookup File Set, and an index
to the location of each row is built and incorporated into the Lookup File
Set.
Because of this indexed structure, there is no capacity for viewing the data
in a Lookup File Set there is no utility and the Lookup File Set stage
does not support a View Data capability.
This structure can only be used to serve rows into the reference input of a
Lookup stage, with which we will deal in a later module.

Page 2-21

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Review
Pipeline Parallelism is like a conveyor belt in which buffers are used
between operators so that more than one row at a time may be passed.
Partition Parallelism involves splitting the data into row-based subsets and
treating each subset identically to the original design.
There are seven Partitioning Algorithms available in DataStage parallel
jobs; round robin, random, entire, hash, modulus, range and DB2. Hash
and modulus are key-based algorithms. Auto and same cause an
algorithm to be selected when the job score is composed.
Situations in which key-based partitioning is necessary to achieve correct
results include any situation in which keys must be compared (such as in
the Lookup, Join, Merge and Remove Duplicates stage types) or combined
(such as in the Sort and Aggregator stage types).
DataStage structures in which partitioned data can be stored include Data
Sets (in which data are stored in internal format), File Sets (in which data
are stored in external, or human-readable, format) and Lookup File Sets
(in which data are stored in an indexed internal format). We examine
these structures in more detail in a later module.

Further Reading
Types of Parallelism Parallel Job Developer Guide Chapter 2
Configuration Files

Parallel Job Developer Guide Chapter 2

Partitioning

Parallel Job Developer Guide Chapter 2

Data Sets

Parallel Job Developer Guide Chapter 2 and 4

File Sets

Parallel Job Developer Guide Chapter 6

Lookup File Sets

Parallel Job Developer Guide Chapter 7

Page 2-22

DataStage Fundamentals (version 8 parallel jobs)

Page 2-23

Staffordshire, December 2010

You might also like