DataStage Configuration File

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

DataStage Configuration file FAQ

July 2, 2010

1. APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can
have many configuration files for a project) to be used. In fact, this is what is generally used in
production. However, if this environment variable is not defined then how DataStage determines
which file to use?
1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for
default configuration file (config.apt) in following path:
1. Current working directory.
2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level
directory of DataStage installation.
2. What are the different options a logical node can have in the configuration file?
1. fastname The fastname is the physical node name that stages use to open connections for
high volume data transfers. The attribute of this option is often the network name. Typically,
you can get this name by using Unix command uname -n.
2. pools Name of the pools to which the node is assigned to. Based on the characteristics of
the processing nodes you can group nodes into set of pools.
1. A pool can be associated with many nodes and a node can be part of many pools.
2. A node belongs to the default pool unless you explicitly specify apools list for it,
and omit the default pool name () from the list.
3. A parallel job or specific stage in the parallel job can be constrained to run on a
pool (set of processing nodes).
1. In case job as well as stage within the job are constrained to run on
specific processing nodes then stage will run on the node which is
common to stage as well as job.
3. resource resource resource_type location [{pools disk_pool_name}] | resource
resource_type value . resource_type can be canonicalhostname(Which takes
quoted ethernet name of a node in cluster that is unconnected to Conductor node by the

hight speed network.) or disk (To read/write persistent data to this


directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where
intermediate data will be temporarily stored. It is local to the processing
node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)
3. How datastage decides on which processing node a stage should be run?
1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a
parallel stage on all nodes defined in the default node pool. (Default Behavior)
2. If the node is constrained then the constrained processing nodes are choosen while
executing the parallel stage. (Refer to 2.2.3 for more detail).
4. When configuring an MPP, you specify the physical nodes in your system on which the parallel
engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need
to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes
from which you start parallel engine applications. It is possible that conductor node is not connected
with the high-speed network switches. However, the other nodes are connected to each other using a
very high-speed network switches. How do you configure your system so that you will be able to
achieve optimized parallelism?
1. Make sure that none of the stages are specified to be run on the conductor node.
2. Use conductor node just to start the execution of parallel job.
3. Make sure that conductor node is not the part of the default pool.
5. Although, parallelization increases the throughput and speed of the process, why maximum
parallelization is not necessarily the optimal parallelization?
1. Datastage creates one process for every stage for each processing node. Hence, if the
hardware resource is not available to support the maximum parallelization, the performance
of overall system goes down. For example, suppose we have a SMP system with three CPU
and a Parallel job with 4 stage. We have 3 logical node (one corresponding to each physical
node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be managed
by a single operating system. Significant time will be spent in switching context and
scheduling the process.
6. Since we can have different logical processing nodes, it is possible that some node will be more
suitable for some stage while other nodes will be more suitable for other stages. So, when to decide
which node will be suitable for which stage?

1. If a stage is performing a memory intensive task then it should be run on a node which
has more disk space available for it. E.g. sorting a data is memory intensive task and it
should be run on such nodes.
2. If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related
stages, etc.) then you need to associate those stages with the processing node, which is
physically mapped to the machine on which the licensed software is installed.
(Assumption: The machine on which licensed software is installed is connected through
other machines using high speed network.)
3. If a job contains stages, which exchange large amounts of data then they should be assigned
to nodes where stages communicate by either shared memory (SMP) or high-speed link
(MPP) in most optimized manner.
7. Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution
of parallel jobs from the conductor node. Conductor nodes creates a shell of remote machines
(depending on the processing nodes) and copies the same environment on them. However, it is
possible to create a startup script which will selectively change the environment on a specific node.
This script has a default name of startup.apt. However, like main configuration file, we can also have
many startup configuration files. The appropriate configuration file can be picked up using the
environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT
environment variable?
1. Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine
not to run the startup script on the remote shell.
8. What are the generic things one must follow while creating a configuration file so that optimal
parallelization can be achieved?
1. Consider avoiding the disk/disks that your input files reside on.
2. Ensure that the different file systems mentioned as the disk and scratchdisk resources hit
disjoint sets of spindles even if theyre located on a RAID (Redundant Array of Inexpensive
Disks) system.
3. Know what is real and what is NFS:
1. Real disks are directly attached, or are reachable over a SAN (storage-area network
-dedicated, just for storage, low-level protocols).
2. Never use NFS file systems for scratchdisk resources, remember scratchdisk are
also used for temporary storage of file/data during processing.

3. If you use NFS file system space for disk resources, then you need to know what
you are doing. For example, your final result files may need to be written out onto
the NFS disk area, but that doesnt mean the intermediate data sets created and
used temporarily in a multi-job sequence should use this NFS disk area. Better to
setup a final disk pool, and constrain the result sequential file or data set to
reside there, but let intermediate storage go to local or SAN resources, not NFS.
4. Know what data points are striped (RAID) and which are not. Where possible, avoid
striping across data points that are already striped at the spindle level.

In Datastage, the degree of parallelism, resources being used, etc. are all determined during the run
time based entirely on the configuration provided in the APT CONFIGURATION FILE. This is one of
the biggest strengths of Datastage. For cases in which you have changed your processing
configurations, or changed servers or platform, you will never have to worry about it affecting your
jobs since all the jobs depend on this configuration file for execution. Datastage jobs determine
which node to run the process on, where to store the temporary data , where to store the dataset
data, based on the entries provide in the configuration file. There is a default configuration file
available whenever the server is installed. You can typically find it under the
<>\IBM\InformationServer\Server\Configurations folder with the name default.apt. Bear in mind that
you will have to optimise these configurations for your server based on your resources.
Basically the configuration file contains the different processing nodes and also specifies the disk
space provided for each processing node. Now when we talk about processing nodes you have to
remember that these can are logical processing nodes that are specified in the configuration file. So
if you have more than one CPU this does not mean the nodes in your configuration file correspond
to these CPUs. It is possible to have more than one logical node on a single physical node.
However you should be wise in configuring the number of logical nodes on a single physical node.
Increasing nodes, increases the degree of parallelism but it does not necessarily mean better
performance because it results in more number of processes. If your underlying system should
have the capability to handle these loads then you will be having a very inefficient configuration on
your hands.
Now lets try our hand in interpreting a configuration file. Lets try the below sample.
{
node node1
{
fastname SVR1

pools
resource disk C:/IBM/InformationServer/Server/Datasets/Node1 {pools }
resource scratchdisk C:/IBM/InformationServer/Server/Scratch/Node1 {pools }
}
node node2
{
fastname SVR1
pools
resource disk C:/IBM/InformationServer/Server/Datasets/Node1 {pools }
resource scratchdisk C:/IBM/InformationServer/Server/Scratch/Node1 {pools }
}
node node3
{
fastname SVR2
pools sort
resource disk C:/IBM/InformationServer/Server/Datasets/Node1 {pools }
resource scratchdisk C:/IBM/InformationServer/Server/Scratch/Node1 {pools " }
}
}
This is a 3 node configuration file. Lets go through the basic entries and what it represents.
Fastname This refers to the node name on a fast network. From this we can imply that the nodes
node1 and node2 are on the same physical node. However if we look at node3 we can see that it is
on a different physical node (identified by SVR2). So basically in node1 and node2 , all the
resources are shared. This means that the disk and scratch disk specified is actually shared
between those two logical nodes. Node3 on the other hand has its own disk and scratch disk space.
Pools Pools allow us to associate different processing nodes based on their functions and
characteristics. So if you see an entry other entry like node0 or other reserved node pools like
sort,db2,etc.. Then it means that this node is part of the specified pool. A node will be by default
associated to the default pool which is indicated by . Now if you look at node3 can see that this
node is associated to the sort pool. This will ensure that that the sort stage will run only on nodes
part of the sort pool.
Resource disk - This will specify Specifies the location on your server where the processing node
will write all the data set files. As you might know when Datastage creates a dataset, the file you

see will not contain the actual data. The dataset file will actually point to the place where the actual
data is stored. Now where the dataset data is stored is specified in this line.
Resource scratchdisk The location of temporary files created during Datastage processes, like
lookups and sorts will be specified here. If the node is part of the sort pool then the scratch disk
can also be made part of the sort scratch disk pool. This will ensure that the temporary files created
during sort are stored only in this location. If such a pool is not specified then Datastage determines
if there are any scratch disk resources that belong to the default scratch disk pool on the nodes that
sort is specified to run on. If this is the case then this space will be used.
Im hoping this will help you in reading your configuration files. In most cases its not as hard as it
looks. But then again this is just a simple configuration file ive explained. It should help you start off
in understanding at least a bit of complex configuration files.

You might also like