Configuration File

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Configuration file

The Datastage configuration file is a master control file (a textfile which sits on
the server side) for jobs which describes the parallel system resources and
architecture. The configuration file provides hardware configuration for supporting
such architectures as SMP (Single machine with multiple CPU , shared memory and
disk), Grid , Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory
per node). DataStage understands the architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you
have changed your processing configurations, or changed servers or platform, you
will never have to worry about it affecting your jobs since  all the jobs depend on this
configuration file for execution. Datastage jobs determine which node to run the
process on, where to store the temporary data, where to store the dataset data,
based on the entries provide in the configuration file. There is a default configuration
file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having
the configuration file is to separate software and hardware configuration from job
design. It allows changing hardware and software resources without changing a job
design. Datastage jobs can point to different configuration files by using job
parameters, which means that a job can utilize different hardware architectures
without being recompiled.
The configuration file contains the different processing nodes and also
specifies the disk space provided for each processing node which are logical
processing nodes that are specified in the configuration file. So if you have more
than one CPU this does not mean the nodes in your configuration file correspond to
these CPUs. It is possible to have more than one logical node on a single physical
node. However you should be wise in configuring the number of logical nodes on a
single physical node. Increasing nodes, increases the degree of parallelism but it
does not necessarily mean better performance because it results in more number of
processes. If your underlying system should have the capability to handle these
loads then you will be having a very inefficient configuration on your hands.
1.    APT_CONFIG_FILE is the file using which DataStage determines the configuration
file (one can have many configuration files for a project) to be used. In fact, this is
what is generally used in production. However, if this environment variable is not
defined then how DataStage determines which file to use ??
1.    If the APT_CONFIG_FILE environment variable is not defined then DataStage look
for default configuration file (config.apt) in following path:
1.    Current working directory.
2.    INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level
directory of DataStage installation.

2.    Define Node in configuration file


A Node is a logical processing unit. Each node in a configuration file is distinguished
by a virtual name and defines a number and speed of CPUs, memory availability,
page and swap space, network connectivity details, etc.

3.    What are the different options a logical node can have in the configuration file?
1.    fastname – The fastname is the physical node name that stages use to open
connections for high volume data transfers. The attribute of this option is often the
network name. Typically, you can get this name by using Unix command ‘uname -n’.
2.    pools – Name of the pools to which the node is assigned to. Based on the
characteristics of the processing nodes you can group nodes into set of pools.
1.    A pool can be associated with many nodes and a node can be part of many pools.
2.    A node belongs to the default pool unless you explicitly specify apools list for it, and
omit the default pool name (“”) from the list.
3.    A parallel job or specific stage in the parallel job can be constrained to run on a pool
(set of processing nodes).
1.    In case job as well as stage within the job are constrained to run on specific
processing nodes then stage will run on the node which is common to stage as well
as job.
3.    resource – resource resource_type “location” [{pools “disk_pool_name”}]  |
resource resource_type “value” . resource_type can becanonicalhostname (Which
takes quoted ethernet name of a node in cluster that is unconnected to Conductor
node by the hight speed network.) or disk (To read/write persistent data to this
directory.) or scratchdisk (Quoted absolute path name of a directory on a file system
where intermediate data will be temporarily stored. It is local to the processing
node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)

4.    How datastage decides on which processing node a stage should be run?


1.    If a job or stage is not constrained to run on specific nodes then parallel engine
executes a parallel stage on all nodes defined in the default node pool. (Default
Behavior)
2.    If the node is constrained then the constrained processing nodes are chosen while
executing the parallel stage.

In Datastage, the degree of parallelism, resources being used, etc. are all
determined during the run time based entirely on the configuration provided in the
APT CONFIGURATION FILE. This is one of the biggest strengths of Datastage. For
cases in which you have changed your processing configurations, or changed
servers or platform, you will never have to worry about it affecting your jobs since  all
the jobs depend on this configuration file for execution. Datastage jobs determine
which node to run the process on, where to store the temporary data , where to store
the dataset data, based on the entries provide in the configuration file. There is a
default configuration file available whenever the server is installed.  You can typically
find it under the <>\IBM\InformationServer\Server\Configurations  folder with the
name default.apt. Bear in mind that you will have to optimise these configurations for
your server based on your resources.

Basically the configuration file contains the different processing nodes and
also specifies the disk space provided for each processing node. Now when we talk
about processing nodes you have to remember that these can are logical processing
nodes that are specified in the configuration file. So if you have more than one CPU
this does not mean the nodes in your configuration file correspond to these CPUs. It
is possible to have more than one logical node on a single physical node. However
you should be wise in configuring the number of logical nodes on a single physical
node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of
processes. If your underlying system should have the capability to handle these
loads then you will be having a very inefficient configuration on your hands.
Now lets try our hand in interpreting a configuration file. Lets try the below sample.

{
node “node1″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “}
}
node “node2″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “”}
}
node “node3″
{
fastname “SVR2″
pools “” “sort”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools  ”" }
}
}
This is a 3 node configuration file. Lets go through the basic entries and what it
represents.
Fastname – This refers to the node name on a fast network. From this we can imply that the
nodes node1 and node2 are on the same physical node. However if we look at node3 we
can see that it is on a different physical node (identified by SVR2). So basically in node1 and
node2 , all the resources are shared. This means that the disk and scratch disk specified is
actually shared between those two logical nodes. Node3 on the other hand has its own disk
and scratch disk space.

Pools – Pools allow us to associate different processing nodes based on their functions and
characteristics. So if you see an entry other  entry like “node0” or other reserved node pools
like “sort”,”db2”,etc.. Then it means that this node is part of the specified pool.  A node will be
by default associated to the default pool which is indicated by “”. Now if you look at node3
can see that this node is associated to the sort pool. This will ensure that that the sort stage
will run only on nodes part of the sort pool.
Resource disk  - This will specify Specifies the location on your server where the processing
node will write all the data set files. As you might know when Datastage creates a dataset,
the file you see will not contain the actual data. The dataset file will actually point to the place
where the actual data is stored. Now where the dataset data is stored is specified in this line.

Resource scratchdisk – The location of temporary files created during Datastage processes,
like lookups and sorts will be specified here. If the node is part of the sort pool then the
scratch disk can also be made part of the sort scratch disk pool. This will ensure that the
temporary files created during sort are stored only in this location. If such a pool is not
specified then Datastage determines if there are any scratch disk resources that belong to
the default scratch disk pool on the nodes  that sort is specified to run on. If this is the case
then this space will be used.

Below is the sample diagram for 1 node and 4 node resource allocation:

SAMPLE CONFIGURATION FILES

Configuration file for a simple SMP


 
A basic configuration file for a single machine, two node server (2-CPU) is shown below.
The file defines 2 nodes (node1 and node2) on a single dev server (IP address might be
provided as well instead of a hostname) with 3 disk resources (d1 , d2 for the data and
Scratch as scratch space).
The configuration file is shown below: 

node "node1"
{             fastname "dev"
               pool ""
               resource disk "/IIS/Config/d1" { }
               resource disk "/IIS/Config/d2" { }                            
               resource scratchdisk "/IIS/Config/Scratch" { }
}

node "node2"
{
               fastname "dev"
               pool ""
               resource disk "/IIS/Config/d1" { }
               resource scratchdisk "/IIS/Config/Scratch" { }
}             
          
 
 
Configuration file for a cluster / MPP / grid

The sample configuration file for a cluster or a grid computing on 4 machines is shown
below.
The configuration defines 4 nodes (node[1-4]), node pools (n[1-4]) and s[1-4), resource
pools bigdata and sort and a temporary space. 

node "node1"
            {
                        fastname "dev1"
                        pool "" "n1" "s1" "sort"
                        resource disk "/IIS/Config1/d1" {}
                        resource disk "/IIS/Config1/d2" {"bigdata"}                      
                        resource scratchdisk "/IIS/Config1/Scratch" {"sort"}
            }

            node "node2"
            {
                        fastname "dev2"
                        pool "" "n2" "s2"
                        resource disk "/IIS/Config2/d1" {}
                        resource disk "/IIS/Config2/d2" {"bigdata"}                      
                        resource scratchdisk "/IIS/Config2/Scratch" {}
            }
            node "node3"
            {
                        fastname "dev3"
                        pool "" "n3" "s3"
                        resource disk "/IIS/Config3/d1" {}
                        resource scratchdisk "/IIS/Config3/Scratch" {}
            }

            node "node4"
            {
                        fastname "dev4"
                        pool "n4" "s4"
                        resource disk "/IIS/Config4/d1" {}
                        resource scratchdisk "/IIS/Config4/Scratch" {}
            }

Resource disk : Here a disk path is defined. The data files of the dataset are stored
in the resource disk.

Resource scratch disk :  Here also a path to folder is defined. This path is used by
the parallel job stages for buffering of the data when the parallel job runs.

You might also like