Introduction To Datastage

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Introduction to DataStage

DataStage Essentials v9.1

5.4

What is DataStage?

What is IBM InfoSphere DataStage?

Design jobs for Extraction, Transformation, and Loading


(ETL)

Ideal tool for data integration projects such as, data


warehouses, data marts, and system migrations

Import, export, create, and manage metadata for use


within jobs

Build, run, and monitor jobs, all within DataStage

Administer your DataStage development and execution


environments

Create batch (controlling) jobs

Called job sequence

What is Information Server?


Suite of applications, including DataStage, that:
Share a common repository

DB2, by default
Repository stores objects built and used by Information Server
applications

DataStage jobs
Metadata imported into DataStage

Share a common set of application services and functionality

Provided by the Metadata Server component

By default an application named server1, hosted by an IBM


WebSphere Application Server (WAS) instance

Provided services include:

Security
Repository
Logging and reporting
Metadata management

Managed using the Information Server Web Console client

Information Server backbone

Information
Services
Director

Business
Glossary

Information
Analyzer

FastTrack

DataStage /
QualityStage

Metadata

Metadata

Access Services

Analysis Services
Metadata Server

Information Server Web Console

MetaBrokers
Metadata
Workbench

Information Server Web Console


Administration

IS Users

Reporting

DataStage architecture
DataStage clients
Administrator
Designer
Director

DataStage engines
Parallel engine
Runs parallel jobs

Server engine
Runs server jobs
Runs job sequences

DataStage Clients

DataStage Administrator

Project
environment
variables

DataStage Designer

Menus / toolbar

Job log
DataStage parallel job
with DB2 Connector
stage

DataStage Director
Log
messages

Developing in DataStage

Define global and project properties in Administrator


Import metadata into the Repository

Specifies formats of sources and targets accessed by your


jobs

Build job in Designer


Compile job in Designer
Run the job and monitor job log messages

The job log can be viewed either in Director or in Designer

In Designer, only the job log for the currently opened job is
available

Jobs can be run from either Director, Designer, or from the


command line
Performance statistics show up in the log and also on the
Designer canvas as the job runs

DataStage project repository


User-added
folder

Standard
jobs folder

Standard
table
definitions
folder

Types of DataStage jobs

Parallel jobs

Executed by the DataStage parallel engine


Built-in capability for pipeline and partition parallelism
Compiled into OSH

Server jobs

Executable script viewable in Designer and the log

Executed by the DataStage Server engine


Use a completely different set of stages than parallel Jobs
No built-in capability for partition parallelism
Runtime monitoring in the job log

Job sequences (batch jobs, controlling jobs)

A type of server job that runs and controls jobs and other activities
specified on the diagram
Can run both parallel jobs and other job sequences
Provides a common interface to the set of jobs it controls
Runtime monitoring in the job log

Design elements of parallel jobs

Stages

Passive stages (E and L of ETL)

Processor (active) stages (T of ETL)

Read data
Write data
Examples: Sequential File, DB2, Oracle, Peek stages
Transform data (Transformer stage)
Filter data (Transformer stage)
Aggregate data (Aggregator stage)
Generate data (Row Generator stage)
Merge data (Join, Lookup stages)

Links

Pipes through which the data moves from stage-to-stage

Job Parallelism

Pipeline parallelism

Transform, Enrich, Load stages execute in parallel

Like a conveyor belt moving rows from stage to stage

Run downstream stages while upstream stages are running

Advantages:

Reduces disk usage for staging areas

Keeps processors busy

Has limits on scalability

Partition parallelism

Divide the incoming stream of data into subsets to be


separately processed by an operation

Each partition of data is processed by copies the same


stage

Subsets are called partitions

For example, if the stage is Filter, each partition will be filtered


in exactly the same way

Facilitates near-linear scalability

8 times faster on 8 processors


24 times faster on 24 processors
This assumes the data is evenly distributed

Three-node partitioning

Node 1
subset1

Stage
Node 2

subset2

Data

Stage

subset3

Node 3

Stage

Here the data is split into three partitions (nodes)


The stage is executed on each partition of data
separately and in parallel
If the data is evenly distributed, the data will be
processed three times faster

Job design versus execution


A developer designs the flow in DataStage Designer

at runtime, this job runs in parallel for any number


of partitions (nodes)

Configuration file
Determines the degree of parallelism (number of partitions) of
jobs that use it
Every job runs under a configure file
Each DataStage project has a default configuration file
Specified by the $APT_CONFIG_FILE job parameter
Individual jobs can run under different configuration files than the
project default
The same job can also run using different configuration files on different
job runs

Example configuration File


Node
(partition)

Resources attached
to the node

Node
(partition)

Checkpoint
1. True or false: DataStage Director is used to build and
compile your ETL jobs
2. True or false: Use Designer to monitor your job during
execution
3. True or false: Administrator is used to set global and project
properties

Checkpoint solutions
1. False.
2. True. The job log is available both in Director and Designer.
In Designer, you can only view log messages for a job open
in Designer.
3. True.

You might also like