Introduction To Datastage

Introduction to DataStage
DataStage Essentials v9.1
5.4
What is DataStage?
What is IBM InfoSphere DataStage?
Design jobs for Extraction, Transformation, and Loading

(ETL)
Ideal tool for data integration projects such as, data

warehouses, data marts, and system migrations
Import, export, create, and manage metadata for use

within jobs
Build, run, and monitor jobs, all within DataStage
Administer your DataStage development and execution

environments
Create batch (controlling) jobs
Called job sequence
What is Information Server?

Suite of applications, including DataStage, that:
Share a common repository
DB2, by default
Repository stores objects built and used by Information Server
applications
DataStage jobs
Metadata imported into DataStage
Share a common set of application services and functionality
Provided by the Metadata Server component
By default an application named server1, hosted by an IBM

WebSphere Application Server (WAS) instance
Provided services include:
Security
Repository
Logging and reporting
Metadata management
Managed using the Information Server Web Console client
Information Server backbone
Information
Services
Director
Business
Glossary
Information
Analyzer
FastTrack
DataStage /
QualityStage
Metadata
Metadata
Access Services
Analysis Services
Metadata Server
Information Server Web Console
MetaBrokers
Metadata
Workbench
Information Server Web Console

Administration
IS Users
Reporting
DataStage architecture
DataStage clients
Administrator
Designer
Director
DataStage engines
Parallel engine
Runs parallel jobs
Server engine
Runs server jobs
Runs job sequences
DataStage Clients
DataStage Administrator
Project
environment
variables
DataStage Designer
Menus / toolbar
Job log
DataStage parallel job
with DB2 Connector
stage
DataStage Director
Log
messages
Developing in DataStage
Define global and project properties in Administrator

Import metadata into the Repository
Specifies formats of sources and targets accessed by your

jobs
Build job in Designer

Compile job in Designer
Run the job and monitor job log messages
The job log can be viewed either in Director or in Designer
In Designer, only the job log for the currently opened job is
available
Jobs can be run from either Director, Designer, or from the

command line
Performance statistics show up in the log and also on the
Designer canvas as the job runs
DataStage project repository

User-added
folder
Standard
jobs folder
Standard
table
definitions
folder
Types of DataStage jobs
Parallel jobs
Executed by the DataStage parallel engine

Built-in capability for pipeline and partition parallelism
Compiled into OSH
Server jobs
Executable script viewable in Designer and the log
Executed by the DataStage Server engine

Use a completely different set of stages than parallel Jobs
No built-in capability for partition parallelism
Runtime monitoring in the job log
Job sequences (batch jobs, controlling jobs)
A type of server job that runs and controls jobs and other activities
specified on the diagram
Can run both parallel jobs and other job sequences
Provides a common interface to the set of jobs it controls
Runtime monitoring in the job log
Design elements of parallel jobs
Stages
Passive stages (E and L of ETL)
Processor (active) stages (T of ETL)
Read data
Write data
Examples: Sequential File, DB2, Oracle, Peek stages
Transform data (Transformer stage)
Filter data (Transformer stage)
Aggregate data (Aggregator stage)
Generate data (Row Generator stage)
Merge data (Join, Lookup stages)
Links
Pipes through which the data moves from stage-to-stage
Job Parallelism
Pipeline parallelism
Transform, Enrich, Load stages execute in parallel
Like a conveyor belt moving rows from stage to stage
Run downstream stages while upstream stages are running
Advantages:
Reduces disk usage for staging areas
Keeps processors busy
Has limits on scalability
Partition parallelism
Divide the incoming stream of data into subsets to be

separately processed by an operation
Each partition of data is processed by copies the same

stage
Subsets are called partitions
For example, if the stage is Filter, each partition will be filtered

in exactly the same way
Facilitates near-linear scalability
8 times faster on 8 processors

24 times faster on 24 processors
This assumes the data is evenly distributed
Three-node partitioning
Node 1
subset1
Stage
Node 2
subset2
Data
Stage
subset3
Node 3
Stage
Here the data is split into three partitions (nodes)

The stage is executed on each partition of data
separately and in parallel
If the data is evenly distributed, the data will be
processed three times faster
Job design versus execution

A developer designs the flow in DataStage Designer
at runtime, this job runs in parallel for any number

of partitions (nodes)
Configuration file
Determines the degree of parallelism (number of partitions) of
jobs that use it
Every job runs under a configure file
Each DataStage project has a default configuration file
Specified by the $APT_CONFIG_FILE job parameter
Individual jobs can run under different configuration files than the
project default
The same job can also run using different configuration files on different
job runs
Example configuration File

Node
(partition)
Resources attached
to the node
Node
(partition)
Checkpoint
1. True or false: DataStage Director is used to build and
compile your ETL jobs
2. True or false: Use Designer to monitor your job during
execution
3. True or false: Administrator is used to set global and project
properties
Checkpoint solutions
1. False.
2. True. The job log is available both in Director and Designer.
In Designer, you can only view log messages for a job open
in Designer.
3. True.

Introduction To Datastage

Uploaded by

Copyright:

Available Formats

You might also like

Introduction To Datastage

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Datastage

Uploaded by

Copyright:

Available Formats

Introduction to DataStage

DataStage Essentials v9.1

What is IBM InfoSphere DataStage?

Design jobs for Extraction, Transformation, and Loading

Ideal tool for data integration projects such as, data

Import, export, create, and manage metadata for use

Build, run, and monitor jobs, all within DataStage

Administer your DataStage development and execution

Create batch (controlling) jobs

Called job sequence

What is Information Server?

Share a common set of application services and functionality

Provided by the Metadata Server component

By default an application named server1, hosted by an IBM

Provided services include:

Managed using the Information Server Web Console client

Information Server backbone

Information Server Web Console

Information Server Web Console

Define global and project properties in Administrator

Specifies formats of sources and targets accessed by your

Build job in Designer

The job log can be viewed either in Director or in Designer

Jobs can be run from either Director, Designer, or from the

DataStage project repository

Types of DataStage jobs

Executed by the DataStage parallel engine

Executable script viewable in Designer and the log

Executed by the DataStage Server engine

Job sequences (batch jobs, controlling jobs)

Design elements of parallel jobs

Passive stages (E and L of ETL)

Processor (active) stages (T of ETL)

Pipes through which the data moves from stage-to-stage

Transform, Enrich, Load stages execute in parallel

Like a conveyor belt moving rows from stage to stage

Run downstream stages while upstream stages are running

Reduces disk usage for staging areas

Keeps processors busy

Has limits on scalability

Divide the incoming stream of data into subsets to be

Each partition of data is processed by copies the same

Subsets are called partitions

For example, if the stage is Filter, each partition will be filtered

Facilitates near-linear scalability

8 times faster on 8 processors

Here the data is split into three partitions (nodes)

Job design versus execution

at runtime, this job runs in parallel for any number

Example configuration File

You might also like