Professional Documents
Culture Documents
Datastage Fundamentals: January 2008 Module 01: Introduction Slide 1-1
Datastage Fundamentals: January 2008 Module 01: Introduction Slide 1-1
Introduction
DataStage Fundamentals
Objectives
Having completed this module the student will
be able:
to define the terms "project", "job", "stage"
and "link" in a DataStage context
to state the purpose of DataStage within
enterprise information integration
to identify types of parallelism and of data
partitioning
What Is DataStage?
"ETL" Tool
Extraction
from any source
Transformation
rich set of transformation capabilities
Loading
to any target
DataStage Editions
► Server Edition
► Enterprise Edition
► Extended Enterprise Edition
► Enterprise MVS Edition
January 2008
Module 01: Introduction
Slide 1-7
Copyright © Ray Wurlod, 2005-2008. All rights reserved.
Metadata
Meta
above
Data
things given
Metadata
Business metadata
business rules, ownership
Technical metadata
table definitions, process specifications
Process metadata
what happened, when, success/fail
Metadata
Technical metadata imported
using DataStage tools
stored in Repository (database)
Business metadata from business analyst
stored in documentation
Terminology
Project
location where components stored
Job
unit of execution
Stage
part of job that performs specific task
Link
joins two stages; represents data flow
Stage
Link
More Terminology
Orchestrate
original name for parallel execution environment
also name of execution shell (osh)
Data Set
a set of rows containing a known structure
may be "virtual" or "persistent"
Operator
defines processing action during data flow
take Data Sets as input and output
Designed:
Executed:
Configuration File
Contains definitions of "processing nodes"
logical concept
unrelated to number of CPUs
may be on same machine, multiple machines
Specified by APT_CONFIG_FILE
environment variable
usually set up as a job parameter
More in next module
Processing Node
• Players
SL
– the actual processes associated with operators
– combined players: one process only
P P P – sends stderr, stdout to Section Leader
– establishes connections to other players for data flow,
repartitioining
– cleans up upon completion
Partitioning
Partitioning Definition
Using an algorithm to distribute rows over
available processing nodes
Each subset of rows is known as a partition of
the data
Definition includes "re-partitioning"
Goals of Partitioning
Distribute rows as evenly as possible over
available nodes
Guarantee that key values are adjacent when
necessary
Use simplest (lowest cost) algorithm
Partitioning Algorithms
Non Key-Based
Round Robin
Random
Entire
Key-Based
Modulus
Hash Can also specify:
Range
(Auto)
Other
DB2
Same
Round Robin
Entire
Modulus
Hash
Range
DB2
Same
Forces downstream stage to use same
partitioning algorithm as upstream stage
May not be possible
warning generated, "Same" ignored, Collector used
Partitioning Icons
Automatic
Repartitioning
Partitioning
Preserve
Collecting
Partitioning
January 2008 Module 01: Introduction Icons Slide 1-32
Copyright © Ray Wurlod, 2005-2008. All rights reserved.
Collecting
Collecting Definition
Gathering rows from multiple
partitions into one
for stage executing in sequential
mode collector
Four algorithms
(Auto)
Ordered Stage
Round Robin running
Sequentially
Sort Merge
Collection Algorithms
(Auto)
read any row from any partition
Round Robin
Ordered
all rows from first partition, then …
Sort Merge
preserve sorting from all inputs into sequential
(sorted) stream
Review Questions
Answer the review questions for Module 1 in
your Lab book