Datastage Fundamentals: January 2008 Module 01: Introduction Slide 1-1

Copyright © Ray Wurlod, 2005-2008. All rights reserved.
Introduction
DataStage Fundamentals
January 2008 Module 01: Introduction Slide 1-1

Objectives
Having completed this module the student will
be able:
 to define the terms "project", "job", "stage"
and "link" in a DataStage context
 to state the purpose of DataStage within
enterprise information integration
 to identify types of parallelism and of data
partitioning

What Is DataStage?
 "ETL" Tool
 Extraction
 from any source
 Transformation
 rich set of transformation capabilities
 Loading
 to any target

DataStage Editions
► Server Edition
► Enterprise Edition
► Extended Enterprise Edition
► Enterprise MVS Edition

What DataStage Is Not

 CASE Tool (for designing databases)
 Discovery Tool
 Reporting Tool
 Fuzzy Matching / Survivorship
 Metadata Management Tool
 Anything except ETL Tool

How DataStage Works

 Graphical Design Tool
 draw picture of data flow
 generates appropriate executables
 Metadata Driven
 can import/create metadata

January 2008
Module 01: Introduction
Slide 1-7
Metadata
 Meta
 above
 Data
 things given
 Information that describes data

 allows questions about data to be answered

Metadata
 Business metadata
 business rules, ownership
 Technical metadata
 table definitions, process specifications
 Process metadata
 what happened, when, success/fail

Metadata
 Technical metadata imported
 using DataStage tools
 stored in Repository (database)
 Business metadata from business analyst
 stored in documentation

Parallel Execution Environment
Execution on more than one node

Terminology
 Project
 location where components stored
 Job
 unit of execution
 Stage
 part of job that performs specific task
 Link
 joins two stages; represents data flow

Stage
Stage
Link

More Terminology
 Orchestrate
 original name for parallel execution environment
 also name of execution shell (osh)
 Data Set
 a set of rows containing a known structure
 may be "virtual" or "persistent"
 Operator
 defines processing action during data flow
 take Data Sets as input and output

Job Design versus Job Execution
Designed:
… at runtime, this job runs in parallel for any

configuration (one node, four nodes, N nodes)
Executed:
No need to modify or recompile the job design!

Configuration File
 Contains definitions of "processing nodes"
 logical concept
 unrelated to number of CPUs
 may be on same machine, multiple machines
 Specified by APT_CONFIG_FILE
 environment variable
 usually set up as a job parameter
 More in next module

Execution of Parallel Jobs

 Orchestra metaphor
 overall control = Conductor process
 per-node control = Section Leader process
 operator execution = Player process
 script executed is called the Score

Job Execution: the Orchestra

Conductor Node • Conductor - initial process
– composes the Score
C
– creates Section Leader processes (one/node)
via fork() or rsh, distributes Score
– consolidates messages to DataStage log
Processing Node
– manages orderly shutdown
SL
• Section Leader (one per Node)
P
– forks Player processes (one per operator)
P P
– manages up/down communication
Processing Node
• Players
SL
– the actual processes associated with operators
– combined players: one process only
P P P – sends stderr, stdout to Section Leader
– establishes connections to other players for data flow,
repartitioining
– cleans up upon completion
Image copyright © 2005 International Business Machines Corporation

Partitioning
Distributing rows over available processing

nodes

Partitioning Definition
 Using an algorithm to distribute rows over
available processing nodes
 Each subset of rows is known as a partition of
the data
 Definition includes "re-partitioning"
 Specified on the input link of a stage

Goals of Partitioning
 Distribute rows as evenly as possible over
available nodes
 Guarantee that key values are adjacent when
necessary
 Use simplest (lowest cost) algorithm

Partitioning Algorithms
 Non Key-Based
 Round Robin
 Random
 Entire
 Key-Based
 Modulus
 Hash Can also specify:
 Range
 (Auto)
 Other
 DB2
 Same

Round Robin
Node #0 Node #1 Node #2

3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

Random (indeterminate, really)

3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

Entire

3 3 3 3
4 4 4 4
5 5 5 5
0 0 0 0
6 6 6 6
1 1 1 1
5 5 5 5
0 0 0 0
4 4 4 4

Modulus
Key Node #0 Node #1 Node #2

3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

Hash

3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

Range

3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

DB2

3 ? ? ?
4 ? ? ? rty
e
5 rop
e p
0 nam
able
6 2 t
DB
1 on
up
5 s
end
p
0 De
4

(Auto) Typical Choice

 Round Robin
 sequential to parallel
 Same
 parallel to parallel
 Hash
 stages that require matched key values
 Entire
 non-sparse reference inputs to Lookup

Same
 Forces downstream stage to use same
partitioning algorithm as upstream stage
 May not be possible
 warning generated, "Same" ignored, Collector used

Partitioning Icons
 Automatic
 Repartitioning
 Partitioning
 Preserve
 Collecting
Partitioning
January 2008 Module 01: Introduction Icons Slide 1-32
"Preserve Partitioning" Flag

 For stages that use (Auto)
 Set on Advanced tab
 Three settings:
 Set
 Clear
 Propagate
 Part of Data Set metadata

Summary: Partitioning Strategy

 Stage (input link) needs grouping of related
key values?
 use Hash (or Modulus if integer key)
 Range may be appropriate
 Grouping not required?
 use Round Robin
 Optimize over entire flow
 Avoid unnecessary re-partitioning

Collecting
Bringing rows from multiple partitions

into one

Collecting Definition
 Gathering rows from multiple
partitions into one
 for stage executing in sequential
mode collector
 Four algorithms
 (Auto)
 Ordered Stage
 Round Robin running
Sequentially
 Sort Merge
 Specified on the input link of a

stage
Specifying Collector Type

 Drop down list on
Input link
 Partitioning tab is
captioned "Collector
type" if:
 stage running in
sequential mode
 upstream stage
running in parallel
mode

Collection Algorithms
 (Auto)
 read any row from any partition
 Round Robin
 Ordered
 all rows from first partition, then …
 Sort Merge
 preserve sorting from all inputs into sequential
(sorted) stream

Summary: Collector Strategy

 Generally choose (Auto)
 Sort Merge to generate single sorted stream
of data
 Ordered only appropriate when sorted input
has been range partitioned
 Round robin rarely used

Sequential Cannot Preserve…

 Warning message may be logged
•
Casual.ico
A sequential operator cannot preserve the partitioning of
the parallel data set on input port 0.
 Caused by upstream parallel stage's
"Preserve Partitioning" flag being Set or
Propagate
 downstream sequential stage cannot comply
 Change to Clear to eliminate Warning

Review Questions
 Answer the review questions for Module 1 in
your Lab book

Datastage Fundamentals: January 2008 Module 01: Introduction Slide 1-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datastage Fundamentals: January 2008 Module 01: Introduction Slide 1-1

Uploaded by

Copyright:

Available Formats

Copyright © Ray Wurlod, 2005-2008. All rights reserved.

January 2008 Module 01: Introduction Slide 1-1

January 2008 Module 01: Introduction Slide 1-2

January 2008 Module 01: Introduction Slide 1-3

January 2008 Module 01: Introduction Slide 1-4

What DataStage Is Not

January 2008 Module 01: Introduction Slide 1-5

How DataStage Works

January 2008 Module 01: Introduction Slide 1-6

 Information that describes data

January 2008 Module 01: Introduction Slide 1-8

January 2008 Module 01: Introduction Slide 1-9

January 2008 Module 01: Introduction Slide 1-10

Parallel Execution Environment

Execution on more than one node

January 2008 Module 01: Introduction Slide 1-11

January 2008 Module 01: Introduction Slide 1-12

January 2008 Module 01: Introduction Slide 1-13

January 2008 Module 01: Introduction Slide 1-14

Job Design versus Job Execution

… at runtime, this job runs in parallel for any

No need to modify or recompile the job design!

January 2008 Module 01: Introduction Slide 1-16

Execution of Parallel Jobs

January 2008 Module 01: Introduction Slide 1-17

Job Execution: the Orchestra

Image copyright © 2005 International Business Machines Corporation

January 2008 Module 01: Introduction Slide 1-18

Distributing rows over available processing

January 2008 Module 01: Introduction Slide 1-19

 Specified on the input link of a stage

January 2008 Module 01: Introduction Slide 1-20

January 2008 Module 01: Introduction Slide 1-21

January 2008 Module 01: Introduction Slide 1-22

Node #0 Node #1 Node #2

January 2008 Module 01: Introduction Slide 1-23

Random (indeterminate, really)

Node #0 Node #1 Node #2

January 2008 Module 01: Introduction Slide 1-24

Node #0 Node #1 Node #2

January 2008 Module 01: Introduction Slide 1-25

Key Node #0 Node #1 Node #2

January 2008 Module 01: Introduction Slide 1-26

Key Node #0 Node #1 Node #2

January 2008 Module 01: Introduction Slide 1-27

Key Node #0 Node #1 Node #2

January 2008 Module 01: Introduction Slide 1-28

Node #0 Node #1 Node #2

January 2008 Module 01: Introduction Slide 1-29

(Auto) Typical Choice

January 2008 Module 01: Introduction Slide 1-30

January 2008 Module 01: Introduction Slide 1-31

"Preserve Partitioning" Flag

January 2008 Module 01: Introduction Slide 1-33

Summary: Partitioning Strategy

January 2008 Module 01: Introduction Slide 1-34

Bringing rows from multiple partitions

January 2008 Module 01: Introduction Slide 1-35

 Specified on the input link of a

Specifying Collector Type

January 2008 Module 01: Introduction Slide 1-37

January 2008 Module 01: Introduction Slide 1-38

Summary: Collector Strategy