Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 42

Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Introduction

DataStage Fundamentals

January 2008 Module 01: Introduction Slide 1-1


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Objectives
Having completed this module the student will
be able:
 to define the terms "project", "job", "stage"
and "link" in a DataStage context
 to state the purpose of DataStage within
enterprise information integration
 to identify types of parallelism and of data
partitioning

January 2008 Module 01: Introduction Slide 1-2


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

What Is DataStage?
 "ETL" Tool
 Extraction
 from any source
 Transformation
 rich set of transformation capabilities
 Loading
 to any target

January 2008 Module 01: Introduction Slide 1-3


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

DataStage Editions
► Server Edition
► Enterprise Edition
► Extended Enterprise Edition
► Enterprise MVS Edition

January 2008 Module 01: Introduction Slide 1-4


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

What DataStage Is Not


 CASE Tool (for designing databases)
 Discovery Tool
 Reporting Tool
 Fuzzy Matching / Survivorship
 Metadata Management Tool
 Anything except ETL Tool

January 2008 Module 01: Introduction Slide 1-5


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

How DataStage Works


 Graphical Design Tool
 draw picture of data flow
 generates appropriate executables
 Metadata Driven
 can import/create metadata

January 2008 Module 01: Introduction Slide 1-6


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

January 2008
Module 01: Introduction
Slide 1-7
Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Metadata
 Meta
 above
 Data
 things given

 Information that describes data


 allows questions about data to be answered

January 2008 Module 01: Introduction Slide 1-8


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Metadata
 Business metadata
 business rules, ownership
 Technical metadata
 table definitions, process specifications
 Process metadata
 what happened, when, success/fail

January 2008 Module 01: Introduction Slide 1-9


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Metadata
 Technical metadata imported
 using DataStage tools
 stored in Repository (database)
 Business metadata from business analyst
 stored in documentation

January 2008 Module 01: Introduction Slide 1-10


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Parallel Execution Environment

Execution on more than one node

January 2008 Module 01: Introduction Slide 1-11


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Terminology
 Project
 location where components stored
 Job
 unit of execution
 Stage
 part of job that performs specific task
 Link
 joins two stages; represents data flow

January 2008 Module 01: Introduction Slide 1-12


Stage
Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Stage

Link

January 2008 Module 01: Introduction Slide 1-13


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

More Terminology
 Orchestrate
 original name for parallel execution environment
 also name of execution shell (osh)
 Data Set
 a set of rows containing a known structure
 may be "virtual" or "persistent"
 Operator
 defines processing action during data flow
 take Data Sets as input and output

January 2008 Module 01: Introduction Slide 1-14


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Job Design versus Job Execution

Designed:

… at runtime, this job runs in parallel for any


configuration (one node, four nodes, N nodes)

Executed:

No need to modify or recompile the job design!


January 2008 Module 01: Introduction Slide 1-15
Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Configuration File
 Contains definitions of "processing nodes"
 logical concept
 unrelated to number of CPUs
 may be on same machine, multiple machines
 Specified by APT_CONFIG_FILE
 environment variable
 usually set up as a job parameter
 More in next module

January 2008 Module 01: Introduction Slide 1-16


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Execution of Parallel Jobs


 Orchestra metaphor
 overall control = Conductor process
 per-node control = Section Leader process
 operator execution = Player process
 script executed is called the Score

January 2008 Module 01: Introduction Slide 1-17


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Job Execution: the Orchestra


Conductor Node • Conductor - initial process
– composes the Score
C
– creates Section Leader processes (one/node)
via fork() or rsh, distributes Score
– consolidates messages to DataStage log
Processing Node
– manages orderly shutdown
SL
• Section Leader (one per Node)
P
– forks Player processes (one per operator)
P P
– manages up/down communication

Processing Node
• Players
SL
– the actual processes associated with operators
– combined players: one process only
P P P – sends stderr, stdout to Section Leader
– establishes connections to other players for data flow,
repartitioining
– cleans up upon completion

Image copyright © 2005 International Business Machines Corporation

January 2008 Module 01: Introduction Slide 1-18


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Partitioning

Distributing rows over available processing


nodes

January 2008 Module 01: Introduction Slide 1-19


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Partitioning Definition
 Using an algorithm to distribute rows over
available processing nodes
 Each subset of rows is known as a partition of
the data
 Definition includes "re-partitioning"

 Specified on the input link of a stage

January 2008 Module 01: Introduction Slide 1-20


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Goals of Partitioning
 Distribute rows as evenly as possible over
available nodes
 Guarantee that key values are adjacent when
necessary
 Use simplest (lowest cost) algorithm

January 2008 Module 01: Introduction Slide 1-21


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Partitioning Algorithms
 Non Key-Based
 Round Robin
 Random
 Entire
 Key-Based
 Modulus
 Hash Can also specify:
 Range
 (Auto)
 Other
 DB2
 Same

January 2008 Module 01: Introduction Slide 1-22


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Round Robin

Node #0 Node #1 Node #2


3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

January 2008 Module 01: Introduction Slide 1-23


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Random (indeterminate, really)

Node #0 Node #1 Node #2


3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

January 2008 Module 01: Introduction Slide 1-24


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Entire

Node #0 Node #1 Node #2


3 3 3 3
4 4 4 4
5 5 5 5
0 0 0 0
6 6 6 6
1 1 1 1
5 5 5 5
0 0 0 0
4 4 4 4

January 2008 Module 01: Introduction Slide 1-25


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Modulus

Key Node #0 Node #1 Node #2


3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

January 2008 Module 01: Introduction Slide 1-26


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Hash

Key Node #0 Node #1 Node #2


3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

January 2008 Module 01: Introduction Slide 1-27


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Range

Key Node #0 Node #1 Node #2


3 3
4 4
5 5
0 0
6 6
1 1
5 5
0 0
4 4

January 2008 Module 01: Introduction Slide 1-28


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

DB2

Node #0 Node #1 Node #2


3 ? ? ?
4 ? ? ? rty
e
5 rop
e p
0 nam
able
6 2 t
DB
1 on
up
5 s
end
p
0 De
4

January 2008 Module 01: Introduction Slide 1-29


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

(Auto) Typical Choice


 Round Robin
 sequential to parallel
 Same
 parallel to parallel
 Hash
 stages that require matched key values
 Entire
 non-sparse reference inputs to Lookup

January 2008 Module 01: Introduction Slide 1-30


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Same
 Forces downstream stage to use same
partitioning algorithm as upstream stage
 May not be possible
 warning generated, "Same" ignored, Collector used

January 2008 Module 01: Introduction Slide 1-31


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Partitioning Icons

 Automatic

 Repartitioning

 Partitioning

 Preserve

 Collecting

Partitioning
January 2008 Module 01: Introduction Icons Slide 1-32
Copyright © Ray Wurlod, 2005-2008. All rights reserved.

"Preserve Partitioning" Flag


 For stages that use (Auto)
 Set on Advanced tab
 Three settings:
 Set
 Clear
 Propagate
 Part of Data Set metadata

January 2008 Module 01: Introduction Slide 1-33


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Summary: Partitioning Strategy


 Stage (input link) needs grouping of related
key values?
 use Hash (or Modulus if integer key)
 Range may be appropriate
 Grouping not required?
 use Round Robin
 Optimize over entire flow
 Avoid unnecessary re-partitioning

January 2008 Module 01: Introduction Slide 1-34


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Collecting

Bringing rows from multiple partitions


into one

January 2008 Module 01: Introduction Slide 1-35


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Collecting Definition
 Gathering rows from multiple
partitions into one
 for stage executing in sequential
mode collector

 Four algorithms
 (Auto)
 Ordered Stage
 Round Robin running
Sequentially
 Sort Merge

 Specified on the input link of a


stage
January 2008 Module 01: Introduction Slide 1-36
Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Specifying Collector Type


 Drop down list on
Input link
 Partitioning tab is
captioned "Collector
type" if:
 stage running in
sequential mode
 upstream stage
running in parallel
mode

January 2008 Module 01: Introduction Slide 1-37


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Collection Algorithms
 (Auto)
 read any row from any partition
 Round Robin
 Ordered
 all rows from first partition, then …
 Sort Merge
 preserve sorting from all inputs into sequential
(sorted) stream

January 2008 Module 01: Introduction Slide 1-38


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Summary: Collector Strategy


 Generally choose (Auto)
 Sort Merge to generate single sorted stream
of data
 Ordered only appropriate when sorted input
has been range partitioned
 Round robin rarely used

January 2008 Module 01: Introduction Slide 1-39


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Sequential Cannot Preserve…


 Warning message may be logged

Casual.ico
A sequential operator cannot preserve the partitioning of
the parallel data set on input port 0.
 Caused by upstream parallel stage's
"Preserve Partitioning" flag being Set or
Propagate
 downstream sequential stage cannot comply
 Change to Clear to eliminate Warning

January 2008 Module 01: Introduction Slide 1-40


Copyright © Ray Wurlod, 2005-2008. All rights reserved.

Review Questions
 Answer the review questions for Module 1 in
your Lab book

January 2008 Module 01: Introduction Slide 1-41

You might also like