Professional Documents
Culture Documents
Introduction To ETL and DataStage
Introduction To ETL and DataStage
Introduction To ETL and DataStage
ETL Basics
• Extraction Transformation & Load
Extracts data from source systems
Enforces data quality and consistency standards
Conforms data from different sources
Load data to target systems
• Usually a
batch process
Involves large volumes of data
• Scenarios
Load a data warehouse, data mart for analytical and reporting applications
Data Integration
Load packaged applications, or external systems through their APIs or interface
databases
Data Migration
Metadata
ETL Engine
Data
• Ideal tool for data integration projects – such as, data warehouses, data marts,
and system migrations
• Import, export, create, and manage metadata for use within jobs
Engine
Metadata Metadata
Repository
Sources Targets
• Server Edition
Lower-end version, much less expensive
Includes Server Engine, supports only Server Jobs
Sufficient for less performance critical applications
MetaStage can also be packaged with it
• MVS Edition
An Extension that allows generation of Cobol Code & JCL for execution on Mainframes
Common development environment, but involves porting & compiling code on to the mainframe
• SOA Edition
RTI component to handle real-time interface
Allows job components to be exposed as web-services
Multiple servers service requests routed through the RTI component
Note that the web service client component is available even without purchasing the SOA Edition
• Repository:
• Contains all the metadata, mapping rules, etc.
• DataStage applications are organized into Projects, each server can handle multiple
projects
• DataStage repository maintained in an internal file format & not in the database
• Windows-based components
• Need to access the server at development time
• Designer: to create DataStage ‘jobs’ , compiled to create the executables,
Import & Export component definitions
• Director: validate, schedule, run, and monitor jobs
• Administrator: setting up users, creating and moving projects, and setting up
purging criteria, setting environment variables
• Designer & Director can connect to one Project at a time
Project
• Usually created for each application (or version of an application, e.g. Test,
Dev, etc.)
• Multiple projects can exist on a single server box
• Associated with a specific directory with the same name as the Project: the
“Repository”, which contains all metadata associated with the project
• Consists of
DataStage Server & Parallel Jobs
Pre-built components (Stages, Functions, etc.)
User-defined components
• User Roles & Privileges set at this level
• Managed through the Information Server Web console/ DS Administrative
Client tool
• Connected to through other client components
Table Definition
Schema Files
• External metadata definition for a sequential file. Specific format & syntax for a file. Associated with a data
file at run-time
Job
• Executable unit of work that can be compiled & executed independently or as part
of a data flow stream
• Created using DS Designer Client (Compile & Execute also available through
Designer)
• Managed (copy, rename, import, export) through DS Designer
• Executed, monitored through DS Director, Log also available through Director
• Parallel Jobs (Available with Enterprise Edition):
• have built-in functionality for Pipeline and Partitioning Parallelism
• Compiled into OSH (Orchestrate Scripting Language).
• The OSH executes “Operators” which are executable C++ class instances
• Server Jobs (Available with Enterprise as well as Server Editions):
• Compiled into Basic (interpreted pseudo-code)
• Limited functionality and parallelism
• Can accept parameters **
• Reads & writes from one or more files/tables, may include transformations
• Collection of stages & links
Stages
• Pre-built component to
• Perform a frequently required operation on a record or set or records, e.g.
Aggregate, Sort, Join, Transform, etc.
• Read or write into a source or target table or file
Links
• Depicts flow of data between stages
Data Sets
• Data is internally carried through links in the form of Data Sets
• DataStage provides facility to “land” or store this data in the form of files
• Recommended for staging data as the data is partitioned & sorted data; so a fast
way of sharing/passing data between jobs
• Not recommended for back-ups or for sharing between applications as it is not
readable, except through DataStage
Shared Containers
• Reusable job elements – comprises of stages and links
Job Sequence
• Definition of a workflow, executing jobs (or sub sequences), routines, OS commands, etc.
• Can accept specifications for dependency, e.g.
• when file A arrives, execute Job B
• Execute Job A, On Failure of Job A Execute OS Command <<XXX>> On Completion of Job
A execute Job B & C
• Can invoke parallel as well as server jobs
DS API
• SDK functions
• Can be embedded into C++ code, invoked through the command line or from shell scripts
• Can retrieve information, compile, start, & stop jobs
Configuration File
• Defines the system size & configuration applicable to the job, in terms of nodes, node
pools, mapped to disk space & assigned scratch disk space
• Details maintained external to the job design
• Different files can be used according to individual job requirements
Environment Variables
• Set or defined through the Administrator at a project level
• Overridden at a job level
• Types
• Standard/Generic Variables: design and running of parallel jobs: e.g. buffering,
message logging, etc.
• User Defined Variables
• We Saw:
• What, Why & How ETL
• DataStage
• Architecture
• Flavors
• Components & Other Features
1 City 1 Z1 10
e.g.
2 City 2 Z1 10
If input is
3 City 3 Z1 20
4 City 4 Z2 20
5 City 5 Z2 30
3 City 3 Z1 20 800 50
4 City 4 Z2 20 800 40
5 City 5 Z2 30 1200 60
• Menu Option: Import > Table Definitions > Sequential File Definitions
• Browse to the directory & select source file.
• Select category under which to save the table definition & the name of the table definition
• Click on Import
• Step 2 …
• Define formatting (e.g. fixed width/delimited, what end of line character has been used, does
the first line contain column names, etc.)
• Set Column Names (if file does not already contain them), & widths
• Open Designer
• Directly through Desktop or through tools menu in Director OR
• Create a new “Parallel Job”
• Save within the chosen ‘Category’ or folder
Repository
Design pane
Palette
• Step 4 Contd.
• Step 6 – Run
• Features
• Normally executes in sequential mode**
• Can read from multiple files with same metadata
• Can accept wild-card path & names.
• The stage needs to be told:
• How file is divided into rows (record format)
• How row is divided into columns (column format)
• Stage Rules
• Accepts 1 input link OR 1 stream output link
• Rejects record(s) that have metadata mismatch. Options on reject
• Continue: ignore record
• Fail: Job aborts
• Output: Reject link metadata a single column, not alterable, can be written into a
file/table
Drop columns,
Change the order of
columns, rename columns
Output
Links
Expressions/
Transforms
Input Links
Metadata
Area
Column Mappings
Not all input columns Stage Variable
need to be used Derivation,
Expression
Metadata defined
for derived
columns
Do not output if
Region_ID is NULL
• Four types:
• Inner
• Left outer
• Right outer
• Full outer
• Join keys must have same name, can modify if required in a previous stage
• All input link data is pre-sorted & partitioned** on the join key
• By default
• Sort inserted by DataStage
• If data is pre-sorted (by a previous stage), does not pre-sort
** - to be discussed shortly
Join Types
** - to be discussed shortly
• Hash
• Intermediate results for each group are stored in a hash table
• Final results are written out after all input has been processed
• No sort required
• Use when number of unique groups is small
• Running tally for each group’s aggregate calculations needs to fit
into memory. Requires about 1K RAM / group
• Sort
• Only a single aggregation group is kept in memory
• When new group is seen, current group is written out
#XXX#
Direct usage for expression evaluation
Usage as stage parameter for string substitution
• We Saw:
• Table Definition
• Job
• Stages
• Sequential File as source & target
• Aggregator
• Join
• Transform
• Job Parameters