Course

InfoSphere DataStage
Workshop
© Copyright IBM Corporation 2009

Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Course Contents
● Mod 01: Introduction ……………………………………………………………… 05

● Mod 02: Deployment …………………………………………………………………… 27
● Mod 03: Administering DataStage …………………………………… 41
● Mod 04: DataStage Designer ……………………………………………… 67
● Mod 05: Creating Parallel Jobs …………………………………… 97
● Mod 06: Accessing Sequential Data …………………………… 133
● Mod 07: Platform Architecture …………………………………… 163
● Mod 08: Combining Data ………………………………………………………… 201
● Mod 09: Sorting and Aggregating Data ………………… 243
● Mod 10: Transforming Data ……………………………………………… 271
● Mod 11: Working With Relational Data ………………… 327
● Mod 12: Metadata in the Parallel Framework …… 373
● Mod 13: Job Control ……………………………………………………………… 397
© Copyright IBM Corporation 2009 2

IBM InfoSphere DataStage
Mod 01: Introduction

Unit objectives
After completing this unit, you should be able to:
● List and describe the uses of DataStage
● List and describe the DataStage clients
● Describe the DataStage workflow
● List and compare the different types of DataStage jobs
● Describe the two types of parallelism exhibited by DataStage
parallel jobs

What is IBM InfoSphere DataStage?
● Design jobs for Extraction, Transformation, and Loading

(ETL)
● Ideal tool for data integration projects – such as, data
warehouses, data marts, and system migrations
● Import, export, create, and manage metadata for use
within jobs
● Schedule, run, and monitor jobs, all within DataStage
● Administer your DataStage development and execution
environments
● Create batch (controlling) jobs

IBM Information Server
● Suite of applications, including DataStage, that:
– Share a common repository
• DB2, by default
– Share a common set of application services and functionality
• Provided by Metadata Server components hosted by an application
server
– IBM InfoSphere Application Server
• Provided services include:
– Security
– Repository
– Logging and reporting
– Metadata management
● Managed using web console clients
– Administration console
– Reporting console

Information Server Backbone
Information Business Information DataStage QualityStage MetaBrokers

Federation
Services Glossary Analyzer Server
Director
Metadata Metadata
Access Services Analysis Services
Metadata Server
Information Server console

Information Server Administration Console

DataStage Architecture
Clients
Parallel engine Server engine
Engines Shared Repository

DataStage Administrator

DataStage Designer
DataStage
parallel job

DataStage Director
Log
message
events

Developing in DataStage
● Define global and project properties in Administrator
● Import metadata into the Repository
● Build job in Designer
● Compile job in Designer
● Run and monitor job log messages in Director
– Jobs can also be run in Designer, but job lob messages cannot be
viewed in Designer

DataStage Project Repository
User-added
folder
Standard
jobs folder
Standard
Table
Definitions
folder

Types of DataStage Jobs
● Server jobs
– Executed by the DataStage server engine
– Compiled into Basic
– Runtime monitoring in DataStage Director
● Parallel jobs
– Executed by the DataStage parallel engine
– Built-in functionality for pipeline and partition parallelism
– Compiled into OSH (Orchestrate Scripting Language)
• OSH executes Operators
– Executable C++ class instances
● Job sequences (batch jobs, controlling jobs)
– Master Server jobs that kick-off jobs and other activities
– Can kick-off Server or Parallel jobs
– Executed by the Server engine

Design Elements of Parallel Jobs
● Stages
– Implemented as OSH operators (pre-built components)
– Passive stages (E and L of ETL)
• Read data
• Write data
• E.g., Sequential File, DB2, Oracle, Peek stages
– Processor (active) stages (T of ETL)
• Transform data
• Filter data
• Aggregate data
• Generate data
• Split / Merge data
• E.g., Transformer, Aggregator, Join, Sort stages
● Links
– “Pipes” through which the data moves from stage to stage

Pipeline Parallelism
● Transform, clean, load processes execute simultaneously

● Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
● Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
● Still has limits on scalability

Partition Parallelism
● Divide the incoming stream of data into subsets to be

separately processed by an operator
– Subsets are called partitions
● Each partition of data is processed by the same operator
– E.g., if operation is Filter, each partition will be filtered in exactly
the same way
● Facilitates near-linear scalability
– 8 times faster on 8 processors
– This assumes the data is evenly distributed

Three-Node Partitioning
Node 1
Operation
subset1
Node 2
subset2
Operation
Data subset3
Node 3
Operation
● Here the data is partitioned into three partitions

● The operation is performed on each partition of data
separately and in parallel
● If the data is evenly distributed, the data will be processed
three times faster

Job Design v. Execution
User designs the flow in DataStage Designer
… at runtime, this job runs in parallel for any configuration

or partitions (called nodes)

Checkpoint
1. True or false: DataStage Director is used to build and

compile your ETL jobs
2. True or false: Use Designer to monitor your job during
execution
3. True or false: Administrator is used to set global and
project properties

Unit summary
Having completed this unit, you should be able to:
● List and describe the uses of DataStage
● List and describe the DataStage clients
● Describe the DataStage workflow
● List and compare the different types of DataStage jobs
● Describe the two types of parallelism exhibited by DataStage
parallel jobs

Mod 02: Deployment

Unit objectives
● Identify the components of Information Server that need to
be installed
● Describe what a deployment domain consists of
● Describe different domain deployment options
● Describe the installation process
● Start the Information Server

What Gets Deployed
An Information Server domain, consisting of the following:
● MetaData Server, hosted by an IBM InfoSphere Application Server
instance
● One or more DataStage servers
– DataStage server includes both the parallel and server engines
● One DB2 UDB instance containing the Repository database
● Information Server clients
– Administration console
– Reporting console
– DataStage clients
• Administrator
• Designer
• Director
● Additional Information Server applications
– Information Analyzer
– Business Glossary
– Rational Data Architect
– Information Services Director
– Federation Server

Deployment: Everything on One Machine
● Here we have a single
domain with the hosted
applications all on one MetaData Server backbone
machine
Clients
● Additional Client
workstations can connect to
this machine using TCP/IP
DataStage server DB2 instance with

Repository
Clients
Deployment: DataStage on Separate Machine
● Here the domain is split

between two machines MetaData Server backbone
– DataStage server
– MetaData server and
DB2 Repository

Repository
Clients
MetaData Server and DB2 On Separate Machines
MetaData Server backbone

● Here the domain is split
between three
machines
– DataStage server
– MetaData server
– DB2 Repository
Clients

Repository

Information Server Installation

Installation Configuration Layers
● Configuration layers include:
– Client
• DataStage and Information Server clients
– Engine
• DataStage and other application engines
– Domain
• MetaData Server and hosted metadata server components
• Installed products domain components
– Repository
• Repository database server and database
– Documentation
● Selected layers are installed on the machine local to the
installation
● Already existing components can be configured and used
– E.g., DB2, InfoSphere Application Server

Information Server Start-Up
– Start the MetaData Server
• From Windows Start menu, click “Start the Server” after the profile
to be used (e.g., “default”)
• From the command line, open the profile bin directory
– Enter “startup server1”
> server1 is the default name of the application server hosting the MetaData
Server
– Start the ASB agent
• From Windows Start menu, click “Start the agent” after selecting
the Information Server folder
• Only required if DataStage and the MetaData Server are on
different machines
– To begin work in DataStage, double-click on a DataStage client
icon
– To begin work in the Administration and Reporting consoles,
double-click on the Web Console for Information Server icon

Starting the MetaData Server Backbone
Application Server Profiles folder Profile Start the Server
Profile Profile\bin directory
Startup command

Starting the ASB Agent
Start the agent

Checkpoint
1. What application components make up a domain?

2. Can a domain contain multiple DataStage servers?
3. Does the DB2 instance and the Repository database need
to be on the same machine as the Application Server?
4. Suppose DataStage is on a separate machine from the
Application Server. What two components need to be
running before you can log onto DataStage?

Unit summary
● Identify the components of Information Server that need to
be installed
● Describe what a deployment domain consists of
● Describe different domain deployment options
● Describe the installation process
● Start the Information Server

Mod 03: Administering

DataStage

Unit objectives
● Open the Administrative console
● Create new users and groups
● Assign Suite roles and Product roles to users and groups
● Give users DataStage credentials
● Log onto DataStage Administrator
● Add a DataStage user on the Permissions tab and specify
the user’s role
● Specify DataStage global and project defaults
● List and describe important environment variables

Information Server Administration Web Console
● Web application for administering Information Server
● Use for:
– Domain management
– Session management
– Management of users and groups
– Logging management
– Scheduling management

Opening the Administration Web Console
Information Server
console address
Information
Server
administrator ID

Users and Group Management

User and Group Management
● Suite authorizations can be provided to users or groups
– Users that are members of a group acquire the authorizations of the group
● Authorizations are provided in the form of roles
– Two types of roles
• Suite roles: Apply to the Suite
• Suite Component roles: Apply to a specific product or component of
Information Server, e.g., DataStage
● Suite roles
– Administrator
• Perform user and group management tasks
• Includes all the privileges of the Suite User role
– User
• Create views of scheduled tasks and logged messages
• Create and run reports
● Suite Component roles
– DataStage
• DataStage user
– Permissions are assigned within DataStage
> Developer, Operator, Super Operator, Production Manager
• DataStage administrator
– Full permissions to work in DataStage Administrator, Designer, and Director
– And so on, for all products in the Suite
Creating a DataStage User ID
Administration
console
Create
new user
Users

Assigning DataStage Roles
Assign Suite
User ID Administrator role
Assign Suite
User role
Users
Assign DataStage
User role Assign DataStage
Administrator role
DataStage Credential Mapping
● DataStage credentials
– Required by DataStage in order to log onto a DataStage client
– Managed by the DataStage Server operating system or LDAP
● Users given DataStage Suite roles in the Suite
Administration console do not automatically receive
DataStage credentials
– Users need to be mapped to a user who has DataStage credentials
on the DataStage Server machine
• This DataStage user must have file access permission to the
DataStage engine/project files or Administrator rights on the
operating system
● Three ways of mapping users (IS ID  OS ID)
– Many to 1
– 1 to 1
– Combination of the above

DataStage Credential Mapping
Map DataStage
User credential

DataStage Administrator

Logging onto DataStage Administrator
Host name,
port number of
application
server
DataStage
administrator ID
and password
Name or IP
address of
DataStage server
machine
DataStage Administrator Projects Tab
Server projects
Click to specify
project properties
Link to Information Server

Administration console

DataStage Administrator General Tab
Enable runtime
column propagation
(RCP) by default
Specify auto purge
Environment
variable settings

Environment Variables
Parallel job variables
User-defined variables

Environment Reporting Variables
Display Score
Display record
counts
Display OSH

DataStage Administrator Permissions Tab
Assigned product
role
Add DataStage
users

Adding Users and Groups
Add DataStage
Available users /
users
groups. Must
have a
DataStage
product User role.

Specify DataStage Role
Added DataStage
user
Select DataStage
role

DataStage Administrator Tracing Tab

DataStage Administrator Parallel Tab

DataStage Administrator Sequence Tab

Checkpoint
1. Authorizations can be assigned to what two items?

2. What two types of authorization roles can be assigned to a
user or group?
3. In addition to Suite authorization to log onto DataStage
what else does a DataStage developer require to work in
DataStage?

Unit objectives
Having completing this unit, you should be able to:
● Open the Administrative console
● Create new users and groups
● Assign Suite roles and Product roles to users and groups
● Give users DataStage credentials
● Log onto DataStage Administrator
● Add a DataStage user on the Permissions tab and specify
the user’s role
● Specify DataStage global and project defaults
● List and describe important environment variables

Mod 04: DataStage Designer

Unit objectives
● Log onto DataStage
● Navigate around DataStage Designer
● Import and export DataStage objects to a file
● Import a Table Definition for a sequential file

Logging onto DataStage Designer
Host name,
port number of
application
server
DataStage server
machine/project

Designer Work Area
Repository Menus Toolbar
Parallel
canvas
Palette

Importing and Exporting
DataStage Objects

Repository Window
Search for
objects in the
project
Project
Default jobs
folder
Default Table
Definitions
folder

Import and Export
● Any object in the repository window can be exported to a

file
● Can export whole projects
● Use for backup
● Sometimes used for version control
● Can be used to move DataStage objects from one
project to another
● Use to share DataStage jobs and projects with other
developers

Export Procedure
● Click “Export>DataStage Components”

● Select DataStage objects for export
● Specify type of export:
– DSX: Default format
– XML: Enables processing of export file by XML applications,
e.g., for generating reports
● Specify file path on client machine

Export Window
Click to
select
objects
from the
repository Selected
objects
Export
type Export
location on
client
system
Begin export
Import Procedure
● Click “Import>DataStage Components”

– Or “Import>DataStage Components (XML)” if you are importing
an XML-format export file
● Select DataStage objects for import

Import Options
Import all objects in

the file
Display list to
select from

Import/Export Command Line Interface
● Command istool
● Located \IBM\InformationServer\Clients\istools\clients
– Documentation in same location
● Invoke through Programs  IBM Information Server 
Information Server Command Line Interface
● Examples:
– istool –hostname domain:9080 –username dsuser –password
inf0server –deployment-package importfile.dsx import
– istool –hostname domain:9080 –username dsuser –password
inf0server –job GenDataJob –filename exportfile.dsx export

Information Server Manager

Information Server Manager
●Create and deploy packages of DataStage objects
●Perform DataStage object exports / imports
–Similar to DataStage Designer exports / imports
• Export files are stored as archives (extension .isx)
●Provides a domain-wide view of the DataStage Repository
–Displays all DataStage servers
• E.g., a Development server and a Production server
–Displays all DataStage projects on each server
●Before you can use it you must add a domain
–Right-click over Repository, click Add Domain
• Select domain
• Specify user ID / password
●Invoke through Information Server start menu:
–IBM Information Server  Information Server Manager

Adding a Domain
Information Server
domain. May
contain multiple
DataStage systems

Information Server Manager Window
Created
export
archive
DataStage
Repository. Shows all
DataStage servers
and projects
Information Server Manager – Package View
● Multiple packages Domain
● Multiple builds within each package Packages
● Build history
Builds
Objects

Importing Table Definitions

● Table Definitions describe the format and columns

of files and tables
● You can import Table Definitions for:
– Sequential files
– Relational tables
– COBOL files
– Many other things
● Table Definitions can be loaded into job stages

Sequential File Import Procedure
● Click Import>Table Definitions>Sequential File Definitions

● Select directory containing sequential file and then the file
● Select a repository folder to store the Table Definition in
● Examined format and column definitions and edit as
necessary

Importing Sequential Metadata
Import Table
Definition for
a sequential
file

Sequential Import Window
Select
directory
containing
files
Start import
Select file
Select repository
folder

Specify Format
Edit columns Select if first row

has column names
Delimiter

Edit Column Names and Types
Double-click to define
extended properties

Extended Properties window
Parallel properties
Property
categories
Available
properties

Table Definition General Tab
Type Source
Stored Table
Definition

Checkpoint
● True or False? The directory to which you export is on the

DataStage client machine, not on the DataStage server
machine.
● Can you import Table Definitions for sequential files with
fixed-length record formats?

Unit summary
● Log onto DataStage
● Navigate around DataStage Designer
● Import and export DataStage objects to a file
● Import a Table Definition for a sequential file

Mod 05: Creating Parallel Jobs

Unit objectives
● Design a simple Parallel job in Designer
● Define a job parameter
● Use the Row Generator, Peek, and Annotation stages in a
job
● Compile your job
● Run your job in Director
● View the job log

What Is a Parallel Job?
● Executable DataStage program

● Created in DataStage Designer
– Can use components from Repository
● Built using a graphical user interface
● Compiles into Orchestrate script language (OSH) and object
code (from generated C++)

Job Development Overview
● Import metadata defining sources and targets

– Done within Designer using import process
● In Designer, add stages defining data extractions and loads
● Add processing stages to define data transformations
● Add links defining the flow of data from sources to targets
● Compile the job
● In Director, validate, run, and monitor your job
– Can also run the job in Designer
– Can only view the job log in Director

Tools Palette
Stage
categories
Stages

Adding Stages and Links
● Drag stages from the Tools Palette to the diagram

– Can also be dragged from Stage Type branch to the diagram
● Draw links from source to target stage
– Right mouse over source stage
– Release mouse button over target stage

Job Creation Example Sequence
● Brief walkthrough of procedure

● Assumes Table Definition of source already exists in the
repository

Create New Parallel Job
Open
New
window
Parallel job

Drag Stages and Links From Palette
Compile
Row Peek
Generator
Job
properties

Renaming Links and Stages
● Click on a stage or link to
rename it
● Meaningful names have
many benefits
– Documentation
– Clarity
– Fewer development errors

RowGenerator Stage
● Produces mock data for specified columns
● No inputs link; single output link
● On Properties tab, specify number of rows
● On Columns tab, load or specify column definitions
– Open Extended Properties window to specify the values to be
generated for that column
– A number of algorithms for generating values are available
depending on the data type
● Algorithms for Integer type
– Random: seed, limit
– Cycle: Initial value, increment
● Algorithms for string type: Cycle , alphabet
● Algorithms for date type: Random, cycle

Inside the Row Generator Stage
Properties
tab
Set property
value
Property

Columns Tab
Double-click to specify
extended properties
View data
Select Table
Definition Load a
Table
Definition
Extended Properties
Specified
properties and
their values
Additional
properties to add

Peek Stage
● Displays field values

– Displayed in job log or sent to a file
– Skip records option
– Can control number of records to be displayed
– Shows data in each partition, labeled 0, 1, 2, …
● Useful stub stage for iterative job development
– Develop job to a stopping point and check the data

Peek Stage Properties
Output to
job log

Job Parameters
● Defined in Job Properties window

● Makes the job more flexible
● Parameters can be:
– Used in directory and file names
– Used to specify property values
– Used in constraints and derivations
● Parameter values are determined at run time
● When used for directory and files names and property
values, they are surrounded with pound signs (#)
– E.g., #NumRows#
● Job parameters can reference DataStage environment
variables
– Prefaced by $, e.g., $APT_CONFIG_FILE

Defining a Job Parameter
Parameters tab
Parameter
Add environment
variable

Using a Job Parameter in a Stage
Job parameter Click to insert Job

parameter

Adding Job Documentation
● Job Properties
– Short and long descriptions
● Annotation stage
– Added from the Tools Palette
– Displays formatted text descriptions on diagram

Job Properties Documentation
Documentation

Annotation Stage Properties

Compiling a Job
Run
Compile

Errors or Successful Message
Highlight stage with

error Click for more info

Running Jobs and Viewing the Job
Log in Designer

DataStage Director
● Use to run and schedule jobs

● View runtime messages
● Can invoke directly from Designer
– Tools > Run Director

Run Options
Assign values to
parameter

Run Options
Stop after number

of warnings
Stop after number

of rows

Director Status View
Status view Schedule view

Log view
Select job to
view
messages in
the log view
Director Log View
Click the notepad
icon to view log
messages
Peek messages

Message Details

Other Director Functions
● Schedule job to run on a particular date/time

● Clear job log of messages
● Set job log purging conditions
● Set Director options
– Row limits (server job only)
– Abort after x warnings

Running Jobs from Command Line
● dsjob –run –param numrows=10 dx444 GenDataJob

– Runs a job
– Use –run to run the job
– Use –param to specify parameters
– In this example, dx444 is the name of the project
– In this example, GenDataJob is the name of the job
● dsjob –logsum dx444 GenDataJob
– Displays a job’s messages in the log
● Documented in “Parallel Job Advanced Developer’s
Guide”

Checkpoint
1. Which stage can be used to display output data in the job

log?
2. Which stage is used for documenting your job on the job
canvas?
3. What command is used to run jobs from the operating
system command line?

Unit summary
● Design a simple Parallel job in Designer
● Define a job parameter
● Use the Row Generator, Peek, and Annotation stages in a
job
● Compile your job
● Run your job in Director
● View the job log

Mod 06: Accessing Sequential

Data

Unit objectives
● Understand the stages for accessing different kinds of file
data
● Sequential File stage
● Data Set stage
● Create jobs that read from and write to sequential files
● Create Reject links
● Work with NULLs in sequential files
● Read from multiple files using file patterns
● Use multiple readers

Types of File Data
● Sequential
– Fixed or variable length
● Data Set
● Complex flat file

How Sequential Data is Handled
● Import and export operators are generated

– Stages get translated into operators during the compile
● Import operators convert data from the external format,
as described by the Table Definition, to the framework
internal format
– Internally, the format of data is described by schemas
● Export operators reverse the process
● Messages in the job log use the “import” / “export”
terminology
– E.g., “100 records imported successfully; 2 rejected”
– E.g., “100 records exported successfully; 0 rejected”
– Records get rejected when they cannot be converted correctly
during the import or export

Using the Sequential File Stage
Both import and export of sequential files (text, binary) can

be performed by the SequentialFile Stage.
Importing/Exporting Data
– Data import: Internal format
– Data export Internal format

Features of Sequential File Stage
● Normally executes in sequential mode

● Executes in parallel when reading multiple files
● Can use multiple readers within a node
– Reads chunks of a single file in parallel
● The stage needs to be told:
– How file is divided into rows (record format)
– How row is divided into columns (column format)

Sequential File Format Example
Record delimiter
Field 1 , Field 1
2 , Field 13 , Last field nl
Final Delimiter = end

Field Delimiter
Field 1 , Field 12 , Field 13 , Last field , nl
Final Delimiter = comma

Sequential File Stage Rules
● One input link

● One stream output link
● Optionally, one reject link
– Will reject any records not matching metadata in the column
definitions
• Example: You specify three columns separated by commas, but
the row that’s read had no commas in it
• Example: The second column is designated a decimal in the
Table Definition, but contains alphabetic characters

Job Design Using Sequential Stages
Stream link
Reject link
(broken line)

Input Sequential Stage Properties
Output tab
File to
access
Column names
in first row
Click to add more

files to read
Format Tab
Record format
Load from Table

Definition
Column format

Sequential Source Columns Tab
View data
Load from Table

Definition
Save as a new
Table Definition

Reading Using a File Pattern
Use wild
cards
Select File
Pattern

Properties - Multiple Readers
Multiple readers option allows

you to set number of readers
per node

Sequential Stage As a Target
Input Tab
Append /
Overwrite

Reject Link
● Reject mode =
– Continue: Continue reading
records
– Fail: Abort job
– Output: Send down output
link
● In a source stage
– All records not matching the
metadata (column definitions)
are rejected
● In a target stage
– All records that fail to be
written for any reason
● Rejected records consist of
one column, datatype = raw
Reject mode property

Inside the Copy Stage
Column
mappings

Reading and Writing NULL Values to a
Sequential File

Working with NULLs
● Internally, NULL is represented by a special value outside
the range of any existing, legitimate values
● If NULL is written to a non-nullable column, the job will abort
● Columns can be specified as nullable
– NULLs can be written to nullable columns
● You must “handle” NULLs written to nullable columns in a
Sequential File stage
– You need to tell DataStage what value to write to the file
– Unhandled rows are rejected
● In a Sequential File source stage, you can specify values you
want DataStage to convert to NULLs

Specifying a Value for NULL
Nullable
column
Added
property

DataSet Stage

Data Set
● Binary data file
● Preserves partitioning
– Component dataset files are written to each partition
● Suffixed by .ds
● Referred to by a header file
● Managed by Data Set Management utility from GUI (Manager,
Designer, Director)
● Represents persistent data
● Key to good performance in set of linked jobs
– No import / export conversions are needed
– No repartitioning needed
● Accessed using DataSet stage
● Implemented with two types of components:
– Descriptor file:
• contains metadata, data location, but NOT the data itself
– Data file(s)
• contain the data
• multiple files, one per partition (node)

Job With DataSet Stage
DataSet stage
DataSet stage
properties

Data Set Management Utility
Display schema
Display data
Display record
counts for each
data file

Data and Schema Displayed
Data viewer
Schema describing the

format of the data

File Set Stage
● Use to read and write to filesets

● Files suffixed by .fs
● Files are similar to a dataset
– Partitioned
– Implemented with header file and data files
● How filesets differ from datasets
– Data files are text files
• Hence readable by external applications
– Datasets have a proprietary data format which may
change in future DataStage versions

Checkpoint
1. List three types of file data.

2. What makes datasets perform better than other types of
files in parallel jobs?
3. What is the difference between a data set and a file set?

Unit summary
● Understand the stages for accessing different kinds of file
data
● Data Set stage
● Create jobs that read from and write to sequential files
● Create Reject links
● Work with NULLs in sequential files
● Read from multiple files using file patterns
● Use multiple readers

Mod 07: Platform Architecture

Unit objectives
● Describe parallel processing architecture
● Describe pipeline parallelism
● Describe partition parallelism
● List and describe partitioning and collecting algorithms
● Describe configuration files
● Describe the parallel job compilation process
● Explain OSH
● Explain the Score

Key Parallel Job Concepts
● Parallel processing:
– Executing the job on multiple CPUs
● Scalable processing:
– Add more resources (CPUs and disks) to increase system
performance
1 2 • Example system: 6 CPUs (processing

nodes) and disks
3 4 • Scale up by adding more CPUs
• Add CPUs as individual nodes or to
5 6 an SMP system

Scalable Hardware Environments
● Single CPU ● SMP ● GRID / Clusters

● Dedicated memory ● Multi-CPU (2-64+) – Multiple, multi-CPU
systems
& disk ● Shared memory & – Dedicated memory per
node
disk – Typically SAN-based
shared storage
● MPP
– Multiple nodes with
dedicated memory,
storage
● 2 – 1000’s of CPUs

Pipeline Parallelism
● Transform, clean, load processes execute simultaneously

● Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
● Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
● Still has limits on scalability

Partition Parallelism
● Divide the incoming stream of data into subsets to be

separately processed by an operation
– Subsets are called partitions (nodes)
● Each partition of data is processed by the same
operation
– E.g., if operation is Filter, each partition will be filtered in exactly
the same way
● Facilitates near-linear scalability
– This assumes the data is evenly distributed

Three-Node Partitioning
Node 1
Operation
subset1
Node 2
subset2
Operation
Data subset3
Node 3
Operation
● Here the data is partitioned into three partitions

● The operation is performed on each partition of data
separately and in parallel
● If the data is evenly distributed, the data will be processed
three times faster

Parallel Jobs Combine Partitioning and Pipelining
– Within parallel jobs pipelining, partitioning, and repartitioning are automatic

– Job developer only identifies:
• Sequential vs. parallel mode (by stage)
– Default for most stages is parallel mode
• Method of data partitioning
– The method to use to distribute the data into the available nodes
• Method of data collecting
– The method to use to collect the data from the nodes into a single
node
• Configuration file
– Specifies the nodes and resources

Job Design v. Execution
User assembles the flow using DataStage Designer
… at runtime, this job runs in parallel for any configuration

(1 node, 4 nodes, N nodes)
No need to modify or recompile the job design!

Configuration File
● Configuration file separates configuration (hardware /

software) from job design
– Specified per job at runtime by $APT_CONFIG_FILE
– Change hardware and resources without changing job design
● Defines nodes with their resources (need not match
physical CPUs)
– Dataset, Scratch, Buffer disk (file systems)
– Optional resources (Database, SAS, etc.)
– Advanced resource optimizations
• “Pools” (named subsets of nodes)
● Multiple configuration files can be used different occasions
of job execution
– Optimizes overall throughput and matches job characteristics to
overall hardware resources
– Allows runtime constraints on resource usage on a per job basis

Example Configuration File
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
3 4 resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
1 2 node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
Key points: resource scratchdisk "/temp" {}
}
1. Number of nodes defined node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
2. Resources assigned to each resource disk "/orch/n3/d1" {}
node. Their order is significant. resource scratchdisk "/temp" {}
}
node "n4" {
3. Advanced resource fastname "s4"
optimizations and configuration pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
(named pools, database, SAS) resource scratchdisk "/temp" {}
}
}

Partitioning and Collecting

Partitioning and Collecting
● Partitioning breaks incoming rows into multiple streams of

rows (one for each node)
● Each partition of rows is processed separately by the
stage/operator
● Collecting returns partitioned data back to a single stream
● Partitioning / Collecting is specified on stage input links

Partitioning / Collecting Algorithms
● Partitioning algorithms include:
– Round robin
– Random
– Hash: Determine partition based on key value
• Requires key specification
– Modulus
– Entire: Send all rows down all partitions
– Same: Preserve the same partitioning
– Auto: Let DataStage choose the algorithm
● Collecting algorithms include:
– Round robin
– Auto
• Collect first available record
– Sort Merge
• Read in by key
• Presumes data is sorted by the key in each partition
• Builds a single sorted stream based on the key
– Ordered
• Read all records from first partition, then second, …
Keyless V. Keyed Partitioning Algorithms
● Keyless: Rows are distributed independently of data values
– Round Robin
– Random
– Entire
– Same
● Keyed: Rows are distributed based on values in the
specified key
– Hash: Partition based on key
• Example: Key is State. All “CA” rows go into the same partition; all
“MA” rows go in the same partition. Two rows of the same state
never go into different partitions
– Modulus: Partition based on modulus of key divided by the number
of partitions. Key is a numeric type.
• Example: Key is OrderNumber (numeric type). Rows with the same
order number will all go into the same partition.
– DB2: Matches DB2 EEE partitioning

Round Robin and Random Partitioning
Keyless
● Keyless partitioning methods
…8 7 6 5 4 3 2 1 0
● Rows are evenly distributed
across partitions
– Good for initial import of data if no
other partitioning is needed
– Useful for redistributing data Round Robin
● Fairly low overhead
● Round Robin assigns rows to

partitions like dealing cards
– Row/Partition assignment will be the
same for a given 6 7 8
$APT_CONFIG_FILE 3 4 5
0 1 2
● Random has slightly higher

overhead, but assigns rows in a
non-deterministic fashion
between job runs

ENTIRE Partitioning
Keyless
● Each partition gets a complete copy
of the data
– Useful for distributing lookup and …8 7 6 5 4 3 2 1 0
reference data
• May have performance impact
in MPP / clustered
environments
– On SMP platforms, Lookup stage ENTIRE
(only) uses shared memory instead
of duplicating ENTIRE reference
data
• On MPP platforms, each
server uses shared memory
for a single local copy . . .
● ENTIRE is the default partitioning . . .
for Lookup reference links with 3 3 3
“Auto” partitioning 2 2 2
– On SMP platforms, it is a good 1 1 1
practice to set this explicitly on the 0 0 0
Normal Lookup reference link(s)

HASH Partitioning
Keyed
● Keyed partitioning
Values of key column
method
● Rows are distributed …0 3 2 1 0 2 3 2 1 1
according to the values

in key columns
– Guarantees that rows with same HASH
key values go into the same
partition
– Needed to prevent matching rows
from “hiding” in other partitions
• E.g. Join, Merge,
0 1 2
Remove Duplicates, 3 1 2
… 0 1 2
– Partition distribution is relatively 3
equal if the data across the source
key column(s) is evenly
distributed

Modulus Partitioning
Keyed
● Keyed partitioning method

Values of key column
● Rows are distributed according
…0 3 2 1 0 2 3 2 1 1
to the values in one integer key
column
– Uses modulus
partition = MOD (key_value / #partitions) MODULUS
● Faster than HASH
● Guarantees that rows with
identical key values go in the
same partition 0 1 2
● Partition size is relatively equal 3
0
1
1
2
2
if the data within the key 3
column is evenly distributed

Auto Partitioning
● DataStage inserts partition components as necessary

to ensure correct results
– Before any stage with “Auto” partitioning
– Generally chooses ROUND-ROBIN or SAME
– Inserts HASH on stages that require matched key values
(e.g. Join, Merge, Remove Duplicates)
– Inserts ENTIRE on Normal (not Sparse) Lookup reference
links
• NOT always appropriate for MPP/clusters
● Since DataStage has limited awareness of your data
and business rules, explicitly specify HASH partitioning
when needed
– DataStage has no visibility into Transformer logic
– Hash is required before Sort and Aggregator stages
– DataStage sometimes inserts un-needed partitioning
• Check the log
Partitioning Requirements for Related Records
● Misplaced records
– Using Aggregator stage to sum customer sales by
customer number
– If there are 25 customers, 25 records should be output
– But suppose records with the same customer numbers
are spread across partitions
• This will produce more than 25 groups (records)
– Solution: Use hash partitioning algorithm
● Partition imbalances
– If all the records are going down only one of the nodes,
then the job is in effect running sequentially

Unequal Distribution Example
● Same key values are ● Hash on LName, with 2-node

assigned to the same config file
partition
Part 0
ID LName FName Address
5 Dodge Horace 17840 Jefferson

Source Data

6 Dodge John 75 Boston Boulevard
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
Partition 1
5 Dodge Horace 17840 Jefferson
1 Ford Henry 66 Edison Avenue
6 Dodge John 75 Boston Boulevard
2 Ford Clara 66 Edison Avenue
7 Ford Henry 4901 Evergreen
3 Ford Edsel 7900 Jefferson
8 Ford Clara 4901 Evergreen
4 Ford Eleanor 7900 Jefferson
9 Ford Edsel 1100 Lakeshore
7 Ford Henry 4901 Evergreen
10 Ford Eleanor 1100 Lakeshore
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore

Partitioning / Collecting Link Icons
Partitioning icon
Collecting icon

More Partitioning Icons
“fan-out” Sequential
to Parallel
SAME partitioner
Re-partition
watch for this!
AUTO partitioner

Partitioning Tab
Input tab
Key
specification
Algorithms

Collecting Specification
Key
specification
Algorithms
Runtime Architecture

Parallel Job Compilation
Designer
Client
What gets generated:
● OSH: A kind of script
● OSH represents the design data flow and
stages Compile
– Stages become OSH operators
● Transform operator for each Transformer DataStage server
– A custom operator built during the compile

– Compiled into C++ and then to corresponding
native operators
• Thus a C++ compiler is needed to compile
jobs with a Transformer stage
Executable
Job
Transformer
Components

Generated OSH
Enable viewing of generated

OSH in Administrator:
Comments OSH is visible in:

- Job
Operator properties
- Job run log
Schema - View Data
- Table Defs

Stage to Operator Mapping Examples
● Sequential File
– Source: import
– Target: export
● DataSet: copy
● Sort: tsort
● Aggregator: group
● Row Generator, Column Generator, Surrogate Key
Generator: generator
● Oracle
– Source: oraread
– Sparse Lookup: oralookup
– Target Load: orawrite
– Target Upsert: oraupsert

Checkpoint
1. What file defines the degree of parallelism a job runs

under?
2. Name two partitioning algorithms that partition based on
key values?
3. What partitioning algorithm produces the most even
distribution of data in the various partitions?

Unit summary
● Describe parallel processing architecture
● Describe pipeline parallelism
● Describe partition parallelism
● List and describe partitioning and collecting algorithms
● Describe configuration files
● Describe the parallel job compilation process
● Explain OSH
● Explain the Score

Mod 08: Combining Data

Unit objectives
● Combine data using the Lookup stage
● Define range lookups
● Combine data using Merge stage
● Combine data using the Join stage
● Combine data using the Funnel stage

Combining Data
Ways to combine data:

● Horizontally:
– Multiple input links
– One output link made of columns from different input links.
– Joins
– Lookup
– Merge
● Vertically:
– One input link, one output link combining groups of related records
into a single record
– Aggregator
– Remove Duplicates
● Funneling: Multiple input streams funneled into a single
output stream
– Funnel stage

Lookup, Merge, Join Stages
● These stages combine two or more input links

– Data is combined by designated "key" column(s)
● These stages differ mainly in:
– Memory usage
– Treatment of rows with unmatched key values
– Input requirements (sorted, de-duplicated)

Lookup Stage

Lookup Features
● One stream input link (source link)
● Multiple reference links
● One output link
● Lookup failure options
– Continue, Drop, Fail, Reject
● Reject link
– Only available when lookup failure option is set to Reject
● Can return multiple matching rows
● Hash file is built in memory from the lookup files
– Indexed by key
– Should be small enough to fit into physical memory

Lookup Types
● Equality match
– Match exactly values in the lookup key column of the reference link to
selected values in the source row
– Return row or rows (if multiple matches are to be returned) that match
● Caseless match
– Like an equality match except that it’s caseless
• E.g., “abc” matches “AbC”
● Range on the reference link
– Two columns on the reference link define the range
– A match occurs when a selected value in the source row is within the
range
● Range on the source link
– Two columns on the source link define the range
– A match occurs when a selected value in the reference link is within
the range

Lookup Example
Reference link
Source (stream)
link

Lookup Stage With an Equality Match
Source link columns
Lookup
constraints
Mappings to
output columns
Reference key column

and source mapping Column
definitions
Lookup Failure Actions
If the lookup fails to find a matching key column, one of
several actions can be taken:
– Fail (Default)
• Stage reports an error and the job fails immediately
– Drop
• Input row is dropped
– Continue
• Input row is transferred to the output. Reference link columns are
filled with null or default values
– Reject
• Input row sent to a reject link
• This requires that a reject link has been created for the stage

Specifying Lookup Failure Actions
Select to return
multiple rows
Select lookup failure

action

Lookup Stage Behavior
Source link Reference link
Revolution Citizen Citizen Exchange

1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE
Lookup key
column

Lookup Stage
Output of Lookup with Continue option
Revolution Citizen Exchange

1789 Lefty
1776 M_B_Dextrous Nasdaq
Empty string
or NULL
Output of Lookup with Drop option


Designing a Range Lookup Job

Range Lookup Job
Reference link
Lookup stage

Range on Reference Link
Reference
range values
Retrieve
description
Source values

Selecting the Stream Column
Source link
Double-click
to specify
range
Reference link

Range Expression Editor
Select range
columns
Select
operators

Range on Stream Link
Source range
Retrieve other
column values
Reference link key

Specifying the Range Lookup
Reference link
key column
Range type

Range Expression Editor
Select range
columns

Join Stage

Join Stage
● Four types of joins
• Inner
• Left outer
• Right outer
• Full outer
● Input links must be sorted
– Left link and a right link
– Supports additional “intermediate” links
● Light-weight
– Little memory required, because of the sort requirement
● Join key column or columns
– Column names for each input link must match

Job With Join Stage
Right input link
Left input link

Join stage

Join Stage Editor
Column to
match
Select left input
link
Select if multiple
Select join type columns make up
the join key

Join Stage Behavior
Left link (primary input) Right link (secondary input)
Revolution Citizen Citizen Exchange

1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE
Join key
column

Inner Join
● Transfers rows from both data sets whose key

columns have matching values
Output of inner join on key Citizen


Left Outer Join
● Transfers all values from the left link and transfers

values from the right link only where key columns
match.

1789 Lefty
Null or default
value

Right Outer Join
● Transfers all values from the right link and transfers values from
the left link only where key columns match.

Righty NYSE
Null or default
value

Full Outer Join
● Transfers rows from both data sets, whose key columns

contain equal values, to the output link
● It also transfers rows, whose key columns contain
unequal values, from both input links to the output link
● Treats both input symmetrically
● Creates new columns, with new column names!
Revolution leftRec_Citizen rightRec_Citizen Exchange

1789 Lefty
1776 M_B_Dextrous M_B_Dextrous Nasdaq
0 Righty NYSE

Merge Stage

Merge Stage
• Similar to Join stage
• Input links must be sorted
– Master link and one or more secondary links
– Master must be duplicate-free
● Light-weight
– Little memory required, because of the sort requirement
● Unmatched master rows can be kept or dropped
● Unmatched ("Bad") update rows in input link n can be captured
in a "reject" link in corresponding output link n.
● Matched update rows are consumed
● Follows the Master-Update model
● Master row and one or more updates row are merged iff they
have the same value in user-specified key column(s).
● A non-key column name occurs in several inputs?
– The lowest input port number prevails
– e.g., master over update; update values are ignored

Merge Stage Job
Merge stage
Capture update
link non-matches

Merge Stage Properties
Match key
Keep or drop
unmatched
masters

The Merge Stage
● Allows composite keys

● Multiple update links
Master One or more
updates
● Matched update rows are
consumed
● Unmatched updates in
input port n can be captured
0 1 2 in output port n
● Lightweight
0 1 2
Merge
Output Rejects

Comparison: Joins, Lookup, Merge
Joins Lookup Merge

Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
Memory usage light heavy light
# and names of Inputs 2 or more: left, right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort all inputs no all inputs
Duplicates in primary input OK OK Warning!
Duplicates in secondary input(s) OK Warning! OK only when N = 1
Options on unmatched primary Keep (left outer), Drop (Inner) [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary Keep (right outer), Drop (Inner) NONE capture in reject set(s)
On match, secondary entries are captured captured consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

Funnel Stage

What is a Funnel Stage?
● Combines data from multiple input links to a single
output link
● All sources must have identical metadata
● Three modes
– Continuous
• Records are combined in no particular order
– First available record
– Sort Funnel
• Combines the input records in the order defined by a
key
• Produces sorted output if the input links are all sorted
by the same key
– Sequence:
• Outputs all records from the first input link, then all
from the second input link, and so on

Funnel Stage Example

Checkpoint
1. Name three stages that horizontally join data?

2. Which stage uses the least amount of memory? Join or
Lookup?
3. Which stage requires that the input data is sorted? Join or
Lookup?

Checkpoint solutions
1. Lookup, Merge, Join

2. Join
3. Join

Unit summary
● Combine data using the Lookup stage
● Define range lookups
● Combine data using Merge stage
● Combine data using the Join stage
● Combine data using the Funnel stage

Mod 09: Sorting and

Aggregating Data

Unit objectives
● Sort data using in-stage sorts and Sort stage
● Combine data using Aggregator stage
● Combine data Remove Duplicates stage

Sort Stage

Sorting Data
● Uses
– Some stages require sorted input
• Join, merge stages require sorted input
– Some stages use less memory with sorted input
• e.g., Aggregator
● Sorts can be done:
– Within stages
• On input link Partitioning tab, set partitioning to anything other
than Auto
– In a separate Sort stage
• Makes sort more visible on diagram
• Has more options

Sorting Alternatives
Sort stage Sort within

stage

In-Stage Sorting
Partitioning
tab Do sort
Preserve
non-key
ordering
Remove
dups
Can’t be Auto Sort key

when sorting
Sort Stage
Sort key
Sort options

Sort keys
● Add one or more keys
● Specify sort mode for each key
– Sort: Sort by this key
– Don’t sort (previously sorted):
• Assumes the data has already been sorted on this key
• Continue sorting by any secondary keys
● Specify sort order: ascending / descending
● Specify case sensitive or not

Sort Options
● Sort Utility
– DataStage – the default
– Unix: Don’t use. Slower than DataStage sort utility
● Stable
● Allow duplicates
● Memory usage
– Sorting takes advantage of the available memory for increased
performance
• Uses disk if necessary
– Increasing amount of memory can improve performance

Stable Sort
Key Col Key Col
4 X
1 K
3 Y
1 A
1 K
2 P
3 C 2 L
2 P 3 Y
3 D 3 C
1 A 3 D
2 L 4 X

Partitioning V. Sorting Keys
● Partitioning keys are often different than Sorting keys
– Keyed partitioning (e.g., Hash) is used to group related records into
the same partition
– Sort keys are used to establish order within each partition
● Example:
– Partition on HouseHoldID, sort on HouseHoldID, EntryDate
• Partitioning on HouseHoldID ensures that the same ID will not be
spread across multiple partitions
• Sorting orders the records with the same ID by entry date
– Useful for deciding which of a group of duplicate records with the same
ID should be retained

Aggregator Stage

Aggregator Stage
Purpose: Perform data aggregations

Specify:
● One or more key columns that define the aggregation
units (or groups)
● Columns to be aggregated
● Aggregation functions include, among many others:
– Count (nulls/non-nulls)
– Sum
– Max / Min / Range
● The grouping method (hash table or pre-sort) is a
performance issue

Job with Aggregator Stage
Aggregator
stage

Aggregation Types
● Count rows
– Count rows in each group
– Put result in a specified output column
● Calculation
– Select column
– Put result of calculation in a specified output column
– Calculations include:
• Sum
• Count
• Min, max
• Mean
• Missing value count
• Non-missing value count
• Percent coefficient of variation

Count Rows Aggregator Properties
Grouping key
columns
Output column for

Count Rows the result
aggregation type

Calculation Type Aggregator Properties
Grouping key
columns
Calculation
aggregation type
Calculations with
output columns
Available
calculations
Grouping Methods
● Hash (default)
– Calculations are made for all groups and stored in memory
• Hash table structure (hence the name)
– Results are written out after all input has been processed
– Input does not need to be sorted
– Useful when the number of unique groups is small
• Running tally for each group’s aggregations needs to fit into
memory
● Sort
– Requires the input data to be sorted by grouping keys
• Does not perform the sort! Expects the sort
– Only a single aggregation group is kept in memory
• When a new group is seen, the current group is written out
– Can handle unlimited numbers of groups

Grouping Method - Hash
Key Col 4 4X
4 X
3 Y 3 3Y 3C 3D
1 K
3 C
1 1K 1A
2 P
3 D
2 2P 2L
1 A
2 L

Grouping Method - Sort
Key Col
1 K 1K 1A
1 A
2 P
2P 2L
2 L
3 Y
3Y 3C 3D
3 C
3 D
4X
4 X

Remove Duplicates Stage

Removing Duplicates
● Can be done by Sort stage

– Use unique option
• No choice on which duplicate to keep
• Stable sort always retains the first row in the group
• Non-stable sort is indeterminate
OR
● Remove Duplicates stage

– Has more sophisticated ways to remove duplicates
• Can choose to retain first or last

Remove Duplicates Stage Job
Remove Duplicates
stage

Remove Duplicates Stage Properties
Key that defines

duplicates
Retain first or last

duplicate
Checkpoint
1. What stage is used to perform calculations of column

values grouped in specified ways?
2. In what two ways can sorts be performed?
3. What is a stable sort?

Unit summary
● Sort data using in-stage sorts and Sort stage
● Combine data using Aggregator stage
● Combine data Remove Duplicates stage

Mod 10: Transforming Data

Unit objectives
● Use the Transformer stage in parallel jobs
● Define constraints
● Define derivations
● Use stage variables
● Create a parameter set and use its parameters in constraints
and derivations

Transformer Stage

Transformer Stage
● Column mappings
● Derivations
– Written in Basic
– Final compiled code is C++ generated object code
● Constraints
– Filter data
– Direct data down different output links
• For different processing or storage
● Expressions for constraints and derivations can reference
– Input columns
– Job parameters
– Functions
– System variables and constants
– Stage variables
– External routines

Job With a Transformer Stage
Transformer
Single input
Reject link Multiple outputs

Inside the Transformer Stage
Stage variables
Show/Hide
Input columns Output columns
Constraint
Derivations / Mappings
Column defs

Defining a Constraint
Input column
Function
Job parameter

Defining a Derivation
Input column
String in quotes Concatenation

operator (:)
IF THEN ELSE Derivation
● Use IF THEN ELSE to conditionally derive a value
● Format:
– IF <condition> THEN <expression1> ELSE <expression2>
– If the condition evaluates to true then the result of expression1 will
be copied to the target column or stage variable
– If the condition evaluates to false then the result of expression2 will
be copied to the target column or stage variable
● Example:
– Suppose the source column is named In.OrderID and the target
column is named Out.OrderID
– Replace In.OrderID values of 3000 by 4000
– IF In.OrderID = 3000 THEN 4000 ELSE In.OrderID

String Functions and Operators
● Substring operator
– Format: “String” [loc, length]
– Example:
• Suppose In.Description contains the string “Orange Juice”
• InDescription[8,5]  “Juice”
● UpCase(<string>) / DownCase(<string>)
– Example: UpCase(In.Description)  “ORANGE JUICE”
● Len(<string>)
– Example: Len(In.Description)  12

Checking for NULLs
● Nulls can be introduced into the data
flow from lookups
– Mismatches (lookup failures) can produce
nulls
● Can be handled in constraints,
derivations, stage variables, or a
combination of these
● NULL functions
– Testing for NULL
• IsNull(<column>)
• IsNotNull(<column>)
– Replace NULL with a value
• NullToValue(<column>, <value>)
– Set to NULL: SetNull()
• Example: IF In.Col = 5 THEN SetNull()
ELSE In.Col

Transformer Functions
● Date & Time

● Logical
● Null Handling
● Number
● String
● Type Conversion

Transformer Execution Order
● Derivations in stage variables

● Constraints are executed before derivations
● Column derivations in earlier links are executed before later
links
● Derivations in higher columns are executed before lower
columns

Transformer Stage Variables
● Derivations execute in order from top to bottom

– Later stage variables can reference earlier stage variables
– Earlier stage variables can reference later stage variables
• These variables will contain a value derived from the previous row
that came into the Transformer
● Multi-purpose
– Counters
– Store values from previously read rows to make comparisons with
the currently read row
– Store derived values to be used in multiple target field derivations
– Can be used to control execution of constraints

Transformer Reject Links
Reject link
Convert link to a
Reject link

Transformer and Null Expressions
● Within a Transformer, any expression that includes a NULL
value produces a NULL result
– 1 + NULL = NULL
– “John” : NULL : “Doe” = NULL
● When the result of a link constraint or output derivation is
NULL, the Transformer will output that row to the reject link.
If no reject link, job will abort!
● Standard practices
– Always create a reject link
– Always test for NULL values in expressions with NULLABLE columns
• IF isNull(link.col) THEN … ELSE …
– Transformer issues warnings for rejects

Otherwise Link
Otherwise link

Defining an Otherwise Link
Check to create
otherwise Can specify abort
condition condition

Specifying Link Ordering
Link ordering toolbar icon
Last in
order

Parameter Sets

Parameter Sets
● Store a collection of parameters in a named object
● One or more values files can be named and specified
– A values file stores values for specified parameters
– Values are picked up at runtime
● Parameter Sets can be added to the job parameters
specified on the Parameters tab in the job properties

Creating a New Parameter Set
New Parameter
Set

Parameters Tab
Specify
parameters
Parameter set
name is specified
on General tab

Values Tab
Values for
Values file name parameters

Adding a Parameter Set to Job Properties
View Parameter
Set
Parameter Set Add Parameter

Reference Set

Using Parameter Set Parameters
Select Parameter
Set parameters

Checkpoint
1. What occurs first? Derivations or constraints?

2. Can stage variables be referenced in constraints?
3. Where should you test for NULLS within a Transformer?
Stage Variable derivations or output column derivations?

Unit summary
● Use the Transformer stage in parallel jobs
● Define constraints
● Define derivations
● Use stage variables
● Create a parameter set and use its parameters in constraints
and derivations

Mod 11: Working With

Relational Data

Unit objectives
● Import Table Definitions for relational tables
● Create Data Connections
● Use Connector stages in a job
● Use SQL Builder to define SQL Select statements
● Use SQL Builder to define SQL Insert and Update
statements
● DB2 Connector stage

Working with Relational Data
● Importing relational data
– Import using ODBC or orchdbutil
• orchdbutil is preferred, in order to get correct type conversions
● Data Connection objects
– Store database connection information in a named object
● Stages available to access relational data
– Connector stages
• Parallel support
• Most functionality
• Consistent GUI and functionality across all relational types
– Enterprise stages
• Parallel support
– Plug-in stages
• Functionality ported from DataStage Server Jobs
● Selecting data
– Build SELECT statements using SQL Builder
● Writing data
– Build INSERT, UPDATE, DELETE statements using SQL Builder


● Can import using ODBC or using Orchestrate schema

definitions
– Orchestrate schema imports are better because the data types
are more accurate
● Import>Table Definitions>Orchestrate Schema
Definitions
● Import>Table Definitions>ODBC Table Definitions

Orchestrate Schema Import

ODBC Import
Select ODBC data

source name

Connectors Stages

Connector Stage Types
● ODBC
– Conforms to ODBC 3.5 standard and is Level 3 compliant
– Certified with Oracle, DB2 UDB, SQL Server, and others
– DataDirect drivers
● DB2 UDB
● Teradata
● InfoSphere MQ
– WSMQ 5.3 and 6.0 for Client / Sever
– WSMB 5.0

Connector Stage Features
● Common stage editor
● Convenient drop-down lists to choose properties
● Job parameters can be inserted into any property
● Required properties are identified with a visual indicator
● Warning indicator for properties requiring attention
● Metadata retrieval
● Integrated SQL Builder
● Parallel support
– Read: parallel connections to the server and modified SQL queries for
each connection
– Write: parallel connections to the server
● Transaction isolation level support
– Read Uncommitted
– Read Committed
– Repeatable Read
– Serializable
● Before / After commands
– Executed once before or after the job runs

Metadata Retrieval
● List data sources

– Can list both user and system DSNs
● List users defined for the given database
● Individual table or view metadata
– Column data type, size, precision, etc.
● Database information
– Name, vendor name, version number, etc.
● List of database schemas
● List tables that match the filter criteria
● List related tables
– Those related by foreign key
– Those whose foreign keys point to the primary key in the given
table

Connector Stages
ODBC connector
ODBC Connector

Stage Editor
Click to display Navigator

stage properties panel
Properties
Click to display link
properties and columns

Property Warning Indicators
Warning indicator

Auto-Generated SQL
Stage generates SQL

SELECT based on
column definitions

Invoking SQL Builder
SQL statement
property
SQL Builder

Connector Stage Properties
Connection
information
SQL
Transaction
and session
management
Before/After
the SQL

DB2 Connector Stage

DB2 Connector Source Stage

DB2 Connector Source Stage Properties
Connection
information

DB2 Connector Target Stage

DB2 Connector Target Stage Properties
Connection
information
Write mode

Building a Query Using SQL
Builder

SQL Builder
● Uses the Table Definition
– Be sure the Locator tab information is specified fully and correctly
• Schema and table names are based on Locator tab information
● Drag Table Definitions to SQL Builder canvas
● Drag columns from Table Definition to Select columns table

Table Definition Locator Tab
Locator tab
Table schema name
Table name

SQL Builder Read Method
SQL Builder
for Select

SQL Builder Window
Drag Table
Definition
Drag
Columns
Build column
expressions
Selection tab

Creating a Calculated Column
Columns
dragged
from table
Select function
Select Expression Editor
Open a second
Expression
Editor
Functions
Select function
Function parameters

Constructing a WHERE Clause
Select predicate Job parameter

Add
conditio
n to
clause
Add second job parameter

Sorting the Data
Second column
to sort by
Sort Ascending
/ Descending
First column to
sort by

Viewing the Generated SQL
Read-only
SQL tab

Building SQL to Write to a
Table

Select Type of Write
Write mode
property

Invoking SQL Builder
Specify
INSERT
SQL
Specify Builder
UPDATE
SQL Builder INSERT
Drag Table
Definition
Drag
columns to
load Specify
values

SQL Builder Update
Drag Table
Definition
Drag
columns to
load Specify
values
Add
WHERE
conditio
clause
n to
clause

Viewing the SQL

Data Connections

Data Connection
● Stores database parameters and values in a named object in
the Repository
– Optionally can create a parameter set with the parameters and
values from with the Data Connection object
● Associated with a stage type
● Property values can be specified in a job stage of the given
type by loading the Data Connection into the stage

Creating a New Data Connection Object
New Data
Connection

Select the Stage Type
Select the stage

type
Specify parameter
Optionally create values
a parameter set

Loading the Data Connection
Load Data
Connection
Save Data
Connection

Checkpoint
1. Which of the following types stages provide the most

functionality in DataStage parallel jobs? ODBC Plug-in,
ODBC Connector, ODBC Enterprise?
2. What are three ways of building SQL statements in
Connector stages?
3. Which of the following statements can be specified in
Connector stages? Select, Insert, Update, Upsert, Create
Table.
4. What are two ways of loading Data Connection metadata
into a database stage?

Unit summary
● Import Table Definitions for relational tables
● Create Data Connections
● Use Connector stages in a job
● Use SQL Builder to define SQL Select statements
● Use SQL Builder to define SQL Insert and Update
statements
● Use the DB2 Enterprise stage

Mod 12: Metadata in the

Parallel Framework

Unit objectives
● Explain schemas
● Create schemas
● Explain Runtime Column Propogation (RCP)
● Turn RCP on and off
● Build a job that reads data from a sequential file using a
schema
● Build a shared container

Schema
● Alternative way to specifying column definitions and record
formats
– Similar to a Table Definition
● Written in a plain text file
● Can be imported as a Table Definition
● Can be created from a Table Definition
● Can be used in place of a Table Definition in a Sequential
file stage
– Requires RCP
– Schema file path can be parameterized
• Enables a single job to process files with different column
definitions

Creating a Schema
● Using a text editor

– Follow correct syntax for definitions
– Not recommended
● Import from an existing data set or file set
– On DataStage Manager import > Table Definitions > Orchestrate
Schema Definitions
– Select checkbox for a file with .fs or .ds
● Import from a database table
● Create from a Table Definition
– Click Parallel on Layout tab

Importing a Schema
Schema location can be

server or workstation
Import from Check if schema is

Database table from a data set or
file set

Creating a Schema From a Table Definition
Parallel layout
Layout
Schema corresponding to
the Table Definition Save schema

Reading a Sequential File Using a Schema
No columns
defined here
Path to
schema file

Runtime Column Propagation (RCP)
● When RCP is turned on:
– Columns of data can flow through a stage without being explicitly
defined in the stage
– Target columns in a stage need not have any columns explicitly
mapped to them
• No column mapping enforcement at design time
– Input columns are mapped to unmapped columns by name
● How implicit columns get into a job
– Read a file using a schema in a Sequential File stage
– Read a database table using “Select *”
– Explicitly define as an output column in a stage earlier in the flow
● Benefits of RCP
– Job flexibility
• Job can process input with different layouts
– Ability to create reusable components in shared containers
• Component logic an apply to a single named column
• All other columns flow through untouched
Enabling Runtime Column Propagation (RCP)
● Project level
– DataStage Administrator Parallel tab
● Job level
– Job properties General tab
● Stage level
– Link Output Column tab
● Settings at a lower level override settings at a higher level
– E.g., disable at the project level, but enable for a given job
– E.g., enable at the job level, but disable a given stage

Enabling RCP at Project Level
Check to enable RCP

to be used
Check to make RCP

the default for new
jobs

Enabling RCP at Job Level
Check to make RCP

the job default

Enabling RCP at Stage Level
– Output Columns tab
● Transformer
– Open Stage Properties
– Stage Properties Output tab
Enable RCP in
Sequential File
stage
Check to enable
RCP in
Transformer

When RCP is Disabled
● DataStage Designer enforces Stage Input to Output column mappings.
Colored red; job

won’t compile

When RCP is Enabled
● DataStage does not enforce mapping rules
● Runtime error if no incoming columns match unmapped target
column names
Job will compile

Shared Containers

Shared Containers
● Encapsulate job design components into a stored container
● Provide reusable job design components
● Example
– Apply stored Transformer business logic

Creating a Shared Container
● Select stages from an existing job
● Click Edit>Construct Container>Shared
Selected
components
Click

Using a Shared Container in a Job
Shared
Container

Mapping Input / Output Links to the Container
Select container link

to map input link to

Mapping Input/Output Links

Checkpoint
1. What are two benefits of RCP?

2. What can you use to encapsulate stages and links in a job
to make them reusable?

Unit summary
● Explain schemas
● Create schemas
● Explain Runtime Column Propogation (RCP)
● Turn RCP on and off
● Build a job that reads data from a sequential file using a
schema
● Build a shared container

Module 14: Job Control

Unit objectives
● Use the DataStage Job Sequencer to build a job that
controls a sequence of jobs
● Use Sequencer links and stages to control the sequence a
set of jobs run in
● Use Sequencer triggers and stages to control the conditions
under which jobs run
● Pass information in job parameters from the master
controlling job to the controlled jobs
● Define user variables
● Enable restart
● Handle errors and exceptions

What is a Job Sequence?
● A master controlling job that controls the execution of a

set of subordinate jobs
● Passes values to the subordinate job parameters
● Controls the order of execution (links)
● Specifies conditions under which the subordinate jobs get
executed (triggers)
● Specifies complex flow of control
– Loops
– All / Some
– Wait for file
● Perform system activities
– Email
– Execute system commands and executables
● Can include Restart checkpoints

Basics for Creating a New Job Sequence
● Open a new job sequence
– Specify whether its restartable
● Add stages
– Stages to execute jobs
– Stages to execute system commands and executables
– Special purpose stages
● Add links
– Specify the order in which jobs are to be executed
● Specify triggers
– Triggers specify the condition under which control passes across
a link
● Specify error handling
● Enable / Disable restart checkpoints

Job Sequencer Stages
● Run stages
– Job Activity: Run a job
– Execute Command: Run a system command
– Notification Activity: Send an email
● Flow control stages
– Sequencer: Go if All / Some
– Wait for File: Go when file exists / doesn’t
exist
– StartLoop / EndLoop
– Nested Condition: Go if condition satisfied
● Error handling
– Exception Handler
– Terminator
● Variables
– User Variables

Example
Wait for file Execute a
command
Send email
Run job
Handle
exceptions

Sequence Properties
Restart
Exception stage to
handle aborts

Job Activity Stage Properties
Job to be executed
Execution mode
Job parameters
to be passed

Job Activity Trigger
Output link names
List of trigger types
Build custom trigger

expressions

Execute Command Stage
Executable
Parameters to pass
● Execute system commands, shell scripts, and other

executables
● Use e.g. to drop or rename database tables

Notification Activity Stage
Include job
status info in
email body

User Variables Stage
Variable
User Variables stage
Expression defining the

value for the variable

Referencing the User Variable
Variable

Flow of Control Stages

Wait for File Stage
File
Options

Sequencer Stage
● Sequence multiple jobs using the Sequence stage
Can be set to all

or any

Nested Condition Stage
Fork based on trigger

conditions
Trigger conditions

Loop Stages
Reference link to start
Pass counter value

Counter values

Error Handling

Handling Activities that Fail
Pass control to
Exception stage when
an activity fails

Exception Handler Stage
Control
goes here
if an
activity fails

Restart

Enable Restart
Enable
checkpoints to be
added

Disable Checkpoint at a Stage
Don’t
checkpoint
this activity

Checkpoint
1. Which stage is used to run jobs in a job sequence?

2. Does the Exception Handler stage support an input link?

Unit summary
● Use the DataStage Job Sequencer to build a job that
controls a sequence of jobs
● Use Sequencer links and stages to control the sequence a
set of jobs run in
● Use Sequencer triggers and stages to control the conditions
under which jobs run
● Pass information in job parameters from the master
controlling job to the controlled jobs
● Define user variables
● Enable restart
● Handle errors and exceptions

InfoSphere QualityStage
v8.1
Workshop

Course Contents
● Mod 13: Course Scenario

● Mod 14A: Structure of a Rule Set
● Mod 14B: Creation of a Custom Rule Set
● Mod 15: Initial Investigation of Data to be
Standardized
● Mod 16: Classification Schema/Classification
Table
● Mod 17: Pattern Action File

Module 13: Course Scenario
Information Management
© 2010 IBM Corporation

Course Scenario
 Acme Foods is implementing a data warehouse to:

– Provide the ability to analyze sales data and buying trends
– Enable improved target marketing campaigns for new customers
– Increase the profitability of existing customers
 Historical sales data to be loaded into the warehouse has no
consistent structure or data formats as it was stored as free-form
text in various stand-alone systems used by the sales department
 Prior to the warehouse load, historical data must also be enriched
with product ID data from the warehouse’s SCM system
10 IBM Confidential © 2010 IBM Corporation

Review Questions
 In what format is the current standardized data?

Lab 1: Setup Data Files
 Copy data files to disk if not using the virtual machine supplied
with the course materials
– Create a working directory C:\DX746

– Copy the entire content from C:\Class_Files\DX746\Lab to C:\DX746

Module 14A: Structure of a Rule Set

Rule Sets
 Rule sets contain logic for:

– Parsing
– Classifying
– Processing data elements
 Three required files
– Classification Table (.cls)
– Dictionary File (.dct)
– Pattern Action File (.pat)
 Optional lookup tables (.tbl)
 Override tables

Rule Set Files
Contains token, standard value,

Classification Table (.CLS) class, and threshold weights
(optional)
Contains SEPLIST/STRIPLIST, NYSIIS

Pattern Action File (.PAT) and Soundex calculations, and
pattern action sets for processing
data
Contains field identifier, type,
Dictionary File (.DCT) length, missing value identifier,
and field description
Optional conversion and lookup

Lookup Tables (.TBL) tables for converting and
returning standardized values
Tables for storing user overrides

Override Tables (.TBL) entered through the QualityStage
Designer GUI
Classification Table (.cls)
 The format for the Classification Table file is:

– token /standard value/class /[threshold-weights]/ [; comments]
 The first line of a Classification Table file must be:
– \FORMAT\ SORT=Y
 Each column must be separated by one or more spaces
 Standard form of a token can be more than one word
– Must be in double quotes
– Example: POB “PO BOX” B

Classes
 One-character tag indicating the class of the word

 User-defined classes can be any letter, A to Z
 Zero (0) class indicates a null class
– Use of a class of zero indicates that a word is a null (or noise) and is skipped
in the pattern matching process
– Standard abbreviation can be any value desired since it is not used
 Any token not classed by the classification table receives the
appropriate default class

Default Classes
^ Numeric, all digits, such as 1234

+ Single, unknown word such as HIPPO
? Unknown, one or more words, such as CHERRY HILL
> Leading numeric, numbers followed by one or more letters,
such as 123A
< Leading alpha, letters followed by one or more numbers, such
as A3
@ Complex mix, a mixture of alpha and numeric characters such
as 123A45
~ Special, special characters including !, \, @, ~, %, etc.
0 Null
- Hyphen
/ Slash
\& Ampersand
\# Number sign
54
\( Left parenthesis
IBM Confidential © 2010 IBM Corporation
Threshold Weights
 Threshold weights specify the degree of tolerated spelling

uncertainty
 An information-theoretic string comparator is used that takes into
account phonetic errors, random insertion, deletion and
replacement of characters, and transpositions of characters
 Score is weighted by the length of the word, since small errors in
long words are less serious than errors in short words
– 900 Exact match
– 800 Strings are almost certainly the same
– 750 Strings are probably the same
– 700 Strings are probably different

Comparison Threshold Examples
Token .CLS Entry 850 800 750
NRTHWEST NORTHWEST yes yes yes

NRTHWST NORTHWEST no yes yes
NRTHW NORTHWEST no no yes

Dictionary File (.dct)
 Defines the fields for the output file for the rule set
 The format for the Dictionary File is:
 field-identifier/ field-type/ field-length/missing value/description
[;comments]
– Field identifier: Any length, no spaces in the name,
• Example: HouseNumber
– Field type: C (character), N (numeric)
• This is legacy information but must be entered into the dictionary file
– Field length: Length (in bytes)
– Missing value identifier: S (spaces), Z (zero), X (none)
• This is legacy information but a value must be entered into the dictionary
file
– Description: No spaces allowed in the description
 The first line of a Dictionary File must be:
– \FORMAT\ SORT=N

Pattern Action File (.pat)
 Contains the rules for standardization

 After the input record is tokenized (parsed) and classified,
patterns are executed in the order they appear in the Pattern
Action file
 ASCII file that can be created or updated using any standard text
editor and has the following general format:
\ PRAGMA_START
SEPLIST " ~`!@#$%^&*()_-+={}[]|\\:;\"<>,.?/"
STRIPLIST " ~`!@$^*()_+={}[]|\\:;\"<>,?"

\ PRAGMA_END
\ POST_START
NYSIIS {SN} {NS}
RSOUNDEX {SN} {SS}

\ POST_END
58
patternIBM Confidential © 2010 IBM Corporation
Optional Tables (.tbl)
 Lookup tables
– Contain information that is specific to the rule set; for example the gender
lookup table for USNAME contains the following:
MRS F
MS F
MR M
– ASCII file that can be manually edited
 User override tables
– User classification, input text, input pattern, unhandled text, unhandled
pattern
– Entries added through Designer Standardization Overrides GUI

Module 14A Review Questions
 Which rule set file is not sorted? Why?

 If the standard form of a token is two words, how is it formatted in
the classification table?
 Where are the SEPLIST and STRIPLIST located?

Lab 5: Copy and provision an existing rule set
 USNAME  CopyOfUSNAME

Lab 6: Modify a rule set
 CopyOfUSNAME
– CLS – Apaloosa Apalusa F
– PAT – F | +
• Change COPY TO COPY_A
 Use Rules Analyzer to test

Module 14B: Creation of a Custom Rule Set

Custom Rule Set Development Cycle
 Investigate source data file – word investigation

 Develop classification schema/table
 Re-run investigation with new classification table to reveal
patterns
 Develop dictionary file
 Develop pattern action sets
 Run standardization and investigate results
 Refine pattern action processing

Investigate Data File
Which comes first – the investigation or the rule set?

 In order to investigate a free-form field, you must specify a rule
 No rule exists for your data

The Answer
 Use a generic rule set first

– No entries in the classification table
Actually, there needs to be at least one entry in the classification table, one is
created by default when you create a new rule set
– Every token will be unclassified, therefore a default class (tag) will be
assigned based upon the data value
 Create a folder and new Rule Set
 Default values will be created for the dictionary, classification
table, and pattern action file
 Modify the pattern action file to reference the default field in the
dictionary file
 Use the Rule Analyzer to test

Lab 7: Create new rule set
 Create a folder called Standardization Rules under the

ACMEFOOD Folder
 Create a New Rule Set name FOOD
– File, New
– Select Data Quality folder
– Choose Rule Set
 Modify (see slide notes)
– Dictionary file
– Pattern Action file
 Run the Standardization Rules Analyzer using your new rule on
the following input
– “THIS IS A TEST”
– If you correctly copied and updated your new rule, it will return an unhandled
pattern of +++T

Parsing
 Parsing is controlled by the SEPLIST/STRIPLIST

 SEPLIST/STRIPLIST entries are stored in the pattern action file
– SEPLIST/STRIPLIST can also be accessed under Advanced Options in the
Investigate stage Wizard
– The SEPLIST/STRIPLIST located in the investigation window is read from the
pattern action file and displayed on the Advanced Options of the word
investigation screen
• Any changes made to the lists in the investigation screen are not reflected
in the pattern action file, to change this list in the pattern action file you
must edit the file and manually change the list

Determining Parsing Requirements
 Start by having only the space in the SEPLIST and only the space
in the STRIPLIST
 Examine the pattern report to examine how the current SEPLIST/
STRIPLIST combination parsed the data
 Remember, it’s important to thoroughly parse data before
classifying words
– New words can result from additional parsing

 For example, if the “/” is not in the SEPLIST, the following pattern
results
VALHALLA/WRAPPER @
 In this example, a brand name and a container type are treated as

the same data token

 When the “/” is added to the SEPLIST, the following pattern

results
VALHALLA / WRAPPER ?/?
 The pattern now shows that there are two (currently) unclassified
words separated by a “/”
– There are now two more words we can classify

 Continue to re-run the Word Investigate stage while modifying the

SEPLIST/STRIPLIST combination until the free-form data is
parsed into manageable words, “data chunks”, also called tokens
– TIP: Except when representing a complex mix of alphas and numbers
(C3PO), the “@” in pattern reports is an indicator that more can be added to
the SEPLIST
 Before adding a character to the STRIPLIST, be sure the character
doesn’t add meaning or context to the data
– Rule of thumb: Whenever in doubt don’t strip it out
 It’s important to analyze the pattern reports for ways the would-be
stripped character can impact the data

 How would removing the highlighted special characters impact the meaning
of the below data items?
#2 pencil
5# flour
Ave.
$100.00
£100.00
1/2

Update the SEPLIST/STRIPLIST
 After you’ve determined the parsing requirements, update the

SEPLIST/STRIPLIST in the .PAT file
 Using CTRL-C / CTRL-V (copy/paste) can be helpful in copying the
characters used in the Word Investigate stage and pasting them to
the rule set’s .PAT file

Module 14B Review Questions
 True or false: It’s best to start with all special characters in the
STRIPLIST.

Module 15: Initial Investigation of Data to be Standardized

Exercise 5-1
 Create a new Investigate stage

– Name: InvWordProductData
– Option: Word
– Input file: SourceProductData
– Rule: FOOD
– Field: ProdDescr
– Adv. Opt.: Include Unclassified Alphas
Include Classified Original Spelling
 Create a job, add the Investigate stage and run

Word Investigation Implementation




Result File InvWordProductData – Pattern Report

Result File InvWordProductData – Token Report

Lab 8: Create word investigation job
 Refine SEPLIST/STRIPLIST
 Modify .PAT to reflect new SEPLIST/STRIPLIST
 Re-run Word Investigate stage
 Iterate as often as needed

Module 16: Classification Schema/Classification Table

Create the Classification Table
 When creating or modifying a custom rule set, entries to the

classification table are often made by copying lines from the
Token Report report
 The user develops a classification schema for key tokens (words)
that create context in patterns
 The classification schema is a set of one letter codes used to
classify categories of words
 Available are the 26 letters of the alphabet and the numeral 0 (for
words you want set to null or nothing)

Classification Schema
 When choosing letters to represent categories, consider choosing

letters that are not easily confused with other categories
– For example, if one category is of words is “Drugs” and the other is
“Diagnoses”, consider not categorizing either with the obvious choice of “D”
– This can help minimize misclassifications when creating the classification
table

What to Classify
 The general rule is “Don’t classify every word.”

 Classify categories of words that occur many times in the data
and help to lend meaning and context to other items in the free-
form text
– Check the frequency of words in the investigation reports
 Examples of words to classify are those whose category contains
a finite set of values, or whose expected values are finite
– For example: street types, name titles, name suffixes, unit types, units of
measure, packaging or container types, brand names

What to Classify
 Think twice before creating a classification for a category of

words for which the set of values are both infinite and do not
conform to a set of standards
 Non-conforming or infinite category value words are best left to
be classified with default tags
– Examples include individual names, street names, and miscellaneous
description words
– Their context will allow them to be processed corrected in the pattern action
file

Lab 9: Create classification schema
 Review the Token Report created in the previous exercise

 Develop a classification schema
 Add your classification schema key to the classification table
 Copy and paste entries from the Token Report file into the
classification table and replace the “?” with the appropriate one-
character class

Result File

Review Questions
 True or false: You should classify as many tokens as you have

time to.
 Which file do you look in for candidates for inclusion in the
classification table?
 True or false: When you start developing a new rule set, the
classification table is completely blank.
 How many classes did you use? What are they?

Pattern Review
Refining the Classification Table

Generating Patterns
 Once the classification table has been created and the Word
Investigate stage is rerun, patterns will emerge
 Review of the Pattern Report will indicate whether additional
classes or new members to existing classes must be made
 Are the patterns explicit enough for you to glean the meaning of
the data based on its context?
– If you can answer yes to this question, you are ready to begin developing
pattern action sets
– If you answer no to this question, you must refine/modify your classification
schema/table until you can answer this question in the affirmative

Lab 10: Review Pattern Report
 Add entries to/update the classification table

 Re-run the job containing the Word Investigate stage
 Review the Pattern Report
 Refine/modify the classification table as needed

Result File

Module Review Question
 Do any default classed tokens remain after you have developed

the classification table for a new rule set?

Module 17: Pattern Action File

Pattern Action Language

Pattern Format
 A pattern consists of one or more operands

 Operands are separated by vertical lines
– The following pattern has four operands:
^|D|?|T
– These are referred to in the actions as
[1], [2], [3], and [4]
 You can use spaces to separate operands
– ^|D|?|T is equivalent to ^ | D | ? | T
 Comments follow a semicolon ; like this comment

Pattern Action Set
 A Pattern Action Set is comprised of a pattern and a set of action

statements
Pattern
^|D|?|T
COPY [1] {HouseNumber} ; Copy operand 1 to HOUSENUMBER
Action
COPY_A [2] {PreDirection} ; Copy operand 2 w/standardization to PREDIR
Statements
COPY_S [3] {StreetName} ; Copy operand 3 w/spaces to STREETNAME
COPY_A [4] {StreetType} ; Copy operand 4 w/standardization to
; SUFFIXTYPE

Common Action Statements
 COPY - Copy the data in it’s original form

 COPY_A - Copy the standardized abbreviation of the word from
the classification table
– HINT: Only use COPY_A if the the token you’re dealing with has a letter
classification
 COPY_S - Copy with spaces preserved
– HINT: Most often used with ?
 CONCAT - Concatenates a data item to a user variable Also
available is CONCAT_A.
 RETYPE - Change the classification of a data item, and optionally
it’s actual and/or standard value

Pattern Action File (.PAT)
Which of the following patterns will handle the following example?
Example: 1 North Ronald McDonald Lane
^|?|T
COPY [1] {HN} ; Copy operand 1 to HOUSENUMBER
COPY_S [2] {SN} ; Copy operand 2 w/spaces to STREETNAME
COPY_A [3] {ST} ; Copy operand 3 w/standardization to SUFFIXTYPE
^|D|?|T
COPY [1] {HN} ; Copy operand 1 to HOUSENUMBER
COPY_A [2] {PD} ; Copy operand 2 w/standardization to PREDIRECTION
COPY_S [3] {SN} ; Copy operand 3 w/spaces to STREETNAME
COPY_A [4] {ST} ; Copy operand 4 w/standardization to SUFFIXTYPE

User Variables
 This pattern action set looks for a date such as 10/15/98 and
copies it as a single token to a date field
*^|/|^|/|^
COPY [1] temp
CONCAT [2] temp
CONCAT [3] temp
CONCAT [4] temp
CONCAT [5] temp Temporary Variable
COPY temp {DA}

Position-related Commands
 End of Field Specifier - $

– ^ | ? | T | $ would only match a field that had a simple address and nothing
else
• 123 Main Road would match
• 123 Main Road Apt 3 would not match
• Apt 3 123 Main Road would not match
 Floating Position Specifier – *
– *^ | ? | T would look starting from the left to find a simple address that appears
anywhere within the field
• Home Address 123 Main Road would match
• Apt 3 123 Main Road would match
• 123 Main Road Apt 3 would match

Position-related Commands
 Reverse Floating Position Specifier - #

– #S | ^ would look starting from the right to find the first instance of state
followed by a numeric
– CA 45 Products, Phoenix, AZ 12345 Dept 45
– Notice it correctly finds AZ 12345. If we used the floating position (*), it
would have incorrectly identified CA 45 as the state and zip code
 Fixed Position Specifier - %
– Specifies a particular operand based on its position in a string
– Input string: 123 Happy Valley Joy Road
• %1 would locate 123
• %3 would locate Valley
• %-1 would locate Road

Subfield Commands
 Subfield Classes - 1 to 9, -1 to -9
– Parses individual words of a ? string
– Johnson Controls Corp whose pattern is ? | C
– 1 | C – Johnson is operand 1 and Corp is operand 2
– -1 | C – Controls is operand 1 and Corp is operand 2
 Subfield Ranges - beg:end
– Specified a range of unknown words
– 123 – A B Main Street whose pattern is ^ | - | ? | T
– ^ | - | (1:2) | T
• “A B Main” is the string specified by ?
• 1:2 is the portion A B
– Operand [3] is A B
• You can also look from the end of the field. For example, (-3:-1) specifies
a range of the third word from the last to the last word

Additional Classes
 Universal Class - **
– Matches all tokens
– ** | T would match any pattern that ended with a T class
 Negation Class Qualifier - !
– Is used to indicate NOT
– ** | !T would match any pattern that did not end with a T class

Conditionals
 Simple Conditional Values

– ^ | ?=“MAPLE” | T
– Would only match simple addresses with the single street name of
“MAPLE”
• 120 MAPLE TERRACE would match
• 240 MAPLE DRIVE would match
• 140 MAPLE LEAF ROAD would not match
 Series of Conditional Values
– ^ | ? | T=“AVE”, “DR”
– Would only match simple addresses with a standard form suffix of AVE or
DR
• 120 MAPLE AVE would match
• 240 MAPLE DR would match
• 140 MAPLE ROAD would not match

Tables of Conditional Values
 If it’s necessary to specify a large number of conditional values,

tables of values can be used
– Postcode.tbl contains valid postal codes
– S | ^ = @Postcode.tbl
– Would only match a state followed by a numeric if the numeric represented a
value in the Postcode.tbl file

Referencing Operands
 Referencing Data Represented by an Operand

– ^ [{} = 10000] is read as “a numeric whose actual value is 10000”
 Referencing the length of an operand
– ^ [{} LEN = 5] is read as “a numeric whose length is 5 bytes
 Referencing the template of an operand
– @ [{} PICT = “cncc”] is read as “a complex mixed that looks like a character
followed by a numeric followed by two characters” such as C3PO
 Referencing a substring of an operand
– + [{} (-7:-1) = “STRASSE”] is read as “a single unknown alpha whose actual
value’s last 7 bytes are STRASSE” such as LEUVENSTRASSE

Calling Subroutines
 Subroutines facilitate faster processing

 Pattern action sets are executed sequentially
 Faster to test for a generic condition and then call subroutine to
process
*M ; look for existence of apartment data

Call APTS ; send record to APTS subroutine
…
\SUB APTS
Process apartment data here
\END_SUB

Review Questions
 COPY_A copies what form of the token?

 COPY_S is most commonly used with what default class?
 Would ^ | ? | T find an address anywhere in a field?
 True or false: ^ | ? | T | $ would only find an address that was the
only value in the field.
 True or false: ** is the negation class qualifier.

Development of Pattern Action Sets

Defining Standardize Output File Format
 The output of the Standardize stage is dictated by the dictionary

file
 Your business rules and requirements will dictate the output file
format
 Because we need to match the product data against the
warehouse file, our output should match the format of the
warehouse file

Warehouse File Format
 DWProductData is the warehouse reference file and has the

following format:

Result - Dictionary File

Lab 11: Develop Pattern Set
 Add entries to your dictionary file based on the format of the

warehouse data file that you are going to match your standardized
data to
 Select patterns and begin developing action statements (pattern

action sets) to bucket/standardize the data
– Create a Standardize stage

• Input file: SourceProductData
• Output file: StanProdData
• Option: Append All
• Rule: FOOD
• Standardize Field: ProdDescr
– Build and run Standardize job

Result File

Refining Pattern Action Sets Investigate of
Standardize Results

Investigate Your Standardize Results
 Once you’ve run the Standardize stage, you should investigate the
results to see how accurately your rule set is working
 Your rule set contains two fields that make this process much
faster and easier
– Unhandled Pattern – which shows the pattern of any data it was unable to
process
– Unhandled Data – which show the actual unhandled/unprocessed data
 Running a Character Concatenate Investigate stage on these two
fields will provide you with a consolidated view of patterns/data
that still need processing
 You should also run a Character Discrete Investigate stage on all
your single domain fields to ensure they have domain integrity
and your rule set is working as desired

Lab 12: Audit Standardization Results (Task 1)
 Create a Character Concatenate Investigate stage on the

Unhandled_Pattern and Unhandled_Data fields
– Use a C mask on UnhandledPattern and an X mask on UnhandledData
– Be sure to increase your sample size
 Add/refine your pattern action sets
 Rerun your Standardize stage and your Character Concatenate
Investigate stage until you have processed all the data correctly

Lab 12: Audit Standardization Results (Task 2)
 Create a Character Discrete Investigate stage to investigate the

contents of all your single domain fields in your Standardize
stage result file
 Review the report to ensure that you have domain integrity in all
the fields
 If you identify any errors, go back and modify your rule’s pattern
action sets to correct the mishandling of the data

Review Questions
 What pattern/actual data value did you find most difficult to

process?
 Were you able to attain 100% processing of the data?
 Why did we use a C mask on the unhandled pattern and an X
mask on the unhandled data in our quality control Character
Concatenate Investigate stage?
 Did you need to make any changes to your classification table?

DB2® v9.1
Workshop

Course Contents
● Mod 18: Overview of DB2

● Mod 19: Command Line Processor (CLP) and GUI
usage
● Mod 20: DB2 Environment
● Mod 21: Create an instance and explore the
environment
● Mod 22: Creating databases and data placement
● Mod 23: Creating database objects
● Mod 24: Moving data
● Mod 25: Backup and recovery

Overview of DB2 9 on Linux, UNIX and
Windows

Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
• Contrast the DB2 Family of products
• Identify the DB2 Products
• Describe the functions of DB2 components
• Explore installation and parameters

DB2 product platforms
Windows Linux/UNIX
DB2 DB2
zSeries iSeries
DB2
DB2

DB2: The scalable database
IBM InfoSphere Warehouse

(DPF Database partitioning)
supports clusters of Physical or
Virtual Servers
DB2 Enterprise Server

meets data server needs of
mid-size to large-size businesses
DB2 Workgroup Server

DB2 Express data server of choice for deployment
DB2 data server, entry-level
in a departmental, workgroup, or
pricing, small and
medium-sized business environment
medium business
DB2 Express-C
Free, entry-level edition of the DB2 data server
for the developer and partner community

Accessing DB2
Command Line Administration

Applications
Processor Tools
APIs
Control Center
Commands Embedded
Command Editor SQL/XQuery
Health Center Call Level
Interactive
SQL Task Center Interface
XQuery Replication Center JDBC/
SQLJ
Journal
DAO/
Tools Settings RDO/
ADO
SQL/XML/API
Database Engine

DB2 product components
IBM Data Server

DRDA Driver Package
Application
Requester IBM Data IBM Data Server
Server Client Runtime Client
Communication Support
DB2 DB2 Connect
DRDA Application Server

Installation prerequisites
• Do you have:
– Enough disk space
– Enough memory
– Available Internet browser
– Security (Authorities) concept for
SYSADM / Admin. Server
• Specific requirements depend on operating system

http://www-01.ibm.com/software/data/db2/9/sysreqs.html
• Product documentation:
http://www.ibm.com/support/docview.wss?rs=71&uid=swg27015148

Installation steps
• GUI or non-GUI • Admin-Server?

• Customizing? • Notification?
• Default instance? • Installation verification
Unit summary
• Contrast the DB2 Family of products
• Identify the DB2 Products
• Describe the functions of DB2 components
• Explore installation and parameters

Student exercise

Command Line Processor (CLP) and
GUI usage

Unit objectives
• Use the Command Line Processor
• Explore the GUI environment
• Describe the DAS role with GUI tools

CLP Command Line Processor

CLP syntax
db2
option-flag db2-command
sql-statement
?
phrase
message
sql-state
class-code

Online reference
• Online Command Reference

db2 ?
db2 ? command string
db2 ? SQLnnnn (nnnn = 4 or 5 digit SQLCODE)
db2 ? nnnnn (nnnnn = 5 digit SQLSTATE)
• Online Reference Manuals

Using the CLP
Non-interactive mode
db2 connect to musicdb

db2 "select * from syscat.tables" more
(double quotes may be required)
Interactive mode
db2
db2=> connect to musicdb
db2=> select * from syscat.tables
CLP command options

Modify CLP options
1 Temporary for Command

db2 -r options.rep list command options
db2 -svtf create.tab3
db2 +c "update tab3 set salary=salary + 100"
2 Temporary for Interactive CLP Session

db2=>update command options using c off a on
3 Temporary for non-interactive CLP Session

export DB2OPTIONS="-svt" (UNIX)
set DB2OPTIONS="-svt" (Windows)
db2 -f create.tab3
4 Every session
put point 3 in UNIX db2profile
or System Program Group in Windows
Input file
Edit create.tab
-- comment: db2 -svtf create.tab
connect to sample;
create table tab3

(name varchar(20) not null,
phone char(40),
salary dec(7,2));
select * from tab3;
commit work;
connect reset;
db2 -svtf create.tab

Input file: Operating system commands
edit seltab.cmd
vi seltab
echo "Table Name Is" $1 > out.sel echo 'Table Name Is' %1 > out.sel
db2 "select * from $1" >> out.sel db2 Select * from %1 >> out.sel
seltab org
Table Name Is org
DEPTNUMB DEPTNAME MANAGER DIVISION LOCATION

10 Head Office 160 Corporate New York
15 New England 50 Eastern Boston
out.sel
20 Mid Atlantic 10 Eastern Washington
contents
38 South Atlantic 30 Eastern Atlanta
42 Great Lakes 100 Midwest Chicago
51 Plains 140 Midwest Dallas
66 Pacific 270 Western San Francisco
84 Mountain 290 Western Denver

QUIT/TERMINATE/CONNECT RESET differences
CLP Terminate CLP Disconnect

COMMAND Back-end Process database Connection
quit No No
terminate Yes Yes

Yes if
connect reset No CONNECT=1
(RUOW)

CLPPlus
command line processor introduced in DB2 9.7
• CLPPlus:
– SQL*Plus compatible command
– Variable substitution
– Column formatting
– Simple reporting
– Control variables

CLPPlus features
• Support for establishing connections to databases when a
database user ID and password are provided.
• A buffer that can be used to store scripts, script fragments,
SQL statements, SQL PL statements, or PL/SQL statements
for editing and then execution. Text in the buffer can be listed,
printed, edited, or run as a batch script.
• A comprehensive set of processor commands can be used to
define variables and strings that can be stored in the buffer.
• A set of commands that retrieve information about the
database and database objects.
• Ability to store buffers or buffer output to a file.
• Multiple options for formatting the output of scripts and
queries.
• Support for executing system-defined routines.
• Support for executing operating system commands.
• Option for recording the output of executed commands,
statements, or scripts.
Roadmap to the GUI tools
IBM DB2
Information Command General Monitoring Setup

Line Tools Administration Tools Tools
Tools
Information Control Configuration

Center Command Center Assistant
(db2ic) Editor (db2cc) Event (db2ca)
(db2ce) Analyzer First Steps
Journal (db2eva)
(db2journal) (db2fs)
Command
Window Health Register Visual
Replication Center Studio Add-ins
(db2cw)
Center (db2hc) (db2vsregister)
Command (dbrc)
Line InDoubt Satellite
Processor Task Center Transaction Synchronizer
(db2) (db2tc) Manager
Command
Line (db2indbt)
License
Processor Center Activity Memory
Plus Monitor Visualizer
(clpplus) (db2am) (db2memvis)

Control Center
Contents
Pane
Object
Tree
Pane
Object
Detail
Pane

Command Editor

Administration tools

Using the wizards

Health Center

Journal

Configuration Assistant (1 of 2)
• Used to configure client connections
– Clients can be configured:
• Manually DB2
• Automatically using DB2 Discovery Administration
• Using imported profiles Server
• Search the network
DB2 Server
• DB2 CA can also:
– Connect to databases, both LAN and DRDA
DB2 Discovery
– Perform CLI/ODBC administration
– Bind applications to database
Configuration
Assistant
Remote Client
Configuration Assistant (2 of 2)

DB2 Administration Server (DAS)
• DB2 control point to assist with server tasks
• Must have running DAS on server to use Configuration
Assistant, Control Center, or Task Center
• Might be created and configured at installation time
• Assists with:
– Enabling remote administration of DB2 servers
– Providing the facility for job management, including scheduling
– Providing a means for discovering information about the
configuration of DB2 instances, databases, and other DB2
Administration Servers in conjunction with DB2 Discovery

Using the Administration Server
• Creating the Administration Server:
– dascrt ASName (Linux/UNIX)
– db2admin create (Windows)
• Starting and stopping the Administration Server:

– db2admin start
– db2admin stop
• Listing the Administration Server:

– db2admin
• Configuring the Administration Server:

– db2 get admin cfg
– db2 update admin cfg using ...
• Removing the Administration Server:

– dasdrop ASName (Linux/UNIX)
– db2admin drop (Windows)
Unit summary
• Use the Command Line Processor
• Explore the GUI environment
• Describe the DAS role with GUI tools

Student exercise

The DB2 environment

Unit objectives
• Specify the key features of an Instance
• Create and drop an Instance
• Use db2start and db2stop
• Distinguish between types of configuration
• Describe and modify the Database Manager Configuration

What is an instance?
DB2 PRODUCT
Instance_1 Instance_2
DBM CFG DBM CFG
DB CFG DB CFG
Catalog DB_1 LOG Catalog DB_3 LOG
DB CFG
Catalog DB_2 LOG

The Database Manager instance
Computer system
Database Manager Database Manager

Inst1 inst2
database 1 database 1 database 2
Tablespace A
Tablespace B
Table 1 Table 2 Table 3
Table 4
co
nn
e ct
to
Local User Application

PATH=...
DB2INSTANCE=inst1

Create and drop an instance
• CREATE (different on Linux/UNIX and Windows):
– Prerequisites met?
– Creates the Instance
– Creates the SQLLIB subdirectory
– Creates needed directories in the SQLLIB subdirectory
– Creates the DBM CFG with defaults
db2icrt –u <fencedID> <instance_name> (UNIX/Linux)
db2icrt <instance_name> (Windows)
• DROP:
– Instance must be stopped and no applications are allowed to be
connected to the databases in this instance
– Does not remove databases (catalog entries)
– Removes the Instance
db2idrop <instance_name>

Starting and stopping an instance
db2start db2stop

Setting DB2 variable values
Environment Platform
Variables Specific
Instance-Level Use db2set

Must reinitialize command to change
Profile
environment after Must stop/restart
changing Registry instance
No restart Global-Level
of system
Profile
after changing!
Registry
The db2set command
• Command line tool for DB2 Profile Registry administration
• Displays, sets, resets, or removes profile variables
db2set variable=value
-g
-i instance [db-partition-number]
-n DAS node [[-u user id] [-p password]]
-u
-ul
-ur
-p
-r
-l
-lr
-v
-? (or -h)
-all

Database Manager configuration
C:\IBM\SQLLIB\BIN>db2 get dbm cfg
Database Manager Configuration
Node type = Enterprise Server Edition with local and remote clients
Database manager configuration release level = 0x0d00
Maximum total of files open (MAXTOTFILOP) = 16000

CPU speed (millisec/instruction) (CPUSPEED) = 4.605357e-007
Communications bandwidth (MB/sec) (COMM_BANDWIDTH) = 1.000000e+002
Max number of concurrently active databases (NUMDB) = 8

Federated Database System Support (FEDERATED) = NO
Transaction processor monitor name (TP_MON_NAME) =
Default charge-back account (DFT_ACCOUNT_STR) =
Java Development Kit installation path (JDK_PATH) = C:\IBM\SQLLIB\java\jd

k
Diagnostic error capture level (DIAGLEVEL) = 3

Notify Level (NOTIFYLEVEL) = 3
Diagnostic data directory path (DIAGPATH) =
Size of rotating db2diag & notify logs (MB) (DIAGSIZE) = 0
Default database monitor switches

Buffer pool (DFT_MON_BUFPOOL) = OFF
Lock (DFT_MON_LOCK) = OFF
Sort (DFT_MON_SORT) = OFF
Statement (DFT_MON_STMT) = OFF
Table (DFT_MON_TABLE) = OFF
Timestamp (DFT_MON_TIMESTAMP) = ON
Unit of work (DFT_MON_UOW) = OFF
Monitor health of instance and databases (HEALTH_MON) = ON
Unit summary
• Specify the key features of an Instance
• Create and drop an Instance
• Use db2start and db2stop
• Distinguish between types of configuration
• Describe and modify the Database Manager Configuration

Student exercise

Creating databases and data placement

Unit objectives
• Review specifics of creating a database
• Explore the System Catalog tables and views
• Compare DMS versus SMS table spaces
• Describe how to setup and manage a DB2 database with
Automatic Storage enabled
• Differentiate between table spaces, containers, extents, and
pages
• Define table spaces
• Use the get snapshot for tablespaces command to display
table space statistics
• Explore Database configuration parameters

Create database overview
Database Manager Instance
database1 database2
Tablespace A Tablespace A
Tablespace B
Table 1 Table 2 Table 3 Table 4 Table 1 Table 2
• Databases are created within a Database Manager instance

• Table spaces are a logical layer created within a database
• Tables are created within table spaces

Database storage requirements
• Database Path:
– Database Control Files for each database
• Includes Database Configuration file, Recovery History, Log Control files,
Table space Control file, Bufferpool Control file and others
– Initial location for database log files
– Default location is dftdbpath in DBM CFG
• Needs to be a local file system
• Automatic Storage paths:
– Allows table space storage to be managed at the database level
– If Automatic storage is enabled there will be at least one path defined
– Initial Storage Paths defined when database is created
• Default System table spaces:
– Use Automatic Storage management by default, if enabled, but can be
defined to use any supported type of table space
– SYSCATSPACE: DB2 catalog tables
– TEMPSPACE1: System Temporary tables, provide work space for
sorting and utility processing
– USERSPACE1: Initial table space for defining user tables and indexes
DB2 Storage Management basics
• DB2 supports three types of storage management for table spaces
• All three types can be used in a single database
• Storage Management type set when a table space is created
• DMS – Database Managed Storage:
– Table space containers defined using the specified files or raw devices
– Disk space allocated is reserved for objects in that table space
• SMS – System Managed Storage:
– Table space containers defined using the specified directories
– No defined initial size or limit
– DB2 creates files for each database object
– Disk space is freed when objects are dropped
• Automatic Storage Management:
– Disk Storage Paths are assigned to the database
– Can be enabled when a database is created or added to an existing database
– Available space can be monitored at the database partition level
– DB2 defines the number and names for containers
– Uses SMS for temporary storage and DMS for other storage types

DMS table space: Minimum requirements
Table Space Creation container tag 1 extent per container
header for table space 1 extent
table space space map 1 extent

meta data
object table data 1 extent
4 extents
Table (Object) Creation extent map 1 extent
extent for data 1 extent
Minimum space required in containers = 6 extents

(for first object)
If EXTENTSIZE = 32 pages, then 6*32 = 192 pages

needed to be provided in containers

CREATE DATABASE syntax
CREATE DATABASE database-name
DB AT DBPARTIONNUM
|Create Database Options|
Create Database options:
AUTOMATIC STORAGE YES ON path DBPATH ON path

NO drive drive
,
ALIAS db-alias USING CODESET codeset TERRITORY territory
COLLATE USING PAGESIZE 4096 DFT_EXTENT_SZ dft-extentsize

IDENTITY n K
RESTRICTIVE CATALOG TABLESPACE |tblspace-defn|
USER TABLESPACE |tblspace-defn| TEMPORARY TABLESPACE |tblspace-defn|
WITH "comment-string" |autoconfigure-settings|

tblspace-defn: ,
| MANAGED BY SYSTEM USING ( 'container-string' )
,
DATABASE USING ( FILE 'container-string' num-pages )
DEVICE
EXTENTSIZE num-pages PREFETCHSIZE num-pages
OVERHEAD number-of-milliseconds TRANSFERRATE number-of-milliseconds

Create Database Wizard

CREATE DATABASE examples
create database sales1 on /dbsales1
– Database Path: /dbsales1
– Automatic Storage Path: /dbsales1
create database sales2 automatic storage no on
/dbsales2
– Automatic Storage not enabled
create database sales3 on /dbauto3 dbpath on /dbsales3
– Automatic Storage Path: /dbauto3
create database sales4 automatic storage yes
on /dbauto41,/dbauto42,/dbauto43
dbpath on /dbsales4
– Automatic Storage Paths: /dbauto41, /dbauto42 and /dbauto43

Completed by
DB2 during database creation (1 of 2)
1. Creates database in the specified subdirectory
2. Creates IBMDEFAULTGROUP, IBMCATGROUP, and IBMTEMPGROUP database
partition groups
3. Creates SYSCATSPACE, TEMPSPACE1, and USERSPACE1 table spaces
4. Creates the system catalog tables and recovery logs
5. Catalogs database in local database directory and system database
directory
6. Stores the specified code set, territory, and collating sequence
7. Creates the schemas SYSCAT, SYSFUN, SYSIBM, and SYSSTAT
8. Binds database manager bind files to the database (db2ubind.lst)
– DB2 CLI packages are automatically bound to databases when the databases
are created or migrated. If a user has intentionally dropped a package, then
you must rebind db2cli.lst.

Completed by
DB2 during database creation (2 of 2)
9. Grants the following privileges:
– ACCESSCTRL , DATAACCESS , DBADM and SECADM
privileges to database creator
– SELECT privilege on system catalog tables and views to PUBLIC
– UPDATE access to the SYSSTAT catalog views
– BIND and EXECUTE privilege to PUBLIC for each successfully
bound utility
– CREATETAB, BINDADD, IMPLICIT_SCHEMA, and CONNECT
authorities to PUBLIC
– USE privilege on USERSPACE1 table space to PUBLIC
– Usage of the WLM workload for user class
SYSDEFAULTUSERCLASS to PUBLIC
• When the RESTRICTIVE option is used, no privileges are
automatically granted to PUBLIC

Database Path files
CREATE DATABASE DSS ON /dbauto DBPATH ON /database
database DB2INSTANCE=inst20
inst20
db2nodes.cfg
NODEnnnn
SQLDBDIR
sqldbdir
SQLxxxxx
SQLDBCONF SQLBP.1 SQLBP.2
db2rhist.asc db2rhist.bak
SQLOGCTL.LFH.1 SQLOGCTL.LFH.2 SQLOGMIR.LFH
SQLSGF.1 SQLSGF.2
SQLSPCS.1 SQLSPCS.2
SQLOGDIR
S0000000.LOG S0000001.LOG S0000002.LOG

Default
table space containers with Automatic Storage
CREATE DATABASE DSS ON /dbauto DBPATH ON /database
dbauto DB2INSTANCE=inst20
inst20
NODE0000
DSS
T0000000 SYSCATSPACE
C0000000.CAT
T0000001 TEMPSPACE1
C0000000.TMP
T0000002 USERSPACE1
C0000000.LRG

System Catalog tables and views
SYSCAT v
...
i
e SYSIBM.SYSCOLUMNS
SYSSTAT SYSIBM.SYSTABLES
w
...
s
See notes for actual table and view names.

Create BUFFERPOOL syntax
• Requires SYSCTRL or SYSADM authority
CREATE BUFFERPOOL bufferpool-name
DEFERRED
ALL DBPARTITIONNUMS
,
DATABASE PARTITION GROUP db-partition-group-name
SIZE 1000 AUTOMATIC
SIZE number-of-pages
1000
SIZE AUTOMATIC
number-of-pages
NUMBLOCKPAGES 0
NUMBLOCKPAGES number-of-pages
BLOCKSIZE number-of-pages
PAGESIZE integer
K
Table space, container, extent, page
TABLE SPACE
CONTAINER
EXTENT
DATA
PAGE

Containers and table spaces
• Container is an Allocation of Physical Space
What Does a Container Look Like?
File
Directory
Device
Directory File Device

SMS DMS DMS

Writing to containers
• DFT_EXTENT_SZ defined at database level
• EXTENTSIZE defined at table space level
• Data written in round-robin manner
Container 0 0 2 Container 1
1 3
Extent
Table space TBS

Table space design limits: Row Identifiers
• Standard RIDs • Large RIDs
4 64 4 8
KB GB KB TB
128 16
GB 256 TB 32
8 KB GB 8 KB TB
512 64
GB TB
16 KB 16 KB
32 KB 32 KB
Page size table space size Page size
table space size
16M 255 4x109 Rows 16M

512M ~2K 1.1x1012 Rows
Row ID (RID) 4 Bytes Large Row ID (RID) 6 Bytes
• For tables in LARGE table spaces

• For tables in Regular table spaces (DMS or Automatic Storage)
• Also all SYSTEM and USER temporary
table spaces
CREATE TABLESPACE syntax (1 of 2)
LARGE
CREATE TABLESPACE tablespace-name
REGULAR
SYSTEM
TEMPORARY
USER PAGESIZE 4096
DATABASE PARTITION GROUP PAGESIZE integer

K
IN db-partition-group-name
AUTOMATIC STORAGE | size-attributes |

MANAGED BY SYSTEM | system-containers |
DATABASE |database-containers | EXTENTSIZE num-pages
integer K
M
G
PREFETCHSIZE AUTOMATIC BUFFERPOOL bufferpool-name

num-pages
integer K
M
G
7.5 NO FILE SYSTEM CACHING
OVERHEAD number-of-milliseconds FILE SYSTEM CACHING
0.06 DROPPED TABLE RECOVERY ON

TRANSFERRATE number-of-milliseconds OFF
CREATE TABLESPACE syntax (2 of 2)
size-attributes:
AUTORESIZE YES INITIALSIZE int K INCREASESIZE int perc MAXSIZE int K

NO M K M
G M G
system-containers: G NONE
,
USING ( 'container-string' )
| on-db-partitions-clause |
database-containers:
USING | container-clause |
| on-db-partitions-clause |
container-clause:
,
( FILE 'container-string' num-pages )
DEVICE integer K
M
G
on-db-partitions-clause:
,
ON DBPARTITIONNUM ( db-partition-number1 )
DBPARTITIONNUMS TO db-partition-number2

Create Table Space Wizard

Storage Management alternatives: Automatic
• Automatic Storage Managed:
– Administration is very easy, no need to define the number or names of
the containers
– Disk space assigned and maintained at the database level
– Monitoring of available space at the database partition level instead of
each table space
– Multiple containers will be created using all available storage paths for
performance
– Automatic Storage can be enabled when the database is created or
added to an existing database
• Default is ON for CREATE DATABASE with DB2 9
– Storage paths can be added or removed using ALTER DATABASE
– Uses standard DMS and SMS under the covers:
• DMS used for REGULAR and LARGE table spaces
• SMS used for SYSTEM and USER TEMPORARY table spaces
– Table space allocation controlled by CREATE/ALTER options:
• INITIALSIZE: Defaults to 32 MB
• AUTORESIZE: Can be set to YES or NO
• INCREASESIZE: Can be set to amount or percent increase
• MAXSIZE: Can define growth limits
Automatic storage: Table space examples
• Syntax for CREATE and ALTER TABLESPACE:
CREATE TABLESPACE <tsName> [MANAGED BY AUTOMATIC STORAGE]
[INITIALSIZE integer {K|M|G}]
[AUTORESIZE {NO|YES}] [INCREASESIZE integer {PERCENT|K|M|G}]
[MAXSIZE {NONE | integer {K|M|G}}]
• Default initial size is 32 MB
• Default max size is none
• Default increase size is determined by DB2, which might
change over the life of the table space
• Examples:
CREATE TABLESPACE USER1
CREATE TEMPORARY TABLESPACE TEMPTS
CREATE TABLESPACE MYTS INITIALSIZE 100 M MAXSIZE 1 G
CREATE LARGE TABLESPACE LRGTS INITIALSIZE 5 G AUTORESIZE NO
CREATE REGULAR TABLESPACE USER2 INITIALSIZE 500 M
ALTER TABLESPACE
• ALTER TABLESPACE can be used to change table space
characteristics:
– For all types table space management, you can adjust:
• Bufferpool Assigned
• Prefetch size
• Overhead and Transfer rate I/O Costs
• File System Caching option
– For DMS-managed table spaces, you can:
• Use the ADD, DROP, RESIZE, EXTEND, REDUCE and BEGIN NEW
STRIPE SET to directly adjust container names and sizes.
• Use MANAGED BY AUTOMATIC STORAGE to convert to automatic
storage management
– For Automatic Storage-managed table spaces, you can:
• Use the REDUCE option to release unused disk space
• Use REBALANCE to reallocate containers when automatic storage
paths are added or dropped
– For DMS and Automatic Storage, you can:
• Change the MAXSIZE, INCREASESIZE and AUTORESIZE settings
Adding or dropping Automatic Storage Paths
• New Automatic Storage paths can added to a database using
the ALTER DATABASE
alter database add storage on ‘/dbauto/path3’,’/dbauto/path4’
– Automatic Storage can be enabled in an existing database by adding

the first paths
– Existing Automatic tablespaces will grow using the previously
assigned storage paths until remaining space is used
– Newly created tablespaces will begin to use all defined paths
– Individual table spaces can be altered using REBALANCE to spread
data over all storage paths
• Storage paths can also be removed using ALTER DATABASE

alter database drop storage on ‘/dbauto/path1’
– The dropped path will be in drop pending state until a ALTER

TABLESPACE with the REBALANCE option is used for each table
space with containers on the dropped path
Alter Table Space Wizard

GET SNAPSHOT FOR TABLESPACES command
• Use GET SNAPSHOT FOR TABLESPACES ON SAMPLE
– LIST TABLESPACES does not provide all of the information
available
• Example of GET SNAPSHOT FOR TABLESPACES ON
SAMPLE
<database> command:
Tablespace name = SYSCATSPACE
Tablespace ID = 0
…
Automatic Prefetch size enabled = Yes
Buffer pool ID currently in use = 1
Buffer pool ID next startup = 1
Using automatic storage = Yes
Auto-resize enabled = Yes
…
Increase size (bytes) = AUTOMATIC
• Note that Using automatic storage = Yes and Increase size
(bytes) = AUTOMATIC are shown in the snapshot.
Using the SNAPCONTAINER view
• Use the SYSIBMADM.SNAPCONTAINER view to retrieve
container information
SELECT SNAPSHOT_TIMESTAMP, TBSP_NAME, TBSP_ID,
CONTAINER_NAME, CONTAINER_ID, CONTAINER_TYPE,
ACCESSIBLE, TOTAL_PAGES, USABLE_PAGES,
DBPARTITIONNUM FROM SYSIBMADM.SNAPCONTAINER ORDER
BY DBPARTITIONNUM, TBSP_NAME, CONTAINER_ID

Using the MON_GET_TABLESPACE function
• The MON_GET_TABLESPACE table function returns monitor metrics
for one or more table spaces.
Syntax
>>-MON_GET_TABLESPACE--(--tbsp_name--,--member--)---><
• Example
To list table spaces ordered by number of physical reads from table
space containers.
SELECT varchar(tbsp_name, 30) as tbsp_name,
member,
tbsp_type,
pool_data_p_reads
FROM TABLE(MON_GET_TABLESPACE('',-2)) AS t
ORDER BY pool_data_p_reads DESC
TBSP_NAME MEMBER TBSP_TYPE POOL_DATA_P_READS

------------------------------ ------ ---------- --------------------
SYSCATSPACE 0 DMS 79
USERSPACE1 0 DMS 34
TEMPSPACE1 0 SMS 0
Data placement considerations
• If index scans are required:
– Consider separate table spaces for index and data
• Consider placing index table space on fastest media available

• Consider placing entire table onto separate table space
– If accessed frequently
• Consider EXTENTSIZE
– Trade-off between space (very small tables) and performance
• I/O PREFETCH size

– Setting to AUTOMATIC is the default and recommended

Maintain or List System Database Directory
• CLP
db2 ? CATALOG DATABASE
CATALOG DATABASE database-name [AS alias] [ON path | AT NODE nodename]
[AUTHENTICATION {SERVER | CLIENT | ... | SERVER_ENCRYPT}]
[WITH "comment-string"]
db2 'CATALOG DATABASE ourdb AS TESTDB ON /database WITH "Test Database"'
db2 LIST DATABASE DIRECTORY

...
Database alias = TESTDB

Database name = OURDB
Local database directory = /database
Database release level = a.00
Comment = Test Database
Directory entry type = Indirect
Catalog database partition number = 4

Additional DB2 System table spaces
• The SYSTOOLSPACE table space is a user data table space used by the DB2
administration tools and some SQL administrative routines for storing historical data
and configuration information:
– Design Advisor
– Alter table notebook
– Configure Automatic Maintenance wizard
– Storage Management tool
– db2look command
– Automatic statistics collection (including the Statistics Collection Required
health indicator)
– Automatic reorganization (including the Reorganization Required health
indicator)
– GET_DBSIZE_INFO stored procedure
– ADMIN_COPY_SCHEMA stored procedure
– ADMIN_DROP_SCHEMA stored procedure
– SYSINSTALLOBJECTS stored procedure
– ALTOBJ stored procedure
• The SYSTOOLSTMPSPACE table space is a user temporary table space used by
the REORGCHK_TB_STATS, REORGCHK_IX_STATS and the ADMIN_CMD
stored procedures for storing temporary data.
Database configuration
db2 get db cfg for musicdb
show detail
db2 update db cfg for musicdb

using <parm> <value>

Database Manager configuration
db2 get dbm cfg show detail
db2 update dbm cfg using <parm>

<value>

ACTIVATE and DEACTIVATE the database
• ACTIVATE DATABASE
– Activates the specified database and starts up all necessary
database services, so that the database is available for connection
and use by any application.
• Incurs the database startup processing
db2 activate db <db_name> user <user> using
<password>
• DEACTIVATE DATABASE
– Databases initialized by ACTIVATE DATABASE can be shut down
using the DEACTIVATE DATABASE command, or using the
db2stop command.
db2 deactivate db <db_name>

Unit summary
• Review specifics of creating a database
• Explore the System Catalog tables and views
• Compare DMS versus SMS table spaces
• Describe how to setup and manage a DB2 database with
Automatic Storage enabled
• Differentiate between table spaces, containers, extents, and
pages
• Define table spaces
• Use the get snapshot for tablespaces command to display
table space statistics
• Explore Database configuration parameters

Student exercise

Creating database objects

Unit objectives
• List DB2 object hierarchy and physical directories and files
• Create the following objects:
• Schema, Table, View, Alias, Index
• Review the use of Temporary Tables
• Explore the use and implementation of Check Constraints,
Referential Integrity and Triggers
• Exploring the need for and the use of Large Objects
• Recognize XML and its native store as critical infrastructure
for emerging technologies

DB2 object hierarchy
Instance 1 dbm configuration file
SYSCATSPACE SYSCATSPACE
Catalog Catalog
View1 View1
Log Database 1 Log Database 2
View2
DB configuration file DB configuration file
TSSMS1 TSDMSLRG1 USERSPACE1
Table1 Table2 Table3 Table2
Index1 TSDMSREG2
Index1
TSDMSLRG3
BLOBs Index2

Database physical directories and files
/path
[ SAMPLE ]
[ DB2INSTANCE ] [ T0000000 ] C0000000.CAT
SQLCRT.FLG
db2rhist.asc [ T0000001 ] SQLCRT.FLG
[ NODE0000 ] db2rhist.bak
SQLBP.1
[ C0000000.TMP ]
[ sqldbdir ] SQLBP.2
[ T0000002 ] C0000000.LRG
SQLDBCON SQLCRT.FLG
[ SQL00001 ] SQLDBCONF C0000000.LRG
SQLINSLK [ T0000003 ] SQLCRT.FLG
SQLOGCTL.LFH.1
SQLOGCTL.LFH.2 C0000000.LRG
[ T0000004 ] SQLCRT.FLG
SQLOGMIR.LFH
SQLSGF.1 C0000000.LRG
SQLSGF.2 [ T0000005 ] SQLCRT.FLG
SQLSPCS.1
[ db2event ] SQLSPCS.2
SQLTMPLK
[ your_SMS_tablespace ]
[ db2detaildeadlock ]
00000000.EVT
[ CONT1 ]
..
DB2EVENT.CTL SQLTAG.NAM
[ SQLOGDIR ] [ CONT2 ]
S0000000.LOG SQLTAG.NAM
S0000001.LOG SQL00041.DAT

Create Schema
CREATE SCHEMA PAYROLL AUTHORIZATION DB2ADMIN;

COMMENT ON Schema PAYROLL IS 'schema for
payroll application';
Set current schema
connect to musicdb user Keith;

select * from employee;
• Will select from KEITH.EMPLOYEE
set current schema = 'PAYROLL‘;

select * from employee;
• Will select from PAYROLL.EMPLOYEE

Create Table

Create Table statement
connect to musicdb;
create table artists
(artno smallint not null,
name varchar(50) with default 'abc',
classification char(1) not null,
bio clob(100K) logged,
picture blob( 10M) not logged compact)
in dms01
index in dms02
long in dms03;

Application Temporary Tables
• Applications can utilize global temporary tables:
– Global temporary tables are stored in User temporary table spaces
– The associated data is deleted when the application connection ends
– Each connection accesses a private copy of the table, so it only sees data put
into the table by that one connection
– Normal table locking is not needed for global temporary tables since the data is
always limited to a single connection
– Declared Global Temporary table:
• These are defined during application execution using the DECLARE GLOBAL
TEMPORARY TABLE statement
• No DB2 Catalog information is required
• The schema for a declared global temporary table is always ‘SESSION’
– Created Global Temporary table:
• These are defined using a CREATE GLOBAL TEMPORARY TABLE statement
with a user selected schema
• The table and any associated indexes may be created before an application
connection starts, but only the catalog definition exists
• The catalog definition is used when the application references the table
• Each application connection still works only with the data it stores in the table
Example of a Declared Temporary Table
• A System Administrator creates the user temporary table space
CREATE USER TEMPORARY TABLESPACE "USR_TEMP_TS"
PAGESIZE 4 K MANAGED BY AUTOMATIC STORAGE
BUFFERPOOL IBMDEFAULTBP ;
• The application uses SQL statements to declare and access the
table
DECLARE GLOBAL TEMPORARY TABLE T1
LIKE REAL_T1
ON COMMIT DELETE ROWS
NOT LOGGED
IN USR_TEMP_TS;
INSERT INTO SESSION.T1
SELECT * FROM REAL_T1 WHERE DEPTNO=:mydept;
• /* do other work on T1 */
• /* when connection ends, table is automatically dropped */
Example of a Created Temporary Table
• A System Administrator creates the user temporary table
space and defines a global temporary table and indexes
CREATE USER TEMPORARY TABLESPACE "USR_TEMP_TS2"
PAGESIZE 4 K MANAGED BY AUTOMATIC STORAGE ;
CREATE GLOBAL TEMPORARY TABLE APP1.DEPARTMENT

LIKE PROD.DEPARTMENT
ON COMMIT DELETE ROWS
NOT LOGGED IN USR_TEMP_TS2;
CREATE INDEX APP1.DEPTIX1 ON APP1.DEPARTMENT (DEPTNO);
• The application uses SQL statements to reference the

temporary table; no DECLARE is needed
INSERT INTO APP1.DEPARTMENT
SELECT * FROM PROD.DEPARTMENT WHERE DEPTNO=:mydept;
SELECT * FROM APP1.DEPARTMENT WHERE LASTNAME = ‘STOPFER’

Large objects: The need
Action
News
DB2
by the Book
B L OB C L OB D B C L OB
H A J O Y H A J
I A J
A R E U T A R E
N R E R G C B E R G C
A G C A E T L A E T
R E T C E C
Y T T
E E
R R
DB2
Table partitioning
• Data organization scheme in which table data is divided across
multiple storage objects called data partitions or ranges:
– Each data partition is stored separately
– These storage objects can be in different table spaces, in the same
table space, or a combination of both
• Benefits:
– Easier roll-in and roll-out of table data
– Allows large data roll-in (ATTACH) or roll-out (DETACH) with a
minimal impact to table availability for applications
– Supports very large tables
– Indexes can be either partitioned (local) or non-partitioned (global)
– Table and Index scans can use partition elimination when access
includes predicates for the defined ranges
– Hierarchical Storage Management (HSM) solutions can be used to
manage older, less frequently accessed ranges
Example of a partitioned table
• The PARTITION BY RANGE clause defines a set of data ranges
CREATE TABLE PARTTAB.HISTORYPART ( ACCT_ID INTEGER NOT NULL ,
TELLER_ID SMALLINT NOT NULL ,
BRANCH_ID SMALLINT NOT NULL ,
BALANCE DECIMAL(15,2) NOT NULL ,
……….
TEMP CHAR(6) NOT NULL )
PARTITION BY RANGE (BRANCH_ID)
(STARTING FROM (1) ENDING (20) IN TSHISTP1 INDEX IN TSHISTI1 ,
STARTING FROM (21) ENDING (40) IN TSHISTP2 INDEX IN TSHISTI2 ,
STARTING FROM (41) ENDING (60) IN TSHISTP3 INDEX IN TSHISTI3 ,
STARTING FROM (61) ENDING (80) IN TSHISTP4 INDEX IN TSHISTI4 ) ;
CREATE INDEX PARTTAB.HISTPIX1 ON PARTTAB.HISTORYPART (TELLER_ID)

PARTITIONED ;
CREATE INDEX PARTTAB.HISTPIX2 ON PARTTAB.HISTORYPART (BRANCH_ID)

PARTITIONED ;
• In this example the data objects and index objects for each data range
are stored in different table spaces
• The table spaces used must be defined with the same options, such as
type of management, extent size and page size

Create View

Create View statement
CONNECT TO TESTDB;
CREATE VIEW EMPSALARY
AS SELECT EMPNO, EMPNAME, SALARY
FROM PAYROLL, PERSONNEL
WHERE EMPNO=EMPNUMB AND SALARY > 30000.00;
SELECT * FROM EMPSALARY
EMPNO EMPNAME SALARY

------ ----------------- ----------
10 John Smith 1000000.00
20 Jane Johnson 300000.00
30 Robert Appleton 250000.00
...

Create Alias

Create Alias statement
• Cannot be the same as an existing table, view, or alias
CONNECT TO SAMPLE;
CREATE ALIAS ADMIN.MUSICIANS FOR ADMIN.ARTISTS;
COMMENT ON Alias ADMIN.MUSICIANS IS 'MUSICIANS
alias for ARTISTS';
CONNECT RESET;

Create Index

Create Index statements
create unique index dba1.empno on
dba1.employee (empno asc)
pctfree 10
minpctused 10
allow reverse scans
page split symmetric
collect sampled detailed statistics ;
create unique index itemno on albums (itemno) ;
create index item on stock (itemno) cluster ;
create unique index empidx on employee (empno)

include (lastname, firstname) ;

Overview of Referential Integrity
Department Table
DEPT DEPTNAME
Parent Table
R PRIMARY KEY = DEPT
E
S
T Employee Table
R EMPNO Dependent Table
NAME WKDEPT
I FOREIGN KEY = WKDEPT
C
T
• Place constraints between tables

• Constraints specified with Create and Alter table statements
• Database services enforce constraints: inserts, updates, deletes
• Removes burden of constraint checking from application programs
RI: Add a foreign key

Referential Integrity: CREATE TABLE statement
CREATE TABLE DEPARTMENT
(DEPT CHAR (3) NOT NULL,
DEPTNAME CHAR(20) NOT NULL,
CONSTRAINT DEPT_NAME
UNIQUE (DEPTNAME),
PRIMARY KEY (DEPT))
IN DMS99 .... ;
CREATE TABLE EMPLOYEE

(EMPNO CHAR(6) NOT NULL,
NAME CHAR(30),
WKDEPT CHAR(3) NOT NULL,
CONSTRAINT DEPT
FOREIGN KEY
(WKDEPT)
REFERENCES DEPARTMENT
ON DELETE RESTRICT)
IN DMS100 ... ;
Unique Key considerations
• Multiple keys in one table can be foreign key targets
CREATE TABLE PAY.EMPTAB
(EMPNO SMALLINT NOT NULL PRIMARY KEY,
NAME CHAR(20),
DRIVERLIC CHAR(17) NOT NULL,
CONSTRAINT DRIV_UNIQ UNIQUE(DRIVERLIC)
)IN TBSP1 ;
Unique indexes PAY.DRIV_UNIQ and SYSIBM.yymmddhhmmssxxx

created Columns must be NOT NULL
• Deferral of unique checking until end of statement processing

– UPDATE EMPTAB SET EMPNO=EMPNO + 1

Overview of Check Constraints: The need
DATA RULE : No value in the US_SL

column in the table
SPEED_LIMITS can exceed 65.
PGM n
RULE PGM n
DB2 DB2
ENFORCEMENT
APPLICATION RULE
DATABASE
PGM 1 PGM 2 PGM 1 PGM 2
RULE RULE

Check Constraints: Definition
CREATE TABLE SPEED_LIMITS
(ROUTE_NUM SMALLINT,
CANADA_SL INTEGER NOT NULL,
US_SL INTEGER NOT NULL
CHECK (US_SL <=65) ) ;
ALTER TABLE SPEED_LIMITS

ADD
CONSTRAINT SPEED65
CHECK (US_SL <=65) ;

Implementation of Check Constraints

Overview of triggers: The need
BUSINESS RULES:
INSERT, UPDATE,
and DELETE
actions against a
table may require
another action or
set of actions
to occur to meet
PGM n the needs of the
business.
ENFORCEMENT DB2
APPLICATION
DATABASE

Implementation of triggers (1 of 2)

Implementation of triggers (2 of 2)

Create Trigger statement
create trigger reorder

after update
of qty on stock
referencing new as n
for each row mode db2sql
when (n.qty <=5)
insert into reorder values (n.itemno, current
timestamp) ;

Data Row Compression summary
• Data row compression was introduced in DB2 9.1
• COMPRESS option for CREATE and ALTER TABLE is used
to specify compression
• Compression uses a static dictionary:
– Dictionary stored in data object, about 100K in size
– Compression Dictionary needs to be built before a row can be
compressed
– Dictionary can be built or rebuilt using REORG TABLE offline, which
also compresses existing data
– A dictionary will be built automatically when a table reaches a
threshold size (about 2 MB). This applies to SQL INSERT as well as
IMPORT and LOAD (DB2 9.5).
• Compression is intended to:
– Reduce disk storage requirements
– Reduce I/Os for scanning tables
– Reduce buffer pool memory for storing data
• Compression requires DB2 Storage Optimization feature
Additional compression features added in DB2 9.7
• Multiple algorithms for automatic index compression
• Automatic compression for system and user

temporary tables
Temp Table Temp
Order By Order By
• Intelligent compression of large objects and XML

db2look utility
• DDL and statistics extraction tool, used to
capture table definitions and generate the
corresponding DDL.
• In addition to capturing the DDL for a set of
tables, can create a test system that mimics a SQL
production system by generating the following
things:
– The UPDATE statements used to replicate the
statistics on the objects
– The UPDATE DB CFG and DBM CFG statements
for replicating configuration parameter settings
– The db2set statements for replicating registry
settings
– Statements for creating partition groups, buffer
pools, table spaces, and so on
db2look examples
• To capture all of the DDL for a database (includes all tables, views,
RI, constraints, triggers, and so on):
db2look -d proddb -e -o statements.sql
{Edit the output file and change the database name}
db2 -tvf statements.sql
• To capture the DDL for one table in particular (table1 in this

example):
db2look -d proddb -e -t table1 -o statements.sql
• To capture the DDL for all of the tables belonging to a particular

schema (db2user in this example):
db2look -d proddb -e -z db2user -o statements.sql

What is XML? (1 of 2)
• XML = eXtensible Markup Language
• XML is self-describing data
• Here is some data from an ordinary delimited flat file:
47; John Doe; 58; Peter Pan; Database systems; 29; SQL;
relational
• What does this data mean?

<book>
<authors>
<author id=“47”>John Doe</author>
<author id=“58”>Peter Pan</author>
</authors>
<title>Database systems</title>
<price>29</price>
<keywords> XML: Describes data
<keyword>SQL</keyword>
<keyword>relational</keyword> HTML: Describes display
</keywords>
</book>
What is XML? (2 of 2)
Attribute
<book>
Start Tag <authors>
End Tag </authors>
<price>29</price> Element
<keywords>
Start Tag
<keyword>SQL</keyword>
<keyword>relational</keyword>
</keywords>
End Tag
</book>
Data

The XML document tree
<book>
<authors>
</authors>
<price>29</price>
<keywords>
XML Parsing
<keyword>SQL</keyword> book
<keyword>relational</keyword>
</keywords>
</book>
authors title price keywords
Each node
has a path
/book Database
author author 29 keyword keyword
Systems
/book/authors
/book/authors/author
/book/authors/author/@id id=47 John Doe id=58 Peter Pan SQL relational
(...)
Integration of XML and relational capabilities
• Applications combine XML and relational data
• Native XML data type (server and client side)
• XML capabilities in all DB2 components
DB2 SERVER
CLIENT SQL/XML DB2 Storage:

Relational
Interface Relational
DB2 Client /
Customer DB2 Engine
Client
Application
XQuery
XML
XML
Interface

Two ways to query XML data
• Given the following DDL used to create the table:
CREATE TABLE CUSTOMER (
CID BIGINT NOT NULL,
INFO XML,
HISTORY XML )
IN IBMDB2SAMPLEXML;
• Using XQuery as the primary language

XQUERY db2-fn:xmlcolumn (“CUSTOMER.INFO")
– Returns all Documents in column INFO
• Using SQL as the primary language

Select Pid from PRODUCT xmlp
WHERE XMLExists('declare default element namespace
"http://posample.org";$p/product/description[price
< 100]' passing xmlp.description as "p");

For more information (1 of 2)
• To learn more about DB2 and XML,
including the use of xQuery and
SQL/XML:
• Course CL131 – 5 days

Query and Manage XML Data with
DB2 9
• Course CL121 – 3 days

Query XML data with DB2 9
• Documentation:
pureXML Guide (SC27-2465)
DB2 XQuery Reference (SC27-2466)

For more information (2 of 2)
Some of our other courses:
• Course CE120 – 2 days
DB2 SQL Workshop
• Course CF18 – 4 days
Relational Database Design
• Course CE130 – 2.5 days
DB2 SQL Workshop for Experienced
Users
• Course DW11 – 4.5 days
Building the Data Warehouse

Unit summary
• List DB2 object hierarchy and physical directories and files
• Create the following objects:
• Schema, Table, View, Alias, Index
• Review the use of Temporary Tables
• Explore the use and implementation of Check Constraints,
Referential Integrity and Triggers
• Exploring the need for and the use of Large Objects
• Recognize XML and its native store as critical infrastructure
for emerging technologies

Student exercise

Moving data

Unit objectives
• Discuss the INSERT statement and recognize its limitations
• Explain the differences between IMPORT and LOAD
• Explain the EXPORT, IMPORT, and LOAD syntax
• Create and use Exception Tables and Dump-Files
• Distinguish and resolve Table States:
– Load Pending and Set Integrity Pending
• Use the SET INTEGRITY command

• Discuss the db2move and db2look commands

Review INSERT statement
• The SQL INSERT statement can insert one or more rows of
data into tables, nicknames, or views:
– SQL overhead is imposed, such as obeying RI, or Check
Constraints, or Uniqueness, or executing triggers.
– As INSERTs occur, the activity is also stored in logs
• The SQL INSERT might not be the best or fastest method to

load massive amounts of data into a database
Example inserts:
insert into artists (artno, name,

classification) values (100, 'Patti & Cart
Wheels', 'S') ;
and
Insert into emptemp select * from employee ;

EXPORT/IMPORT overview
• Must be connected prior to call
Workstation File Formats

LOGS
ASCII
DEL ASCII
WSF
IMPORT
PC/IXF EXPORT DB2 for Linux,

UNIX, and Windows
or
DRDA Databases

EXPORT data wizard

EXPORT command syntax (Basic)
EXPORT TO filename OF filetype
, ,
LOBS TO lob-path LOBFILE filename
...
MODIFIED BY filetype-mod
select-statement
MESSAGES message-file

EXPORT command example
• Exports data from database tables to file

• Check message for error or warning messages
DB2
MUSICDB
EXPORT
artexprt
artists

db2 export to artexprt
of ixf messages artmsg
select artno, name
from artists
IMPORT data wizard

IMPORT command syntax (Basic)
IMPORT FROM filename OF filetype
...
,
LOBS FROM lob-path
ALLOW NO ACCESS
ALLOW WRITE ACCESS

COMMITCOUNT n/ RESTARTCOUNT n MESSAGES message-file

Automatic
INSERT INTO table-name
,
INSERT_UPDATE
REPLACE ( insert-column )
REPLACE_CREATE
CREATE INTO table-name | tblspace-specs |
,
( insert-column )
tblspace-specs
| |
IN tablespace-name
INDEX IN tablespace-name LONG IN tablespace-name
IMPORT command example
• Imports data from a file to a database table
db2 import from artexprt of ixf
messages artmsg create into artists
insert
insert_update
replace
replace_create
• Imports data from a file to three DMS table spaces
db2 import from artexprt of ixf
messages artmsg create into artists [(column_list)]
in <tablespace>
index in <indextablespace>
long in <largetablespace>
DB2
musicdb
IMPORT
artexprt artists
Differences between IMPORT and LOAD
IMPORT LOAD
• Slow when moving large • Faster for large amounts of
amounts of data data
• Creation of table/index • Tables and indexes must exist

supported with IXF format • Load into tables only
• Import into tables, views, • Existing data can still be seen
aliases by read applications
• Option to ALLOW WRITE • Minimal logging; can make
ACCESS copy
• All rows logged • Triggers not fired; unique
• Triggers fired, constraints constraints enforced, RI and
enforced check constraints via SET
• Inserts can use space freed INTEGRITY
by deleted rows • LOAD builds new extents
Four phases of Load
1. LOAD
Load data into tables DB2 Data
Collect index keys / sort
Consistency points at SAVECOUNT
Invalid data rows in dump file; messages in message file
2. BUILD
Indexes created or updated
3. DELETE
Unique Key Violations placed in Exception Table
Messages generated for unique key violations
Deletes Unique Key Violation Rows
4. INDEX COPY
Copy indexes from temp table space to index
table space

LOAD: Load phase
• During the LOAD phase, data is stored in a table and index
keys are collected
• Save points are established at intervals
• Messages indicate the number of input rows successfully
loaded
• If a failure occurs in this phase, use the RESTART option for
LOAD to restart from the last successful consistency point
Load
Input
Build
Data
Delete
Index
Copy
LOAD: Build phase
• During the BUILD phase, indexes are created based on the
index keys collected in the Load phase
• The index keys are sorted during the Load phase
• If a failure occurs during this phase, LOAD restarts from the
BUILD phase
Load
Input
Build
Data
Delete
Index
Copy
LOAD: Delete phase
• During the DELETE phase, all rows that have violated a
unique constraint are deleted
• If a failure occurs, LOAD restarts from the DELETE phase
• Once the database indexes are rebuilt, information about the
rows containing the invalid keys is contained in an exception
table, WHICH SHOULD BE CREATED BEFORE LOAD
• Finally, any duplicate keys found are deleted
Load
Input
Build
Data
Delete
Index
Copy
LOAD: Index Copy phase
• The index data is copied from a system temporary table
space to the original table space.
• This will only occur if a system temporary table space was
specified for index creation during a load operation with the
READ ACCESS option specified.
Load
Input
Build
Data
Delete
Index
Copy
Load data wizard

LOAD command syntax (Basic)
,
LOAD FROM file OF ASC
CLIENT pipe DEL
device IXF
cursorname CURSOR
...
SAVECOUNT n ROWCOUNT n WARNINGCOUNT n MESSAGES msg-file
INSERT INTO table-name

REPLACE ,
RESTART ( insert-column ) FOR EXCEPTION table-name
TERMINATE
ALLOW NO ACCESS
ALLOW READ ACCESS

| statistics options | |copy options|
USE tablespace-name
| set integrity pending options | LOCK WITH FORCE

LOAD scenario
INPUT OUTPUT
calpar.del cal.par cal.parexp par.msgs dump.fil.000
10 ~~ 1 10 ~~ 1 30 ~~ 4 timestamp msg ... msg ... 20 ~~ -
50 7 timestamp msg 40 ~~ X
20 ~~ - 30 ~~ 3 ~~ 20 msg
30 ~~ 3 50 ~~ 6 ... ~ ~ ... ... ... 40 msg
30 ~~ 4 80 ~~ 8 ... ... ...

40 ~~ X not null, numeric column
Primary Key
50 ~~ 6 Rows not Loaded
50 ~~ 7 Table Exception Table

UNIQUE INDEX
80 ~~ 8
create tables/indexes 10 RID

obtain delimited input file in sorted format
create exception table 30 RID
db2 load from calpar.del of del 50 RID

modified by dumpfile=<path>/dump.fil
warningcount 100 messages par.msgs
insert into cal.par for exception cal.parexp 80 RID
db2 load query table cal.par
examine par.msgs file
examine cal.parexp exception table

Rules and methods for creating Exception Tables
• The first n columns are the same
• No constraints and no trigger definitions
• n + 1 column TIMESTAMP
• n + 2 column CLOB (32 KB)
• user INSERT privilege
CREATE TABLE T1EXC
LIKE T1
ALTER TABLE T1EXC

ADD COLUMN TS TIMESTAMP
ADD COLUMN MSG CLOB(32K) CREATE TABLE T1EXC AS
(SELECT T1.*,
CURRENT TIMESTAMP AS TS,
CLOB('', 32767) AS MSG
FROM T1)
DEFINITION ONLY

Offline versus Online Load
• ALLOW NO ACCESS
Lock Lock Load
requested granted commit
read/write read/write
Load allows no access
Time
• ALLOW READ ACCESS
Super Super
exclusive exclusive Load
Drain Drain lock lock commit
requested granted requested granted
read/write read
read
r/w
Load allows read access
Time

Table states
• (Load pending, Set Integrity Pending)
• LOAD QUERY TABLE <table-name>
• Tablestate:
Normal
Set Integrity Pending
Load in Progress
Load Pending
Reorg Pending
Read Access Only
Unavailable
Not Load Restartable
Unknown
• Table can be in several states at same time
Tablestate:
Set Integrity Pending
Load in Progress
Read Access Only
Checking Load status: Load query
db2 load query table prod.table1
SQL3501W The table space(s) in which the table resides will not be placed in
backup pending state since forward recovery is disabled for the database.
SQL3109N The utility is beginning to load data from file

"C:\cf45\reorg\savehist.del".
SQL3500W The utility is beginning the "LOAD" phase at time "03/29/2005

21:30:24.468073".
SQL3519W Begin Load Consistency Point. Input record count = "0".

SQL3520W Load Consistency Point was successful.
........................
SQL3519W Begin Load Consistency Point. Input record count = "48314".
SQL3520W Load Consistency Point was successful.
SQL0289N Unable to allocate new pages in table space "LOADTSPD".

SQLSTATE=57011
SQL3532I The Load utility is currently in the "LOAD" phase.
Number of rows read = 48314

Number of rows skipped = 0
Number of rows loaded = 48314
Number of rows rejected = 0
Number of rows deleted = 0
Number of rows committed = 48314
Number of warnings = 0
Tablestate:
Load Pending

Load monitoring: LIST UTILITIES
db2 LIST UTILITIES SHOW DETAIL

ID = 2
Type = LOAD
Database Name = MUSICDB
Partition Number = 0
Description = ONLINE LOAD DEL AUTOMATIC INDEXING INSERT NON
-RECOVERABLE ADMIN.LOA
Start Time = 03/30/2005 15:57:42.354567
Progress Monitoring:
Phase Number = 1
Description = SETUP
Total Work = 0 bytes
Completed Work = 0 bytes
Start Time = 03/30/2005 15:57:42.354573
Phase Number = 2
Description = LOAD
Total Work = 10000 rows
Completed Work = 10000 rows
Start Time = 03/30/2005 15:57:50.669476
Phase Number [Current] = 3
Description = BUILD
Total Work = 2 indexes
Completed Work = 2 indexes
Start Time = 03/30/2005 15:57:51.182767

Load Pending state:
Recovering from LOAD failure
• Restart the Load:
– Check Messages files
– Use Restart option
– Load operation automatically continues from last consistency point
in Load or Build phase
• or Delete phase if ALLOW NO ACCESS
• Replace the whole table

– LOAD ... REPLACE
• Terminate the Load:

– If LOAD ... INSERT, returns table to state preceding Load
– If LOAD ... REPLACE, table will be truncated to an empty state
• Create backup copy prior to Load to return to state preceding Load
Do not tamper with Load temporary files

Backup Pending state: COPY options
• When loading data and forward recovery is enabled:
– COPY NO (default)
• During load, Backup pending and Load in progress
• After load, Backup Pending
– COPY YES
• Load has made Copy not Backup pending
– NONRECOVERABLE
• No copy made and no backup required
UNIX
Databasealias.Type.Instancename.Nodename.Catnodename.Timestamp.number
Windows
Databasealias.Type.Instancename.Nodename.Catnodename.Timestamp.number
Type: 0=Full Backup

3=Table Space Backup
4 =Copy from Table Load

Set Integrity Pending table state
• Load turns OFF constraint checking:
– Leaves table in Set Integrity Pending state
– If parent table is in Set Integrity Pending, then dependents may also
be in Set Integrity Pending
– LOAD INSERT, ALLOW READ ACCESS
• Loaded table in Set Integrity Pending with read access
– LOAD INSERT, ALLOW NO ACCESS
• Loaded table in Set Integrity Pending with no access
– LOAD REPLACE, SET INTEGRITY PENDING CASCADE
DEFERRED
• Loaded table in Set Integrity Pending with no access
– LOAD REPLACE, SET INTEGRITY PENDING CASCADE
IMMEDIATE
• Loaded table and descendant foreign key tables are in Set Integrity
Pending with no access

SET INTEGRITY command syntax (Basic)
SET INTEGRITY ...
,
FOR table-name OFF
IMMEDIATE CHECKED
| exception-clause |
,
FOR table-name ALL IMMEDIATE UNCHECKED
FOREIGN KEY
CHECK
exception-clause
,
| FOR EXCEPTION IN table-name USE table-name |

Set Integrity Pending state
BEFORE
SYSCAT.TABLES
cal.par cal.for
TABNAME DEFINER STATUS CONST_ ACCESS_
CHECKED MODE 10 1 10 X
par cal C NYYY... R 30 3 10 Y
for cal C YNYY... R 50 6 20 Y
80 8 50 Z
AFTER
80 X
Primary Key
forexp 90 A
cal.for
20 Y timestamp msg
10 ~~ X Foreign Key
90 A timestamp msg
10 ~~ Y parexp
50 ~~ Z timestamp msg
80 ... X
db2 SET INTEGRITY for
cal.par, cal.for IMMEDIATE CHECKED
FOR EXCEPTION
IN cal.par USE cal.parexp,
IN cal.for USE cal.forexp

Meaningful steps for LOAD
• Create tables and indexes
• Create exception tables
• Sort data
• Back up TS/DB (if using REPLACE)
• Consider freespace
• Load .... for Exception ... Savecount ... Warningcount...
• Examine xx.msg and dumpfile (after LOAD completes)
• Examine exception tables (after LOAD completes)
• Back up table space if log retain=recovery and COPY NO
• Set Integrity for .... (only if table in Set Integrity Pending state)
• Update statistics (if necessary)

db2move utility
• Facilitates the moving/copying of large numbers s
bl e
Ta
of tables between databases
• Can be used with db2look to move/copy
a database between different platforms
(for example, AIX to Windows).
• Usage: db2move <dbName> EXPORT/IMPORT/LOAD/COPY [options]
• EXPORT: The system catalogs are queried, a list of tables is
compiled (based on the options), and the tables are exported in
IXF format -- additionally, a file called db2move.lst is created
• IMPORT: The db2move.lst file is used to import the IXF files
created in the EXPORT step
• LOAD: The db2move.lst file is used to load the PC/IXF data files
created in the EXPORT step
• COPY: Duplicates schema(s) into a target database
db2move options: Export/Import/Load
db2move.lst
• EXPORT:
 -tc table creator list !"ADMIN"."EXPLAIN_INSTANCE"!tab1.ixf!tab1.msg!
!"ADMIN"."EXPLAIN_STATEMENT"!tab2.ixf!tab2.msg!
 -tn table name list !"ADMIN"."EXPLAIN_ARGUMENT"!tab3.ixf!tab3.msg!
 -sn schema name list !"ADMIN"."EXPLAIN_OBJECT"!tab4.ixf!tab4.msg!
!"ADMIN"."EXPLAIN_OPERATOR"!tab5.ixf!tab5.msg!
 -ts table space list !"ADMIN"."EXPLAIN_PREDICATE"!tab6.ixf!tab6.msg!
 -tf file contains list of tables !"ADMIN"."EXPLAIN_STREAM"!tab7.ixf!tab7.msg!
db2move sample export -tc admin -tn EXPL*

• IMPORT:
 -io defaults to REPLACE_CREATE
 CREATE,INSERT,INSERT_UPDATE,REPLACE
db2move sample import -io replace
• LOAD:
 -lo defaults to INSERT
 REPLACE
db2move sample load -lo insert

db2move COPY option
Copy one or more schemas between DB2 databases
Uses a -co option to specify:
• Target Database:
"TARGET_DB <db name> [USER <userid> USING <password>]"
• MODE:
– DDL_AND_LOAD - Creates all supported objects from the source schema,
and populates the tables with the source table data. Default
– DDL_ONLY -Creates all supported objects from the source schema, but
does not repopulate the tables.
– LOAD_ONLY- Loads all specified tables from the source database to the
target database. The tables must already exist on the target.
• SCHEMA_MAP: Allows user to rename schema when copying to target
• TABLESPACE_MAP: Table space name mappings to be used
• Load Utility option: COPY NO or Nonrecoverable
• Owner: Change the owner of each new object created in the target
schema

db2move COPY schema examples
• To duplicate schema schema1 Database dbsrc
from source database dbsrc
to target database dbtgt, issue:
db2move dbsrc COPY -sn schema1 -co TARGET_DB
dbtgt
USER myuser1 USING mypass1 db2move
• To duplicate schema schema1

from source database dbsrc
to target database dbtgt and Database dbtgt
rename the schema to newschema1 on
the target and map source table space ts1 Output files generated:
COPYSCHEMA.msg
to ts2 on the target, issue: COPYSCHEMA.err
LOADTABLE.msg
db2move dbsrc COPY -sn schema1 -co TARGET_DB LOADTABLE.err
dbtgt
USER myuser1 USING mypass1 These files are
SCHEMA_MAP ((schema1,newschema1)) timestamped.
TABLESPACE_MAP ((ts1,ts2), SYS_ANY))
Online Table Move stored procedure
• The ADMIN_MOVE_TABLE procedure available with DB2 9.7
is designed to move data from a source table to a target table
with a minimal impact to application access
– Changes that can be made using ADMIN_MOVE_TABLE:
• New Data, Index or Long table spaces, which could have a different
page size, extent size or type of table space management (like
moving from SMS to Automatic Storage)
• Data compression could be implemented during the move
• MDC clustering can be added or changed
• Range partitions can be added or changed
• Distribution keys can be changed for DPF tables
• Columns can be added, removed or changed
– Multiple phased processing allows write access to the source table
except for a short outage required to swap access to the target table

ADMIN_MOVE_TABLE: Processing phases
SYSTOOLS.ADMIN_MOVE_TABLE
1 INIT PHASE tabschema tabname key value
Create triggers,target,
staging tables
SOURCE TARGET
TABLE 2 TABLE
COPY PHASE
c1 c2 … cn c1 c2 … cn
3
REPLAY PHASE
INSERT INSERT
c1 c2 … cn
Rows with
DELETE DELETE keys present
UPDATE in staging
UPDATE
table are
re-copied
Online Keys of from source
Workload row changed table
by online
workload 4 SWAP PHASE
captured via
triggers STAGING Rename Target -> Source
TABLE
ADMIN_MOVE_TABLE procedure methods
• There are two methods of calling ADMIN_MOVE_TABLE:
One method specifies the how to define the target table.
>>-ADMIN_MOVE_TABLE--(--tabschema--,--tabname--,---------------->
>--data_tbsp--,--index_tbsp--,--lob_tbsp--,--mdc_cols--,-------->
.-,-------.
V |
>--partkey_cols--,--data_part--,--coldef--,----options-+--,----->
>--operation--)------------------------------------------------><
The second method allows a predefined table to be specified as the

target for the move.
>>-ADMIN_MOVE_TABLE--(--tabschema--,--tabname--,---------------->
.-,-------.
V |
>--target_tabname--,----options-+--,--operation--)-------------><
Example: Move a table to new table space
call SYSPROC.ADMIN_MOVE_TABLE
( 'MDC', 'HIST1', 'MDCTSP2', 'MDCTSPI', 'MDCTSP2', NULL, NULL, NULL, NULL,
'KEEP, REORG',
'MOVE' ) Source is kept, target is reorganized
KEY VALUE
-------------------------------- --------------------------------------------------
AUTHID INST461
CLEANUP_END 2009-06-01-10.57.55.282196
CLEANUP_START 2009-06-01-10.57.55.050882
COPY_END 2009-06-01-10.57.36.243768
COPY_INDEXNAME HIST1IX1
COPY_INDEXSCHEMA MDC
COPY_OPTS ARRAY_INSERT,CLUSTER_OVER_INDEX
COPY_START 2009-06-01-10.57.25.132774
COPY_TOTAL_ROWS 100000
INDEXNAME MDC.HIST1IX3
INDEXSCHEMA INST461
INDEX_CREATION_TOTAL_TIME 7
INIT_END 2009-06-01-10.57.24.880488
INIT_START 2009-06-01-10.57.21.220930
ORIGINAL HIST1AATOhxo
REPLAY_END 2009-06-01-10.57.53.870351
REPLAY_START 2009-06-01-10.57.36.244234
REPLAY_TOTAL_ROWS 0
REPLAY_TOTAL_TIME 0
STATUS COMPLETE
SWAP_END 2009-06-01-10.57.54.995848
SWAP_RETRIES 0
SWAP_START 2009-06-01-10.57.54.255941
VERSION 09.07.0000
Unit summary
• Discuss the INSERT statement and recognize its limitations
• Explain the differences between IMPORT and LOAD
• Explain the EXPORT, IMPORT, and LOAD syntax
• Create and use Exception Tables and Dump-Files
• Distinguish and resolve Table States:
– Load Pending and Set Integrity Pending
• Use the SET INTEGRITY command

• Discuss the db2move and db2look commands

Student exercise

Backup and recovery

Unit objectives
• Describe the major principles and methods for backup and
recovery
• State the three types of recovery used by DB2
• Explain the importance of logging for backup and recovery
• Describe how data logging takes place, including circular
logging and archival logging
• Use the BACKUP, RESTORE, ROLLFORWARD and
RECOVER commands
• Perform a table space backup and recovery
• Restore a database to the end of logs or to a point-in-time
• Discuss the configuration parameters and the recovery
history file and use these to handle various backup and
recovery scenarios
Types of failure
MEDIA OPERATIONAL
HARDWARE
SOFTWARE
POWER
Database objects
DB2 Recovery based on restoring at the Database or Table Space level
Table Space Table Space Table Space

'SYSCATSPACE' 'USERSPACE1' 'TBSP1'
Database
Index
Table Index Table
Table
Table
Table
Container Container
0 Container 0
Container
0 1

DB2 Database recovery methods
Create offline Database level Backup at 1AM DB2 Logs
Database at 1AM
12
11 1
10
9
2
3
Log 1
8
7
6
5
4 Backup DB DB2 Database
Log 2
Database
Backup Table spaces
Log 3
1: Crash Recovery
Log 4
11
12
1 Database at 3PM
10 2
9 3
8 4
7
6
5 Restore DB
DB2 Database Log 5
Database Log 6
Backup Table spaces 3: Roll forward
2: Version Recovery Recovery

Introduction to logging
Buffer Pool
Log Buffer
Insert
Current Row New Row
Update
Delete Old Row
Requests Commit
for Reads Externalized
Log Buffer Full
and Writes
TABLES LOGS

Circular logging (Non-recoverable database)
"n" PRIMARY 2 1 "n"
SECONDARY

Archival logging (Recoverable database)
Manual
or
Automatic Archive
12 ACTIVE — Contains
information for non-
committed or non-
13 externalized transactions
14
OFFLINE ARCHIVE — ONLINE ARCHIVE —

Contains information 15
Archive moved from
for committed and
ACTIVE log subdirectory externalized transactions.
(may also be on other Stored in the ACTIVE 16
media log subdirectory.

Log file information
Time
Online Active Active Active Unused Unused

Archive Log Log File Log File Log File Log File Log File
....
....
loghead
SQLLOGCTL.LFH(1 and 2) contains:
— Log configuration
— Log files to be archived
— Active logs
— Start point for crash recovery
Location of log files
Instance
/your/choice
NODE0000 newlogpath
S~3.LOG S~4.LOG S~5.LOG
SQL0000n Different disk drive

and potentially a
different disk
controller unit
/2nd/choice
SQLOGDIR
mirrorlogpath

Configure database logs: Parameters
Primary log files New log path

(LOGPRIMARY) + (NEWLOGPATH)
? ?
Secondary log files Mirror log path
(LOGSECOND) (MIRRORLOGPATH)
Log size Primary Log

(LOGFILSIZ) Archive Method
Y
N (LOGARCHMETH1)
E DEFAULT
O Secondary Log
Log Buffer S
(LOGBUFSIZ) Archive Method
(LOGARCHMETH2)
Number of commits Log records to write

to group before soft checkpoint
(MINCOMMIT) (SOFTMAX)
? - ?

Configure database logs: GUI (Control Center)

Configure database logs: Wizard (CC)

Configure database logs: Command line
• Windows: db2 get db cfg for musicdb | find /i "log"
• Linux/UNIX: db2 get db cfg for musicdb | grep –i log
• Update: db2 update db cfg for musicdb using logprimary 15
Log retain for recovery status = NO
User exit for logging status = NO
Catalog cache size (4KB) (CATALOGCACHE_SZ) = 260
Log buffer size (4KB) (LOGBUFSZ) = 98
Log file size (4KB) (LOGFILSIZ) = 1024
Number of primary log files (LOGPRIMARY) = 13
Number of secondary log files (LOGSECOND) = 4
Changed path to log files (NEWLOGPATH) =
Path to log files = C:\DB2\NODE0000\SQL00003\SQLOGDIR\
Overflow log path (OVERFLOWLOGPATH) =
Mirror log path (MIRRORLOGPATH) =
First active log file =
Block log on disk full (BLK_LOG_DSK_FUL) = NO
Percent max primary log space by transaction (MAX_LOG) = 0
Num. of active log files for 1 active UOW(NUM_LOG_SPAN) = 0
Percent log file reclaimed before soft chckpt (SOFTMAX) = 520
Log retain for recovery enabled (LOGRETAIN) = OFF
User exit for logging enabled (USEREXIT) = OFF
HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
First log archive method (LOGARCHMETH1) = OFF
Options for logarchmeth1 (LOGARCHOPT1) =
Second log archive method (LOGARCHMETH2) = OFF
Options for logarchmeth2 (LOGARCHOPT2) =
Failover log archive path (FAILARCHPATH) =
Number of log archive retries on error (NUMARCHRETRY) = 5
Log archive retry Delay (secs) (ARCHRETRYDELAY) = 20
Log pages during index build (LOGINDEXBUILD) = OFF
DB2 recovery-related system files
• Recovery History File - db2rhist.asc:
– Created during Create Database command
– Updated by DB2 Utility execution:
• Back up database or table space
• Restore database or table space Database Directory
• Roll forward database or table space Instance
• Load table
• Reorg table
• DB2 Log file archival NODE0000
• Create/Alter/Rename table space
• Quiesce table spaces for table SQL0000n
• Drop Table (optional)
– Included on each DB2 backup file db2rhist.asc
– db2 LIST HISTORY command SQLOGCTL.LFH.1
• Log Control File – SQLOGCTL.LFH.1: SQLOGCTL.LFH.2
– Used during crash recovery
– Disk updated at end of each Log File
– Use softmax value < 100 (% of logfilsiz) to refresh pointers more often
– Included at end of each DB2 backup
BACKUP command
BACKUP DATABASE MUSICDB
REMOTE
*SIZE?
MUSICDB
SYSADM LOCAL -or-

SYSCTRL *REFERENCE?
SYSMAINT
-or-
Buffer size, TSM, XBSA or
number? vendor backup
shared library name

Backup Utility options
BACKUP DATABASE database-alias [USER username [USING password]]
[TABLESPACE (tblspace-name [ {,tblspace-name} ... ])] [ONLINE]
[INCREMENTAL [DELTA]]
[USE {TSM | XBSA} [OPEN num-sess SESSIONS]
[OPTIONS {options-string | options-filename}] | TO dir/dev
[ {,dir/dev} ... ] | LOAD lib-name [OPEN num-sess SESSIONS]

[OPTIONS {options-string | options-filename}]]
[WITH num-buff BUFFERS] [BUFFER buffer-size] [PARALLELISM n]
[COMPRESS [COMPRLIB lib-name [EXCLUDE]] [COMPROPTS options-

string]]
[UTIL_IMPACT_PRIORITY [priority]
[{INCLUDE | EXCLUDE} LOGS] [WITHOUT PROMPTING]
BACKUP using the GUI interface

Online versus Offline Backup

The backup files
• File name for backup images on disk has:
– Database alias
– Type of backup (0=full, 3=table space, 4=copy from table load)
– Instance name
– Node number
– Catalog node number
– Timestamp of backup (YYYYMMDDHHMMS)
– Sequence number (for multiple files)
MUSICDB.0.DB2.NODE0000.CATN0000.20071013150616.001
• Tape images are not named, but internally contain the same
information in the backup header for verification purposes
• Backup history provides key information in easy-to-use format

RESTORE command
RESTORE DATABASE MUSICDB
REMOTE
MUSICDB
SYSADM LOCAL -or-

SYSCTRL
SYSMAINT
-or-
Buffer size, TSM, XBSA or
number? vendor backup
shared library name

Syntax of the RESTORE command
RESTORE DATABASE source-database-alias | restore-options |
DB CONTINUE
ABORT
Restore options:
USER username TABLESPACE ONLINE

USING password ,
TABLESPACE ( tablespace-name )
ONLINE
HISTORY FILE
ONLINE
USE TSM TAKEN AT date-time

OPEN num-sessions SESSIONS
,
FROM directory
device
LOAD shared-library
OPEN num-sessions SESSIONS
TO target-directory INTO target-database-alias NEWLOGPATH directory
WITH num-buffers BUFFERS BUFFER buffer-size REPLACE EXISTING REDIRECT
...
WITHOUT ROLLING FORWARD WITHOUT PROMPTING

RESTORE using the GUI interface

Backup/restore table space considerations
• Roll-forward must be enabled
• Can choose to restore a subset of table spaces
• Generally best to put multiple spaces in one backup image:
– Makes table space recovery strategy easier
– Provides access to related tables spaces and coherent
management of these table spaces
• Handling of long/LOB/XML data requires a correlated strategy

• Point-in-time recovery is supported, but has requirements
• Faster recovery for Catalogs using Tablespace Level backup
• Critical business application tables should obviously be the
focus of the backup/restore, but other tables are needed in
support of these tables
Roll forward pending state
• Roll forward pending is set as a result of:
– Restore of offline database backup omitting the command option
WITHOUT ROLLING FORWARD
– Restore of an online database backup
– Restore of any table space level backup
– DB2 detects media failure isolated at a table space
• Scope of pending state managed by DB2:

– Database in pending state will not permit any activity
– Table spaces in pending state will permit access to other table
spaces

Syntax of the ROLLFORWARD command
>>-ROLLFORWARD--+-DATABASE-+--database-alias-------------------->
'-DB-------'
>--+-------------------------------------------------------------------------------------
+-->
| .-ON ALL DBPARTITIONNUMS-. .-USING UTC TIME---. |
+-TO--+-isotime--+------------------------+--+------------------+-+--+--------------+-+
| | '-USING LOCAL TIME-' | +-AND COMPLETE-+ |
| | .-ON ALL DBPARTITIONNUMS-. | '-AND STOP-----' |
| +-END OF BACKUP--+------------------------+-----------------+ |
| '-END OF LOGS--+----------------------------------+---------' |
| '-| On Database Partition clause |-' |
'-+-COMPLETE---------------------------+--+----------------------------------+--------'
+-STOP-------------------------------+ '-| On Database Partition clause |-'
+-CANCEL-----------------------------+
| .-USING UTC TIME---. |
'-QUERY STATUS--+------------------+-'
'-USING LOCAL TIME-'
>--+-------------------------------------------------------+---->
'-TABLESPACE--+-ONLINE--------------------------------+-'
| .-,---------------. |
| V | |
'-(----tablespace-name-+--)--+--------+-'
'-ONLINE-'
>--+------------------------------------------------------------------------+-->
'-OVERFLOW LOG PATH--(--log-directory--+----------------------------+--)-'
'-,--| Log Overflow clause |-'
>--+------------+----------------------------------------------->
'-NORETRIEVE-'

ROLLFORWARD: GUI interface

ROLLFORWARD: How far?
• END OF LOGS: (Apply as many changes as possible):
– Rollforward will apply all available logs beginning with the logs associated with
the backup that was restored
– Archived logs will be retrieved unless NORETRIEVE is specified
• Point-in-time (PIT): (Apply changes up to a specified time):
– Specified in Coordinated Universal Time (UTC) via command
– Specified in local time on server with USING LOCAL TIME
– Specified in local time on the client via GUI interface
– Format: yyyy-mm-dd-hh.mm.ss.nnnnnn
• END OF BACKUP: (Apply as few changes as possible):
– Allows a Database to be recovered from an online database backup and to end
the ROLLFORWARD processing at the earliest point where the database is
consistent.
– Recovery history file (RHF) shows logs associated with online backups
• Table space point-in-time considerations:
– Minimum roll forward time maintained for each table space – requires roll
forward at least to the last DDL change (create, alter, drop) in a table space
– Table spaces are placed in backup pending when the roll forward completes to
insure future recoverability

DB2 RECOVER command
db2 recover database salesdb

– RECOVER command performs both RESTORE and
ROLLFORWARD command processing using Recovery
History file data.
– Can use Full or Incremental Database level backup
images
– Allows Partitioned DPF database to be recovered using
a single command Recovery History
– Database can be recovered to End of logs or a Archived Logs

point in time.
– RESTART option forces failed RECOVER command to Backups
redo RESTORE even if previous restore completed
– USING HISTORY FILE option allows disaster recovery
with no existing database
– No table space level recovery options

Log file naming
0000000
Sxxxxxxx.LOG
9999999
S~0.LOG S~1.LOG S~2.LOG S~3.LOG S~4.LOG
BACKUP
• When roll-forward is enabled • RESTORE DATABASE

without rolling forward
• When S9999999.LOG is filled
• ROLLFORWARD DATABASE
• When roll-forward is disabled to some point in S~2.LOG
RESTART NAMING SEQUENCE
Disaster recovery considerations
Should you use database recovery, table space recovery, or a
combination?
• Database level backups:
– Offline backup does not require a roll forward
– Online backup will include the ‘required’ logs for a roll forward TO END OF
BACKUP
– Entire database can be restored on another system
– Reduces backup management
• Table space level backups:

– Can be used to supplement database backups
– A complete set of tablespace backups could be used to REBUILD a database
for disaster recovery
– Increases the number of backup images
• Using the REBUILD WITH option of RESTORE simplifies creating a full

or partial copy of a database using either database or table space
backup images

Questions to consider
• Will the log files and the database files be on separate physical
disks?
• What medium can best handle the volume of log data to be
generated?
• Should copies of log files, backups, or both be made and
managed?
• Can the OS support mirroring? Should a mirror log path be
defined?
• How far back should recovery be enabled?
• When will all log files and backups be discarded?
• Are paging space/swap files on the same device or drive as
DB2 logs?
• Have you backed up database system support files, such as
db2profile and db2nodes.cfg, and your shell scripts?
Logging/backup requirements summary
Consideration CRASH VERSION ROLL-

FORWARD
Circular or Circular or
Logging Archive
Archive Archive
OFFLINE
Backup N/A OFFLINE
or ONLINE
Resources Low Medium High
Management Low Medium High
Last
Currency after Media Latest
Not Provided Committed
Failure Backup Work
Table Space Recovery N/A NO YES

DB2 Integrated Cluster Manager support
• HA Cluster Manager Integration:
– Coupling of DB2 and TSA (S AMP) on Linux, Solaris and AIX.
– TSA can be installed and maintained with DB2 installation
procedures.
– DB2 interface to configure cluster.
– DB2 to maintain cluster configuration, add node, add table space,
and so on. Exploitation of new vendor independent layering (VIL),
providing support for any cluster manager.
• NO SCRIPTING REQUIRED!
– One set of embedded scripts that are used by all cluster managers.
• Automates HADR failover

– Exploit HA cluster manager integration to automate HADR Takeover
on Standby.

DB2: Cluster configuration simplified
• DB2 utility db2haicu:
– DB2 High Availability Instance Configuration Utility
– Sets up DB2 with the Cluster Manager
• For HADR or Shared Storage cluster
– Interactive or XML file driven interface
– Sets the DBM CFG option cluster_mgr
• DB2 communicates with the Cluster Manager:

– Keeps the Cluster Manager in sync with DB2 changes:
• CM needs to be aware of new or changed table spaces, containers,
and so on
• Avoids failover problems due to missing resources
– Keeps the XML cluster configuration file up to date
– Automates HADR failover

High Availability Disaster Recovery: HADR
Clients
Network
IB M
Standby DB2
Primary DB2 Database Server
IB M
Database Server
Direct Database to
Database TCPIP Link
Primary Copy of Standby Copy of

DB2 Database DB2 Database
HADR is supported on all editions of DB2 9.

For DB2 Express 9, HADR is part of the separately priced High Availability Feature.

DB2 additional recovery facilities
• On-demand log archiving
• Infinite active logs
• Block transactions on log directory full
• Split mirror database copies:
– SET WRITE SUSPEND/RESUME commands
– db2inidb command modes:
• SNAPSHOT: Database copy for testing reporting
• STANDBY: Database copy to create standby database for quick
recovery
• MIRROR: Use split mirror database copy instead of RESTORE
• Incremental and delta database and table space backups
• Relocating a database or a table space:
– RESTORE UTILITY with REDIRECT option
– db2relocatedb command
• Full and partial database REBUILD support
• Integrated Cluster Failover support
• High Availability Disaster Recovery (HADR)
DB2 for LUW Advanced Database Recovery
CL491: DB2 for LUW Advanced Database Recovery (4 days
with Labs):
• Automated Recovery log management
• Dropped table recovery
• Table space recovery
• Incremental backup and restore
• Database crash recovery
• Database and table space relocation methods
• Additional recovery facilities
• Recovery history file
• Disaster recovery of DB2 databases
• DB2 high availability and split mirror support
• DB2 partitioned database recovery considerations
• DB2 high availability disaster recovery (HADR) implementation

Unit summary
• Describe the major principles and methods for backup and
recovery
• State the three types of recovery used by DB2
• Explain the importance of logging for backup and recovery
• Describe how data logging takes place, including circular
logging and archival logging
• Use the BACKUP, RESTORE, ROLLFORWARD and
RECOVER commands
• Perform a table space backup and recovery
• Restore a database to the end of logs or to a point-in-time
• Discuss the configuration parameters and the recovery
history file and use these to handle various backup and
recovery scenarios
Student exercise

Course

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Course

Uploaded by

Copyright:

Available Formats

InfoSphere DataStage

© Copyright IBM Corporation 2009

● Mod 01: Introduction ……………………………………………………………… 05

© Copyright IBM Corporation 2009 2

Mod 01: Introduction

© Copyright IBM Corporation 2009

© Copyright IBM Corporation 2009 4

● Design jobs for Extraction, Transformation, and Loading

© Copyright IBM Corporation 2009 5

© Copyright IBM Corporation 2009 6

Information Business Information DataStage QualityStage MetaBrokers

Information Server console

© Copyright IBM Corporation 2009 7

© Copyright IBM Corporation 2009 8

Parallel engine Server engine

Engines Shared Repository

© Copyright IBM Corporation 2009 10

© Copyright IBM Corporation 2009 11

© Copyright IBM Corporation 2009 12

© Copyright IBM Corporation 2009 13

© Copyright IBM Corporation 2009 14

© Copyright IBM Corporation 2009 15

© Copyright IBM Corporation 2009 16

● Transform, clean, load processes execute simultaneously

© Copyright IBM Corporation 2009 17

● Divide the incoming stream of data into subsets to be

© Copyright IBM Corporation 2009 18

● Here the data is partitioned into three partitions

© Copyright IBM Corporation 2009 19

User designs the flow in DataStage Designer

… at runtime, this job runs in parallel for any configuration

© Copyright IBM Corporation 2009 20

1. True or false: DataStage Director is used to build and

© Copyright IBM Corporation 2009 21

© Copyright IBM Corporation 2009 22

Mod 02: Deployment

© Copyright IBM Corporation 2009

© Copyright IBM Corporation 2009 24

© Copyright IBM Corporation 2009 25

DataStage server DB2 instance with

● Here the domain is split

DataStage server DB2 instance with

MetaData Server backbone

DataStage server DB2 instance with

© Copyright IBM Corporation 2009 28

© Copyright IBM Corporation 2009 29

© Copyright IBM Corporation 2009 30

© Copyright IBM Corporation 2009 31

Application Server Profiles folder Profile Start the Server

Profile Profile\bin directory

© Copyright IBM Corporation 2009 32

Start the agent

© Copyright IBM Corporation 2009 33

1. What application components make up a domain?

© Copyright IBM Corporation 2009 34

© Copyright IBM Corporation 2009 35

Mod 03: Administering

© Copyright IBM Corporation 2009

© Copyright IBM Corporation 2009 37

© Copyright IBM Corporation 2009 38

© Copyright IBM Corporation 2009 39

© Copyright IBM Corporation 2009 40

© Copyright IBM Corporation 2009 42

© Copyright IBM Corporation 2009 44