Download as pdf or txt
Download as pdf or txt
You are on page 1of 663

InfoSphere DataStage

Workshop

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Course Contents

● Mod 01: Introduction ……………………………………………………………… 05


● Mod 02: Deployment …………………………………………………………………… 27
● Mod 03: Administering DataStage …………………………………… 41
● Mod 04: DataStage Designer ……………………………………………… 67
● Mod 05: Creating Parallel Jobs …………………………………… 97
● Mod 06: Accessing Sequential Data …………………………… 133
● Mod 07: Platform Architecture …………………………………… 163
● Mod 08: Combining Data ………………………………………………………… 201
● Mod 09: Sorting and Aggregating Data ………………… 243
● Mod 10: Transforming Data ……………………………………………… 271
● Mod 11: Working With Relational Data ………………… 327
● Mod 12: Metadata in the Parallel Framework …… 373
● Mod 13: Job Control ……………………………………………………………… 397

© Copyright IBM Corporation 2009 2


IBM InfoSphere DataStage

Mod 01: Introduction

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● List and describe the uses of DataStage
● List and describe the DataStage clients
● Describe the DataStage workflow
● List and compare the different types of DataStage jobs
● Describe the two types of parallelism exhibited by DataStage
parallel jobs

© Copyright IBM Corporation 2009 4


What is IBM InfoSphere DataStage?

● Design jobs for Extraction, Transformation, and Loading


(ETL)
● Ideal tool for data integration projects – such as, data
warehouses, data marts, and system migrations
● Import, export, create, and manage metadata for use
within jobs
● Schedule, run, and monitor jobs, all within DataStage
● Administer your DataStage development and execution
environments
● Create batch (controlling) jobs

© Copyright IBM Corporation 2009 5


IBM Information Server
● Suite of applications, including DataStage, that:
– Share a common repository
• DB2, by default
– Share a common set of application services and functionality
• Provided by Metadata Server components hosted by an application
server
– IBM InfoSphere Application Server
• Provided services include:
– Security
– Repository
– Logging and reporting
– Metadata management
● Managed using web console clients
– Administration console
– Reporting console

© Copyright IBM Corporation 2009 6


Information Server Backbone

Information Business Information DataStage QualityStage MetaBrokers


Federation
Services Glossary Analyzer Server
Director

Metadata Metadata
Access Services Analysis Services

Metadata Server

Information Server console

© Copyright IBM Corporation 2009 7


Information Server Administration Console

© Copyright IBM Corporation 2009 8


DataStage Architecture

Clients

Parallel engine Server engine

Engines Shared Repository


© Copyright IBM Corporation 2009 9
DataStage Administrator

© Copyright IBM Corporation 2009 10


DataStage Designer

DataStage
parallel job

© Copyright IBM Corporation 2009 11


DataStage Director

Log
message
events

© Copyright IBM Corporation 2009 12


Developing in DataStage
● Define global and project properties in Administrator
● Import metadata into the Repository
● Build job in Designer
● Compile job in Designer
● Run and monitor job log messages in Director
– Jobs can also be run in Designer, but job lob messages cannot be
viewed in Designer

© Copyright IBM Corporation 2009 13


DataStage Project Repository

User-added
folder

Standard
jobs folder

Standard
Table
Definitions
folder

© Copyright IBM Corporation 2009 14


Types of DataStage Jobs
● Server jobs
– Executed by the DataStage server engine
– Compiled into Basic
– Runtime monitoring in DataStage Director
● Parallel jobs
– Executed by the DataStage parallel engine
– Built-in functionality for pipeline and partition parallelism
– Compiled into OSH (Orchestrate Scripting Language)
• OSH executes Operators
– Executable C++ class instances
– Runtime monitoring in DataStage Director
● Job sequences (batch jobs, controlling jobs)
– Master Server jobs that kick-off jobs and other activities
– Can kick-off Server or Parallel jobs
– Runtime monitoring in DataStage Director
– Executed by the Server engine

© Copyright IBM Corporation 2009 15


Design Elements of Parallel Jobs

● Stages
– Implemented as OSH operators (pre-built components)
– Passive stages (E and L of ETL)
• Read data
• Write data
• E.g., Sequential File, DB2, Oracle, Peek stages
– Processor (active) stages (T of ETL)
• Transform data
• Filter data
• Aggregate data
• Generate data
• Split / Merge data
• E.g., Transformer, Aggregator, Join, Sort stages
● Links
– “Pipes” through which the data moves from stage to stage

© Copyright IBM Corporation 2009 16


Pipeline Parallelism

● Transform, clean, load processes execute simultaneously


● Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
● Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
● Still has limits on scalability

© Copyright IBM Corporation 2009 17


Partition Parallelism

● Divide the incoming stream of data into subsets to be


separately processed by an operator
– Subsets are called partitions
● Each partition of data is processed by the same operator
– E.g., if operation is Filter, each partition will be filtered in exactly
the same way
● Facilitates near-linear scalability
– 8 times faster on 8 processors
– 24 times faster on 24 processors
– This assumes the data is evenly distributed

© Copyright IBM Corporation 2009 18


Three-Node Partitioning

Node 1

Operation
subset1
Node 2
subset2
Operation
Data subset3
Node 3

Operation

● Here the data is partitioned into three partitions


● The operation is performed on each partition of data
separately and in parallel
● If the data is evenly distributed, the data will be processed
three times faster

© Copyright IBM Corporation 2009 19


Job Design v. Execution

User designs the flow in DataStage Designer

… at runtime, this job runs in parallel for any configuration


or partitions (called nodes)

© Copyright IBM Corporation 2009 20


Checkpoint

1. True or false: DataStage Director is used to build and


compile your ETL jobs
2. True or false: Use Designer to monitor your job during
execution
3. True or false: Administrator is used to set global and
project properties

© Copyright IBM Corporation 2009 21


Unit summary
Having completed this unit, you should be able to:
● List and describe the uses of DataStage
● List and describe the DataStage clients
● Describe the DataStage workflow
● List and compare the different types of DataStage jobs
● Describe the two types of parallelism exhibited by DataStage
parallel jobs

© Copyright IBM Corporation 2009 22


IBM InfoSphere DataStage

Mod 02: Deployment

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Identify the components of Information Server that need to
be installed
● Describe what a deployment domain consists of
● Describe different domain deployment options
● Describe the installation process
● Start the Information Server

© Copyright IBM Corporation 2009 24


What Gets Deployed
An Information Server domain, consisting of the following:
● MetaData Server, hosted by an IBM InfoSphere Application Server
instance
● One or more DataStage servers
– DataStage server includes both the parallel and server engines
● One DB2 UDB instance containing the Repository database
● Information Server clients
– Administration console
– Reporting console
– DataStage clients
• Administrator
• Designer
• Director
● Additional Information Server applications
– Information Analyzer
– Business Glossary
– Rational Data Architect
– Information Services Director
– Federation Server

© Copyright IBM Corporation 2009 25


Deployment: Everything on One Machine
● Here we have a single
domain with the hosted
applications all on one MetaData Server backbone
machine
Clients
● Additional Client
workstations can connect to
this machine using TCP/IP

DataStage server DB2 instance with


Repository
Clients
© Copyright IBM Corporation 2009 26
Deployment: DataStage on Separate Machine

● Here the domain is split


between two machines MetaData Server backbone
– DataStage server
– MetaData server and
DB2 Repository

DataStage server DB2 instance with


Repository
Clients
© Copyright IBM Corporation 2009 27
MetaData Server and DB2 On Separate Machines

MetaData Server backbone


● Here the domain is split
between three
machines
– DataStage server
– MetaData server
– DB2 Repository

Clients

DataStage server DB2 instance with


Repository

© Copyright IBM Corporation 2009 28


Information Server Installation

© Copyright IBM Corporation 2009 29


Installation Configuration Layers
● Configuration layers include:
– Client
• DataStage and Information Server clients
– Engine
• DataStage and other application engines
– Domain
• MetaData Server and hosted metadata server components
• Installed products domain components
– Repository
• Repository database server and database
– Documentation
● Selected layers are installed on the machine local to the
installation
● Already existing components can be configured and used
– E.g., DB2, InfoSphere Application Server

© Copyright IBM Corporation 2009 30


Information Server Start-Up
– Start the MetaData Server
• From Windows Start menu, click “Start the Server” after the profile
to be used (e.g., “default”)
• From the command line, open the profile bin directory
– Enter “startup server1”
> server1 is the default name of the application server hosting the MetaData
Server
– Start the ASB agent
• From Windows Start menu, click “Start the agent” after selecting
the Information Server folder
• Only required if DataStage and the MetaData Server are on
different machines
– To begin work in DataStage, double-click on a DataStage client
icon
– To begin work in the Administration and Reporting consoles,
double-click on the Web Console for Information Server icon

© Copyright IBM Corporation 2009 31


Starting the MetaData Server Backbone

Application Server Profiles folder Profile Start the Server

Profile Profile\bin directory

Startup command

© Copyright IBM Corporation 2009 32


Starting the ASB Agent

Start the agent

© Copyright IBM Corporation 2009 33


Checkpoint

1. What application components make up a domain?


2. Can a domain contain multiple DataStage servers?
3. Does the DB2 instance and the Repository database need
to be on the same machine as the Application Server?
4. Suppose DataStage is on a separate machine from the
Application Server. What two components need to be
running before you can log onto DataStage?

© Copyright IBM Corporation 2009 34


Unit summary
Having completed this unit, you should be able to:
● Identify the components of Information Server that need to
be installed
● Describe what a deployment domain consists of
● Describe different domain deployment options
● Describe the installation process
● Start the Information Server

© Copyright IBM Corporation 2009 35


IBM InfoSphere DataStage

Mod 03: Administering


DataStage

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Open the Administrative console
● Create new users and groups
● Assign Suite roles and Product roles to users and groups
● Give users DataStage credentials
● Log onto DataStage Administrator
● Add a DataStage user on the Permissions tab and specify
the user’s role
● Specify DataStage global and project defaults
● List and describe important environment variables

© Copyright IBM Corporation 2009 37


Information Server Administration Web Console
● Web application for administering Information Server
● Use for:
– Domain management
– Session management
– Management of users and groups
– Logging management
– Scheduling management

© Copyright IBM Corporation 2009 38


Opening the Administration Web Console

Information Server
console address

Information
Server
administrator ID

© Copyright IBM Corporation 2009 39


Users and Group Management

© Copyright IBM Corporation 2009 40


User and Group Management
● Suite authorizations can be provided to users or groups
– Users that are members of a group acquire the authorizations of the group
● Authorizations are provided in the form of roles
– Two types of roles
• Suite roles: Apply to the Suite
• Suite Component roles: Apply to a specific product or component of
Information Server, e.g., DataStage
● Suite roles
– Administrator
• Perform user and group management tasks
• Includes all the privileges of the Suite User role
– User
• Create views of scheduled tasks and logged messages
• Create and run reports
● Suite Component roles
– DataStage
• DataStage user
– Permissions are assigned within DataStage
> Developer, Operator, Super Operator, Production Manager
• DataStage administrator
– Full permissions to work in DataStage Administrator, Designer, and Director
– And so on, for all products in the Suite
© Copyright IBM Corporation 2009 41
Creating a DataStage User ID

Administration
console

Create
new user

Users

© Copyright IBM Corporation 2009 42


Assigning DataStage Roles

Assign Suite
User ID Administrator role

Assign Suite
User role
Users

Assign DataStage
User role Assign DataStage
Administrator role
© Copyright IBM Corporation 2009 43
DataStage Credential Mapping
● DataStage credentials
– Required by DataStage in order to log onto a DataStage client
– Managed by the DataStage Server operating system or LDAP
● Users given DataStage Suite roles in the Suite
Administration console do not automatically receive
DataStage credentials
– Users need to be mapped to a user who has DataStage credentials
on the DataStage Server machine
• This DataStage user must have file access permission to the
DataStage engine/project files or Administrator rights on the
operating system
● Three ways of mapping users (IS ID  OS ID)
– Many to 1
– 1 to 1
– Combination of the above

© Copyright IBM Corporation 2009 44


DataStage Credential Mapping

Map DataStage
User credential

© Copyright IBM Corporation 2009 45


DataStage Administrator

© Copyright IBM Corporation 2009 46


Logging onto DataStage Administrator
Host name,
port number of
application
server

DataStage
administrator ID
and password

Name or IP
address of
DataStage server
machine
© Copyright IBM Corporation 2009 47
DataStage Administrator Projects Tab

Server projects
Click to specify
project properties

Link to Information Server


Administration console

© Copyright IBM Corporation 2009 48


DataStage Administrator General Tab

Enable runtime
column propagation
(RCP) by default

Specify auto purge

Environment
variable settings

© Copyright IBM Corporation 2009 49


Environment Variables

Parallel job variables

User-defined variables

© Copyright IBM Corporation 2009 50


Environment Reporting Variables

Display Score

Display record
counts

Display OSH

© Copyright IBM Corporation 2009 51


DataStage Administrator Permissions Tab

Assigned product
role

Add DataStage
users

© Copyright IBM Corporation 2009 52


Adding Users and Groups

Add DataStage
Available users /
users
groups. Must
have a
DataStage
product User role.

© Copyright IBM Corporation 2009 53


Specify DataStage Role

Added DataStage
user

Select DataStage
role

© Copyright IBM Corporation 2009 54


DataStage Administrator Tracing Tab

© Copyright IBM Corporation 2009 55


DataStage Administrator Parallel Tab

© Copyright IBM Corporation 2009 56


DataStage Administrator Sequence Tab

© Copyright IBM Corporation 2009 57


Checkpoint

1. Authorizations can be assigned to what two items?


2. What two types of authorization roles can be assigned to a
user or group?
3. In addition to Suite authorization to log onto DataStage
what else does a DataStage developer require to work in
DataStage?

© Copyright IBM Corporation 2009 58


Unit objectives
Having completing this unit, you should be able to:
● Open the Administrative console
● Create new users and groups
● Assign Suite roles and Product roles to users and groups
● Give users DataStage credentials
● Log onto DataStage Administrator
● Add a DataStage user on the Permissions tab and specify
the user’s role
● Specify DataStage global and project defaults
● List and describe important environment variables

© Copyright IBM Corporation 2009 59


IBM InfoSphere DataStage

Mod 04: DataStage Designer

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Log onto DataStage
● Navigate around DataStage Designer
● Import and export DataStage objects to a file
● Import a Table Definition for a sequential file

© Copyright IBM Corporation 2009 61


Logging onto DataStage Designer

Host name,
port number of
application
server

DataStage server
machine/project

© Copyright IBM Corporation 2009 62


Designer Work Area
Repository Menus Toolbar

Parallel
canvas

Palette

© Copyright IBM Corporation 2009 63


Importing and Exporting
DataStage Objects

© Copyright IBM Corporation 2009 64


Repository Window
Search for
objects in the
project

Project
Default jobs
folder

Default Table
Definitions
folder

© Copyright IBM Corporation 2009 65


Import and Export

● Any object in the repository window can be exported to a


file
● Can export whole projects
● Use for backup
● Sometimes used for version control
● Can be used to move DataStage objects from one
project to another
● Use to share DataStage jobs and projects with other
developers

© Copyright IBM Corporation 2009 66


Export Procedure

● Click “Export>DataStage Components”


● Select DataStage objects for export
● Specify type of export:
– DSX: Default format
– XML: Enables processing of export file by XML applications,
e.g., for generating reports
● Specify file path on client machine

© Copyright IBM Corporation 2009 67


Export Window

Click to
select
objects
from the
repository Selected
objects

Export
type Export
location on
client
system

Begin export
© Copyright IBM Corporation 2009 68
Import Procedure

● Click “Import>DataStage Components”


– Or “Import>DataStage Components (XML)” if you are importing
an XML-format export file
● Select DataStage objects for import

© Copyright IBM Corporation 2009 69


Import Options

Import all objects in


the file

Display list to
select from

© Copyright IBM Corporation 2009 70


Import/Export Command Line Interface
● Command istool
● Located \IBM\InformationServer\Clients\istools\clients
– Documentation in same location
● Invoke through Programs  IBM Information Server 
Information Server Command Line Interface
● Examples:
– istool –hostname domain:9080 –username dsuser –password
inf0server –deployment-package importfile.dsx import
– istool –hostname domain:9080 –username dsuser –password
inf0server –job GenDataJob –filename exportfile.dsx export

© Copyright IBM Corporation 2009 71


Information Server Manager

© Copyright IBM Corporation 2009 72


Information Server Manager
●Create and deploy packages of DataStage objects
●Perform DataStage object exports / imports
–Similar to DataStage Designer exports / imports
• Export files are stored as archives (extension .isx)
●Provides a domain-wide view of the DataStage Repository
–Displays all DataStage servers
• E.g., a Development server and a Production server
–Displays all DataStage projects on each server
●Before you can use it you must add a domain
–Right-click over Repository, click Add Domain
• Select domain
• Specify user ID / password
●Invoke through Information Server start menu:
–IBM Information Server  Information Server Manager

© Copyright IBM Corporation 2009 73


Adding a Domain

Information Server
domain. May
contain multiple
DataStage systems

© Copyright IBM Corporation 2009 74


Information Server Manager Window

Created
export
archive

DataStage
Repository. Shows all
DataStage servers
and projects
© Copyright IBM Corporation 2009 75
Information Server Manager – Package View
● Multiple packages Domain
● Multiple builds within each package Packages
● Build history

Builds

Objects

© Copyright IBM Corporation 2009 76


Importing Table Definitions

© Copyright IBM Corporation 2009 77


Importing Table Definitions

● Table Definitions describe the format and columns


of files and tables
● You can import Table Definitions for:
– Sequential files
– Relational tables
– COBOL files
– Many other things
● Table Definitions can be loaded into job stages

© Copyright IBM Corporation 2009 78


Sequential File Import Procedure

● Click Import>Table Definitions>Sequential File Definitions


● Select directory containing sequential file and then the file
● Select a repository folder to store the Table Definition in
● Examined format and column definitions and edit as
necessary

© Copyright IBM Corporation 2009 79


Importing Sequential Metadata

Import Table
Definition for
a sequential
file

© Copyright IBM Corporation 2009 80


Sequential Import Window
Select
directory
containing
files

Start import

Select file

Select repository
folder

© Copyright IBM Corporation 2009 81


Specify Format

Edit columns Select if first row


has column names

Delimiter

© Copyright IBM Corporation 2009 82


Edit Column Names and Types

Double-click to define
extended properties

© Copyright IBM Corporation 2009 83


Extended Properties window

Parallel properties
Property
categories

Available
properties

© Copyright IBM Corporation 2009 84


Table Definition General Tab

Type Source

Stored Table
Definition

© Copyright IBM Corporation 2009 85


Checkpoint

● True or False? The directory to which you export is on the


DataStage client machine, not on the DataStage server
machine.
● Can you import Table Definitions for sequential files with
fixed-length record formats?

© Copyright IBM Corporation 2009 86


Unit summary
Having completed this unit, you should be able to:
● Log onto DataStage
● Navigate around DataStage Designer
● Import and export DataStage objects to a file
● Import a Table Definition for a sequential file

© Copyright IBM Corporation 2009 87


IBM InfoSphere DataStage

Mod 05: Creating Parallel Jobs

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Design a simple Parallel job in Designer
● Define a job parameter
● Use the Row Generator, Peek, and Annotation stages in a
job
● Compile your job
● Run your job in Director
● View the job log

© Copyright IBM Corporation 2009 89


What Is a Parallel Job?

● Executable DataStage program


● Created in DataStage Designer
– Can use components from Repository
● Built using a graphical user interface
● Compiles into Orchestrate script language (OSH) and object
code (from generated C++)

© Copyright IBM Corporation 2009 90


Job Development Overview

● Import metadata defining sources and targets


– Done within Designer using import process
● In Designer, add stages defining data extractions and loads
● Add processing stages to define data transformations
● Add links defining the flow of data from sources to targets
● Compile the job
● In Director, validate, run, and monitor your job
– Can also run the job in Designer
– Can only view the job log in Director

© Copyright IBM Corporation 2009 91


Tools Palette

Stage
categories

Stages

© Copyright IBM Corporation 2009 92


Adding Stages and Links

● Drag stages from the Tools Palette to the diagram


– Can also be dragged from Stage Type branch to the diagram
● Draw links from source to target stage
– Right mouse over source stage
– Release mouse button over target stage

© Copyright IBM Corporation 2009 93


Job Creation Example Sequence

● Brief walkthrough of procedure


● Assumes Table Definition of source already exists in the
repository

© Copyright IBM Corporation 2009 94


Create New Parallel Job

Open
New
window

Parallel job

© Copyright IBM Corporation 2009 95


Drag Stages and Links From Palette

Compile

Row Peek
Generator
Job
properties

© Copyright IBM Corporation 2009 96


Renaming Links and Stages
● Click on a stage or link to
rename it
● Meaningful names have
many benefits
– Documentation
– Clarity
– Fewer development errors

© Copyright IBM Corporation 2009 97


RowGenerator Stage
● Produces mock data for specified columns
● No inputs link; single output link
● On Properties tab, specify number of rows
● On Columns tab, load or specify column definitions
– Open Extended Properties window to specify the values to be
generated for that column
– A number of algorithms for generating values are available
depending on the data type
● Algorithms for Integer type
– Random: seed, limit
– Cycle: Initial value, increment
● Algorithms for string type: Cycle , alphabet
● Algorithms for date type: Random, cycle

© Copyright IBM Corporation 2009 98


Inside the Row Generator Stage

Properties
tab

Set property
value

Property

© Copyright IBM Corporation 2009 99


Columns Tab

Double-click to specify
extended properties

View data

Select Table
Definition Load a
Table
Definition
© Copyright IBM Corporation 2009 100
Extended Properties

Specified
properties and
their values

Additional
properties to add

© Copyright IBM Corporation 2009 101


Peek Stage

● Displays field values


– Displayed in job log or sent to a file
– Skip records option
– Can control number of records to be displayed
– Shows data in each partition, labeled 0, 1, 2, …
● Useful stub stage for iterative job development
– Develop job to a stopping point and check the data

© Copyright IBM Corporation 2009 102


Peek Stage Properties

Output to
job log

© Copyright IBM Corporation 2009 103


Job Parameters

● Defined in Job Properties window


● Makes the job more flexible
● Parameters can be:
– Used in directory and file names
– Used to specify property values
– Used in constraints and derivations
● Parameter values are determined at run time
● When used for directory and files names and property
values, they are surrounded with pound signs (#)
– E.g., #NumRows#
● Job parameters can reference DataStage environment
variables
– Prefaced by $, e.g., $APT_CONFIG_FILE

© Copyright IBM Corporation 2009 104


Defining a Job Parameter

Parameters tab

Parameter
Add environment
variable

© Copyright IBM Corporation 2009 105


Using a Job Parameter in a Stage

Job parameter Click to insert Job


parameter

© Copyright IBM Corporation 2009 106


Adding Job Documentation

● Job Properties
– Short and long descriptions
● Annotation stage
– Added from the Tools Palette
– Displays formatted text descriptions on diagram

© Copyright IBM Corporation 2009 107


Job Properties Documentation

Documentation

© Copyright IBM Corporation 2009 108


Annotation Stage Properties

© Copyright IBM Corporation 2009 109


Compiling a Job

Run

Compile

© Copyright IBM Corporation 2009 110


Errors or Successful Message

Highlight stage with


error Click for more info

© Copyright IBM Corporation 2009 111


Running Jobs and Viewing the Job
Log in Designer

© Copyright IBM Corporation 2009 112


DataStage Director

● Use to run and schedule jobs


● View runtime messages
● Can invoke directly from Designer
– Tools > Run Director

© Copyright IBM Corporation 2009 113


Run Options

Assign values to
parameter

© Copyright IBM Corporation 2009 114


Run Options

Stop after number


of warnings

Stop after number


of rows

© Copyright IBM Corporation 2009 115


Director Status View

Status view Schedule view


Log view

Select job to
view
messages in
the log view
© Copyright IBM Corporation 2009 116
Director Log View
Click the notepad
icon to view log
messages

Peek messages

© Copyright IBM Corporation 2009 117


Message Details

© Copyright IBM Corporation 2009 118


Other Director Functions

● Schedule job to run on a particular date/time


● Clear job log of messages
● Set job log purging conditions
● Set Director options
– Row limits (server job only)
– Abort after x warnings

© Copyright IBM Corporation 2009 119


Running Jobs from Command Line

● dsjob –run –param numrows=10 dx444 GenDataJob


– Runs a job
– Use –run to run the job
– Use –param to specify parameters
– In this example, dx444 is the name of the project
– In this example, GenDataJob is the name of the job
● dsjob –logsum dx444 GenDataJob
– Displays a job’s messages in the log
● Documented in “Parallel Job Advanced Developer’s
Guide”

© Copyright IBM Corporation 2009 120


Checkpoint

1. Which stage can be used to display output data in the job


log?
2. Which stage is used for documenting your job on the job
canvas?
3. What command is used to run jobs from the operating
system command line?

© Copyright IBM Corporation 2009 121


Unit summary
Having completed this unit, you should be able to:
● Design a simple Parallel job in Designer
● Define a job parameter
● Use the Row Generator, Peek, and Annotation stages in a
job
● Compile your job
● Run your job in Director
● View the job log

© Copyright IBM Corporation 2009 122


IBM InfoSphere DataStage

Mod 06: Accessing Sequential


Data

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Understand the stages for accessing different kinds of file
data
● Sequential File stage
● Data Set stage
● Create jobs that read from and write to sequential files
● Create Reject links
● Work with NULLs in sequential files
● Read from multiple files using file patterns
● Use multiple readers

© Copyright IBM Corporation 2009 124


Types of File Data

● Sequential
– Fixed or variable length
● Data Set
● Complex flat file

© Copyright IBM Corporation 2009 125


How Sequential Data is Handled

● Import and export operators are generated


– Stages get translated into operators during the compile
● Import operators convert data from the external format,
as described by the Table Definition, to the framework
internal format
– Internally, the format of data is described by schemas
● Export operators reverse the process
● Messages in the job log use the “import” / “export”
terminology
– E.g., “100 records imported successfully; 2 rejected”
– E.g., “100 records exported successfully; 0 rejected”
– Records get rejected when they cannot be converted correctly
during the import or export

© Copyright IBM Corporation 2009 126


Using the Sequential File Stage

Both import and export of sequential files (text, binary) can


be performed by the SequentialFile Stage.
Importing/Exporting Data
– Data import: Internal format

– Data export Internal format

© Copyright IBM Corporation 2009 127


Features of Sequential File Stage

● Normally executes in sequential mode


● Executes in parallel when reading multiple files
● Can use multiple readers within a node
– Reads chunks of a single file in parallel
● The stage needs to be told:
– How file is divided into rows (record format)
– How row is divided into columns (column format)

© Copyright IBM Corporation 2009 128


Sequential File Format Example

Record delimiter

Field 1 , Field 1
2 , Field 13 , Last field nl

Final Delimiter = end


Field Delimiter

Field 1 , Field 12 , Field 13 , Last field , nl

Final Delimiter = comma

© Copyright IBM Corporation 2009 129


Sequential File Stage Rules

● One input link


● One stream output link
● Optionally, one reject link
– Will reject any records not matching metadata in the column
definitions
• Example: You specify three columns separated by commas, but
the row that’s read had no commas in it
• Example: The second column is designated a decimal in the
Table Definition, but contains alphabetic characters

© Copyright IBM Corporation 2009 130


Job Design Using Sequential Stages

Stream link

Reject link
(broken line)

© Copyright IBM Corporation 2009 131


Input Sequential Stage Properties

Output tab
File to
access

Column names
in first row

Click to add more


files to read
© Copyright IBM Corporation 2009 132
Format Tab

Record format

Load from Table


Definition

Column format

© Copyright IBM Corporation 2009 133


Sequential Source Columns Tab

View data

Load from Table


Definition

Save as a new
Table Definition

© Copyright IBM Corporation 2009 134


Reading Using a File Pattern

Use wild
cards

Select File
Pattern

© Copyright IBM Corporation 2009 135


Properties - Multiple Readers

Multiple readers option allows


you to set number of readers
per node

© Copyright IBM Corporation 2009 136


Sequential Stage As a Target
Input Tab

Append /
Overwrite

© Copyright IBM Corporation 2009 137


Reject Link
● Reject mode =
– Continue: Continue reading
records
– Fail: Abort job
– Output: Send down output
link
● In a source stage
– All records not matching the
metadata (column definitions)
are rejected
● In a target stage
– All records that fail to be
written for any reason
● Rejected records consist of
one column, datatype = raw

Reject mode property

© Copyright IBM Corporation 2009 138


Inside the Copy Stage

Column
mappings

© Copyright IBM Corporation 2009 139


Reading and Writing NULL Values to a
Sequential File

© Copyright IBM Corporation 2009 140


Working with NULLs
● Internally, NULL is represented by a special value outside
the range of any existing, legitimate values
● If NULL is written to a non-nullable column, the job will abort
● Columns can be specified as nullable
– NULLs can be written to nullable columns
● You must “handle” NULLs written to nullable columns in a
Sequential File stage
– You need to tell DataStage what value to write to the file
– Unhandled rows are rejected
● In a Sequential File source stage, you can specify values you
want DataStage to convert to NULLs

© Copyright IBM Corporation 2009 141


Specifying a Value for NULL

Nullable
column

Added
property

© Copyright IBM Corporation 2009 142


DataSet Stage

© Copyright IBM Corporation 2009 143


Data Set
● Binary data file
● Preserves partitioning
– Component dataset files are written to each partition
● Suffixed by .ds
● Referred to by a header file
● Managed by Data Set Management utility from GUI (Manager,
Designer, Director)
● Represents persistent data
● Key to good performance in set of linked jobs
– No import / export conversions are needed
– No repartitioning needed
● Accessed using DataSet stage
● Implemented with two types of components:
– Descriptor file:
• contains metadata, data location, but NOT the data itself
– Data file(s)
• contain the data
• multiple files, one per partition (node)

© Copyright IBM Corporation 2009 144


Job With DataSet Stage

DataSet stage

DataSet stage
properties

© Copyright IBM Corporation 2009 145


Data Set Management Utility
Display schema

Display data

Display record
counts for each
data file

© Copyright IBM Corporation 2009 146


Data and Schema Displayed

Data viewer

Schema describing the


format of the data

© Copyright IBM Corporation 2009 147


File Set Stage

● Use to read and write to filesets


● Files suffixed by .fs
● Files are similar to a dataset
– Partitioned
– Implemented with header file and data files
● How filesets differ from datasets
– Data files are text files
• Hence readable by external applications
– Datasets have a proprietary data format which may
change in future DataStage versions

© Copyright IBM Corporation 2009 148


Checkpoint

1. List three types of file data.


2. What makes datasets perform better than other types of
files in parallel jobs?
3. What is the difference between a data set and a file set?

© Copyright IBM Corporation 2009 149


Unit summary
Having completed this unit, you should be able to:
● Understand the stages for accessing different kinds of file
data
● Sequential File stage
● Data Set stage
● Create jobs that read from and write to sequential files
● Create Reject links
● Work with NULLs in sequential files
● Read from multiple files using file patterns
● Use multiple readers

© Copyright IBM Corporation 2009 150


IBM InfoSphere DataStage

Mod 07: Platform Architecture

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Describe parallel processing architecture
● Describe pipeline parallelism
● Describe partition parallelism
● List and describe partitioning and collecting algorithms
● Describe configuration files
● Describe the parallel job compilation process
● Explain OSH
● Explain the Score

© Copyright IBM Corporation 2009 152


Key Parallel Job Concepts

● Parallel processing:
– Executing the job on multiple CPUs
● Scalable processing:
– Add more resources (CPUs and disks) to increase system
performance

1 2 • Example system: 6 CPUs (processing


nodes) and disks
3 4 • Scale up by adding more CPUs
• Add CPUs as individual nodes or to
5 6 an SMP system

© Copyright IBM Corporation 2009 153


Scalable Hardware Environments

● Single CPU ● SMP ● GRID / Clusters


● Dedicated memory ● Multi-CPU (2-64+) – Multiple, multi-CPU
systems
& disk ● Shared memory & – Dedicated memory per
node
disk – Typically SAN-based
shared storage
● MPP
– Multiple nodes with
dedicated memory,
storage
● 2 – 1000’s of CPUs

© Copyright IBM Corporation 2009 154


Pipeline Parallelism

● Transform, clean, load processes execute simultaneously


● Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
● Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
● Still has limits on scalability

© Copyright IBM Corporation 2009 155


Partition Parallelism

● Divide the incoming stream of data into subsets to be


separately processed by an operation
– Subsets are called partitions (nodes)
● Each partition of data is processed by the same
operation
– E.g., if operation is Filter, each partition will be filtered in exactly
the same way
● Facilitates near-linear scalability
– 8 times faster on 8 processors
– 24 times faster on 24 processors
– This assumes the data is evenly distributed

© Copyright IBM Corporation 2009 156


Three-Node Partitioning

Node 1

Operation
subset1
Node 2
subset2
Operation
Data subset3
Node 3

Operation

● Here the data is partitioned into three partitions


● The operation is performed on each partition of data
separately and in parallel
● If the data is evenly distributed, the data will be processed
three times faster

© Copyright IBM Corporation 2009 157


Parallel Jobs Combine Partitioning and Pipelining

– Within parallel jobs pipelining, partitioning, and repartitioning are automatic


– Job developer only identifies:
• Sequential vs. parallel mode (by stage)
– Default for most stages is parallel mode
• Method of data partitioning
– The method to use to distribute the data into the available nodes
• Method of data collecting
– The method to use to collect the data from the nodes into a single
node
• Configuration file
– Specifies the nodes and resources

© Copyright IBM Corporation 2009 158


Job Design v. Execution

User assembles the flow using DataStage Designer

… at runtime, this job runs in parallel for any configuration


(1 node, 4 nodes, N nodes)

No need to modify or recompile the job design!


© Copyright IBM Corporation 2009 159
Configuration File

● Configuration file separates configuration (hardware /


software) from job design
– Specified per job at runtime by $APT_CONFIG_FILE
– Change hardware and resources without changing job design
● Defines nodes with their resources (need not match
physical CPUs)
– Dataset, Scratch, Buffer disk (file systems)
– Optional resources (Database, SAS, etc.)
– Advanced resource optimizations
• “Pools” (named subsets of nodes)
● Multiple configuration files can be used different occasions
of job execution
– Optimizes overall throughput and matches job characteristics to
overall hardware resources
– Allows runtime constraints on resource usage on a per job basis

© Copyright IBM Corporation 2009 160


Example Configuration File
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
3 4 resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
1 2 node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
Key points: resource scratchdisk "/temp" {}
}
1. Number of nodes defined node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
2. Resources assigned to each resource disk "/orch/n3/d1" {}
node. Their order is significant. resource scratchdisk "/temp" {}
}
node "n4" {
3. Advanced resource fastname "s4"
optimizations and configuration pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
(named pools, database, SAS) resource scratchdisk "/temp" {}
}
}

© Copyright IBM Corporation 2009 161


Partitioning and Collecting

© Copyright IBM Corporation 2009 162


Partitioning and Collecting

● Partitioning breaks incoming rows into multiple streams of


rows (one for each node)
● Each partition of rows is processed separately by the
stage/operator
● Collecting returns partitioned data back to a single stream
● Partitioning / Collecting is specified on stage input links

© Copyright IBM Corporation 2009 163


Partitioning / Collecting Algorithms
● Partitioning algorithms include:
– Round robin
– Random
– Hash: Determine partition based on key value
• Requires key specification
– Modulus
– Entire: Send all rows down all partitions
– Same: Preserve the same partitioning
– Auto: Let DataStage choose the algorithm
● Collecting algorithms include:
– Round robin
– Auto
• Collect first available record
– Sort Merge
• Read in by key
• Presumes data is sorted by the key in each partition
• Builds a single sorted stream based on the key
– Ordered
• Read all records from first partition, then second, …
© Copyright IBM Corporation 2009 164
Keyless V. Keyed Partitioning Algorithms
● Keyless: Rows are distributed independently of data values
– Round Robin
– Random
– Entire
– Same
● Keyed: Rows are distributed based on values in the
specified key
– Hash: Partition based on key
• Example: Key is State. All “CA” rows go into the same partition; all
“MA” rows go in the same partition. Two rows of the same state
never go into different partitions
– Modulus: Partition based on modulus of key divided by the number
of partitions. Key is a numeric type.
• Example: Key is OrderNumber (numeric type). Rows with the same
order number will all go into the same partition.
– DB2: Matches DB2 EEE partitioning

© Copyright IBM Corporation 2009 165


Round Robin and Random Partitioning

Keyless
● Keyless partitioning methods
…8 7 6 5 4 3 2 1 0
● Rows are evenly distributed
across partitions
– Good for initial import of data if no
other partitioning is needed
– Useful for redistributing data Round Robin
● Fairly low overhead

● Round Robin assigns rows to


partitions like dealing cards
– Row/Partition assignment will be the
same for a given 6 7 8
$APT_CONFIG_FILE 3 4 5
0 1 2

● Random has slightly higher


overhead, but assigns rows in a
non-deterministic fashion
between job runs

© Copyright IBM Corporation 2009 166


ENTIRE Partitioning
Keyless
● Each partition gets a complete copy
of the data
– Useful for distributing lookup and …8 7 6 5 4 3 2 1 0
reference data
• May have performance impact
in MPP / clustered
environments
– On SMP platforms, Lookup stage ENTIRE
(only) uses shared memory instead
of duplicating ENTIRE reference
data
• On MPP platforms, each
server uses shared memory
for a single local copy . . .
● ENTIRE is the default partitioning . . .
for Lookup reference links with 3 3 3
“Auto” partitioning 2 2 2
– On SMP platforms, it is a good 1 1 1
practice to set this explicitly on the 0 0 0
Normal Lookup reference link(s)

© Copyright IBM Corporation 2009 167


HASH Partitioning
Keyed
● Keyed partitioning
Values of key column
method
● Rows are distributed …0 3 2 1 0 2 3 2 1 1

according to the values


in key columns
– Guarantees that rows with same HASH
key values go into the same
partition
– Needed to prevent matching rows
from “hiding” in other partitions
• E.g. Join, Merge,
0 1 2
Remove Duplicates, 3 1 2
… 0 1 2
– Partition distribution is relatively 3
equal if the data across the source
key column(s) is evenly
distributed

© Copyright IBM Corporation 2009 168


Modulus Partitioning
Keyed

● Keyed partitioning method


Values of key column
● Rows are distributed according
…0 3 2 1 0 2 3 2 1 1
to the values in one integer key
column
– Uses modulus
partition = MOD (key_value / #partitions) MODULUS
● Faster than HASH
● Guarantees that rows with
identical key values go in the
same partition 0 1 2
● Partition size is relatively equal 3
0
1
1
2
2
if the data within the key 3

column is evenly distributed

© Copyright IBM Corporation 2009 169


Auto Partitioning

● DataStage inserts partition components as necessary


to ensure correct results
– Before any stage with “Auto” partitioning
– Generally chooses ROUND-ROBIN or SAME
– Inserts HASH on stages that require matched key values
(e.g. Join, Merge, Remove Duplicates)
– Inserts ENTIRE on Normal (not Sparse) Lookup reference
links
• NOT always appropriate for MPP/clusters
● Since DataStage has limited awareness of your data
and business rules, explicitly specify HASH partitioning
when needed
– DataStage has no visibility into Transformer logic
– Hash is required before Sort and Aggregator stages
– DataStage sometimes inserts un-needed partitioning
• Check the log
© Copyright IBM Corporation 2009 170
Partitioning Requirements for Related Records

● Misplaced records
– Using Aggregator stage to sum customer sales by
customer number
– If there are 25 customers, 25 records should be output
– But suppose records with the same customer numbers
are spread across partitions
• This will produce more than 25 groups (records)
– Solution: Use hash partitioning algorithm
● Partition imbalances
– If all the records are going down only one of the nodes,
then the job is in effect running sequentially

© Copyright IBM Corporation 2009 171


Unequal Distribution Example

● Same key values are ● Hash on LName, with 2-node


assigned to the same config file
partition

Part 0
ID LName FName Address

5 Dodge Horace 17840 Jefferson


Source Data

ID LName FName Address


6 Dodge John 75 Boston Boulevard
1 Ford Henry 66 Edison Avenue

2 Ford Clara 66 Edison Avenue

3 Ford Edsel 7900 Jefferson

4 Ford Eleanor 7900 Jefferson

Partition 1
ID LName FName Address
5 Dodge Horace 17840 Jefferson
1 Ford Henry 66 Edison Avenue
6 Dodge John 75 Boston Boulevard
2 Ford Clara 66 Edison Avenue
7 Ford Henry 4901 Evergreen
3 Ford Edsel 7900 Jefferson
8 Ford Clara 4901 Evergreen
4 Ford Eleanor 7900 Jefferson
9 Ford Edsel 1100 Lakeshore
7 Ford Henry 4901 Evergreen
10 Ford Eleanor 1100 Lakeshore
8 Ford Clara 4901 Evergreen

9 Ford Edsel 1100 Lakeshore

10 Ford Eleanor 1100 Lakeshore

© Copyright IBM Corporation 2009 172


Partitioning / Collecting Link Icons

Partitioning icon

Collecting icon

© Copyright IBM Corporation 2009 173


More Partitioning Icons

“fan-out” Sequential
to Parallel

SAME partitioner

Re-partition
watch for this!
AUTO partitioner

© Copyright IBM Corporation 2009 174


Partitioning Tab
Input tab
Key
specification

Algorithms

© Copyright IBM Corporation 2009 175


Collecting Specification

Key
specification

Algorithms
© Copyright IBM Corporation 2009 176
Runtime Architecture

© Copyright IBM Corporation 2009 177


Parallel Job Compilation
Designer
Client
What gets generated:
● OSH: A kind of script
● OSH represents the design data flow and
stages Compile
– Stages become OSH operators
● Transform operator for each Transformer DataStage server

– A custom operator built during the compile


– Compiled into C++ and then to corresponding
native operators
• Thus a C++ compiler is needed to compile
jobs with a Transformer stage

Executable
Job

Transformer
Components

© Copyright IBM Corporation 2009 178


Generated OSH

Enable viewing of generated


OSH in Administrator:

Comments OSH is visible in:


- Job
Operator properties
- Job run log
Schema - View Data
- Table Defs

© Copyright IBM Corporation 2009 179


Stage to Operator Mapping Examples
● Sequential File
– Source: import
– Target: export
● DataSet: copy
● Sort: tsort
● Aggregator: group
● Row Generator, Column Generator, Surrogate Key
Generator: generator
● Oracle
– Source: oraread
– Sparse Lookup: oralookup
– Target Load: orawrite
– Target Upsert: oraupsert

© Copyright IBM Corporation 2009 180


Checkpoint

1. What file defines the degree of parallelism a job runs


under?
2. Name two partitioning algorithms that partition based on
key values?
3. What partitioning algorithm produces the most even
distribution of data in the various partitions?

© Copyright IBM Corporation 2009 181


Unit summary
Having completed this unit, you should be able to:
● Describe parallel processing architecture
● Describe pipeline parallelism
● Describe partition parallelism
● List and describe partitioning and collecting algorithms
● Describe configuration files
● Describe the parallel job compilation process
● Explain OSH
● Explain the Score

© Copyright IBM Corporation 2009 182


IBM InfoSphere DataStage

Mod 08: Combining Data

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Combine data using the Lookup stage
● Define range lookups
● Combine data using Merge stage
● Combine data using the Join stage
● Combine data using the Funnel stage

© Copyright IBM Corporation 2009 184


Combining Data

Ways to combine data:


● Horizontally:
– Multiple input links
– One output link made of columns from different input links.
– Joins
– Lookup
– Merge
● Vertically:
– One input link, one output link combining groups of related records
into a single record
– Aggregator
– Remove Duplicates
● Funneling: Multiple input streams funneled into a single
output stream
– Funnel stage

© Copyright IBM Corporation 2009 185


Lookup, Merge, Join Stages

● These stages combine two or more input links


– Data is combined by designated "key" column(s)
● These stages differ mainly in:
– Memory usage
– Treatment of rows with unmatched key values
– Input requirements (sorted, de-duplicated)

© Copyright IBM Corporation 2009 186


Lookup Stage

© Copyright IBM Corporation 2009 187


Lookup Features
● One stream input link (source link)
● Multiple reference links
● One output link
● Lookup failure options
– Continue, Drop, Fail, Reject
● Reject link
– Only available when lookup failure option is set to Reject
● Can return multiple matching rows
● Hash file is built in memory from the lookup files
– Indexed by key
– Should be small enough to fit into physical memory

© Copyright IBM Corporation 2009 188


Lookup Types
● Equality match
– Match exactly values in the lookup key column of the reference link to
selected values in the source row
– Return row or rows (if multiple matches are to be returned) that match
● Caseless match
– Like an equality match except that it’s caseless
• E.g., “abc” matches “AbC”
● Range on the reference link
– Two columns on the reference link define the range
– A match occurs when a selected value in the source row is within the
range
● Range on the source link
– Two columns on the source link define the range
– A match occurs when a selected value in the reference link is within
the range

© Copyright IBM Corporation 2009 189


Lookup Example

Reference link
Source (stream)
link

© Copyright IBM Corporation 2009 190


Lookup Stage With an Equality Match
Source link columns

Lookup
constraints

Mappings to
output columns

Reference key column


and source mapping Column
definitions
© Copyright IBM Corporation 2009 191
Lookup Failure Actions
If the lookup fails to find a matching key column, one of
several actions can be taken:

– Fail (Default)
• Stage reports an error and the job fails immediately
– Drop
• Input row is dropped
– Continue
• Input row is transferred to the output. Reference link columns are
filled with null or default values
– Reject
• Input row sent to a reject link
• This requires that a reject link has been created for the stage

© Copyright IBM Corporation 2009 192


Specifying Lookup Failure Actions

Select to return
multiple rows

Select lookup failure


action

© Copyright IBM Corporation 2009 193


Lookup Stage Behavior

Source link Reference link

Revolution Citizen Citizen Exchange


1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE

Lookup key
column

© Copyright IBM Corporation 2009 194


Lookup Stage

Output of Lookup with Continue option

Revolution Citizen Exchange


1789 Lefty
1776 M_B_Dextrous Nasdaq

Empty string
or NULL
Output of Lookup with Drop option

Revolution Citizen Exchange


1776 M_B_Dextrous Nasdaq

© Copyright IBM Corporation 2009 195


Designing a Range Lookup Job

© Copyright IBM Corporation 2009 196


Range Lookup Job

Reference link

Lookup stage

© Copyright IBM Corporation 2009 197


Range on Reference Link
Reference
range values

Retrieve
description

Source values

© Copyright IBM Corporation 2009 198


Selecting the Stream Column

Source link
Double-click
to specify
range

Reference link

© Copyright IBM Corporation 2009 199


Range Expression Editor

Select range
columns

Select
operators

© Copyright IBM Corporation 2009 200


Range on Stream Link

Source range

Retrieve other
column values

Reference link key


© Copyright IBM Corporation 2009 201
Specifying the Range Lookup

Reference link
key column
Range type

© Copyright IBM Corporation 2009 202


Range Expression Editor

Select range
columns

© Copyright IBM Corporation 2009 203


Join Stage

© Copyright IBM Corporation 2009 204


Join Stage
● Four types of joins
• Inner
• Left outer
• Right outer
• Full outer
● Input links must be sorted
– Left link and a right link
– Supports additional “intermediate” links
● Light-weight
– Little memory required, because of the sort requirement
● Join key column or columns
– Column names for each input link must match

© Copyright IBM Corporation 2009 205


Job With Join Stage

Right input link

Left input link


Join stage

© Copyright IBM Corporation 2009 206


Join Stage Editor

Column to
match
Select left input
link

Select if multiple
Select join type columns make up
the join key

© Copyright IBM Corporation 2009 207


Join Stage Behavior

Left link (primary input) Right link (secondary input)

Revolution Citizen Citizen Exchange


1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE

Join key
column

© Copyright IBM Corporation 2009 208


Inner Join

● Transfers rows from both data sets whose key


columns have matching values

Output of inner join on key Citizen

Revolution Citizen Exchange


1776 M_B_Dextrous Nasdaq

© Copyright IBM Corporation 2009 209


Left Outer Join

● Transfers all values from the left link and transfers


values from the right link only where key columns
match.

Revolution Citizen Exchange


1789 Lefty
1776 M_B_Dextrous Nasdaq

Null or default
value

© Copyright IBM Corporation 2009 210


Right Outer Join

● Transfers all values from the right link and transfers values from
the left link only where key columns match.

Revolution Citizen Exchange


1776 M_B_Dextrous Nasdaq
Righty NYSE

Null or default
value

© Copyright IBM Corporation 2009 211


Full Outer Join

● Transfers rows from both data sets, whose key columns


contain equal values, to the output link
● It also transfers rows, whose key columns contain
unequal values, from both input links to the output link
● Treats both input symmetrically
● Creates new columns, with new column names!

Revolution leftRec_Citizen rightRec_Citizen Exchange


1789 Lefty
1776 M_B_Dextrous M_B_Dextrous Nasdaq
0 Righty NYSE

© Copyright IBM Corporation 2009 212


Merge Stage

© Copyright IBM Corporation 2009 213


Merge Stage
• Similar to Join stage
• Input links must be sorted
– Master link and one or more secondary links
– Master must be duplicate-free
● Light-weight
– Little memory required, because of the sort requirement
● Unmatched master rows can be kept or dropped
● Unmatched ("Bad") update rows in input link n can be captured
in a "reject" link in corresponding output link n.
● Matched update rows are consumed
● Follows the Master-Update model
● Master row and one or more updates row are merged iff they
have the same value in user-specified key column(s).
● A non-key column name occurs in several inputs?
– The lowest input port number prevails
– e.g., master over update; update values are ignored

© Copyright IBM Corporation 2009 214


Merge Stage Job

Merge stage

Capture update
link non-matches

© Copyright IBM Corporation 2009 215


Merge Stage Properties

Match key

Keep or drop
unmatched
masters

© Copyright IBM Corporation 2009 216


The Merge Stage

● Allows composite keys


● Multiple update links
Master One or more
updates
● Matched update rows are
consumed
● Unmatched updates in
input port n can be captured
0 1 2 in output port n
● Lightweight
0 1 2
Merge

Output Rejects

© Copyright IBM Corporation 2009 217


Comparison: Joins, Lookup, Merge

Joins Lookup Merge


Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
Memory usage light heavy light
# and names of Inputs 2 or more: left, right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort all inputs no all inputs
Duplicates in primary input OK OK Warning!
Duplicates in secondary input(s) OK Warning! OK only when N = 1
Options on unmatched primary Keep (left outer), Drop (Inner) [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary Keep (right outer), Drop (Inner) NONE capture in reject set(s)
On match, secondary entries are captured captured consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

© Copyright IBM Corporation 2009 218


Funnel Stage

© Copyright IBM Corporation 2009 219


What is a Funnel Stage?
● Combines data from multiple input links to a single
output link
● All sources must have identical metadata
● Three modes
– Continuous
• Records are combined in no particular order
– First available record
– Sort Funnel
• Combines the input records in the order defined by a
key
• Produces sorted output if the input links are all sorted
by the same key
– Sequence:
• Outputs all records from the first input link, then all
from the second input link, and so on

© Copyright IBM Corporation 2009 220


Funnel Stage Example

© Copyright IBM Corporation 2009 221


Checkpoint

1. Name three stages that horizontally join data?


2. Which stage uses the least amount of memory? Join or
Lookup?
3. Which stage requires that the input data is sorted? Join or
Lookup?

© Copyright IBM Corporation 2009 222


Checkpoint solutions

1. Lookup, Merge, Join


2. Join
3. Join

© Copyright IBM Corporation 2009 223


Unit summary
Having completed this unit, you should be able to:
● Combine data using the Lookup stage
● Define range lookups
● Combine data using Merge stage
● Combine data using the Join stage
● Combine data using the Funnel stage

© Copyright IBM Corporation 2009 224


IBM InfoSphere DataStage

Mod 09: Sorting and


Aggregating Data

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Sort data using in-stage sorts and Sort stage
● Combine data using Aggregator stage
● Combine data Remove Duplicates stage

© Copyright IBM Corporation 2009 226


Sort Stage

© Copyright IBM Corporation 2009 227


Sorting Data
● Uses
– Some stages require sorted input
• Join, merge stages require sorted input
– Some stages use less memory with sorted input
• e.g., Aggregator
● Sorts can be done:
– Within stages
• On input link Partitioning tab, set partitioning to anything other
than Auto
– In a separate Sort stage
• Makes sort more visible on diagram
• Has more options

© Copyright IBM Corporation 2009 228


Sorting Alternatives

Sort stage Sort within


stage

© Copyright IBM Corporation 2009 229


In-Stage Sorting

Partitioning
tab Do sort

Preserve
non-key
ordering

Remove
dups

Can’t be Auto Sort key


when sorting
© Copyright IBM Corporation 2009 230
Sort Stage

Sort key

Sort options

© Copyright IBM Corporation 2009 231


Sort keys
● Add one or more keys
● Specify sort mode for each key
– Sort: Sort by this key
– Don’t sort (previously sorted):
• Assumes the data has already been sorted on this key
• Continue sorting by any secondary keys
● Specify sort order: ascending / descending
● Specify case sensitive or not

© Copyright IBM Corporation 2009 232


Sort Options

● Sort Utility
– DataStage – the default
– Unix: Don’t use. Slower than DataStage sort utility
● Stable
● Allow duplicates
● Memory usage
– Sorting takes advantage of the available memory for increased
performance
• Uses disk if necessary
– Increasing amount of memory can improve performance

© Copyright IBM Corporation 2009 233


Stable Sort

Key Col Key Col

4 X
1 K
3 Y
1 A
1 K
2 P
3 C 2 L
2 P 3 Y
3 D 3 C
1 A 3 D
2 L 4 X

© Copyright IBM Corporation 2009 234


Partitioning V. Sorting Keys
● Partitioning keys are often different than Sorting keys
– Keyed partitioning (e.g., Hash) is used to group related records into
the same partition
– Sort keys are used to establish order within each partition
● Example:
– Partition on HouseHoldID, sort on HouseHoldID, EntryDate
• Partitioning on HouseHoldID ensures that the same ID will not be
spread across multiple partitions
• Sorting orders the records with the same ID by entry date
– Useful for deciding which of a group of duplicate records with the same
ID should be retained

© Copyright IBM Corporation 2009 235


Aggregator Stage

© Copyright IBM Corporation 2009 236


Aggregator Stage

Purpose: Perform data aggregations


Specify:
● One or more key columns that define the aggregation
units (or groups)
● Columns to be aggregated
● Aggregation functions include, among many others:
– Count (nulls/non-nulls)
– Sum
– Max / Min / Range
● The grouping method (hash table or pre-sort) is a
performance issue

© Copyright IBM Corporation 2009 237


Job with Aggregator Stage

Aggregator
stage

© Copyright IBM Corporation 2009 238


Aggregation Types

● Count rows
– Count rows in each group
– Put result in a specified output column
● Calculation
– Select column
– Put result of calculation in a specified output column
– Calculations include:
• Sum
• Count
• Min, max
• Mean
• Missing value count
• Non-missing value count
• Percent coefficient of variation

© Copyright IBM Corporation 2009 239


Count Rows Aggregator Properties

Grouping key
columns

Output column for


Count Rows the result
aggregation type

© Copyright IBM Corporation 2009 240


Calculation Type Aggregator Properties

Grouping key
columns

Calculation
aggregation type

Calculations with
output columns
Available
calculations
© Copyright IBM Corporation 2009 241
Grouping Methods
● Hash (default)
– Calculations are made for all groups and stored in memory
• Hash table structure (hence the name)
– Results are written out after all input has been processed
– Input does not need to be sorted
– Useful when the number of unique groups is small
• Running tally for each group’s aggregations needs to fit into
memory
● Sort
– Requires the input data to be sorted by grouping keys
• Does not perform the sort! Expects the sort
– Only a single aggregation group is kept in memory
• When a new group is seen, the current group is written out
– Can handle unlimited numbers of groups

© Copyright IBM Corporation 2009 242


Grouping Method - Hash

Key Col 4 4X
4 X
3 Y 3 3Y 3C 3D
1 K
3 C
1 1K 1A
2 P
3 D
2 2P 2L
1 A
2 L

© Copyright IBM Corporation 2009 243


Grouping Method - Sort
Key Col

1 K 1K 1A
1 A

2 P
2P 2L
2 L

3 Y
3Y 3C 3D
3 C

3 D
4X
4 X

© Copyright IBM Corporation 2009 244


Remove Duplicates Stage

© Copyright IBM Corporation 2009 245


Removing Duplicates

● Can be done by Sort stage


– Use unique option
• No choice on which duplicate to keep
• Stable sort always retains the first row in the group
• Non-stable sort is indeterminate

OR

● Remove Duplicates stage


– Has more sophisticated ways to remove duplicates
• Can choose to retain first or last

© Copyright IBM Corporation 2009 246


Remove Duplicates Stage Job

Remove Duplicates
stage

© Copyright IBM Corporation 2009 247


Remove Duplicates Stage Properties

Key that defines


duplicates

Retain first or last


duplicate
© Copyright IBM Corporation 2009 248
Checkpoint

1. What stage is used to perform calculations of column


values grouped in specified ways?
2. In what two ways can sorts be performed?
3. What is a stable sort?

© Copyright IBM Corporation 2009 249


Unit summary
Having completed this unit, you should be able to:
● Sort data using in-stage sorts and Sort stage
● Combine data using Aggregator stage
● Combine data Remove Duplicates stage

© Copyright IBM Corporation 2009 250


IBM InfoSphere DataStage

Mod 10: Transforming Data

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Use the Transformer stage in parallel jobs
● Define constraints
● Define derivations
● Use stage variables
● Create a parameter set and use its parameters in constraints
and derivations

© Copyright IBM Corporation 2009 252


Transformer Stage

© Copyright IBM Corporation 2009 253


Transformer Stage
● Column mappings
● Derivations
– Written in Basic
– Final compiled code is C++ generated object code
● Constraints
– Filter data
– Direct data down different output links
• For different processing or storage
● Expressions for constraints and derivations can reference
– Input columns
– Job parameters
– Functions
– System variables and constants
– Stage variables
– External routines

© Copyright IBM Corporation 2009 254


Job With a Transformer Stage

Transformer

Single input

Reject link Multiple outputs

© Copyright IBM Corporation 2009 255


Inside the Transformer Stage
Stage variables

Show/Hide

Input columns Output columns

Constraint

Derivations / Mappings

Column defs

© Copyright IBM Corporation 2009 256


Defining a Constraint

Input column

Function

Job parameter

© Copyright IBM Corporation 2009 257


Defining a Derivation

Input column

String in quotes Concatenation


operator (:)
© Copyright IBM Corporation 2009 258
IF THEN ELSE Derivation
● Use IF THEN ELSE to conditionally derive a value
● Format:
– IF <condition> THEN <expression1> ELSE <expression2>
– If the condition evaluates to true then the result of expression1 will
be copied to the target column or stage variable
– If the condition evaluates to false then the result of expression2 will
be copied to the target column or stage variable
● Example:
– Suppose the source column is named In.OrderID and the target
column is named Out.OrderID
– Replace In.OrderID values of 3000 by 4000
– IF In.OrderID = 3000 THEN 4000 ELSE In.OrderID

© Copyright IBM Corporation 2009 259


String Functions and Operators
● Substring operator
– Format: “String” [loc, length]
– Example:
• Suppose In.Description contains the string “Orange Juice”
• InDescription[8,5]  “Juice”
● UpCase(<string>) / DownCase(<string>)
– Example: UpCase(In.Description)  “ORANGE JUICE”
● Len(<string>)
– Example: Len(In.Description)  12

© Copyright IBM Corporation 2009 260


Checking for NULLs
● Nulls can be introduced into the data
flow from lookups
– Mismatches (lookup failures) can produce
nulls
● Can be handled in constraints,
derivations, stage variables, or a
combination of these
● NULL functions
– Testing for NULL
• IsNull(<column>)
• IsNotNull(<column>)
– Replace NULL with a value
• NullToValue(<column>, <value>)
– Set to NULL: SetNull()
• Example: IF In.Col = 5 THEN SetNull()
ELSE In.Col

© Copyright IBM Corporation 2009 261


Transformer Functions

● Date & Time


● Logical
● Null Handling
● Number
● String
● Type Conversion

© Copyright IBM Corporation 2009 262


Transformer Execution Order

● Derivations in stage variables


● Constraints are executed before derivations
● Column derivations in earlier links are executed before later
links
● Derivations in higher columns are executed before lower
columns

© Copyright IBM Corporation 2009 263


Transformer Stage Variables

● Derivations execute in order from top to bottom


– Later stage variables can reference earlier stage variables
– Earlier stage variables can reference later stage variables
• These variables will contain a value derived from the previous row
that came into the Transformer
● Multi-purpose
– Counters
– Store values from previously read rows to make comparisons with
the currently read row
– Store derived values to be used in multiple target field derivations
– Can be used to control execution of constraints

© Copyright IBM Corporation 2009 264


Transformer Reject Links

Reject link

Convert link to a
Reject link

© Copyright IBM Corporation 2009 265


Transformer and Null Expressions
● Within a Transformer, any expression that includes a NULL
value produces a NULL result
– 1 + NULL = NULL
– “John” : NULL : “Doe” = NULL
● When the result of a link constraint or output derivation is
NULL, the Transformer will output that row to the reject link.
If no reject link, job will abort!
● Standard practices
– Always create a reject link
– Always test for NULL values in expressions with NULLABLE columns
• IF isNull(link.col) THEN … ELSE …
– Transformer issues warnings for rejects

© Copyright IBM Corporation 2009 266


Otherwise Link

Otherwise link

© Copyright IBM Corporation 2009 267


Defining an Otherwise Link

Check to create
otherwise Can specify abort
condition condition

© Copyright IBM Corporation 2009 268


Specifying Link Ordering

Link ordering toolbar icon

Last in
order

© Copyright IBM Corporation 2009 269


Parameter Sets

© Copyright IBM Corporation 2009 270


Parameter Sets
● Store a collection of parameters in a named object
● One or more values files can be named and specified
– A values file stores values for specified parameters
– Values are picked up at runtime
● Parameter Sets can be added to the job parameters
specified on the Parameters tab in the job properties

© Copyright IBM Corporation 2009 271


Creating a New Parameter Set

New Parameter
Set

© Copyright IBM Corporation 2009 272


Parameters Tab

Specify
parameters

Parameter set
name is specified
on General tab

© Copyright IBM Corporation 2009 273


Values Tab

Values for
Values file name parameters

© Copyright IBM Corporation 2009 274


Adding a Parameter Set to Job Properties

View Parameter
Set

Parameter Set Add Parameter


Reference Set

© Copyright IBM Corporation 2009 275


Using Parameter Set Parameters

Select Parameter
Set parameters

© Copyright IBM Corporation 2009 276


Checkpoint

1. What occurs first? Derivations or constraints?


2. Can stage variables be referenced in constraints?
3. Where should you test for NULLS within a Transformer?
Stage Variable derivations or output column derivations?

© Copyright IBM Corporation 2009 277


Unit summary
Having completed this unit, you should be able to:
● Use the Transformer stage in parallel jobs
● Define constraints
● Define derivations
● Use stage variables
● Create a parameter set and use its parameters in constraints
and derivations

© Copyright IBM Corporation 2009 278


IBM InfoSphere DataStage

Mod 11: Working With


Relational Data

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Import Table Definitions for relational tables
● Create Data Connections
● Use Connector stages in a job
● Use SQL Builder to define SQL Select statements
● Use SQL Builder to define SQL Insert and Update
statements
● DB2 Connector stage

© Copyright IBM Corporation 2009 280


Working with Relational Data
● Importing relational data
– Import using ODBC or orchdbutil
• orchdbutil is preferred, in order to get correct type conversions
● Data Connection objects
– Store database connection information in a named object
● Stages available to access relational data
– Connector stages
• Parallel support
• Most functionality
• Consistent GUI and functionality across all relational types
– Enterprise stages
• Parallel support
– Plug-in stages
• Functionality ported from DataStage Server Jobs
● Selecting data
– Build SELECT statements using SQL Builder
● Writing data
– Build INSERT, UPDATE, DELETE statements using SQL Builder

© Copyright IBM Corporation 2009 281


Importing Table Definitions

© Copyright IBM Corporation 2009 282


Importing Table Definitions

● Can import using ODBC or using Orchestrate schema


definitions
– Orchestrate schema imports are better because the data types
are more accurate
● Import>Table Definitions>Orchestrate Schema
Definitions
● Import>Table Definitions>ODBC Table Definitions

© Copyright IBM Corporation 2009 283


Orchestrate Schema Import

© Copyright IBM Corporation 2009 284


ODBC Import

Select ODBC data


source name

© Copyright IBM Corporation 2009 285


Connectors Stages

© Copyright IBM Corporation 2009 286


Connector Stage Types

● ODBC
– Conforms to ODBC 3.5 standard and is Level 3 compliant
– Certified with Oracle, DB2 UDB, SQL Server, and others
– DataDirect drivers
● DB2 UDB
● Teradata
● InfoSphere MQ
– WSMQ 5.3 and 6.0 for Client / Sever
– WSMB 5.0

© Copyright IBM Corporation 2009 287


Connector Stage Features
● Common stage editor
● Convenient drop-down lists to choose properties
● Job parameters can be inserted into any property
● Required properties are identified with a visual indicator
● Warning indicator for properties requiring attention
● Metadata retrieval
● Integrated SQL Builder
● Parallel support
– Read: parallel connections to the server and modified SQL queries for
each connection
– Write: parallel connections to the server
● Transaction isolation level support
– Read Uncommitted
– Read Committed
– Repeatable Read
– Serializable
● Before / After commands
– Executed once before or after the job runs

© Copyright IBM Corporation 2009 288


Metadata Retrieval

● List data sources


– Can list both user and system DSNs
● List users defined for the given database
● Individual table or view metadata
– Column data type, size, precision, etc.
● Database information
– Name, vendor name, version number, etc.
● List of database schemas
● List tables that match the filter criteria
● List related tables
– Those related by foreign key
– Those whose foreign keys point to the primary key in the given
table

© Copyright IBM Corporation 2009 289


Connector Stages

ODBC connector

ODBC Connector

© Copyright IBM Corporation 2009 290


Stage Editor

Click to display Navigator


stage properties panel

Properties
Click to display link
properties and columns

© Copyright IBM Corporation 2009 291


Property Warning Indicators

Warning indicator

© Copyright IBM Corporation 2009 292


Auto-Generated SQL

Stage generates SQL


SELECT based on
column definitions

© Copyright IBM Corporation 2009 293


Invoking SQL Builder

SQL statement
property

SQL Builder

© Copyright IBM Corporation 2009 294


Connector Stage Properties

Connection
information

SQL

Transaction
and session
management

Before/After
the SQL

© Copyright IBM Corporation 2009 295


DB2 Connector Stage

© Copyright IBM Corporation 2009 296


DB2 Connector Source Stage

© Copyright IBM Corporation 2009 297


DB2 Connector Source Stage Properties

Connection
information

© Copyright IBM Corporation 2009 298


DB2 Connector Target Stage

© Copyright IBM Corporation 2009 299


DB2 Connector Target Stage Properties

Connection
information

Write mode

© Copyright IBM Corporation 2009 300


Building a Query Using SQL
Builder

© Copyright IBM Corporation 2009 301


SQL Builder
● Uses the Table Definition
– Be sure the Locator tab information is specified fully and correctly
• Schema and table names are based on Locator tab information
● Drag Table Definitions to SQL Builder canvas
● Drag columns from Table Definition to Select columns table

© Copyright IBM Corporation 2009 302


Table Definition Locator Tab

Locator tab

Table schema name

Table name

© Copyright IBM Corporation 2009 303


SQL Builder Read Method

SQL Builder
for Select

© Copyright IBM Corporation 2009 304


SQL Builder Window

Drag Table
Definition

Drag
Columns

Build column
expressions

Selection tab

© Copyright IBM Corporation 2009 305


Creating a Calculated Column
Columns
dragged
from table

Select function
Select Expression Editor
Open a second
Expression
Editor

Functions

Select function
Function parameters

© Copyright IBM Corporation 2009 306


Constructing a WHERE Clause

Select predicate Job parameter


Add
conditio
n to
clause

Add second job parameter

© Copyright IBM Corporation 2009 307


Sorting the Data

Second column
to sort by

Sort Ascending
/ Descending
First column to
sort by

© Copyright IBM Corporation 2009 308


Viewing the Generated SQL

Read-only

SQL tab

© Copyright IBM Corporation 2009 309


Building SQL to Write to a
Table

© Copyright IBM Corporation 2009 310


Select Type of Write

Write mode
property

© Copyright IBM Corporation 2009 311


Invoking SQL Builder

Specify
INSERT

SQL
Specify Builder
UPDATE
© Copyright IBM Corporation 2009 312
SQL Builder INSERT

Drag Table
Definition

Drag
columns to
load Specify
values

© Copyright IBM Corporation 2009 313


SQL Builder Update

Drag Table
Definition

Drag
columns to
load Specify
values

Add
WHERE
conditio
clause
n to
clause

© Copyright IBM Corporation 2009 314


Viewing the SQL

© Copyright IBM Corporation 2009 315


Data Connections

© Copyright IBM Corporation 2009 316


Data Connection
● Stores database parameters and values in a named object in
the Repository
– Optionally can create a parameter set with the parameters and
values from with the Data Connection object
● Associated with a stage type
● Property values can be specified in a job stage of the given
type by loading the Data Connection into the stage

© Copyright IBM Corporation 2009 317


Creating a New Data Connection Object

New Data
Connection

© Copyright IBM Corporation 2009 318


Select the Stage Type

Select the stage


type

Specify parameter
Optionally create values
a parameter set

© Copyright IBM Corporation 2009 319


Loading the Data Connection

Load Data
Connection

Save Data
Connection

© Copyright IBM Corporation 2009 320


Checkpoint

1. Which of the following types stages provide the most


functionality in DataStage parallel jobs? ODBC Plug-in,
ODBC Connector, ODBC Enterprise?
2. What are three ways of building SQL statements in
Connector stages?
3. Which of the following statements can be specified in
Connector stages? Select, Insert, Update, Upsert, Create
Table.
4. What are two ways of loading Data Connection metadata
into a database stage?

© Copyright IBM Corporation 2009 321


Unit summary
Having completed this unit, you should be able to:
● Import Table Definitions for relational tables
● Create Data Connections
● Use Connector stages in a job
● Use SQL Builder to define SQL Select statements
● Use SQL Builder to define SQL Insert and Update
statements
● Use the DB2 Enterprise stage

© Copyright IBM Corporation 2009 322


IBM InfoSphere DataStage

Mod 12: Metadata in the


Parallel Framework

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Explain schemas
● Create schemas
● Explain Runtime Column Propogation (RCP)
● Turn RCP on and off
● Build a job that reads data from a sequential file using a
schema
● Build a shared container

© Copyright IBM Corporation 2009 324


Schema
● Alternative way to specifying column definitions and record
formats
– Similar to a Table Definition
● Written in a plain text file
● Can be imported as a Table Definition
● Can be created from a Table Definition
● Can be used in place of a Table Definition in a Sequential
file stage
– Requires RCP
– Schema file path can be parameterized
• Enables a single job to process files with different column
definitions

© Copyright IBM Corporation 2009 325


Creating a Schema

● Using a text editor


– Follow correct syntax for definitions
– Not recommended
● Import from an existing data set or file set
– On DataStage Manager import > Table Definitions > Orchestrate
Schema Definitions
– Select checkbox for a file with .fs or .ds
● Import from a database table
● Create from a Table Definition
– Click Parallel on Layout tab

© Copyright IBM Corporation 2009 326


Importing a Schema

Schema location can be


server or workstation

Import from Check if schema is


Database table from a data set or
file set

© Copyright IBM Corporation 2009 327


Creating a Schema From a Table Definition

Parallel layout

Layout

Schema corresponding to
the Table Definition Save schema

© Copyright IBM Corporation 2009 328


Reading a Sequential File Using a Schema

No columns
defined here

Path to
schema file

© Copyright IBM Corporation 2009 329


Runtime Column Propagation (RCP)
● When RCP is turned on:
– Columns of data can flow through a stage without being explicitly
defined in the stage
– Target columns in a stage need not have any columns explicitly
mapped to them
• No column mapping enforcement at design time
– Input columns are mapped to unmapped columns by name
● How implicit columns get into a job
– Read a file using a schema in a Sequential File stage
– Read a database table using “Select *”
– Explicitly define as an output column in a stage earlier in the flow
● Benefits of RCP
– Job flexibility
• Job can process input with different layouts
– Ability to create reusable components in shared containers
• Component logic an apply to a single named column
• All other columns flow through untouched
© Copyright IBM Corporation 2009 330
Enabling Runtime Column Propagation (RCP)

● Project level
– DataStage Administrator Parallel tab
● Job level
– Job properties General tab
● Stage level
– Link Output Column tab
● Settings at a lower level override settings at a higher level
– E.g., disable at the project level, but enable for a given job
– E.g., enable at the job level, but disable a given stage

© Copyright IBM Corporation 2009 331


Enabling RCP at Project Level

Check to enable RCP


to be used

Check to make RCP


the default for new
jobs

© Copyright IBM Corporation 2009 332


Enabling RCP at Job Level

Check to make RCP


the job default

© Copyright IBM Corporation 2009 333


Enabling RCP at Stage Level
● Sequential File stage
– Output Columns tab
● Transformer
– Open Stage Properties
– Stage Properties Output tab

Enable RCP in
Sequential File
stage

Check to enable
RCP in
Transformer

© Copyright IBM Corporation 2009 334


When RCP is Disabled
● DataStage Designer enforces Stage Input to Output column mappings.

Colored red; job


won’t compile

© Copyright IBM Corporation 2009 335


When RCP is Enabled
● DataStage does not enforce mapping rules
● Runtime error if no incoming columns match unmapped target
column names

Job will compile

© Copyright IBM Corporation 2009 336


Shared Containers

© Copyright IBM Corporation 2009 337


Shared Containers
● Encapsulate job design components into a stored container
● Provide reusable job design components
● Example
– Apply stored Transformer business logic

© Copyright IBM Corporation 2009 338


Creating a Shared Container
● Select stages from an existing job
● Click Edit>Construct Container>Shared

Selected
components

Click

© Copyright IBM Corporation 2009 339


Using a Shared Container in a Job

Shared
Container

© Copyright IBM Corporation 2009 340


Mapping Input / Output Links to the Container

Select container link


to map input link to

© Copyright IBM Corporation 2009 341


Mapping Input/Output Links

© Copyright IBM Corporation 2009 342


Checkpoint

1. What are two benefits of RCP?


2. What can you use to encapsulate stages and links in a job
to make them reusable?

© Copyright IBM Corporation 2009 343


Unit summary
Having completed this unit, you should be able to:
● Explain schemas
● Create schemas
● Explain Runtime Column Propogation (RCP)
● Turn RCP on and off
● Build a job that reads data from a sequential file using a
schema
● Build a shared container

© Copyright IBM Corporation 2009 344


IBM InfoSphere DataStage

Module 14: Job Control

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Use the DataStage Job Sequencer to build a job that
controls a sequence of jobs
● Use Sequencer links and stages to control the sequence a
set of jobs run in
● Use Sequencer triggers and stages to control the conditions
under which jobs run
● Pass information in job parameters from the master
controlling job to the controlled jobs
● Define user variables
● Enable restart
● Handle errors and exceptions

© Copyright IBM Corporation 2009 346


What is a Job Sequence?

● A master controlling job that controls the execution of a


set of subordinate jobs
● Passes values to the subordinate job parameters
● Controls the order of execution (links)
● Specifies conditions under which the subordinate jobs get
executed (triggers)
● Specifies complex flow of control
– Loops
– All / Some
– Wait for file
● Perform system activities
– Email
– Execute system commands and executables
● Can include Restart checkpoints

© Copyright IBM Corporation 2009 347


Basics for Creating a New Job Sequence
● Open a new job sequence
– Specify whether its restartable
● Add stages
– Stages to execute jobs
– Stages to execute system commands and executables
– Special purpose stages
● Add links
– Specify the order in which jobs are to be executed
● Specify triggers
– Triggers specify the condition under which control passes across
a link
● Specify error handling
● Enable / Disable restart checkpoints

© Copyright IBM Corporation 2009 348


Job Sequencer Stages

● Run stages
– Job Activity: Run a job
– Execute Command: Run a system command
– Notification Activity: Send an email
● Flow control stages
– Sequencer: Go if All / Some
– Wait for File: Go when file exists / doesn’t
exist
– StartLoop / EndLoop
– Nested Condition: Go if condition satisfied
● Error handling
– Exception Handler
– Terminator
● Variables
– User Variables

© Copyright IBM Corporation 2009 349


Example
Wait for file Execute a
command

Send email
Run job

Handle
exceptions

© Copyright IBM Corporation 2009 350


Sequence Properties

Restart

Exception stage to
handle aborts

© Copyright IBM Corporation 2009 351


Job Activity Stage Properties

Job to be executed

Execution mode

Job parameters
to be passed

© Copyright IBM Corporation 2009 352


Job Activity Trigger
Output link names

List of trigger types

Build custom trigger


expressions

© Copyright IBM Corporation 2009 353


Execute Command Stage

Executable

Parameters to pass

● Execute system commands, shell scripts, and other


executables
● Use e.g. to drop or rename database tables

© Copyright IBM Corporation 2009 354


Notification Activity Stage

Include job
status info in
email body

© Copyright IBM Corporation 2009 355


User Variables Stage
Variable
User Variables stage

Expression defining the


value for the variable

© Copyright IBM Corporation 2009 356


Referencing the User Variable

Variable

© Copyright IBM Corporation 2009 357


Flow of Control Stages

© Copyright IBM Corporation 2009 358


Wait for File Stage

File

Options

© Copyright IBM Corporation 2009 359


Sequencer Stage

● Sequence multiple jobs using the Sequence stage

Can be set to all


or any

© Copyright IBM Corporation 2009 360


Nested Condition Stage

Fork based on trigger


conditions

Trigger conditions

© Copyright IBM Corporation 2009 361


Loop Stages

Reference link to start

Pass counter value


Counter values

© Copyright IBM Corporation 2009 362


Error Handling

© Copyright IBM Corporation 2009 363


Handling Activities that Fail

Pass control to
Exception stage when
an activity fails

© Copyright IBM Corporation 2009 364


Exception Handler Stage

Control
goes here
if an
activity fails

© Copyright IBM Corporation 2009 365


Restart

© Copyright IBM Corporation 2009 366


Enable Restart

Enable
checkpoints to be
added

© Copyright IBM Corporation 2009 367


Disable Checkpoint at a Stage

Don’t
checkpoint
this activity

© Copyright IBM Corporation 2009 368


Checkpoint

1. Which stage is used to run jobs in a job sequence?


2. Does the Exception Handler stage support an input link?

© Copyright IBM Corporation 2009 369


Unit summary
Having completed this unit, you should be able to:
● Use the DataStage Job Sequencer to build a job that
controls a sequence of jobs
● Use Sequencer links and stages to control the sequence a
set of jobs run in
● Use Sequencer triggers and stages to control the conditions
under which jobs run
● Pass information in job parameters from the master
controlling job to the controlled jobs
● Define user variables
● Enable restart
● Handle errors and exceptions

© Copyright IBM Corporation 2009 370


InfoSphere QualityStage
v8.1
Workshop

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Course Contents

● Mod 13: Course Scenario


● Mod 14A: Structure of a Rule Set
● Mod 14B: Creation of a Custom Rule Set
● Mod 15: Initial Investigation of Data to be
Standardized
● Mod 16: Classification Schema/Classification
Table
● Mod 17: Pattern Action File

© Copyright IBM Corporation 2009 372


Module 13: Course Scenario

Information Management

© 2010 IBM Corporation


Information Management

Course Scenario

 Acme Foods is implementing a data warehouse to:


– Provide the ability to analyze sales data and buying trends
– Enable improved target marketing campaigns for new customers
– Increase the profitability of existing customers
 Historical sales data to be loaded into the warehouse has no
consistent structure or data formats as it was stored as free-form
text in various stand-alone systems used by the sales department
 Prior to the warehouse load, historical data must also be enriched
with product ID data from the warehouse’s SCM system

10 IBM Confidential © 2010 IBM Corporation


Information Management

Review Questions

 In what format is the current standardized data?

11 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 1: Setup Data Files

 Copy data files to disk if not using the virtual machine supplied
with the course materials

– Create a working directory C:\DX746


– Copy the entire content from C:\Class_Files\DX746\Lab to C:\DX746

12 IBM Confidential © 2010 IBM Corporation


Module 14A: Structure of a Rule Set

Information Management

© 2010 IBM Corporation


Information Management

Rule Sets

 Rule sets contain logic for:


– Parsing
– Classifying
– Processing data elements
 Three required files
– Classification Table (.cls)
– Dictionary File (.dct)
– Pattern Action File (.pat)
 Optional lookup tables (.tbl)
 Override tables

50 IBM Confidential © 2010 IBM Corporation


Information Management

Rule Set Files

Contains token, standard value,


Classification Table (.CLS) class, and threshold weights
(optional)

Contains SEPLIST/STRIPLIST, NYSIIS


Pattern Action File (.PAT) and Soundex calculations, and
pattern action sets for processing
data
Contains field identifier, type,
Dictionary File (.DCT) length, missing value identifier,
and field description

Optional conversion and lookup


Lookup Tables (.TBL) tables for converting and
returning standardized values

Tables for storing user overrides


Override Tables (.TBL) entered through the QualityStage
Designer GUI
51 IBM Confidential © 2010 IBM Corporation
Information Management

Classification Table (.cls)

 The format for the Classification Table file is:


– token /standard value/class /[threshold-weights]/ [; comments]
 The first line of a Classification Table file must be:
– \FORMAT\ SORT=Y
 Each column must be separated by one or more spaces
 Standard form of a token can be more than one word
– Must be in double quotes
– Example: POB “PO BOX” B

52 IBM Confidential © 2010 IBM Corporation


Information Management

Classes

 One-character tag indicating the class of the word


 User-defined classes can be any letter, A to Z
 Zero (0) class indicates a null class
– Use of a class of zero indicates that a word is a null (or noise) and is skipped
in the pattern matching process
– Standard abbreviation can be any value desired since it is not used
 Any token not classed by the classification table receives the
appropriate default class

53 IBM Confidential © 2010 IBM Corporation


Information Management

Default Classes

^ Numeric, all digits, such as 1234


+ Single, unknown word such as HIPPO
? Unknown, one or more words, such as CHERRY HILL
> Leading numeric, numbers followed by one or more letters,
such as 123A
< Leading alpha, letters followed by one or more numbers, such
as A3
@ Complex mix, a mixture of alpha and numeric characters such
as 123A45
~ Special, special characters including !, \, @, ~, %, etc.
0 Null
- Hyphen
/ Slash
\& Ampersand
\# Number sign
54
\( Left parenthesis
IBM Confidential © 2010 IBM Corporation
Information Management

Threshold Weights

 Threshold weights specify the degree of tolerated spelling


uncertainty
 An information-theoretic string comparator is used that takes into
account phonetic errors, random insertion, deletion and
replacement of characters, and transpositions of characters
 Score is weighted by the length of the word, since small errors in
long words are less serious than errors in short words
– 900 Exact match
– 800 Strings are almost certainly the same
– 750 Strings are probably the same
– 700 Strings are probably different

55 IBM Confidential © 2010 IBM Corporation


Information Management

Comparison Threshold Examples

Token .CLS Entry 850 800 750

NRTHWEST NORTHWEST yes yes yes


NRTHWST NORTHWEST no yes yes
NRTHW NORTHWEST no no yes

56 IBM Confidential © 2010 IBM Corporation


Information Management

Dictionary File (.dct)

 Defines the fields for the output file for the rule set
 The format for the Dictionary File is:
 field-identifier/ field-type/ field-length/missing value/description
[;comments]
– Field identifier: Any length, no spaces in the name,
• Example: HouseNumber
– Field type: C (character), N (numeric)
• This is legacy information but must be entered into the dictionary file
– Field length: Length (in bytes)
– Missing value identifier: S (spaces), Z (zero), X (none)
• This is legacy information but a value must be entered into the dictionary
file
– Description: No spaces allowed in the description
 The first line of a Dictionary File must be:
– \FORMAT\ SORT=N

57 IBM Confidential © 2010 IBM Corporation


Information Management

Pattern Action File (.pat)

 Contains the rules for standardization


 After the input record is tokenized (parsed) and classified,
patterns are executed in the order they appear in the Pattern
Action file
 ASCII file that can be created or updated using any standard text
editor and has the following general format:
\ PRAGMA_START
SEPLIST " ~`!@#$%^&*()_-+={}[]|\\:;\"<>,.?/"

STRIPLIST " ~`!@$^*()_+={}[]|\\:;\"<>,?"


\ PRAGMA_END
\ POST_START
NYSIIS {SN} {NS}

RSOUNDEX {SN} {SS}


\ POST_END
58
patternIBM Confidential © 2010 IBM Corporation
Information Management

Optional Tables (.tbl)

 Lookup tables
– Contain information that is specific to the rule set; for example the gender
lookup table for USNAME contains the following:
MRS F
MS F
MR M
– ASCII file that can be manually edited
 User override tables
– User classification, input text, input pattern, unhandled text, unhandled
pattern
– Entries added through Designer Standardization Overrides GUI

59 IBM Confidential © 2010 IBM Corporation


Information Management

Module 14A Review Questions

 Which rule set file is not sorted? Why?


 If the standard form of a token is two words, how is it formatted in
the classification table?
 Where are the SEPLIST and STRIPLIST located?

60 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 5: Copy and provision an existing rule set

 USNAME  CopyOfUSNAME

61 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 6: Modify a rule set

 CopyOfUSNAME
– CLS – Apaloosa Apalusa F
– PAT – F | +
• Change COPY TO COPY_A
 Use Rules Analyzer to test

62 IBM Confidential © 2010 IBM Corporation


Module 14B: Creation of a Custom Rule Set

Information Management

© 2010 IBM Corporation


Information Management

Custom Rule Set Development Cycle

 Investigate source data file – word investigation


 Develop classification schema/table
 Re-run investigation with new classification table to reveal
patterns
 Develop dictionary file
 Develop pattern action sets
 Run standardization and investigate results
 Refine pattern action processing

64 IBM Confidential © 2010 IBM Corporation


Information Management

Investigate Data File

Which comes first – the investigation or the rule set?


 In order to investigate a free-form field, you must specify a rule
 No rule exists for your data

65 IBM Confidential © 2010 IBM Corporation


Information Management

The Answer

 Use a generic rule set first


– No entries in the classification table
Actually, there needs to be at least one entry in the classification table, one is
created by default when you create a new rule set
– Every token will be unclassified, therefore a default class (tag) will be
assigned based upon the data value
 Create a folder and new Rule Set
 Default values will be created for the dictionary, classification
table, and pattern action file
 Modify the pattern action file to reference the default field in the
dictionary file
 Use the Rule Analyzer to test

66 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 7: Create new rule set

 Create a folder called Standardization Rules under the


ACMEFOOD Folder
 Create a New Rule Set name FOOD
– File, New
– Select Data Quality folder
– Choose Rule Set
 Modify (see slide notes)
– Dictionary file
– Pattern Action file
 Run the Standardization Rules Analyzer using your new rule on
the following input
– “THIS IS A TEST”
– If you correctly copied and updated your new rule, it will return an unhandled
pattern of +++T

67 IBM Confidential © 2010 IBM Corporation


Information Management

Parsing

 Parsing is controlled by the SEPLIST/STRIPLIST


 SEPLIST/STRIPLIST entries are stored in the pattern action file
– SEPLIST/STRIPLIST can also be accessed under Advanced Options in the
Investigate stage Wizard
– The SEPLIST/STRIPLIST located in the investigation window is read from the
pattern action file and displayed on the Advanced Options of the word
investigation screen
• Any changes made to the lists in the investigation screen are not reflected
in the pattern action file, to change this list in the pattern action file you
must edit the file and manually change the list

68 IBM Confidential © 2010 IBM Corporation


Information Management

Determining Parsing Requirements

 Start by having only the space in the SEPLIST and only the space
in the STRIPLIST
 Examine the pattern report to examine how the current SEPLIST/
STRIPLIST combination parsed the data
 Remember, it’s important to thoroughly parse data before
classifying words
– New words can result from additional parsing

69 IBM Confidential © 2010 IBM Corporation


Information Management

Determining Parsing Requirements

 For example, if the “/” is not in the SEPLIST, the following pattern
results

VALHALLA/WRAPPER @

 In this example, a brand name and a container type are treated as


the same data token

70 IBM Confidential © 2010 IBM Corporation


Information Management

Determining Parsing Requirements

 When the “/” is added to the SEPLIST, the following pattern


results

VALHALLA / WRAPPER ?/?

 The pattern now shows that there are two (currently) unclassified
words separated by a “/”
– There are now two more words we can classify

71 IBM Confidential © 2010 IBM Corporation


Information Management

Determining Parsing Requirements

 Continue to re-run the Word Investigate stage while modifying the


SEPLIST/STRIPLIST combination until the free-form data is
parsed into manageable words, “data chunks”, also called tokens
– TIP: Except when representing a complex mix of alphas and numbers
(C3PO), the “@” in pattern reports is an indicator that more can be added to
the SEPLIST
 Before adding a character to the STRIPLIST, be sure the character
doesn’t add meaning or context to the data
– Rule of thumb: Whenever in doubt don’t strip it out
 It’s important to analyze the pattern reports for ways the would-be
stripped character can impact the data

72 IBM Confidential © 2010 IBM Corporation


Information Management

Determining Parsing Requirements

 How would removing the highlighted special characters impact the meaning
of the below data items?
#2 pencil
5# flour
Ave.
$100.00
£100.00
1/2

73 IBM Confidential © 2010 IBM Corporation


Information Management

Update the SEPLIST/STRIPLIST

 After you’ve determined the parsing requirements, update the


SEPLIST/STRIPLIST in the .PAT file
 Using CTRL-C / CTRL-V (copy/paste) can be helpful in copying the
characters used in the Word Investigate stage and pasting them to
the rule set’s .PAT file

74 IBM Confidential © 2010 IBM Corporation


Information Management

Module 14B Review Questions

 True or false: It’s best to start with all special characters in the
STRIPLIST.

75 IBM Confidential © 2010 IBM Corporation


Module 15: Initial Investigation of Data to be Standardized

Information Management

© 2010 IBM Corporation


Information Management

Exercise 5-1

 Create a new Investigate stage


– Name: InvWordProductData
– Option: Word
– Input file: SourceProductData
– Rule: FOOD
– Field: ProdDescr
– Adv. Opt.: Include Unclassified Alphas
Include Classified Original Spelling
 Create a job, add the Investigate stage and run

77 IBM Confidential © 2010 IBM Corporation


Information Management

Word Investigation Implementation

78 IBM Confidential © 2010 IBM Corporation


Information Management

Word Investigation Implementation

79 IBM Confidential © 2010 IBM Corporation


Information Management

Word Investigation Implementation

80 IBM Confidential © 2010 IBM Corporation


Information Management

Word Investigation Implementation

81 IBM Confidential © 2010 IBM Corporation


Information Management

Result File InvWordProductData – Pattern Report

82 IBM Confidential © 2010 IBM Corporation


Information Management

Result File InvWordProductData – Token Report

83 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 8: Create word investigation job

 Refine SEPLIST/STRIPLIST
 Modify .PAT to reflect new SEPLIST/STRIPLIST
 Re-run Word Investigate stage
 Iterate as often as needed

84 IBM Confidential © 2010 IBM Corporation


Module 16: Classification Schema/Classification Table

Information Management

© 2010 IBM Corporation


Information Management

Create the Classification Table

 When creating or modifying a custom rule set, entries to the


classification table are often made by copying lines from the
Token Report report
 The user develops a classification schema for key tokens (words)
that create context in patterns
 The classification schema is a set of one letter codes used to
classify categories of words
 Available are the 26 letters of the alphabet and the numeral 0 (for
words you want set to null or nothing)

86 IBM Confidential © 2010 IBM Corporation


Information Management

Classification Schema

 When choosing letters to represent categories, consider choosing


letters that are not easily confused with other categories
– For example, if one category is of words is “Drugs” and the other is
“Diagnoses”, consider not categorizing either with the obvious choice of “D”
– This can help minimize misclassifications when creating the classification
table

87 IBM Confidential © 2010 IBM Corporation


Information Management

What to Classify

 The general rule is “Don’t classify every word.”


 Classify categories of words that occur many times in the data
and help to lend meaning and context to other items in the free-
form text
– Check the frequency of words in the investigation reports
 Examples of words to classify are those whose category contains
a finite set of values, or whose expected values are finite
– For example: street types, name titles, name suffixes, unit types, units of
measure, packaging or container types, brand names

88 IBM Confidential © 2010 IBM Corporation


Information Management

What to Classify

 Think twice before creating a classification for a category of


words for which the set of values are both infinite and do not
conform to a set of standards
 Non-conforming or infinite category value words are best left to
be classified with default tags
– Examples include individual names, street names, and miscellaneous
description words
– Their context will allow them to be processed corrected in the pattern action
file

89 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 9: Create classification schema

 Review the Token Report created in the previous exercise


 Develop a classification schema
 Add your classification schema key to the classification table
 Copy and paste entries from the Token Report file into the
classification table and replace the “?” with the appropriate one-
character class

90 IBM Confidential © 2010 IBM Corporation


Information Management

Result File

91 IBM Confidential © 2010 IBM Corporation


Information Management

Review Questions

 True or false: You should classify as many tokens as you have


time to.
 Which file do you look in for candidates for inclusion in the
classification table?
 True or false: When you start developing a new rule set, the
classification table is completely blank.
 How many classes did you use? What are they?

92 IBM Confidential © 2010 IBM Corporation


Pattern Review
Refining the Classification Table

Information Management

© 2010 IBM Corporation


Information Management

Generating Patterns

 Once the classification table has been created and the Word
Investigate stage is rerun, patterns will emerge
 Review of the Pattern Report will indicate whether additional
classes or new members to existing classes must be made
 Are the patterns explicit enough for you to glean the meaning of
the data based on its context?
– If you can answer yes to this question, you are ready to begin developing
pattern action sets
– If you answer no to this question, you must refine/modify your classification
schema/table until you can answer this question in the affirmative

94 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 10: Review Pattern Report

 Add entries to/update the classification table


 Re-run the job containing the Word Investigate stage
 Review the Pattern Report
 Refine/modify the classification table as needed

95 IBM Confidential © 2010 IBM Corporation


Information Management

Result File

96 IBM Confidential © 2010 IBM Corporation


Information Management

Module Review Question

 Do any default classed tokens remain after you have developed


the classification table for a new rule set?

97 IBM Confidential © 2010 IBM Corporation


Module 17: Pattern Action File

Information Management

© 2010 IBM Corporation


Pattern Action Language

Information Management

© 2010 IBM Corporation


Information Management

Pattern Format

 A pattern consists of one or more operands


 Operands are separated by vertical lines
– The following pattern has four operands:
^|D|?|T
– These are referred to in the actions as
[1], [2], [3], and [4]
 You can use spaces to separate operands
– ^|D|?|T is equivalent to ^ | D | ? | T
 Comments follow a semicolon ; like this comment

100 IBM Confidential © 2010 IBM Corporation


Information Management

Pattern Action Set

 A Pattern Action Set is comprised of a pattern and a set of action


statements

Pattern

^|D|?|T
COPY [1] {HouseNumber} ; Copy operand 1 to HOUSENUMBER
Action
COPY_A [2] {PreDirection} ; Copy operand 2 w/standardization to PREDIR
Statements
COPY_S [3] {StreetName} ; Copy operand 3 w/spaces to STREETNAME
COPY_A [4] {StreetType} ; Copy operand 4 w/standardization to
; SUFFIXTYPE

101 IBM Confidential © 2010 IBM Corporation


Information Management

Common Action Statements

 COPY - Copy the data in it’s original form


 COPY_A - Copy the standardized abbreviation of the word from
the classification table
– HINT: Only use COPY_A if the the token you’re dealing with has a letter
classification
 COPY_S - Copy with spaces preserved
– HINT: Most often used with ?
 CONCAT - Concatenates a data item to a user variable Also
available is CONCAT_A.
 RETYPE - Change the classification of a data item, and optionally
it’s actual and/or standard value

102 IBM Confidential © 2010 IBM Corporation


Information Management

Pattern Action File (.PAT)

Which of the following patterns will handle the following example?

Example: 1 North Ronald McDonald Lane

^|?|T
COPY [1] {HN} ; Copy operand 1 to HOUSENUMBER
COPY_S [2] {SN} ; Copy operand 2 w/spaces to STREETNAME
COPY_A [3] {ST} ; Copy operand 3 w/standardization to SUFFIXTYPE

^|D|?|T
COPY [1] {HN} ; Copy operand 1 to HOUSENUMBER
COPY_A [2] {PD} ; Copy operand 2 w/standardization to PREDIRECTION
COPY_S [3] {SN} ; Copy operand 3 w/spaces to STREETNAME
COPY_A [4] {ST} ; Copy operand 4 w/standardization to SUFFIXTYPE

103 IBM Confidential © 2010 IBM Corporation


Information Management

User Variables

 This pattern action set looks for a date such as 10/15/98 and
copies it as a single token to a date field

*^|/|^|/|^
COPY [1] temp
CONCAT [2] temp
CONCAT [3] temp
CONCAT [4] temp
CONCAT [5] temp Temporary Variable
COPY temp {DA}

104 IBM Confidential © 2010 IBM Corporation


Information Management

Position-related Commands

 End of Field Specifier - $


– ^ | ? | T | $ would only match a field that had a simple address and nothing
else
• 123 Main Road would match
• 123 Main Road Apt 3 would not match
• Apt 3 123 Main Road would not match
 Floating Position Specifier – *
– *^ | ? | T would look starting from the left to find a simple address that appears
anywhere within the field
• Home Address 123 Main Road would match
• Apt 3 123 Main Road would match
• 123 Main Road Apt 3 would match

105 IBM Confidential © 2010 IBM Corporation


Information Management

Position-related Commands

 Reverse Floating Position Specifier - #


– #S | ^ would look starting from the right to find the first instance of state
followed by a numeric
– CA 45 Products, Phoenix, AZ 12345 Dept 45
– Notice it correctly finds AZ 12345. If we used the floating position (*), it
would have incorrectly identified CA 45 as the state and zip code
 Fixed Position Specifier - %
– Specifies a particular operand based on its position in a string
– Input string: 123 Happy Valley Joy Road
• %1 would locate 123
• %3 would locate Valley
• %-1 would locate Road

106 IBM Confidential © 2010 IBM Corporation


Information Management

Subfield Commands

 Subfield Classes - 1 to 9, -1 to -9
– Parses individual words of a ? string
– Johnson Controls Corp whose pattern is ? | C
– 1 | C – Johnson is operand 1 and Corp is operand 2
– -1 | C – Controls is operand 1 and Corp is operand 2
 Subfield Ranges - beg:end
– Specified a range of unknown words
– 123 – A B Main Street whose pattern is ^ | - | ? | T
– ^ | - | (1:2) | T
• “A B Main” is the string specified by ?
• 1:2 is the portion A B
– Operand [3] is A B
• You can also look from the end of the field. For example, (-3:-1) specifies
a range of the third word from the last to the last word

107 IBM Confidential © 2010 IBM Corporation


Information Management

Additional Classes

 Universal Class - **
– Matches all tokens
– ** | T would match any pattern that ended with a T class
 Negation Class Qualifier - !
– Is used to indicate NOT
– ** | !T would match any pattern that did not end with a T class

108 IBM Confidential © 2010 IBM Corporation


Information Management

Conditionals

 Simple Conditional Values


– ^ | ?=“MAPLE” | T
– Would only match simple addresses with the single street name of
“MAPLE”
• 120 MAPLE TERRACE would match
• 240 MAPLE DRIVE would match
• 140 MAPLE LEAF ROAD would not match
 Series of Conditional Values
– ^ | ? | T=“AVE”, “DR”
– Would only match simple addresses with a standard form suffix of AVE or
DR
• 120 MAPLE AVE would match
• 240 MAPLE DR would match
• 140 MAPLE ROAD would not match

109 IBM Confidential © 2010 IBM Corporation


Information Management

Tables of Conditional Values

 If it’s necessary to specify a large number of conditional values,


tables of values can be used
– Postcode.tbl contains valid postal codes
– S | ^ = @Postcode.tbl
– Would only match a state followed by a numeric if the numeric represented a
value in the Postcode.tbl file

110 IBM Confidential © 2010 IBM Corporation


Information Management

Referencing Operands

 Referencing Data Represented by an Operand


– ^ [{} = 10000] is read as “a numeric whose actual value is 10000”
 Referencing the length of an operand
– ^ [{} LEN = 5] is read as “a numeric whose length is 5 bytes
 Referencing the template of an operand
– @ [{} PICT = “cncc”] is read as “a complex mixed that looks like a character
followed by a numeric followed by two characters” such as C3PO
 Referencing a substring of an operand
– + [{} (-7:-1) = “STRASSE”] is read as “a single unknown alpha whose actual
value’s last 7 bytes are STRASSE” such as LEUVENSTRASSE

111 IBM Confidential © 2010 IBM Corporation


Information Management

Calling Subroutines

 Subroutines facilitate faster processing


 Pattern action sets are executed sequentially
 Faster to test for a generic condition and then call subroutine to
process

*M ; look for existence of apartment data


Call APTS ; send record to APTS subroutine

\SUB APTS
Process apartment data here
\END_SUB

112 IBM Confidential © 2010 IBM Corporation


Information Management

Review Questions

 COPY_A copies what form of the token?


 COPY_S is most commonly used with what default class?
 Would ^ | ? | T find an address anywhere in a field?
 True or false: ^ | ? | T | $ would only find an address that was the
only value in the field.
 True or false: ** is the negation class qualifier.

113 IBM Confidential © 2010 IBM Corporation


Development of Pattern Action Sets

Information Management

© 2010 IBM Corporation


Information Management

Defining Standardize Output File Format

 The output of the Standardize stage is dictated by the dictionary


file
 Your business rules and requirements will dictate the output file
format
 Because we need to match the product data against the
warehouse file, our output should match the format of the
warehouse file

115 IBM Confidential © 2010 IBM Corporation


Information Management

Warehouse File Format

 DWProductData is the warehouse reference file and has the


following format:

116 IBM Confidential © 2010 IBM Corporation


Information Management

Result - Dictionary File

117 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 11: Develop Pattern Set

 Add entries to your dictionary file based on the format of the


warehouse data file that you are going to match your standardized
data to

 Select patterns and begin developing action statements (pattern


action sets) to bucket/standardize the data

– Create a Standardize stage


• Input file: SourceProductData
• Output file: StanProdData
• Option: Append All
• Rule: FOOD
• Standardize Field: ProdDescr
– Build and run Standardize job

118 IBM Confidential © 2010 IBM Corporation


Information Management

Result File

119 IBM Confidential © 2010 IBM Corporation


Refining Pattern Action Sets Investigate of
Standardize Results

Information Management

© 2010 IBM Corporation


Information Management

Investigate Your Standardize Results

 Once you’ve run the Standardize stage, you should investigate the
results to see how accurately your rule set is working
 Your rule set contains two fields that make this process much
faster and easier
– Unhandled Pattern – which shows the pattern of any data it was unable to
process
– Unhandled Data – which show the actual unhandled/unprocessed data
 Running a Character Concatenate Investigate stage on these two
fields will provide you with a consolidated view of patterns/data
that still need processing
 You should also run a Character Discrete Investigate stage on all
your single domain fields to ensure they have domain integrity
and your rule set is working as desired

121 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 12: Audit Standardization Results (Task 1)

 Create a Character Concatenate Investigate stage on the


Unhandled_Pattern and Unhandled_Data fields
– Use a C mask on UnhandledPattern and an X mask on UnhandledData
– Be sure to increase your sample size
 Add/refine your pattern action sets
 Rerun your Standardize stage and your Character Concatenate
Investigate stage until you have processed all the data correctly

122 IBM Confidential © 2010 IBM Corporation


Information Management

Lab 12: Audit Standardization Results (Task 2)

 Create a Character Discrete Investigate stage to investigate the


contents of all your single domain fields in your Standardize
stage result file
 Review the report to ensure that you have domain integrity in all
the fields
 If you identify any errors, go back and modify your rule’s pattern
action sets to correct the mishandling of the data

123 IBM Confidential © 2010 IBM Corporation


Information Management

Review Questions

 What pattern/actual data value did you find most difficult to


process?
 Were you able to attain 100% processing of the data?
 Why did we use a C mask on the unhandled pattern and an X
mask on the unhandled data in our quality control Character
Concatenate Investigate stage?
 Did you need to make any changes to your classification table?

124 IBM Confidential © 2010 IBM Corporation


DB2® v9.1
Workshop

© Copyright IBM Corporation 2009


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Course Contents

● Mod 18: Overview of DB2


● Mod 19: Command Line Processor (CLP) and GUI
usage
● Mod 20: DB2 Environment
● Mod 21: Create an instance and explore the
environment
● Mod 22: Creating databases and data placement
● Mod 23: Creating database objects
● Mod 24: Moving data
● Mod 25: Backup and recovery

© Copyright IBM Corporation 2009 374


Overview of DB2 9 on Linux, UNIX and
Windows

© Copyright IBM Corporation 2010


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Contrast the DB2 Family of products
• Identify the DB2 Products
• Describe the functions of DB2 components
• Explore installation and parameters

© Copyright IBM Corporation 2010


DB2 product platforms

Windows Linux/UNIX

DB2 DB2

zSeries iSeries

DB2

DB2

© Copyright IBM Corporation 2010


DB2: The scalable database

IBM InfoSphere Warehouse


(DPF Database partitioning)
supports clusters of Physical or
Virtual Servers

DB2 Enterprise Server


meets data server needs of
mid-size to large-size businesses

DB2 Workgroup Server


DB2 Express data server of choice for deployment
DB2 data server, entry-level
in a departmental, workgroup, or
pricing, small and
medium-sized business environment
medium business

DB2 Express-C
Free, entry-level edition of the DB2 data server
for the developer and partner community

© Copyright IBM Corporation 2010


Accessing DB2

Command Line Administration


Applications
Processor Tools
APIs
Control Center
Commands Embedded
Command Editor SQL/XQuery
Health Center Call Level
Interactive
SQL Task Center Interface
XQuery Replication Center JDBC/
SQLJ
Journal
DAO/
Tools Settings RDO/
ADO

SQL/XML/API

Database Engine

© Copyright IBM Corporation 2010


DB2 product components

IBM Data Server


DRDA Driver Package
Application
Requester IBM Data IBM Data Server
Server Client Runtime Client

Communication Support

DB2 DB2 Connect

DRDA Application Server

© Copyright IBM Corporation 2010


Installation prerequisites
• Do you have:
– Enough disk space
– Enough memory
– Available Internet browser
– Security (Authorities) concept for
SYSADM / Admin. Server

• Specific requirements depend on operating system


http://www-01.ibm.com/software/data/db2/9/sysreqs.html

• Product documentation:
http://www.ibm.com/support/docview.wss?rs=71&uid=swg27015148

© Copyright IBM Corporation 2010


Installation steps

• GUI or non-GUI • Admin-Server?


• Customizing? • Notification?
• Default instance? • Installation verification
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Contrast the DB2 Family of products
• Identify the DB2 Products
• Describe the functions of DB2 components
• Explore installation and parameters

© Copyright IBM Corporation 2010


Student exercise

© Copyright IBM Corporation 2010


Command Line Processor (CLP) and
GUI usage

© Copyright IBM Corporation 2010


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Use the Command Line Processor
• Explore the GUI environment
• Describe the DAS role with GUI tools

© Copyright IBM Corporation 2010


CLP Command Line Processor

© Copyright IBM Corporation 2010


CLP syntax

db2

option-flag db2-command
sql-statement
?

phrase
message
sql-state
class-code

© Copyright IBM Corporation 2010


Online reference

• Online Command Reference


db2 ?
db2 ? command string
db2 ? SQLnnnn (nnnn = 4 or 5 digit SQLCODE)
db2 ? nnnnn (nnnnn = 5 digit SQLSTATE)

• Online Reference Manuals

© Copyright IBM Corporation 2010


Using the CLP

Non-interactive mode

db2 connect to musicdb


db2 "select * from syscat.tables" more
(double quotes may be required)

Interactive mode

db2
db2=> connect to musicdb
db2=> select * from syscat.tables
© Copyright IBM Corporation 2010
CLP command options

© Copyright IBM Corporation 2010


Modify CLP options

1 Temporary for Command


db2 -r options.rep list command options
db2 -svtf create.tab3
db2 +c "update tab3 set salary=salary + 100"

2 Temporary for Interactive CLP Session


db2=>update command options using c off a on

3 Temporary for non-interactive CLP Session


export DB2OPTIONS="-svt" (UNIX)
set DB2OPTIONS="-svt" (Windows)
db2 -f create.tab3

4 Every session
put point 3 in UNIX db2profile
or System Program Group in Windows
© Copyright IBM Corporation 2010
Input file

Edit create.tab
-- comment: db2 -svtf create.tab

connect to sample;

create table tab3


(name varchar(20) not null,
phone char(40),
salary dec(7,2));

select * from tab3;

commit work;

connect reset;

db2 -svtf create.tab

© Copyright IBM Corporation 2010


Input file: Operating system commands

edit seltab.cmd
vi seltab

echo "Table Name Is" $1 > out.sel echo 'Table Name Is' %1 > out.sel
db2 "select * from $1" >> out.sel db2 Select * from %1 >> out.sel

seltab org

Table Name Is org

DEPTNUMB DEPTNAME MANAGER DIVISION LOCATION


10 Head Office 160 Corporate New York
15 New England 50 Eastern Boston
out.sel
20 Mid Atlantic 10 Eastern Washington
contents
38 South Atlantic 30 Eastern Atlanta
42 Great Lakes 100 Midwest Chicago
51 Plains 140 Midwest Dallas
66 Pacific 270 Western San Francisco
84 Mountain 290 Western Denver

© Copyright IBM Corporation 2010


QUIT/TERMINATE/CONNECT RESET differences

CLP Terminate CLP Disconnect


COMMAND Back-end Process database Connection

quit No No

terminate Yes Yes


Yes if
connect reset No CONNECT=1
(RUOW)

© Copyright IBM Corporation 2010


CLPPlus
command line processor introduced in DB2 9.7
• CLPPlus:
– SQL*Plus compatible command
– Variable substitution
– Column formatting
– Simple reporting
– Control variables

© Copyright IBM Corporation 2010


CLPPlus features
• Support for establishing connections to databases when a
database user ID and password are provided.
• A buffer that can be used to store scripts, script fragments,
SQL statements, SQL PL statements, or PL/SQL statements
for editing and then execution. Text in the buffer can be listed,
printed, edited, or run as a batch script.
• A comprehensive set of processor commands can be used to
define variables and strings that can be stored in the buffer.
• A set of commands that retrieve information about the
database and database objects.
• Ability to store buffers or buffer output to a file.
• Multiple options for formatting the output of scripts and
queries.
• Support for executing system-defined routines.
• Support for executing operating system commands.
• Option for recording the output of executed commands,
statements, or scripts.
© Copyright IBM Corporation 2010
Roadmap to the GUI tools

IBM DB2

Information Command General Monitoring Setup


Line Tools Administration Tools Tools
Tools

Information Control Configuration


Center Command Center Assistant
(db2ic) Editor (db2cc) Event (db2ca)
(db2ce) Analyzer First Steps
Journal (db2eva)
(db2journal) (db2fs)
Command
Window Health Register Visual
Replication Center Studio Add-ins
(db2cw)
Center (db2hc) (db2vsregister)
Command (dbrc)
Line InDoubt Satellite
Processor Task Center Transaction Synchronizer
(db2) (db2tc) Manager
Command
Line (db2indbt)
License
Processor Center Activity Memory
Plus Monitor Visualizer
(clpplus) (db2am) (db2memvis)

© Copyright IBM Corporation 2010


Control Center

Contents
Pane

Object
Tree
Pane
Object
Detail
Pane

© Copyright IBM Corporation 2010


Command Editor

© Copyright IBM Corporation 2010


Administration tools

© Copyright IBM Corporation 2010


Using the wizards

© Copyright IBM Corporation 2010


Health Center

© Copyright IBM Corporation 2010


Journal

© Copyright IBM Corporation 2010


Configuration Assistant (1 of 2)
• Used to configure client connections
– Clients can be configured:
• Manually DB2
• Automatically using DB2 Discovery Administration
• Using imported profiles Server
• Search the network
DB2 Server
• DB2 CA can also:
– Connect to databases, both LAN and DRDA
DB2 Discovery
– Perform CLI/ODBC administration
– Bind applications to database

Configuration
Assistant

Remote Client
© Copyright IBM Corporation 2010
Configuration Assistant (2 of 2)

© Copyright IBM Corporation 2010


DB2 Administration Server (DAS)
• DB2 control point to assist with server tasks
• Must have running DAS on server to use Configuration
Assistant, Control Center, or Task Center
• Might be created and configured at installation time
• Assists with:
– Enabling remote administration of DB2 servers
– Providing the facility for job management, including scheduling
– Providing a means for discovering information about the
configuration of DB2 instances, databases, and other DB2
Administration Servers in conjunction with DB2 Discovery

© Copyright IBM Corporation 2010


Using the Administration Server
• Creating the Administration Server:
– dascrt ASName (Linux/UNIX)
– db2admin create (Windows)

• Starting and stopping the Administration Server:


– db2admin start
– db2admin stop

• Listing the Administration Server:


– db2admin

• Configuring the Administration Server:


– db2 get admin cfg
– db2 update admin cfg using ...

• Removing the Administration Server:


– dasdrop ASName (Linux/UNIX)
– db2admin drop (Windows)
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Use the Command Line Processor
• Explore the GUI environment
• Describe the DAS role with GUI tools

© Copyright IBM Corporation 2010


Student exercise

© Copyright IBM Corporation 2010


The DB2 environment

© Copyright IBM Corporation 2010


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Specify the key features of an Instance
• Create and drop an Instance
• Use db2start and db2stop
• Distinguish between types of configuration
• Describe and modify the Database Manager Configuration

© Copyright IBM Corporation 2010


What is an instance?
DB2 PRODUCT

Instance_1 Instance_2
DBM CFG DBM CFG

DB CFG DB CFG

Catalog DB_1 LOG Catalog DB_3 LOG

DB CFG

Catalog DB_2 LOG

© Copyright IBM Corporation 2010


The Database Manager instance

Computer system

Database Manager Database Manager


Inst1 inst2
database 1 database 1 database 2

Tablespace A
Tablespace B
Table 1 Table 2 Table 3
Table 4

co
nn
e ct
to

Local User Application


PATH=...
DB2INSTANCE=inst1

© Copyright IBM Corporation 2010


Create and drop an instance
• CREATE (different on Linux/UNIX and Windows):
– Prerequisites met?
– Creates the Instance
– Creates the SQLLIB subdirectory
– Creates needed directories in the SQLLIB subdirectory
– Creates the DBM CFG with defaults
db2icrt –u <fencedID> <instance_name> (UNIX/Linux)
db2icrt <instance_name> (Windows)

• DROP:
– Instance must be stopped and no applications are allowed to be
connected to the databases in this instance
– Does not remove databases (catalog entries)
– Removes the Instance
db2idrop <instance_name>

© Copyright IBM Corporation 2010


Starting and stopping an instance

db2start db2stop

© Copyright IBM Corporation 2010


Setting DB2 variable values

Environment Platform
Variables Specific

Instance-Level Use db2set


Must reinitialize command to change
Profile
environment after Must stop/restart
changing Registry instance

No restart Global-Level
of system
Profile
after changing!
Registry
© Copyright IBM Corporation 2010
The db2set command
• Command line tool for DB2 Profile Registry administration
• Displays, sets, resets, or removes profile variables
db2set variable=value
-g
-i instance [db-partition-number]
-n DAS node [[-u user id] [-p password]]
-u
-ul
-ur
-p
-r
-l
-lr
-v
-? (or -h)
-all

© Copyright IBM Corporation 2010


Database Manager configuration
C:\IBM\SQLLIB\BIN>db2 get dbm cfg

Database Manager Configuration

Node type = Enterprise Server Edition with local and remote clients

Database manager configuration release level = 0x0d00

Maximum total of files open (MAXTOTFILOP) = 16000


CPU speed (millisec/instruction) (CPUSPEED) = 4.605357e-007
Communications bandwidth (MB/sec) (COMM_BANDWIDTH) = 1.000000e+002

Max number of concurrently active databases (NUMDB) = 8


Federated Database System Support (FEDERATED) = NO
Transaction processor monitor name (TP_MON_NAME) =

Default charge-back account (DFT_ACCOUNT_STR) =

Java Development Kit installation path (JDK_PATH) = C:\IBM\SQLLIB\java\jd


k

Diagnostic error capture level (DIAGLEVEL) = 3


Notify Level (NOTIFYLEVEL) = 3
Diagnostic data directory path (DIAGPATH) =
Size of rotating db2diag & notify logs (MB) (DIAGSIZE) = 0

Default database monitor switches


Buffer pool (DFT_MON_BUFPOOL) = OFF
Lock (DFT_MON_LOCK) = OFF
Sort (DFT_MON_SORT) = OFF
Statement (DFT_MON_STMT) = OFF
Table (DFT_MON_TABLE) = OFF
Timestamp (DFT_MON_TIMESTAMP) = ON
Unit of work (DFT_MON_UOW) = OFF
Monitor health of instance and databases (HEALTH_MON) = ON
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Specify the key features of an Instance
• Create and drop an Instance
• Use db2start and db2stop
• Distinguish between types of configuration
• Describe and modify the Database Manager Configuration

© Copyright IBM Corporation 2010


Student exercise

© Copyright IBM Corporation 2010


Creating databases and data placement

© Copyright IBM Corporation 2010


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Review specifics of creating a database
• Explore the System Catalog tables and views
• Compare DMS versus SMS table spaces
• Describe how to setup and manage a DB2 database with
Automatic Storage enabled
• Differentiate between table spaces, containers, extents, and
pages
• Define table spaces
• Use the get snapshot for tablespaces command to display
table space statistics
• Explore Database configuration parameters

© Copyright IBM Corporation 2010


Create database overview

Database Manager Instance

database1 database2

Tablespace A Tablespace A
Tablespace B
Table 1 Table 2 Table 3 Table 4 Table 1 Table 2

• Databases are created within a Database Manager instance


• Table spaces are a logical layer created within a database
• Tables are created within table spaces

© Copyright IBM Corporation 2010


Database storage requirements
• Database Path:
– Database Control Files for each database
• Includes Database Configuration file, Recovery History, Log Control files,
Table space Control file, Bufferpool Control file and others
– Initial location for database log files
– Default location is dftdbpath in DBM CFG
• Needs to be a local file system
• Automatic Storage paths:
– Allows table space storage to be managed at the database level
– If Automatic storage is enabled there will be at least one path defined
– Initial Storage Paths defined when database is created
• Default System table spaces:
– Use Automatic Storage management by default, if enabled, but can be
defined to use any supported type of table space
– SYSCATSPACE: DB2 catalog tables
– TEMPSPACE1: System Temporary tables, provide work space for
sorting and utility processing
– USERSPACE1: Initial table space for defining user tables and indexes
© Copyright IBM Corporation 2010
DB2 Storage Management basics
• DB2 supports three types of storage management for table spaces
• All three types can be used in a single database
• Storage Management type set when a table space is created
• DMS – Database Managed Storage:
– Table space containers defined using the specified files or raw devices
– Disk space allocated is reserved for objects in that table space
• SMS – System Managed Storage:
– Table space containers defined using the specified directories
– No defined initial size or limit
– DB2 creates files for each database object
– Disk space is freed when objects are dropped
• Automatic Storage Management:
– Disk Storage Paths are assigned to the database
– Can be enabled when a database is created or added to an existing database
– Available space can be monitored at the database partition level
– DB2 defines the number and names for containers
– Uses SMS for temporary storage and DMS for other storage types

© Copyright IBM Corporation 2010


DMS table space: Minimum requirements

Table Space Creation container tag 1 extent per container

header for table space 1 extent

table space space map 1 extent


meta data
object table data 1 extent

4 extents

Table (Object) Creation extent map 1 extent

extent for data 1 extent

Minimum space required in containers = 6 extents


(for first object)

If EXTENTSIZE = 32 pages, then 6*32 = 192 pages


needed to be provided in containers

© Copyright IBM Corporation 2010


CREATE DATABASE syntax
CREATE DATABASE database-name
DB AT DBPARTIONNUM
|Create Database Options|
Create Database options:

AUTOMATIC STORAGE YES ON path DBPATH ON path


NO drive drive
,
ALIAS db-alias USING CODESET codeset TERRITORY territory

COLLATE USING PAGESIZE 4096 DFT_EXTENT_SZ dft-extentsize


IDENTITY n K

RESTRICTIVE CATALOG TABLESPACE |tblspace-defn|

USER TABLESPACE |tblspace-defn| TEMPORARY TABLESPACE |tblspace-defn|

WITH "comment-string" |autoconfigure-settings|


tblspace-defn: ,
| MANAGED BY SYSTEM USING ( 'container-string' )
,
DATABASE USING ( FILE 'container-string' num-pages )
DEVICE

EXTENTSIZE num-pages PREFETCHSIZE num-pages

OVERHEAD number-of-milliseconds TRANSFERRATE number-of-milliseconds


© Copyright IBM Corporation 2010
Create Database Wizard

© Copyright IBM Corporation 2010


CREATE DATABASE examples
create database sales1 on /dbsales1
– Database Path: /dbsales1
– Automatic Storage Path: /dbsales1
create database sales2 automatic storage no on
/dbsales2
– Database Path: /dbsales2
– Automatic Storage not enabled
create database sales3 on /dbauto3 dbpath on /dbsales3
– Database Path: /dbsales3
– Automatic Storage Path: /dbauto3
create database sales4 automatic storage yes
on /dbauto41,/dbauto42,/dbauto43
dbpath on /dbsales4
– Database Path: /dbsales4
– Automatic Storage Paths: /dbauto41, /dbauto42 and /dbauto43

© Copyright IBM Corporation 2010


Completed by
DB2 during database creation (1 of 2)
1. Creates database in the specified subdirectory
2. Creates IBMDEFAULTGROUP, IBMCATGROUP, and IBMTEMPGROUP database
partition groups
3. Creates SYSCATSPACE, TEMPSPACE1, and USERSPACE1 table spaces
4. Creates the system catalog tables and recovery logs
5. Catalogs database in local database directory and system database
directory
6. Stores the specified code set, territory, and collating sequence
7. Creates the schemas SYSCAT, SYSFUN, SYSIBM, and SYSSTAT
8. Binds database manager bind files to the database (db2ubind.lst)
– DB2 CLI packages are automatically bound to databases when the databases
are created or migrated. If a user has intentionally dropped a package, then
you must rebind db2cli.lst.

© Copyright IBM Corporation 2010


Completed by
DB2 during database creation (2 of 2)
9. Grants the following privileges:
– ACCESSCTRL , DATAACCESS , DBADM and SECADM
privileges to database creator
– SELECT privilege on system catalog tables and views to PUBLIC
– UPDATE access to the SYSSTAT catalog views
– BIND and EXECUTE privilege to PUBLIC for each successfully
bound utility
– CREATETAB, BINDADD, IMPLICIT_SCHEMA, and CONNECT
authorities to PUBLIC
– USE privilege on USERSPACE1 table space to PUBLIC
– Usage of the WLM workload for user class
SYSDEFAULTUSERCLASS to PUBLIC
• When the RESTRICTIVE option is used, no privileges are
automatically granted to PUBLIC

© Copyright IBM Corporation 2010


Database Path files
CREATE DATABASE DSS ON /dbauto DBPATH ON /database

database DB2INSTANCE=inst20

inst20
db2nodes.cfg

NODEnnnn
SQLDBDIR
sqldbdir

SQLxxxxx
SQLDBCONF SQLBP.1 SQLBP.2
db2rhist.asc db2rhist.bak
SQLOGCTL.LFH.1 SQLOGCTL.LFH.2 SQLOGMIR.LFH
SQLSGF.1 SQLSGF.2
SQLSPCS.1 SQLSPCS.2

SQLOGDIR

S0000000.LOG S0000001.LOG S0000002.LOG

© Copyright IBM Corporation 2010


Default
table space containers with Automatic Storage

CREATE DATABASE DSS ON /dbauto DBPATH ON /database

dbauto DB2INSTANCE=inst20

inst20

NODE0000

DSS
T0000000 SYSCATSPACE
C0000000.CAT

T0000001 TEMPSPACE1
C0000000.TMP

T0000002 USERSPACE1
C0000000.LRG

© Copyright IBM Corporation 2010


System Catalog tables and views

SYSCAT v

...
i
e SYSIBM.SYSCOLUMNS
SYSSTAT SYSIBM.SYSTABLES
w

...
s

See notes for actual table and view names.

© Copyright IBM Corporation 2010


Create BUFFERPOOL syntax
• Requires SYSCTRL or SYSADM authority
CREATE BUFFERPOOL bufferpool-name
DEFERRED

ALL DBPARTITIONNUMS

,
DATABASE PARTITION GROUP db-partition-group-name
SIZE 1000 AUTOMATIC

SIZE number-of-pages

1000
SIZE AUTOMATIC
number-of-pages
NUMBLOCKPAGES 0

NUMBLOCKPAGES number-of-pages
BLOCKSIZE number-of-pages

PAGESIZE integer
K
© Copyright IBM Corporation 2010
Table space, container, extent, page

TABLE SPACE
CONTAINER

EXTENT
DATA
PAGE

© Copyright IBM Corporation 2010


Containers and table spaces
• Container is an Allocation of Physical Space

What Does a Container Look Like?

File

Directory
Device

Directory File Device


SMS DMS DMS

© Copyright IBM Corporation 2010


Writing to containers
• DFT_EXTENT_SZ defined at database level
• EXTENTSIZE defined at table space level
• Data written in round-robin manner

Container 0 0 2 Container 1

1 3

Extent

Table space TBS


© Copyright IBM Corporation 2010
Table space design limits: Row Identifiers
• Standard RIDs • Large RIDs

4 64 4 8
KB GB KB TB
128 16
GB 256 TB 32
8 KB GB 8 KB TB
512 64
GB TB
16 KB 16 KB

32 KB 32 KB
Page size table space size Page size

table space size

16M 255 4x109 Rows 16M


512M ~2K 1.1x1012 Rows

Row ID (RID) 4 Bytes Large Row ID (RID) 6 Bytes

• For tables in LARGE table spaces


• For tables in Regular table spaces (DMS or Automatic Storage)
• Also all SYSTEM and USER temporary
table spaces
© Copyright IBM Corporation 2010
CREATE TABLESPACE syntax (1 of 2)
LARGE
CREATE TABLESPACE tablespace-name
REGULAR
SYSTEM
TEMPORARY
USER PAGESIZE 4096

DATABASE PARTITION GROUP PAGESIZE integer


K
IN db-partition-group-name

AUTOMATIC STORAGE | size-attributes |


MANAGED BY SYSTEM | system-containers |
DATABASE |database-containers | EXTENTSIZE num-pages
integer K
M
G

PREFETCHSIZE AUTOMATIC BUFFERPOOL bufferpool-name


num-pages
integer K
M
G
7.5 NO FILE SYSTEM CACHING
OVERHEAD number-of-milliseconds FILE SYSTEM CACHING

0.06 DROPPED TABLE RECOVERY ON


TRANSFERRATE number-of-milliseconds OFF
© Copyright IBM Corporation 2010
CREATE TABLESPACE syntax (2 of 2)
size-attributes:

AUTORESIZE YES INITIALSIZE int K INCREASESIZE int perc MAXSIZE int K


NO M K M
G M G
system-containers: G NONE

,
USING ( 'container-string' )
| on-db-partitions-clause |
database-containers:

USING | container-clause |
| on-db-partitions-clause |
container-clause:
,
( FILE 'container-string' num-pages )
DEVICE integer K
M
G
on-db-partitions-clause:
,
ON DBPARTITIONNUM ( db-partition-number1 )
DBPARTITIONNUMS TO db-partition-number2

© Copyright IBM Corporation 2010


Create Table Space Wizard

© Copyright IBM Corporation 2010


Storage Management alternatives: Automatic
• Automatic Storage Managed:
– Administration is very easy, no need to define the number or names of
the containers
– Disk space assigned and maintained at the database level
– Monitoring of available space at the database partition level instead of
each table space
– Multiple containers will be created using all available storage paths for
performance
– Automatic Storage can be enabled when the database is created or
added to an existing database
• Default is ON for CREATE DATABASE with DB2 9
– Storage paths can be added or removed using ALTER DATABASE
– Uses standard DMS and SMS under the covers:
• DMS used for REGULAR and LARGE table spaces
• SMS used for SYSTEM and USER TEMPORARY table spaces
– Table space allocation controlled by CREATE/ALTER options:
• INITIALSIZE: Defaults to 32 MB
• AUTORESIZE: Can be set to YES or NO
• INCREASESIZE: Can be set to amount or percent increase
• MAXSIZE: Can define growth limits
© Copyright IBM Corporation 2010
Automatic storage: Table space examples
• Syntax for CREATE and ALTER TABLESPACE:
CREATE TABLESPACE <tsName> [MANAGED BY AUTOMATIC STORAGE]
[INITIALSIZE integer {K|M|G}]
[AUTORESIZE {NO|YES}] [INCREASESIZE integer {PERCENT|K|M|G}]
[MAXSIZE {NONE | integer {K|M|G}}]
• Default initial size is 32 MB
• Default max size is none
• Default increase size is determined by DB2, which might
change over the life of the table space
• Examples:
CREATE TABLESPACE USER1
CREATE TEMPORARY TABLESPACE TEMPTS
CREATE TABLESPACE MYTS INITIALSIZE 100 M MAXSIZE 1 G
CREATE LARGE TABLESPACE LRGTS INITIALSIZE 5 G AUTORESIZE NO
CREATE REGULAR TABLESPACE USER2 INITIALSIZE 500 M
© Copyright IBM Corporation 2010
ALTER TABLESPACE
• ALTER TABLESPACE can be used to change table space
characteristics:
– For all types table space management, you can adjust:
• Bufferpool Assigned
• Prefetch size
• Overhead and Transfer rate I/O Costs
• File System Caching option
– For DMS-managed table spaces, you can:
• Use the ADD, DROP, RESIZE, EXTEND, REDUCE and BEGIN NEW
STRIPE SET to directly adjust container names and sizes.
• Use MANAGED BY AUTOMATIC STORAGE to convert to automatic
storage management
– For Automatic Storage-managed table spaces, you can:
• Use the REDUCE option to release unused disk space
• Use REBALANCE to reallocate containers when automatic storage
paths are added or dropped
– For DMS and Automatic Storage, you can:
• Change the MAXSIZE, INCREASESIZE and AUTORESIZE settings
© Copyright IBM Corporation 2010
Adding or dropping Automatic Storage Paths
• New Automatic Storage paths can added to a database using
the ALTER DATABASE
alter database add storage on ‘/dbauto/path3’,’/dbauto/path4’

– Automatic Storage can be enabled in an existing database by adding


the first paths
– Existing Automatic tablespaces will grow using the previously
assigned storage paths until remaining space is used
– Newly created tablespaces will begin to use all defined paths
– Individual table spaces can be altered using REBALANCE to spread
data over all storage paths

• Storage paths can also be removed using ALTER DATABASE


alter database drop storage on ‘/dbauto/path1’

– The dropped path will be in drop pending state until a ALTER


TABLESPACE with the REBALANCE option is used for each table
space with containers on the dropped path
© Copyright IBM Corporation 2010
Alter Table Space Wizard

© Copyright IBM Corporation 2010


GET SNAPSHOT FOR TABLESPACES command
• Use GET SNAPSHOT FOR TABLESPACES ON SAMPLE
– LIST TABLESPACES does not provide all of the information
available
• Example of GET SNAPSHOT FOR TABLESPACES ON
SAMPLE
<database> command:
Tablespace name = SYSCATSPACE
Tablespace ID = 0

Automatic Prefetch size enabled = Yes
Buffer pool ID currently in use = 1
Buffer pool ID next startup = 1
Using automatic storage = Yes
Auto-resize enabled = Yes

Increase size (bytes) = AUTOMATIC
• Note that Using automatic storage = Yes and Increase size
(bytes) = AUTOMATIC are shown in the snapshot.
© Copyright IBM Corporation 2010
Using the SNAPCONTAINER view
• Use the SYSIBMADM.SNAPCONTAINER view to retrieve
container information
SELECT SNAPSHOT_TIMESTAMP, TBSP_NAME, TBSP_ID,
CONTAINER_NAME, CONTAINER_ID, CONTAINER_TYPE,
ACCESSIBLE, TOTAL_PAGES, USABLE_PAGES,
DBPARTITIONNUM FROM SYSIBMADM.SNAPCONTAINER ORDER
BY DBPARTITIONNUM, TBSP_NAME, CONTAINER_ID

© Copyright IBM Corporation 2010


Using the MON_GET_TABLESPACE function
• The MON_GET_TABLESPACE table function returns monitor metrics
for one or more table spaces.
Syntax
>>-MON_GET_TABLESPACE--(--tbsp_name--,--member--)---><

• Example
To list table spaces ordered by number of physical reads from table
space containers.
SELECT varchar(tbsp_name, 30) as tbsp_name,
member,
tbsp_type,
pool_data_p_reads
FROM TABLE(MON_GET_TABLESPACE('',-2)) AS t
ORDER BY pool_data_p_reads DESC

TBSP_NAME MEMBER TBSP_TYPE POOL_DATA_P_READS


------------------------------ ------ ---------- --------------------
SYSCATSPACE 0 DMS 79
USERSPACE1 0 DMS 34
TEMPSPACE1 0 SMS 0
© Copyright IBM Corporation 2010
Data placement considerations
• If index scans are required:
– Consider separate table spaces for index and data

• Consider placing index table space on fastest media available


• Consider placing entire table onto separate table space
– If accessed frequently

• Consider EXTENTSIZE
– Trade-off between space (very small tables) and performance

• I/O PREFETCH size


– Setting to AUTOMATIC is the default and recommended

© Copyright IBM Corporation 2010


Maintain or List System Database Directory
• CLP
db2 ? CATALOG DATABASE
CATALOG DATABASE database-name [AS alias] [ON path | AT NODE nodename]
[AUTHENTICATION {SERVER | CLIENT | ... | SERVER_ENCRYPT}]
[WITH "comment-string"]

db2 'CATALOG DATABASE ourdb AS TESTDB ON /database WITH "Test Database"'

db2 LIST DATABASE DIRECTORY


...

Database alias = TESTDB


Database name = OURDB
Local database directory = /database
Database release level = a.00
Comment = Test Database
Directory entry type = Indirect
Catalog database partition number = 4

© Copyright IBM Corporation 2010


Additional DB2 System table spaces
• The SYSTOOLSPACE table space is a user data table space used by the DB2
administration tools and some SQL administrative routines for storing historical data
and configuration information:
– Design Advisor
– Alter table notebook
– Configure Automatic Maintenance wizard
– Storage Management tool
– db2look command
– Automatic statistics collection (including the Statistics Collection Required
health indicator)
– Automatic reorganization (including the Reorganization Required health
indicator)
– GET_DBSIZE_INFO stored procedure
– ADMIN_COPY_SCHEMA stored procedure
– ADMIN_DROP_SCHEMA stored procedure
– SYSINSTALLOBJECTS stored procedure
– ALTOBJ stored procedure
• The SYSTOOLSTMPSPACE table space is a user temporary table space used by
the REORGCHK_TB_STATS, REORGCHK_IX_STATS and the ADMIN_CMD
stored procedures for storing temporary data.
© Copyright IBM Corporation 2010
Database configuration
db2 get db cfg for musicdb
show detail

db2 update db cfg for musicdb


using <parm> <value>

© Copyright IBM Corporation 2010


Database Manager configuration
db2 get dbm cfg show detail

db2 update dbm cfg using <parm>


<value>

© Copyright IBM Corporation 2010


ACTIVATE and DEACTIVATE the database
• ACTIVATE DATABASE
– Activates the specified database and starts up all necessary
database services, so that the database is available for connection
and use by any application.
• Incurs the database startup processing
db2 activate db <db_name> user <user> using
<password>

• DEACTIVATE DATABASE
– Databases initialized by ACTIVATE DATABASE can be shut down
using the DEACTIVATE DATABASE command, or using the
db2stop command.
db2 deactivate db <db_name>

© Copyright IBM Corporation 2010


Unit summary
Having completed this unit, you should be able to:
• Review specifics of creating a database
• Explore the System Catalog tables and views
• Compare DMS versus SMS table spaces
• Describe how to setup and manage a DB2 database with
Automatic Storage enabled
• Differentiate between table spaces, containers, extents, and
pages
• Define table spaces
• Use the get snapshot for tablespaces command to display
table space statistics
• Explore Database configuration parameters

© Copyright IBM Corporation 2010


Student exercise

© Copyright IBM Corporation 2010


Creating database objects

© Copyright IBM Corporation 2010


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• List DB2 object hierarchy and physical directories and files
• Create the following objects:
• Schema, Table, View, Alias, Index
• Review the use of Temporary Tables
• Explore the use and implementation of Check Constraints,
Referential Integrity and Triggers
• Exploring the need for and the use of Large Objects
• Recognize XML and its native store as critical infrastructure
for emerging technologies

© Copyright IBM Corporation 2010


DB2 object hierarchy

Instance 1 dbm configuration file

SYSCATSPACE SYSCATSPACE

Catalog Catalog
View1 View1
Log Database 1 Log Database 2
View2
DB configuration file DB configuration file

TSSMS1 TSDMSLRG1 USERSPACE1

Table1 Table2 Table3 Table2

Index1 TSDMSREG2

Index1
TSDMSLRG3

BLOBs Index2

© Copyright IBM Corporation 2010


Database physical directories and files

/path
[ SAMPLE ]
[ DB2INSTANCE ] [ T0000000 ] C0000000.CAT
SQLCRT.FLG
db2rhist.asc [ T0000001 ] SQLCRT.FLG
[ NODE0000 ] db2rhist.bak
SQLBP.1
[ C0000000.TMP ]
[ sqldbdir ] SQLBP.2
[ T0000002 ] C0000000.LRG
SQLDBCON SQLCRT.FLG
[ SQL00001 ] SQLDBCONF C0000000.LRG
SQLINSLK [ T0000003 ] SQLCRT.FLG
SQLOGCTL.LFH.1
SQLOGCTL.LFH.2 C0000000.LRG
[ T0000004 ] SQLCRT.FLG
SQLOGMIR.LFH
SQLSGF.1 C0000000.LRG
SQLSGF.2 [ T0000005 ] SQLCRT.FLG
SQLSPCS.1
[ db2event ] SQLSPCS.2
SQLTMPLK
[ your_SMS_tablespace ]
[ db2detaildeadlock ]
00000000.EVT
[ CONT1 ]

..
DB2EVENT.CTL SQLTAG.NAM
[ SQLOGDIR ] [ CONT2 ]
S0000000.LOG SQLTAG.NAM
S0000001.LOG SQL00041.DAT

© Copyright IBM Corporation 2010


Create Schema

CREATE SCHEMA PAYROLL AUTHORIZATION DB2ADMIN;


COMMENT ON Schema PAYROLL IS 'schema for
payroll application';
© Copyright IBM Corporation 2010
Set current schema

connect to musicdb user Keith;


select * from employee;
• Will select from KEITH.EMPLOYEE

set current schema = 'PAYROLL‘;


select * from employee;
• Will select from PAYROLL.EMPLOYEE

© Copyright IBM Corporation 2010


Create Table

© Copyright IBM Corporation 2010


Create Table statement

connect to musicdb;
create table artists
(artno smallint not null,
name varchar(50) with default 'abc',
classification char(1) not null,
bio clob(100K) logged,
picture blob( 10M) not logged compact)

in dms01

index in dms02

long in dms03;

© Copyright IBM Corporation 2010


Application Temporary Tables
• Applications can utilize global temporary tables:
– Global temporary tables are stored in User temporary table spaces
– The associated data is deleted when the application connection ends
– Each connection accesses a private copy of the table, so it only sees data put
into the table by that one connection
– Normal table locking is not needed for global temporary tables since the data is
always limited to a single connection
– Declared Global Temporary table:
• These are defined during application execution using the DECLARE GLOBAL
TEMPORARY TABLE statement
• No DB2 Catalog information is required
• The schema for a declared global temporary table is always ‘SESSION’
– Created Global Temporary table:
• These are defined using a CREATE GLOBAL TEMPORARY TABLE statement
with a user selected schema
• The table and any associated indexes may be created before an application
connection starts, but only the catalog definition exists
• The catalog definition is used when the application references the table
• Each application connection still works only with the data it stores in the table
© Copyright IBM Corporation 2010
Example of a Declared Temporary Table
• A System Administrator creates the user temporary table space
CREATE USER TEMPORARY TABLESPACE "USR_TEMP_TS"
PAGESIZE 4 K MANAGED BY AUTOMATIC STORAGE
BUFFERPOOL IBMDEFAULTBP ;
• The application uses SQL statements to declare and access the
table
DECLARE GLOBAL TEMPORARY TABLE T1
LIKE REAL_T1
ON COMMIT DELETE ROWS
NOT LOGGED
IN USR_TEMP_TS;
INSERT INTO SESSION.T1
SELECT * FROM REAL_T1 WHERE DEPTNO=:mydept;
• /* do other work on T1 */
• /* when connection ends, table is automatically dropped */
© Copyright IBM Corporation 2010
Example of a Created Temporary Table
• A System Administrator creates the user temporary table
space and defines a global temporary table and indexes
CREATE USER TEMPORARY TABLESPACE "USR_TEMP_TS2"
PAGESIZE 4 K MANAGED BY AUTOMATIC STORAGE ;

CREATE GLOBAL TEMPORARY TABLE APP1.DEPARTMENT


LIKE PROD.DEPARTMENT
ON COMMIT DELETE ROWS
NOT LOGGED IN USR_TEMP_TS2;

CREATE INDEX APP1.DEPTIX1 ON APP1.DEPARTMENT (DEPTNO);

• The application uses SQL statements to reference the


temporary table; no DECLARE is needed
INSERT INTO APP1.DEPARTMENT
SELECT * FROM PROD.DEPARTMENT WHERE DEPTNO=:mydept;

SELECT * FROM APP1.DEPARTMENT WHERE LASTNAME = ‘STOPFER’


© Copyright IBM Corporation 2010
Large objects: The need

Action
News
DB2
by the Book

B L OB C L OB D B C L OB
H A J O Y H A J
I A J
A R E U T A R E
N R E R G C B E R G C
A G C A E T L A E T
R E T C E C
Y T T
E E
R R

DB2
© Copyright IBM Corporation 2010
Table partitioning
• Data organization scheme in which table data is divided across
multiple storage objects called data partitions or ranges:
– Each data partition is stored separately
– These storage objects can be in different table spaces, in the same
table space, or a combination of both

• Benefits:
– Easier roll-in and roll-out of table data
– Allows large data roll-in (ATTACH) or roll-out (DETACH) with a
minimal impact to table availability for applications
– Supports very large tables
– Indexes can be either partitioned (local) or non-partitioned (global)
– Table and Index scans can use partition elimination when access
includes predicates for the defined ranges
– Hierarchical Storage Management (HSM) solutions can be used to
manage older, less frequently accessed ranges
© Copyright IBM Corporation 2010
Example of a partitioned table
• The PARTITION BY RANGE clause defines a set of data ranges
CREATE TABLE PARTTAB.HISTORYPART ( ACCT_ID INTEGER NOT NULL ,
TELLER_ID SMALLINT NOT NULL ,
BRANCH_ID SMALLINT NOT NULL ,
BALANCE DECIMAL(15,2) NOT NULL ,
……….
TEMP CHAR(6) NOT NULL )
PARTITION BY RANGE (BRANCH_ID)
(STARTING FROM (1) ENDING (20) IN TSHISTP1 INDEX IN TSHISTI1 ,
STARTING FROM (21) ENDING (40) IN TSHISTP2 INDEX IN TSHISTI2 ,
STARTING FROM (41) ENDING (60) IN TSHISTP3 INDEX IN TSHISTI3 ,
STARTING FROM (61) ENDING (80) IN TSHISTP4 INDEX IN TSHISTI4 ) ;

CREATE INDEX PARTTAB.HISTPIX1 ON PARTTAB.HISTORYPART (TELLER_ID)


PARTITIONED ;

CREATE INDEX PARTTAB.HISTPIX2 ON PARTTAB.HISTORYPART (BRANCH_ID)


PARTITIONED ;

• In this example the data objects and index objects for each data range
are stored in different table spaces
• The table spaces used must be defined with the same options, such as
type of management, extent size and page size

© Copyright IBM Corporation 2010


Create View

© Copyright IBM Corporation 2010


Create View statement

CONNECT TO TESTDB;
CREATE VIEW EMPSALARY
AS SELECT EMPNO, EMPNAME, SALARY
FROM PAYROLL, PERSONNEL
WHERE EMPNO=EMPNUMB AND SALARY > 30000.00;

SELECT * FROM EMPSALARY

EMPNO EMPNAME SALARY


------ ----------------- ----------
10 John Smith 1000000.00
20 Jane Johnson 300000.00
30 Robert Appleton 250000.00
...

© Copyright IBM Corporation 2010


Create Alias

© Copyright IBM Corporation 2010


Create Alias statement
• Cannot be the same as an existing table, view, or alias

CONNECT TO SAMPLE;
CREATE ALIAS ADMIN.MUSICIANS FOR ADMIN.ARTISTS;
COMMENT ON Alias ADMIN.MUSICIANS IS 'MUSICIANS
alias for ARTISTS';
CONNECT RESET;

© Copyright IBM Corporation 2010


Create Index

© Copyright IBM Corporation 2010


Create Index statements
create unique index dba1.empno on
dba1.employee (empno asc)
pctfree 10
minpctused 10
allow reverse scans
page split symmetric
collect sampled detailed statistics ;

create unique index itemno on albums (itemno) ;

create index item on stock (itemno) cluster ;

create unique index empidx on employee (empno)


include (lastname, firstname) ;

© Copyright IBM Corporation 2010


Overview of Referential Integrity

Department Table
DEPT DEPTNAME

Parent Table
R PRIMARY KEY = DEPT
E
S
T Employee Table
R EMPNO Dependent Table
NAME WKDEPT
I FOREIGN KEY = WKDEPT
C
T

• Place constraints between tables


• Constraints specified with Create and Alter table statements
• Database services enforce constraints: inserts, updates, deletes
• Removes burden of constraint checking from application programs
© Copyright IBM Corporation 2010
RI: Add a foreign key

© Copyright IBM Corporation 2010


Referential Integrity: CREATE TABLE statement
CREATE TABLE DEPARTMENT
(DEPT CHAR (3) NOT NULL,
DEPTNAME CHAR(20) NOT NULL,
CONSTRAINT DEPT_NAME
UNIQUE (DEPTNAME),
PRIMARY KEY (DEPT))
IN DMS99 .... ;

CREATE TABLE EMPLOYEE


(EMPNO CHAR(6) NOT NULL,
NAME CHAR(30),
WKDEPT CHAR(3) NOT NULL,
CONSTRAINT DEPT
FOREIGN KEY
(WKDEPT)
REFERENCES DEPARTMENT
ON DELETE RESTRICT)
IN DMS100 ... ;
© Copyright IBM Corporation 2010
Unique Key considerations
• Multiple keys in one table can be foreign key targets
CREATE TABLE PAY.EMPTAB
(EMPNO SMALLINT NOT NULL PRIMARY KEY,
NAME CHAR(20),
DRIVERLIC CHAR(17) NOT NULL,
CONSTRAINT DRIV_UNIQ UNIQUE(DRIVERLIC)
)IN TBSP1 ;

Unique indexes PAY.DRIV_UNIQ and SYSIBM.yymmddhhmmssxxx


created Columns must be NOT NULL

• Deferral of unique checking until end of statement processing


– UPDATE EMPTAB SET EMPNO=EMPNO + 1

© Copyright IBM Corporation 2010


Overview of Check Constraints: The need

DATA RULE : No value in the US_SL


column in the table
SPEED_LIMITS can exceed 65.
PGM n

RULE PGM n

DB2 DB2
ENFORCEMENT
APPLICATION RULE
DATABASE

PGM 1 PGM 2 PGM 1 PGM 2

RULE RULE

© Copyright IBM Corporation 2010


Check Constraints: Definition
CREATE TABLE SPEED_LIMITS
(ROUTE_NUM SMALLINT,
CANADA_SL INTEGER NOT NULL,
US_SL INTEGER NOT NULL
CHECK (US_SL <=65) ) ;

ALTER TABLE SPEED_LIMITS


ADD
CONSTRAINT SPEED65
CHECK (US_SL <=65) ;

© Copyright IBM Corporation 2010


Implementation of Check Constraints

© Copyright IBM Corporation 2010


Overview of triggers: The need

BUSINESS RULES:

INSERT, UPDATE,
and DELETE
actions against a
table may require
another action or
set of actions
to occur to meet
PGM n the needs of the
business.
ENFORCEMENT DB2
APPLICATION
DATABASE

© Copyright IBM Corporation 2010


Implementation of triggers (1 of 2)

© Copyright IBM Corporation 2010


Implementation of triggers (2 of 2)

© Copyright IBM Corporation 2010


Create Trigger statement

create trigger reorder


after update
of qty on stock
referencing new as n
for each row mode db2sql
when (n.qty <=5)
insert into reorder values (n.itemno, current
timestamp) ;

© Copyright IBM Corporation 2010


Data Row Compression summary
• Data row compression was introduced in DB2 9.1
• COMPRESS option for CREATE and ALTER TABLE is used
to specify compression
• Compression uses a static dictionary:
– Dictionary stored in data object, about 100K in size
– Compression Dictionary needs to be built before a row can be
compressed
– Dictionary can be built or rebuilt using REORG TABLE offline, which
also compresses existing data
– A dictionary will be built automatically when a table reaches a
threshold size (about 2 MB). This applies to SQL INSERT as well as
IMPORT and LOAD (DB2 9.5).
• Compression is intended to:
– Reduce disk storage requirements
– Reduce I/Os for scanning tables
– Reduce buffer pool memory for storing data
• Compression requires DB2 Storage Optimization feature
© Copyright IBM Corporation 2010
Additional compression features added in DB2 9.7
• Multiple algorithms for automatic index compression

• Automatic compression for system and user


temporary tables
Temp Table Temp
Order By Order By

• Intelligent compression of large objects and XML

© Copyright IBM Corporation 2010


db2look utility
• DDL and statistics extraction tool, used to
capture table definitions and generate the
corresponding DDL.
• In addition to capturing the DDL for a set of
tables, can create a test system that mimics a SQL
production system by generating the following
things:
– The UPDATE statements used to replicate the
statistics on the objects
– The UPDATE DB CFG and DBM CFG statements
for replicating configuration parameter settings
– The db2set statements for replicating registry
settings
– Statements for creating partition groups, buffer
pools, table spaces, and so on
© Copyright IBM Corporation 2010
db2look examples
• To capture all of the DDL for a database (includes all tables, views,
RI, constraints, triggers, and so on):
db2look -d proddb -e -o statements.sql
{Edit the output file and change the database name}
db2 -tvf statements.sql

• To capture the DDL for one table in particular (table1 in this


example):
db2look -d proddb -e -t table1 -o statements.sql
{Edit the output file and change the database name}
db2 -tvf statements.sql

• To capture the DDL for all of the tables belonging to a particular


schema (db2user in this example):
db2look -d proddb -e -z db2user -o statements.sql
{Edit the output file and change the database name}
db2 -tvf statements.sql

© Copyright IBM Corporation 2010


What is XML? (1 of 2)
• XML = eXtensible Markup Language
• XML is self-describing data
• Here is some data from an ordinary delimited flat file:
47; John Doe; 58; Peter Pan; Database systems; 29; SQL;
relational

• What does this data mean?


<book>
<authors>
<author id=“47”>John Doe</author>
<author id=“58”>Peter Pan</author>
</authors>
<title>Database systems</title>
<price>29</price>
<keywords> XML: Describes data
<keyword>SQL</keyword>
<keyword>relational</keyword> HTML: Describes display
</keywords>
</book>
© Copyright IBM Corporation 2010
What is XML? (2 of 2)

Attribute
<book>
Start Tag <authors>
<author id=“47”>John Doe</author>
<author id=“58”>Peter Pan</author>
End Tag </authors>
<title>Database systems</title>
<price>29</price> Element
<keywords>
Start Tag
<keyword>SQL</keyword>
<keyword>relational</keyword>
</keywords>
End Tag
</book>
Data

© Copyright IBM Corporation 2010


The XML document tree
<book>
<authors>
<author id=“47”>John Doe</author>
<author id=“58”>Peter Pan</author>
</authors>
<title>Database systems</title>
<price>29</price>
<keywords>
XML Parsing
<keyword>SQL</keyword> book
<keyword>relational</keyword>
</keywords>
</book>
authors title price keywords
Each node
has a path

/book Database
author author 29 keyword keyword
Systems
/book/authors
/book/authors/author
/book/authors/author/@id id=47 John Doe id=58 Peter Pan SQL relational
(...)
© Copyright IBM Corporation 2010
Integration of XML and relational capabilities
• Applications combine XML and relational data
• Native XML data type (server and client side)
• XML capabilities in all DB2 components
DB2 SERVER

CLIENT SQL/XML DB2 Storage:


Relational
Interface Relational
DB2 Client /
Customer DB2 Engine
Client
Application
XQuery
XML
XML
Interface

© Copyright IBM Corporation 2010


Two ways to query XML data
• Given the following DDL used to create the table:
CREATE TABLE CUSTOMER (
CID BIGINT NOT NULL,
INFO XML,
HISTORY XML )
IN IBMDB2SAMPLEXML;

• Using XQuery as the primary language


XQUERY db2-fn:xmlcolumn (“CUSTOMER.INFO")
– Returns all Documents in column INFO

• Using SQL as the primary language


Select Pid from PRODUCT xmlp
WHERE XMLExists('declare default element namespace
"http://posample.org";$p/product/description[price
< 100]' passing xmlp.description as "p");

© Copyright IBM Corporation 2010


For more information (1 of 2)
• To learn more about DB2 and XML,
including the use of xQuery and
SQL/XML:

• Course CL131 – 5 days


Query and Manage XML Data with
DB2 9

• Course CL121 – 3 days


Query XML data with DB2 9
• Documentation:
pureXML Guide (SC27-2465)
DB2 XQuery Reference (SC27-2466)

© Copyright IBM Corporation 2010


For more information (2 of 2)
Some of our other courses:
• Course CE120 – 2 days
DB2 SQL Workshop
• Course CF18 – 4 days
Relational Database Design
• Course CE130 – 2.5 days
DB2 SQL Workshop for Experienced
Users
• Course DW11 – 4.5 days
Building the Data Warehouse

© Copyright IBM Corporation 2010


Unit summary
Having completed this unit, you should be able to:
• List DB2 object hierarchy and physical directories and files
• Create the following objects:
• Schema, Table, View, Alias, Index
• Review the use of Temporary Tables
• Explore the use and implementation of Check Constraints,
Referential Integrity and Triggers
• Exploring the need for and the use of Large Objects
• Recognize XML and its native store as critical infrastructure
for emerging technologies

© Copyright IBM Corporation 2010


Student exercise

© Copyright IBM Corporation 2010


Moving data

© Copyright IBM Corporation 2010


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Discuss the INSERT statement and recognize its limitations
• Explain the differences between IMPORT and LOAD
• Explain the EXPORT, IMPORT, and LOAD syntax
• Create and use Exception Tables and Dump-Files
• Distinguish and resolve Table States:
– Load Pending and Set Integrity Pending

• Use the SET INTEGRITY command


• Discuss the db2move and db2look commands

© Copyright IBM Corporation 2010


Review INSERT statement
• The SQL INSERT statement can insert one or more rows of
data into tables, nicknames, or views:
– SQL overhead is imposed, such as obeying RI, or Check
Constraints, or Uniqueness, or executing triggers.
– As INSERTs occur, the activity is also stored in logs

• The SQL INSERT might not be the best or fastest method to


load massive amounts of data into a database
Example inserts:

insert into artists (artno, name,


classification) values (100, 'Patti & Cart
Wheels', 'S') ;
and

Insert into emptemp select * from employee ;

© Copyright IBM Corporation 2010


EXPORT/IMPORT overview

• Must be connected prior to call

Workstation File Formats


LOGS
ASCII
DEL ASCII
WSF
IMPORT

PC/IXF EXPORT DB2 for Linux,


UNIX, and Windows
or
DRDA Databases

© Copyright IBM Corporation 2010


EXPORT data wizard

© Copyright IBM Corporation 2010


EXPORT command syntax (Basic)

EXPORT TO filename OF filetype

, ,

LOBS TO lob-path LOBFILE filename

...

MODIFIED BY filetype-mod

select-statement
MESSAGES message-file

© Copyright IBM Corporation 2010


EXPORT command example

• Exports data from database tables to file


• Check message for error or warning messages

DB2
MUSICDB
EXPORT
artexprt
artists

db2 connect to musicdb


db2 export to artexprt
of ixf messages artmsg
select artno, name
from artists
© Copyright IBM Corporation 2010
IMPORT data wizard

© Copyright IBM Corporation 2010


IMPORT command syntax (Basic)
IMPORT FROM filename OF filetype

...
,
LOBS FROM lob-path
ALLOW NO ACCESS

ALLOW WRITE ACCESS


MODIFIED BY filetype-mod

COMMITCOUNT n/ RESTARTCOUNT n MESSAGES message-file


Automatic
INSERT INTO table-name
,
INSERT_UPDATE
REPLACE ( insert-column )
REPLACE_CREATE
CREATE INTO table-name | tblspace-specs |
,
( insert-column )

tblspace-specs

| |
IN tablespace-name
INDEX IN tablespace-name LONG IN tablespace-name
© Copyright IBM Corporation 2010
IMPORT command example
• Imports data from a file to a database table
db2 connect to musicdb
db2 import from artexprt of ixf
messages artmsg create into artists
insert
insert_update
replace
replace_create
• Imports data from a file to three DMS table spaces
db2 connect to musicdb
db2 import from artexprt of ixf
messages artmsg create into artists [(column_list)]
in <tablespace>
index in <indextablespace>
long in <largetablespace>

DB2
musicdb
IMPORT
artexprt artists
© Copyright IBM Corporation 2010
Differences between IMPORT and LOAD

IMPORT LOAD
• Slow when moving large • Faster for large amounts of
amounts of data data

• Creation of table/index • Tables and indexes must exist


supported with IXF format • Load into tables only
• Import into tables, views, • Existing data can still be seen
aliases by read applications
• Option to ALLOW WRITE • Minimal logging; can make
ACCESS copy
• All rows logged • Triggers not fired; unique
• Triggers fired, constraints constraints enforced, RI and
enforced check constraints via SET
• Inserts can use space freed INTEGRITY
by deleted rows • LOAD builds new extents
© Copyright IBM Corporation 2010
Four phases of Load

1. LOAD
Load data into tables DB2 Data
Collect index keys / sort
Consistency points at SAVECOUNT
Invalid data rows in dump file; messages in message file

2. BUILD
Indexes created or updated

3. DELETE
Unique Key Violations placed in Exception Table
Messages generated for unique key violations
Deletes Unique Key Violation Rows
4. INDEX COPY
Copy indexes from temp table space to index
table space

© Copyright IBM Corporation 2010


LOAD: Load phase
• During the LOAD phase, data is stored in a table and index
keys are collected
• Save points are established at intervals
• Messages indicate the number of input rows successfully
loaded
• If a failure occurs in this phase, use the RESTART option for
LOAD to restart from the last successful consistency point
Load

Input
Build
Data

Delete

Index
Copy
© Copyright IBM Corporation 2010
LOAD: Build phase
• During the BUILD phase, indexes are created based on the
index keys collected in the Load phase
• The index keys are sorted during the Load phase
• If a failure occurs during this phase, LOAD restarts from the
BUILD phase

Load

Input
Build
Data

Delete

Index
Copy
© Copyright IBM Corporation 2010
LOAD: Delete phase
• During the DELETE phase, all rows that have violated a
unique constraint are deleted
• If a failure occurs, LOAD restarts from the DELETE phase
• Once the database indexes are rebuilt, information about the
rows containing the invalid keys is contained in an exception
table, WHICH SHOULD BE CREATED BEFORE LOAD
• Finally, any duplicate keys found are deleted
Load

Input
Build
Data

Delete

Index
Copy
© Copyright IBM Corporation 2010
LOAD: Index Copy phase
• The index data is copied from a system temporary table
space to the original table space.
• This will only occur if a system temporary table space was
specified for index creation during a load operation with the
READ ACCESS option specified.

Load

Input
Build
Data

Delete

Index
Copy
© Copyright IBM Corporation 2010
Load data wizard

© Copyright IBM Corporation 2010


LOAD command syntax (Basic)

,
LOAD FROM file OF ASC
CLIENT pipe DEL
device IXF
cursorname CURSOR

MODIFIED BY filetype-mod

...
SAVECOUNT n ROWCOUNT n WARNINGCOUNT n MESSAGES msg-file

INSERT INTO table-name


REPLACE ,
RESTART ( insert-column ) FOR EXCEPTION table-name
TERMINATE

ALLOW NO ACCESS

ALLOW READ ACCESS


| statistics options | |copy options|
USE tablespace-name

| set integrity pending options | LOCK WITH FORCE

© Copyright IBM Corporation 2010


LOAD scenario

INPUT OUTPUT
calpar.del cal.par cal.parexp par.msgs dump.fil.000
10 ~~ 1 10 ~~ 1 30 ~~ 4 timestamp msg ... msg ... 20 ~~ -

50 7 timestamp msg 40 ~~ X
20 ~~ - 30 ~~ 3 ~~ 20 msg

30 ~~ 3 50 ~~ 6 ... ~ ~ ... ... ... 40 msg

30 ~~ 4 80 ~~ 8 ... ... ...


40 ~~ X not null, numeric column
Primary Key
50 ~~ 6 Rows not Loaded

50 ~~ 7 Table Exception Table


UNIQUE INDEX
80 ~~ 8

create tables/indexes 10 RID


obtain delimited input file in sorted format
create exception table 30 RID

db2 load from calpar.del of del 50 RID


modified by dumpfile=<path>/dump.fil
warningcount 100 messages par.msgs
insert into cal.par for exception cal.parexp 80 RID
db2 load query table cal.par
examine par.msgs file

examine cal.parexp exception table

© Copyright IBM Corporation 2010


Rules and methods for creating Exception Tables
• The first n columns are the same
• No constraints and no trigger definitions
• n + 1 column TIMESTAMP
• n + 2 column CLOB (32 KB)
• user INSERT privilege
CREATE TABLE T1EXC
LIKE T1

ALTER TABLE T1EXC


ADD COLUMN TS TIMESTAMP
ADD COLUMN MSG CLOB(32K) CREATE TABLE T1EXC AS
(SELECT T1.*,
CURRENT TIMESTAMP AS TS,
CLOB('', 32767) AS MSG
FROM T1)
DEFINITION ONLY

© Copyright IBM Corporation 2010


Offline versus Online Load
• ALLOW NO ACCESS
Lock Lock Load
requested granted commit

read/write read/write
Load allows no access

Time

• ALLOW READ ACCESS

Super Super
exclusive exclusive Load
Drain Drain lock lock commit
requested granted requested granted
read/write read
read
r/w
Load allows read access

Time

© Copyright IBM Corporation 2010


Table states
• (Load pending, Set Integrity Pending)
• LOAD QUERY TABLE <table-name>
• Tablestate:
Normal
Set Integrity Pending
Load in Progress
Load Pending
Reorg Pending
Read Access Only
Unavailable
Not Load Restartable
Unknown
• Table can be in several states at same time
Tablestate:
Set Integrity Pending
Load in Progress
Read Access Only
© Copyright IBM Corporation 2010
Checking Load status: Load query
db2 load query table prod.table1
SQL3501W The table space(s) in which the table resides will not be placed in
backup pending state since forward recovery is disabled for the database.

SQL3109N The utility is beginning to load data from file


"C:\cf45\reorg\savehist.del".

SQL3500W The utility is beginning the "LOAD" phase at time "03/29/2005


21:30:24.468073".

SQL3519W Begin Load Consistency Point. Input record count = "0".


SQL3520W Load Consistency Point was successful.
........................
SQL3519W Begin Load Consistency Point. Input record count = "48314".
SQL3520W Load Consistency Point was successful.

SQL0289N Unable to allocate new pages in table space "LOADTSPD".


SQLSTATE=57011

SQL3532I The Load utility is currently in the "LOAD" phase.

Number of rows read = 48314


Number of rows skipped = 0
Number of rows loaded = 48314
Number of rows rejected = 0
Number of rows deleted = 0
Number of rows committed = 48314
Number of warnings = 0

Tablestate:
Load Pending

© Copyright IBM Corporation 2010


Load monitoring: LIST UTILITIES

db2 LIST UTILITIES SHOW DETAIL


ID = 2
Type = LOAD
Database Name = MUSICDB
Partition Number = 0
Description = ONLINE LOAD DEL AUTOMATIC INDEXING INSERT NON
-RECOVERABLE ADMIN.LOA
Start Time = 03/30/2005 15:57:42.354567
Progress Monitoring:
Phase Number = 1
Description = SETUP
Total Work = 0 bytes
Completed Work = 0 bytes
Start Time = 03/30/2005 15:57:42.354573
Phase Number = 2
Description = LOAD
Total Work = 10000 rows
Completed Work = 10000 rows
Start Time = 03/30/2005 15:57:50.669476
Phase Number [Current] = 3
Description = BUILD
Total Work = 2 indexes
Completed Work = 2 indexes
Start Time = 03/30/2005 15:57:51.182767

© Copyright IBM Corporation 2010


Load Pending state:
Recovering from LOAD failure
• Restart the Load:
– Check Messages files
– Use Restart option
– Load operation automatically continues from last consistency point
in Load or Build phase
• or Delete phase if ALLOW NO ACCESS

• Replace the whole table


– LOAD ... REPLACE

• Terminate the Load:


– If LOAD ... INSERT, returns table to state preceding Load
– If LOAD ... REPLACE, table will be truncated to an empty state
• Create backup copy prior to Load to return to state preceding Load

Do not tamper with Load temporary files


© Copyright IBM Corporation 2010
Backup Pending state: COPY options
• When loading data and forward recovery is enabled:
– COPY NO (default)
• During load, Backup pending and Load in progress
• After load, Backup Pending
– COPY YES
• Load has made Copy not Backup pending
– NONRECOVERABLE
• No copy made and no backup required

UNIX
Databasealias.Type.Instancename.Nodename.Catnodename.Timestamp.number
Windows
Databasealias.Type.Instancename.Nodename.Catnodename.Timestamp.number

Type: 0=Full Backup


3=Table Space Backup
4 =Copy from Table Load

© Copyright IBM Corporation 2010


Set Integrity Pending table state
• Load turns OFF constraint checking:
– Leaves table in Set Integrity Pending state
– If parent table is in Set Integrity Pending, then dependents may also
be in Set Integrity Pending
– LOAD INSERT, ALLOW READ ACCESS
• Loaded table in Set Integrity Pending with read access
– LOAD INSERT, ALLOW NO ACCESS
• Loaded table in Set Integrity Pending with no access
– LOAD REPLACE, SET INTEGRITY PENDING CASCADE
DEFERRED
• Loaded table in Set Integrity Pending with no access
– LOAD REPLACE, SET INTEGRITY PENDING CASCADE
IMMEDIATE
• Loaded table and descendant foreign key tables are in Set Integrity
Pending with no access

© Copyright IBM Corporation 2010


SET INTEGRITY command syntax (Basic)
SET INTEGRITY ...
,
FOR table-name OFF

IMMEDIATE CHECKED
| exception-clause |

,
FOR table-name ALL IMMEDIATE UNCHECKED

FOREIGN KEY
CHECK

exception-clause
,
| FOR EXCEPTION IN table-name USE table-name |

© Copyright IBM Corporation 2010


Set Integrity Pending state

BEFORE
SYSCAT.TABLES
cal.par cal.for
TABNAME DEFINER STATUS CONST_ ACCESS_
CHECKED MODE 10 1 10 X

par cal C NYYY... R 30 3 10 Y

for cal C YNYY... R 50 6 20 Y

80 8 50 Z
AFTER
80 X
Primary Key
forexp 90 A
cal.for
20 Y timestamp msg
10 ~~ X Foreign Key
90 A timestamp msg
10 ~~ Y parexp
50 ~~ Z timestamp msg

80 ... X
db2 SET INTEGRITY for
cal.par, cal.for IMMEDIATE CHECKED
FOR EXCEPTION
IN cal.par USE cal.parexp,
IN cal.for USE cal.forexp

© Copyright IBM Corporation 2010


Meaningful steps for LOAD
• Create tables and indexes
• Create exception tables
• Sort data
• Back up TS/DB (if using REPLACE)
• Consider freespace
• Load .... for Exception ... Savecount ... Warningcount...
• Examine xx.msg and dumpfile (after LOAD completes)
• Examine exception tables (after LOAD completes)
• Back up table space if log retain=recovery and COPY NO
• Set Integrity for .... (only if table in Set Integrity Pending state)
• Update statistics (if necessary)

© Copyright IBM Corporation 2010


db2move utility
• Facilitates the moving/copying of large numbers s
bl e
Ta
of tables between databases
• Can be used with db2look to move/copy
a database between different platforms
(for example, AIX to Windows).
• Usage: db2move <dbName> EXPORT/IMPORT/LOAD/COPY [options]
• EXPORT: The system catalogs are queried, a list of tables is
compiled (based on the options), and the tables are exported in
IXF format -- additionally, a file called db2move.lst is created
• IMPORT: The db2move.lst file is used to import the IXF files
created in the EXPORT step
• LOAD: The db2move.lst file is used to load the PC/IXF data files
created in the EXPORT step
• COPY: Duplicates schema(s) into a target database
© Copyright IBM Corporation 2010
db2move options: Export/Import/Load
db2move.lst
• EXPORT:
 -tc table creator list !"ADMIN"."EXPLAIN_INSTANCE"!tab1.ixf!tab1.msg!
!"ADMIN"."EXPLAIN_STATEMENT"!tab2.ixf!tab2.msg!
 -tn table name list !"ADMIN"."EXPLAIN_ARGUMENT"!tab3.ixf!tab3.msg!
 -sn schema name list !"ADMIN"."EXPLAIN_OBJECT"!tab4.ixf!tab4.msg!
!"ADMIN"."EXPLAIN_OPERATOR"!tab5.ixf!tab5.msg!
 -ts table space list !"ADMIN"."EXPLAIN_PREDICATE"!tab6.ixf!tab6.msg!
 -tf file contains list of tables !"ADMIN"."EXPLAIN_STREAM"!tab7.ixf!tab7.msg!

db2move sample export -tc admin -tn EXPL*


• IMPORT:
 -io defaults to REPLACE_CREATE
 CREATE,INSERT,INSERT_UPDATE,REPLACE
db2move sample import -io replace
• LOAD:
 -lo defaults to INSERT
 REPLACE
db2move sample load -lo insert

© Copyright IBM Corporation 2010


db2move COPY option
Copy one or more schemas between DB2 databases
Uses a -co option to specify:
• Target Database:
"TARGET_DB <db name> [USER <userid> USING <password>]"
• MODE:
– DDL_AND_LOAD - Creates all supported objects from the source schema,
and populates the tables with the source table data. Default
– DDL_ONLY -Creates all supported objects from the source schema, but
does not repopulate the tables.
– LOAD_ONLY- Loads all specified tables from the source database to the
target database. The tables must already exist on the target.
• SCHEMA_MAP: Allows user to rename schema when copying to target
• TABLESPACE_MAP: Table space name mappings to be used
• Load Utility option: COPY NO or Nonrecoverable
• Owner: Change the owner of each new object created in the target
schema

© Copyright IBM Corporation 2010


db2move COPY schema examples
• To duplicate schema schema1 Database dbsrc
from source database dbsrc
to target database dbtgt, issue:
db2move dbsrc COPY -sn schema1 -co TARGET_DB
dbtgt
USER myuser1 USING mypass1 db2move

• To duplicate schema schema1


from source database dbsrc
to target database dbtgt and Database dbtgt
rename the schema to newschema1 on
the target and map source table space ts1 Output files generated:
COPYSCHEMA.msg
to ts2 on the target, issue: COPYSCHEMA.err
LOADTABLE.msg
db2move dbsrc COPY -sn schema1 -co TARGET_DB LOADTABLE.err
dbtgt
USER myuser1 USING mypass1 These files are
SCHEMA_MAP ((schema1,newschema1)) timestamped.
TABLESPACE_MAP ((ts1,ts2), SYS_ANY))
© Copyright IBM Corporation 2010
Online Table Move stored procedure
• The ADMIN_MOVE_TABLE procedure available with DB2 9.7
is designed to move data from a source table to a target table
with a minimal impact to application access
– Changes that can be made using ADMIN_MOVE_TABLE:
• New Data, Index or Long table spaces, which could have a different
page size, extent size or type of table space management (like
moving from SMS to Automatic Storage)
• Data compression could be implemented during the move
• MDC clustering can be added or changed
• Range partitions can be added or changed
• Distribution keys can be changed for DPF tables
• Columns can be added, removed or changed
– Multiple phased processing allows write access to the source table
except for a short outage required to swap access to the target table

© Copyright IBM Corporation 2010


ADMIN_MOVE_TABLE: Processing phases
SYSTOOLS.ADMIN_MOVE_TABLE

1 INIT PHASE tabschema tabname key value

Create triggers,target,
staging tables

SOURCE TARGET
TABLE 2 TABLE
COPY PHASE
c1 c2 … cn c1 c2 … cn
3
REPLAY PHASE

INSERT INSERT
c1 c2 … cn
Rows with
DELETE DELETE keys present
UPDATE in staging
UPDATE
table are
re-copied
Online Keys of from source
Workload row changed table
by online
workload 4 SWAP PHASE
captured via
triggers STAGING Rename Target -> Source
TABLE
© Copyright IBM Corporation 2010
ADMIN_MOVE_TABLE procedure methods
• There are two methods of calling ADMIN_MOVE_TABLE:
One method specifies the how to define the target table.

>>-ADMIN_MOVE_TABLE--(--tabschema--,--tabname--,---------------->

>--data_tbsp--,--index_tbsp--,--lob_tbsp--,--mdc_cols--,-------->

.-,-------.
V |
>--partkey_cols--,--data_part--,--coldef--,----options-+--,----->

>--operation--)------------------------------------------------><

The second method allows a predefined table to be specified as the


target for the move.

>>-ADMIN_MOVE_TABLE--(--tabschema--,--tabname--,---------------->

.-,-------.
V |
>--target_tabname--,----options-+--,--operation--)-------------><
© Copyright IBM Corporation 2010
Example: Move a table to new table space
call SYSPROC.ADMIN_MOVE_TABLE
( 'MDC', 'HIST1', 'MDCTSP2', 'MDCTSPI', 'MDCTSP2', NULL, NULL, NULL, NULL,
'KEEP, REORG',
'MOVE' ) Source is kept, target is reorganized

KEY VALUE
-------------------------------- --------------------------------------------------
AUTHID INST461
CLEANUP_END 2009-06-01-10.57.55.282196
CLEANUP_START 2009-06-01-10.57.55.050882
COPY_END 2009-06-01-10.57.36.243768
COPY_INDEXNAME HIST1IX1
COPY_INDEXSCHEMA MDC
COPY_OPTS ARRAY_INSERT,CLUSTER_OVER_INDEX
COPY_START 2009-06-01-10.57.25.132774
COPY_TOTAL_ROWS 100000
INDEXNAME MDC.HIST1IX3
INDEXSCHEMA INST461
INDEX_CREATION_TOTAL_TIME 7
INIT_END 2009-06-01-10.57.24.880488
INIT_START 2009-06-01-10.57.21.220930
ORIGINAL HIST1AATOhxo
REPLAY_END 2009-06-01-10.57.53.870351
REPLAY_START 2009-06-01-10.57.36.244234
REPLAY_TOTAL_ROWS 0
REPLAY_TOTAL_TIME 0
STATUS COMPLETE
SWAP_END 2009-06-01-10.57.54.995848
SWAP_RETRIES 0
SWAP_START 2009-06-01-10.57.54.255941
VERSION 09.07.0000
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Discuss the INSERT statement and recognize its limitations
• Explain the differences between IMPORT and LOAD
• Explain the EXPORT, IMPORT, and LOAD syntax
• Create and use Exception Tables and Dump-Files
• Distinguish and resolve Table States:
– Load Pending and Set Integrity Pending

• Use the SET INTEGRITY command


• Discuss the db2move and db2look commands

© Copyright IBM Corporation 2010


Student exercise

© Copyright IBM Corporation 2010


Backup and recovery

© Copyright IBM Corporation 2010


Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Describe the major principles and methods for backup and
recovery
• State the three types of recovery used by DB2
• Explain the importance of logging for backup and recovery
• Describe how data logging takes place, including circular
logging and archival logging
• Use the BACKUP, RESTORE, ROLLFORWARD and
RECOVER commands
• Perform a table space backup and recovery
• Restore a database to the end of logs or to a point-in-time
• Discuss the configuration parameters and the recovery
history file and use these to handle various backup and
recovery scenarios
© Copyright IBM Corporation 2010
Types of failure

MEDIA OPERATIONAL
HARDWARE

SOFTWARE

POWER
© Copyright IBM Corporation 2010
Database objects
DB2 Recovery based on restoring at the Database or Table Space level

Table Space Table Space Table Space


'SYSCATSPACE' 'USERSPACE1' 'TBSP1'
Database
Index
Table Index Table

Table

Table
Table

Container Container
0 Container 0
Container
0 1

© Copyright IBM Corporation 2010


DB2 Database recovery methods
Create offline Database level Backup at 1AM DB2 Logs
Database at 1AM
12
11 1
10
9
2
3
Log 1
8
7
6
5
4 Backup DB DB2 Database

Log 2
Database
Backup Table spaces
Log 3
1: Crash Recovery
Log 4
11
12
1 Database at 3PM
10 2
9 3
8 4
7
6
5 Restore DB
DB2 Database Log 5

Database Log 6
Backup Table spaces 3: Roll forward
2: Version Recovery Recovery

© Copyright IBM Corporation 2010


Introduction to logging

Buffer Pool

Log Buffer

Insert
Current Row New Row
Update

Delete Old Row

Requests Commit
for Reads Externalized
Log Buffer Full
and Writes

TABLES LOGS

© Copyright IBM Corporation 2010


Circular logging (Non-recoverable database)

"n" PRIMARY 2 1 "n"

SECONDARY

© Copyright IBM Corporation 2010


Archival logging (Recoverable database)

Manual
or
Automatic Archive

12 ACTIVE — Contains
information for non-
committed or non-
13 externalized transactions

14

OFFLINE ARCHIVE — ONLINE ARCHIVE —


Contains information 15
Archive moved from
for committed and
ACTIVE log subdirectory externalized transactions.
(may also be on other Stored in the ACTIVE 16
media log subdirectory.

© Copyright IBM Corporation 2010


Log file information

Time

Online Active Active Active Unused Unused


Archive Log Log File Log File Log File Log File Log File

....
....

loghead
SQLLOGCTL.LFH(1 and 2) contains:
— Log configuration
— Log files to be archived
— Active logs
— Start point for crash recovery
© Copyright IBM Corporation 2010
Location of log files

Instance
/your/choice

NODE0000 newlogpath

S~3.LOG S~4.LOG S~5.LOG

SQL0000n Different disk drive


and potentially a
different disk
controller unit
/2nd/choice

SQLOGDIR
mirrorlogpath

S~3.LOG S~4.LOG S~5.LOG

S~0.LOG S~1.LOG S~2.LOG

© Copyright IBM Corporation 2010


Configure database logs: Parameters

Primary log files New log path


(LOGPRIMARY) + (NEWLOGPATH)
? ?
Secondary log files Mirror log path
(LOGSECOND) (MIRRORLOGPATH)

Log size Primary Log


(LOGFILSIZ) Archive Method
Y
N (LOGARCHMETH1)
E DEFAULT
O Secondary Log
Log Buffer S
(LOGBUFSIZ) Archive Method
(LOGARCHMETH2)

Number of commits Log records to write


to group before soft checkpoint
(MINCOMMIT) (SOFTMAX)
? - ?

© Copyright IBM Corporation 2010


Configure database logs: GUI (Control Center)

© Copyright IBM Corporation 2010


Configure database logs: Wizard (CC)

© Copyright IBM Corporation 2010


Configure database logs: Command line
• Windows: db2 get db cfg for musicdb | find /i "log"
• Linux/UNIX: db2 get db cfg for musicdb | grep –i log
• Update: db2 update db cfg for musicdb using logprimary 15
Log retain for recovery status = NO
User exit for logging status = NO
Catalog cache size (4KB) (CATALOGCACHE_SZ) = 260
Log buffer size (4KB) (LOGBUFSZ) = 98
Log file size (4KB) (LOGFILSIZ) = 1024
Number of primary log files (LOGPRIMARY) = 13
Number of secondary log files (LOGSECOND) = 4
Changed path to log files (NEWLOGPATH) =
Path to log files = C:\DB2\NODE0000\SQL00003\SQLOGDIR\
Overflow log path (OVERFLOWLOGPATH) =
Mirror log path (MIRRORLOGPATH) =
First active log file =
Block log on disk full (BLK_LOG_DSK_FUL) = NO
Percent max primary log space by transaction (MAX_LOG) = 0
Num. of active log files for 1 active UOW(NUM_LOG_SPAN) = 0
Percent log file reclaimed before soft chckpt (SOFTMAX) = 520
Log retain for recovery enabled (LOGRETAIN) = OFF
User exit for logging enabled (USEREXIT) = OFF
HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
First log archive method (LOGARCHMETH1) = OFF
Options for logarchmeth1 (LOGARCHOPT1) =
Second log archive method (LOGARCHMETH2) = OFF
Options for logarchmeth2 (LOGARCHOPT2) =
Failover log archive path (FAILARCHPATH) =
Number of log archive retries on error (NUMARCHRETRY) = 5
Log archive retry Delay (secs) (ARCHRETRYDELAY) = 20
Log pages during index build (LOGINDEXBUILD) = OFF
© Copyright IBM Corporation 2010
DB2 recovery-related system files
• Recovery History File - db2rhist.asc:
– Created during Create Database command
– Updated by DB2 Utility execution:
• Back up database or table space
• Restore database or table space Database Directory
• Roll forward database or table space Instance
• Load table
• Reorg table
• DB2 Log file archival NODE0000
• Create/Alter/Rename table space
• Quiesce table spaces for table SQL0000n
• Drop Table (optional)
– Included on each DB2 backup file db2rhist.asc
– db2 LIST HISTORY command SQLOGCTL.LFH.1
• Log Control File – SQLOGCTL.LFH.1: SQLOGCTL.LFH.2
– Used during crash recovery
– Disk updated at end of each Log File
– Use softmax value < 100 (% of logfilsiz) to refresh pointers more often
– Included at end of each DB2 backup
© Copyright IBM Corporation 2010
BACKUP command

BACKUP DATABASE MUSICDB

REMOTE

*SIZE?
MUSICDB

SYSADM LOCAL -or-


SYSCTRL *REFERENCE?
SYSMAINT

-or-
Buffer size, TSM, XBSA or
number? vendor backup
shared library name

© Copyright IBM Corporation 2010


Backup Utility options

BACKUP DATABASE database-alias [USER username [USING password]]

[TABLESPACE (tblspace-name [ {,tblspace-name} ... ])] [ONLINE]

[INCREMENTAL [DELTA]]

[USE {TSM | XBSA} [OPEN num-sess SESSIONS]

[OPTIONS {options-string | options-filename}] | TO dir/dev

[ {,dir/dev} ... ] | LOAD lib-name [OPEN num-sess SESSIONS]


[OPTIONS {options-string | options-filename}]]

[WITH num-buff BUFFERS] [BUFFER buffer-size] [PARALLELISM n]

[COMPRESS [COMPRLIB lib-name [EXCLUDE]] [COMPROPTS options-


string]]

[UTIL_IMPACT_PRIORITY [priority]
[{INCLUDE | EXCLUDE} LOGS] [WITHOUT PROMPTING]
© Copyright IBM Corporation 2010
BACKUP using the GUI interface

© Copyright IBM Corporation 2010


Online versus Offline Backup

© Copyright IBM Corporation 2010


The backup files
• File name for backup images on disk has:
– Database alias
– Type of backup (0=full, 3=table space, 4=copy from table load)
– Instance name
– Node number
– Catalog node number
– Timestamp of backup (YYYYMMDDHHMMS)
– Sequence number (for multiple files)
MUSICDB.0.DB2.NODE0000.CATN0000.20071013150616.001

• Tape images are not named, but internally contain the same
information in the backup header for verification purposes
• Backup history provides key information in easy-to-use format

© Copyright IBM Corporation 2010


RESTORE command

RESTORE DATABASE MUSICDB

REMOTE

MUSICDB

SYSADM LOCAL -or-


SYSCTRL
SYSMAINT

-or-
Buffer size, TSM, XBSA or
number? vendor backup
shared library name

© Copyright IBM Corporation 2010


Syntax of the RESTORE command
RESTORE DATABASE source-database-alias | restore-options |
DB CONTINUE
ABORT
Restore options:

USER username TABLESPACE ONLINE


USING password ,

TABLESPACE ( tablespace-name )
ONLINE

HISTORY FILE
ONLINE

USE TSM TAKEN AT date-time


OPEN num-sessions SESSIONS
,

FROM directory
device

LOAD shared-library
OPEN num-sessions SESSIONS

TO target-directory INTO target-database-alias NEWLOGPATH directory

WITH num-buffers BUFFERS BUFFER buffer-size REPLACE EXISTING REDIRECT

...
WITHOUT ROLLING FORWARD WITHOUT PROMPTING

© Copyright IBM Corporation 2010


RESTORE using the GUI interface

© Copyright IBM Corporation 2010


Backup/restore table space considerations
• Roll-forward must be enabled
• Can choose to restore a subset of table spaces
• Generally best to put multiple spaces in one backup image:
– Makes table space recovery strategy easier
– Provides access to related tables spaces and coherent
management of these table spaces

• Handling of long/LOB/XML data requires a correlated strategy


• Point-in-time recovery is supported, but has requirements
• Faster recovery for Catalogs using Tablespace Level backup
• Critical business application tables should obviously be the
focus of the backup/restore, but other tables are needed in
support of these tables
© Copyright IBM Corporation 2010
Roll forward pending state
• Roll forward pending is set as a result of:
– Restore of offline database backup omitting the command option
WITHOUT ROLLING FORWARD
– Restore of an online database backup
– Restore of any table space level backup
– DB2 detects media failure isolated at a table space

• Scope of pending state managed by DB2:


– Database in pending state will not permit any activity
– Table spaces in pending state will permit access to other table
spaces

© Copyright IBM Corporation 2010


Syntax of the ROLLFORWARD command
>>-ROLLFORWARD--+-DATABASE-+--database-alias-------------------->
'-DB-------'

>--+-------------------------------------------------------------------------------------
+-->
| .-ON ALL DBPARTITIONNUMS-. .-USING UTC TIME---. |
+-TO--+-isotime--+------------------------+--+------------------+-+--+--------------+-+
| | '-USING LOCAL TIME-' | +-AND COMPLETE-+ |
| | .-ON ALL DBPARTITIONNUMS-. | '-AND STOP-----' |
| +-END OF BACKUP--+------------------------+-----------------+ |
| '-END OF LOGS--+----------------------------------+---------' |
| '-| On Database Partition clause |-' |
'-+-COMPLETE---------------------------+--+----------------------------------+--------'
+-STOP-------------------------------+ '-| On Database Partition clause |-'
+-CANCEL-----------------------------+
| .-USING UTC TIME---. |
'-QUERY STATUS--+------------------+-'
'-USING LOCAL TIME-'

>--+-------------------------------------------------------+---->
'-TABLESPACE--+-ONLINE--------------------------------+-'
| .-,---------------. |
| V | |
'-(----tablespace-name-+--)--+--------+-'
'-ONLINE-'

>--+------------------------------------------------------------------------+-->
'-OVERFLOW LOG PATH--(--log-directory--+----------------------------+--)-'
'-,--| Log Overflow clause |-'

>--+------------+----------------------------------------------->
'-NORETRIEVE-'

© Copyright IBM Corporation 2010


ROLLFORWARD: GUI interface

© Copyright IBM Corporation 2010


ROLLFORWARD: How far?
• END OF LOGS: (Apply as many changes as possible):
– Rollforward will apply all available logs beginning with the logs associated with
the backup that was restored
– Archived logs will be retrieved unless NORETRIEVE is specified
• Point-in-time (PIT): (Apply changes up to a specified time):
– Specified in Coordinated Universal Time (UTC) via command
– Specified in local time on server with USING LOCAL TIME
– Specified in local time on the client via GUI interface
– Format: yyyy-mm-dd-hh.mm.ss.nnnnnn
• END OF BACKUP: (Apply as few changes as possible):
– Allows a Database to be recovered from an online database backup and to end
the ROLLFORWARD processing at the earliest point where the database is
consistent.
– Recovery history file (RHF) shows logs associated with online backups
• Table space point-in-time considerations:
– Minimum roll forward time maintained for each table space – requires roll
forward at least to the last DDL change (create, alter, drop) in a table space
– Table spaces are placed in backup pending when the roll forward completes to
insure future recoverability

© Copyright IBM Corporation 2010


DB2 RECOVER command

db2 recover database salesdb


– RECOVER command performs both RESTORE and
ROLLFORWARD command processing using Recovery
History file data.
– Can use Full or Incremental Database level backup
images
– Allows Partitioned DPF database to be recovered using
a single command Recovery History

– Database can be recovered to End of logs or a Archived Logs


point in time.
– RESTART option forces failed RECOVER command to Backups
redo RESTORE even if previous restore completed
– USING HISTORY FILE option allows disaster recovery
with no existing database
– No table space level recovery options

© Copyright IBM Corporation 2010


Log file naming

0000000
Sxxxxxxx.LOG
9999999

S~0.LOG S~1.LOG S~2.LOG S~3.LOG S~4.LOG

BACKUP

• When roll-forward is enabled • RESTORE DATABASE


without rolling forward
• When S9999999.LOG is filled
• ROLLFORWARD DATABASE
• When roll-forward is disabled to some point in S~2.LOG
RESTART NAMING SEQUENCE
© Copyright IBM Corporation 2010
Disaster recovery considerations
Should you use database recovery, table space recovery, or a
combination?
• Database level backups:
– Offline backup does not require a roll forward
– Online backup will include the ‘required’ logs for a roll forward TO END OF
BACKUP
– Entire database can be restored on another system
– Reduces backup management

• Table space level backups:


– Can be used to supplement database backups
– A complete set of tablespace backups could be used to REBUILD a database
for disaster recovery
– Increases the number of backup images

• Using the REBUILD WITH option of RESTORE simplifies creating a full


or partial copy of a database using either database or table space
backup images

© Copyright IBM Corporation 2010


Questions to consider
• Will the log files and the database files be on separate physical
disks?
• What medium can best handle the volume of log data to be
generated?
• Should copies of log files, backups, or both be made and
managed?
• Can the OS support mirroring? Should a mirror log path be
defined?
• How far back should recovery be enabled?
• When will all log files and backups be discarded?
• Are paging space/swap files on the same device or drive as
DB2 logs?
• Have you backed up database system support files, such as
db2profile and db2nodes.cfg, and your shell scripts?
© Copyright IBM Corporation 2010
Logging/backup requirements summary

Consideration CRASH VERSION ROLL-


FORWARD
Circular or Circular or
Logging Archive
Archive Archive

OFFLINE
Backup N/A OFFLINE
or ONLINE

Resources Low Medium High

Management Low Medium High

Last
Currency after Media Latest
Not Provided Committed
Failure Backup Work

Table Space Recovery N/A NO YES

© Copyright IBM Corporation 2010


DB2 Integrated Cluster Manager support
• HA Cluster Manager Integration:
– Coupling of DB2 and TSA (S AMP) on Linux, Solaris and AIX.
– TSA can be installed and maintained with DB2 installation
procedures.
– DB2 interface to configure cluster.
– DB2 to maintain cluster configuration, add node, add table space,
and so on. Exploitation of new vendor independent layering (VIL),
providing support for any cluster manager.

• NO SCRIPTING REQUIRED!
– One set of embedded scripts that are used by all cluster managers.

• Automates HADR failover


– Exploit HA cluster manager integration to automate HADR Takeover
on Standby.

© Copyright IBM Corporation 2010


DB2: Cluster configuration simplified
• DB2 utility db2haicu:
– DB2 High Availability Instance Configuration Utility
– Sets up DB2 with the Cluster Manager
• For HADR or Shared Storage cluster
– Interactive or XML file driven interface
– Sets the DBM CFG option cluster_mgr

• DB2 communicates with the Cluster Manager:


– Keeps the Cluster Manager in sync with DB2 changes:
• CM needs to be aware of new or changed table spaces, containers,
and so on
• Avoids failover problems due to missing resources
– Keeps the XML cluster configuration file up to date
– Automates HADR failover

© Copyright IBM Corporation 2010


High Availability Disaster Recovery: HADR

Clients

Network

IB M
Standby DB2
Primary DB2 Database Server
IB M

Database Server
Direct Database to
Database TCPIP Link

Primary Copy of Standby Copy of


DB2 Database DB2 Database

HADR is supported on all editions of DB2 9.


For DB2 Express 9, HADR is part of the separately priced High Availability Feature.

© Copyright IBM Corporation 2010


DB2 additional recovery facilities
• On-demand log archiving
• Infinite active logs
• Block transactions on log directory full
• Split mirror database copies:
– SET WRITE SUSPEND/RESUME commands
– db2inidb command modes:
• SNAPSHOT: Database copy for testing reporting
• STANDBY: Database copy to create standby database for quick
recovery
• MIRROR: Use split mirror database copy instead of RESTORE
• Incremental and delta database and table space backups
• Relocating a database or a table space:
– RESTORE UTILITY with REDIRECT option
– db2relocatedb command
• Full and partial database REBUILD support
• Integrated Cluster Failover support
• High Availability Disaster Recovery (HADR)
© Copyright IBM Corporation 2010
DB2 for LUW Advanced Database Recovery
CL491: DB2 for LUW Advanced Database Recovery (4 days
with Labs):
• Automated Recovery log management
• Dropped table recovery
• Table space recovery
• Incremental backup and restore
• Database crash recovery
• Database and table space relocation methods
• Additional recovery facilities
• Recovery history file
• Disaster recovery of DB2 databases
• DB2 high availability and split mirror support
• DB2 partitioned database recovery considerations
• DB2 high availability disaster recovery (HADR) implementation

© Copyright IBM Corporation 2010


Unit summary
Having completed this unit, you should be able to:
• Describe the major principles and methods for backup and
recovery
• State the three types of recovery used by DB2
• Explain the importance of logging for backup and recovery
• Describe how data logging takes place, including circular
logging and archival logging
• Use the BACKUP, RESTORE, ROLLFORWARD and
RECOVER commands
• Perform a table space backup and recovery
• Restore a database to the end of logs or to a point-in-time
• Discuss the configuration parameters and the recovery
history file and use these to handle various backup and
recovery scenarios
© Copyright IBM Corporation 2010
Student exercise

© Copyright IBM Corporation 2010

You might also like