Professional Documents
Culture Documents
InfoSphereDataStageEssentials PDF
InfoSphereDataStageEssentials PDF
InfoSphereDataStageEssentials PDF
Webinar Series
2
Overview of Talk
IBM's InfoSphere platform
Function and core highlights of InfoSphere
DataStage
Architecture of InfoSphere DataStage
Working with InfoSphere DataStage
Key features of InfoSphere DataStage
3
IBM InfoSphere DataStage Essentials
5
IBM InfoSphere Components
6
IBM InfoSphere Integration
7
IBM InfoSphere Architecture
8
What is InfoSphere DataStage?
Part of the InfoSphere Information Server
suite:
Uses the shared metadata repository to integrate
with other components
Web-based operations console enables users to
view and analyze the runtime environment
Data integration tool:
Enables users to move and transform data
between operational, transactional, and
analytical target systems
9
Why InfoSphere DataStage?
Provides direct connectivity to enterprise
applications as sources or targets:
Ensures that the most relevant, complete, and accurate
data is integrated into your data integration project
Solves large-scale business problems:
Uses the parallel processing capabilities of multiprocessor
hardware platforms
Accelerates development time:
Use the hundreds of prebuilt transformation functions
Simplifies the process of data transformation
Transformation functions can be modified and reused
Decrease the overall cost of development
Increase the effectiveness in building, deploying, and
managing your data integration infrastructure
10
IBM InfoSphere DataStage Essentials
Core Highlights
InfoSphere DataStage v9.1 Highlights
Built on a massively parallel processing (MPP)
architecture
Overcome the common challenges associated
with “big” and “regular” data
Design once and deploy anywhere, improves
performance and provides greater integration agility
while keeping costs low
Prioritize and streamline tasks
Leverage a common enterprise-wide business
rules approach
Quickly process and react to business changes
Available for customers who require broader data
integration capabilities
12
Support for Big Data
Manage big data from new and emerging
sources
Built on a massively parallel processing (MPP)
architecture
Easily scales to meet the demanding data integration
requirements of a Hadoop environment
“Design-once, run anywhere” paradigm provides the
flexibility to easily move data integration tasks
between a single machine and a cluster of low-
commodity servers
Job design is isolated from run time environment:
Enables developers to focus on the business requirements at
hand rather than coding explicitly for a given architecture
13
Various Methods of ETL
14
Workload Management
InfoSphere DataStage supports policy-driven
control of system resources and prioritization
of different workload classes
Support for various policy-driven controls:
System-level job policies
System-level resource policies
Queue-based policies
Administrators can proactively manage the
distribution of resources and payloads
Useful in shared services environments where multiple teams share
a common hardware infrastructure
Get the best out of hardware
Prioritize mission-critical tasks
Throttle job activities where resources exceed specified thresholds
Assess, assign and reassign priorities as new jobs are submitted
15
Empowering Business Users
InfoSphere DataStage integrates directly with
IBM Operational Decision Management
(formerly ILOG JRules)
Implementation of decision logic tightens the link
between business people and IT
Responsive business rules management tools
Rules can quickly be promoted to the operational
environment and consumed by InfoSphere Information
Server
Systems can react instantly in to gain business advantage
16
Control of Operational Environment
InfoSphere DataStage enriches the level of
information and control available for the data
integration and data quality runtime
environment
Run jobs directly from a web-based operations
console
Combine with the enhanced ability to view full log
messages to adapt more quickly to new information
Prebuilt reports included which can be deployed for
IBM Cognos BI
17
Performance Optimization
InfoSphere DataStage and QualityStage
Designer client optimized to scale up
accommodating even more users working in
shared services environments
Shows performance gains of 60% or more for some
of its most common operations
Data integration engine includes two new
enhancements:
Connector optimization - new DBMS connector boost
delivers job results faster
Improved Windows process model – now uses native
process forking APIs to increase scalability on Windows
18
IBM InfoSphere DataStage Essentials
A Closer Look
More detail
InfoSphere DataStage is the data integration
component of IBM InfoSphere Information
Server
Provides a graphical framework for developing the
jobs that move data from source systems to target
systems
Transformed data can be delivered to data
warehouses, data marts, and operational data stores,
real-time web services and messaging systems, and
other enterprise applications
Supports ETL and ETL patterns
20
InfoSphere DataStage Architecture
21
InfoSphere DataStage Tier Topology
22
Partitioning and Dynamic Repartitioning
InfoSphere Information Server parallel
technology operates using a divide-and-conquer
technique:
Partition parallelism:
Split the largest integration jobs into subsets
pipeline parallelism:
Flowing these subsets concurrently across all available
processors
Delivers true linear scalability
Performance increases proportionally to the
number of processors
23
Data Partitioning
24
IBM InfoSphere DataStage Essentials
Working with
InfoSphere DataStage
Get More from InfoSphere DataStage
Use the capabilities of other IBM InfoSphere
Information Server components as you develop
and deploy InfoSphere DataStage jobs
Collaborate across projects by using the metadata
repository
Determine development requirements for a project by
using IBM InfoSphere Blueprint Director
Collaborate with business analysts to jump-start job
designs by using IBM InfoSphere FastTrack
Develop data rules by using IBM InfoSphere Information
Analyzer
Deploy jobs as services by using IBM InfoSphere
Information Services Director
26
InfoSphere DataStage Design Elements
Design jobs that extract data from various sources,
transform it, and deliver it to target systems in
the required formats.
Extract and load data
Transform data
Enrich data
Cleanse data
Real-time processing
Big data processing
Combine jobs in a sequence job
27
Sample Scenario
28
Projects
All work is organized in projects, along with
associated design items
Different users can be granted access to different
projects
For example, give each designer on your team an
individual project to work in
An engine tier can have any number of projects
Use the Administrator client to perform administrative
tasks on an engine tier and its associated projects
29
Jobs and Stages – 1/2
Develop, test, deploy, and run jobs that move
data from source systems to target systems
Jobs consist of stages linked together which
describe the flow of data from a data source to a
data target
For example, a final data warehouse
Stages have at least one data input or one data
output
Some stages can accept more than one data input,
and output to more than one stage
30
Jobs and Stages – 2/2
Different types of job have different stage types
Stages that are available in the Designer depend
on the type of job that is currently open in the
Designer
Jobs and Stages:
Three types of jobs:
Parallel
Server
Mainframe
Two types of stages:
Passive stages (E and L of ETL)
Processor (active) stages (T of ETL)
31
Creating Jobs
32
Example of Job with Stages
33
InfoSphere DataStage User interfaces
Web based tools:
Information Server Web Console
Clients:
Designer
Director
Administrator
34
Information Server Web Console
Use the web console for the following
administrative activities:
Managing security
Creating views of scheduled tasks
Reporting
Tasks that are associated with IBM InfoSphere
Business Glossary
Tasks that are associated with the Information
Services catalog
35
Information Server Web Console
36
Designer Client
IBM InfoSphere DataStage and QualityStage
Designer:
Set up your project
Create, manage, and design jobs
Define tables
Access metadata services
37
Designer Client
38
Director Client
IBM InfoSphere DataStage and QualityStage
Director:
Validates, runs, schedules, and monitors jobs that are
run by the IBM InfoSphere Information Server engine
39
Director Client
40
Administrator Client
IBM InfoSphere DataStage and QualityStage
Adminstrator:
Provides tools for managing general and project-
related tasks such as server timeout and NLS
mappings
Exists in parallel with the web client-based Suite
Administrator
open the Suite Administrator from within the Administrator
by clicking on the Suite Admin hyperlink
41
Administrator Client
42
IBM InfoSphere DataStage Essentials
Key Features
Enterprise Connectivity
Successful enterprise-class information
integration requires access to a full range of data
sources
structured, semi-structured or unstructured within and outside of
the enterprise
Connectivity to a virtually unlimited array of
heterogeneous data sources, targets and
applications that can be combined within a single
job
Supports best-of-breed integration with IBM
Netezza Performance Server data warehouse
appliances
Realize faster time to value 44
Develop and Maintenance
Support for Developers and Administrators:
InfoSphere Information Server uses a powerful
architecture:
maximize speed, flexibility and effectiveness in building, deploying, updating
and managing data integration infrastructure
InfoSphere DataStage leverages the productivity-enhancing
features of InfoSphere Information Server:
reduce the learning curve
simplify administration
optimize the use of development resources
46
Common Metadata Repositories
Provides a unified metadata repository for
InfoSphere DataStage and all other modules
Immediately access technical and process
metadata developed during data profiling,
cleansing and integration processes:
Speed development
Reduce the chance for errors
47
Productivity
Development cycle shortened:
Fosters reuse of existing data integration business
logic
Employs a container concept enabling jobs and
metadata created in one container to be shared and
reused by other jobs
Quick Find and Advanced Find capabilities:
Easy to locate objects for reuse across different
projects
Robust job specification reporting:
Provides documentation so other developers can
easily understand job design and provide additional
support
48
Right-time Data Integration
InfoSphere Information Server architecture
enables InfoSphere DataStage to operate in real
time:
Captures messages or extracting data at a moment’s notice
on the same platform that integrates bulk data and using
the same transformation rules
Use InfoSphere DataStage together with
InfoSphere Information Services Director:
Data integration jobs can be deployed with Java Message
Services, web services and other methods
Service oriented architecture (SOA) approach enables
numerous developers to share complex data integration
processes without requiring them to understand the steps
contained in the services
49
Flexibility and Scalability
InfoSphere DataStage facilitates high performance
integration of massive amounts of data
Leverages the parallel processing capabilities of
multiprocessor hardware platforms enabling
businesses to linearly scale the speed of data
throughput
Scale transformation jobs to address the demands of
ever-growing data volumes and ever-shrinking batch
windows
During development, the deployment configuration
automatically adds the desired degree of parallelism:
For example, an organization could take an application from 2-way
processing in the morning to 32-way in the afternoon to 128-way processing
at night—all with a simple change to the configuration file
50
InfoSphere DataStage in Perspective
51
Summary
In this session, we:
Took a look at IBM's InfoSphere platform and
what DataStage brings in its offering
Examined the architecture of InfoSphere
DataStage
Learned how to use InfoSphere DataStage
Reviewed the key features of InfoSphere
DataStage
52
IBM InfoSphere DataStage Education
Web Age Solutions is an authorized provider of
IBM InfoSphere DataStage education.
53
InfoSphere DataStage Resources
IBM
IBM InfoSphere DataStage home page
http://www-03.ibm.com/software/products/en/ibminfodata
IBM InfoSphere Information Server Information
Center
http://pic.dhe.ibm.com/infocenter/iisinfsv/v9r1/index.jsp
54
IBM InfoSphere DataStage Essentials
10
Core Highlights
13
14
16
17
18
A Closer Look
20
21
22
23
24
Working with
InfoSphere DataStage
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Key Features
46
47
50
51
52
53
54