DWarchitecturesanddevelopmentstrategy Guidebook

1
Companion Guidebook

Learning course
Data warehouse architectures
and development strategy

Sabir Asadullaev
Executive IT Architect SWG IBM EE/A
Distinguished IT Architect, Open Group

2
Table of Content

DATA WAREHOUSE ARCHITECTURES - I.............................................................................................................. 5
ABSTRACT....................................................................................................................................................................... 5
OLAP AND OLTP ........................................................................................................................................................... 5
SIX LEVELS OF DATA WAREHOUSE ARCHITECTURES........................................................................................................ 6
VIRTUAL DATA WAREHOUSE.......................................................................................................................................... 8
INDEPENDENT DATA MARTS .......................................................................................................................................... 10
CONCLUSION................................................................................................................................................................. 11
LITERATURE.................................................................................................................................................................. 11
DATA WAREHOUSE ARCHITECTURES - II .......................................................................................................... 12
ABSTRACT..................................................................................................................................................................... 12
CENTRALIZED DATA WAREHOUSE WITH ETL............................................................................................................... 12
CENTRALIZED DATA WAREHOUSE WITH ELT............................................................................................................... 14
CENTRALIZED DW WITH OPERATIONAL DATA STORE.................................................................................................. 16
EXTENDED MODEL WITH DATA MARTS........................................................................................................................ 17
CONCLUSION................................................................................................................................................................. 18
LITERATURE.................................................................................................................................................................. 18
DATA WAREHOUSE ARCHITECTURES - III......................................................................................................... 19
ABSTRACT..................................................................................................................................................................... 19
CENTRALIZED ETL WITH PARALLEL DW AND DATA MARTS......................................................................................... 19
DW WITH INTERMEDIATE APPLICATION DATA MARTS................................................................................................... 20
DATA WAREHOUSE WITH INTEGRATION BUS................................................................................................................ 22
RECOMMENDED EDW ARCHITECTURE ......................................................................................................................... 24
CONCLUSION................................................................................................................................................................. 25
LITERATURE.................................................................................................................................................................. 26
DATA, METADATA AND MASTER DATA: THE TRIPLE STRATEGY FOR DATA WAREHOUSE
PROJECTS...................................................................................................................................................................... 27
ABSTRACT..................................................................................................................................................................... 27
INTRODUCTION.............................................................................................................................................................. 27
MASTER DATA MANAGEMENT ....................................................................................................................................... 28
METADATA MANAGEMENT............................................................................................................................................ 28
DATA, METADATA AND MASTER DATA INTERRELATIONS .............................................................................................. 28
Data and metadata................................................................................................................................................... 29
Data and master data............................................................................................................................................... 30
Metadata and master data ....................................................................................................................................... 30
COMPONENTS OF ENTERPRISE DATA WAREHOUSE ....................................................................................................... 32
EXAMPLE OF EXISTING APPROACH ................................................................................................................................ 33
THE PRACTICAL REALIZATION OF THE TRIPLE STRATEGY.............................................................................................. 34
CONCLUSION................................................................................................................................................................. 37
LITERATURE.................................................................................................................................................................. 37
METADATA MANAGEMENT USING IBM INFORMATION SERVER............................................................... 38
ABSTRACT..................................................................................................................................................................... 38
GLOSSARY..................................................................................................................................................................... 38
METADATA TYPES......................................................................................................................................................... 38
SUCCESS CRITERIA OF METADATA PROJECT................................................................................................................... 38
METADATA MANAGEMENT LIFECYCLE ......................................................................................................................... 39
IBM INFORMATION SERVER METADATA MANAGEMENT TOOLS .................................................................................... 41
ROLES IN METADATA MANAGEMENT PROJECT............................................................................................................... 42
THE ROLES SUPPORT BY IBM INFORMATION SERVER TOOLS......................................................................................... 43
CONCLUSION................................................................................................................................................................. 46
3
INCREMENTAL IMPLEMENTATION OF IBM INFORMATION SERVERS METADATA MANAGEMENT
TOOLS............................................................................................................................................................................. 47
ABSTRACT..................................................................................................................................................................... 47
SCENARIO, CURRENT SITUATION AND BUSINESS GOALS................................................................................................. 47
LOGICAL TOPOLOGY AS IS ......................................................................................................................................... 47
ARCHITECTURE OF METADATA MANAGEMENT SYSTEM................................................................................................. 49
ARCHITECTURE OF METADATA MANAGEMENT ENVIRONMENT...................................................................................... 49
ARCHITECTURE OF METADATA REPOSITORY ................................................................................................................. 50
LOGICAL TOPOLOGY TO BE........................................................................................................................................ 51
TWO PHASES OF EXTENDED METADATA MANAGEMENT LIFECYCLE............................................................................... 51
METADATA ELABORATION PHASE.............................................................................................................................. 53
METADATA PRODUCTION PHASE ............................................................................................................................... 53
ROLES AND INTERACTIONS ON METADATA ELABORATION PHASE.................................................................................. 55
ROLES AND INTERACTIONS ON METADATA PRODUCTION PHASE.................................................................................... 58
ADOPTION ROUTE 1: METADATA ELABORATION............................................................................................................ 60
ADOPTION ROUTE 2: METADATA PRODUCTION.............................................................................................................. 62
CONCLUSION................................................................................................................................................................. 64
LITERATURE.................................................................................................................................................................. 64
MASTER DATA MANAGEMENT WITH PRACTICAL EXAMPLES................................................................... 65
ABSTRACT..................................................................................................................................................................... 65
BASIC CONCEPTS AND TERMINOLOGY ........................................................................................................................... 65
REFERENCE DATA (RD) AND MASTER DATA (MD) ........................................................................................................ 66
ENTERPRISE RD & MD MANAGEMENT.......................................................................................................................... 67
TECHNOLOGICAL SHORTCOMINGS OF RD & MD MANAGEMENT................................................................................... 67
No unified data model for RD & MD....................................................................................................................... 67
There is no single regulation of history and archive management .......................................................................... 68
The complexity of identifying RD & MD objects ..................................................................................................... 68
The emergence of duplicate RD & MD objects........................................................................................................ 68
Metadata inconsistency of RD & MD...................................................................................................................... 68
Referential integrity and synchronization of RD & MD model................................................................................ 68
Discrepancy of RD & MD object life cycle.............................................................................................................. 69
Clearance rules development................................................................................................................................... 69
Wrong core system selection for RD & MD management ....................................................................................... 69
IT systems are not ready for RD & MD integration ................................................................................................ 69
EXAMPLES OF TRADITIONAL RD & MD MANAGEMENT ISSUES ..................................................................................... 69
Passport data as a unique identifier ........................................................................................................................ 70
Address as a unique identifier.................................................................................................................................. 70
The need for mass contracts renewal ..................................................................................................................... 70
The discrepancy between the consistent data .......................................................................................................... 70
BENEFITS OF CORPORATE RD & MD............................................................................................................................. 70
Law compliance and risk reduction......................................................................................................................... 71
Profits increase and customer retention .................................................................................................................. 71
Cost reduction.......................................................................................................................................................... 71
Increased flexibility to support new business strategies .......................................................................................... 71
ARCHITECTURAL PRINCIPLES OF RD & MD MANAGEMENT .......................................................................................... 72
CONCLUSION................................................................................................................................................................. 73
LITERATURE.................................................................................................................................................................. 73
DATA QUALITY MANAGEMENT USING IBM INFORMATION SERVER....................................................... 74
ABSTRACT..................................................................................................................................................................... 74
INTRODUCTION.............................................................................................................................................................. 74
METADATA AND PROJECT SUCCESS............................................................................................................................... 74
METADATA AND MASTER DATA PARADOX .................................................................................................................... 75
METADATA IMPACT ON DATA QUALITY......................................................................................................................... 75
DATA QUALITY AND PROJECT STAGES ........................................................................................................................... 76
QUALITY MANAGEMENT IN METADATA LIFE CYCLE ...................................................................................................... 76
DATA FLOWS AND QUALITY ASSURANCE....................................................................................................................... 77
4
ROLES, INTERACTIONS AND QUALITY MANAGEMENT TOOLS ......................................................................................... 80
NECESSARY AND SUFFICIENT TOOLS ............................................................................................................................. 80
CONCLUSION................................................................................................................................................................. 82
LITERATURE.................................................................................................................................................................. 82
PRIMARY DATA GATHERING AND ANALYSIS SYSTEM - I............................................................................. 83
ABSTRACT..................................................................................................................................................................... 83
INTRODUCTION.............................................................................................................................................................. 83
SYSTEM REQUIREMENTS................................................................................................................................................ 84
PROJECT OBJECTIVES..................................................................................................................................................... 85
Development of e-forms for approved paper forms ................................................................................................. 85
Development of e-forms for new paper forms.......................................................................................................... 85
Development of storage for detailed data................................................................................................................ 85
Development of analytical tools............................................................................................................................... 86
Development of reporting and visualization tools ................................................................................................... 86
Information security................................................................................................................................................. 86
Data back-up............................................................................................................................................................ 86
Data archiving......................................................................................................................................................... 86
Logging system events.............................................................................................................................................. 86
Success criteria........................................................................................................................................................ 86
ARCHITECTURE OF SYSTEM FOR DATA COLLECTION, STORAGE AND ANALYSIS ............................................................. 87
DATA COLLECTION........................................................................................................................................................ 89
DATA STORAGE ............................................................................................................................................................. 90
CONCLUSION................................................................................................................................................................. 92
LITERATURE.................................................................................................................................................................. 92
PRIMARY DATA GATHERING AND ANALYSIS SYSTEM - II ........................................................................... 93
ABSTRACT..................................................................................................................................................................... 93
DATA ANALYSIS USING IBM INFOSPHERE WAREHOUSE ............................................................................................... 93
Cubing Services & Alphablox based OLAP............................................................................................................. 93
Text and data mining ............................................................................................................................................... 94
Data mining using MiningBlox & Alphablox........................................................................................................... 94
Data mining using Intelligent Miner........................................................................................................................ 95
Text analysis ............................................................................................................................................................ 96
Data mining application development ..................................................................................................................... 97
DATA ANALYSIS USING IBM COGNOS BUSINESS INTELLIGENCE................................................................................... 98
ENTERPRISE PLANNING USING COGNOS TM1.............................................................................................................. 101
CONCLUSION............................................................................................................................................................... 102
LITERATURE................................................................................................................................................................ 102
DATA WAREHOUSING: TRIPLE STRATEGY IN PRACTICE .......................................................................... 103
ABSTRACT................................................................................................................................................................... 103
INTRODUCTION............................................................................................................................................................ 103
ARCHITECTURE OF PRIMARY DATA GATHERING AND ANALYSIS SYSTEM..................................................................... 103
ROLE OF METADATA AND MASTER DATA MANAGEMENT PROJECTS ............................................................................. 105
RECOMMENDED DW ARCHITECTURE.......................................................................................................................... 106
RELATION BETWEEN THE RECOMMENDED ARCHITECTURE AND THE SOLUTION........................................................... 107
COMPARISON OF PROPOSED AND EXISTING APPROACHES ............................................................................................ 109
THE FINAL ARCHITECTURE OF IMPLEMENTATION OF EXISTING APPROACHES .............................................................. 110
CONCLUSION............................................................................................................................................................... 114
LITERATURE................................................................................................................................................................ 114

5
Data Warehouse Architectures - I
Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A
19.10.2009
http://www.ibm.com/developerworks/ru/library/sabir/axd_1/index.html
Abstract
This paper starts a series of three articles on data warehousing (DW) architectures and their
predecessors. The abundance of various approaches, methods and recommendations makes a mess
of concepts, advantages and drawbacks, limitations and applicability of specific architecture
solutions. The first article is concerned with the evolution of OLAP role understanding, with DW
architecture components, with virtual DW and independent data marts. The second article considers
the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT (Extract, Load,
Transform), CDW with operational data store, and extended model with data marts. The third article
discusses centralized ETL with parallel DW and data marts; DW with intermediate application data
marts, DW with integration bus, and the recommended DW architecture.
OLAP and OLTP
Any transactional system usually contains two types of tables. One of them is responsible for quick
transactions. For example, ticket sale system has to ensure reliable exchange of short messages
between the system and a large number of ticket agents. Indeed, entered and printed information that
deals with passengers names, flight dates, flight number, seat, flight destination, can be estimated to
be around 1000 bytes. Thus, passenger service requires fast processing of short records.
Another type of tables contains a summary of sales for a specified time, by directions, and by
categories of passengers. These tables are used by analysts and financial specialists in the end of
month, quarter, or year, when companys financial results are needed. And if the number of analysts
is ten times smaller than the number of ticket agents, the volume of data required for analysis
exceeds an average transaction size by several orders of magnitude.
Not surprisingly, execution of analytical queries increases the systems response time to ticket
availability request. Creating a system with a reserve of computational power can mitigate the
negative impact of analytical processing load on transactional activity, but leads to a significant cost
increase of the required software and hardware, due to the fact that excess processing capacity
remains unused most of the time. The second factor that led to the separation of analytical and
transactional systems is different requirements that are applied by analytical and transactional
systems to computing systems.
The OLAP story begins in 1993 when an article "Providing OLAP (online analytical processing) to
users - analysts" was published [1]. Initially, it appeared that the separation of transactional and
analytical systems (OLTP - OLAP) is sufficient.
However, it soon became clear that OLAP systems cope badly with the task of mediator between
various data sources (including transactional systems) and client applications.
It became clear that analytical data storage environment is required. Initially shared databases were
proposed for this role, which implied the need to copy and store the original data from data sources.
This idea turned out to be quite unviable, as transactional systems were commonly developed
without a unified plan, and thus they contained conflicting and inconsistent information.

6

Pic. 1. Evolution of OLTP and OLAP understandings
These implications lead to the idea of a data warehouse, designed for secure data storage and
systems for data extraction, transformation and loading (ETL). OLAP-systems had to use
information from data warehouse.
It was revealed soon that data warehouse accumulates very important enterprise information and any
unauthorized access to the DW is fraught with serious financial losses. Moreover, data formats
oriented towards reliable storage are hard to combine with requirements for fast information service.
Geographical distribution and organizational structure of enterprise also require a specific approach
to information services to each companys unit. The solution is a data mart that contains a required
subset of information from the DW. Data load to data mart from DW may occur during the user
activitys decay time. In case of data mart failure, data can be easily retrieved from the DW with
minimal losses.
Data Marts can handle reporting, statistical analysis, planning, scenario calculations (what-if
analysis) and, in particular, multidimensional analysis (OLAP).
Thus, OLAP systems that initially claimed to be almost a half of the computing world (giving the
second half to OLTP systems), now rank among analytical tools on the working groups level.
Six levels of data warehouse architectures
Data warehouse architecture at times resembles a child's toy blocks. Any arbitrary combination of
blocks can represent something that you can meet in real life. Sometimes in company one can find
the presence of several enterprise data warehouses, each of which is positioned as the only and
unified source of consistent information.
Multilevel data marts in the presence of a unified data warehouse bring even more fun. Why should
not we build a new DM on top of DW? You see, users want to combine some data from the two
DMs to a third one. Maybe, it would make sense if the DMs contain information that is not in the
7
DW, for example, if users have enriched DM with their calculations and data. Even if so, what is the
value of these enriched data in comparison with those that have passed through a cleaning sieve in
accordance with enterprise policies? Who is responsible for the quality of the data? How they
appeared in the system? Nobody knows, but everyone wants to get access to information that is not
in the DW.
Data warehouses are somewhat similar to a water purification system. Water with different chemical
composition is collected from various sources. Therefore, the specific cleaning and disinfection
methods are applied for each case of water source. Water delivered to the consumers meets strict
quality standards. And no matter how we complain about the quality of water, this approach
prevents the spread of epidemics in the city. And it comes to no ones mind (I hope so) to enrich
purified water with water from a nearby pond. However, IT has its own laws.
Various data warehousing architecture will be considered later, though extremely exotic approaches
are not going to be examined.
We will discuss the architecture of the enterprise data warehouse on six layers, because, despite the
fact that the components themselves may be absent, the layers exist in some form.

Pic. 2. Six layers of DW architecture
The first layer consists of data sources, such as transactional and legacy systems, archives, separate
files of known formats, MS Office documents, and any other structured data sources.
The second layer hosts ETL (Extract, Transformation and Load) system. The main objective of ETL
is to extract data from multiple sources to bring them to a consistent form and load to the DW. The
hardware and software system where ETL is implemented must have a large throughput. But high
computing performance is even more important. Therefore, the best of ETL systems should be able
to provide a high degree of task parallelism, and work even with clusters and computational grids.
8
The role of the next level is a reliable data storage, protected from unauthorized access. Under the
proposed triple strategy [2], we believe that at this level metadata and master data management
systems should also be placed. An operational data store (ODS) is needed when quick access is
required to even incomplete, not fully consistent data available with the least possible time lag. A
staging area is needed to implement a specific business process, for example, when data steward
should review data and should permit to load reviewed data to DW.
Sometimes storage areas are referred to as a database buffer necessary for the implementation of
internal process operations. For example, ETL retrieves data from a source, writes them into the
internal database, cleans, and loads to DW. In this paper, the term staging zone is used for storage
areas, designed for operations performed by external users or systems in accordance with business
requirements for data processing. Separation of staging zone in a specific component of DW is
needed, since these zones require establishment of additional administration, monitoring, security
and audit processes.
Information systems at data distribution layer still do not have a common name. They can be simply
called ETL, as well as the system of extraction, transformation, and loading on the second layer. Or,
to emphasize the differences from ETL, they are sometimes called ETL-2. Data distribution systems
at the fourth layer perform tasks that differ significantly from the tasks of ETL, namely, sampling,
restructuring, and data delivery (SRD - Sample, Restructure, Deliver).
ETL extracts data from a variety of external systems. SRD selects data from a single DW. ETL
receives inconsistent data that are to be converted to a common format. SRD has to deal with
purified data the structure of which must be brought into compliance with the requirements of
different applications. ETL loads data into a central DW. SRD shall deliver the data in different data
marts in accordance with the rights of access, delivery schedule and requirements for the
information set.
Information access layer is intended to separate data storage functions from information support
functions for various applications. Data marts must have a data structure which suits best the needs
of information support tasks. Since there are no universal data structures that are optimal for all
applications, data marts should be grouped by geographical, thematic, organizational, functional and
other characteristics.
Business applications layer is presented by scenario and statistical analysis, multidimensional
analysis, planning and reporting tools and other business applications.
Virtual Data Warehouse
Virtual data warehouse remained in the Romantic era, when it seemed that you can implement
everything that a human mind can imagine. No one remembers virtual DW, and so again and again
invent them, however, on a new level. So we have to start with what is already long gone, but trying
to be reborn in a new guise.
The concept of virtual data warehouse was based on a few sublimely beautiful ideas.
The first great idea is costs reduction. There is no need to spend money on expensive equipment for
a central data warehouse. No need to have qualified personnel maintaining this repository. We do
not need server rooms with expensive cooling systems, fire control and monitoring equipment.
The second idea - we should work with the most recent data. The analytical system must work
directly with data sources, bypassing all middlemen. The intermediary is the evil, everyone knows
that. Our experts do not have confidence in the mediator applications. Experts have always worked
directly with the data source systems.
9
The third idea we will write everything you need. All that is needed - this is a workstation and
access to data sources. And the compiler. Our programmers are still sitting idle. They will develop
an application that will query by itself all sources by user's request, it will deliver the data to a user's
computer, will convert divergent formats by itself, it will perform data analysis, and it will show
everything on the screen.

Pic. 3. Virtual Data Warehouse
Does the company have many various users with different needs? Do not worry, we will modify our
universal application for as many options as you want.
Is there a new data source? Thats wonderful. We will rewrite all of our applications, taking into
account the peculiarities of this source.
Did the data format change? Fine. We will rewrite all of our applications to reflect the new format.
Everything is going well; everybody is busy, we need more staff, SW development department
should be expanded.
Oh, and the users of data sources are complaining that for some time their system is very slow, for
the reason that every time, even if the request has already been done before, our universal client
application queries the data source again and again. Therefore it is necessary to purchase new, more
powerful servers.
What about the spending cuts? No cuts. Conversely, the costs only increase. We need more
developers, more servers, more power, and more space for server rooms.
Are there still any benefits from this architecture?
We got a tight coupling between data sources and analytical applications. Any change in the source
data must be agreed with the developers of the universal client, in order to avoid transmission of
distorted and misinterpreted data to the analysis applications. A set of interfaces to access different
data source systems should be maintained on every workplace.
10
There is an opinion that all this is obvious and it is not worth wasting time on explaining things that
everyone understands. But in case of a user's request I need data from the data marts A and B why
do the same developers write a client application that accesses multiple data marts, again and again
reproduces the dead architecture of a virtual data warehouse?
Independent data marts
Independent data marts have emerged as a physical realization of the understanding that
transactional and analytical data processing do not get along well together on a single computer.
The reasons for incompatibility are as follows:
Transactional processing is characterized by a large number of reads and writes to the
database. Analytical processing may take only a few queries to the database.
A record length in OLTP is typically less than 1000 characters. A single analytical query
may require megabytes of data for analysis.
Number of transactional system users can reach up to several thousand employees. The
number of analysts is usually within a few tens.
Typical requirement for transactional systems is the clock round non-stop operation 365 days
a year (24 x 365). Analytical processing does not have such well-defined requirements for
the availability of analytical systems, but a report not prepared in time can lead to serious
troubles for analysts as well as the company.
The transactional systems load is distributed more or less evenly over the year. The
analytical systems load is usually maximal at the end of accounting periods (month, quarter,
year).
Transactional processing is mainly carried out using current data. Analytical calculations
address historical data.
Data in transactional systems can be updated, whereas in analytical systems data should only
be added to. Any attempt to change data retroactively should at least cause awareness.
Thus, transactional and analytical systems place different requirements both for software and
hardware in terms of performance, capacity, availability, data models, data storage organization,
access data methods, peak loads, data volumes and processing methods.
Creation of independent data marts was the first response to the need for the separation of analytical
and transactional systems. In those days it was a big step forward, simplifying the design and
operation of software and hardware because they do not have to try to satisfy the mutually exclusive
requirements of analytical and transactional systems.
The advantage of development of independent data marts is the ease and simplicity of their
organization, as each of them operates the data of one specific application, and therefore there is no
problem with metadata and master data. There is no need for complex systems extraction,
transformation and loading (ETL). Data just are copied from a transactional system to a data mart on
a regular basis. One application - one data mart. Therefore, independent data marts are often called
application data marts.
But what if users need to use information from multiple data marts simultaneously? Development of
complex client applications that can query many data marts at a time and can convert the data on the
fly has been compromised by virtual data warehouse approach.
11

Pic.4. Independent data marts

So, you need a single repository - a data warehouse. But the information in the data marts is not
consistent. Each data mart has inherited from its transactional system its terminology, data model,
master data, including the data encoding. For example, in one system the date of the operation can
be encoded in the Russian format dd.mm.yyyy (day, month, year), and in the other in the American
format mm.dd.yyyy (month, day, year). So, at data merge it is necessary to understand, what does
06.05.2009 mean - is it June 5 or May 6. Thats why we need a data ETL (extract, transform and
load) system.
Thus, the benefits of independent data marts disappear with the first requirement of users to work
with data from several data marts.
Conclusion
The article deals with the evolution of understanding of OLAP role, with DW component
architecture, with virtual DW and independent data marts. Next papers will discuss the advantages
and limitations of the following architectures: a centralized DW with ETL system, DW with ELT
system, central data warehouse with an operational data store (ODS), extended model with data
marts, centralized ETL with parallel DW and data marts, DW with intermediate application data
marts, data warehouse with integration bus, and recommended DW architecture.
Literature
1. Codd E.F., Codd S.B., and Salley C.T. "Providing OLAP (On-line Analytical Processing) to
User-Analysts: An IT Mandate". Codd & Date, Inc. 1993.
2. Asadullaev S. "Data, metadata and master data: the triple strategy for data warehouse projects,
http://www.ibm.com/developerworks/ru/library/r-nci/index.html, 2009.

12
Data Warehouse Architectures - II
23.10.2009
Abstract
The second paper continues a series of three articles on data warehouse (DW) architectures and their
solutions. The first article [1] was concerned with the evolution of OLAP role understanding, with
DW architecture components, with virtual DW and independent data marts. This publication
considers the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT
(Extract, Load, Transform), CDW with operational data store, and the extended model with data
marts. The third article discusses centralized ETL with parallel DW and data marts; DW with
intermediate application data marts, DW with integration bus, and the recommended DW
architecture.
Centralized Data Warehouse with ETL
Virtual data warehouse and independent data marts showed that a unified data repository is required
for effective operation of analytical systems. To fill this repository we need to extract and to
reconcile disparate data from various data sources and to load data into the repository.
ETL tools should be aware of all information about data sources: the structure of stored data and
their formats, differences in data processing methods, the meaning of the stored data, the data
processing schedule in transaction systems. Ignoring this information about data (metadata) leads
inevitably to quality deterioration of the information loaded into the repository. As a result, users
lose confidence in the data warehouse, trying to get information directly from the source, which
leads to unnecessary time expenditures of specialists who maintain data source systems.
Thus, ETL tools must use the information about data sources. Therefore, ETL tools should work in
close conjunction with metadata management tools.
Extracted data should be converted into a unified form. Since the data are stored mainly in relational
databases, we need to take into account the difference in the data coded values. Dates can be
encoded in different formats; addresses may use different abbreviations, product encoding may
follow different nomenclatures. Initially, information about master data was included in the data
conversion algorithms of ETL tools. As the number of data sources and the volume of data being
processed increased (the first one can reach thousands systems, and the second can be more than ten
terabytes per day), it became clear that it is necessary to separate master data management (MDM)
from ETL, and to ensure their effective interaction between MDM and ETL.
Thus, ETL tools in conjunction with metadata and master data management tools extract data from
sources, transform them to a required format, and load into a data repository. Usually data
warehouse repository is used to store the data, but it can be also can also an operational data store
(ODS), a staging area, and even a data mart. Therefore, one of the key requirements for ETL tools is
their ability to interact with various systems.
Growing volume of processed data and the need to increase the responsiveness of provisioning of
analytical information impose increased requirements for the performance and scalability of ETL
13
tools. Therefore, ETL tools should use various schemes of parallel computing and be able to run on
high-performance systems having different architectures.

Pic. 1. Centralized Data Warehouse with ETL

As is seen, the ETL tools must fit different requirements:
Data from various data source systems should be collected, even if one or more systems
failed to complete data processing in time, and at least some required data should be
provided.
Collected information must be recognized and converted in accordance with transformation
rules, and with the help of metadata and master data management system.
Transformed information must be loaded into a staging zone, into a data warehouse, OSD, a
data mart, as required by business and production processes.
ETL tools must have a high throughput to collect and load the ever-increasing data volumes
into various repositories.
ETL tools must possess high performance and scalability to reduce data processing time and
to shorten lags in providing data for analytical tasks.
ETL tools should provide various data extracting instruments in different operating
environments: from data collection batch systems, which are non-critical to time delays, to
practically real-time incremental data processing.
In connection with these often mutually exclusive requirements, design and development of ETL
tools become a difficult task, especially when ready made solutions are not used.
14
Centralized Data Warehouse with ELT
Traditional ETL system is often blamed for poor efficiency and high cost due required dedicated
hardware and software. As an alternative to ETL, ELT (extraction, loading and transformation) tools
were proposed, which are attributed to high productivity and efficient use of equipment.
In order to understand what are the comparative advantages and disadvantages of ETL and ELT
systems of, lets turn now to the three main functions of enterprise data warehouse (EDW):
1. Full and timely collection and processing of information from data sources;
2. Safe and secure data storage;
3. Provision of data for analytical tasks.
The input to ETL / ELT systems are disparate data whish have to be compared, cleaned, transformed
to a common format, and to be processed according to calculation algorithms. On the one hand, data
practically do not stay in ETL / ELT systems; on the other the main information stream flows
through these systems to data repositories. Therefore, the requirements for information security can
be moderate.

Pic. 2. Centralized Data Warehouse with ELT

As a rule central data warehouse (CDW) contains a wealth of information and its full disclosure
could lead to serious losses for the company. In this case, a reliable information security perimeter is
required around CDW. Data structure in CDW should best fit the requirements of long-term, reliable
and secure storage. Using ELT approach means that CDW should also perform the data
transformation.
Data delivery for analytical tasks requires specific reorganization of data structures for each
analytical application. Multidimensional analysis requires data cubes; statistical analysis, as a rule,
uses data series, scenario and model analysis can use MS Excel files. In this architecture business
applications use data from CDW directly. So this architecture implies that CDW should store data
15
structures that are optimized both for current and for future business applications. Moreover, such
direct access increases the risk of unauthorized access to all data in CDW.
Thus, we see that this architecture entrusts CDW with the data transformation function and
information services for analytic tasks. Both of these features are unusual for CDW, which in this
form becomes "all in one" unit, where functional components generally have lower quality than if
they were implemented separately (e.g., a camera in a mobile phone).
We will later discuss how data storage functions and functions of data delivery for analytical
applications can be separated.
Implementation of the ETL scheme allows to separate data processing and data storage functions.
ELT scheme loads CDW with improper data conversion functions. The migration of ETL
functionality inside CDW forces us not only to provide the same processing power, but also to
design a universal platform that can still efficiently process data and store them. This approach
could be applied to SOHO segment, but enterprise wide system (like EDW) requires adequate
solution.
Despite the stated performance advantages of the ELT scheme, on practice it turns out that:
1. Data quality affects data load time. For example, ETL may discard up to 90% of duplicate
data during data cleaning and transformation. In this data case ELT will load all the data in
the CDW, where data cleaning will occur.
2. Data transformation rate in CDW storage depends strongly on processing algorithms and
data structures. In some cases SQL processing within the CDWs database is more efficient,
in others cases external programs that extract data to be processed and load processing
results to the CDW will run much faster.
3. Some algorithms are very difficult to implement using SQL statements and stored
procedures. This imposes restrictions on the use of the ELT scheme, while ETL can use
appropriate and more effective tools for data processing.
4. ETL is a unified area where data extraction, processing and loading rules reside, which
simplifies testing, modification and operation of algorithms. ELT, by contrast, separate data
collecting and loading algorithms from transformation algorithms. That is, to test a new
transformation rule we have to risk the integrity of the production data warehouse, or to
create a test copy of the repository, which is very costly.
Thus, comparing ETL and ELT, we see that the advantages of data loading and transformation are
not clear, that the ELT faces SQL constraints in data conversion, and that the savings in ELT
software and hardware result in financial costs for the creation of software and hardware CDW test
copy.
The use of ELT may be justified if:
1. There are no stringent requirements for DW reliability, performance, and security.
2. Budget constraints force to take a risk of data loss.
3. Data warehouse and data sources interact via a service bus (SOA).
The latter case is the most exotic, but it has a right to exist under certain conditions. In this case the
service bus is responsible for the integration between data sources and DW at the messaging level,
and a minimal (by the DW standards) data conversion and loading to DW.
16
Centralized DW with Operational Data Store
Data extraction, transformation and loading processes, of course, take some time to complete.
Additional delay is caused by the need to check data downloaded to DW for consistency with
already available in DW data, for data consolidation, and for the totals recalculation based on new
data.
The operational data store (ODS) was proposed in 1998 [2] in order to reduce the time delay
between information receipt from ETL and analytical systems. ODS has less accurate information
due to lack of internal checks, and has more detailed data due to missed data consolidation phase.
Therefore, data from ODS are designed to make tactical decisions, while information from a central
data warehouse (CDW) is better suited for strategic missions [3].
Imagine a company that sells drinks and snacks from vending machines throughout the country. 15
minutes downtime of an empty machine means potential profit loss, so it is crucial to monitor the
status of the machine and fill it with the missing goods. Collection and processing of all information
across the country may take several hours, whereas products delivery is done locally: in every city
there is a warehouse from where drinks and snacks are delivered to the nearest outlets. Warehouses
are filled up through centralized procurement. Thus, there are two different types of tactical tasks
(filling vending machines), and strategic planning (filling warehouses).

Pic. 3. Centralized DW with Operational Data Store

Indeed, if an extra bottle of water will be delivered as a result of incomplete and inaccurate data in
the ODS, then it will not lead to serious losses. However, a planning error caused by low data
quality in ODS may adversely affect a decision on the types and volumes of bulk purchases.
Information security requirements for CDW and ODS are also different. In our example, ODS stores
recent data for no more than a couple of hours. CDW stores historical information, which can cover
a period of several years for better prediction of the required purchase volume. This historical
information can present a considerable commercial interest for competitors. So tactic analysts can
17
work directly with the ODS, while strategic analysts must work with the CDW through a data mart
for the responsibility delineation. Tactic analysts can access data nearly on line due to absence of
data mart. Data mart does not preclude strategic analysis, since such an analysis is carried out on
monthly or even quarterly basis.
The architecture shown in Fig. 3 involves direct interaction between CDW and business
applications. Analysis of the strengths and limitations of this approach will be considered in the
section "Extended model with data marts". Now it should be noted that ODS actually performs
another role of a staging zone when data move sequentially from ODS to CDW. Tactic analysts,
working with data from ODS, wittingly or unwittingly reveal errors and contradictions in the data,
thereby improving their quality.
In this scheme corrected data from the ODS are transferred to CDR. However, there are other
schemes, for example, when data from the ETL come both in the ODS and in the CDW in a parallel
manner. Unnecessary data are simply erased from ODS after using. This scheme is applicable in
cases where human intervention in the data can only distort them, voluntarily or involuntarily.
Extended Model with Data Marts
Direct access of business applications to CDW is admissible if the users requests do not interfere
with the normal functioning of CDW, if users communicate with CDW through high-speed lines, or
if accidental access to all data in CDW does not lead to serious losses.
Administration of direct user access to CDW is an extremely difficult task. For example, a user from
one department is authorized to access data from another unit only in 10 days after data are
available. Another user can see only the aggregates, but no detailed data. There are other, more
complicated access rules. Their management, accounting, and change lead to the inevitable errors
caused by a combination of difficult access conditions.
Data marts that contain information intended for a specific group of users, significantly reduce the
risk of information security breaches.
Up to now, the quality of communication lines is a serious problem for geographically distributed
organizations. In the event of breakage or insufficient bandwidth, remote users are denied access to
the information contained in CDW. The solution is remote data marts, which are filled either after
working hours, or incrementally, as information becomes available, using assured data transfer.
Various business applications require different data formats: multi-dimensional cubes, data series,
two-dimensional arrays, relational tables, files in MS Excel, comma separated values, XML-files,
etc. No data structure in CDW can meet these requirements. The solution is the creation of data
marts, whose data structures are optimized for the specific requirements of individual applications.
Another reason for the need of data marts creation is the requirement for the reliability of CDW,
which is often defined as four or five nines. This means that downtime of CDW can not be more
than 5 minutes per year (99.999%) or more than 1 hour per year (99.99%). Creation of a hardware
and software system with such characteristics is a complex and expensive engineering task.
Requirements for protection against terrorist attacks, sabotage and natural disasters further
complicate the construction of software and hardware system and implementation of appropriate
organizational arrangements. The more complex such system is and the more data it stores, the
higher is the cost and complexity of its support.
Data marts reduce dramatically the CDW load, both through the number of users and through data
volume in the repository, as these data can be optimized for storage, not for query facilities.

18

Pic. 4. Extended Model with Data Marts

If the data marts are filled directly from CDW, the actual number of users is reduced from hundreds
and thousands to tens of data marts, which become CDW users. Implementation of SRD (Sample,
Restructure, Delivery) tools reduces the number of users to one and only one. In this case, the logic
of the information supply for data marts is concentrated in the SRD. So DMs can be optimized for
service user requests. CDW hardware and software can be optimized exclusively for reliable, secure
data storage.
SRD tools also soften the CDW workload due to the fact that different data marts can access the
same data, whereas SRD retrieves data once, convert to various formats and delivers to different
data marts.
Conclusion
The paper considers the following architectures: a centralized DW with ETL system, DW with ELT
system, central data warehouse with an operational data store (ODS), extended model with data
marts. Next papers will discuss the advantages and limitations of centralized ETL with parallel DW
and data marts, DW with intermediate application data marts, data warehouse with integration bus,
and recommended DW architecture.
Literature
1. Asadullaev S. Data Warehouse Architectures - I, 19.10.2009,
2. Inmon, W. The Operational Data Store. Designing the Operational Data Store. Information
Management Magazine, July 1998.
3. Building the Operational Data Store on DB2 UDB Using IBM Data Replication, WebSphere MQ
Family, and DB2 Warehouse Manager, SG24-6513-00, IBM Redbooks, 19 December 2001,
http://www.redbooks.ibm.com/abstracts/sg246513.html?Open
19
Data Warehouse Architectures - III
03.11.2009
Abstract
The series of three articles is devoted to architectures of data warehouse (DW) and their
solutions. The first article [1] is concerned with the evolution of OLAP role understanding, of DW
architecture components, of virtual DW and independent data marts. The second article [2]
considers the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT
(Extract, Load, Transform), CDW with operational data store, and extended model with data marts.
This article discusses centralized ETL with parallel DW and data marts; DW with intermediate
application data marts, DW with integration bus, and the recommended DW architecture.
Centralized ETL with parallel DW and data marts
In this case the whole architecture of EDW is built around the ETL (Extract, Transform and Load)
system. Information from disparate sources goes to ETL which purifies and harmonizes data and
loads it to a central data warehouse (CDW), to operational data store (ODS), if any, and, if
necessary, to temporary storage area. This is a common practice in EDW development. But
downloading data from the ETL to the data marts directly is unusual.
In practice, this architecture is a result of users requirement to access analytical data as soon as
possible, without time delay. Operational data store does not solve the problem, as users may be
located in distant regions, and they require territorial data marts. The security limitations on the
deployment of heterogeneous information in ODS may be another rationale for this architecture.
This architecture has a trouble spot: one of the problems of its operation is data recovery difficulty
after a crash of data marts, directly supplied from ETL. The point is that ETL tools are not designed
for long term storage of extracted and cleaned data. Transactional systems tend to focus on ongoing
operations. Therefore, in case of data losses in data marts directly associated with ETL, one has to
either extract information from the transactional systems backup, or organize historical archives of
data sources systems. These archives require funds for their development and operational support,
and they are redundant, with a corporate standpoint, since they duplicate functions of EDW, but they
are designed only to support a limited number of data marts.
As another approach, sometimes these data marts are connected both to ETL directly and to data
warehouse, which leads to confusion and misalignment of the results of analytical work. The reason
is that data coming in EDW, as a rule, pass additional checks for consistency with the already loaded
data. For example, financial document can be loaded with the requisites, almost coinciding with the
document received by the EDW before. The ETL system, not having the information about all
downloaded data, can not reveal whether the new document is a mistake or a result of legitimate
correction.
Running inside the data warehouse, data verification procedures could reveal such uncertainty. The
new data will be discarded in case of errors. Contrary, if it is a required correction, the changes will
affect both these numbers, and the corresponding aggregate figures.

20

Pic.1. Centralized ETL with parallel DW and data marts

Thus, the information, loaded into data mart from ETL directly, may contradict the data received
from the EDW. Sometimes, to solve this contradiction, the identical data verification algorithms are
implemented in the data marts and in EDW. The disadvantage is the need to support and to
synchronize the same algorithms in the EDW and in data marts, fed from the ETL directly.
To sum up, we can say that parallel data marts lead to additional data processing, to organization
and maintenance of excess operating archives, to support of duplicate applications and to
decentralized data processing, which causes information mismatch.
Nevertheless, the parallel data marts can be implemented in cases where rapid access to analytical
information is more important than disadvantages of this architecture.
DW with intermediate application data marts
The following assumptions were the rationales for this architectures invention.
1. Some companies still deploy and operate independent disparate application data marts. Data
quality in these data marts can meet the requirements of analysts who are working with DM.
2. Project stakeholders are confident that enterprise data warehouse implementation is a deadly
technical trick with unpredictable consequences. As a matter of fact, the difficulties of EDW
development and implementation are not technical, but are associated with poor project
organization and with the lack of involvement of experts - future EDW users. However
project team tries to avoid nonsignificant technology issues and to simplify up-to-the-minute
tasks, instead of improving project organization.
3. The requirement for quick results. The necessity to report on a quarterly basis causes a need
for quick tangible results. Thats why project team is not immune to the temptation to
develop and implement a restricted solution with no relation to other tasks.
21
Following these principles either accidentally or deliberately, companies start data integration with
introducing the separate independent data marts, in the hope that the data they contain will be easily,
simply and quickly integrated when required. The reality is much more complicated. Although the
quality of data in data marts can satisfy their users, this information is not consistent with data from
other DMs. So reports, prepared for the top management and decision makers, can not be reduced to
an uncontroversial view.
The same indicators can be calculated by different algorithms based on different data sets for
various periods of time. Figures with the same name may conceal different entities, and vice versa,
the same entity may have different names in various DMs and reports.

Pic. 2. DW with intermediate application data marts

Diagnosis is a lack of common data sense. Users of independent data marts speak different business
languages, and each DM contains its own metadata.
Another problem lies in the difference of master data, used in the independent data marts. The
differences in the data encoding, used codifier, dictionary, classifiers, identifiers, indices, glossaries
make it impossible to combine these data without serious analysis, design and development of
master data management tools.
However, the organization already has approved plans, budget and timeline for EDW which is based
on independent data marts. Management expects to get results quickly and inexpensively.
Developers provided with a scarce budget, are forced to implement cheapest solutions. This is a
proven recipe for creation a repository of inconsistent reports. Such repository contradicts the idea
of data warehousing as a single and sole source of purified, coherent and consistent historical data.
Obviously neither the company management nor the repository users are inclined to trust the
information contained therein. Therefore, the total rebuilding of DW is required that usually implies
22
that new EDW should be created, which stores report figures indexes, rather than full reports. This
allows to aggregate figures indexes into consistent reports.
Successful EDW rebuilding is impossible without metadata and master data management systems.
Both systems will impact only the central data warehouse (CDW), as independent data marts contain
their own metadata and master data.
As a result, management and experts can get coherent and consistent records, but they can not trace
the data origin, due to discontinuity in the metadata data management between independent data
marts and CDW.
Thus, the desire to achieve immediate results and to demonstrate rapid progress leads to denial of
unified, end-to-end management of metadata and master data. The result of this approach is the
semantic islands, where users speak a variety of business languages.
Nevertheless, this architecture can be implemented, where a single data model is not necessary, or is
impossible, and where a relatively small amount of data must be transferred to CDW without
knowledge of their origin and initial components. For example, an international company, operating
in different countries, has already implemented several national data warehouses that follow local
legal requirements, business constrain and financial accounting rules. CDW can require only piece
of information from the national DWs for corporate reporting. There is no need to develop a unified
data model, because it would not be demanded at the national level.
Certainly, similar scheme requires a high degree of confidence in national data, and can be used, if
intentional or unintentional distortion of the data will not lead to serious financial consequences for
the entire organization.
Data Warehouse with Integration Bus
Widespread acceptance of service - oriented architecture (SOA) [3] has led to an idea to use SOA in
solutions for enterprise data warehousing instead of ETL tools to extract, transform, load to a central
data warehouse, and instead of SRD tools to sample, restructure and deliver data to the data marts.
Integration bus, which underpins the SOA, is designed for web-services and applications integration,
and provides intellectual message routing, protocol mediation and message transformation between
service consumer and service provider applications.
At first glance, the functionality of service bus allows us to replace the ETL and SRD by integration
bus. Indeed, ETL performs mediation between the central data warehouse (CDW) and data sources,
and SRD is the mediator between the CDW and data marts. It would seem that the replacement of
the ETL and SRD by the integration bus can benefit from the flexibility provided by the bus for
application integration.
Imagine that the CDW, the operational data store (ODS), the temporary storage area, metadata and
master data management systems call the bus as independent applications with queries to update the
data from data sources.
First of all, the load on the data sources will increase by many times, since the same information will
repeatedly transmitted by the request of the CDW, ODS, the temporary storage area and metadata
and master data management systems. The obvious solution is to create for the integration bus its
own data store to cache queries.

23

Pic. 5. Data Warehouse with Integration Bus

Secondly, the data gathering procedures, previously centralized in the ETL, now scattered over the
application requesting the data. The discrepancy in various data gathering procedures for the CDW,
ODS, metadata and master data management systems will arise Sooner or later. Data collected by
different methods at different time intervals, processed by different algorithms contradict each other.
Thereby the main goal of creating the CDW as the single source of consistent non-contradictory data
will be destroyed.
The consequences of SRD replacement by the integration bus are not so dramatic. CDW must be
transformed into service in order to respond to data marts requests for data, directed through the
integration bus. This means that data warehouse must conform to the most common style of web -
services and support HTTP / HTTPS protocols and SOAP / XML message format. This approach
works well for short messages, but usually data marts require a large amount of data to pass through
integration bus.
The task can be solved by using the binary objects transmission. The necessary data restructuring
can not be performed by the integration bus, and must be carried out either in the CDW, or in the
data marts. Data restructuring inside CDW is unusual functionality for CDW, as it must be aware of
all data marts, and has to carry the additional workload, irrelevant to its main goal: reliable data
storage. Data restructuring inside data marts requires the direct access from DM to CDW. In many
cases its unacceptable for security reasons. This function can be realized by some proxy service that
receives data and transmits them to the data marts after the restructuring. So, we return to the idea of
SRD tool just supplied with bus interface.
Thus, integration bus can be used in the EDW architecture as a transport medium between the data
sources and the ETL and between SRD and data marts in those cases where the components of
EDW are separated geographically and are behind firewalls in accordance with strict requirements
for data protection. In this case, for interoperability it is sufficient that the exchange was enabled
24
over HTTP / HTTPS protocols. All data collection, transformation and dissemination logic should
still be concentrated in ETL and SRD.
Recommended EDW Architecture
Architecture of an enterprise data warehouse (EDW) should satisfy many functional and
nonfunctional requirements that depend on the specific tasks solved by the EDW. As there is no
generic bank, airline, or oil company, so there is no single solution for the EDW to fit all occasions.
But the basic principles that EDW must follow can still be formulated.
First and foremost it is the data quality that can be understood as complete, accurate and
reproducible data, delivered in time where they are needed. Data quality is difficult to measure
directly, but it can be judged by the decisions made. That is, data quality requires investment, and it
can generate profits in turn.
Secondly, it is the security and reliability of data storage. The value of information stored in EDW
can be compared to the market value of the company. Unauthorized access to EDW is a threat with
serious consequences, and therefore adequate protection measures must be taken.
Thirdly, the data must be available to the employees to the extent necessary and sufficient to carry
out their duties.
Fourthly, employees should have a unified understanding of the data, so a single semantic space is
required.
Fifthly, it is necessary, if possible, to resolve conflicts in data encoding in the source systems.

Pic. 4. Recommended EDW Architecture

The proposed architecture follows the examined principles of modular design - "unsinkable
compartments. The strategy of "divide and rule" is applicable not only in politics. By separating the
25
architecture into modules, we also concentrate in them certain functionality to give power over the
unruly IT elements.
ETL tools provide complete, reliable and accurate information gathering from data sources by
means of algorithms concentrated in ETL for the collection, processing, data conversion and
interaction with metadata and master data management systems.
Metadata management system is the principal "keeper of wisdom" which you can ask for advice.
Metadata management system supports the relevance of business metadata, technical, operational
and project metadata.
The master data system is an arbitrator for conflict resolution of data encoding.
Central Data Warehouse (CDW) has only the workload of reliable and secure data storage.
Depending on the tasks, the reliability of CDW can be up to 99,999%, to ensure smooth functioning
with no more than 5 minutes of downtime per year. CDWs software and hardware tools can protect
data from unauthorized access, sabotage and natural disasters. Data structure in the CDW is
optimized solely for the purpose of ensuring effective data storage.
Data sample, restructuring, and delivery tools (SRD) in this architecture are the only users of the
CDW, taking on the whole job of data marts filling and, thereby, reducing the user queries workload
on the CDW.
Data marts contain data in formats and structures that are optimized for tasks of specific data mart
users. At present, when even a laptop can be equipped with a terabyte disk drive, the problems
associated with multiple data duplication in the data mart do not matter. The main advantages of this
architecture are:
comfortable users operation with the necessary amount of data,
the possibility to restore quickly the contents from the CDW in case of data marts failover,
off-line data access when connection with the CDW is lost.
This architecture allows a separate design, development, operation and refinement of individual
EDW components without a radical overhaul of the whole system. This means that the beginning of
work on the establishment of EDW does not require hyper effort or hyper investments. To start it is
enough to implement a data warehouse with limited capabilities, and following the proposed
principles, to develop a prototype that is working and truly useful for users. Then you need to
identify the bottlenecks and to evolve the required components.
Implementation of this architecture along with the triple strategy for data integration, metadata, and
master data [4], allows to reduce time and budgeting needed for EDW implementation and to
develop it in accordance with changing business requirements.
Conclusion
The article discusses the advantages and limitations of the following architectures: centralized ETL
with parallel DW and data marts, DW with intermediate application data marts, data warehouse with
integration bus and recommended EDW architecture.
Recommended corporate data warehouse architecture allows creating in a short time and with
minimal investment the workable prototype that is useful for business users. The key to this
architecture, providing an evolutionary development of EDW, is the introduction of metadata and
master data management systems in early stages of development.
26
Literature
1. Asadullaev S. Data Warehouse Architectures I, 19.10.2009,
2. Asadullaev S. Data Warehouse Architectures II, 23.10.2009.
3. Bieberstein N., Bose S., Fiammante M, Jones K., Shah R. Service-Oriented Architecture
Compass: Business Value, Planning, and Enterprise Roadmap, IBM Press, 2005.
4. Asadullaev S. Data, metadata, master data: the triple strategy for data warehouse project,
09.07.2009, http://www.ibm.com/developerworks/ru/library/r-nci/index.html

27
Data, metadata and master data: the triple strategy for data
warehouse projects
09.07.2009
http://www.ibm.com/developerworks/ru/library/r-nci/index.html
Abstract
The concept of data warehousing emerged in the early 90s. Since then, many data warehouse
implementation projects were carried out, but not all of these projects were successfully completed.
One of the most important reasons of failure is problem of common interpretation of data meaning,
data cleaning, alignment and reconciliation. In the article it is shown that three interrelated projects
for data, metadata and master data integration should be performed simultaneously in order to
implement enterprise data warehouse (EDW).
Introduction
The largest companies have implemented DW since the mid 90s. Previous projects can not be
considered as unsuccessful, as they solved requested tasks, in particular, to provide companys
management with consistent reliable information, at least in some areas of companys business.
However, the growth of companies, changes in legislation and increased needs for strategic analysis
and planning require further development of data warehouse implementation strategy.
By this time the companies have understood that a successful data warehouse required creating a
centralized system for master data and metadata management. Unfortunately, these projects are still
performed separately. It is assumed generally that the development of enterprise data warehousing is
a project of integrating data from disparate sources. At the same time, the sources contain not only
data but also the master data, as well as the metadata elements. Typically, large companies start data
warehouse project without allocating funds and resources for metadata and master data
management.
Project sponsors, steering committee and other decision makers usually try to allocate funds for the
project phase by phase and tend to implement these three projects sequentially. As a result, the
projects budgets, completion periods and quality do not meet the initial requirements because of the
need for changes and improvements of IT systems built under the previous project.
In most cases the need of a data warehouse project is the demand of business users who are no
longer able to bring together data from different information systems. That is, precisely the
requirements of business users define, first of all, the information content of the future data
warehouse.
The data sources for the future DW are transactional databases (OLTP), legacy systems, file storage,
intranet sites, archives, and isolated local analytic applications. First, you need to determine where
the required data are located. Since, as a rule, these data are stored in different formats, you must
bring them to a single format. This task is performed by fairly complex system of data extraction,
transformation and loading (ETL) into data warehouse.
The ETL procedures can not be accomplished without an accompanying analysis of metadata and
master data. Moreover, the practice of data warehouse implementation has shown [1] that the
metadata created and imported from various sources manage in fact the entire process of data
collection.
28
Master data management
Reference data and master data include glossaries, dictionaries, classifiers, indices, identifiers, and
codifiers. In addition to standard dictionaries, directories and classifiers, each IT system has its own
master data required for systems operation. As long as several information systems operate in
isolation from each other, the problems caused by the difference in the master data usually does not
arise. However, if you have to combine the reported data of two or more different systems, the
discrepancy in the master data makes it impossible to merge tables directly. In such cases is requires
a "translator" of codes stored in multiple tables to a unified form. In addition, the master data
although infrequently, but does change, and consistent update of master data in all information
systems is a challenge.
Thus, there is a need to establish a master data management system, which helps coordinate master
data changes in various information systems and simplifies the data integration from these systems.
Metadata management
The prototype for metadata management were the Data Dictionary / Directory Systems, which were
designed for logical centralization of information about data resources and should serve as a tool for
enterprise data resources management [2].
Data sources, including transactional systems, contain metadata in an implicit form. For example,
the table names and column names in the tables are technical metadata, and the definitions of
entities that are stored in the tables represent the business metadata. Statistics of applications that
can be done in monitoring systems should be classified as operational metadata. Relationship
between the project roles and the database access, including the administration rights, and data for
audit and for change management, usually related to the project metadata. And finally, business
metadata are the most important piece of metadata, and include business rules, definitions,
terminology, glossary, the data origin and processing algorithms.
Many sources contain metadata elements, but almost never have full set of metadata. The result is a
mismatch in the reports provided by different source systems. For example, in one report, the
production volume can be calculated in dollars, and in another - in pieces, while in the third in a
total weight basis. That is, the same field production volume may contain a variety of data in
different reports.
Such a mismatch of data sense in reports forces companies to develop and implement an integrated
system of unified indicators, reports, and terminology.
Data, metadata and master data interrelations
DW structure consists of three main information levels: detailed, summary and historical data, as
well as their accompanying metadata [3]. Now it is clear that this list should be complemented by
master data. The relationship between data, metadata and master data can be visualized as a triangle
(Pic. 1).
As can be seen from the figure, all the relationships fall into three pairs:
Data metadata
Data master data
Metadata master data
Consider each pair in more detail.
29

Pic. 1. Data, metadata and master data interrelations
Data and metadata
The interdependence of data and metadata can be shown in the following example. We can assume
that any book contains data. Library catalog card is the metadata that describe the book. The
collection of cards is a library catalog, which can be treated as a set of data (database). Card should
be filled according to certain rules which are specified in a book on librarianship (metameta data).
This book should also be placed in the library, and its catalog card must be prepared, which must be
placed in the appropriate box of catalog cabinet, where you can find the rules for using the catalog.
The question, are these rules the metametameta data, we leave to the reader as a homework.
If you hold the book you need in your hands already, you do not need a catalog card for this book.
Most home libraries do not have catalogues because the owner knows their library, and they are the
librarys creator and the user. And they are a librarian in case if someone asks for a book from their
library. But a large public library can not operate without the catalog.
The situation in enterprise information systems is not so simple and so obvious. Despite the fact that
the first publications on the need for data dictionary systems appeared in the mid 80s, corporate
resources are still designed, developed and operated in isolation, without a unified semantic space.
This situation in libraries would mean that the reader is one library could not even tell whether the
required book exists in another library.
In 1995 an article was published [4], which stated that for successful data integration it is necessary
to establish and maintain the metadata flow. In the language of library users this discovery sounds
something like this: "Libraries need to share information about books in a single format." It is now
clear that this requirement needs to be clarified, since the metadata are generated on all stages of
development and operation of information systems.
Business metadata are generated on the initial stage of system design. They include business rules,
definitions, terminology, glossary, the origin of algorithms and data processing, which are described
in the language of business.
On the next stage of logical design the technical metadata appear, such as names of entities and
relationships between them. The table names and column names also refer to the technical metadata,
and are determined on the stage of physical development of the system.
30
Metadata of production stage are operational metadata, which include statistics on computing
resources usage, user activities and application statistics (e.g., frequency of execution, the number of
records, componentwise analysis).
The metadata that document the development efforts, provide data for the project audit, assign
metadata stewards, and support change management, refer to the project metadata.
Operational metadata are the most undervalued. The importance of operational metadata may be
demonstrated by an example of a large company which provides customers with a variety of Web-
services. At the heart of the companys IT infrastructure resides a multi-terabyte data warehouse,
around which custom local client databases and applications are built. Marketing department
receives clients orders, legal department manages contracts, and the architectural team develops
and provides documentation for project managers to transfer to outsourcing development.
When a customers contract ends, everybody is looking for new clients, and nobody takes the time
to inform administrators that the customers application and the database can be deleted. As a result,
the overhead for data archiving, data and applications is increasing. In addition, the development of
new versions is significantly hampered, since it is necessary to support the unused protocols and
interfaces.
Operational metadata management provides administrators and developers with information about
how frequently applications are used. Based on this information we can determine unused
applications, data and interfaces, whose removal from the system will significantly reduce the costs
of its maintenance and future upgrades.
Data and master data
In relational databases, designed in accordance with the requirements of normalization, one can
identify two different types of tables. Some include, for example, a list of items of goods and their
value (master data). Other tables contain information on purchases (the data). Without going into the
jungle of definitions, in this example, you can see the difference between the data and the master
data. Multiple purchases can be performed every second in a large store. But the prices and names of
goods, at least for now, do not change every second.
Master data in relational databases perform several functions. They help reduce the number of data
entry errors, support more compact storage of data through the use of short codes instead of long
names. In addition, master data is the basis for standardization and normalization of data. On the one
hand, the presence of the corresponding nationwide classifier will inevitably affect the structure of
the database. On the other hand, the process of bringing data to a third normal form, as a rule, leads
to internal codifiers.
Despite the fact that the ISBN or SIN codes are unique and can be the primary keys in relational
databases, in practice, additional local codifiers are often created.
Metadata and master data
There are various definitions and classifications of master data: by source, by management method,
by classified data. For the purposes of this work, we may assume that the master data includes
codifiers, dictionaries, classifiers, identifiers, indices and glossaries (Table 1).
Classifier, for example, bank identification code BIC, is managed centrally by an external
organization (Bank of Russia) and provides rules for the code. In addition, the classifier may
determine the rules of the code usage. That is, the reuse of bank identification codes by payment
participants is allowed one calendar year after the date of their exclusion from the BIC classifier of
31
Russia, but not before Bank of Russia draws up the consolidated balance sheet of the on payments
using the aviso for that calendar year. BIC code does not contain a control number.
A three-stage hierarchical classification is adopted in the All-Russia Classifier of Management
Documentation: the class of forms (the first two digits), the subclass of forms (the second two
digits), registration number (next three digits), and control numbers (last digit). Classifier can
contain the rules for control number calculation or code validation algorithms.
Table 1. Master data types

Metadata in classifiers are the rules for calculating the control number, a description of the
hierarchical classification, and usage regulations for identification codes.
Identifier (e.g., ISBN) is managed by authorized organizations in a non-centralized manner. Unlike
the case of a classifier, identifier codes must follow the rules of control number calculation. The
rules for the identifier compilation are developed centrally and are maintained by requirements of
standards or other regulatory documents. The main difference from the classifier is that the identifier
as a complete list either is not available, or it is not needed on a system design phase. The working
list is updated with individual codes during system operation.
The difference between identifiers metadata and classifiers metadata is the different behavior on
various stages of system life cycle. Identifiers metadata must be defined at the design stage, when
the identifiers are not yet filled with individual values. New identifiers may appear during the
system operation. In some cases they do not match the existing metadata, and metadata should be
revised to eliminate the possible misinterpretation of the new values of identifiers.
Dictionary (e.g., phone book) is managed by an outside agency. The code numbering (telephone
number) is not subject to any rules.
Dictionarys metadata are less structured, as is the phone book. However, they are also necessary.
For example, if the organization provides several different communication methods (work phone,
32
home phone, mobile phone, e-mail, instant messaging tools, etc.), system administrator can describe
the rules to send a message at system failure.
Codifier is designed by developers for internal purposes of specific database. As a rule, neither
checksum calculation algorithms, nor coding rules are designed for codifier. The encoding of a
month of the year is a simple example of codifier.
Despite the absence of external rules, encoding is carried out in accordance with the designers
concept and often contains rules (metadata) in an implicit form. For example, for payments over the
New Year the month of January can be entered again in the codifier as 13th month.
The index may be a just a numeric value (for example, the tax rate), which is derived from an
unstructured document (order, law, act). It would be unreasonable to include a numeric value
directly in the algorithm, since changing its value requires finding all occurrences in program text
and replace the old value with the new. Therefore, indices, isolated in separate tables, are an
important part of the master data.
Metadata of indices define the scope of their applications, time limits and restrictions.
Glossaries contain abbreviations, terms and other string values that are needed during the
generation of forms and reports. The presence of these glossaries in the system provides a common
terminology for all the input and output documents. Glossaries are so close in nature to the metadata
that sometimes is difficult to distinguish them.
Thus, the master data always contains business metadata and technical metadata.
Most of the technical and business metadata is created during the understanding phase of metadata
management life cycle [5]. Project metadata arise during the development phase and to a lesser
extent, during the operation phase (e.g., assigning metadata stewards). Operational metadata are
created and accumulated during the operation of the system.
Components of Enterprise Data Warehouse
Enterprise data warehouse (EDW) transforms the data, master data and metadata from disparate
sources and makes them available to users of analytical systems as a single version of truth. Data
source are usually described as transactional databases, legacy systems, various file formats, as well
as other sources of data, information from which must be provided to end users (pic.2).
The components of enterprise data warehouse are
1. ETL tools used to extract, transform and load data into a central data warehouse (CDW);
2. Central data warehouse, designed and optimized for reliable and secure data storage;
3. Data marts which provide efficient user access to data stored in structures that are optimal
for specific users tasks.
A central repository includes, above all, three repositories:
1. Master data repository;
2. Data repository;
3. Metadata repository.
The scheme above does not include an operational data store, the staging area, the data delivery and
access tools, business applications and other components of EDW that are not relevant to this level
of detail.
33

Pic. 2. Components of Enterprise Data Warehouse

After several unsuccessful attempts to create a virtual data warehouse the need for a data repository
has become unquestionable. In virtual DW architecture, a client program receives data directly from
sources, transforming them instantly. The simplicity of architecture compensates for the waiting
time of execution of query and data transformation. Query result is not saved, and the next same or
similar request requires re-conversion of data, which lead to the abandonment of virtual DW and to
creation of data repositories.
The present situation with metadata and master data resembles the situation of virtual DWs.
Metadata and master data are used intensively during data extraction, transformation and loading.
Cleaned data are saved in data warehouse. However, metadata and master data are discarded as
waste material. Creating a repository of metadata and master data significantly reduces the EDW
implementation costs and improves the quality of information support for business users by reusing
consistent metadata and master data from a single source.
Example of existing approach
An illustration of the existing approaches to data integration is presented in the paper [6] on the
master data implementation in a bank. The bank spent more than six months reengineering of
planning and forecasting process for performance management. Vice-president of the bank
explained the success of implementation of master data management initiatives was due to the fact
that the team focused on solving a local problem, avoiding the "big bang", which refers to the
creation of an enterprise data warehouse. In his view, the creation of enterprise master data
management system is a long, difficult and risky job.
During the next step it is planned to create a bank reporting system, based on integration of core
banking systems to use more detailed data that are compatible with the general ledger. This will
create a financial data repository, which should become the main source for all financial reporting
systems, and will support the drill-down analysis.
34
Careful reading of the article leads to the following conclusions. First of all, this project did not
provide for the integration of enterprise data and covers only reengineering of planning and
forecasting process. Created data repository appears to be narrowly thematic data mart, and is not
capable to support common analytical techniques, such as drill-down analysis.
In contrast, the enterprise data warehouse provides a consistent enterprise data to a wide range of
analytical applications. In practice, a single version of data can be provided only by enterprise data
warehouse, which works in conjunction with one enterprise master data and metadata management
systems. The article [6] describes how to create a single version of truth for metadata only for
financial reporting area.
Thus, the project team implemented a metadata and the master data management system for one
specific area of activity. The team deliberately avoided the enterprise wide solutions: neither
enterprise data warehouse, nor metadata or master data management system was implemented.
Statements that enterprise data warehouse cant be implemented on practice, are refuted by projects
performed by IBM employees on a regular basis.
This project is a typical "fast win"; the main objective is to demonstrate quick little success. At this
stage, no one thinks about the price of applications redesign, redevelopment, and integration into the
enterprise infrastructure. Unfortunately, we have to address increasingly the effects of the activity of
"quick winners" who avoid complicated, lengthy, and therefore risky decisions.
It should be clarified that small successful projects are quite useful as a pilot, when beginning of
global project geared to the demonstration of a workable solution in a production IT environment. In
this situation all the pros and cons should be weighed. In the worst case, all the results of the pilot
project may be rejected due to incompatibility with enterprise architecture of information systems.
In EDW development the compatibility problem is particularly acute because of the need to
coordinate not only the interfaces and data formats, but also the accompanying metadata, and master
data.
The practical realization of the triple strategy
The data warehouse as corporate memory should deliver the unified consistent information, but
usually doesnt, due to conflicting master data and lack of common understanding of the data sense.
Known solutions are metadata and master data analysis as part of data integration project without
establishing metadata and master data management systems. Metadata and master data management
systems implementations usually are regarded as separate projects, performed after data warehouse
implementation (pic.3).
The drawbacks of such known solutions are insufficient quality of information delivered to data
warehouse end users due to lack of consistent metadata and master data management, extra
expenditures for data warehouse redesign to align the existing data integration processes with
requirements of new metadata and / or master data management systems. The result is inefficiency
of these three systems, the coexistence of the modules with similar functionality, waste of duplicated
functionality, rising development budget, high total cost of ownership and user frustration due to
discrepancy in data, metadata and master data.

35

Pic.3. One of the existing workflows

The master data, metadata and data integration projects, performed sequentially in any order cant
provide the business with the required quality of information. The only way to solve this problem is
the parallel execution of three projects: metadata integration, master data integration and data
integration (pic.4).
1. Enterprise metadata integration establishes a common understanding of the data and master
data sense.
2. Master data integration eliminates the conflict in data and metadata coding in various
information systems.
3. Data integration provides end users with data as a single version of the truth based on
consistent metadata and master data.
Coordinated execution of these three projects delivers a corporate data warehouse with improved
quality at lower costs and time expenditures. The proposed approach increases the quality of
information delivered from data warehouse to business users, and consequently provides better
support for decisions based on improved information.
The three integration projects (for data, metadata and master data), performed in parallel manner,
allow to implement a coordinated architecture design, consistent environment, coherent life cycles
and interrelated core capabilities for data warehouse, metadata management system, and master data
management system.
36

Pic.4. Workflow according to the triple strategy

In practice there are lots of ways, methods, and approaches which assure success of the parallel
coordinated execution of three projects of data, metadata and master data integration.
1. Arrange the data warehouse, metadata integration and master data integration project as a
program.
2. Adopt Guide to Project Management Body of Knowledge as world-wide recognized project
management standards.
3. Select the spiral development life cycle.
4. Gather functional and non-functional requirements to select suitable core capabilities for data
warehouse, for metadata, and for master data
5. Select an environment
a. for data warehouse: data sources, ETL, data repository, staging area, operational data
store, application data mart, departmental and regional data marts, analytical, reporting
and other applications
b. for metadata: Managed Metadata Environment with 6 layers: sourcing, integration, data
repository, management, meta data marts, and delivery layer
c. for master data: Upstream, MDM core, Downstream
6. Select an architecture design
a. centralized data warehouse architecture
b. centralized metadata architecture
c. centralized master data repository
7. Select life cycles
a. life cycle for data: for example: understand, extract, transform, load, consolidate, archive,
deliver
37
b. life cycle for metadata: development, publishing, ownership, consuming, metadata
management
c. life cycle for master data: identify, create, review, publish, update, and retire
8. Define roles in project and responsibilities, and assign team members to specific roles.
9. Select the tools for each team member.
The technical feature, which is absolutely required for the strategy implementation, is primarily the
coordination of these three projects. In general, this is subject matter of program management. The
specific details (who, what, when, where, how, why) of inter project communication depend on
project environment described above.
Conclusion
At the moment IBM is the only company which proposes almost full product set for the triple
strategy implementation. ETL tools for data extraction from heterogeneous data sources, metadata
glossary tools, data architecture instruments, master data management tools, sophisticated tools for
BI environment designing, industrial data models, and middleware allowing to integrate the
components into the unified environment for information delivery to business users.
The idea of the triple strategy could have arisen 10 or 15 years ago. Practically implementation of
the strategy was impossible at that time due to huge costs of developing the required tools, which
are available now.
Ready-made software tools for data, metadata and master data integration support the triple strategy
and together can mitigate the project risks, reduce the data warehouse development time and provide
companies with new availabilities to improve the corporate performance.
The author thanks M. Barinstein, R.Ivanov, D.Makoed, A.Karpov, A.Spirin, and O.Tretyak for
helpful discussions.
Literature
1. Asadullaev S. Vendors data warehouse architectures, PC Week / RE, 1998, 32-33, p. 156-
157
2. Leong-Hong B.W., Plagman B.K. Data Dictionary / Directory Systems. John Wiley & Sons.
1982.
3. Inmon, W. H., Zachman, J. A., Geiger, J. G. Data Stores, Data Warehousing and the Zachman
Framework, McGraw-Hill, 1997
4. Hackathorn R. Data Warehousing Energizes Your Enterprise, Datamation, Feb.1, 1995, p. 39.
5. Asadullaev S. Metadata management using IBM Information Server, 2008,
http://www.ibm.com/developerworks/ru/library/sabir/meta/index.html
6. Financial Service Technology. Mastering financial systems success, 2009,
http://www.usfst.com/article/Issue-2/Business-Process/Mastering-financial-systems-success/

38
Metadata Management Using IBM Information Server
Sabir Asadullaev, Executive IT Architect, SWG IBM EE/A
06.10.2008
Abstract
The strategy selection for BI metadata management system implementation requires getting the
answer to several critical questions. Which metadata needs to be managed? What does the metadata
lifecycle look like? Which specialists are needed to complete the project successfully? Which
instruments can support the specialists during the whole lifecycle of required metadata set?
This paper investigates the metadata management system for data integration projects from these
four specified points of view.
Glossary
Glossary is a simple dictionary which includes a list of terms and definitions on a specific subject. It
contains terms and their textual definitions in natural language, like this glossary.
Thesaurus (treasure) is a variety of dictionary, where lexical relations are established between
lexical units (e.g., synonyms, antonyms, homonyms, paronyms, hyponyms, hyperonyms)
Controlled vocabulary requires the use of predefined, authorized terms that have been preferred by
the authors of the vocabulary
Taxonomy models subtype - supertype relationships, also called parent-child relationships on basis
of controlled vocabularies with hierarchical relationships between the terms
Ontology expands on taxonomy by modeling other relationships, constraints, and functions and
comprises the modeled specification of the concepts embodied by a controlled vocabulary.
Metadata types
IBM Information Server handles four metadata types. These metadata serve the data integration task
solved by Information Server
Business Metadata are intended for business users and include business rules, definitions,
terminology, glossaries, algorithms and lineage using business language.
Technical Metadata are required by specific BI, ETL, profiling, modeling tool users and define
source and target systems, their table and fields structures and attributes, derivations and
dependencies.
Operational Metadata are intended for operations, management and business users, who need
information about application runs: their frequency, record counts, component by component
analysis and other statistics.
Project Metadata are used by operations, stewards, tool users and management in order to
document and audit the development process, to assign stewards, and to handle change
management.
Success criteria of metadata project
Not all companies have recognized the necessity of metadata management in data integration
projects (for instance, in data warehouse development). Those who started the implementation of
39
metadata management system faced a number of challenges. The requirements for metadata
management system in BI environment can be defined precisely and in proper time, for example:
Metadata management system must provide relevant and accessible centralized information
about all information systems and their relations.
Metadata management system must establish a consistent usage of business terminology
across organizations
Impact of change must be discovered and planned
Problems must be traced from the point of detection down to the origin.
New development must be supplied with the information about existing systems
The reality is that unwieldy repository stores a pile of useless records; each system uses its own
isolated metadata; uncoordinated policies appear to be mismatching factors; obsolete and
unqualified metadata do not meet the quickly changing business requirements.
The failure of metadata management projects, when one would think the goals are defined, the
budget is allocated and a competent team is picked up, is mainly caused by the next reasons:
Insufficient participation of business-user in the creation of a consolidated glossary, which
may be the result of inconvenience and complexity of glossary and metadata repository
management tools.
Supporting only a couple of metadata types due to shortage of time and / or financial
resources.
Lacking or incomplete documentation for production systems, which could be mitigated by
tools for data structure analysis of existing systems on the initial investigation step.
Lack of support of the full metadata management life cycle due to the fragmented metadata
management tools.
Side bar
Strictly speaking, these statements are related to product success or failure as a result of the
project execution. As a rule, projects success criteria are timely execution within the budget,
required quality and scope. Project success doesnt guarantee the product success. The history of
technological expansion knows a lot of examples, when technically perfect product, developed
in time and with no budget deficit, wasnt demanded or didnt meet with a ready market sale
The success criteria for metadata implementation project are the demand for developed metadata
management system by subject matter experts, by business-users, by IT personnel and by other
information systems, both in production and development.
Metadata management Lifecycle
The simplified lifecycle implies five stage and five roles (Pic.1). Development is the creation of
new metadata by author (subject matter expert). Publishing, performed by publisher, notifies the
participants and users of the existing and available metadata and their locations. Ownership allows
to define and to assign metadata usage rights. Consuming of metadata is performed by the
development team, by users or by information systems. Metadata management, executed by
manager or stewards, includes modification, enrichment, extension, and access control.
40

Pic.1. Simplified metadata management lifecycle

Extended metadata management lifecycle consists of the following stages (Pic.2).
Analysis and understanding includes data profiling and analysis, data sets and structures quality
determination, understanding the sense and content of the input data, connection revealing between
the columns of database tables, analysis of dependence and information relations, data investigations
for their integration.

Pic.2. Extended metadata management lifecycle

Modeling means revealing data aggregation schemas, detection and mapping of metadata
interrelation, impact analysis and synchronization of models.
Development provides team glossary building and management, business context support for IT
assets, elaboration of extraction, transformation and delivery data flows.
Transformation consists of automated generation of complex data transformation tasks and of
linking source and target systems by means of data transformation rules.
Development Publishing Consuming
Ownership
Metadata Management
Development
Publishing
Consuming
Ownership
Quality
management
Analysis and understanding

Modeling
Transformation
Metadata
Management
Reporting
and audit
41
Publishing provides a unified mechanism for metadata deployment and for upgrade notification.
Consuming is visual navigation and mapping metadata and their relations; metadata access,
integration, import and export; change impact analysis; search and queries.
Metadata quality management solves the tasks of heterogeneous data lineage in data integration
processes; quality improvement of the information assets; input data quality monitoring, and allows
to eliminate data structure troubles and their processability before they affect the project.
Reporting and audit imply setting formatting options for reports results, report generation for the
linage between business terms and IT assets, scheduling reports execution, saving and reviewing the
versions of the reports. Audit results can be used for analysis and understanding on the next loop of
life cycle.
Metadata management is to manage access to templates, reports and results, to control metadata,
to navigate and query the metamodel, to define access rights, responsibilities and manageability.
Ownership determines metadata usage rights.

Sidebar
Support of full metadata management lifecycle is critically important for metadata management
goals, especially for big enterprise information systems. Lifecycle discontinuity leads to the
consistency violation of the corporate metadata, and isolated islands of the contradictive
metadata arise.
Implementation of consistent tools for metadata management leads to a considerable increase in
the success possibility of metadata management system implementation project.

IBM Information Server metadata management tools
IBM Information Server platform includes the following metadata management tools.
Business Glossary is a Web-based application that supports the collaborative authoring and
collective management of business dictionary (glossary). It allows to maintain the metadata
categories, to build their relations, and to link them to physical sources. Business Glossary supports
the metadata management, alignment and browsing and assignment the responsible stewards.
Business Glossary Anywhere is a small program which provides a read-only access to the content
of business glossary through operation systems clipboard. User can highlight the term on the screen
of any application, and a business definition of the term will appear in a pop-up window.
Business Glossary Browser provides a read-only access to business glossarys content in a separate
window of web-browser.
Information Analyzer scans automatically data sets to determine their structure and quality. This
analysis helps in understanding data inputs to integration process, ranging from individual fields to
high-level data entities. Information analysis also enables to correct problems with structure or
validity before they affect the metadata project. Information Analyzer maintains profiling and
analysis as an ongoing process of data reliability improvement.
QualityStage provides the instruments for investigation, consolidation, standardization and
validation of heterogeneous data in integration processes and improves the quality of the
information assets.
42
DataStage maintains the development of data flows, which extract information from multiple
sources, transform it according to the specified rules and deliver it to target data bases or
applications.
Information Analyzer performs source systems analysis and passes to QualityStage, which, in turn,
supports DataStage, responsible for data transformation. Used together Information Analyzer,
QualityStage and DataStage allow to automate the data quality assurance processes, and to
eliminate the painstaking or even impossible data integration handworks.
FastTrack reveals the relations between columns of database tables, links columns and business
terms, automatically creates complex data transformation tasks in DataStage and QualityStage
Designer, binds data sources and target systems by data transformation rules, reducing the
application development time.
Metadata Workbench provides metadata visualization and navigation tools, maintains visual
representation of metadata interdependences, gives the possibilities of information dependencies and
relations analysis between various tools, allows metadata binding, generates reports on business
terms and IT assets relations, support metadata management, navigation and metamodel queries;
allows to investigate key integration data: Tasks, Reports, DBs, Models, Terms, Stewards, Systems.
Web Console provides administrators with a role based access management tools; maintains the
scheduling of report execution, storing the results of queries in common repository and viewing
multiple versions of the report; creating the directories for report storage and indicating in which
directories the reports will be stored. Web console allows to define the formatting options for results
of queries.
Information Services Director resides in the domain layer of IBM Information Server and provides
the unified mechanism for publishing and management the data quality services, allowing to IT
specialists to deploy and control the services for any data integration task. Common services include
metadata services, which supply the standard service-oriented end-to- end access to metadata and
their analysis.
Rational Data Architect is an enterprise data modeling and integration design tool that combines
data structure modeling capabilities with metadata discovery, relationship mapping, and analysis.
Rational Data Architect helps to understand data assets and their relationships to each other and
allows to reveal data integration schemas, to visualize metadata relations, to analyze the impact of
changes and the synchronization of models.
Metadata Server maintains the metadata repository and other components interaction, and support
metadata services: metadata access and integration, impact analysis, metadata import, export, search
and queries. The repository is a J2EE application. For persistent storage it uses a standard relational
database such as IBM DB2, Oracle, or SQL Server. Backup, administration, scalability, parallel
access, transactions, and concurrent access are provided by an implemented database.
As we can see, IBM Information Server metadata management tools cover the extended metadata
management lifecycle.
Roles in metadata management project
The team roles set of the metadata management project depends on many factors and can include,
for example, infrastructure engineers, information security specialists, and middleware developers.
Limited by team roles being of direct relevance to metadata system development, the role list can
look as follows.
43
Project manager for effective project management requires both project documentation and
information on product deliverables, namely, on developing metadata management system. So
project manager should be granted an access to tools producing the reports on jobs, queries, data
bases, models, terms, stewards, systems and servers.
Subject matter expert has to participate in the business glossary collaborative creation and
management. Expert must define the terms and their categories, and to establish their relations.
Business analyst should know the subject matter, understand the terminology and the sense of
entities, and have previous experience in formulating the rules of data processing and transformation
from sources to target systems and consumers. Participation of business analyst in business glossary
creation is also very important.
Data analyst reveals all the inconsistencies and contradictions in data and terms before the
application program developments starts.
IT developer should have the ability to familiarize himself with business terminology, to develop
the data processing jobs, to implement transformation rules, to code data quality
Application administrator is responsible for maintaining and versioning the configuration of
applications in production; for updates and patch sets installation; for maintaining and monitoring
the current state of program components; for execution of the general policies of the protection
profiles; for conducting the performance analysis and for application execution optimization.
Data base administrator should tune the data base and control its growth; reveal the performance
problems and fix them; generate the required data base configurations; change the structure of data
vase; add and remove the users and change their access rights.
Business users in the frame of metadata project need a simple and effective access to the metadata
dictionary. As in the case of a common paper dictionary users require the ability to read the lexical
entry along with the explicit description and the brief dictionary definition, preferably without any
loss of context or focus.
The roles support by IBM Information Server tools
Business Glossary allows to assign a steward (responsible for metadata) role to a user or a group of
users; and to hold steward liable for one or more metadata objects. Stewards responsibilities imply
an efficient management and integration with related data and making the data available to
authorized users. Steward should ensure that data is properly defined, and that all users of the data
clearly understand its meaning.
Subject matter expert (metadata author) uses Business Glossary to create the business
classification (taxonomy), which maintains hierarchical structure of terms. Term is a word or phrase
which can be used for object classification and grouping in metadata repository.
Business Glossary supply subject matter experts with collaborative tool to annotate existing data
definitions, to edit descriptions, and to assign data object to categories.
If business analyst or data analyst discovers contradictions between glossary and data base
columns, he can notify metadata authors by means of Business Glossary features.
Other project participants need a read-only access to metadata. Their demands can be covered by
two instruments: Business Glossary Browser and Business Glossary Anywhere.
Information Analyzer plays an important role on the integration analysis stage, which is required
for the estimation of the data existence and their current state. The result of this stage fulfillment is
44
the understanding of the source systems and, consequently, the adequate target system design.
Instead of time-consuming hand work in analysis of the outdated or missing documentation,
Information Analyzer provides business analyst and data analyst with the possibilities of automated
analysis of production systems.
Business analyst uses Information Analyzer to make decisions on integration design on the basis of
data base tables investigation, columns, keys and their relations. Data analysis helps to understand
the content and structure of data before project starts, and allows making useful for integration
process conclusions on later project stages.
Data analyst accepts Information Analyzer as an instrument for a complete analysis of source
information systems and target systems; for evaluation of structure, content and quality of data in
single and multiple column level, on table level, on file level, on cross table level, and on the level
of multiple sources.
Stewards can use Information Analyzer to maintain the common understanding of the data sense by
all users and project participants.
Business analyst or data analyst by means of Information Analyzer can create additional rules for
evaluation and measurement of data and their quality in time. These rules are either simple criteria
of column evaluation based on results data profiling, or complex conditions, which evaluate several
fields. Evaluation rules allow to create the indices, which deviation can be controlled over time.
QualityStage can be invoked on the preparation stage of enterprise data integration (often referred to
as data cleansing). IT developer runs QualityStage for data standardization automation, for data
transformation into the verified standard formats, for designing and testing match passes, and for
data-cleansing operations setup. Information is extracted from the source system, measured,
cleansed, enriched, consolidated, and loaded into the target system. Data cleansing jobs consist of
the following sequence of stages.
Investigation stage is performed by business analyst to reach a complete visibility of the actual
condition of data and can be fulfilled using both Information Analyzer and QualityStages
embedded analyzing tools.
Standardization stage reformats data from multiple systems to ensure that each data type has the
correct content and format.
Match stages ensure data integrity by linking records from one or more data sources that correspond
to the same entity. The goal of the Match stages is to create semantic keys to identify information
relationships.
Survive stage ensures that the best available data survives and is correctly prepared for the target.
This means that survive stage is executed to build the best available view of related information
Basing on data understanding achieved on investigation stage, IT developer can apply QualityStage
ready to run rules to reformat data from several sources on standardization stage.
IT developer leverages DataStage for data transformation and movement from source systems to
target systems in accordance with business rules, requirements of subject matter and integrity, and /
or in compliance with other data of target environment.
Using metadata for analysis and maintenance, and embedded data validation rules, IT developer
can design and implement integration processes for data, received from a broad set of corporate and
external sources, and processes of mass data manipulation and transformation leveraging scalable
45
parallel technologies. IT developer can implement these processes as DataStage batch jobs, as real
time tasks, or as Web services.
FastTrack is predominantly an instrument of business analyst and IT developer.
Business analyst with the help of the instrumentality of mapping editor, a component of FastTrack,
creates mapping specifications for data flows from sources to target systems. Each mapping can
contain several sources and targets. Mapping specifications are used for business requirements
documentation..
Mapping can be adjusted by applying business rules. End-to-end mapping can involve data
transformation rules, which are part of functional requirements and define how application should
be developed.
IT developer uses FastTrack during the process of program logic development of end-to-end
information processing. FastTrack converts the artifacts received from various sources into
understandable descriptions. This information has internal relations and allows the developer to get
the descriptions from metadata repository and to concentrate on the complex logic development,
avoiding loosing the time for search in multiple documents and files.
FastTrack is integrated into IBM Information Server, so specifications, metadata and jobs become
available for all project participants, who use Information Server, Information Analyzer, DataStage
Server and Business Glossary.

Table 1. Roles in a metadata management project and IBM Information Server tools

P
r
o
j
e
c
t

m
a
n
a
g
e
r

S
u
b
j
e
c
t

m
a
t
t
e
r

e
x
p
e
r
t

S
t
e
w
a
r
d

B
u
s
i
n
e
s
s

a
n
a
l
y
s
t

D
a
t
a

a
n
a
l
y
s
t

I
T

d
e
v
e
l
o
p
e
r

A
p
p
l
i
c
a
t
i
o
n

A
d
m
i
n
i
s
t
r
a
t
o
r

D
B

a
d
m
i
n
i
s
t
r
a
t
o
r

B
u
s
i
n
e
s
s

u
s
e
r
s

Business Glossary x x x x
Business Glossary
Browser
x x x x x x x x x
Business Glossary
Anywhere
x x x x x x x x x
Information Analyzer x x x x
QualityStage x x
DataStage x
FastTrack x x
Metadata Workbench x x x x x x x x x
Web Console x x
Information Services
Director
x
Rational Data
Architect
x

46
Metadata Workbench provides IT developers with metadata view, analysis and enrichment tools.
Thus IT developers can use Metadata Workbench embedded design tools for management and
understanding the information assets, created and shared in IBM Information Server.
Business analysts and subject matter experts can leverage Metadata Workbench to manage
metadata stored in IBM Information Server.
Specialists, responsible for compliance with regulations such as Sarbanes-Oxley and Basel II,
have the possibility to trace the data lineage of business intelligence reports using the appropriate
tools of Metadata Workbench.
IT specialists who are responsible for change management, say, project manager, with Metadata
Workbench can analyze the change impact on the information environment.
Administrators can use the capabilities of Web console for global administration that is based on a
common framework of Information Server. For example, user needs only one credential to access all
the components of Information Server. A set of credentials is stored for each user to provide single a
sign-on to the products registered with the domain.
IT developer executes Information Services Director as a foundation for deploying integration tasks
as consistent and reusable information services. Thus IT developer can use metadata management
service-oriented tasks together with corporate applications integration, the business-process
management, with enterprise service bus and the application servers.
Data analysts and architects can invoke Rational Data Architect for data base design, including
federated databases, that can interact with DataStage and other components of Information Server.
Rational Data Architect provide data analysts with metadata research and analysis capabilities, and
data analysts can discover, model, visualize and relate heterogeneous data assets, and can create
physical data models of from scratch, from logical models by using transformation, or from the
database using reverse engineering.
Conclusion
The performed multianalysis, including the types of metadata, the metadata life cycle, the roles in
metadata project, metadata management tools, allowed to draw the following conclusions.
IBM Information Server metadata management tools cover an extended metadata management
lifecycle in data integration projects.
The participants of metadata management project are provided with the consistent set of IBM
Information Server metadata management tools, which allows to considerably increase the
corporate metadata management system implementations success probability.
The process flows of IBM Information Server components and their interaction will be considered
in further papers.
Author thanks S.Likharev for useful discussion.
47
Incremental implementation of IBM Information Servers
metadata management tools
21.09.2009
http://www.ibm.com/developerworks/ru/library/sabir/Information_Server/index.html
Abstract
Just 15 years ago a data warehouse (DW) implementation team had to develop custom DW tools
from scratch. Currently integrated DW development tools are numerous and their implementation is
an challenging task. This article proposes incremental implementation of IBM Information Servers
metadata management tools in DW projects by the example of typical oil & gas company.
Scenario, current situation and business goals
After having spent significant amounts of money on hundreds of SAP applications, our client
suddenly realized that a seemingly homogeneous IT environment doesnt automatically provide
unified understanding of business terms.
The customer, one of the worlds leading companies, incorporates four groups of subsidiary units,
which operate in oil & gas exploration and production, and in refinery of petroleum products and
marketing. Subsidiary units are subsidiaries spread around the world; they operate in various
countries with different legislation, languages and terminologies. Each unit has its own information
accounting system. Branch data warehouses integrate information from units accounting systems.
The reports produced by branch data warehouses are not aligned with each other due to disparate
treatment of reports fields (attributes).
The company decided to build an attribute based reporting system, and realized that the lack of
common business language made the enterprise data integration impossible. In this scenario, the
company decided to establish a unified understanding of business terms, which allows to eliminate
the contradiction in understanding of report fields.
Business goals were formulated in accordance with the identified issues:
Improve the quality of information, enhance security of information, and provide the
transparency of its origin;
Increase the efficiency of business process integration and minimize time and effort of its
implementation;
Remove the hindrances for corporate data warehouse development.
Logical Topology As Is
The existing IT environment incorporates the information accounting systems of Units, branch
information systems, branch data warehouses, information systems of the headquarters, data marts
and planned enterprise data warehouse.
Information accounting systems of subsidiaries were realized on various platforms and are out of the
scope of this consideration. Branch information systems are mainly based on SAP R/3. Branch data
warehouses were developed on SAP BW. Headquarters information systems are realized using
Oracle technologies. Data marts are currently working over headquarters information systems and
branch data warehouses are running SAP BW. The platform for enterprise data warehouse is DB2.

48

Pic. 1. Logical Topology As Is
49
On the left side of Pic.1 we can see the information systems of four branches: Exploration,
Production, Refinery and Marketing. These hundreds systems include HR, financial, material
management and other modules, and are out of our scope currently because they will not be
connected to metadata management system on this stage.
The center of Pic.1 presents the centralized part of the clients IT infrastructure. It includes
several branches data warehouses on SAP BW platform and headquarters information system
on Oracle data base. Historically these two groups use various independent data gathering tools
and methods, so stored data are not consistent across the information systems. The information is
grouped in several regional, thematic and department data marts. These data marts were built
independently over the years. Thats why the reports generated by OLAP systems do not provide
a unified understanding of the reports fields.
Since the metadata management eliminates the data mismatch, improves the integration of
business processes and removes obstacles to developing a enterprise data warehouse, it was
decided to implement the enterprise metadata management system.
Architecture of metadata management system
Basically, there are three main approaches to metadata integration: point-to-point architecture,
point-to-point architecture with model-based approach, and central repository-based hub-and-
spoke metadata architecture [1].
The first one is point-to-point architecture, a traditional metadata bridging approach, in which
pair-wise metadata bridges are built for every pair of product types that are to be integrated.
Relative ease and simplicity of pair integration leads to uncontrolled growth of number of
connections between systems. This uncontrolled growth results in considerable expense for
maintaining a unified semantic space when changes are made in at least one system.
Second is point-to-point architecture with model-based approach, which significantly reduces the
cost and complexity associated with the traditional point-to-point metadata integration
architectures based on metadata bridges. Common meta-model eliminates the need to construct
pair-wise metadata bridges and establishes a complete semantic equivalence at metadata level
between different systems and tools that are included in the information supply chain to users.
The third is central repository-based hub-and-spoke metadata architecture. In this case the
repository generally takes on a new meaning as the central store for both the common meta-
model definition and all of its various instances (models) used within the overall environment.
The centralized structure of information systems of oil and gas company dictates the choice of
central repository-based hub-and-spoke metadata architecture as the most adequately
implementing the necessary connection between systems to be integrated.
Architecture of metadata management environment
Metadata Management Environment (MME) [2] includes the sources of metadata, metadata
integration tools, a metadata repository, and tools for metadata management, delivery, access and
publication. In some cases, the metadata management environment includes metadata data marts,
but in this task they are not needed, since the functionality of the metadata data marts is not
required.
Metadata sources are all information systems and other data sources that are included in the
enterprise metadata management system.
Metadata integration tools are designed to extract metadata from sources, their integration and
deployment to a metadata repository.
Metadata repository stores business rules, definitions, terminology, glossary, data lineage and
data processing algorithms, described in the business language; description of the tables and
50
columns (attributes), including the statistics of the applications execution, the data for the
project audit.
Metadata management tools provide a definition of access rights, responsibilities and
manageability.
Tools for metadata delivery, access and publication allow users and information systems to work
with metadata the most convenient way.
Architecture of metadata repository
Metadata repository can be implemented using either a centralized, a decentralized, or a
distributed architecture approach.
The centralized architecture implies a global repository, which is designed on a single metadata
model and maintains all enterprise systems. There are no local repositories. The system has a
single, unified and coherent metadata model. The need to access a single central repository of
metadata can lead to performance degradation of metadata consuming remote systems due to
possible communication problems.
In the distributed architecture the global repository contains enterprise metadata for the core
information systems. Local repositories, containing a subset of metadata, serve the peripheral
system. Metadata model is uniform and consistent. All metadata are processed and agreed in a
central repository, but are accessed through the local repository. The advantages of local
repositories are balanced by requirements to be synchronized with a central metadata repository.
The distributed architecture is preferable for geographically distributed enterprises.
Table 1. Comparison of metadata repositories architectures

The decentralized architecture assumes that a central repository contains only metadata
references which are maintained independently in local repositories. Lack of coordination efforts
on terms and concepts significantly reduces development costs, but leads to multiple and varied
models that are mutually incompatible. The applicability of this architecture is limited to the case
when the integrated systems are within the non-overlapping areas of companys operations.
51
As one of the Companys most important objectives is to establish a single business language, a
decentralized architecture is not applicable. The choice between centralized and distributed
architecture is based on the fact that all the systems to be integrated are located in headquarters,
and there is no problem with stable communication lines.
Thus, the most applicable to this scenario is a centralized architecture of metadata repository.
In various publications one can find statements that metadata repository is a transactional
system, and should be managed differently than the data warehouse. From our point of view, the
recommendation to organize metadata repository data warehouse is more justified. Metadata
should accompany the data throughout its lifecycle. That is, if the data warehouse contains
historical data, the metadata repository should also contain relevant historical metadata.
Logical Topology To Be
The selected architectures of metadata management environment, of metadata management
system and of metadata repository lead to the target logical topology shown in Pic. 2. On can see
two major changes compared to the current logical topology.
1. We plan to create an enterprise data warehouse and to use IBM Information Server as an
ETL tool (Extract, Transform and Load). This task is beyond the scope of current work.
2. The second, most important change is the centralized metadata management, which
allows the Company to establish a common business language for all systems operating
in headquarter. So, on the client side only the metadata client is required.
Two phases of extended metadata management lifecycle
Extended metadata management lifecycle (Pic.3) as proposed in [3], consists of the following
stages: analysis and understanding, modeling, development, transformation, publication,
consuming, ownership, quality management, metadata management, reporting and auditing.
In terms of incremental implementation the extended metadata management lifecycle can be
divided into two phases:
1. Metadata elaboration phase: analysis and understanding, modeling, development,
transformation, publication.
2. Metadata Production phase: consuming, ownership, quality management, metadata
management, reporting and audit.
As the phases names imply, on the first phase mainly analysis, modeling and development of
metadata are mainly carried out, while the second phase is more closely related to the operation
of the metadata management system. For clarity, the stages of phase Metadata elaboration are
grouped in the left hand side of Pic. 3, whereas the stages of phase Metadata Production are
placed on the right hand side.
52

Pic. 2. Logical Topology To Be
53
Metadata elaboration phase
Analysis and understanding includes data profiling and analysis, quality assessment of data
sets and structures, understanding the sense and content of the input data, identification of
connections between columns of database tables, analysis of dependencies and information
relations, and investigation of data for their integration.
Business Analyst performs data flow mapping and prepares the initial classification.
Subject matter expert develops business classification.
Data Analyst accomplishes analysis of systems.
Modeling means revealing data aggregation schemes, detection and mapping of metadata
interrelation, impact analysis and synchronization of models.
Data Analyst develops the logical and physical models and provides synchronization of
models.
Development provides team glossary elaboration and maintenance, business context support for
IT assets, elaboration of flows of data extraction, transformation and delivery.
IT developer creates the logic of data processing, transformation and delivery.
Transformation consists of automated generation of complex data transformation tasks and of
linking source and target systems by means of data transformation rules.
IT developer prepares the tasks to transform and move data, which are performed by the
system.
Publishing provides a unified mechanism for metadata deployment and for notification upgrade.
IT developer provides deployment of integration services, ...
... Which help Metadata steward publish metadata
Metadata Production phase
Consuming is visual navigation and mapping of metadata and their relations; metadata access,
integration, import and export; change impact analysis; search and queries (Pic.3).
Business users are able to use metadata
Ownership determines metadata access rights.
Metadata steward maintains the metadata access rights
Metadata quality management solves the tasks of lineage of heterogeneous data in data
integration processes; quality improvement of information assets; input data quality monitoring,
and allows to eliminate issues of data structure and their processability before they affect the
project.
Project manager analyzes the impact of changes
Business analyst identifies inconsistencies in the metadata
Subject matter expert updates business classification
Data analyst removes the contradiction between metadata and classification
IT developer manages information assets
Metadata steward supports a unified understanding of metadata meaning
Business users use metadata and inevitably reveal metadata contradictions
54

Pic. 3. Extended metadata management lifecycle

Development
IT developer
Logic of end-to-end information
processing
Publication
IT developer
Integration services deployment
Metadata steward
Metadata publication
Consuming
Business user
Read-only access to metadata

Ownership
Metadata steward
Define metadata access rights
Quality management
Project manager
Analyze the change impact
Business analyst
Discover metadata contradictions
Subject matter expert
Update the business classification
Data analyst
Eliminate metadata contradictions
IT developer
Manage the information assets
Metadata steward
Maintain the common understanding of
the data sense
Business user
Report metadata issues
Analysis and understanding
Business analyst
Data flows mapping
Initial classification
Subject matter expert
Business classification
Data analyst
System analysis
Modeling
Data analyst
Logical & physical data models
Synchronization of models
Transformation
IT developer
Data standardization and
transformation procedures
Metadata Management
Project manager
Assign a steward
Define responsibilities and
manageability
Reporting and audit
Metadata steward
Metadata audit
Report metadata state
55
During Metadata management stage access to templates, reports and results is managed;
metadata, navigation and queries in the meta-model are controlled; access rights, responsibilities
and manageability are defined.
The project manager should appoint stewards and allocate responsibilities among team
members.
Reporting and audit imply formatting options settings for reports results, report generation for
the connections between business terms and IT assets, scheduled reports execution, saving and
reviewing the reports versions.
Metadata steward provides auditing and reporting
Audit results can be used to analyze and understand metadata on the next stage of the life cycle.
Roles and interactions on metadata elaboration phase
Business analyst, with the help of the instrumentality of mapping editor, a component of
FastTrack, creates mapping specifications for data flows from sources to target systems (Pic.4).
Each mapping can contain several sources and targets. Mapping specifications are used for
business requirements documentation. Mapping can be adjusted by applying business rules. End-
to-end mapping can involve data transformation rules, which are part of functional requirements
and define how an application should be developed.
Business analyst uses Information Analyzer to make decisions on integration design on the basis
of data base tables investigation, columns, keys and their relations. Data analysis helps to
understand the content and structure of data before a project starts, and on later project stages
allows making conclusions useful for integration process.
Subject matter expert (metadata author) uses Business Glossary to create the business
classification (taxonomy), which maintains hierarchical structure of terms. A term is a word or
phrase which can be used for object classification and grouping in metadata repository. Business
Glossary supplies subject matter experts with a collaborative tool to annotate existing data
definitions, to edit descriptions, and to assign data object to categories.
Data analyst uses Information Analyzer as an instrument for a complete analysis of data source
systems and target systems; for evaluation of structure, content and quality of data on single and
multiple columns level, on table level, on file level, on cross table level, and on the level of
multiple sources.
Data analysts and architects can invoke Rational Data Architect for database design, including
federated databases that can interact with DataStage and other components of Information
Server. Rational Data Architect provide data analysts based on metadata research and analysis,
and data analysts can discover, model, visualize and link heterogeneous data assets, and can
create physical data models from scratch deriving it from logical models by means of
transformation, or with the help of reverse engineering of production databases.
IT developer uses FastTrack during program logic development of end-to-end information
processing. FastTrack converts the artifacts received from various sources into understandable
descriptions. This information has internal relations and allows the developer to get the
descriptions from metadata repository and to concentrate on the complex logic development,
avoiding losing the time for search in multiple documents and files.
FastTrack is integrated into the IBM Information Server. Thats why the specifications,
metadata, and the jobs become available to all project participants, who use the Information
Server, Information Analyzer, DataStage Server and Business Glossary.
IT developer runs QualityStage for data standardization automation, for data transformation into
verified standard formats, for designing and testing match passes, and for data cleansing
56
operations setup. Information is extracted from the source system, is measured, cleansed,
enriched, consolidated, and loaded into the target system.
IT developer leverages DataStage for data transformation and movement from source systems
to target systems in accordance with business rules, requirements of subject matter and integrity,
and / or in compliance with other data of target environment. Using metadata for analysis and
maintenance, and embedded data validation rules, IT developer can design and implement
integration tasks for data, received from a broad set of internal and external sources, and can
arrange extremely big data manipulation and transformation using scalable parallel processing
technologies. IT developer has choice to implement these processes as DataStage batch jobs, as
real time tasks, or as Web services.
IT developer executes Information Services Director as a foundation for deploying integration
tasks as consistent and reusable information services. Thus IT developer can use metadata
management service-oriented tasks together with enterprise applications integration, business-
process management, with enterprise service bus and the application servers.
Business users need a read-only access to metadata. Their demands can be met by two
instruments: Business Glossary Browser and Business Glossary Anywhere.

57

. 4. Roles & Interactions on Elaboration phases of metadata management lifecycle
58
Roles and interactions on metadata production phase
IT specialists who are responsible for change management, say, a project manager, can analyze
a change impact on the information environment with the help of Metadata Workbench (Pic.5).
Business Glossary allows to assign the role of stewards, who are responsible for the metadata, to
a user or a group, and to link the role of stewards with one or more metadata objects. Stewards
responsibility includes the effective metadata management and integration with related data, and
providing authorized users with relevant data access. Stewards must ensure that all data are
correctly described and that all data users understand the meaning of the data.
If business analyst discovers contradictions between glossary and database columns, he can
notify metadata authors by means of Business Glossary features.
Business analyst investigates data status to reach a complete visibility of the actual data
condition using QualityStages embedded analyzing tools.
Data analyst eliminates contradictions between glossary and data base tables and columns by
means of Business Glossary and Rational Data Architect
Metadata Workbench provides IT developers with metadata view, analysis and enrichment
tools. Thus IT developers can use Metadata Workbench embedded design tools for management
and understanding the information assets, created and shared by IBM Information Server.
Business users responsible for regulations compliance such as Sarbanes-Oxley and Basel II,
have the possibility to trace the data lineage in reports using the appropriate tools of Metadata
Workbench.
Stewards can use Information Analyzer to maintain the common understanding of data sense by
all users and project participants.
Stewards can invoke Metadata Workbench to maintain metadata stored in the IBM Information
Server.
Administrators can use the capabilities of Web console for global administration that is based
on a common framework of Information Server. For example, user may need only one credential
to access all the components of Information Server. A set of credentials is stored for each user to
provide a single sign-on to all registered assets.
59

Pic. 5. Roles & Interactions on Production phases of metadata management lifecycle
60
Adoption route 1: metadata elaboration
So, we have two metadata adoption routes.
Route 1: Metadata Elaboration and
Route 2: Metadata Production.
These routes are beginning at the single start point.
Picture 6 represents Route 1, which deals mainly with first part of metadata management
lifecycle, namely with Analysis and understanding, Modeling, Development, Transformation,
Publishing, and Consuming
As the first step we have to install Metadata Server, which maintains metadata repository, and
supports metadata services.
On the second step one should add Information Analyzer to perform automated analysis of
production systems and to define initial classification.
Step three is adding FastTrack which allows to create mapping specifications for data flows from
sources to target.
We can add Business Glossary as a fourth step in order to create the business classification
To create logical & physical data models the Rational Data Architect could be added on the fifth
step.
Sixth step is the extended usage of the Information Analyzer to create rules for data evaluation.
On the seventh step we plan the extended usage of FastTrack to program the logic of end-to-end
information processing.
As step eight one could install QualityStage and DataStage to design and execute data
transformation procedures
To deploy integration tasks as services we should add Information Services Director on the ninth
step.
On the last step one has to grant users with read-only access to metadata, and we can add
Business Glossary Browser and Business Glossary Anywhere.

61

Pic. 6. Metadata adoption route on Elaboration phases of metadata management lifecycle
62
Adoption route 2: metadata production
This adoption route covers production part of metadata management lifecycle, and includes
Reporting and audit, Ownership, Quality management, Metadata Management. The second route
begins at the same starting point as route 1.
Almost all products were installed during the first route, so this route in general deals with
extended usage of the software added previously.
Web console is the one of the two products which should be added during this route. It enables
management of users credentials, and hence, it is required in the very beginning.
The next step Extended use of Business Glossary should be performed as soon as possible to
assign a steward.
To perform the change impact analysis one should add the Metadata Workbench.
The extended usage of FastTrack and QualityStage allows to discover the contradictions between
glossary and data base columns.
Extended usage of Rational Data Architect could eliminate the revealed contradictions between
glossary and data base tables & columns.
Metadata Workbench can help in understanding and managing the information assets.
By means of Business Glossary users could update the business classification according to new
requirements.
Again Metadata Workbench helps in reporting the revealed metadata issues.
Information Analyzer can be used to maintain the common understanding of the data sense
Both Metadata Workbench and Web Console can be used to maintain metadata and to report
metadata state.

63

Pic. 7. Metadata adoption route on Production phases of metadata management lifecycle
64
Conclusion
The proposed routes cover an extended metadata management lifecycle in data integration projects.
The participants of metadata management project are provided incrementally with a consistent set of
IBM Information Server metadata management tools. Software that is implemented following the
proposed routes, realizes the pre-selected architecture environment for metadata management,
metadata management system and a metadata repository in accordance with the target logical
topology.
Incremental implementation of metadata management tools of IBM Information Server reduces the
time and complexity of the project, enabling business user to get the benefits of metadata
management on earlier stages, and increases the probability of successful implementation of
metadata management system.
This work was performed as part of plusOne initiative. The author would like to express his
gratitude to Anshu Kak for the invitation to plusOne project.
Literature
1. Poole J., Chang D., Tolbert D, Mellor D. Common Warehouse Metamodel: An Introduction to the
Standard for Data Warehouse Integration, Wiley, 2003.
2. Marco D., Jennings M. Universal Meta Data Models, Wiley, 2004.
3. Asadullaev S. Metadata management using IBM Information Server, 2008,

65
Master data management with practical examples
Alexander Karpov, Solution Architect SWG IBM EE/A
09.11.2010
http://www.ibm.com/developerworks/ru/library/sabir/nsi/index.html
Abstract
The article provides examples, when insufficient attention to master data management (MDM) leads
to inefficient use of information systems due to the fact that the results of queries and reports do not
fit the task and do not reflect the real situation. The article also notices difficulties faced by a
company, which decided to implement a home grown MDM system, and provides practical
examples and common errors. The benefits of enterprise MDM are stressed, and the basic
requirements for MDM system are formulated.
Basic concepts and terminology
Master data (MD) includes information about customers, employees, products, goods suppliers,
which typically is not transactional in its nature.
Reference data refer to codifiers, dictionaries, classifiers, identifiers, indices and glossaries [1]. This
is a basic level of transactional systems, which in many cases is supplied by external designated
organizations.
Classifier is managed centrally by an external entity, contains the rules of code generation and has a
three or four level hierarchical structure. Classifier may determine the coding rules. Classifier does
not always contain the rules for calculating the check digit or code validation algorithms. An
example of a classifier is the bank identification code BIC, which is managed by the Bank of Russia,
contains no check digit, and has a four-level hierarchical structure: code of the Russian Federation,
code of the Russian Federation region, the identification number of division of settlement network
of the Bank of Russia, the identification number of the credit institution. Russia Classifier of
Enterprises and Organizations is managed centrally by Russian Statistics Committee. In contrast to
BIC it contains the method for calculating the check digit for enterprise or organizations code.
Identifier (e.g., ISBN) is managed by authorized organizations in a non-centralized manner. Unlike
the case of a classifier, identifiers codes must follow the rules of check digit calculation. The rules
for the identifier compilation are developed centrally and are maintained by requirements of
standards or other regulatory documents. The main difference from the classifier is that the identifier
as a complete list is either not available, or it is not needed on system design phase. The working list
is updated with individual codes during system operation.
Dictionary (e.g., Yellow Pages) is managed by a third party. The numbering code (telephone
number) is not subject to any rules.
Codifier is designed by developers for internal purposes of specific database. As a rule, neither
checksum calculation algorithms, nor coding rules are designed for a codifier. The encoding of a
month of the year is a simple example of codifier.
The index may simply be a numeric value (for example, tax rate), which is derived from an
unstructured document (order, law, act). A flat tax rate of 13% is an example of index.
Glossaries contain abbreviations, terms and other string values that are needed during the generation
of forms and reports. The presence of these glossaries in the system provides a common terminology
66
for all input and output documents. Glossaries are so close in nature to metadata that it is sometimes
difficult to distinguish them.
Reference data (RD) and master data (MD)
In Russian literature there is long established concept of "normative-reference information"
(reference data) which appeared in the disciplines related to management of the economy back in the
pre-computer days [2]. The term "master data" comes from the English-language documentation,
and, unfortunately, was used as a synonym for reference data. In fact, there is a significant
difference between reference data and master data.

Pic. 8. Data, reference data and master data

Pic.1 illustrates the difference between reference data, master data and transactional data in a
simplified form. In some conditional e-ticketing system the codifier of the airports performs the role
of reference data. This codifier could be created by the developers of the system, taking into account
some specific requirements. But the airport code should be understandable to other international
information systems for flawless interaction between them. This purpose is achieved by the unique
three-letter airport code assigned to airports by the International Air Transport Association (IATA).
Passengers data are not as stable as airport codes. At the same time, being once introduced into the
system, the passengers data can be further used for various marketing activities, such as discounts
when a certain total flight distance is achieved. Such information usually refers to master data.
Master data may also include information about the crew, the fleet of the company, freight and
passenger terminals, and many other entities involved in air transportation, but not considered in the
framework of our simplified example.
The top row in Pic.1 schematically depicts some transaction related to the ticket sale. Airports are
relatively few in the world, yet there are much more passengers, and they can repeatedly use the
services of this company, but a ticket can not and must not be reused. Thus, ticket sales data are the
most frequently changing transactional data for an airline company.
To sum up, we can say that reference data constitutes the base level of automated information
systems. Master data store information about customers and employees, suppliers of products,
equipment, materials and about other business entities. As reference data and master data have much
in common, in those cases where the considered factors relate both to reference data and master
data, we will refer to as "RD & MD", for example, "RD & MD management system",
67
Enterprise RD & MD management
The most common and obvious issue of the traditional RD & MD management is the lack of support
of data that changes over time. Address, as a rule, is one of the most important components of RD &
MD. Unfortunately, addresses change. A client can move to another street, but the whole house and
even the street can move also. So, in 2009, address of a group of buildings Tower on waterfront
changed from 18, Krasnopresnenskaya embankment to 10, Presnenskaya embankment. Thus,
the query How much mail was delivered to the office of the company, renting premises in the
Tower on the waterfront in 2009? should correctly handle the delivery records to two different
addresses.
However, RD & MD management tools (hardware and software) themselves are not enough to
reflect real world changes in the IT system. Someone or something is needed to track changes. That
is, organizational measures are required, for example, qualified staff with proper responsibilities
relevant to adopted methodology of RD & MD management.
Thus, the enterprise RD & MD management includes three categories of activities:
1. Methodological activities that set guidelines, regulations, standards, processes and roles
which support the entire life cycle of the RD & MD.
2. Organizational arrangements that determine the organizational structure, functional units and
their tasks, roles and duties of employees in accordance with the methodological
requirements.
3. Technological measures, which lie at the IT level and ensure the execution of
methodological activity and organizational arrangements.
In this article we will primarily discuss technological measures, which include the creation of a
unified data model for RD & MD, management and archiving of historical RD & MD, identification
of RD & MD objects, elimination of duplicates, conflict identification of RD & MD objects,
enforcing referential integrity, support RD & MD objects life cycle, formulation of clearance rules,
creating a RD & MD management system, and its integration with enterprise production information
systems.
Technological shortcomings of RD & MD management
Let us consider in more detail the technological area of RD & MD infrastructure development and
associated disadvantages of the traditional RD & MD management.
No unified data model for RD & MD
Unified data model for RD & MD is missing or not formalized which prevents the efficient use of
RD & MD objects and obstructs any automation of data processing. The data model is the basic and
most important part of the RD & MD management, giving answers, for example, for the following
questions:
What should be included into identifying attribute set of RD & MD object?
Which of all the attributes of RD & MD object should be attributed to the RD & MD and
stored in the data model and what should be attributed to operational data and left in the
production system?
How to integrate the model with external identifiers and classifiers?
68
Does a combination of two attributes from different IT systems provide a third unique
attribute, important from a business perspective?
There is no single regulation of history and archive management
Historical information in existing enterprise IT systems is often carried out by its regulations and has
its own life cycles, responsible for the processing, aggregation and archiving of RD & MD objects.
Synchronization and archiving of historical data and bringing them to a common view is a nontrivial
task even with a common data model of RD & MD. An example of the problems caused by the lack
of historical reference data is provided in the section " Law compliance and risk reduction"
The complexity of identifying RD & MD objects
RD & MD objects in various IT systems have their own identifiers - sets of attributes. The attributes
together can identify uniquely an RD & MD object in the information system and such set of
attributes can be treated as an analog of composite primary key field in the database. The situation
becomes more complicated when it is impossible to allocate a common set of attributes for the same
objects in different systems. In this case, the problem of identifying and comparing objects of
different IT systems changes from deterministic to probabilistic. Quality identification of RD &
MD objects without specialized data analysis and processing tools is difficult in this case.
The emergence of duplicate RD & MD objects
The complexity of object identification leads to the potential emergence of duplicates (or possible
duplicates) of the same RD & MD object in different systems, which is the main and most
significant problem for business. Duplication of information leads to cost duplication of object
processing, duplication of "entry points", to the cost increase of maintaining the objects life cycles.
Additionally we have to mention the cost of manual reconciliation of duplicates, which were
originally too high, as it often goes beyond the boundaries of IT systems and require human
intervention. It should be stressed that the occurrence of duplicates is a system error that appears in
the earliest steps of business processes which involve RD & MD objects. On the next stages of the
business processes execution the duplicate acquires bindings and attribute composition so the
situation becomes more complicated.
Metadata inconsistency of RD & MD
Each information system which supports a line of business of enterprise generates RD & MD
objects specific to the business. Such IT system defines its own set of business rules and constraints
applied both to the composition of attribute (metadata), and to the value of attributes. As a result, the
rules and constraints imposed by various information systems, are in conflict with each other, thus
nullifying even the theoretical attempts to bring all of the RD & MD objects to a singe view. The
situation is exacerbated when, outwardly matching data model, the data have the same semantic
meaning, but different presentations: various spelling, permutations in the addresses, name
reduction, different character sets, reductions and abbreviation.
Referential integrity and synchronization of RD & MD model
In real life RD & MD objects, located in the space of their IT systems, contain not only values but
also references to other RD & MD objects, which can be stored and managed in separate external
systems. Here, a problem of synchronization and integrity maintenance of enterprise wide RD &
MD model arises to the utmost. One of the common ways of dealing with such problems is the
transition to the use of the RD & MD that are maintained and are imported from outside the
organization.
69
Discrepancy of RD & MD object life cycle
Due to the presence of the same RD & MD object in a variety of enterprise systems, object input
and change in these systems are inconsistent, and are often time stretched. It is possible that the
same object in different systems is in mutually exclusive statuses (active in one system, archived in
another, deleted in the third), making it difficult to maintain the integrity of RD & MD objects.
Unbound and "spread" over time objects are difficult to use both in transactional, and in analytical
process.
Clearance rules development
RD & MD cleaning rules often are quite equitably attributed to methodological aspects. Of course,
IT professionals need a problem statement from business users, for example, when the codes of
airports should be updated, or which of the two payment orders has the correct data encoding. But
business specialists are not familiar with the intricacies of the implementation of IT systems they
use. Moreover, the documentation on these systems is incomplete, or missing. Therefore an analysis
of information systems is required in order to clarify existing clearance rules and to identify new
rules if required.
Wrong core system selection for RD & MD management
Most often, the most significant sources and consumers of the RD & MD are large legacy enterprise
information systems which are the core of companys business. In real life, such a system is often
chosen as the "master system" for the RD & MD management instead of creating a specialized RD
& MD repository. The fact that such role of this master system is irrelevant to its initial design is
usually ignored. As a result, any revision of these systems associated with RD & MD, pours into
large and unnecessary spending. The situation is exacerbated when qualitatively new features must
be entered along with the development of RD & MD management subsystems: batch data
processing, data formatting and cleanup, data stewards assignment.
IT systems are not ready for RD & MD integration
In order to fully implement RD & MD management into existing IT enterprise systems, it is
necessary to integrate these systems. More often, this integration is necessary not as a one-time local
event but as a change of processes, living within IT systems. The integration intended for
operational mode support is not enough. The integration has to be carried out for the initial batch
data loading (ETL), as well as for the procedures of manual data verification (reconciliation).
Not all automated information systems are ready for such changes, not all systems provide such
interfaces. Most of all, this is a completely new functionality to these systems. During system
implementation arise architectural issues related to the selection of different approaches to the
development of RD & MD management system and its integration with the technological landscape
of the enterprise. To confirm the importance of this moment, we note that there are designed and
proven architectural patterns and approaches aimed at proper deployment and integration of INS and
MD.
Examples of traditional RD & MD management issues
Thus, the main issues of the RD & MD management arise because of the decentralization and
fragmentation of RD & MD across the companys IT systems and are manifested in practice in
concrete examples.
70
Passport data as a unique identifier
For example, in a major bank as a result of creating a customer data model, it was decided to use the
passport data in identifying attributes assuming its maximum selectivity. During execution of merge
procedures of client data it was revealed that the customers passport is not unique. For example,
customers who had relations with the bank using the old passport and then using new passport were
registered as different clients. Analysis of client records revealed instances where one passport has
been reported by thousands of customers. On top of that, one data source was the banking
information system, in which the passport data were optional and the corresponding fields during the
filling process were hammered with "garbage".
It should be noted that the detected problems with the customers data quality were not expected and
were found only at the stage of data cleaning, which required additional time and resources to
finalize the rules for data cleaning and improve customer data model.
Address as a unique identifier
In another case, an insurance company conducted a merger of customers personal data, where
address is used as an identifying attribute. It was found that most clients were registered at the
address "same", "ibid." Poor quality data were supplied by the application system that supports the
activities of insurance agents. The system allowed agents to freely interpret the fields values of
client questionnaire. Moreover, this system lacked any logic and formatted data input validations.
The need for mass contracts renewal
In the third case, when an existing enterprise CRM system was connected to RD & MD
management system only on the testing phase it became clear that the CRM system can not
automatically accept the updates from RD & MD management system. This requires some
procedural actions, in this case, invite the customer and renew paper contract documents that
mention critical information relating to RD & MD. Both technological and organizational aspects of
RD & MD integration and usage were reconsidered due to the large amount of work.
The discrepancy between the consistent data
The fourth example describes a typical situation in many organizations. As a result of a rapid
development of the companys business, it was decided to open a new direction that supports the
work with clients in the style of B2C / B2B using the Internet. To do this, a new IT system that
supports the automation of new companys business was acquired. During the deployment the
integration with existing enterprises RD & MD was required. So the existing master data should be
expanded by attributes specific for new IT system. Lack of a dedicated RD & MD management
system made this task not easy. Thats why RD & MD were once loaded into the new system
without any feedback from the existing enterprises IT landscape. Some time later this led to two
independent versions of client directories. Initially the problem was solved by manual handling of
customer data in spreadsheets, but after a while the number of customers has increased considerably,
customer directories "diverged", and manual processing has proved ineffective and expensive. As a
result, the situation has led to a serious escalation of the problem to the level of business users who
do not have the overall picture of their customers for marketing campaigns.
Benefits of corporate RD & MD
Enterprise RD & MD management has the following advantages:
Law compliance and risk reduction
71
Profits increase and customer retention
Cost reduction
Increased flexibility to support new business strategies.
It sounds too good to be true, so we consider each of the benefits on practical examples.
Law compliance and risk reduction
Prosecuting authorities demanded a big company to provide data for the past 10 years. The task
seemed simple and doable: the company introduced procedures for regular archiving and backup of
data and applications long before, storage media was stored in a secure room, the equipment to read
the data carriers had not yet become obsolete. However, after the restoration of historical data from
the archives it was revealed that the data are of no practical sense. RD & MD during this time
changed repeatedly, and it was impossible to determine to what the data were related. Nobody
foresaw RD & MD archiving because it seemed that this part of information was stable at the time.
The company had been imposed major penalties. The companys management responsible for these
decisions was changed. In addition, the unit, responsible for RD & MD management, was
established to avoid the repetition of such an unpleasant situation.
Profits increase and customer retention
A large flower shop was one of the first to realize the effectiveness of email marketing. A web site
was created where marketing campaigns were performed, where customers could subscribe to mail
out on the Valentine's Day, on the birth of first child's, on a birthday of a beloved, etc. Subsequently,
clients received congratulations with the proposals of flower choices. However, advertising
campaigns were conducted with the assistance of various developers who created disparate
applications, unrelated to each other. Therefore, customers can receive up to ten letters on the same
occasion that annoyed customers and caused their outflow. As a result, each successive advertising
campaign not only rendered unprofitable, but also reduced the number of existing customers. Flower
shop had to spend considerable resources to process and integrate the applications. The high amount
of expenses was related to the heterogeneity of customer information, multiple formats, addresses
and telephone numbers, which caused big problems in the identification of customers to eliminate
multiple entries.
Cost reduction
One of the main requirements for the company's products is the need to respond quickly to changes
in demand, launch a new product to the market in a short time and communication with consumers.
We see that yesterday's undisputed leader turn into backward, while newcomers, who brought their
product to market for the first time, increase their profits and capitalization greatly. Under these
conditions, various corporate information systems, responsible for developing the product, its supply
and sales, service and evolving should be based on a unified information base, covering all lines of
the companys business. Then the lunch of a new product to market requires less time and financial
costs through seamless interaction between supporting information systems.
Increased flexibility to support new business strategies
Elimination of fragmentation and decentralization of RD & MD allows providing the information as
a service. This means that any IT system, following established communication protocols and access
rights can query the enterprise RD & MD management system and obtain the necessary data.
Service oriented approach allows to build flexible data services in accordance with changing
72
business processes, thus providing a timely response of IT systems and services in terms of
changing requirements.
Architectural principles of RD & MD management
The basic architectural principles of master data management are published in paper [3]. Let us list
them briefly:
The MDM solution should provide the ability to decouple information from enterprise
applications and processes to make it available as a strategic asset of the enterprise.
The MDM solution should provide the enterprise with an authoritative source for master data
that manages information integrity and controls distribution of master data across the
enterprise in a standardized way that enables reuse.
The MDM solution should provide the flexibility to accommodate changes to master data
schema, business requirements and regulations, and support the addition of new master data.
The MDM solution should be designed with the highest regard to preserve the ownership of
data, integrity and security of the data from the time it is entered into the system until
retention of the data is no longer required.
The MDM solution should be based upon industry-accepted open computing standards to
support the use of multiple technologies and techniques for interoperability with external
systems and systems within the enterprise
The MDM solution should be based upon an architectural framework and reusable services
that can leverage existing technologies within the enterprise.
The MDM solution should provide the ability to incrementally implement an MDM solution
so that a MDM solution can demonstrate immediate value.
Based on considered practical examples, we can expand the list of architectural principles with
additional requirements to RD & MD management system:
Master data system must be based on a unified RD & MD model. Without a unified data
model it is not possible to create and operate a RD & MD system as a single enterprise
source of master data.
Unified rules and regulations of master data history and archiving management are needed.
The purpose is to provide opportunities to work with historical data to improve the accuracy
of analytical processing, law compliance and risk reduction.
An MDM solution must be capable to identify RD & MD objects and to eliminate
duplicates. Without identification it is impossible to build a unified RD & MD model and to
identify duplicates, which cause multiple "entry points", cost increase for object processing
and for maintenance of the objects life cycle.
RD & MD metadata must be consistent. Metadata mismatch leads to the fact that even if it is
possible to create a unified model of the RD & MD, in fact, this model is of low quality due
to the fact that different objects can actually be duplicated because of different definitions
and presentations.
An MDM solution must support referential integrity and synchronization of RD & MD
models. Depending on the solution architecture RD & MD model may contain both objects
73
and links. That is, the synchronization and integrity are necessary to support a unified RD &
MD model.
A consistent life-cycle of RD & MD object must be supported. RD & MD object stored in
different IT systems in various stages of its life cycle (e.g., created, agreed, active, frozen,
archived, destroyed), essentially destroys the unified RD & MD model. The life cycle of RD
& MD objects must be expressed as a set of procedures, methodological, and regulatory
documents approved by the organization.
Support should develop clearance rules for RD & MD objects and their correction. This
ensures the relevance of a unified model of INS and MD, which may be disrupted due to
changing business requirements and legislation.
It is necessary to create a specialized RD & MD repository instead of the use of existing
information systems as a RD & MD "master system". The result is flexibility and
performance of the RD & MD management system, data security and protection, improved
availability.
The RD & MD management system must take into account that IT systems may not be ready
to integrate RD & MD. Systems integration requires a counter-action: the existing system
should be further developed to meet the requirements of the centralized RD & MD.
Conclusion
The practice of creating RD & MD systems discussed in this paper shows that the company that
attempts to develop and implement such a enterprise level system independently, faces a number of
problems that lead to significant material, labor and time costs.
As follows from the case studies, the main RD & MD technological challenges are caused by
decentralization and fragmentation of the RD & MD in the enterprise. In order to address these
challenges requirements to RD & MD management system are proposed and formulated.
The following articles will discuss tools that can facilitate the creation of enterprise RD & MD
management system, the main implementation stages of RD & MD management system, and the
roles on various phases of the RD & MD life cycle.
Literature
1. Asadullaev S., Data, metadata and master data: the triple strategy for data warehouse projects,
09.07.2009, http://www.ibm.com/developerworks/ru/library/r-nci/index.html
2. Kolesov A., Technology of enterprise master data management, PC Week/RE, 18(480),
24.05.2005, http://www.pcweek.ru/themes/detail.php?ID=70392
3. Oberhofer M., Dreibelbis A., An introduction to the Master Data Management Reference
Architecture, 24.04.2008, http://www.ibm.com/developerworks/data/library/techarticle/dm-
0804oberhofer/
74
Data quality management using IBM Information Server
08.12.2010
http://www.ibm.com/developerworks/ru/library/sabir/inf_s/index.html
Abstract
Data integration projects often do not provide users with data of required quality. The reason is the
lack of established rules and processes to improve data quality, wrong choice of software, and lack
of attention to work arrangement. A belief that data quality can be improved after the completion of
the project is also widespread.
The aim of this study is to determine the required process of data quality assurance, to identify the
roles and qualifications, as well as to analyze the tools for interaction between participants in a data
quality improvement project.
Introduction
Data quality has a crucial impact on the correctness of decision making. Inaccurate geological data
can lead to a collapse of high rise buildings, low-quality oil and gas exploration data cause
significant losses due to incorrectly assessed effects of well drilling; incomplete data on the bank's
customer is a source of error and loss-making. Other examples of serious consequences of
inadequate data quality are published in [1].
Despite an apparent agreement on the need of data quality improvement, the intangibility of data
quality as an end product raises doubts about the advisability of spending on these works. Typically,
a customer, especially from financial management, asks the question, what will be the
organizations profit on completion of works and how the result can be measured.
Some researchers identify up to 200 data quality characteristics [2], so the absence of a precise
criteria of quality also prevents the deployment of work to improve data quality.
An analogy to water pipes may clarify the situation. Each end user knows that he needs water
suitable for drinking. He does not necessarily understand the chemical, organoleptic,
epidemiological and other requirements for water.
Similarly, an end user does not have to understand what should be the technology, engineering
construction and equipment for water purification. The purpose is to take water from designated
sources, to process water in accordance with the requirements and deliver it to consumers.
To sum up, we can say that to achieve the required data quality it is necessary to create an adequate
infrastructure and arrangement of required procedures. That is, the customer does not receive "high-
quality data in a box" (equivalent - a bottle of drinking water), but the processes, tools and methods
for their preparation (equivalent - town water supply).
The aim of this study is to determine the process of data quality improvement to identify the needed
roles and qualifications, as well as to analyze the tools for data quality improvement.
Metadata and project success
For the first time I faced with the metadata problem in an explicit form in 1988 when I was the SW
development manager for one of the largest petrochemical companies. In a simplified form, the task
was to enter a large number of raw data manually, to apply complicated and convoluted algorithms
75
to process input data and to present the results on the screen and on paper. The task complexity and
a large amount of work required that various parts of the tasks were performed by several parallel
workgroups of customers and developers. The project ran fast, and customer representatives
received the working prototypes of future systems regularly. Discussion of the prototype took the
form of lengthy debates on the correctness of calculations, because no one could substantiate their
doubts or to identify the cause of rejection of these data by experts. That is, results do not
correspond to the intuitive customers understandings of the expected values.
In this connection we performed a data and code review for the consistency of input and output
forms and data processing algorithms. Imagine our surprise when we discovered that the same data
have different names in the input and output forms. This "discovery" compelled to change the
architecture of the developed system (to carry out all the names into a separate file initially, and then
in a dedicated database table), and to reexamine the developed data processing algorithms. Indeed,
different names of the same data led to different understanding of their meaning and different
algorithms to process them.
Applied corrections allowed to substantiate the correctness of calculations and to simplify the
support of the developed system. In the case of changing the indexs name the customer should
change one text field in one table, and this change was reflected in all forms.
This was my first but not last time when metadata had a critical influence on the project success. My
further practice of data warehouse development reaffirmed the importance of metadata more than
once.
Metadata and master data paradox
The need to maintain metadata was stressed yet in the earliest publications on data warehouse
architecture [3]. At the same time master data management as a part of the process of DW
development has not been considered until recently.
Paradoxically, master data management was quite satisfactory, while metadata management was
simply ignored. Perhaps the paradox can be explained by the fact that DW is usually implemented
on relational databases, where the third normal form automatically leads to the need for master data
management. The lack of off-the-shelf product instruments on the SW market also led to the fact
that companies experienced difficulties in implementing enterprise metadata management.
Metadata are still out of focus of developers and customers, and ignoring them is often the cause of
DW project delays, cost overrun risk, and even project failure.
Metadata impact on data quality
Many years ago, reading Dijkstra, I found his statement: "I know one -very successful- software
firm in which it is a rule of the house that for one year project coding is not allowed to start before
the ninth month! In this organization they know that eventual code is no more than the deposit of
your understanding."[4]. At that moment I could not understand what one can do for eight months,
without programming, without demonstrating working prototypes to the customer, without
improving the existing code basing on the discussions with the customer.
Now, hopefully, I can assume what the developers have been loaded with for eight months. I believe
that the solution understanding is best formalized through metadata: a data model, glossary of
technical terms, sources description, data processing algorithms, applications launch schedule,
responsible personnel identification and access requirements... All this and much more is metadata.
76
In my opinion one of the best definitions of a specification of a system under development is given
in [5]: "A specification is a statement of how a system - a set of planned responses - will react to
events in the world immediately outside its borders". This definition shows how closely metadata
and system specification are. In turn, there are close links between metadata, data, and master data
[6]. This gives a reason to believe that the better metadata are worked out, the higher system
specification quality is and, under certain circumstances, the higher data quality is.
Data quality and project stages
Data quality must be ensured at all stages of problem statement, design, implementation and
operation of information system.
Problem statement is eventually expressed in formulated business rules, adopted definitions,
industry terminology, glossary, data origin identification and data processing algorithms described
on business language. This is business metadata. Thus, problem statement is a definition of business
metadata. The better problem statement and definition of business metadata are performed, the
better is data quality which must be provided by the designed IT system.
IT system development is associated with entitlement of entities (such as table names and column
names in a database) and identifying the links between them, programming of data processing
algorithms in accordance with business rules. Thus the following statements are equally true:
1. Technical metadata system appears on development phase;
2. Development of the system is the definition of technical metadata.
Documentation of the design process provides personal responsibility of each team member as the
result of his work, which leads to improved data quality due to the existence of project metadata.
Deviations from established regulations may happen during system operation. Operational metadata,
such as user activities logs, computing resources usage, applications statistics (eg, execution
frequency, records number, component analysis) allows not only to identify and prevent incidents
that lead to data quality deterioration, but also to improve user service quality through optimal
utilization of resources.
Quality management in metadata life cycle
Extended metadata life cycle [7] consists of the following phases: analysis and understanding,
modeling, development, transformation, publishing, consuming, reporting and auditing,
management, quality management, ownership (Pic. 1).
Quality management stage solves the task of heterogeneous data lineage in data integration
processes; quality improvement of information assets; input data quality monitoring, and allows to
eliminate data structure and processability issues before they affect the project.

77

Pic.1. Extended metadata management life cycle

Data flows and quality assurance
At first glance, the role of quality management stage is not remarkable. However, if we use the roles
description [7, 8] and draw Table 1, which shows the task description for each role at each stage of
metadata management life cycle, it is evident that all of the projects tasks can be divided into two
streams.
The first flow, directed along the tables diagonal, contains the activities aimed at creation of
functionality of metadata management system.
The second stream consists of tasks to improve data quality. It should be noted, that all project
participants contribute to data quality improvement if project team is selected properly.
Let us consider the tasks flow of data quality improvement in more detail. In practice, four
indicators of data quality are usually discussed: comprehensiveness, accuracy, consistency and
relevance [9].
Comprehensiveness implies that all required data are collected and presented. For example, a client
address may omit supplementary house number, or the patient's medical history misses one record of
a disease.
Accuracy of data indicates that presented values (e.g., passport number, or the loan period, or the
date of departure) dont contain errors.
Consistency is closely related to metadata and influences data understanding. This may be date in
different formats, or such a term as "profit", which is differently calculated in different countries.
Relevance of data is associated with timely data update. The client can change the name, or get a
new passport; wells debit might change over time. In the absence of timely updates the data may be
complete, accurate, consistent, but out of date.
78
Change of requirements, which inevitably associated with IT system development, can bring, as any
changes, to the result which is the opposite to the desired one.
Completeness of data may suffer from inaccurate problem statement.
Data accuracy can be reduced as a result of increased load on the employee responsible for
manual data entry.
Consistency can be impaired due to the integration of a new system with a different
understanding of data (metadata).
Relevance of data can be compromised by the inability to update data timely due to
insufficient throughput of the IT system.
So IT professionals responsible for change management (for example, project manager) should
analyze the impact of the changes on the IT environment.
The discrepancies between glossary and database columns lead to data consistency violation, which
is essentially a metadata contradiction. Since identification of these conflicts requires understanding
both of subject area, and IT technologies, in this step it is necessary to involve a business analyst
who should reach complete visibility of the actual data state.
Revealed discrepancies may require updates of business classification that must be performed by a
subject matter expert.
Consistency as data quality indicator requires discrepancies elimination in metadata. This work
should be performed by a data analyst.
Enterprise data used in the companys business are the most important information assets or data
resources. The quality of these resources has a direct impact on business performance and is a
concern of, among others, IT developers who can use the design tools for managing and
understanding information assets that are created and are available through IBM Information Server.
Thus, an IT developer ensures data comprehensiveness, accuracy, consistency and relevance.
Business users have an instrumental ability to track data lineage, which allows to identify missing
data and to ensure comprehensiveness.
Stewards maintain data consistency by managing metadata to support common data meaning
understanding by all users and project participants, and to monitor comprehensiveness, accuracy and
relevance of the data.

79
Table 2. Data flows and quality assurance

80
Roles, interactions and quality management tools
Picture 2 shows the interaction pattern between the roles and used tools [8]. Tasks related to data
quality improvement and discussed in the previous section are highlighted.
Groups of tasks, related to one role, are enclosed in a dotted rectangle. Interactions between the
roles are assigned to the workflow, the direction of which is marked by arcs with arrows. Let us
consider in more detail the tools and the tasks performed by roles.
A project manager, who is responsible for change management process, analyzes the impact of
changes on the IT environment with the help of Metadata Workbench.
A business analyst reveals contradictions between the glossary and database columns, and
notifies metadata authors using a functionality of Business Glossary and FastTrack. Data
analysis tools built into QualityStage help the business analyst to reach a full visibility of data
actual state.
A subject matter expert (the metadata author) uses Business Glossary to update business
classification (taxonomy), which supports the hierarchical structure of terms. A term is a word or
phrase that can be used to classify and to group objects in the metadata repository. If a joint work
of experts is necessary, Business Glossary provides subject matter experts with collaboration
tools to annotate data definitions, descriptions editing, and their categorization.
Using Business Glossary and Rational Data Architect, data analyst eliminates the conflicts
between glossary and tables and columns in databases, which were identified by the business
analyst.
Metadata Workbench provides an IT developer with tools for metadata review, analysis, design
and enrichment, and allows him to manage and understand information assets that were created
and are available through the IBM Information Server
Business users, who are responsible for legislative requirements compliance, are able to trace
data lineage using appropriate Metadata Workbench tools.
Support of common understanding of data meaning by all users and project participants is
performed by stewards with the help of Information Analyzer.
Necessary and sufficient tools
As follows from the analysis, IBM Information Server product family provides all participants
with the necessary tools to ensure data quality.
Information is extracted from a data source system and then evaluated, cleaned, enriched,
consolidated and loaded into the target system. Data quality improvement is carried out in four
stages.
1. Research stage is performed in order to fully understand the information.
2. Standardization stage reformats data from different systems and converts them to the
required content and format.
3. Matching stage ensures data consistency by linking records from one or more data
sources related to the same entity. This stage is performed in order to create semantic
keys for information relationships identification.
4. Survival stage ensures that the best available data survive and that data are prepared
correctly for transfer to the target system. This stage is required to obtain the best
representation of interrelated information.

81

Pic. 2. Roles, interactions and quality management tools
82
Thus, IBM Information Server family is a necessary tool to ensure data quality, but not always
sufficient, since in some cases, additional instruments are needed for master data quality assurance.
The issues of master data quality assurance will be discussed in future articles.
Conclusion
Data quality assurance is a complex process which requires the involvement of all project
participants. Metadata quality impact is extremely high, so it is important to ensure quality
management within the metadata life cycle. Analysis showed that when used properly, IBM
Information Server family creates a workflow to ensure data quality. IBM Information Servers
tools provide each employee involved in a data integration project with the quality management
instruments and ensure an effective interaction of the project team.
Literature
1. Redman T.C. Data: An Unfolding Quality Disaster. Information Management Magazine,
August 2004. http://www.information-management.com/issues/20040801/1007211-1.html
2. Wang, R., Kon, H. & Madnick, S. Data Quality Requirements Analysis and Modeling, Ninth
International Conference of Data Engineering, 1993, Vienna, Austria.
4. Dijkstra E.W. Why is software so expensive?', in "Selected Writings on Computing: A
Personal Perspective", Springer-Verlag, 1982, pp. 338-348
5. DeMarco . The Deadline: A Novel About Project Management, Dorset House Publishing
Company, Incorporated, 1997
6. Asadullaev S. Data, metadata and master data: the triple strategy for data warehouse projects,
09.07.2009. http://www.ibm.com/developerworks/ru/library/r-nci/index.html
7. Asadullaev S. Metadata Management Using IBM Information Server, 30.09.2008.
8. Asadullaev S. Incremental implementation of IBM Information Servers metadata management
tools, 21.09.2009,
http://www.ibm.com/developerworks/ru/library/sabir/Information_Server/index.html
9. Giovinazzo W. BI: Only as Good as its Data Quality, Information Management Special
Reports, August 18, 2009.
http://www.information-
management.com/specialreports/2009_157/business_intelligence_bi_data_quality_governance_d
ecision_making-10015888-1.html
83
Primary data gathering and analysis system - I
Problem formulation, data collecting and storing
17.02.2011
http://www.ibm.com/developerworks/ru/library/sabir/warehouse-1/index.html
Abstract
A standard solution for primary data collection, storage and analysis is proposed. The solution is
based on manual input using IBM Forms and IBM InfoSphere Warehouse for storage. The analysis
of collected data by means of IBM InfoSphere Warehouse analytical tools and IBM Cognos is
discussed in the second part of the article [1].
The proposed approach can be implemented as a basis for a variety of solutions for different
industries and enterprises.
Introduction
A distribution company purchases goods and distributes them to regional customers. New
purchases planning requires information from the regions on the goods balance. Customers
representatives enter data manually. This leads to the fact that, despite the training, instructions and
reminders, data input does not fit the expectations. As a result, a whole department in the central
office verifies collected data by phone together with regional representatives.
A transportation company has an extensive fleet of cargo conveyances. Despite the presence of
automatic diagnostic tools, the technical validation data of rolling stock are maintained manually on
paper forms and are later entered into the computer. During the time between the problem detection
and entering data from paper to the information system, a defective vehicle can be directed for
loading accidentally or intentionally. The fines, specified for this situation, lead to losses of the
transportation company and generate profits for the loading firm which can hardly be called fair.
Federal agency collects reports from the regions on a regular basis. Experts in the regions are
trained, and they use these gained skills to provide the agency with data so that the region would
look decent. The real picture, of course, is blurred, and the agency operates with inaccurate data.
The list of similar examples can be continued. For example, an offender can steal a car in one
region, to commit a robbery in the second, to purchase illegally a weapon in the third, and his crimes
can not be combined into one case due to small errors in the recording of incidents. Or an employer
can reside in one region, open a business in another, and pay taxes in the third.
The same functional requirements unite together these examples:
Collection of primary, statistical, or reported data from remote locations is required;
It is necessary to check the data on the workplace prior sending to the center.
The collected data should be purified and stored for a period defined by regulations;
It necessary to form management statements to assess the state of affairs in the regions.
It is necessary to perform an analysis based on the collected statistics to identify regular
patterns and to make a management decision;
84
In this paper we consider a standard solution for collection, storage and analysis of primary data that
was entered manually. Despite obvious limitations, this approach can be used as a basis of various
solutions for different industries and enterprises.
System requirements
Consider the most general requirements specific to similar systems. Certainly, this task formulation
is somewhat artificial. At the same time it gets rid of the details that are specific to real systems, but
arent central to the typical task of primary data collection, storage and analysis.
Information systems, operating in the regions, are external to the "Collection and analysis of primary
data" system that is being developed and do not interact with it.
The required information must be entered manually using the on-screen forms. Forms should
represent the existing and approved paper forms for data collection as accurately as possible. Data
input error check should be provided before sending the filled e-form.
The completed e-form is always sent from the region to the central office. Form is resubmitted as a
whole in case of revealed data input errors.
The completed e-form is not required to be kept wholly for the audit or to follow legal requirements.
Only the data contained in e-form fields are retained. Therefore, there is no need to send to the
center the entire e-form, only extracted data from it can be sent.
Consumers of internal management statements are employees and executives. Management
statement is provided as on-screen reports and as hard copies.
External users of reports are staff and management of superior bodies and cooperating organizations
and agencies. Statements for external organizations are available in hard copy.
The number of users in each region is limited to one user (data input clerk) at any one time. That is,
one workplace is enough to operate the system in each region. The number of regions will equal
1000.
The number of users of the system in the head office at one moment can be estimated as 100
analysts.
The number of approved input forms is 10 daily forms with 100 fields each.
To evaluate the data streams one can confine oneself by the following approximate values. As
experience shows, on average one field corresponds to 3 - 5 KB of forms.
Thus, the size of one form can be estimated at 300 - 500 Kbytes, and the daily flow from a single
location is about 3 - 5 MB / day. Given that the e-forms are filled in by hand during the workday, the
minimum required connection throughput shall provide for the transfer of about 1 form per hour,
that is, about 1 kbit / sec. The total daily flow from the regions is 3 - 5 GB / day.
In case of insufficient throughput the peak data flow may be reduced through the difference in time
zones of regions and by an approved schedule of data transmission.
Storage period for on-line access is 5 years, after which the data are transferred to the archives.
Storage period in the archives is 50 years.
Backup and restore tools and procedures should be provided.
Telecommunication infrastructure (active and passive network equipment, communication lines and
channels) is beyond the scope of the project.
85
The proposed solution must be expandable and scalable. For example, the integration of the
"Collection and analysis of primary data" system with the document workflow system should be
anticipated.
Project objectives
The following tasks should be performed within the project:
Development of e-forms for approved paper forms
Development of e-forms for new paper forms
Development of storage for detailed data
Development of analytical tools
Development of reporting and visualization tools
Information Security
Data back-up
Data archiving
Logging of system events
Development of e-forms for approved paper forms
On-screen forms should be designed to correspond with approved paper forms of statistical
indicators. Data entry e-forms for offline and online modes should be offered. Online mode is the
primary mode for data entry.
Development of e-forms for new paper forms
Developers should be provided with simple and intuitive design tools for creation of new e-forms of
data collection to extend the production system. There is no need in simplified new forms
development tools, standard tools can be used for this purpose.
New forms can be developed both by internal staff, and by third-party organizations. The customer
must be completely independent of the external development company to be able to change external
developers or give them direct access to the instruments of development and testing of e-forms and
applications.
Development of storage for detailed data
As collected detailed data are subject oriented, integrated, time-variant and non volatile, not a
database but a data warehouse is proposed to use for data storage. Traditional database is oriented
for execution of a large number of short transactions, while analytical tasks require a relatively small
number of queries to large volumes of data. A data warehouse meets the conditions.
The data warehouse should be focused on storage of not individual e-forms but of data, from which
the e-form must be prepared. So the forms field must be mutually agreed upon. Intrinsically, the e-
form should be drawn from these agreed data.
We highly recommend using a ready-made data model that should be adapted to the task needs. In
case of specific requirements, the data warehouse model will be modified jointly with the customers.
It is not necessary to store the history of e-forms, or algorithms of e-forms calculation and assembly.
86
Data can be aggregated into large period indices for long term storage. For example, data with
storage period of more than 5 years, can be combined into 5 years indices; data with storage period
more than 1 year, can be combined into 1 years indices; data with storage period less than 1 year,
can be left as monthly indices.
Development of analytical tools
Tools for analytical calculations based on gathered statistics should provide the following
capabilities:
Quite simple statistical analysis. For example, the calculation of the efficiency of usage of
various resources;
Scenario and forecast calculations.
Development of reporting and visualization tools
Reporting and visualization should provide
On-screen reports generation
Paper reports generation
Reports visualization in graphical form through the web interface (browser)
Grouping of graphs into a desired set (control panel or dashboard).
Information security
Due to the fact that the system for data collection, storage and analysis should not contain sensitive
data, information security will be provided by built-in tools of operating system, databases and data
warehouses, application servers and applications.
Data back-up
Backups should be performed by means of built-in tools of databases and data warehouse.
Data archiving
Data archiving is necessary for long term storage. Expiration time is currently defined as 50 years.
Perhaps in the future it will be necessary to reduce the reporting forms in coarse-grained forms, that
is, to combine monthly statements into yearly statistics, and yearly data into for periods of several
years.
Logging system events
Source data input logging should be ensured to eliminate possible conflicts with the non-receipt of
data sent by the user from a region.
Success criteria
Data collection e-forms must comply with the approved list of data input forms.
Performed procedures should follow the business process of collecting, processing, storage and
dissemination of information, agreed with the customer.
Electronic and paper output forms must conform to the approved list of management statements
forms.
87
Logging of data input processes should be ensured to track the timeliness of reporting.
Reliability of the delivered information should not be worse than the quality of collected data.
Architecture of system for data collection, storage and analysis
In this paper we consider a typical task without taking into account the specific requirements of
various projects. Therefore, the proposed architecture is based on the most simple software
configuration. Data are collected with the help of IBM Lotus Forms. Storage, analysis and reporting
are implemented using IBM InfoSphere Warehouse. The architecture should include IBM Cognos
software to manage corporate performance and data interpretation.
Separation of subsystems for data entry, collection, storage and analysis allows us to construct
different architectures, depending on the needs of the task and requirements of enterprise
infrastructure. Centralized architecture for data collection, data storage and analysis is represented
on Pic. 1. This architecture assumes that data input can be carried out remotely, and all servers for
data acquisition (Lotus Forms), data storage (InfoSphere Warehouse), and data analysis and
interpretation (Cognos) are installed in a single data center. Analysts can work both locally and
remotely, with the help of the Web interface provided by Cognos for analytical calculations
preparation and their execution.
A distributed Lotus Forms server architecture can be created if various regional forms must be filled
in. In this case, initial forms processing should be implemented on a regional level, and gathered
data are sent to the central office where the data storage servers reside.
A combination of large volumes of regional data and poor telecommunication lines may require a
solution with forms processing and data storage system that are both decentralized.
Analytical works with large amount of ad hoc queries may require the creation of a distributed
infrastructure of Cognos servers. In this case, data from a centralized repository can be transmitted
in advance to the regional centers, where Cognos servers are deployed. This architecture provides an
acceptable response time and high-performance execution of analytical tasks in the regions, even in
the absence of high-speed communication channels.
Various options of the system architecture for data collection, storage and analysis will be discussed
in more detail in a separate article.
Another advantage of the proposed modular system is the possibility of its functionality expansion.
Since all modules interact by standard protocols, it is possible to integrate the system with document
management, metadata and master data management, and enterprise resource planning systems, as
well as with a variety of analytical and statistical packages.
88

Pic.1. Centralized architecture of system for collecting, storing and analyzing data
89
Data collection
Lotus Forms is a set of products that enables organizations to use e-forms for manual data entry and
transfer the collected data to other systems [2]. Lotus Forms Server can be further integrated with
repositories of data (e.g., IBM DB2, Oracle, and MS SQL Server), with a variety of document
management and document repositories (for example, IBM File Net).
Architecture of primary data collection based on Lotus Forms is shown in Pic. 2.

Pic. 2. Primary data collection using Lotus Forms
Forms designer prepares e-forms for data entry using Lotus Forms Designer. The e-forms are stored
in the forms repository in XFDL format [3, 4], which is a standard approved by W3C.
An application developer is developing Forms application logic, Webform servers servlets and
mapping for Transformation Extender (TX), which associates the form fields to values in the
database.
A translator converts the e-form from XFDL format to HTML and JavaScript for users who are
using a thin client (browser).
90
Users who have installed Lotus Form Viewer (thick client) may work with e-forms in XFDL format,
bypassing the translation to HTML.
Users in the regions enter data by means of Lotus Form Viewer or browser. The data can pass
several stages of verification:
On user's computer during the form filling, using forms built-in logic
On Lotus Form Server invoking the application logic
When data is being loaded into the database.
The data can be transmitted to the InfoSphere Warehouse database through CLI, ODBC, and JDBC
protocols.
Data storage
IBM InfoSphere Warehouse Enterprise Edition [4,5] consists of the following products:
InfoSphere Warehouse Design Studio, that includes IBM Data Server Developer Workbench
subset of IBM Rational Data Architect components.
InfoSphere Warehouse SQL Warehousing Tool
InfoSphere Warehouse Administration Console, which is the part of Integrated Solutions
Console.
DB2 Enterprise Server Edition for Linux, UNIX and Windows
InfoSphere Warehouse Cubing Services
DB2 Query Patroller
InfoSphere Warehouse Intelligent Miner
IBM Alphablox and companion documentation
WebSphere Application Server
The architecture of data storage in the IBM InfoSphere Warehouse is shown in Pic.3.
Design Studio provides a common design environment for creating physical models, OLAP cubes
and data mining models, for data flow and SQL control flow design, as well as for Alphablox Blox
Builder analytical applications. Design Studio is based on open source Eclipse platform.
The application developer develops applications using InfoSphere Warehouse Design Studio and
deploys them on the server, providing data processing in accordance with the required business
logic.
SQL Warehousing Tool (SQW) is a graphical tool that, replacing manual SQL coding, generates
SQL code to support and administer the data warehouse. Based on the visual flow of statements,
modeled in Design Studio, SQW automatically generates SQL code that is specific to DB2. The
integration of SQW with IBM WebSphere DataStage extends the development capabilities of
analytical systems based on DB2.
In this project e-forms filled out according to strict rules are the only data source, so at this stage
there is no need for Extract, Transform and Load (ETL) tools, such as DataStage. However, as the
project evolves, it is expected that other sources will be connected. The ability of using ETL tools
provides functional extensibility of the system without the need of radical changes.
91
The administrator uses the Administration Console, which is a WebSphere application, for
deploying and managing applications created in Design Studio. Administration Console allows you
to:
Create and manage database resources, view logs and manage SQW processes.
Perform and monitor database applications, review the history of their deployment, and
execution statistics.
Manage cube services, import and export cubes and models, as well as to execute OLAP
Metadata Optimization Advisor.
Maintain database jobs for data mining; to load, import and export data mining models.

Pic. 3. Data storage in IBM InfoSphere Warehouse
DB2, IBM Alphablox, and WebSphere Application Server have their own administration tools, but
these tools can also be executed from Integrated Solutions Console.
The administrator uses the DB2 Query Patroller to manage dynamically the flow of queries to the
DB2 database. Query Patroller allows you to adjust the database resource usage so that short queries
or queries with the highest priority will be executed in the first place, ensuring efficient use of
resources. In addition, administrators can collect and analyze information about the executed queries
to determine the temporal patterns, frequently used tables and indexes, as well as resource intensive
applications.
92
Conclusion
The proposed solution is scalable and has expandable functionality. In the future, you can connect
different document workflow systems, enterprise planning, metadata and master data management
systems. The system for collecting and analyzing primary data can be easily integrated into existing
enterprise IT infrastructure. In other circumstances it may be treated as a first step in implementation
of an enterprise system for data collection, storage and analysis.
Various solutions for data analysis by means of IBM InfoSphere Warehouse and IBM Cognos BI
will be described in the second part of the article.
The author thanks M.Barinstein, V.Ivanov, M.Ozerova, D.Savustjan, A.Son, and E.Fischukova for
useful discussions.
Literature
1. Asadullaev S. Primary data gathering and analysis system II., 2011,
2. IBM Forms documentation, https://www.ibm.com/developerworks/lotus/documentation/forms/
3. Boyer J., Bray T., Gordon M. Extensible Forms Description Language (XFDL) 4.0. 1998,
http://www.w3.org/TR/1998/NOTE-XFDL-19980902
4. IBM, XFDL 8 Specification, 2010, http://www-
10.lotus.com/ldd/lfwiki.nsf/xpViewCategories.xsp?lookupName=XFDL%208%20Specification
5. IBM, InfoSphere Warehouse overview 9.7, 2010,
http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp?topic=/com.ibm.isw.release.doc/he
lpindex_isw.html
6. IBM, IBM DB2 Database for Linux, UNIX, and Windows Information Center, 2011,
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.doc/w
elcome.html
93
Primary data gathering and analysis system - II
Analysis of primary data
17.02.2011
Abstract
A standard solution for primary data collection, storage and analysis was proposed in [1]. The
solution is based on IBM Forms for manual input and IBM InfoSphere Warehouse for storage. This
article considers the analysis of collected data by means of IBM InfoSphere Warehouse analytical
tools and IBM Cognos.
The proposed approach can be implemented as a basis for a variety of solutions for different
Data analysis using IBM InfoSphere Warehouse
Cubing Services & Alphablox based OLAP
IBM Alphablox and the components of Cubing Services are used to provide direct data access from
InfoSphere Warehouse. Cubing Services components include tools for metadata modeling for
OLAP cubes, an optimizer of materialized query tables (MQT), and cube server for
multidimensional data access (Pic. 1).

Pic. 1. Cubing Services & Alphablox based OLAP
94
Due to full integration of Cubing Services components with the user interface of InfoSphere
Warehouse, the design is performed using the Design Studio and the administration and support is
provided through the Administration Console.
The cube server of Cubing Services processes multidimensional queries expressed in the MDX
query language and returns the results of multidimensional queries. In response to MDX queries
cube server retrieves data from DB2 via SQL queries. Materialized query tables MQT are used by
the DB2 optimizer, which rewrites incoming SQL queries and forwards them to the appropriate
MQT for high execution performance of queries.
IBM Alphablox provides a rapid development of analytical Web applications that meet the
requirements of enterprise infrastructure and are available both in intranet and behind the enterprise
firewall. Alphablox applications allow to perform multidimensional data analysis in real time, using
a standard browser as a client.
Alphablox applications can work with data from multiple sources, including DB2 and Cubing
Services, create structured reports and provide data in the required form with the help of filters and
drill-down tools.
Alphablox analytic applications are aimed to improve the quality of decision-making and to
optimize the performance of financial reporting and analysis, planning, operational analysis, task
analysis and reporting, performance analysis and analysis of key performance indicators (KPI).
Text and data mining
Data mining algorithms embedded into the InfoSphere Warehouse are used to understand a
customers or a business units behavior. Data discovery tools allow to identify hidden data
relationships, to profile data, to browse the tables contents and to visualize correlated statistics to
identify data suitable for analysis. [InfoSphere Warehouse provides the following data mining tools:
Miningblox
Intelligent Miner Easy Mining
Intelligent Miner Modeling
Intelligent Miner Scoring
Intelligent Miner Visualization
Unstructured text analysis
Data mining using MiningBlox & Alphablox
A typical data mining application can include the following steps:
Data selection for analysis
Analysis beginning and tracking its progress
View the results of the analysis
Selection, management or control of the data mining tasks
The Miningblox tag library provides tags for each step and is designed to perform predictive
analysis using Alphablox functions. In this configuration the J2EE application server includes
Miningblox applications, Alphablox, and Miningblox tag library, data warehouse applications, as
well as the Administration Console (Pic. 2).
95
Miningblox web applications include Java Server Pages (JSP) that use Alphablox and Miningblox
JSP tag library. JSP-page which invokes Alphablox, is compiled during the execution on the
application server. Alphablox manages queries and the Web server returns the dynamic content.
The data warehouse application contains control flows, which are invoked by Miningblox web
application. Control flows contain data flows and data mining flows. DB2 database stores both data,
analyzed by data mining flows, and the results in the form of data models and resulting tables.

Pic. 2. Data mining using MiningBlox & Alphablox
Administration Console can be used to deploy and administer data warehouse applications related to
Miningblox applications.
Design Studio is used for visual design of the flow of statements of data mining or text analysis, as
well as preprocessing statements and text statements. The generated SQL query can be embedded in
Alphablox application or other application to invoke the data mining flow.
Data mining using Intelligent Miner
To solve data mining and text analysis tasks it is necessary to develop applications using a
specialized SQL API application interface, which consists of two levels with different levels of
detail and abstraction.
Application interface of Easy Mining tasks is problem-oriented and is used to perform the
basic tasks of data mining;
IM Scoring / Modeling SQL / MM API application interface conforms to ISO / IEC 13249-
6: Data Mining standard and allows to create data mining applications for specific individual
96
user requirements. This interface can be used by SQL scripts, or from any JDBC, CLI,
ODBC, or SQLJ application.
Easy Mining Procedures provide the core functionality of typical data mining tasks. To do this users
should have knowledge of their subject area and are not required to deeply understand the intricacies
of data mining.
IM Scoring and IM Modeling is a set of software development tools (software development kit,
SDK). These tools are DB2 extensions, and include SQL API, which allows to invoke data mining
functions from applications.
With the help of modeling tools of IM Modeling you can use the following functions for the
development of analytical PMML models (Predictive Model Markup Language): association rules,
sequence rules, cluster models, regression models, and classification. Generated PMML models can
be used in the IM Visualization modules or in IM Scoring.
IM Scoring evaluating tools allow application programs to apply the PMML model to a large
database, its subset, or to individual rows. IM Scoring can work with the following PMML models:
association rules, sequences rules, cluster models, regression models, classification, naive Bayesian
approach, neural networks, and decision trees. Forecasting models created using Intelligent Miner
for Data are not part of PMML. These models can be exported from the IM for Data to XML format
and can be used in IM Scoring.
The results of data simulation (associations, sequences, classification, clustering and regression) can
be viewed by ready-made Java visualization tools of IM Visualization.
To present the modeling results these visualization tools can be invoked by applications or as a
browser applet.
Design Studio contains editors and visual tools for data mining application development integrated
into the Eclipse environment. An application developer can visually model the data mining tasks and
generate SQL code to include the Intelligent Miner SQL functionality in analytical applications.
For prototyping, you can use an Excel extension, which allows you to prove a concept, avoiding the
complexities of SQL API.
Administrators can configure a database for data mining, to manage the data mining model, to
optimize the performance of analytic queries through the Web interface of Administration Console.
Text analysis
Text analysis allows to extract business information from patient records, from a repairs report, from
database text fields and from records of a call center. This information can be used in
multidimensional analysis, reports, or as an input for data mining. Text analysis covers a broad area
of computer science, such as:
Automatic classification of documents into groups of similar documents (clustering, or
unsupervised categorization)
Automatic classification of documents into predefined categories (supervised categorization)
Structured data extraction from unstructured text.
Text analysis functions of InfoSphere Warehouse are targeted on data extraction, which generates
structured data for business intelligence, together with other structured information by data mining
and multidimensional analysis tools and reporting tools.
97
In InfoSphere Warehouse the UIMA (Unstructured Information Management Architecture [2])
based software tools are used for text analysis. UIMA is an open, industrial-oriented, scalable and
extensible platform for integration and deployment of solutions of text analysis.
Text analysis function of the InfoSphere Warehouse can perform the following tasks:
Explore tables that contain text columns;
Extract data invoking regular expressions, such as phone numbers, email addresses, social
insurance number, or a unified resource locators (URL);
Extract data using dictionaries and classifiers, for example, product names, or names of
clients;
Extract data using the UIMA compliant components.
Data mining application development

Pic. 3. Intelligent Miner using scenario
You can select one of the approaches to application development, depending on the complexity of
the problem and the experience, preferences and skills of specialists:
To use examples and tutorials for quick code adjustment to fit your goals.
98
To use graphical user interface of Design Studio to determine the analysis process and to
generate code and to integrate it into the application. Include data mining steps in the process
of automated data conversion.
To use the Easy Mining procedures for basic functionality of typical mining tasks.
Use the command-line script generator idmmkSQL as a starting point for scoring statements.
Invoke a powerful low-level application interface SQL/MM API from SQL scripts or from
any JDBC, CLI, ODBC, or SQLJ application.
Picture 3 shows a typical scenario for Intelligent Miner used for data mining tasks.
The application developer integrates SQL functionality of Intelligent Miner into applications using
development tools of Design Studio.
The analyst uses the tools of data mining from applications.
Data analysis using IBM Cognos Business Intelligence
The proposed solution (Fig. 4) is based on the previous architecture, with the expansion of analytic
functionality by means of IBM Cognos 10 Business Intelligence [3].
IBM Cognos 10 Business Intelligence (BI) is an integrated software suite to manage enterprise
performance and is designed to aid in interpretation of data arising during the operation of the
organization.
Cognos BI 10 allows to draw graphs to compare the plan and fact, to create different types of
reports, to embed reports into a convenient portal and to create a custom dashboard.
Any organizations employee can use IBM Cognos 10 BI to create business reports, to analyze data
and monitor events and metrics in order to make effective business decisions.
Cognos BI 10 includes the following components:
Cognos Connection - content publishing, managing and viewing.
Cognos Administration console - viewing, organizing and scheduling of content,
administration and data protection
Cognos Business Insight - interactive dashboards
Cognos Business Insight Advanced - easy reporting and data research
Cognos Query Studio - arbitrary queries
Cognos Report Studio - managed accounts
Cognos Event Studio - event management and notification
Cognos Metric Studio - metrics and scorecarding
Cognos Analysis Studio - business analysis
Cognos for Microsoft Office - working with Cognos BI data in Microsoft Office
Framework Manager - business metadata management for cube connection.
Metric Designer - data extraction.
Transformer - multidimensional data cubes PowerCubes modeling
99
Map Manager - import maps and update labels
Cognos Software Development Kit - Cognos BI application development
Cognos Connection is a portal that provides a single access point to all enterpise data available in
Cognos 10BI. Portal allows users to publish, find, organize and view data. Having appropriate
access rights, users can work through the portal with a variety of applications and manage the
portals content, including schedules management, preparation and distribution of reports.
Cognos Administration console together with the Cognos Connection provides system
administrators with abilities to administer Cognos servers, tune performance and manage user access
rights.
Cognos Business Insight allows users to create complex interactive dashboard using data from
Cognos and from external sources, such as TM1 Websheets and CubeViews. A user can open a
personal dashboard, manage reports and send the dashboard via e-mail and participate in collective
decision making.
Cognos Business Insight Advanced allows users to create simple reports and explore data from
internal and external data sources, both relational and multidimensional. When an analyst uses his
personal dashboard and wants to perform a deeper data analysis, he can pass to Business Insight
Advanced, where it is possible to add a new dimension, conditional formatting, and complex
calculations. The user can launch the Business Insight Advanced directly from Cognos Connection
portal.
Query Studio provides an interface for creating simple queries and reports in Cognos 10 BI. Users
without special training can use Query Studio to create reports that answer simple business
questions. With minimal effort, users can change report layout, filter and sort data, add formatting,
and create charts.
Report Studio is a tool that professional report authors and developers use to create sophisticated
and managed reports, multi-page reports with composite queries to multiple databases (relational or
multidimensional). Using Report Studio, you can create any reports required by organizations such
as a sales invoice, budgets, or weekly activity reports of any complexity.
Event Studio is a tool for event management in IBM Cognos 10. It allows to notify users of events
as they approach to take timely and effective decisions. Event Studio can be used to create agents
that monitor changes of various states of financial and operational performance of the company and
of key customers to identify any important events. When an event occurs, the agent can send an e-
mail, publish information on the portal, or prepare a report.
Metric Studio allows users to create and use a balanced scorecard to track and analyze key
performance indicators (KPI) of the organization. You can use a standard or a custom scorecard, if it
is already implemented in the company.
Metric Studio translates the organization's strategy into measurable goals, which allow each
employee to correlate their actions with the strategic plan of the company. Scorecarding
environment reveals both successful activities of the company, and those that need improvement. It
monitors progress in achieving these goals and shows the current state of business. Therefore, all
employees and managers of the organization can make necessary decisions and can plan their works.
Analysis Studio is intended for research, analysis and comparison of multidimensional data, and
provides real-time processing (OLAP) of various multidimensional data sources. The results of the
analysis are available for creation reports of professional quality in Report Studio.
100

. 4. Solution architecture using Lotus Forms, InfoSphere Warehouse Cognos BI
101
Managers and analysts use Analysis Studio to quickly analyze the reasons of past events and to
understand the required actions to improve performance. The analysis allows users to identify
unobvious, but influencing on business patterns and abnormalities in big data volumes. Other types
of reports do not provide such an opportunity.
Cognos for Microsoft Office allows you to work with Cognos reports directly from MS office, and
offers two types of client software:
1. Smart client does not require installation, administration, and is updated automatically.
2. The client software is a COM add-in and requires installation. Updates are performed by
reinstalling the software.
Cognos for Microsoft Office allows to work with reports created in Query Studio, Analysis Studio,
or Report Studio, and users get full access to the contents of the report, including data, metadata,
headers, footers, and pictures.
Framework Manager is a simulation tool that is designed to create and manage business metadata
for use in analysis and reporting tools of Cognos BI. Metadata provides a common understanding of
data from different sources. OLAP cubes contain metadata for business analysis and reporting. Since
the cube metadata can be changed, Framework Manager models the minimum amount of
information required to connect to a cube.
Metric Designer is a simulation tool for data extraction. Extracts are used for mapping and data
transfer to the scorecarding environment from existing sources of metadata, such as files of
Framework Manager and Impromptu Query Definition. Typically, a data model is optimized for
storage, rather than reporting. Therefore, a developer of data models uses Framework Manager to
create data models that are optimized for the needs of business users. For example, a model can
define business rules that describe the data and their relationships, dimensions and hierarchies from
a business perspective.
Transformer is used for modeling of multidimensional data cubes PowerCubes for business
reporting in Cognos BI. After collecting all necessary metadata from various data sources,
dimensions modeling, measures customization and dimensional filtering, you can create
PowerCubes based on this model. These cubes can be deployed to support OLAP analysis and
reporting.
Map Manager allows administrators and modeling specialists to import maps and update maps
labels in Report Studio, and add alternative names of countries and cities for the creation of
multilingual texts that appear on maps.
IBM Cognos Software Development Kit is designed to create custom reports, to manage the
deployment of components of Cognos BI, to ensure the safety of the portal and its functionality in
accordance with the requirements of user, local legislation and existing IT infrastructure. Cognos
SDK includes cross platform web services, libraries and programming interfaces.
Enterprise planning using Cognos TM1
Analytical tools can be extended by means of IBM Cognos TM1 enterprise planning software [4],
which provides a complete, robust and dynamic planning environment for the timely preparation of
personalized budgets and forecasts. 64-bit OLAP kernel provides analysis performance of complex
models, large data sets, and even streamed data.
A full set of requirements for enterprise planning is supported: from the calculation of profitability,
financial analysts and flexible modeling up to revealing of the contribution of each unit.
102
Ability to create unlimited number of custom scripts allows employees, groups, departments and
companies to respond more quickly to changing conditions.
Best practices, based on the driver-based planning and rolling forecasts, can become a part of the
enterprise planning process.
Models and data accesss configuration tools can provide data in familiar formats.
Managed team work provides a quick and automated collection of results from different systems and
entities, their assembly into a single enterprise planning process and presentation of results.
Consistent scorecarding, reporting and analysis environment of Cognos BI give a complete picture
from goal planning and setting to progress measurement and reporting.
Financial and production units have full control over the processes of planning, budgeting and
forecasting.
Users have the ability to work with familiar interfaces (Microsoft Excel and the client software
Cognos TM1 Web or Contributor).
Conclusion
The proposed solution is scalable and functionally expandable. The solution can be integrated with
various document management systems, enterprise planning, metadata and master data management
systems.
Primary data gathering and analysis system can be easily integrated into existing enterprise IT
infrastructure. In other circumstances the solution may be realized as a first step in implementation
of an enterprise system for collecting, storing and analyzing data.
The author thanks M.Barinstein, V.Ivanov, M.Ozerova, D.Savustjan, A.Son, and E.Fischukova for
useful discussions.
Literature
1. Asadullaev S., Primary data gathering and analysis system - I. Problem formulation, data
collecting and storing, 2011,
2. Apache Software Foundations, Apache UIMA, 2010, http://uima.apache.org/
3. IBM, Cognos Business Intelligence, 2010,
http://publib.boulder.ibm.com/infocenter/cfpm/v10r1m0/index.jsp?topic=/com.ibm.swg.im.cogn
os.wig_cr.10.1.0.doc/wig_cr_id111gtstd_c8_bi.html
4. IBM, Cognos TM1, 2010, http://www-01.ibm.com/software/data/cognos/products/tm1/
103
Data Warehousing: Triple Strategy in Practice
Program Engineering, 2011, v4, pp 26-33
www.ibm.com/developerworks/ru/library/sabir/strategy/index.html
Abstract
This paper uses a practical example of a system for collecting and analyzing primary data to show
how triple strategy and recommended architecture of enterprise data warehouse (EDW) can provide
higher quality of the information analysis service while reducing costs and time of EDW
development.
Introduction
Many successful companies have found that separate line of business management does not give a
complete picture of the companys market situation and business. To make accurate and timely
decisions, experts, analysts and company management need unified, consistent information, which
should be provided by an enterprise data warehouse (EDW).
In practice, enterprise data warehouse projects do not meet time, cost and quality targets as a rule. In
many cases analytical reports as data warehouse output still contain conflicting information. In this
article it is shown that adherence to recommended architectural solutions, using proven strategies for
creating EDW and the right choice of software tools can reduce EDW development costs and
improve the quality of EDW services. Based on the triple strategy, recommended architecture,
proposed principles and best practices of EDW construction, project management plan is proposed
for software development of an enterprise data warehouse.
IBM offers a complete toolset for data, metadata and master data integration at all stages of the life
cycle of an EDW development project. The purpose of this paper is to analyze the simplified
solution based on IBM Forms, IBM InfoSphere Warehouse and IBM Cognos BI software. The
solution must be scalable and functionally expandable. It should be easily integrated into the
enterprise IT infrastructure and to be able to become a foundation for enterprise data warehouse.
Architecture of primary data gathering and analysis system
A typical solution for the collection, storage and analysis of primary data inputted manually was
proposed in articles [1, 2]. We recall the essential system requirements:
Collection of primary, statistical, or reported data from remote locations is required;
It is necessary to check the data on the workplace prior sending to the center.
The collected data should be purified and stored for a period defined by regulations;
It necessary to form management statements to assess the state of affairs in the regions.
It is necessary to perform analysis based on the collected statistics to identify regular patterns
and to make a management decision;
The solution must be extensible and scalable. For example, it is necessary to anticipate the
subsequent integration of the primary data gathering and analysis system with document
management system.
Pic. 1 shows the centralized architecture of a primary data gathering and analysis system, which
assumes that the data input can be performed remotely, and all IBM Forms [3] data collection
servers, InfoSphere Warehouse [4] data storage servers, and IBM Cognos [5,6 ] data analysis and
interpretation servers are installed in a single data center.
104

Pic. 1. Solution architecture using Lotus Forms, InfoSphere Warehouse and Cognos BI
105
Analysts can work both locally and remotely, thanks to the Web interface provided by Cognos for
the preparation and execution of analytical calculations.
The proposed architecture has been based on the most simple software configuration for a typical
task without taking into account the specific requirements of various projects. Despite the obvious
limitations, the proposed approach can be used as a basis for a variety of solutions for different
Separation of subsystems for data entry, collection, storage and analysis allows us to construct
different architectures depending on the needs and demands of the task and enterprise infrastructure
requirements.
Another advantage of this modular system is the possibility of its functionality expansion. Since all
modules communicate over standard generally accepted protocols, it can be integrated with various
IT systems such as document management, metadata and master data management, enterprise
resource planning, analytical and statistical packages and many others.
The system for collecting and analyzing raw data can be easily integrated into existing corporate IT
infrastructure. In other circumstances it may be a first step to implementation of a corporate system
for collecting, storing and analyzing data.
As you can see, the architecture of the primary data gathering and analysis system (Pic. 1) contains
no metadata or master data management tools. On the face of it, this contradicts the proposed triple
strategy of a data warehouse development [7], which involves integration of data, metadata and
master data. On the other hand, it is not clear how this solution relates to the recommended EDW
architecture [8], and how the proposed approach differs from the numerous projects whose primary
purpose is to demonstrate a quick insignificant success.
Role of metadata and master data management projects
The task of primary data input, their collection, storage and analysis has several features. First of all,
primary data is entered manually in the field of approved on-screen forms (e-forms). Thats why e-
form fields are aligned with each other, both inside individual e-form and between e-forms. This
means that different entities have different names and different fields. Therefore, the customer at
early project stages has planned to store not forms or reports, but individual field data of which
forms and reports can be constructed further. Consistent set of data is a great first step to manage
metadata, even if this requirement was not formulated explicitly.
Under these conditions, the first phase of metadata management does not require the use of specific
software and can be done with a pencil, an eraser and paper. The main difficulty of this stage is to
reach an agreement among all the experts on the terms, entities, their names and methods of
calculation. Sometimes users have to abandon familiar but ambiguous names, and agreement may
require considerable effort and time. Fortunately, this work has been performed by the customer
before resorting to the external project team.
The names of e-form fields, methods of input data check and calculation are essential business
metadata. The solution architecture is developed on the basis of collected metadata, including the
data warehouse model, EDW tables columns are created and named. Thus, implicit technical
metadata management is started.
Changes are inevitable during the maintenance of the developed system. During this stage, the
developers need to manage a glossary of terms. If it was not created earlier, it's time to think about
glossary implementation, since the system maintenance process forces to start centralized metadata
management in an explicit form.
106
This scenario implies minimal overhead for the implementation of a centralized metadata
management system, as the kernel of the future system has been created previously. This core,
though small and not having enough features, has a major asset of consistent metadata.
Centralized master data management should be started simultaneously with metadata management.
The reason is simple: master data and metadata are closely connected [7], and master data
implementation without a metadata project, as a rule, does not lead to success.
The basis for a master data management system can be a set of codifiers, dictionaries, classifiers,
identifiers, indices and glossaries maintained for the data warehouse. In this case well conceived
metadata should perform a master data quality assurance role, which eliminates data encoding
conflicts under conditions of skilled DW design.
Thus, systematization of business metadata, based on e-form fields and performed at the pre-project
stage, has provided the opportunity to create a trouble-free metadata and master data management
systems. It allowed to reduce the budget of the project of implementation of primary data gathering
and analysis system. At the same time the project team was aware that the metadata and master data
projects are performed implicitly. At this stage only the designers strategic vision and the
developers accuracy are demanded.
Recommended DW Architecture
Recommended enterprise data warehouse architecture, proposed in [8],is constructed in accordance
with the following basic principles.
EDW is the only source of noncontradictory data and should provide users with consistent data of
high quality gathered from different information systems.
Data should be available to employees to the extent necessary and sufficient to carry out
their duties.
Users should have a common understanding of the data, i.e., there should be a common
semantic space.
It is necessary to eliminate data encoding conflicts in the source systems.
Analytical calculations must be separated from operational data processing.
Multilevel data organization should be ensured and maintained.
It is necessary to follow the evolutionary approach, allowing business continuity and the IT
investment saving.
The information content of future data storage, stages of EDW development and putting functional
modules into operation are determined, first of all, by the requirements of the business users.
Data protection and secure storage must be ensured. Data protection activities should be adequate to
the value of the information.
Architecture designed in accordance with these principles, follows the examined principle of the
modular design - "unsinkable compartments. By separating the architecture into modules, we also
concentrate certain functionality in them (Pic. 2).
ETL tools provide complete, reliable and accurate information gathering from data sources by
means of algorithms concentrated in ETL for the collection, processing, data conversion and
interaction with metadata and master data management systems.
107
Metadata management system is the primary source of information about the data in EDW.
Metadata management system supports the relevance of business metadata, technical, operational
and project metadata.
The master data management system eliminates conflicts in the data encoding in source systems.
Central Data Warehouse (CDW) has the only workload of reliable and secure data storage. Data
structure in the CDW is optimized solely for the purpose of ensuring effective data storage.
Data sampling, restructuring, and delivery tools (SRD) in this architecture are the only users of the
CDW, taking on the whole job of data marts filling and, thereby, reducing the user queries workload
on the CDW.
Data marts contain data in formats and structures that are optimized for tasks of specific data mart
users.

Pic. 2. Recommended DW Architecture
So, comfortable users operation with the necessary amount of data is achieved even when the
connection to CDW is lost. The ability to quickly restore data marts content from the CDW in case
of data marts failover is also provided.
The advantage of this architecture is the ability to separate design, development, operation and
refinement of individual EDW components without an overhaul of the whole system. This means
that the beginning of work on the establishment of EDW does not require hyper effort or hyper
investments. To start, it is enough to implement a data warehouse with limited capabilities, and
following the proposed principles, to develop a prototype that is working and truly useful for users.
Then you need to identify the bottlenecks and to evolve the required components.
Relation between the recommended architecture and the solution
Architecture solution for the primary data collection, storage and analysis system (Pic. 3) is
translated to EDW terms of the recommended architecture and is aligned with it.
108
Data are collected with the help of IBM Forms, which uses e-forms for manual data entry and
allows you to transfer the collected data to other systems. IBM Forms application server can be
further integrated with the repositories of structured and unstructured data.
The only data source in this project are e-forms filled in according to strict rules, so at this stage
there is no need for tools extraction, transformation and loading (eg, DataStage). However, in the
future the project will evolve, and one can expect the need to connect other sources. The possibility
of using ETL tools provides functional extensibility of the system without the need of a radical
redesign.
Data storage is implemented using IBM InfoSphere Warehouse. Data analysis can be performed by
means of IBM InfoSphere Warehouse and IBM Cognos Business Intelligence (BI).

Pic. 3. Recommended and solution architectures relations
IBM InfoSphere Warehouse provides the following data analysis tools: analytical processing using
Cubing Services based OLAP tools and Alphablox, data mining using Miningblox and Alphablox,
and data mining with the assistance of Intelligent Miner.
IBM Cognos 10 Business Intelligence (BI) is an integrated software suite to manage enterprise
performance and is designed to aid in interpretation of data arising during the operation of the
organization. Any employee can use IBM Cognos 10 BI to create business reports, to analyze data
and monitor events and metrics in order to make effective business decisions. Cognos BI 10 allows
to draw graphs to compare the plan and fact, to create different types of reports, to embed reports
into a convenient portal and to create a custom dashboard.
Analytical tools can be extended by means of IBM Cognos TM1 enterprise planning software,
which provides a complete, robust and dynamic planning environment for the timely preparation of
personalized budgets and forecasts.
109
Metadata which are obtained as a byproduct of matching e-forms, and master data, which are the
result of data reduction to normal form in relational database of EDW, are the prototypes of future
enterprise metadata and master data management systems (Pic. 4).
The first publication of the need to establish systems of Data Dictionary / Directory Systems has
appeared in the mid 80s [9]. An article [10] published in 1995 stated that for successful data
integration it is necessary to establish and maintain metadata flow. Current practice shows that this
requirement needs to be clarified, since metadata are generated at all stages of development and
operation of information systems. The relationship between data, metadata and master data was
discussed in detail in [8], where it was shown that master data contain business metadata and
technical metadata.
Data loading to EDW can not be properly performed without metadata and master data, which are
heavily used at this stage, explicitly or implicitly. Cleaned and consistent data are stored, but
metadata and master data are usually ignored.

Pic. 4. Step to enterprise metadata & master data management
Creation of metadata and master data repositories significantly reduces the EDW implementation
costs; allows us to move from the storage of inconsistent forms to the storage of consistent data and
improves the quality of information services for business users [11].
Comparison of proposed and existing approaches
In order to answer the question, what is the difference of the proposed approach and existing
practice, consider a typical example of a project of a financial analysis system development in a
bank [12].
The project team relied on the fact that creation of enterprise master data is a long, difficult and
risky job. Therefore, the project was limited to local task solution of planning and forecasting
110
processes reengineering, which should pave the way for a bank reporting system, based on the
integration of core banking systems, to use more detailed data, consistent with the general ledger.
Development of an EDW in project teams eyes was tantamount to the "Big Bang" that created the
universe. The project team, avoiding enterprise wide solutions, has introduced metadata and master
data for a particular area of activity. Therefore, the financial data repository is a highly specific data
mart for financial statements (Pic. 5).
In contrast, the EDW provides consistent corporate data for a wide range of analytical applications.
Practice shows that only EDW, which is integrated with metadata and master data management
systems, can provide a single version of data.
As you can see, the main objective of this project is a small demonstration of quick win. Many of us
were put in the same situation, when there was an urgent need to demonstrate even a small, but
working system. An experienced developer knows that he will have to follow the advice of Brooks
[13] and throw this first version away. The reason is that the cost of applications redesig n and their
integration into the enterprise infrastructure would be prohibitive because of the lack of agreed
metadata and master data.

Pic. 5. Example of existing approach
The final architecture of implementation of existing approaches
Let us briefly summarize the results of the analysis.
1. Existing approaches implement disparate application data marts effectively. The data in the
data marts may be valuable inside the units, but not for the company as a whole, because of
the impossibility of data reconciliation due by different data sense and coding.
2. Belief that the creation of an EDW is like a deadly trick with unpredictable consequences is
widespread, so it often is decided to create local data marts without EDW development.
3. The demand for instant results leads to development and implementation of limited solutions
with no relation to enterprise level tasks.
111
Following these principles, the company introduces initially separate, independent data marts. The
information in the data marts is not consistent with data from other data marts, so the management
has contradictory reports on their tables. The indicators in these reports with the same name may
hide different identities, and vice versa, the same entity may have different names, can be calculated
by different algorithms, based on different data sets, for various periods of time.
As a result, users of independent application data marts speak different business languages, and each
data mart has its own metadata.
Another problem is the difference between master data used in the independent data marts. The
difference in data encoding used in the codifiers, dictionaries, classifiers, identifiers, indices and
glossaries makes it impossible to combine these data without serious analysis, design and
development of master data management tools.
So the company creates several inconsistent data warehouses, which is fundamentally contrary to
the very idea of establishing an EDW as the one and only source of cleaned, consistent and
noncontradictory historical data. Lack of enterprise metadata and master data management (shaded
in Fig. 6) makes the possibility to agree on of data between them even less probable.
Obviously neither the management nor the users of such a repository are inclined to trust the
information contained in it. So on the next step the need for radical redesign arises, and in fact, for
creation of a new data warehouse which stores not reports, but agreed indicators, from which reports
will be collected.
Thus, the pursuit of short term results and the need to demonstrate quick wins lead to the rejection of
a unified end-to-end metadata and master data management. The result of this approach is the
presence of semantic islands, where the staff speaks a variety of business languages. Enterprise data
integration architecture must be redesigned completely, which leads to repeated time and money
expenditures to create a full scale EDW (Pic. 6).

Pic. 6. Result of existing approach: DW with intermediate Application DMs
112
Triple strategy and EDW development planning
The proposed approach is based on the triple strategy, on recommended architecture, on formulated
principles and on best practices of EDW development.
As a rule, developers need to quickly demonstrate at least an insignificant success in data
integration. In some companies, by contrast, one must develop and implement a corporate strategy
for EDW. No matter how the task is formulated, in both cases you must have a global goal before
your eyes and reach it by means of short steps.
The role of the compass that lines up with strategic goals is given to coordinated integration of data,
metadata, and master data (Pic. 7):
1. master data integration to eliminate data redundancies and inconsistencies;
2. metadata integration to ensure a common understanding of data and metadata;
3. data integration to provide end users with a single version of truth on the basis of agreed
metadata and master data
As you know, a great journey starts and ends with a small step. Creation of a centralized data
metadata, and master data management environment is a priority task. But business users do not see
immediate benefits to themselves from that environment, and management prefers to avoid long
term projects with no tangible results for the company's core business.

Pic. 7. DW development plan

Therefore, two or three pilot projects should be chosen on the first phase. The main selection criteria
for these projects are management support and users and experts willingness to participate in the
task formulation. Projects should provide minimum acceptable functionality of the future of EDW.
As a tentative example the following pilot projects are selected to implement the first phase (Pic.7):
1. Data mining on the basis of Intelligent Miner (IM);
113
2. Multidimensional analysis (OLAP) using Cubing Services and Alphablox;
3. Unstructured text analysis using Unstructured Text Analysis Tools (UTAT).
All these tools deployed in the first phase of pilot projects are part of IBM InfoSphere Warehouse.
It is important that users feel the real benefits of EDW as a result of these short projects. The project
team together with users needs to analyze the results of pilot projects implementation and if
necessary determine the actions to change the EDW environment and to adjust the tasks of data
metadata, and master data integration.
The next step is to choose three or four new pilot projects that cold promote the company to the
creation of basic functionality of the future EDW. It is desirable that the selection process involves
all concerned parties: company management, users, business experts, project team and EDW
maintenance and support team. Centralized data, metadata and master data management
environment must be developed enough to meet the requirements of EDW basic functionality.
Assume that the following projects and tools are chosen to be implemented on second phase:
1. Report generation and data analysis with Cognos Business Insight Advanced and Report Studio;
2. Creation of complex interactive dashboard, based on Cognos Business Insight;
3. Scenario analysis using Cognos TM1;
4. Corporate Planning with Cognos TM1.
The projects results should be reexamined after completion of the pilot projects of second phase.
The next step should be the development of a fully functional EDW, which is impossible without a
comprehensive support by the environment of centralized data, metadata and master data
management.
Thus, a rough plan for a EDW development may look as follows:
Strategic objectives:
o coordinated integration of data, metadata, and master data
Tactical objectives:
o Selection of two or three projects to demonstrate the benefits
o Creation of a centralized data, metadata and master data management environment,
o Project results analysis and alteration of EDW environment, if necessary
o Implementation of three or four pilot projects, relying on the experience gained
o In case of success - EDW development with company-wide functionality
o EDW operation and modernization to fit new tasks, formulation and solution of
which became possible due to the accumulated experience of EDW operation
Thus, the EDW development project is not completed when EDW is accepted as commissioned and
fully operational. EDW must evolve together with the company. Life goes on, new problems arise,
and new information systems are required. If these systems can provide the information important
for the data analysis across the enterprise, these new systems must be connected to the EDW. In
order to avoid integration issues it is desirable to create a new system based on the capabilities of a
centralized data, metadata and master data management environment.
114
In turn, a centralized data, metadata and master data management environment should be changed
and improved taking into consideration the needs of new systems. Therefore, centralized data,
metadata and master data management environment must evolve until company and its IT systems
exist, which is conventionally indicated on Pic. 7 by the arrows that go beyond the schedule.
Conclusion
Enterprise data warehouse, built as a result of a coordinated data, metadata, and master data
integration, provides higher quality of information and analytical services at lower costs, reduces
development time and enables decision making based on more accurate information.
The proposed approach provides an effective operation of data, metadata, and master data
management systems, eliminates the coexistence of modules with similar functionality, lowers the
total cost of ownership and increases user confidence in the data of EDW. The integration of data,
metadata, and master data, performed simultaneously with the development of EDW functionality
allows to implement the agreed architecture, environment, life cycles, and key capabilities for data
warehouse and metadata and master data management systems.
Literature
1. Asadullaev S., Primary data gathering and analysis system I. Problem formulation, data
collecting and storing , 2011,
2. Asadullaev S., Primary data gathering and analysis system II. Primary data analysis, 2011,
3. IBM, IBM Forms documentation, 2010,
https://www.ibm.com/developerworks/lotus/documentation/forms/
4. IBM, InfoSphere Warehouse overview 9.7, 2010,
http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp?topic=/com.ibm.isw.release.doc/he
lpindex_isw.html
5. IBM, Cognos Business Intelligence, 2010,
http://publib.boulder.ibm.com/infocenter/cfpm/v10r1m0/index.jsp?topic=/com.ibm.swg.im.cogn
os.wig_cr.10.1.0.doc/wig_cr_id111gtstd_c8_bi.html
6. IBM, Cognos TM1, 2010, http://www-01.ibm.com/software/data/cognos/products/tm1/
7. Asadullaev S., Data, metadata and master data: triple strategy for data warehouse project.
http://www.ibm.com/developerworks/ru/library/r-nci/index.html, 2009.
8. Asadullaev S., Data warehouse architecture III,
http://www.ibm.com/developerworks/ru/library/sabir/axd_3/index.html, 2009.
9. Leong-Hong B.W., Plagman B.K. Data Dictionary / Directory Systems. Wiley & Sons. 1982.
11. Asadullaev S., Data quality management using IBM Information Server, 2010,
http://www.ibm.com/developerworks/ru/library/sabir/inf_s/index.html
12. Financial Service Technology. Mastering financial systems success, 2009,
http://www.usfst.com/article/Issue-2/Business-Process/Mastering-financial-systems-success/
13. Brooks F. P. The Mythical Man-Month: Essays on Software Engineering, Addison-Wesley
Professional; Anniversary edition, 1995.

DWarchitecturesanddevelopmentstrategy Guidebook

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWarchitecturesanddevelopmentstrategy Guidebook

Uploaded by

Copyright:

Available Formats

1

You might also like