Database Warehousing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Part

Chapter 32

Chapter 31

OLAP

Data Warehousing Design

Data Warehousing Concepts

1232

1204

1181

1149

Business Intelligence

Chapter 33
Data Mining

9
Chapter 34

Chapter

31
Data Warehousing
Concepts

Chapter Objectives

How Oracle supports data warehousing.

The main issues associated with the development and management of data marts.

The concept of a data mart and the main reasons for implementing a data mart.

The issues associated with the integration of a data warehouse and the
importance of managing metadata.

The main tools and technologies associated with data warehousing.

The important data flows or processes of a data warehouse.

The architecture and main components of a data warehouse.

The problems associated with data warehousing.

How Online Transaction Processing (OLTP) systems differ from data warehousing.

The main concepts and benefits associated with data warehousing.

How data warehousing evolved.

In this chapter you will learn:

We have already noted in earlier chapters that database management systems are pervasive throughout industry, with relational database management systems being the dominant system. These systems have been designed to handle high transaction throughput,
with transactions typically making small changes to the organizations operational data,
that is, data that the organization requires to handle its day-to-day operations. These types
of system are called Online Transaction Processing (OLTP) systems. The size of OLTP
databases can range from small databases of a few megabytes (Mb), to medium-sized
databases with several gigabytes (Gb), to large databases requiring terabytes (Tb) or even
petabytes (Pb) of storage.
Corporate decision-makers require access to all the organizations data, wherever it is
located. To provide comprehensive analysis of the organization, its business, its requirements, and any trends, requires access to not only the current values in the database but
also to historical data. To facilitate this type of analysis, the data warehouse has been
created to hold data drawn from several data sources, maintained by different operating
units, together with historical and summary transformations. The data warehouse based on

1150

Introduction to Data Warehousing

In Section 31.1 we outline what data warehousing is and how it evolved, and also describe
the potential benefits and problems associated with this approach. In Section 31.2 we
describe the architecture and main components of a data warehouse. In Sections 31.3 and
31.4 we identify and discuss the important data flows or processes of a data warehouse, and
the associated tools and technologies of a data warehouse, respectively. In Section 31.5 we
introduce data marts and the issues associated with the development and management of
data marts. Finally, in Section 31.6 we present an overview of how Oracle supports a data
warehouse environment. The examples in this chapter are taken from the DreamHome
case study described in Section 10.4 and Appendix A.

Structure of this Chapter

extended database technology provides the management of the datastore. However,


decision-makers also require powerful analysis tools. Two main types of analysis tools
have emerged over the last few years: Online Analytical Processing (OLAP) and data
mining tools.
As data warehousing is such a complex subject, we have devoted four chapters to
different aspects of data warehousing. In this chapter, we describe the basic concepts associated with data warehousing. In Chapter 32 we describe how to design and build a data
warehouse and in Chapters 33 and 34 we discuss the important end-user access tools for a
data warehouse.

Chapter 31 z Data Warehousing Concepts

31.1

In this section we discuss the origin and evolution of the concept of data warehousing.
We then discuss the main benefits associated with data warehousing. We next identify the
main characteristics of data warehousing systems in comparison with Online Transaction
Processing (OLTP) systems. We conclude this section by examining the problems of
developing and managing a data warehouse.

31.1.1 The Evolution of Data Warehousing


Since the 1970s, organizations have mostly focused their investment in new computer
systems that automate business processes. In this way, organizations gained competitive
advantage through systems that offered more efficient and cost-effective services to the
customer. Throughout this period, organizations accumulated growing amounts of data
stored in their operational databases. However, in recent times, where such systems are
commonplace, organizations are focusing on ways to use operational data to support
decision-making, as a means of regaining competitive advantage.
Operational systems were never designed to support such business activities and so
using these systems for decision-making may never be an easy solution. The legacy is that

31.1.2

31.1 Introduction to Data Warehousing

a typical organization may have numerous operational systems with overlapping and
sometimes contradictory definitions, such as data types. The challenge for an organization
is to turn its archives of data into a source of knowledge, so that a single integrated/
consolidated view of the organizations data is presented to the user. The concept of a
data warehouse was deemed the solution to meet the requirements of a system capable of
supporting decision-making, receiving data from multiple operational data sources.

Data Warehousing Concepts

A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision-making process.

In this definition by Inmon (1993), the data is:

Data
warehousing

The original concept of a data warehouse was devised by IBM as the information warehouse and presented as a solution for accessing data held in non-relational systems. The
information warehouse was proposed to allow organizations to use their data archives to
help them gain a business advantage. However, due to the sheer complexity and performance problems associated with the implementation of such solutions, the early attempts at
creating an information warehouse were mostly rejected. Since then, the concept of data
warehousing has been raised several times but it is only in recent years that the potential
of data warehousing is now seen as a valuable and viable solution. The latest and most
successful advocate for data warehousing is Bill Inmon, who has earned the title of father
of data warehousing due to his active promotion of the concept.

Subject-oriented as the warehouse is organized around the major subjects of the enterprise (such as customers, products, and sales) rather than the major application areas
(such as customer invoicing, stock control, and product sales). This is reflected in the
need to store decision-support data rather than application-oriented data.
Integrated because of the coming together of source data from different enterprise-wide
applications systems. The source data is often inconsistent using, for example, different
formats. The integrated data source must be made consistent to present a unified view
of the data to the users.
Time-variant because data in the warehouse is only accurate and valid at some point in
time or over some time interval. The time-variance of the data warehouse is also shown
in the extended time that the data is held, the implicit or explicit association of time with
all data, and the fact that the data represents a series of snapshots.
Non-volatile as the data is not updated in real time but is refreshed from operational
systems on a regular basis. New data is always added as a supplement to the database,
rather than a replacement. The database continually absorbs this new data, incrementally integrating it with the previous data.

There are numerous definitions of data warehousing, with the earlier definitions focusing
on the characteristics of the data held in the warehouse. Alternative definitions widen the

1151

1152

|
Chapter 31 z Data Warehousing Concepts

A distributed data warehouse that is implemented over the Web with


no central data repository.

scope of the definition of data warehousing to include the processing associated with
accessing the data from the original sources to the delivery of the data to the decisionmakers (Anahory and Murray, 1997).
Whatever the definition, the ultimate goal of data warehousing is to integrate enterprisewide corporate data into a single repository from which users can easily run queries, produce reports, and perform analysis. In summary, a data warehouse is data management and
data analysis technology.
In recent years a new term associated with data warehousing has been used, namely
Data Webhouse.
Data
Webhouse

The Web is an immense source of behavioral data as individuals interact through


their Web browsers with remote Web sites. The data generated by this behavior is called
clickstream. Using a data warehouse on the Web to harness clickstream data has led to
the development of Data Webhouses. Further discussions on the development of this new
variation of data warehousing is out with the scope of this book, however the interested
reader is referred to Kimball et al. (2000).

31.1.3 Benefits of Data Warehousing

Potential high returns on investment An organization must commit a huge amount of


resources to ensure the successful implementation of a data warehouse and the cost
can vary enormously from 50,000 to over 10 million due to the variety of technical
solutions available. However, a study by the International Data Corporation (IDC) in
1996 reported that average three-year returns on investment (ROI) in data warehousing
reached 401%, with over 90% of the companies surveyed achieving over 40% ROI, half
the companies achieving over 160% ROI, and a quarter with more than 600% ROI
(IDC, 1996).
Competitive advantage The huge returns on investment for those companies that have
successfully implemented a data warehouse is evidence of the enormous competitive
advantage that accompanies this technology. The competitive advantage is gained
by allowing decision-makers access to data that can reveal previously unavailable,
unknown, and untapped information on, for example, customers, trends, and demands.

The successful implementation of a data warehouse can bring major benefits to an


organization including:
n

Increased productivity of corporate decision-makers Data warehousing improves


the productivity of corporate decision-makers by creating an integrated database of
consistent, subject-oriented, historical data. It integrates data from multiple incompatible systems into a form that provides one consistent view of the organization. By
transforming data into meaningful information, a data warehouse allows corporate
decision-makers to perform more substantive, accurate, and consistent analysis.

31.1.4

31.1 Introduction to Data Warehousing

Comparison of OLTP Systems and


Data Warehousing
A DBMS built for Online Transaction Processing (OLTP) is generally regarded as unsuitable for data warehousing because each system is designed with a differing set of requirements in mind. For example, OLTP systems are designed to maximize the transaction
processing capacity, while data warehouses are designed to support ad hoc query processing. Table 31.1 provides a comparison of the major characteristics of OLTP systems
and data warehousing systems (Singh, 1997).
An organization will normally have a number of different OLTP systems for business
processes such as inventory control, customer invoicing, and point-of-sale. These systems
generate operational data that is detailed, current, and subject to change. The OLTP systems are optimized for a high number of transactions that are predictable, repetitive, and
update intensive. The OLTP data is organized according to the requirements of the transactions associated with the business applications and supports the day-to-day decisions of
a large number of concurrent operational users.
In contrast, an organization will normally have a single data warehouse, which holds
data that is historical, detailed, and summarized to various levels and rarely subject to
change (other than being supplemented with new data). The data warehouse is designed
to support relatively low numbers of transactions that are unpredictable in nature and
require answers to queries that are ad hoc, unstructured, and heuristic. The warehouse data
is organized according to the requirements of potential queries and supports the long-term
strategic decisions of a relatively low number of managerial users.
Although OLTP systems and data warehouses have different characteristics and are
built with different purposes in mind, these systems are closely related in that the OLTP
systems provide the source data for the warehouse. A major problem of this relationship
is that the data held by the OLTP systems can be inconsistent, fragmented, and subject

Data warehousing systems

Comparison of OLTP systems and data warehousing systems.

OLTP systems

Holds historical data


Stores detailed, lightly, and highly summarized data
Data is largely static
Ad hoc, unstructured, and heuristic processing
Medium to low level of transaction throughput
Unpredictable pattern of usage
Analysis driven
Subject-oriented
Supports strategic decisions
Serves relatively low number of managerial users

Table 31.1

Holds current data


Stores detailed data
Data is dynamic
Repetitive processing
High level of transaction throughput
Predictable pattern of usage
Transaction-driven
Application-oriented
Supports day-to-day decisions
Serves large number of
clerical/operational users

1153

1154

|
Chapter 31 z Data Warehousing Concepts

What was the total revenue for Scotland in the third quarter of 2004?
What was the total revenue for property sales for each type of property in Great Britain
in 2003?
What are the three most popular areas in each city for the renting of property in 2004
and how does this compare with the results for the previous two years?
What is the monthly revenue for property sales at each branch office, compared with
rolling 12-monthly prior figures?
What would be the effect on property sales in the different regions of Britain if legal costs
went up by 3.5% and Government taxes went down by 1.5% for properties over 100,000?
Which type of property sells for prices above the average selling price for properties in
the main cities of Great Britain and how does this correlate to demographic data?
What is the relationship between the total annual revenue generated by each branch
office and the total number of sales staff assigned to each branch office?

to change, containing duplicate or missing entries. As such, the operational data must be
cleaned up before it can be used in the data warehouse. We discuss the tasks associated
with this process in Section 31.3.1.
OLTP systems are not built to quickly answer ad hoc queries. They also tend not to store
historical data, which is necessary to analyze trends. Basically, OLTP offers large amounts
of raw data, which is not easily analyzed. The data warehouse allows more complex queries
to be answered besides just simple aggregations such as, What is the average selling price
for properties in the major cities of Great Britain?. The types of queries that a data warehouse is expected to answer range from the relatively simple to the highly complex and are
dependent on the types of end-user access tools used (see Section 31.2.10). Examples of the
range of queries that the DreamHome data warehouse may be capable of supporting include:
n
n

31.1.5 Problems of Data Warehousing

Problems of data warehousing.

The problems associated with developing and managing a data warehouse are listed in
Table 31.2 (Greenfield, 1996).
Table 31.2

Underestimation of resources for data loading


Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long-duration projects
Complexity of integration

31.1 Introduction to Data Warehousing

Underestimation of resources for data loading


Many developers underestimate the time required to extract, clean, and load the data into
the warehouse. This process may account for a significant proportion of the total development time, although better data cleansing and management tools should ultimately reduce
the time and effort spent.

Hidden problems with source systems


Hidden problems associated with the source systems feeding the data warehouse may be
identified, possibly after years of being undetected. The developer must decide whether
to fix the problem in the data warehouse and/or fix the source systems. For example, when
entering the details of a new property, certain fields may allow nulls, which may result in
staff entering incomplete property data, even when available and applicable.

Required data not captured


Warehouse projects often highlight a requirement for data not being captured by the
existing source systems. The organization must decide whether to modify the OLTP systems or create a system dedicated to capturing the missing data. For example, when considering the DreamHome case study, we may wish to analyze the characteristics of certain
events such as the registering of new clients and properties at each branch office. However,
this is currently not possible as we do not capture the data that the analysis requires such
as the date registered in either case.

Increased end-user demands


After end-users receive query and reporting tools, requests for support from IS staff may
increase rather than decrease. This is caused by an increasing awareness of the users on
the capabilities and value of the data warehouse. This problem can be partially alleviated
by investing in easier-to-use, more powerful tools, or in providing better training for the
users. A further reason for increasing demands on IS staff is that once a data warehouse is
online, it is often the case that the number of users and queries increase together with
requests for answers to more and more complex queries.

Data homogenization
Large-scale data warehousing can become an exercise in data homogenization that lessens
the value of the data. For example, in producing a consolidated and integrated view of the
organizations data, the warehouse designer may be tempted to emphasize similarities
rather than differences in the data used by different application areas such as property sales
and property renting.

High demand for resources


The data warehouse can use large amounts of disk space. Many relational databases
used for decision-support are designed around star, snowflake, and starflake schemas

|
1155

1156

Data Warehouse Architecture

The most important area for the management of a data warehouse is the integration
capabilities. This means an organization must spend a significant amount of time determining how well the various different data warehousing tools can be integrated into the
overall solution that is needed. This can be a very difficult task, as there are a number of
tools for every operation of the data warehouse, which must integrate well in order that the
warehouse works to the organizations benefit.

Complexity of integration

A data warehouse represents a single data resource for the organization. However, the
building of a warehouse can take up to three years, which is why some organizations are
building data marts (see Section 31.5). Data marts support only the requirements of a
particular department or functional area and can therefore be built more rapidly.

Long-duration projects

Data warehouses are high maintenance systems. Any reorganization of the business
processes and the source systems may affect the data warehouse. To remain a valuable
resource, the data warehouse must remain consistent with the organization that it supports.

High maintenance

Data warehousing may change the attitude of end-users to the ownership of data. Sensitive
data that was originally viewed and used only by a particular department or business area,
such as sales or marketing, may now be made accessible to others in the organization.

Data ownership

(see Chapter 32). These approaches result in the creation of very large fact tables. If there
are many dimensions to the factual data, the combination of aggregate tables and indexes
to the fact tables can use up more space than the raw data.

Chapter 31 z Data Warehousing Concepts

31.2

In this section we present an overview of the architecture and major components of a data
warehouse (Anahory and Murray, 1997). The processes, tools, and technologies associated
with data warehousing are described in more detail in the following sections of this chapter.
The typical architecture of a data warehouse is shown in Figure 31.1.

31.2.1 Operational Data


Mainframe operational data held in first generation hierarchical and network databases.
It is estimated that the majority of corporate operational data is held in these systems.

The source of data for the data warehouse is supplied from:


n

31.2.2

31.2 Data Warehouse Architecture

Departmental data held in proprietary file systems such as VSAM, RMS, and relational
DBMSs such as Informix and Oracle.
Private data held on workstations and private servers.
External systems such as the Internet, commercially available databases, or databases
associated with an organizations suppliers or customers.

Figure 31.1 Typical architecture of a data warehouse.

n
n

Operational Data Store


An Operational Data Store (ODS) is a repository of current and integrated operational data
used for analysis. It is often structured and supplied with data in the same way as the
data warehouse, but may in fact act simply as a staging area for data to be moved into the
warehouse.
The ODS is often created when legacy operational systems are found to be incapable
of achieving reporting requirements. The ODS provides users with the ease of use of a
relational database while remaining distant from the decision support functions of the
data warehouse.

1157

1158

|
Chapter 31 z Data Warehousing Concepts

Building an ODS can be a helpful step towards building a data warehouse because an
ODS can supply data that has been already extracted from the source systems and cleaned.
This means that the remaining work of integrating and restructuring the data for the data
warehouse is simplified (see Section 32.3).

31.2.3 Load Manager


The load manager (also called the frontend component) performs all the operations
associated with the extraction and loading of data into the warehouse. The data may be
extracted directly from the data sources or more commonly from the operational data store.
The operations performed by the load manager may include simple transformations of the
data to prepare the data for entry into the warehouse. The size and complexity of this component will vary between data warehouses and may be constructed using a combination
of vendor data loading tools and custom-built programs.

31.2.4 Warehouse Manager

analysis of data to ensure consistency;


transformation and merging of source data from temporary storage into data warehouse
tables;
creation of indexes and views on base tables;
generation of denormalizations (if necessary);
generation of aggregations (if necessary);
backing-up and archiving data.

The warehouse manager performs all the operations associated with the management of
the data in the warehouse. This component is constructed using vendor data management
tools and custom-built programs. The operations performed by the warehouse manager
include:
n
n

n
n
n
n

In some cases, the warehouse manager also generates query profiles to determine which
indexes and aggregations are appropriate. A query profile can be generated for each user,
group of users, or the data warehouse and is based on information that describes the characteristics of the queries such as frequency, target table(s), and size of result sets.

31.2.5 Query Manager


The query manager (also called the backend component) performs all the operations
associated with the management of user queries. This component is typically constructed
using vendor end-user data access tools, data warehouse monitoring tools, database
facilities, and custom-built programs. The complexity of the query manager is determined
by the facilities provided by the end-user access tools and the database. The operations

31.2.9

31.2.8

31.2.7

31.2.6

31.2 Data Warehouse Architecture

performed by this component include directing queries to the appropriate tables and
scheduling the execution of queries. In some cases, the query manager also generates
query profiles to allow the warehouse manager to determine which indexes and aggregations are appropriate.

Detailed Data
This area of the warehouse stores all the detailed data in the database schema. In most
cases, the detailed data is not stored online but is made available by aggregating the data
to the next level of detail. However, on a regular basis, detailed data is added to the warehouse to supplement the aggregated data.

Lightly and Highly Summarized Data


This area of the warehouse stores all the predefined lightly and highly summarized (aggregated)
data generated by the warehouse manager. This area of the warehouse is transient as it will
be subject to change on an ongoing basis in order to respond to changing query profiles.
The purpose of summary information is to speed up the performance of queries.
Although there are increased operational costs associated with initially summarizing the
data, this is offset by removing the requirement to continually perform summary operations (such as sorting or grouping) in answering user queries. The summary data is updated
continuously as new data is loaded into the warehouse.

Archive/Backup Data
This area of the warehouse stores the detailed and summarized data for the purposes of
archiving and backup. Even although summary data is generated from detailed data, it
may be necessary to backup online summary data if this data is kept beyond the retention
period for detailed data. The data is transferred to storage archives such as magnetic tape
or optical disk.

Metadata

as part of the query management process metadata is used to direct a query to the most
appropriate data source.

the extraction and loading processes metadata is used to map data sources to a
common view of the data within the warehouse;
the warehouse management process metadata is used to automate the production of
summary tables;

This area of the warehouse stores all the metadata (data about data) definitions used by all
the processes in the warehouse. Metadata is used for a variety of purposes including:
n

1159

1160

|
Chapter 31 z Data Warehousing Concepts

The structure of metadata differs between each process, because the purpose is different.
This means that multiple copies of metadata describing the same data item are held within
the data warehouse. In addition, most vendor tools for copy management and end-user
data access use their own versions of metadata. Specifically, copy management tools use
metadata to understand the mapping rules to apply in order to convert the source data into
a common form. End-user access tools use metadata to understand how to build a query.
The management of metadata within the data warehouse is a very complex task that should
not be underestimated. The issues associated with the management of metadata in a data
warehouse are discussed in Section 31.4.3.

31.2.10 End-User Access Tools

reporting and query tools;


application development tools;
Executive Information System (EIS) tools;
Online Analytical Processing (OLAP) tools;
data mining tools.

The principal purpose of data warehousing is to provide information to business users


for strategic decision-making. These users interact with the warehouse using end-user
access tools. The data warehouse must efficiently support ad hoc and routine analysis.
High performance is achieved by pre-planning the requirements for joins, summations,
and periodic reports by end-users.
Although the definitions of end-user access tools can overlap, for the purpose of this
discussion, we categorize these tools into five main groups (Berson and Smith, 1997):
n
n
n
n
n

Reporting and query tools


Reporting tools include production reporting tools and report writers. Production reporting tools are used to generate regular operational reports or support high-volume batch
jobs, such as customer orders/invoices and staff pay cheques. Report writers, on the other
hand, are inexpensive desktop tools designed for end-users.
Query tools for relational data warehouses are designed to accept SQL or generate
SQL statements to query data stored in the warehouse. These tools shield end-users from
the complexities of SQL and database structures by including a meta-layer between users
and the database. The meta-layer is the software that provides subject-oriented views of
a database and supports point-and-click creation of SQL. An example of a query tool is
Query-By-Example (QBE). The QBE facility of Microsoft Office Access DBMS was
demonstrated in Chapter 7. Query tools are popular with users of business applications
such as demographic analysis and customer mailing lists. However, as questions become
increasingly complex, these tools may rapidly become inefficient.

Application development tools


The requirements of the end-users may be such that the built-in capabilities of reporting
and query tools are inadequate either because the required analysis cannot be performed

31.3

31.3 Data Warehouse Data Flows

or because the user interaction requires an unreasonably high level of expertise by the
user. In this situation, user access may require the development of in-house applications
using graphical data access tools designed primarily for clientserver environments. Some
of these application development tools integrate with popular OLAP tools, and can access
all major database systems, including Oracle, Sybase, and Informix.

Executive information system (EIS) tools


Executive information systems, more recently referred to as everybodys information
systems, were originally developed to support high-level strategic decision-making. However, the focus of these systems widened to include support for all levels of management.
EIS tools were originally associated with mainframes enabling users to build customized,
graphical decision-support applications to provide an overview of the organizations data
and access to external data sources.
Currently, the demarcation between EIS tools and other decision-support tools is even
more vague as EIS developers add additional query facilities and provide custom-built
applications for business areas such as sales, marketing, and finance.

Online Analytical Processing (OLAP) tools


Online Analytical Processing (OLAP) tools are based on the concept of multi-dimensional
databases and allow a sophisticated user to analyze the data using complex, multidimensional views. Typical business applications for these tools include assessing the
effectiveness of a marketing campaign, product sales forecasting, and capacity planning. These tools assume that the data is organized in a multi-dimensional model
supported by a special multi-dimensional database (MDDB) or by a relational database
designed to enable multi-dimensional queries. We discuss OLAP tools in more detail in
Chapter 33.

Data mining tools


Data mining is the process of discovering meaningful new correlations, patterns, and
trends by mining large amounts of data using statistical, mathematical, and artificial
intelligence (AI) techniques. Data mining has the potential to supersede the capabilities of
OLAP tools, as the major attraction of data mining is its ability to build predictive rather
than retrospective models. We discuss data mining in more detail in Chapter 34.

Data Warehouse Data Flows


In this section we examine the activities associated with the processing (or flow) of data
within a data warehouse. Data warehousing focuses on the management of five primary
data flows, namely the inflow, upflow, downflow, outflow, and metaflow (Hackathorn,
1995). The data flows within a data warehouse are shown in Figure 31.2. The processes
associated with each data flow include:

1161

1162

|
Chapter 31 z Data Warehousing Concepts

Inflow
Upflow
Downflow
Outflow
Metaflow

The processes associated with the extraction, cleansing, and loading of the
data from the source systems into the data warehouse.

Extraction, cleansing, and loading of the source data.


Adding value to the data in the warehouse through summarizing, packaging, and distribution of the data.
Archiving and backing-up the data in the warehouse.
Making the data available to end-users.
Managing the metadata.

Figure 31.2 Information flows of a data warehouse.

n
n

n
n
n

31.3.1 Inflow
Inflow

The inflow is concerned with taking data from the source systems to load into the data
warehouse. Alternatively, the data may be first loaded into the operational data store

31.3.2

31.3 Data Warehouse Data Flows

cleansing dirty data;


restructuring data to suit the new requirements of the data warehouse including, for
example, adding and/or removing fields, and denormalizing data;
ensuring that the source data is consistent with itself and with the data already in the
warehouse.

(ODS) (see Section 31.2.2) before being transferred to the data warehouse. As the source
data is generated predominately by OLTP systems, the data must be reconstructed for the
purposes of the data warehouse. The reconstruction of data involves:
n
n

The processes associated with adding value to the data in the warehouse
through summarizing, packaging, and distribution of the data.

To effectively manage the inflow, mechanisms must be identified to determine when to


start extracting the data to carry out the necessary transformations and to undertake consistency checks. When extracting data from the source systems, it is important to ensure
that the data is in a consistent state to generate a single, consistent view of the corporate
data. The complexity of the extraction process is determined by the extent to which the
source systems are in tune with one another.
Once the data is extracted, the data is usually loaded into a temporary store for the
purposes of cleansing and consistency checking. As this process is complex, it is
important for it to be fully automated and to have the ability to report when problems
and failures occur. Commercial tools are available to support the management of the
inflow. However, unless the process is relatively straightforward, the tools may require
customization.

Upflow
Upflow

Summarizing the data by selecting, projecting, joining, and grouping relational data
into views that are more convenient and useful to the end-users. Summarizing extends
beyond simple relational operations to involve sophisticated statistical analysis
including identifying trends, clustering, and sampling the data.
Packaging the data by converting the detailed or summarized data into more useful
formats, such as spreadsheets, text documents, charts, other graphical presentations,
private databases, and animation.
Distributing the data to appropriate groups to increase its availability and accessibility.

The activities associated with the upflow include:


n

While adding value to the data, consideration must also be given to support the performance requirements of the data warehouse and to minimize the ongoing operational costs.
These requirements essentially pull the design in opposing directions, forcing restructuring to improve query performance or to lower operational costs. In other words, the data
warehouse administrator must identify the most appropriate database design to meet all
requirements, which often necessitates a degree of compromise.

1163

1164

|
Chapter 31 z Data Warehousing Concepts

The processes associated with archiving and backing-up of data in the


warehouse.

31.3.3 Downflow
Downflow

The processes associated with making the data available to the


end-users.

Archiving old data plays an important role in maintaining the effectiveness and performance of the warehouse by transferring the older data of limited value to a storage archive
such as magnetic tape or optical disk. However, if the correct partitioning scheme is
selected for the database, the amount of data online should not affect performance.
Partitioning is a useful design option for very large databases that enables the fragmentation of a table storing enormous numbers of records into several smaller tables. The
rule for the partitioning a given table can be based on characteristics of the data such as
timespan or area of the country. For example, the PropertySale table of DreamHome could
be partitioned according to the countries of the UK.
The downflow of data includes the processes to ensure that the current state of the data
warehouse can be rebuilt following data loss, or software/hardware failures. Archived data
should be stored in a way that allows the re-establishment of the data in the warehouse,
when required.

31.3.4 Outflow
Outflow

Accessing, which is concerned with satisfying the end-users requests for the data they
need. The main issue is to create an environment so that users can effectively use
the query tools to access the most appropriate data source. The frequency of user
accesses can vary from ad hoc, to routine, to real-time. It is important to ensure that the
systems resources are used in the most effective way in scheduling the execution
of user queries.

The outflow is where the real value of warehousing is realized by the organization.
This may require re-engineering the business processes to achieve competitive advantage
(Hackathorn, 1995). The two key activities involved in the outflow include:

Delivering, which is concerned with proactively delivering information to the end-users


workstations and is referred to as a type of publish-and-subscribe process. The
warehouse publishes various business objects that are revised periodically by monitoring usage patterns. Users subscribe to the set of business objects that best meets their
needs.

An important issue in managing the outflow is the active marketing of the data warehouse
to users, which will contribute to its overall impact on an organizations operations. There
are additional operational activities in managing the outflow including directing queries to

31.4.1

31.4

31.3.5

31.4 Data Warehousing Tools and Technologies

The processes associated with the management of the metadata.

the appropriate target table(s) and capturing information on the query profiles associated
with user groups to determine which aggregations to generate.
Data warehouses that contain summary data potentially provide a number of distinct
data sources to respond to a specific query including the detailed data itself and any number of aggregations that satisfy the querys data needs. However, the performance of the
query will vary considerably depending on the characteristics of the target data, the most
obvious being the volume of data to be read. As part of managing the outflow, the system
must determine the most efficient way to answer a query.

Metaflow
Metaflow

The previous flows describe the management of the data warehouse with regard to how the
data moves in and out of the warehouse. Metaflow is the process that moves metadata
(data about the other flows). Metadata is a description of the data contents of the data
warehouse, what is in it, where it came from originally, and what has been done to it by
way of cleansing, integrating, and summarizing. We discuss issues associated with the
management of metadata in a data warehouse in Section 31.4.3.
To respond to changing business needs, legacy systems are constantly changing. Therefore, the warehouse involves responding to these continuous changes, which must reflect
the changes to the source legacy systems and the changing business environment. The
metaflow (metadata) must be continuously updated with these changes.

Data Warehousing Tools and Technologies


In this section we examine the tools and technologies associated with building and
managing a data warehouse and, in particular, we focus on the issues associated with the
integration of these tools. For more information on data warehousing tools and technologies, the interested reader is referred to Berson and Smith (1997).

Extraction, Cleansing, and Transformation Tools


Selecting the correct extraction, cleansing, and transformation tools are critical steps in
the construction of a data warehouse. There are an increasing number of vendors that are
focused on fulfilling the requirements of data warehouse implementations as opposed
to simply moving data between hardware platforms. The tasks of capturing data from
a source system, cleansing and transforming it, and then loading the results into a target
system can be carried out either by separate products, or by a single integrated solution.
Integrated solutions fall into one of the following categories:

1165

1166

|
code generators;
database data replication tools;
dynamic transformation engines.

Chapter 31 z Data Warehousing Concepts

n
n
n

Code generators
Code generators create customized 3GL/4GL transformation programs based on source and
target data definitions. The main issue with this approach is the management of the large
number of programs required to support a complex corporate data warehouse. Vendors
recognize this issue and some are developing management components employing techniques such as workflow methods and automated scheduling systems.

Database data replication tools


Database data replication tools employ database triggers or a recovery log to capture
changes to a single data source on one system and apply the changes to a copy of the
source data located on a different system (see Chapter 24). Most replication products do
not support the capture of changes to non-relational files and databases, and often do not
provide facilities for significant data transformation and enhancement. These tools can be
used to rebuild a database following failure or to create a database for a data mart (see
Section 31.5), provided that the number of data sources is small and the level of data
transformation is relatively simple.

Dynamic transformation engines


Rule-driven dynamic transformation engines capture data from a source system at userdefined intervals, transform the data, and then send and load the results into a target environment. To date, most products support only relational data sources, but products are now
emerging that handle non-relational source files and databases.

31.4.2 Data Warehouse DBMS


There are few integration issues associated with the data warehouse database. Due to the
maturity of such products, most relational databases will integrate predictably with other
types of software. However, there are issues associated with the potential size of the data
warehouse database. Parallelism in the database becomes an important issue, as well as the
usual issues such as performance, scalability, availability, and manageability, which must
all be taken into consideration when choosing a DBMS. We first identify the requirements
for a data warehouse DBMS and then discuss briefly how the requirements of data warehousing are supported by parallel technologies.

Requirements for data warehouse DBMS


The specialized requirements for a relational DBMS suitable for data warehousing are
published in a White Paper (Red Brick Systems, 1996) and are listed in Table 31.3.

Table 31.3

31.4 Data Warehousing Tools and Technologies

The requirements for a data warehouse RDBMS.

Load performance
Load processing
Data quality management
Query performance
Terabyte scalability
Mass user scalability
Networked data warehouse
Warehouse administration
Integrated dimensional analysis
Advanced query functionality

Load performance
Data warehouses require incremental loading of new data on a periodic basis within
narrow time windows. Performance of the load process should be measured in hundreds
of millions of rows or gigabytes of data per hour and there should be no maximum limit
that constrains the business.
Load processing
Many steps must be taken to load new or updated data into the data warehouse including
data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and
metadata update. Although each step may in practice be atomic, the load process should
appear to execute as a single, seamless unit of work.
Data quality management
The shift to fact-based management demands the highest data quality. The warehouse
must ensure local consistency, global consistency, and referential integrity despite dirty
sources and massive database sizes. While loading and preparation are necessary steps,
they are not sufficient. The ability to answer end-users queries is the measure of success
for a data warehouse application. As more questions are answered, analysts tend to ask
more creative and complex questions.
Query performance
Fact-based management and ad hoc analysis must not be slowed or inhibited by the
performance of the data warehouse RDBMS. Large, complex queries for key business
operations must complete in reasonable time periods.
Terabyte scalability
Data warehouse sizes are growing at enormous rates with sizes ranging from a few to
hundreds of gigabytes to terabyte-sized (1012 bytes) and petabyte-sized (1015 bytes).

|
1167

1168

|
Chapter 31 z Data Warehousing Concepts

The RDBMS must not have any architectural limitations to the size of the database and
should support modular and parallel management. In the event of failure, the RDBMS should
support continued availability, and provide mechanisms for recovery. The RDBMS must
support mass storage devices such as optical disk and hierarchical storage management
devices. Lastly, query performance should not be dependent on the size of the database,
but rather on the complexity of the query.
Mass user scalability
Current thinking is that access to a data warehouse is limited to relatively low numbers
of managerial users. This is unlikely to remain true as the value of data warehouses is
realized. It is predicted that the data warehouse RDBMS should be capable of supporting
hundreds, or even thousands, of concurrent users while maintaining acceptable query
performance.
Networked data warehouse
Data warehouse systems should be capable of cooperating in a larger network of data warehouses. The data warehouse must include tools that coordinate the movement of subsets
of data between warehouses. Users should be able to look at, and work with, multiple data
warehouses from a single client workstation.
Warehouse administration
The very-large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility. The RDBMS must provide controls for implementing resource
limits, chargeback accounting to allocate costs back to users, and query prioritization to
address the needs of different user classes and activities. The RDBMS must also provide
for workload tracking and tuning so that system resources may be optimized for maximum
performance and throughput. The most visible and measurable value of implementing
a data warehouse is evidenced in the uninhibited, creative access to data it provides for
end-users.
Integrated dimensional analysis
The power of multi-dimensional views is widely accepted, and dimensional support
must be inherent in the warehouse RDBMS to provide the highest performance for
relational OLAP tools (see Chapter 33). The RDBMS must support fast, easy creation of
pre-computed summaries common in large data warehouses, and provide maintenance
tools to automate the creation of these pre-computed aggregates. Dynamic calculation of
aggregates should be consistent with the interactive performance needs of the end-user.
Advanced query functionality
End-users require advanced analytical calculations, sequential and comparative analysis,
and consistent access to detailed and summarized data. Using SQL in a clientserver
point-and-click tool environment may sometimes be impractical or even impossible
due to the complexity of the users queries. The RDBMS must provide a complete and
advanced set of analytical operations.

Parallel DBMSs

31.4.3

31.4 Data Warehousing Tools and Technologies

Symmetric Multi-Processing (SMP) a set of tightly coupled processors that share


memory and disk storage;
Massively Parallel Processing (MPP) a set of loosely coupled processors, each of
which has its own memory and disk storage.

Data warehousing requires the processing of enormous amounts of data and parallel database technology offers a solution to providing the necessary growth in performance. The
success of parallel DBMSs depends on the efficient operation of many resources including processors, memory, disks, and network connections. As data warehousing grows
in popularity, many vendors are building large decision-support DBMSs using parallel
technologies. The aim is to solve decision-support problems using multiple nodes working on the same problem. The major characteristics of parallel DBMSs are scalability,
operability, and availability.
The parallel DBMS performs many database operations simultaneously, splitting
individual tasks into smaller parts so that tasks can be spread across multiple processors.
Parallel DBMSs must be capable of running parallel queries. In other words, they must
be able to decompose large complex queries into subqueries, run the separate subqueries
simultaneously, and reassemble the results at the end. The capability of such DBMSs must
also include parallel data loading, table scanning, and data archiving and backup. There
are two main parallel hardware architectures commonly used as database server platforms
for data warehousing:
n

The SMP and MPP parallel architectures were described in detail in Section 22.1.1.

Data Warehouse Metadata


There are many issues associated with data warehouse integration, however in this section
we focus on the integration of metadata, that is data about data (Darling, 1996). The
management of the metadata in the warehouse is an extremely complex and difficult task.
Metadata is used for a variety of purposes and the management of metadata is a critical
issue in achieving a fully integrated data warehouse.
The major purpose of metadata is to show the pathway back to where the data began,
so that the warehouse administrators know the history of any item in the warehouse.
However, the problem is that metadata has several functions within the warehouse that
relates to the processes associated with data transformation and loading, data warehouse
management, and query generation (see Section 31.2.9).
The metadata associated with data transformation and loading must describe the
source data and any changes that were made to the data. For example, for each source
field there should be a unique identifier, original field name, source data type, and original
location including the system and object name, along with the destination data type and
destination table name. If the field is subject to any transformations such as a simple field
type change to a complex set of procedures and functions, this should also be recorded.
The metadata associated with data management describes the data as it is stored in the
warehouse. Every object in the database needs to be described including the data in each

1169

1170

|
Chapter 31 z Data Warehousing Concepts

table, index, and view, and any associated constraints. This information is held in the
DBMS system catalog, however, there are additional requirements for the purposes of
the warehouse. For example, metadata should also describe any fields associated with
aggregations, including a description of the aggregation that was performed. In addition,
table partitions should be described including information on the partition key, and the
data range associated with that partition.
The metadata described above is also required by the query manager to generate appropriate queries. In turn, the query manager generates additional metadata about the queries
that are run, which can be used to generate a history on all the queries and a query profile
for each user, group of users, or the data warehouse. There is also metadata associated
with the users of queries that includes, for example, information describing what the term
price or customer means in a particular database and whether the meaning has changed
over time.

Synchronizing metadata
The major integration issue is how to synchronize the various types of metadata used
throughout the data warehouse. The various tools of a data warehouse generate and use
their own metadata, and to achieve integration, we require that these tools are capable of
sharing their metadata. The challenge is to synchronize metadata between different products from different vendors using different metadata stores. For example, it is necessary
to identify the correct item of metadata at the right level of detail from one product and
map it to the appropriate item of metadata at the right level of detail in another product,
then sort out any coding differences between them. This has to be repeated for all other
metadata that the two products have in common. Further, any changes to the metadata
(or even meta-metadata), in one product needs to be conveyed to the other product. The
task of synchronizing two products is highly complex, and therefore repeating this process
for six or more products that make up the data warehouse can be resource intensive.
However, integration of the metadata must be achieved.
In the beginning there were two major standards for metadata and modeling in the
areas of data warehousing and component-based development proposed by the Meta
Data Coalition (MDC) and the Object Management Group (OMG). However, these two
industry organizations jointly announced that the MDC would merge into the OMG. As
a result, the MDC discontinued independent operations and work continued in the OMG
to integrate the two standards.
The merger of MDC into the OMG marked an agreement of the major data warehousing and metadata vendors to converge on one standard, incorporating the best of the
MDCs Open Information Model (OIM) with the best of the OMGs Common Warehouse
Metamodel (CWM). This work is now complete and the resulting specification issued by
the OMG as the next version of the CWM is discussed in Section 27.1.3. A single standard allows users to exchange metadata between different products from different vendors
freely.
The OMGs CWM builds on various standards, including OMGs UML (Unified
Modeling Language), XMI (XML Metadata Interchange), and MOF (Meta Object
Facility), and on the MDCs OIM. The CWM was developed by a number of companies,
including IBM, Oracle, Unisys, Hyperion, Genesis, NCR, UBS, and Dimension EDI.

Administration and Management Tools

a data mart focuses on only the requirements of users associated with one department
or business function;

31.5

31.4.4

31.5 Data Marts

monitoring data loading from multiple sources;


data quality and integrity checks;
managing and updating metadata;
monitoring database performance to ensure efficient query response times and resource
utilization;
auditing data warehouse usage to provide user chargeback information;
replicating, subsetting, and distributing data;
maintaining efficient data storage management;
purging data;
archiving and backing-up data;
implementing recovery following failure;
security management.

A data warehouse requires tools to support the administration and management of such
a complex environment. These tools are relatively scarce, especially those that are well
integrated with the various types of metadata and the day-to-day operations of the data
warehouse. The data warehouse administration and management tools must be capable of
supporting the following tasks:
n
n
n
n

n
n
n
n
n
n
n

Data Marts

A subset of a data warehouse that supports the requirements of a particular


department or business function.

Accompanying the rapid emergence of data warehouses is the related concept of data
marts. In this section we describe what data marts are, the reasons for building data marts,
and the issues associated with the development and use of data marts.
Data
mart

data marts do not normally contain detailed operational data, unlike data warehouses;

A data mart holds a subset of the data in a data warehouse normally in the form of
summary data relating to a particular department or business function. The data mart can
be standalone or linked centrally to the corporate data warehouse. As a data warehouse
grows larger, the ability to serve the various needs of the organization may be compromised. The popularity of data marts stems from the fact that corporate-wide data warehouses
are proving difficult to build and use. The typical architecture for a data warehouse and
associated data mart is shown in Figure 31.3. The characteristics that differentiate data
marts and data warehouses include:

1171

1172

|
Chapter 31 z Data Warehousing Concepts

Figure 31.3 Typical data warehouse and data mart architecture.

31.5.2

31.5.1

31.5 Data Marts

as data marts contain less data compared with data warehouses, data marts are more
easily understood and navigated.

There are several approaches to building data marts. One approach is to build several
data marts with a view to the eventual integration into a warehouse; another approach is
to build the infrastructure for a corporate data warehouse while at the same time building
one or more data marts to satisfy immediate business needs.
Data mart architectures can be built as two-tier or three-tier database applications. The
data warehouse is the optional first tier (if the data warehouse provides the data for the
data mart), the data mart is the second tier, and the end-user workstation is the third tier,
as shown in Figure 31.3. Data is distributed among the tiers.

Reasons for Creating a Data Mart


To give users access to the data they need to analyze most often.
To provide data in a form that matches the collective view of the data by a group of
users in a department or business function.
To improve end-user response time due to the reduction in the volume of data to be
accessed.
To provide appropriately structured data as dictated by the requirements of end-user
access tools such as Online Analytical Processing (OLAP) and data mining tools, which
may require their own internal database structures. In practice, these tools often create
their own data mart designed to support their specific functionality.
Data marts normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart
is simpler than establishing a corporate data warehouse.
The cost of implementing data marts is normally less than that required to establish a
data warehouse.
The potential users of a data mart are more clearly defined and can be more easily targeted
to obtain support for a data mart project rather than a corporate data warehouse project.

There are many reasons for creating a data mart, which include:
n
n

Data Marts Issues


The issues associated with the development and management of data marts are listed in
Table 31.4 (Brooks, 1997).

Data mart functionality


The capabilities of data marts have increased with the growth in their popularity. Rather
than being simply small, easy-to-access databases, some data marts must now be scalable
to hundreds of gigabytes (Gb), and provide sophisticated analysis using Online Analytical

1173

1174

|
The issues associated with data marts.

Chapter 31 z Data Warehousing Concepts


Table 31.4
Data mart functionality
Data mart size
Data mart load performance
Users access to data in multiple data marts
Data mart Internet/intranet access
Data mart administration
Data mart installation

Processing (OLAP) and/or data mining tools. Further, hundreds of users must be capable
of remotely accessing the data mart. The complexity and size of some data marts are
matching the characteristics of small-scale corporate data warehouses.

Data mart size


Users expect faster response times from data marts than from data warehouses, however,
performance deteriorates as data marts grow in size. Several vendors of data marts are
investigating ways to reduce the size of data marts to gain improvements in performance. For example, dynamic dimensions allow aggregations to be calculated on demand
rather than pre-calculated and stored in the multi-dimensional database (MDDB) cube
(see Chapter 33).

Data mart load performance


A data mart has to balance two critical components: end-user response time and data
loading performance. A data mart designed for fast user response will have a large
number of summary tables and aggregate values. Unfortunately, the creation of such tables
and values greatly increases the time of the load procedure. Vendors are investigating
improvements in the load procedure by providing indexes that automatically and continually adapt to the data being processed or by supporting incremental database updating
so that only cells affected by the change are updated and not the entire MDDB structure.

Users access to data in multiple data marts


One approach is to replicate data between different data marts or, alternatively, build
virtual data marts. Virtual data marts are views of several physical data marts or the
corporate data warehouse tailored to meet the requirements of specific groups of users.
Commercial products that manage virtual data marts are available.

Data mart Internet /Intranet access


Internet/Intranet technology offers users low-cost access to data marts and the data
warehouse using Web browsers such as Netscape Navigator and Microsoft Internet

31.6.1

31.6

31.6 Data Warehousing Using Oracle

Explorer. Data mart Internet/Intranet products normally sit between a Web server and the
data analysis product. Vendors are developing products with increasingly advanced Web
capabilities. These products include Java and ActiveX capabilities. We discussed Web and
DBMS integration in detail in Chapter 29.

Data mart administration


As the number of data marts in an organization increases, so does the need to centrally
manage and coordinate data mart activities. Once data is copied to data marts, data can
become inconsistent as users alter their own data marts to allow them to analyze data in
different ways. Organizations cannot easily perform administration of multiple data marts,
giving rise to issues such as data mart versioning, data and metadata consistency and
integrity, enterprise-wide security, and performance tuning. Data mart administrative tools
are commercially available.

Data mart installation


Data marts are becoming increasingly complex to build. Vendors are offering products
referred to as data marts in a box that provide a low-cost source of data mart tools.

Data Warehousing Using Oracle


In Chapter 8 we provided a general overview of the major features of the Oracle DBMS.
In this section we describe the features of Oracle9i Enterprise Edition that are specifically
designed to improve performance and manageability for the data warehouse (Oracle
Corporation, 2004f).

Oracle9i
Oracle9i Enterprise Edition is one of the leading relational DBMS for data warehousing. Oracle has achieved this success by focusing on basic, core requirements for data
warehousing: performance, scalability, and manageability. Data warehouses store larger
volumes of data, support more users, and require faster performance, so that these core
requirements remain key factors in the successful implementation of data warehouses.
However, Oracle goes beyond these core requirements and is the first true data warehouse
platform. Data warehouse applications require specialized processing techniques to allow
support for complex, ad hoc queries running against large amounts of data. To address
these special requirements, Oracle offers a variety of query processing techniques, sophisticated query optimization to choose the most efficient data access path, and a scalable
architecture that takes full advantage of all parallel hardware configurations. Successful
data warehouse applications rely on superior performance when accessing the enormous
amounts of stored data. Oracle provides a rich variety of integrated indexing schemes,
join methods, and summary management features, to deliver answers quickly to data

1175

1176

|
Chapter 31 z Data Warehousing Concepts

summary management;
analytical functions;
bitmapped indexes;
advanced join methods;
sophisticated SQL optimizer;
resource management.

warehouse users. Oracle also addresses applications that have mixed workloads and where
administrators want to control which users, or groups of users, have priority when executing transactions or queries. In this section we provide an overview of the main features
of Oracle, which are particularly aimed at supporting data warehousing applications.
These features include:
n
n
n
n
n
n

Summary management
In a data warehouse application, users often issue queries that summarize detail data by
common dimensions, such as month, product, or region. Oracle provides a mechanism for
storing multiple dimensions and summary calculations on a table. Thus, when a query
requests a summary of detail records, the query is transparently re-written to access the
stored aggregates rather than summing the detail records every time the query is issued.
This results in dramatic improvements in query performance. These summaries are automatically maintained from data in the base tables. Oracle also provides summary advisory
functions that assist database administrators in choosing which summary tables are the
most effective, depending on actual workload and schema statistics. Oracle Enterprise
Manager supports the creation and management of materialized views and related dimensions and hierarchies via a graphical interface, greatly simplifying the management of
materialized views.

Analytical functions

ranking (for example, who are the top ten sales reps in each region of Great Britain?);

Oracle9i includes a range of SQL functions for business intelligence and data warehousing applications. These functions are collectively called analytical functions, and they
provide improved performance and simplified coding for many business analysis queries.
Some examples of the new capabilities are:

moving aggregates (for example, what is the three-month moving average of property
sales?);
other functions including cumulative aggregates, lag/lead expressions, period-over-period
comparisons, and ratio-to-report.

Oracle also includes the CUBE and ROLLUP operators for OLAP analysis, via SQL.
These analytical and OLAP functions significantly extend the capabilities of Oracle for
analytical applications (see Chapter 33).

Bitmapped indexes

31.6 Data Warehousing Using Oracle

Bitmapped indexes deliver performance benefits to data warehouse applications. They


coexist with, and complement, other available indexing schemes, including standard
B-tree indexes, clustered tables, and hash clusters. While a B-tree index may be the
most efficient way to retrieve data using a unique identifier, bitmapped indexes are most
efficient when retrieving data based on much wider criteria, such as How many flats were
sold last month? In data warehousing applications, end-users often query data based on
these wider criteria. Oracle enables efficient storage of bitmap indexes through the use of
advanced data compression technology.

Advanced join methods


Oracle offers partition-wise joins, which dramatically increase the performance of joins
involving tables that have been partitioned on the join keys. Joining records in matching
partitions increases performance, by avoiding partitions that could not possibly have
matching key records. Less memory is also used since less in-memory sorting is required.
Hash joins deliver higher performance over other join methods in many complex
queries, especially for those queries where existing indexes cannot be leveraged in join
processing, a common occurrence in ad hoc query environments. This join eliminates the
need to perform sorts, by using an in-memory hash table constructed at runtime. The hash
join is also ideally suited for scalable parallel execution.

Sophisticated SQL optimizer


Oracle provides numerous powerful query processing techniques that are completely
transparent to the end-user. The Oracle cost-based optimizer dynamically determines
the most efficient access paths and joins for every query. It incorporates transformation
technology that automatically re-writes queries generated by end-user tools, for efficient
query execution.
To choose the most efficient query execution strategy, the Oracle cost-based optimizer
takes into account statistics, such as the size of each table and the selectivity of each query
condition. Histograms provide the cost-based optimizer with more detailed statistics based
on a skewed, non-uniform data distribution. The cost-based optimizer optimizes execution
of queries involved in a star schema, which is common in data warehouse applications
(see Section 32.2). By using a sophisticated star-query optimization algorithm and bitmapped indexes, Oracle can dramatically reduce the query executions done in a traditional
join fashion. Oracle query processing not only includes a comprehensive set of specialized
techniques in all areas (optimization, access and join methods, and query execution), they
are also all seamlessly integrated, and work together to deliver the full power of the query
processing engine.

Resource management
Managing CPU and disk resources in a multi-user data warehouse or OLTP application
is challenging. As more users require access, contention for resources becomes greater.

|
1177

1178

|
Chapter 31 z Data Warehousing Concepts

Oracle has resource management functionality that provides control of system resources
assigned to users. Important online users, such as order entry clerks, can be given a high
priority, while other users those running batch reports receive lower priorities. Users
are assigned to resource classes, such as order entry or batch, and each resource class
is then assigned an appropriate percentage of machine resources. In this way, highpriority users are given more system resources than lower-priority users.

Additional data warehouse features


Oracle also includes many features that improve the management and performance of data
warehouse applications. Index rebuilds can be done online without interrupting inserts,
updates, or deletes that may be occurring on the base table. Function-based indexes can be
used to index expressions, such as arithmetic expressions, or functions that modify column
values. The sample scan functionality allows queries to run and only access a specified
percentage of the rows or blocks of a table. This is useful for getting meaningful aggregate
amounts, such as an average, without accessing every row of a table.

The operational data source for the data warehouse is supplied from mainframe operational data held in first
generation hierarchical and network databases, departmental data held in proprietary file systems, private data
held on workstations and private servers and external systems such as the Internet, commercially available
databases, or databases associated with an organizations suppliers or customers.

The major components of a data warehouse include the operational data sources, operational data store, load
manager, warehouse manager, query manager, detailed, lightly and highly summarized data, archive/backup
data, metadata, and end-user access tools.

A DBMS built for Online Transaction Processing (OLTP) is generally regarded as unsuitable for data warehousing because each system is designed with a differing set of requirements in mind. For example, OLTP
systems are design to maximize the transaction processing capacity, while data warehouses are designed to
support ad hoc query processing.

The potential benefits of data warehousing are high returns on investment, substantial competitive advantage,
and increased productivity of corporate decision-makers.

Data Webhouse is a distributed data warehouse that is implemented over the Web with no central data
repository.

Data warehousing is subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision-making process. A data warehouse is data management and data analysis
technology.

Chapter Summary

The operational data store (ODS) is a repository of current and integrated operational data used for analysis.
It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act
as a staging area for data to be moved into the warehouse.

Metaflow is the processes associated with the management of the metadata (data about data).

Outflow is the processes associated with making the data available to the end-users.

Downflow is the processes associated with archiving and backing-up of data in the warehouse.

Upflow is the processes associated with adding value to the data in the warehouse through summarizing,
packaging, and distribution of the data.

Inflow is the processes associated with the extraction, cleansing, and loading of the data from the source
systems into the data warehouse.

Data warehousing focuses on the management of five primary data flows, namely the inflow, upflow,
downflow, outflow, and metaflow.

End-user access tools can be categorized into five main groups: data reporting and query tools, application
development tools, executive information system (EIS) tools, Online Analytical Processing (OLAP) tools, and
data mining tools.

The query manager (also called the backend component) performs all the operations associated with the
management of user queries. The operations performed by this component include directing queries to the
appropriate tables and scheduling the execution of queries.

The warehouse manager performs all the operations associated with the management of the data in the
warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and
aggregations, and archiving and backing-up data.

The load manager (also called the frontend component) performs all the operations associated with the
extraction and loading of data into the warehouse. These operations include simple transformations of the data
to prepare the data for entry into the warehouse.

1179

The requirements for a data warehouse RDBMS include load performance, load processing, data quality
management, query performance, terabyte scalability, mass user scalability, networked data warehouse,
warehouse administration, integrated dimensional analysis, and advanced query functionality.

Chapter Summary

Data mart is a subset of a data warehouse that supports the requirements of a particular department or
business function. The issues associated with data marts include functionality, size, load performance, users
access to data in multiple data marts, Internet/intranet access, administration, and installation.

1180

|
Chapter 31 z Data Warehousing Concepts

Review Questions
31.1

31.2

31.3
31.4

31.5

31.6

Discuss what is meant by the following terms


when describing the characteristics of the data
in a data warehouse:
(a) subject-oriented;
(b) integrated;
(c) time-variant;
(d) non-volatile.
Discuss how Online Transaction Processing
(OLTP) systems differ from data warehousing
systems.
Discuss the main benefits and problems
associated with data warehousing.
Present a diagrammatic representation of the
typical architecture and main components of
a data warehouse.
Describe the characteristics and main
functions of the following components of
a data warehouse:
(a) load manager;
(b) warehouse manager;
(c) query manager;
(d) metadata;
(e) end-user access tools.
Discuss the activities associated with each of
the five primary data flows or processes within
a data warehouse:
(a) inflow;
(b) upflow;

Exercise

31.7

31.8

31.9

31.10

31.11

31.12

31.13

31.14

(c) downflow;
(d) outflow;
(e) metaflow.
What are the three main approaches taken by
vendors to provide data extraction, cleansing,
and transformation tools?
Describe the specialized requirements of
a relational database management system
(RDBMS) suitable for use in a data
warehouse environment.
Discuss how parallel technologies can
support the requirements of a data
warehouse.
Discuss the importance of managing metadata
and how this relates to the integration of the
data warehouse.
Discuss the main tasks associated with the
administration and management of a data
warehouse.
Discuss how data marts differ from data
warehouses and identify the main reasons for
implementing a data mart.
Identify the main issues associated with
the development and management of data
marts.
Describe the features of Oracle that
support the core requirements of data
warehousing.

31.15 You are asked by the Managing Director of DreamHome to investigate and report on the applicability of data
warehousing for the organization. The report should compare data warehouse technology with OLTP systems
and should identify the advantages and disadvantages, and any problem areas associated with implementing
a data warehouse. The report should reach a fully justified set of conclusions on the applicability of a data
warehouse for DreamHome.

Chapter

32
Data Warehousing Design

Chapter Objectives

How Oracle Warehouse Builder can be used to build a data warehouse.

Criteria for assessing the degree of dimensionality provided by a data


warehouse.

A step-by-step methodology for designing a data warehouse database.

How a dimensional model (DM) differs from an EntityRelationship (ER) model.

A technique for designing a data warehouse database called dimensionality


modeling.

The issues associated with designing a data warehouse database.

In this chapter you will learn:

In Chapter 31 we described the basic concepts of data warehousing. In this chapter we


focus on the issues associated with data warehouse database design. Since the 1980s, data
warehouses have evolved their own design techniques, distinct from transaction-processing
systems. Dimensional design techniques have emerged as the dominant approach for most
data warehouse databases.

1182

Designing a Data Warehouse Database

In Section 32.1 we highlight the major issues associated with data warehouse design.
In Section 32.2 we describe the basic concepts associated with dimensionality modeling and then compare this technique with traditional EntityRelationship modeling.
In Section 32.3 we describe and demonstrate a step-by-step methodology for designing
a data warehouse database using worked examples taken from an extended version of
the DreamHome case study described in Section 10.4 and Appendix A. In Section 32.4
we describe criteria for assessing the dimensionality of a data warehouse. Finally, in
Section 32.5 we describe how to design a data warehouse using an Oracle product called
Oracle Warehouse Builder.

Structure of this Chapter

Chapter 32 z Data Warehousing Design

32.1

Designing a data warehouse database is highly complex. To begin a data warehouse project, we need answers for questions such as: which user requirements are most important
and which data should be considered first? Also, should the project be scaled down into
something more manageable yet at the same time provide an infrastructure capable of
ultimately delivering a full-scale enterprise-wide data warehouse? Questions such as these
highlight some of the major issues in building data warehouses. For many enterprises the
solution is data marts, which we described in Section 31.5. Data marts allow designers
to build something that is far simpler and achievable for a specific group of users. Few
designers are willing to commit to an enterprise-wide design that must meet all user
requirements at one time. However, despite the interim solution of building data marts,
the goal remains the same; the ultimate creation of a data warehouse that supports the
requirements of the enterprise.
The requirements collection and analysis stage (see Section 9.5) of a data warehouse
project involves interviewing appropriate members of staff such as marketing users,
finance users, sales users, operational users, and management to enable the identification
of a prioritized set of requirements for the enterprise that the data warehouse must meet.
At the same time, interviews are conducted with members of staff responsible for Online
Transaction Processing (OLTP) systems to identify, which data sources can provide clean,
valid, and consistent data that will remain supported over the next few years.
The interviews provide the necessary information for the top-down view (user requirements) and the bottom-up view (which data sources are available) of the data warehouse.
With these two views defined we are ready to begin the process of designing the data warehouse database.
The database component of a data warehouse is described using a technique called dimensionality modeling. In the following sections, we first describe the concepts associated
with a dimensional model and contrast this model with the traditional EntityRelationship
(ER) model (see Chapters 11 and 12). We then present a step-by-step methodology for
creating a dimensional model using worked examples from an extended version of the
DreamHome case study.

32.2

32.2 Dimensionality Modeling

A logical design technique that aims to present the data in a


standard, intuitive form that allows for high-performance access.

Dimensionality Modeling
Dimensionality
modeling

A logical structure that has a fact table containing factual data in the
center, surrounded by dimension tables containing reference data (which
can be denormalized).

Dimensionality modeling uses the concepts of EntityRelationship (ER) modeling with


some important restrictions. Every dimensional model (DM) is composed of one table
with a composite primary key, called the fact table, and a set of smaller tables called
dimension tables. Each dimension table has a simple (non-composite) primary key that
corresponds exactly to one of the components of the composite key in the fact table. In
other words, the primary key of the fact table is made up of two or more foreign keys. This
characteristic star-like structure is called a star schema or star join. An example star
schema for the property sales of DreamHome is shown in Figure 32.1. Note that foreign
keys (labeled {FK}) are included in a dimensional model.
Another important feature of a DM is that all natural keys are replaced with surrogate
keys. This means that every join between fact and dimension tables is based on surrogate
keys, not natural keys. Each surrogate key should have a generalized structure based on
simple integers. The use of surrogate keys allows the data in the warehouse to have some
independence from the data used and produced by the OLTP systems. For example, each
branch has a natural key, namely branchNo and also a surrogate key namely branchID.
Star
schema

The star schema exploits the characteristics of factual data such that facts are generated
by events that occurred in the past, and are unlikely to change, regardless of how they are
analyzed. As the bulk of data in a data warehouse is represented as facts, the fact tables
can be extremely large relative to the dimension tables. As such, it is important to treat
fact data as read-only reference data that will not change over time. The most useful fact
tables contain one or more numerical measures, or facts, that occur for each record. In
Figure 32.1, the facts are offerPrice, sellingPrice, saleCommission, and saleRevenue. The most
useful facts in a fact table are numeric and additive because data warehouse applications
almost never access a single record; rather, they access hundreds, thousands, or even
millions of records at a time and the most useful thing to do with so many records is to
aggregate them.
Dimension tables, by contrast, generally contain descriptive textual information.
Dimension attributes are used as the constraints in data warehouse queries. For example,
the star schema shown in Figure 32.1 can support queries that require access to sales
of properties in Glasgow using the city attribute of the PropertyForSale table, and on sales
of properties that are flats using the type attribute in the PropertyForSale table. In fact, the
usefulness of a data warehouse is in relation to the appropriateness of the data held in the
dimension tables.

1183

1184

|
Chapter 32 z Data Warehousing Design

Figure 32.1
Star schema for
property sales of
DreamHome.

A variant of the star schema where dimension tables do not contain


denormalized data.

Star schemas can be used to speed up query performance by denormalizing reference


information into a single dimension table. For example, in Figure 32.1 note that several
dimension tables (namely PropertyForSale, Branch, ClientBuyer, Staff, and Owner) contain
location data (city, region, and country), which is repeated in each. Denormalization is
appropriate when there are a number of entities related to the dimension table that are often
accessed, avoiding the overhead of having to join additional tables to access those
attributes. Denormalization is not appropriate where the additional data is not accessed
very often, because the overhead of scanning the expanded dimension table may not be
offset by any gain in the query performance.
Snowflake
schema

Efficiency The consistency of the underlying database structure allows more efficient
access to the data by various tools including report writers and query tools.

|
1185

Figure 32.2
Part of star schema
for property sales of
DreamHome with a
normalized version
of the Branch
dimension table.

32.2 Dimensionality Modeling

A hybrid structure that contains a mixture of star and snowflake


schemas.

There is a variation to the star schema called the snowflake schema, which allows
dimensions to have dimensions. For example, we could normalize the location data (city,
region, and country attributes) in the Branch dimension table of Figure 32.1 to create two
new dimension tables called City and Region. A normalized version of the Branch dimension table of the property sales schema is shown in Figure 32.2. In a snowflake schema
the location data in the PropertyForSale, ClientBuyer, Staff, and Owner dimension tables would
also be removed and the new City and Region dimension tables would be shared with these
tables.
Starflake
schema

The most appropriate database schemas use a mixture of denormalized star and normalized snowflake schemas. This combination of star and snowflake schemas is called a
starflake schema. Some dimensions may be present in both forms to cater for different
query requirements. Whether the schema is star, snowflake, or starflake, the predictable
and standard form of the underlying dimensional model offers important advantages
within a data warehouse environment including:

Ability to handle changing requirements The star schema can adapt to changes in the
users requirements, as all dimensions are equivalent in terms of providing access to the
fact table. This means that the design is better able to support ad hoc user queries.

1186

Ability to model common business situations There are a growing number of standard
approaches for handling common modeling situations in the business world. Each of
these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces; for example, slowly
changing dimensions where a constant dimension such as Branch or Staff actually
evolves slowly and asynchronously. We discuss slowly changing dimensions in more
detail in Section 32.3, Step 8.

Extensibility The dimensional model is extensible; for example typical changes that
a DM must support include: (a) adding new facts as long as they are consistent with
the fundamental granularity of the existing fact table; (b) adding new dimensions, as
long as there is a single value of that dimension defined for each existing fact record;
(c) adding new dimensional attributes; and (d) breaking existing dimension records
down to a lower level of granularity from a certain point in time forward.

Chapter 32 z Data Warehousing Design

Predictable query processing Data warehouse applications that drill down will simply
be adding more dimension attributes from within a single star schema. Applications that
drill across will be linking separate fact tables together through the shared (conformed)
dimensions. Even though the overall suite of star schemas in the enterprise dimensional
model is complex, the query processing is very predictable because at the lowest level,
each fact table should be queried independently.

32.2.1 Comparison of DM and ER models


In this section we compare and contrast the dimensional model (DM) with the Entity
Relationship (ER) model. As described in the previous section, DMs are normally used to
design the database component of a data warehouse whereas ER models have traditionally
been used to describe the database for Online Transaction Processing (OLTP) systems.
ER modeling is a technique for identifying relationships among entities. A major
goal of ER modeling is to remove redundancy in the data. This is immensely beneficial to
transaction processing because transactions are made very simple and deterministic. For
example, a transaction that updates a clients address normally accesses a single record in
the Client table. This access is extremely fast as it uses an index on the primary key clientNo.
However, in making transaction processing efficient such databases cannot efficiently and
easily support ad hoc end-user queries. Traditional business applications such as customer
ordering, stock control, and customer invoicing require many tables with numerous joins
between them. An ER model for an enterprise can have hundreds of logical entities, which
can map to hundreds of physical tables. Traditional ER modeling does not support the
main attraction of data warehousing, namely intuitive and high-performance retrieval
of data.
The key to understanding the relationship between dimensional models and Entity
Relationship models is that a single ER model normally decomposes into multiple DMs.
The multiple DMs are then associated through shared dimension tables. We describe the
relationship between ER models and DMs in more detail in the following section, in which
we present a database design methodology for data warehouses.

Activity

32.3

32.3 Database Design Methodology for Data Warehouses

Database Design Methodology for


Data Warehouses
In this section we describe a step-by-step methodology for designing the database of a
data warehouse. This methodology was proposed by Kimball and is called the Nine-Step
Methodology (Kimball, 1996). The steps of this methodology are shown in Table 32.1.
There are many approaches that offer alternative routes to the creation of a data warehouse.
One of the more successful approaches is to decompose the design of the data warehouse
into more manageable parts, namely data marts (see Section 31.5). At a later stage, the integration of the smaller data marts leads to the creation of the enterprise-wide data warehouse.
Thus, a data warehouse is the union of a set of separate data marts implemented over a
period of time, possibly by different design teams, and possibly on different hardware and
software platforms.
The Nine-Step Methodology specifies the steps required for the design of a data mart.
However, the methodology also ties together separate data marts so that over time they
merge together into a coherent overall data warehouse. We now describe the steps shown
in Table 32.1 in some detail using worked examples taken from an extended version of the
DreamHome case study.
Step 1: Choosing the process
The process (function) refers to the subject matter of a particular data mart. The first
data mart to be built should be the one that is most likely to be delivered on time, within
budget, and to answer the most commercially important business questions. The best
choice for the first data mart tends to be the one that is related to sales. This data source is
likely to be accessible and of high quality. In selecting the first data mart for DreamHome,
we first identify that the discrete business processes of DreamHome include:

Step

Choosing the process


Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing pre-calculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query modes

Table 32.1 Nine-Step Methodology by Kimball (1996).

1
2
3
4
5
6
7
8
9

1187

1188

Figure 32.3

Chapter 32 z Data Warehousing Design

property sales;
property rentals (leasing);
property viewing;
property advertising;
property maintenance.

ER diagram of an extended version of DreamHome.

n
n
n
n
n

The data requirements associated with these processes are shown in the ER diagram of
Figure 32.3. Note that this ER diagram forms part of the design documentation, which
describes the Online Transaction Processing (OLTP) systems required to support the business processes of DreamHome. The ER diagram of Figure 32.3 has been simplified by
labeling only the main entities and relationships and is created by following Steps 1 and 2
of the database design methodology described earlier in Chapters 15 and 16. The shaded
entities represent the core facts for each business process of DreamHome. The business
process selected to be the first data mart is property sales. The part of the original ER

|
1189

Figure 32.4
Part of ER diagram
in Figure 32.3 that
represents the data
requirements of the
property sales
business process
of DreamHome.

32.3 Database Design Methodology for Data Warehouses

diagram that represents the data requirements of the property sales business process is
shown in Figure 32.4.
Step 2: Choosing the grain
Choosing the grain means deciding exactly what a fact table record represents. For example,
the PropertySale entity shown with shading in Figure 32.4 represents the facts about each
property sale and becomes the fact table of the property sales star schema shown
previously in Figure 32.1. Therefore, the grain of the PropertySale fact table is individual
property sales.
Only when the grain for the fact table is chosen can we identify the dimensions of the
fact table. For example, the Branch, Staff, Owner, ClientBuyer, PropertyForSale, and Promotion
entities in Figure 32.4 will be used to reference the data about property sales and will become the dimension tables of the property sales star schema shown previously in Figure 32.1.
We also include Time as a core dimension, which is always present in star schemas.
The grain decision for the fact table also determines the grain of each of the dimension
tables. For example, if the grain for the PropertySale fact table is an individual property sale,
then the grain of the ClientBuyer dimension is the details of the client who bought a particular property.
Step 3: Identifying and conforming the dimensions
Dimensions set the context for asking questions about the facts in the fact table. A wellbuilt set of dimensions makes the data mart understandable and easy to use. We identify
dimensions in sufficient detail to describe things such as clients and properties at the
correct grain. For example, each client of the ClientBuyer dimension table is described by
the clientID, clientNo, clientName, clientType, city, region, and country attributes, as shown previously in Figure 32.1. A poorly presented or incomplete set of dimensions will reduce the
usefulness of a data mart to an enterprise.

1190

|
Chapter 32 z Data Warehousing Design

Figure 32.5
Star schemas for
property sales and
property advertising
with Time,
PropertyForSale,
Branch, and
Promotion as
conformed (shared)
dimension tables.

If any dimension occurs in two data marts, they must be exactly the same dimension, or
one must be a mathematical subset of the other. Only in this way can two data marts share
one or more dimensions in the same application. When a dimension is used in more than
one data mart, the dimension is referred to as being conformed. Examples of dimensions that must conform between property sales and property advertising are the Time,
PropertyForSale, Branch, and Promotion dimensions. If these dimensions are not synchronized
or if they are allowed to drift out of synchronization between data marts, the overall data
warehouse will fail, because the two data marts will not be able to be used together.
For example, in Figure 32.5 we show the star schemas for property sales and property
advertising with Time, PropertyForSale, Branch, and Promotion as conformed dimensions with
light shading.

Step 4: Choosing the facts

|
1191

Figure 32.6
Star schema for
property rentals of
DreamHome. This
is an example of a
badly structured
fact table with
non-numeric facts,
a non-additive fact,
and a numeric fact
with an inconsistent
granularity with the
other facts in the
table.

32.3 Database Design Methodology for Data Warehouses

The grain of the fact table determines which facts can be used in the data mart. All the
facts must be expressed at the level implied by the grain. In other words, if the grain
of the fact table is an individual property sale, then all the numerical facts must refer
to this particular sale. Also, the facts should be numeric and additive. In Figure 32.6 we
use the star schema of the property rental process of DreamHome to illustrate a badly
structured fact table. This fact table is unusable with non-numeric facts (promotionName
and staffName), a non-additive fact (monthlyRent), and a fact (lastYearRevenue) at a different
granularity from the other facts in the table. Figure 32.7 shows how the Lease fact
table shown in Figure 32.6 could be corrected so that the fact table is appropriately
structured.
Additional facts can be added to a fact table at any time provided they are consistent
with the grain of the table.

1192

|
Chapter 32 z Data Warehousing Design

Figure 32.7
Star schema for the
property rentals of
DreamHome. This is
the schema shown in
Figure 32.6 with the
problems corrected.

Step 5: Storing pre-calculations in the fact table


Once the facts have been selected each should be re-examined to determine whether there
are opportunities to use pre-calculations. A common example of the need to store precalculations occurs when the facts comprise a profit and loss statement. This situation will
often arise when the fact table is based on invoices or sales. Figure 32.7 shows the fact table
with the rentDuration, totalRent, clientAllowance, staffCommission, and totalRevenue attributes. These
types of facts are useful because they are additive quantities, from which we can derive
valuable information such as the average clientAllowance based on aggregating some number
of fact table records. To calculate the totalRevenue generated per property rental we subtract
the clientAllowance and the staffCommission from totalRent. Although the totalRevenue can always
be derived from these attributes, we still need to store the totalRevenue. This is particularly
true for a value that is fundamental to an enterprise, such as totalRevenue, or if there is any
chance of a user calculating the totalRevenue incorrectly. The cost of a user incorrectly representing the totalRevenue is offset against the minor cost of a little redundant data storage.

32.3 Database Design Methodology for Data Warehouses

Step 6: Rounding out the dimension tables


In this step, we return to the dimension tables and add as many text descriptions to the
dimensions as possible. The text descriptions should be as intuitive and understandable to
the users as possible. The usefulness of a data mart is determined by the scope and nature
of the attributes of the dimension tables.
Step 7: Choosing the duration of the database
The duration measures how far back in time the fact table goes. In many enterprises,
there is a requirement to look at the same time period a year or two earlier. For other enterprises, such as insurance companies, there may be a legal requirement to retain data
extending back five or more years. Very large fact tables raise at least two very significant
data warehouse design issues. First, it is often increasingly difficult to source increasingly
old data. The older the data, the more likely there will be problems in reading and
interpreting the old files or the old tapes. Second, it is mandatory that the old versions
of the important dimensions be used, not the most current versions. This is known as the
slowly changing dimension problem, which is described in more detail in the following step.
Step 8: Tracking slowly changing dimensions
The slowly changing dimension problem means, for example, that the proper description
of the old client and the old branch must be used with the old transaction history. Often,
the data warehouse must assign a generalized key to these important dimensions in order
to distinguish multiple snapshots of clients and branches over a period of time.
There are three basic types of slowly changing dimensions: Type 1, where a changed
dimension attribute is overwritten; Type 2, where a changed dimension attribute causes a
new dimension record to be created; and Type 3, where a changed dimension attribute
causes an alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension record.
Step 9: Deciding the query priorities and the query modes
In this step we consider physical design issues. The most critical physical design issues
affecting the end-users perception of the data mart are the physical sort order of the fact
table on disk and the presence of pre-stored summaries or aggregations. Beyond these issues
there are a host of additional physical design issues affecting administration, backup,
indexing performance, and security. For further information on the issues affecting the
physical design for data warehouses the interested reader is referred to Anahory and
Murray (1997).
At the end of this methodology, we have a design for a data mart that supports the
requirements of a particular business process and also allows the easy integration with
other related data marts to ultimately form the enterprise-wide data warehouse. Table 32.2
lists the fact and dimension tables associated with the star schema for each business process
of DreamHome (identified in Step 1 of the methodology).

|
1193

1194

|
Chapter 32 z Data Warehousing Design

Figure 32.8
Dimensional model
(fact constellation)
for the DreamHome
data warehouse.

We integrate the star schemas for the business processes of DreamHome using the conformed dimensions. For example, all the fact tables share the Time and Branch dimensions
as shown in Table 32.2. A dimensional model, which contains more than one fact table
sharing one or more conformed dimension tables, is referred to as a fact constellation.
The fact constellation for the DreamHome data warehouse is shown in Figure 32.8. The
model has been simplified by displaying only the names of the fact and dimension tables.
Note that the fact tables are shown with dark shading and all the dimension tables being
conformed are shown with light shading.

Property viewing

Property rentals

Property sales

Business process

Advert

PropertyViewing

Lease

PropertySale

Fact table

Time, Branch, Staff, PropertyForRent

Time, Branch, PropertyForSale,


PropertyForRent, Promotion, Newspaper

Time, Branch, PropertyForSale,


PropertyForRent, ClientBuyer, ClientRenter

Time, Branch, Staff, PropertyForRent, Owner,


ClientRenter, Promotion

Time, Branch, Staff, PropertyForSale, Owner,


ClientBuyer, Promotion

Dimension tables

32.4

32.4 Criteria for Assessing the Dimensionality of a Data Warehouse

Property advertising
PropertyMaintenance

Table 32.2 Fact and dimension tables for each business process of DreamHome.

Property maintenance

Criteria for Assessing the Dimensionality


of a Data Warehouse
Since the 1980s, data warehouses have evolved their own design techniques, distinct from
OLTP systems. Dimensional design techniques have emerged as the main approach for
most of the data warehouses. In this section we describe the criteria proposed by Ralph
Kimball to measure the extent to which a system supports the dimensional view of data
warehousing (Kimball, 2000a,b).
When assessing a particular data warehouse remember that few vendors attempt to
provide a completely integrated solution. However, as a data warehouse is a complete
system, the criteria should only be used to assess complete end-to-end systems and not a
collection of disjointed packages that may never integrate well together.
There are twenty criteria divided into three broad groups: architecture, administration,
and expression as shown in Table 32.3. The purpose of establishing these criteria is to
establish an objective standard for assessing how well a system supports the dimensional
view of data warehousing, and to set the threshold high so that vendors have a target for
improving their systems. The intended way to use this list is to rate a system on each
criterion with a simple 0 or 1. A system qualifies for a 1 only if it meets the full definition
of support for that criterion. For example, a system that offers aggregate navigation
(the fourth criterion) that is available only to a single front-end tool gets a zero because
the aggregate navigation is not open. In other words, there can be no partial credit for a
criterion.
Architectural criteria are fundamental characteristics to the way the entire system is
organized. These criteria usually extend from the backend, through the DBMS, to the
frontend and the users desktop.
Administration criteria are more tactical than architectural criteria, but are considered
to be essential to the smooth running of a dimensionally oriented data warehouse.
These criteria generally affect IT personnel who are building and maintaining the data
warehouse.

1195

1196

Expression

Administration

Architecture

Group

Graceful modification
Dimensional replication
Changed dimension notification
Surrogate key administration
International consistency

Explicit declaration
Conformed dimensions and facts
Dimensional integrity
Open aggregate navigation
Dimensional symmetry
Dimensional scalability
Sparsity tolerance

Criteria

Data Warehousing Design Using Oracle

Expression criteria are mostly analytic capabilities that are needed in real-life situations. The end-user community experiences all expression criteria directly. The expression
criteria for dimensional systems are not the only features users look for in a data warehouse, but they are all capabilities that need to exploit the power of a dimensional system.
A system that supports most or all of these dimensional criteria would be adaptable,
easier to administer, and able to address many real-world applications. The major point of
dimensional systems is that they are business-issue and end-user driven. For further details
of the criteria in Table 32.3, the interested reader is referred to Kimball (2000a,b).

Multiple-dimension hierarchies
Ragged-dimension hierarchies
Multiple valued dimensions
Slowly changing dimensions
Roles of a dimension
Hot-swappable dimensions
On-the-fly fact range dimensions
On-the-fly behavior dimensions

Table 32.3 Criteria for assessing the dimensionality provided by a data


warehouse (Kimball, 2000a,b).

Chapter 32 z Data Warehousing Design

32.5

We introduced the Oracle DBMS in Section 8.2. In this section, we describe Oracle
Warehouse Builder (OWB) as a key component of the Oracle Warehouse solution,
enabling the design and deployment of data warehouses, data marts, and e-Business intelligence applications. OWB is a design tool and an extraction, transformation, and loading

Oracle the engine of OWB (as there is no external server);

32.5.1

32.5 Data Warehousing Design Using Oracle

(ETL) tool. An important aspect of OWB from the customers perspective is that it allows
the integration of the traditional data warehousing environments with the new e-Business
environments (Oracle Corporation, 2000). This section first provides an overview of the
components of OWB and the underlying technologies and then describes how the user
would apply OWB to typical data warehousing tasks.

Oracle Warehouse Builder Components


OWB provides the following primary functional components:
n

A repository consisting of a set of tables in an Oracle database that is accessed via a


Java-based access layer. The repository is based on the Common Warehouse Model
(CWM) standard, which allows the OWB meta-data to be accessible to other products
that support this standard (see Section 31.4.3).
A graphical user interface (GUI) that enables access to the repository. The GUI
features graphical editors and an extensive use of wizards. The GUI is written in Java,
making the frontend portable.
A code generator, also written in Java, generates the code that enables the deployment
of data warehouses. The different code types generated by OWB are discussed later in
this section.
Integrators, which are components that are dedicated to extracting data from a particular
type of source. In addition to native support for Oracle, other relational, non-relational,
and flat-file data sources, OWB integrators allow access to information in enterprise
resource planning (ERP) applications such as Oracle and SAP R/3. The SAP integrator
provides access to SAP transparent tables using PL/SQL code generated by OWB.
An open interface that allows developers to extend the extraction capabilities of OWB,
while leveraging the benefits of the OWB framework. This open interface is made available to developers as part of the OWB Software Development Kit (SDK).
Runtime, which is a set of tables, sequences, packages, and triggers that are installed
in the target schema. These database objects are the foundation for the auditing and
error detection/correction capabilities of OWB. For example, loads can be restarted
based on information stored in the runtime tables. OWB includes a runtime audit viewer
for browsing the runtime tables and runtime reports.

Oracle Enterprise Manager for scheduling;


Oracle Workflow for dependency management;
Oracle PureExtract for MVS mainframe access;
Oracle PureIntegrate for customer data quality;

The architecture of the Oracle Warehouse Builder is shown in Figure 32.9. Oracle Warehouse Builder is a key component of the larger Oracle data warehouse. The other products
that the OWB must work with within the data warehouse include:

Oracle Gateways for relational and mainframe data access.

1197

1198

|
Chapter 32 z Data Warehousing Design

Figure 32.9
Oracle Warehouse
Builder architecture.

32.5.2 Using Oracle Warehouse Builder


In this section we describe how OWB assists the user in some typical data warehousing
tasks like defining source data structures, designing the target warehouse, mapping sources
to targets, generating code, instantiating the warehouse, extracting the data, and maintaining the warehouse.

Defining sources
Once the requirements have been determined and all the data sources have been identified,
a tool such as OWB can be used for constructing the data warehouse. OWB can handle
a diverse set of data sources by means of integrators. OWB also has the concept of a
module, which is a logical grouping of related objects. There are two types of modules:
data source and warehouse. For example, a data source module might contain all the
definitions of the tables in an OLTP database that is a source for the data warehouse.
And a module of type warehouse might contain definitions of the facts, dimensions, and
staging tables that make up the data warehouse. It is important to note that modules merely
contain definitions, that is metadata, about either sources or warehouses, and not objects
that can be populated or queried. A user identifies the integrators that are appropriate
for the data sources, and each integrator accesses a source and imports the metadata
that describes it.
Oracle sources
To connect to an Oracle database, the user chooses the integrator for Oracle databases.
Next, the user supplies some more detailed connection information, for example user
name, password, and SQL*Net connection string. This information is used to define a
database link in the database that hosts the OWB repository. OWB uses this database link
to query the system catalog of the source database and extract metadata that describes the
tables and views of interest to the user. The user experiences this as a process of visually
inspecting the source and selecting objects of interest.

32.5 Data Warehousing Design Using Oracle

Non-Oracle sources
Non-Oracle databases are accessed in exactly the same way as Oracle databases. What
makes this possible is the Transparent Gateway technology of Oracle. In essence, a
Transparent Gateway allows a non-Oracle database to be treated in exactly the same
way as if it were an Oracle database. On the SQL level, once the database link pointing to
the non-Oracle database has been defined, the non-Oracle database can be queried via
SELECT just like any Oracle database. In OWB, all the user has to do is identify the type
of database, so that OWB can select the appropriate Transparent Gateway for the database
link definition. In the case of MVS mainframe sources, OWB and Oracle PureExtract
provide data extraction from sources such as IMS, DB2, and VSAM. The plan is that
Oracle PureExtract will ultimately be integrated with the OWB technology.
Flat files
OWB supports two kinds of flat files: character-delimited and fixed-length files. If the data
source is a flat file, the user selects the integrator for flat files and specifies the path and
file name. The process of creating the meta-data that describes a file is different from the
process used for a table in a database. With a table, the owning database itself stores
extensive information about the table such as the table name, the column names, and data
types. This information can be easily queried from the catalog. With a file, on the other
hand, the user assists in the process of creating the metadata with some intelligent guesses
supplied by OWB. In OWB, this process is called sampling.
Web data
With the proliferation of the Internet, the new challenge for data warehousing is to capture
data from Web sites. There are different types of data in e-Business environments: transactional Web data stored in the underlying databases; clickstream data stored in Web server
log files; registration data in databases or log files; and consolidated clickstream data in
the log files of Web analysis tools. OWB can address all these sources with its built-in
features for accessing databases and flat files.

Data quality

powerful rule-based merging to resolve conflicting data and create the best possible
integrated result from the matched data.

integrated name and address processing to standardize, correct, and enhance representations of customer names and locations;
advanced probabilistic matching to identify unique consumers, businesses, households,
super-households, or other entities for which no common identifiers exist;

A solution to the challenge of data quality is OWB with Oracle PureIntegrate. Oracle
PureIntegrate is customer data integration software that automates the creation of consolidated profiles of customers and related business data to support e-Business and
customer relationship management applications. PureIntegrate complements OWB by
providing advanced data transformation and cleansing features designed specifically to
meet the requirements of database applications. These include:
n

|
1199

1200

|
Chapter 32 z Data Warehousing Design

Designing the target warehouse


Once the source systems have been identified and defined, the next task is to design the
target warehouse based on user requirements. One of the most popular designs in data
warehousing is the star schema and its variations, as discussed in Section 32.2. Also, many
business intelligence tools such as Oracle Discoverer are optimized for this kind of design.
OWB supports all variations of star schema designs. It features wizards and graphical
editors for fact and dimensions tables. For example, in the Dimension Editor the user
graphically defines the attributes, levels, and hierarchies of a dimension.

Mapping sources to targets


When both the sources and the target have been well defined, the next step is to map the
two together. Remember that there are two types of modules: source modules and warehouse modules. Modules can be reused many times in different mappings. Warehouse
modules can themselves be used as source modules. For example, in an architecture where
we have an OLTP database that feeds a central data warehouse, which in turn feeds a data
mart, the data warehouse is a target (from the perspective of the OLTP database) and a
source (from the perspective of the data mart).
The mappings of OWB are defined on two levels. A high-level mapping that indicates
source and target modules. One level down is the detail mapping that allows a user to map
source columns to target columns and defines transformations. OWB features a built-in
transformation library from which the user can pick predefined transformations. Users can
also define their own transformations in PL/SQL and Java.

Generating code
The Code Generator is the OWB component that reads the target definitions and sourceto-target mappings and generates code to implement the warehouse. The type of generated
code varies depending on the type of object that the user wants to implement.
Logical versus physical design
Before generating code, the user has primarily been working on the logical level, that is,
on the level of object definitions. On this level, the user is concerned with capturing all the
details and relationships (the semantics) of an object, but is not yet concerned with
defining any implementation characteristics. For example, consider a table to be implemented in an Oracle database. On the logical level, the user may be concerned with the
table name, the number of columns, the column names and data types, and any relationships that the table has to other tables. On the physical level, however, the question
becomes: how can this table be optimally implemented in an Oracle database? The user
must now be concerned with things like tablespaces, indexes, and storage parameters (see
Section 8.2.2). OWB allows the user to view and manipulate an object on both the logical
and physical level. The logical definition and physical implementation details are automatically synchronized.

32.5 Data Warehousing Design Using Oracle

Configuration
In OWB, the process of assigning physical characteristics to an object is called configuration. The specific characteristics that can be defined depend on the object that is being
configured. These objects include, for example, storage parameters, indexes, tablespaces,
and partitions.
Validation
It is good practice to check the object definitions for completeness and consistency prior
to code generation. OWB offers a validate feature to automate this process. Errors detectable by the validation process include, for example, data type mismatches between sources
and targets, and foreign key errors.
Generation
The following are some of the main types of code that OWB produces:
n

SQL Data Definition Language (DDL) commands A warehouse module with its
definitions of fact and dimension tables is implemented as a relational schema in an
Oracle database. OWB generates SQL DDL scripts that create this schema. The scripts
can either be executed from within OWB or saved to the file system for later, manual
execution.
PL/SQL programs A source-to-target mapping results in a PL/SQL program if the
source is a database, whether Oracle or non-Oracle. The PL/SQL program accesses
the source database via a database link, performs the transformations as defined in the
mapping, and loads the data into the target table.
SQL*Loader control files If the source in a mapping is a flat file, OWB generates a
control file for use with SQL*Loader.
Tcl scripts OWB also generates Tcl scripts. These can be used to schedule PL/SQL
and SQL*Loader mappings as jobs in Oracle Enterprise Manager for example, to
refresh the warehouse at regular intervals.

Instantiating the warehouse and extracting data


Before the data can be moved from the source to the target database, the developer has to
instantiate the warehouse, in other words execute the generated DDL scripts to create the
target schema. OWB refers to this step as deployment. Once the target schema is in place,
the PL/SQL programs can move data from the source into the target. Note that the basic
data movement mechanism is INSERT . . . SELECT . . . with the use of a database link.
If an error should occur, a routine from one of the OWB runtime packages logs the error
in an audit table.

Maintaining the warehouse


Once the data warehouse has been instantiated and the initial load has been completed, it
has to be maintained. For example, the fact table has to be refreshed at regular intervals,
so that queries return up-to-date results. Dimension tables have to be extended and

|
1201

1202

|
Chapter 32 z Data Warehousing Design

UPDATE
DELETE
INSERT/UPDATE (insert a row; if it already exists, update it)
UPDATE/INSERT (update a row; if it does not exist, insert it)

updated, albeit much less frequently than fact tables. An example of a slowly changing
dimension is the Customer table, in which a customers address, marital status, or name
may all change over time. In addition to INSERT, OWB also supports other ways of
manipulating the warehouse:
n
n
n
n

These features give the OWB user a variety of tools to undertake ongoing maintenance
tasks. OWB interfaces with Oracle Enterprise Manager for repetitive maintenance tasks;
for example, a fact table refresh that is scheduled to occur at a regular interval. For complex dependencies OWB integrates with Oracle Workflow.

Metadata integration
OWB is based on the Common Warehouse Model (CWM) standard (see Section 31.4.3).
It can seamlessly exchange metadata with Oracle Express and Oracle Discoverer as well
as other business intelligence tools that comply with the standard.

Starflake schema is a hybrid structure that contains a mixture of star and snowflake schemas.

Snowflake schema is a variant of the star schema where dimension tables do not contain denormalized data.

Dimension tables most often contain descriptive textual information. Dimension attributes are used as the
constraints in data warehouse queries.

The most useful facts in a fact table are numerical and additive because data warehouse applications almost
never access a single record; rather, they access hundreds, thousands, or even millions of records at a time and
the most useful thing to do with so many records is to aggregate them.

The star schema exploits the characteristics of factual data such that facts are generated by events that
occurred in the past, and are unlikely to change, regardless of how they are analyzed. As the bulk of data in
the data warehouse is represented within facts, the fact tables can be extremely large relative to the dimension
tables.

Star schema is a logical structure that has a fact table containing factual data in the center, surrounded by
dimension tables containing reference data (which can be denormalized).

Every dimensional model (DM) is composed of one table with a composite primary key, called the fact table,
and a set of smaller tables called dimension tables. Each dimension table has a simple (non-composite)
primary key that corresponds exactly to one of the components of the composite key in the fact table. In other
words, the primary key of the fact table is made up of two or more foreign keys. This characteristic star-like
structure is called a star schema or star join.

Dimensionality modeling is a design technique that aims to present the data in a standard, intuitive form that
allows for high-performance access.

Chapter Summary

The Nine-Step Methodology specifies the steps required for the design of a data mart / warehouse. The steps
include: Step 1 Choosing the process, Step 2 Choosing the grain, Step 3 Identifying and conforming the
dimensions, Step 4 Choosing the facts, Step 5 Storing pre-calculations in the fact table, Step 6 Rounding out
the dimensions, Step 7 Choosing the duration of the database, Step 8 Tracking slowly changing dimensions,
and Step 9 Deciding the query priorities and query modes.

There are many approaches that offer alternative routes to the creation of a data warehouse. One of the more
successful approaches is to decompose the design of the data warehouse into more manageable parts, namely
data marts. At a later stage, the integration of the smaller data marts leads to the creation of the enterprisewide data warehouse.

The key to understanding the relationship between dimensional models and ER models is that a single ER
model normally decomposes into multiple DMs. The multiple DMs are then associated through conformed
(shared) dimension tables.

1203

There are criteria to measure the extent to which a system supports the dimensional view of data warehousing. The criteria are divided into three broad groups: architecture, administration, and expression.

Exercises

warehouse environment. Describe these


advantages.
31.7 Describe the main activities associated with
each step of the Nine-Step Methodology for
data warehouse database design.
31.8 Describe the purpose of assessing the
dimensionality of a data warehouse.
31.9 Briefly outline the criteria groups used to
assess the dimensionality of a data
warehouse.
31.10 Describe how the Oracle Warehouse
Builder supports the design of a data
warehouse.

Oracle Warehouse Builder (OWB) is a key component of the Oracle Warehouse solution, enabling the
design and deployment of data warehouses, data marts, and e-Business intelligence applications. OWB is both
a design tool and an extraction, transformation, and loading (ETL) tool.

Identify the major issues associated with


designing a data warehouse database.
Describe how a dimensional model (DM)
differs from an EntityRelationship (ER)
model.
Present a diagrammatic representation of a
typical star schema.
Describe how the fact and dimensional tables
of a star schema differ.
Describe how star, snowflake, and starflake
schemas differ.
The star, snowflake, and starflake schemas
offer important advantages in a data

Review Questions
31.1
31.2

31.3
31.4
31.5
31.6

Exercises
31.11 Use the Nine-Step Methodology for data warehouse database design to produce dimensional models for the
case studies described in Appendix B.
31.12 Use the Nine-Step Methodology for data warehouse database design to produce a dimensional model for all
or part of your organization.

Chapter

33
OLAP

Chapter Objectives

How Oracle supports OLAP.

OLAP extensions to the SQL standard.

The main categories of OLAP tools.

The rules for OLAP tools.

How to represent multi-dimensional data.

The potential benefits associated with successful OLAP applications.

The key features of OLAP applications.

The relationship between OLAP and data warehousing.

The purpose of Online Analytical Processing (OLAP).

In this chapter you will learn:

In Chapter 31 we discussed the increasing popularity of data warehousing as a means of


gaining competitive advantage. We learnt that data warehouses bring together large
volumes of data for the purposes of data analysis. Until recently, access tools for large
database systems have provided only limited and relatively simplistic data analysis.
However, accompanying the growth in data warehousing is an ever-increasing demand by
users for more powerful access tools that provide advanced analytical capabilities. There
are two main types of access tools available to meet this demand, namely Online
Analytical Processing (OLAP) and data mining. These tools differ in what they offer the
user and because of this they are complementary technologies.
A data warehouse (or more commonly one or more data marts) together with tools such
as OLAP and/or data mining are collectively referred to as Business Intelligence (BI)
technologies. In this chapter we describe OLAP and in the following chapter we describe
data mining.

Structure of this Chapter

33.1

33.1 Online Analytical Processing

In Section 33.1 we introduce Online Analytical Processing (OLAP) and discuss the
relationship between OLAP and data warehousing. In Section 33.2 we describe OLAP
applications and identify the key features and potential benefits associated with OLAP
applications. In Section 33.3 we discuss how multi-dimensional data can be represented
and describe the main concepts associated with multi-dimensional analysis. In Section
33.4 we describe the rules for OLAP tools and highlight the characteristics and issues
associated with OLAP tools. In Section 33.5 we discuss how the SQL standard has been
extended to include OLAP functions. Finally, in Section 33.6, we describe how Oracle
supports OLAP. The examples in this chapter are taken from the DreamHome case study
described in Section 10.4 and Appendix A.

Online Analytical Processing

The dynamic synthesis, analysis, and consolidation of large


volumes of multi-dimensional data.

Over the past few decades, we have witnessed the increasing popularity and prevalence of
relational DBMSs such that we now find a significant proportion of corporate data is housed
in such systems. Relational databases have been used primarily to support traditional
Online Transaction Processing (OLTP) systems. To provide appropriate support for OLTP
systems, relational DBMSs have been developed to enable the highly efficient execution
of a large number of relatively simple transactions.
In the past few years, relational DBMS vendors have targeted the data warehousing
market and have promoted their systems as tools for building data warehouses. As discussed in Chapter 31, a data warehouse stores operational data and is expected to support
a wide range of queries from the relatively simple to the highly complex. However, the
ability to answer particular queries is dependent on the types of end-user access tools
available for use on the data warehouse. General-purpose tools such as reporting and query
tools can easily support who? and what? questions about past events. A typical query
submitted directly to a data warehouse is: What was the total revenue for Scotland in the
third quarter of 2004?. In this section we focus on a tool that can support more advanced
queries, namely Online Analytical Processing (OLAP).
Online Analytical
Processing (OLAP)

OLAP is a term that describes a technology that uses a multi-dimensional view of aggregate data to provide quick access to strategic information for the purposes of advanced
analysis (Codd et al., 1995). OLAP enables users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access
to a wide variety of possible views of the data. OLAP allows the user to view corporate
data in such a way that it is a better model of the true dimensionality of the enterprise.
While OLAP systems can easily answer who? and what? questions, it is their ability to
answer what if? and why? type questions that distinguishes them from general-purpose

1205

You might also like